Back in the 80’s, there was a movie called “Scanners”. Scanners were mutants that could think intensely about you with a very constipated look and veins standing out on their faces. Then your head would explode. That’s the way the parallel_scan construct makes me feel.
There is no example for parallel_scan in the TBB tutorial, but there is a terse explanation in the reference manual. Basically, parallel_scan breaks a range into subranges and computes a partial result in each subrange in parallel. Then, the partial result for subrange k is used to update the information in subrange k+1, starting from k=0 and proceeding sequentially up to the last subrange. Then each subrange uses its updated information to compute its final result in parallel with all the other subranges.
That’s suitably abstract, and it took me over an hour to get that much out of it. In order to get a better feel for parallel_scan, I tried to build a program that would compute the running minimum for a floating-point vector. Each location i in the running minimum stores the minimum value of the original vector found between locations 0 and i . The algorithm has the following phases:
- The vector is divided into subranges.
- A pre-scan version of the
()operator finds the minimum value in each subrange. - A
reverse_joinmethod communicates the minimum found in subrange k over to subrange k +1, starting from k=0. This tells subrange k +1 what the minimum value is in the entire vector that precedes it. - A final scan version of the
()operator computes the running minimum over each subrange using the minimum passed in by the reverse_join as the starting minimum.
The algorithm seems straight-forward. Here’s the resulting program:
// example4.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <FLOAT.H>
#include <iostream>
#include <stdlib.h>
#include <time.h>
#include "tbb/task_scheduler_init.h"
#include "tbb/parallel_scan.h"
#include "tbb/blocked_range.h"
using namespace tbb;
using namespace std;
class vector_run_min {
const double *const v; // vector of data
public:
double *const run_min; // vector of running minimums of data
double min_in; // minimum at beginning of subrange
double min_out; // minimum at end of subrange
// constructor copies the arguments into local storage
vector_run_min (double *v_, double *run_min_)
: v(v_), run_min(run_min_), min_in(FLT_MAX), min_out(FLT_MAX) { }
// splitting constructor
vector_run_min (vector_run_min &vrm, split)
: v(vrm.v), run_min(vrm.run_min), min_in(FLT_MAX), min_out(FLT_MAX) { }
// overload () so it finds the minimum within each subrange
void operator() (const blocked_range<size_t> &r, pre_scan_tag) {
for(size_t i=r.begin(); i!=r.end(); i++)
min_out = (v[i]<min_out) ? v[i] : min_out;
cout << "pre_scan: [" << r.begin() << "," << r.end() << ") min_out = " << min_out << endl;
}
// join partial results starting at subrange 0, 1, 2 ... and proceeding upward
void reverse_join (vector_run_min &vrm) {
min_in = vrm.min_out; // min at start of subrange k = min at end of subrange k-1
// min at end of subrange k is lesser of min at end of subranges k, k-1
min_out = (min_out < vrm.min_out) ? min_out : vrm.min_out;
cout << "reverse_join: " << min_in << " " << min_out << endl;
}
// overload () so it computes the running minimum over a subrange
void operator() (const blocked_range<size_t> &r, final_scan_tag) {
size_t i = r.begin();
run_min[i] = (v[i]<min_in) ? v[i] : min_in; // set running min for first entry in subrange
for(++i; i!=r.end(); i++)
run_min[i] = (v[i]<run_min[i-1]) ? v[i] : run_min[i-1];
cout << "final_scan: [" << r.begin() << "," << r.end() << ") min_in = " << min_in << endl;
}
// assign
void assign(vector_run_min &vrm) {
min_in = vrm.min_in;
min_out = vrm.min_out;
}
};
const size_t vec_size = 10000;
int _tmain(int argc, _TCHAR* argv[])
{
// allocate storage for vectors
double *v = (double*)malloc(vec_size*sizeof(double));
double *run_min_ser = (double*)malloc(vec_size*sizeof(double));
double *run_min_par = (double*)malloc(vec_size*sizeof(double));
// initialize the vector with random data
srand((unsigned)time(NULL)); // set the random number seed
for(size_t i=0; i<vec_size; i++) v[i] = rand();
// serially scan vector and record running minimum at each point
run_min_ser[0] = v[0]; // initialize first entry in running minimum
for(size_t i=1; i<vec_size; i++)
run_min_ser[i] = (v[i] < run_min_ser[i-1]) ? v[i] : run_min_ser[i-1];
// print some of the result to make sure it is correct
cout << "Vector and its running minimum (serial): " << endl;
for(size_t i=0; i<10; i++) printf("%6.0f ",v[i]);
cout << endl;
for(size_t i=0; i<10; i++) printf("%6.0f ",run_min_ser[i]);
cout << endl;
// scan vector in parallel and record running minimum at each point
task_scheduler_init init(1);
vector_run_min vrm(v,run_min_par);
parallel_scan( blocked_range<size_t>(0,vec_size,1000), vrm );
// print some of the result to make sure it is correct
cout << "Vector and its running minimum (parallel): " << endl;
for(size_t i=0; i<10; i++) printf("%6.0f ",v[i]);
cout << endl;
for(size_t i=0; i<10; i++) printf("%6.0f ",run_min_par[i]);
cout << endl;
// compare serial and parallel versions of the running minimum
for(size_t i=0; i<vec_size; i++)
if(run_min_ser[i] != run_min_par[i])
{
cout << "Serial and parallel versions differ!" << endl;
cout << "run_min_ser[" << i << "] = " << run_min_ser[i] << endl;
cout << "run_min_par[" << i << "] = " << run_min_par[i] << endl;
return 1;
}
cout << "Serial and parallel versions are the same!" << endl;
return 0;
}
In _tmain (lines 57-103), the running minimum for a vector of random values is computed in serial and parallel. The parallel running minimum calculation is initiated on lines 81-83. A task scheduler object is initialized. Then a vector_run_min object is constructed from the pointers to the vectors that store the data and the running minimum. Then that object is passed to the parallel_scan subroutine along with a blocked_range object that will partition the range into smaller parcels.
The vector_run_min object (lines 15-53) has both standard and splitting constructors just like those found when using parallel_reduce. A vector_run_min object for a given subrange stores the minimum value found across all the subranges that precede it (min_in), and it stores the updated minimum value after looking at the data within its subrange (min_out). The min_in and min_out values are initialized to the maximum floating-point value whenever a subrange is constructed. (Conversely, if I was doing a running maximum, then I would initialize these to the minimum floating-point value.)
The () operator is used to find the minimum within a subrange (lines 28-32). A dummy argument, pre_scan_tag, is used to identify this as the operator to apply during the pre-scan portion of the algorithm. The minimum value in the subrange is stored in the min_out member.
After the pre-scan () operator is applied in parallel to all the subranges, the reverse_join method is applied sequentially to the subranges (lines 34-39). The minimum value coming into a subrange from the preceding subranges is stored in the min_in member. Then the minimum value including the values in the current subrange is stored in the min_out member. This value is passed on to the next subrange, thus updating and propagating the correct minimum value.
The () operator is used again to calculate the running minimum within a subrange (lines 41-47). The final_scan_tag dummy argument identifies this as the appropriate operator for the final scan. The minimum value found up to the start of the current subrange is compared to the first element of the current subrange to get the first element of the running minimum. Then the running minimum is calculated at each point in the subrange in the same way as for the serial version (lines 69-71).
The logic of this program looks correct, but it is not working. I ran the serial and parallel versions and their answers are different. For a ten-thousand element vector, the blocked_range object partitions are [0,625), [625,1250), [1250,2500), [2500, 5000), and [5000, 10000), and the first difference always occurs at location 625, right where the first two subranges meet. It appears the reverse_join is not passing the minimum from the previous subrange on to the next subrange.
I’ve tried this program using both release and development versions of TBB, and the result is always the same. After several days, I stopped trying to fix it. Here’s the source code for this example if you want to try it. Let me know if you find the error in my code.
[...] finally explained! Filed under: Uncategorized — llpanorama @ 11:59 am I beat my head against parallel_scan for a week and never really understood why I was having the problems I had. Now the developers at [...]
Pingback by parallel_scan finally explained! « /// Parallel Panorama /// — May 22, 2008 @ 11:59 am
[...] under: multicore — Tags: TBB — llpanorama @ 4:47 pm After my problems with parallel_scan, I approached parallel_sort with some trepidation. I was pleasantly surprised when parallel_sort [...]
Pingback by /// Parallelism Panorama /// — March 7, 2008 @ 4:47 pm
[...] kinda, sorta Filed under: multicore — Tags: TBB — llpanorama @ 11:39 am In a previous post, I showed a program that uses the parallel_scan construct but gets the wrong result. Since then, I [...]
Pingback by parallel_scan works … kinda, sorta « /// Parallelism Panorama /// — March 7, 2008 @ 11:39 am