parallel_scan works … kinda, sorta
March 7, 2008 1 Comment
In a previous post, I showed a program that uses the parallel_scan construct but gets the wrong result. Since then, I received a working, running sum example for parallel_scan (from Mike Deskevich of ITT) that I could poke at and observe the results. Doing that I found a mistake I was making, and I found a mistake that TBB is making.
My mistake was creating an incorrect mental model of how reverse_join works. In my original version for the running minimum program, reverse_join stored the incoming minimum to a subrange and compared it to the minimum found within the subrange so that the updated minimum could be passed on to the next subrange:
void reverse_join (vector_run_min &vrm) {
min_in = vrm.min_out; // min at start of subrange k = min at end of subrange k-1
// min at end of subrange k is lesser of min at end of subranges k, k-1
min_out = (min_out < vrm.min_out) ? min_out : vrm.min_out;
}
It actually doesn’t have to be that complicated. The vector_run_min object needs only a single member, mmin, that stores the minimum at the end of the subrange. Then the correct version of reverse_join can be written this way:
void reverse_join (vector_run_min &vrm) {
// min at end of subrange k is lesser of min at end of subranges k, k-1
mmin = (mmin < vrm.mmin) ? mmin : vrm.mmin;
}
Obviously, TBB is doing something in the background to make this work. All I can figure is that, during the final scan, each vector_min_object is passed the subrange that followed it during the pre-scan. Then mmin would hold the minimum at the start of that subrange.
The second problem I had was with the () operator that handles the pre-scan and final scan. During my debugging of my original running minimum program, it seemed that the pre-scan version of the () operator wasn’t being called at the right times. So I re-wrote the original operators:
// overload () so it finds the minimum within each subrange
void operator() (const blocked_range<size_t> &r, pre_scan_tag) {
for(size_t i=r.begin(); i!=r.end(); i++)
mmin = (v[i]<mmin) ? v[i] : mmin;
}
// overload () so it computes the running minimum over a subrange
void operator() (const blocked_range<size_t> &r, final_scan_tag) {
size_t i = r.begin();
run_min[i] = (v[i]<mmin) ? v[i] : mmin; // set running min for first entry in subrange
for(++i; i!=r.end(); i++)
run_min[i] = (v[i]<run_min[i-1]) ? v[i] : run_min[i-1];
}
and made them operate identically so the actual scan phase didn’t matter:
// overload () so it finds the minimum within each subrange
void operator() (const blocked_range<size_t> &r, pre_scan_tag) {
// find the running minimum over the subrange
size_t i = r.begin();
run_min[i] = (v[i]<mmin) ? v[i] : mmin;
for(++i; i!=r.end(); i++)
run_min[i] = (v[i]<run_min[i-1]) ? v[i] : run_min[i-1];
mmin = run_min[i-1]; // subrange minimum is the last entry in the running minimum
}
// overload () so it computes the running minimum over a subrange
void operator() (const blocked_range<size_t> &r, final_scan_tag) {
// find the running minimum over the subrange
size_t i = r.begin();
run_min[i] = (v[i]<mmin) ? v[i] : mmin;
for(++i; i!=r.end(); i++)
run_min[i] = (v[i]<run_min[i-1]) ? v[i] : run_min[i-1];
mmin = run_min[i-1]; // subrange minimum is the last entry in the running minimum
}
Now the pre-scan version of the () operator needlessly computes the running minimum while it is finding the minimum for a subrange. And the final scan version needlessly updates the minimum after it calculates the running minimum. So there is some wasted effort, but at least it works.
Now, this is not the way I wanted to do it. And I’m sure it’s not the way the TBB developers want me to do it. But it’s the only way I could do it.
So here’s the current, working program for finding the running minimum. You can also download it here. Let me know if you find a way to fix my problem with the pre-scan and final scan.
// example4.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <FLOAT.H>
#include <iostream>
#include <stdlib.h>
#include <time.h>
#include "tbb/task_scheduler_init.h"
#include "tbb/blocked_range.h"
#include "tbb/parallel_scan.h"
using namespace std;
using namespace tbb;
class vector_run_min {
const double *const v; // vector of data
public:
double *const run_min; // vector of running minimums of data
double mmin; // minimum at beginning of subrange
// constructor copies the arguments into local storage
vector_run_min (double *v_, double *run_min_)
: v(v_), run_min(run_min_), mmin(FLT_MAX) { }
// splitting constructor
vector_run_min (vector_run_min &vrm, split)
: v(vrm.v), run_min(vrm.run_min), mmin(FLT_MAX) { }
// overload () so it finds the minimum within each subrange
void operator() (const blocked_range<size_t> &r, pre_scan_tag) {
// find the running minimum over the subrange
size_t i = r.begin();
run_min[i] = (v[i]<mmin) ? v[i] : mmin;
for(++i; i!=r.end(); i++)
run_min[i] = (v[i]<run_min[i-1]) ? v[i] : run_min[i-1];
mmin = run_min[i-1]; // subrange minimum is the last entry in the running minimum
}
// join partial results starting at subrange 0, 1, 2 ... and proceeding upward
void reverse_join (vector_run_min &vrm) {
// min at end of subrange k is lesser of min at end of subranges k, k-1
mmin = (mmin < vrm.mmin) ? mmin : vrm.mmin;
}
// overload () so it computes the running minimum over a subrange
void operator() (const blocked_range<size_t> &r, final_scan_tag) {
// find the running minimum over the subrange
size_t i = r.begin();
run_min[i] = (v[i]<mmin) ? v[i] : mmin;
for(++i; i!=r.end(); i++)
run_min[i] = (v[i]<run_min[i-1]) ? v[i] : run_min[i-1];
mmin = run_min[i-1]; // subrange minimum is the last entry in the running minimum
}
// assign
void assign(vector_run_min &vrm) {
mmin = vrm.mmin;
}
};
const size_t vec_size = 10000;
int _tmain(int argc, _TCHAR* argv[])
{
// allocate storage for vectors
double *v = (double*)malloc(vec_size*sizeof(double));
double *run_min_ser = (double*)malloc(vec_size*sizeof(double));
double *run_min_par = (double*)malloc(vec_size*sizeof(double));
// initialize the vector with random data
srand((unsigned)time(NULL)); // set the random number seed
for(size_t i=0; i<vec_size; i++) v[i] = rand();
// serially scan vector and record running minimum at each point
run_min_ser[0] = v[0]; // initialize first entry in running minimum
for(size_t i=1; i<vec_size; i++)
run_min_ser[i] = (v[i] < run_min_ser[i-1]) ? v[i] : run_min_ser[i-1];
// scan vector in parallel and record running minimum at each point
task_scheduler_init init;
vector_run_min vrm(v,run_min_par);
parallel_scan( blocked_range<size_t>(0,vec_size,1000), vrm );
// compare serial and parallel versions of the running minimum
for(size_t i=0; i<vec_size; i++)
if(run_min_ser[i] != run_min_par[i])
{
cout << "Serial and parallel versions differ!" << endl;
cout << "run_min_ser[" << i << "] = " << run_min_ser[i] << endl;
cout << "run_min_par[" << i << "] = " << run_min_par[i] << endl;
return 1;
}
cout << "Serial and parallel versions are the same!" << endl;
// print some of the result to make sure it is correct
cout << "Vector and its running minimum: " << endl;
for(size_t i=0; i<10; i++) printf("%6.0f ",v[i]);
cout << endl;
for(size_t i=0; i<10; i++) printf("%6.0f ",run_min_ser[i]);
cout << endl;
return 0;
}
Pingback: parallel_scan finally explained! « /// Parallel Panorama ///