In my first attempt at writing a parallel program with TBB, my speedup was disappointing (about 4%). I guessed that vector multiplication wasn’t computationally intensive enough to make good use of multiple cores. I tried to remedy that by writing a program to find points in a Julia set. The Julia set is just the set of points that don’t shoot off to infinity when they are iterated by the function:
z(n+1) = z(n) * z(n) + c
So the basic technique is to take a candidate point z and repeatedly square it and add a constant (both z and c are complex numbers). If it exceeds a bound within a set number of iterations, then z is probably not a Julia point (this is an approximate test). By varying the bound and the number of iterations, I could easily vary the computational load.
The C++ code for both the parallel and serial versions of my Julia set generator is shown below:
// example2.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include
#include // for complex numbers
#include "tbb/task_scheduler_init.h"
#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"
#include "tbb/tick_count.h"
using namespace tbb;
using namespace std;
typedef complex JPT;
class julia{
JPT *const z; // pointer to points to test for "Julianess"
const complex c; // complex constant in eq. z_i+1 = (z_i)^2 + c
const unsigned n; // # times to iterate equation
const double b; // Julia point if z_n >= b
bool *is_julia; // put the result here
public:
// constructor copies the arguments into local storage
julia(JPT *arg_z, complex arg_c, int arg_n, double arg_b, bool *arg_is_julia)
: z(arg_z), c(arg_c), n(arg_n), b(arg_b), is_julia(arg_is_julia) { }
// overload () so it finds julia points
void operator() (const blocked_range &r) const {
for(size_t i=r.begin(); i!=r.end(); i++) {
complex zi = z[i]; // store the point to test
is_julia[i] = true; // start off assuming it's a Julia point
for(unsigned j=0; j= b) {
// Not a Julia point if it exceeds the bound within n iterations
is_julia[i] = false;
break;
}
zi = zi * zi + c; // keep iterating the point and see where it goes
}
}
}
};
const size_t n_pts = 100000; // number of points to test
const JPT z_lo = 0.0; // find Julia points between here...
const JPT z_hi = 1.0; // and here
const complex c_julia(-0.123,0.745); // complex constant for iterations
const unsigned n_iter = 100; // number of iterations for checking bounds
const double bound = 100000; // bound for testing "Julianess" of a point
int _tmain(int argc, _TCHAR* argv[])
{
JPT *z = (JPT*)malloc(n_pts*sizeof(JPT)); // storage for points to test
bool *is_julia = (bool*)malloc(n_pts*sizeof(bool)); // storage for test results
complex c = c_julia;
const unsigned n = n_iter;
const double b = bound;
// initialize vector with evenly-spaced points along a line
JPT spacing = (z_hi - z_lo)/JPT(n_pts-1); // spacing between points
z[0] = z_lo; // initial point
for(size_t i=1; i<n_pts; i++) {
z[i] = z[i-1] + spacing; // next point is previous point plus spacing
}
// serially calculate whether points are Julia or not and record duration
tick_count serial_start = tick_count::now();
for(size_t i=0; i<n_pts; i++) {
complex zi = z[i]; // store the point to test
is_julia[i] = true; // start off assuming it's a Julia point
for(unsigned j=0; j= b) {
// Not a Julia point if it exceeds the bound within n iterations
is_julia[i] = false;
break;
}
zi = zi * zi + c; // keep iterating the point and see where it goes
}
}
tick_count serial_end = tick_count::now();
cout << "# Julia points found between [" << z_lo << ", " << z_hi << "] = ";
unsigned j_pts = 0;
for(size_t i=0; i<n_pts; i++)
if(is_julia[i])
j_pts++;
cout << j_pts << endl;
// redo the calculation in parallel and time the operation
task_scheduler_init init;
tick_count parallel_start = tick_count::now();
parallel_for( blocked_range(0,n_pts,1000), julia(z,c,n,b,is_julia) );
tick_count parallel_end = tick_count::now();
cout << "# Julia points found between [" << z_lo << ", " << z_hi << "] = ";
j_pts = 0;
for(size_t i=0; i<n_pts; i++)
if(is_julia[i])
j_pts++;
cout << j_pts << endl;
// print out the speedup delivered by parallelism
cout << "Speedup from parallelism is "
<< (serial_end - serial_start).seconds()
<< " / " << (parallel_end - parallel_start).seconds()
<< " = "
<< (serial_end - serial_start).seconds()
/ (parallel_end - parallel_start).seconds()
<< endl;
return 0;
}
The program structure is similar to my first TBB program. An array for holding the complex points to test is allocated on line 52 and it is loaded with a set of evenly-spaced points along the line connecting z_lo and z_hi (lines 59-63). Then the serial portion of the program tests each point and removes it from the Julian set if it exceeds the bound within a certain number of iterations (lines 66-79). Then the number of Julian points found is printed (lines 81-86).
The same computation is then done in parallel (lines 89-92). A task scheduler is created and then a parallel_for is done over the subranges of the line of points using the methods of the julia object. The structure and use of the julia object is the same as the vector_mult object in my previous program, there are just more arguments and computational steps (lines 16-41). Once again, the number of Julian points is printed to make sure it matches the result found by the serial algorithm (lines 94-99). Then the durations of the serial and parallel portions are printed and the speedup is calculated (lines 102-108).
Over ten runs of the program, the serial duration averaged 1.50s while the parallel portion finished in 0.83s for an average speedup of 1.8. Since I’m using an AMD dual-core Athlon, I’m getting 90% of the potential speedup. Obviously, increasing the amount of work done by each invocation of the object has a major effect on the speedup.
Here’s the sourcecode for this example if you want to play around with it.