# Reductio ad absurdum

February 28, 2008 1 Comment

I used `parallel_for`

in my first TBB program to multiply two vectors element-by-element and store the products into a third vector. But what about calculating their dot product where the vector products are added together? TBB supports this type of operation with the `parallel_reduce`

construct.

In order to use `parallel_reduce`

, I needed to add two methods to the object that performs the parallel operations:

- a splitting constructor that can cut off a piece of an existing object and initialize it;
- a join method that combines the answers of two smaller problems into a result that applies to their total.

I modified my vector multiplication program to compute the dot product using `parallel_reduce`

. Here’s the result:

// example3.cpp : Defines the entry point for the console application.

//

#include “stdafx.h”

#include

#include “tbb/task_scheduler_init.h”

#include “tbb/parallel_reduce.h”

#include “tbb/blocked_range.h”

using namespace tbb;

using namespace std;

class vector_dot_prod {

double *const v1, *const v2; // dot-product these vectors

public:

double result; // put the result here

// constructor copies the arguments into local storage

vector_dot_prod(double *vec1, double *vec2)

: v1(vec1), v2(vec2), result(0) { }

// splitting constructor

vector_dot_prod(vector_dot_prod &vdp, split)

: v1(vdp.v1), v2(vdp.v2), result(0) { }

// overload () so it does a dot product

void operator() (const blocked_range

for(size_t i=r.begin(); i!=r.end(); i++)

result += v1[i] * v2[i];

}

// join method adds partial results

void join(vector_dot_prod &v) { result += v.result; }

};

const size_t vec_size = 100000000;

int _tmain(int argc, _TCHAR* argv[])

{

// allocate storage for vectors

double *v1, *v2, result;

v1 = (double*)malloc(vec_size*sizeof(double));

v2 = (double*)malloc(vec_size*sizeof(double));

// init and dot-prod the vectors serially

for(size_t i=0; i

// print the result to make sure it is correct

cout << "Parallel dot product of the two vectors is " << v.result << endl;
return 0;
}
[/sourcecode]
I'll go over the changes on a line-by-line basis:Line 7: I changed the include file from `parallel_for.h`

to `parallel_reduce.h`

since that’s what I’m using in this program.

Line 12: I changed the class name of the object to `vector_dot_prod`

for obvious reasons.

Line 15: I removed the product vector and added a `result`

member to hold the scalar dot product value.

Lines 17-18: I added an initialization that clears `result`

when the object constructor is called.

Lines 20-21: I added the new splitting constructor which accepts an existing `vector_dot_prod`

object and creates another one with the same pointers to the input vectors while clearing the partial result scalar member. The `split`

argument is a dummy argument that just serves to distinguish this constructor from any others. The original object and the newly constructed object will each get a subrange of the input vectors on which to compute the dot product (although there may be additional splitting of each subrange before actual computations occur). In effect, the parent object that is passed as an argument is transformed into two sibling objects that each operate over smaller subranges of the input vectors.

Lines 23-26: The `()`

operator handles the actual dot product calculation. A subrange of each input vector is multiplied and added to the `result`

member.

Line 28: The `join`

method takes the partial dot product from one sibling (passed as the argument) and adds it to the `result`

in the other sibling, which is transformed back into the parent object. Of course, this parent object may be a sibling of another object to which its `result`

would be added. This joining process will repeat until the original object that spans the full range of the vectors is rebuilt, at which point the complete dot product will be in its `result`

member.

Lines 41-43: The members of each vector are initialized with the value of their respective indices, and then the dot product is computed serially.

Line 46: The dot product is output so I could check the result. For the given input vectors, the dot product is going to be approximately *N*N*N / 3* where *N* is the length of the vectors.

Lines 49-51: This computes the dot product in parallel. The task scheduler is started and then an object is constructed that holds the pointers to the input vectors. This object is passed to the `parallel_reduce`

function along with a `blocked_range`

object that spans the input vectors. The `parallel_reduce`

function uses the `vector_dot_prod`

object’s `split`

method to create objects that span subranges of the input based on the grainsize of the `blocked_range`

object. Then the `()`

operator is used to create the partial dot products of each subrange. Finally, the `join`

method is used to add all the partial dot products together.

Line 54: The result of the parallel dot product calculation is output so I could check to make sure it matched the result of the serial calculation.

When I ran the program, both the serial and parallel portions calculated a dot product of 3.33333 x 10^23, which is correct.

As with the vector product example, the dot product doesn’t do enough computations to make parallelism worthwhile so I didn’t print out the speedup. The dot product calculation serves as an easy-to-understand example of how to use `parallel_reduce`

, but you would never parallelize this in practice.

What other types of operations can be used with `parallel_reduce`

? Any operation that is associative, i.e. where the order of evaluation is not important. Some examples are addition, multiplication, finding the minimum or maximum of a set of numbers, and finding intersections and unions of sets.

Here’s the source code for this example if you want to try it.

Looks like this has been up for a while, but I just found it Googling around. It’s been very helpful getting started on my project. Thank you!