My first TBB program!

In my previous post, I showed how to setup the software tools for creating C++ programs that use TBB. Now I’ll show how I wrote my first TBB program.

I wanted to concentrate on the mechanics of creating a TBB application with Microsoft Visual C++ and not get too bogged down in the details of the algorithm. So I wrote a simple program that just multiplies two vectors item by item and places the products into another, equal-sized vector.

I started by cranking up Visual C++ and clicked on the New Project button (you can also use File→New→Project… from the menu). In the New Project window, I selected Win32 as the project type and Win32 Console Application as the template. I gave the project the creative name of example1 and set its location to the C:\llpanorama\TBB\examples directory. After clicking OK in the New Project window, and then clicking Finish in the Win32 Application Wizard window, a window opened with a simple code skeleton. I made additions to the skeleton and arrived at the following program which I will explain below.

// example1.cpp : Defines the entry point for the console application.
//

#include “stdafx.h”
#include
#include “tbb/task_scheduler_init.h”
#include “tbb/parallel_for.h”
#include “tbb/blocked_range.h”
#include “tbb/tick_count.h”
using namespace tbb;
using namespace std;

class vector_mult{
double *const v1, *const v2; // multiply these vectors
double *v3; // put the result here
public:
// constructor copies the arguments into local storage
vector_mult(double *vec1, double *vec2, double *vec3)
: v1(vec1), v2(vec2), v3(vec3) { }
// overload () so it does a vector multiply
void operator() (const blocked_range &r) const {
for(size_t i=r.begin(); i!=r.end(); i++)
v3[i] = v1[i] * v2[i];
}
};

const size_t vec_size = 1000000;

int _tmain(int argc, _TCHAR* argv[])
{
// allocate storage for vectors
double *v1, *v2, *v3;
v1 = (double*)malloc(vec_size*sizeof(double));
v2 = (double*)malloc(vec_size*sizeof(double));
v3 = (double*)malloc(vec_size*sizeof(double));

// init and multiply vectors serially and time the operation
for(size_t i=0; i(0,vec_size,1000), vector_mult(v1,v2,v3) );
tick_count parallel_end = tick_count::now();

// print some of the result vector to make sure it is correct
for(size_t i=0; i<10; i++) { cout << v3[i] << endl; } // print out the speedup delivered by parallelism cout << "Speedup from parallelism is " << (serial_end - serial_start).seconds() / (parallel_end - parallel_start).seconds() << endl; return 0; }[/sourcecode] The program starts by allocating some memory for vectors v1, v2 and v3 (lines 32-35). The size of these vectors is set by a constant defined on line 27.I wanted to multiply the vectors serially at first to get a baseline for how long that took. I initialized the vectors on line 38 so that each element stored its index. The current time is stored on line 39 after which a loop is started on line 40 where each element in v1 and v2 is multiplied and stored in the corresponding location in v3. Finally, the completion time for this loop is stored on line 41. The starting and ending times are used to calculate the time it takes to multiply the vectors serially. Next the same calculation is performed in parallel. The vectors are reinitialized on line 44. (I did this to erase the product values in v3 so I could be sure any correct values came from the parallel threads and not the preceding serial calculation.) Then a task scheduler object is created on line 45. This object manages the scheduling of tasks onto physical threads (cores of the CPU, essentially). You can specify the number of threads when the scheduler object is instantiated, but it's usually best to let it determine the number of threads automatically in case your application is run on a different type of CPU. The starting and ending times are recorded on lines 46 and 47 for the parallel loop on line 45. The parallel_for subroutine does the actual creation of tasks that are scheduled for execution on the threads. It takes two arguments. The first is an object that stores the range of the vectors and a grainsize that specifies what size the vectors should be cut into. The second argument is an object whose constructor stores the pointers to the v1, v2 and v3 vectors as private variables (see lines 18-19). What parallel_for does (I think) is partition the range of the vectors into non-overlapping pieces no larger than the grainsize, and then it creates multiple copies of the vector_mult object and passes one of the subrange pieces to each one. Then the vector_mult objects are scheduled and run in parallel on the physical threads.

When one of the vector_mult objects runs, its () operator is called and passed the subrange that specifies the sections of the vectors it should multiply (see lines 21-24). The loop on line 22 gets the beginning and ending indices of the subrange and just multiplies the vector elements that lie between them. For example, the range beginning and ending might be 51000 and 52000 so the vector_mult object would do the following calculations:

v3[51000] = v1[51000] * v2[51000];
v3[51001] = v1[51001] * v2[51001];
...
v3[51999] = v1[51999] * v2[51999];

So each vector_mult object computes a small section of the final product vector. The cores just get copies of the vector_mult object and subranges in no particular order and compute the partial results. Since the subranges are non-overlapping, there is no danger that the objects are overwriting each other’s results. And parallel_for makes sure the entire range is covered so a complete product vector is delivered.

After the parallel_for completes, a few of the product values are printed on line 51 just to make sure the program is doing something correctly. Then the amount of speedup provided by using parallelism is printed on lines 54-56.

Once the program was written, I had to get it compiled and linked. This required the setting of a few project properties. I clicked on the Project→example1 Properties… menu item to open the Property Pages window. Then I selected All Configurations from the Configurations drop-down menu and I set the following properties so the compiler and linker could find the TBB header files and libraries:

C/C++→General:
   Additional Include Directories = $(TBB20_INSTALL_DIR)\include
Linker→General:
   Additional Library Directories = $(TBB20_INSTALL_DIR)\ia32\vc8\lib

Then I selected the Debug configuration and set the following property:

Linker→Input:
   Additional Dependencies = kernel32.lib $(NoInherit) tbb_debug.lib

And I did this in the Release configuration:

Linker→Input:
   Additional Dependencies = kernel32.lib $(NoInherit) tbb.lib

Once my application is built, it still needs to find the TBB dynamic link library when it runs. I can copy the DLL into the same directory as my program by setting the following property for the Debug configuration:

Build Events→Post-Build Event:
   CommandLine = copy "$(TBB20_INSTALL_DIR)\ia32\vc8\bin\tbb_debug.dll" "$(OutDir)"

And I set the property this way in the Release configuration:

Build Events→Post-Build Event:
   CommandLine = copy "$(TBB20_INSTALL_DIR)\ia32\vc8\bin\tbb.dll" "$(OutDir)"

(Instead of copying the TBB DLL into the application directory, it might be better to modify the PATH environment variable to include the directory where the DLLs are stored: %TBB20_INSTALL_DIR%\ia32\vc8\bin. This only needs to be done once and it avoids the accumulation of multiple copies of the TBB DLLs in the project directories.)

Once these properties were all set up, I built the Debug and Release versions (select the configuration and then select Build→Build example1 from the menu bar). Once a version was built, I could run it using the Debug→Start Without Debugging menu item. I ran both versions ten times and got the following speedups:

Speedup
Debug Release
0.48 1.05
0.48 1.08
0.50 0.90
0.47 0.95
0.49 1.11
0.50 1.07
0.50 1.11
0.50 1.13
0.49 0.93
0.50 1.10

In the Debug version, the parallel code runs about half the speed of the serial code because all the diagnostic features of the TBB library are enabled. These features watch out for many of the stupid things you might do while you’re developing your code (like setting the number of physical threads to a negative number). I would expect this to have a major impact on the speedup.

However, the speedup in the Release version is really disappointing! The average speedup is just 1.04. In fact, the parallel code is slower than the serial code in 30% of the tests. Why is that!? My feeling is that a simple multiplication of vectors doesn’t provide enough computational mass to outweigh the overhead of starting the threads, but I haven’t run any experiments to investigate this.

If you want to play around with this example and see if you can do better, you can download it from here.

At this point, I feel I’ve come a long way on this portion of my investigation into parallel programming:

  • I set up a development environment for building TBB-based programs.
  • I successfully recompiled the TBB libraries.
  • I learned a little about a few TBB constructs like the task scheduler, blocked ranges and parallel_for.
  • I built a program from scratch that uses TBB, and it ran successfully if not quickly.

And the best part is I haven’t spent a penny to do any of this.
As I continue exploring TBB, I’ll be guided by the following questions:

  • What other constructs does TBB provide, and what parallel programming styles do they support?
  • What characteristics of a program make it suitable for parallelization?
  • If a program doesn’t run well after it is coded for parallel execution, how do I find the bottlenecks?

About dave_vandenbout
President of XESS Corp, a manufacturer of FPGA development boards.

3 Responses to My first TBB program!

  1. Per Nordlöw says:

    I have to do more calculations per memory access to get speed-ups close to 100 %. The mandelbrot fractal generation is an execellent example.

  2. Eugene says:

    may be a bit late for comment, but…

    “Parallel for” is really an MT-antipattern for doing simple computations on sequential blocks of memory, since you get cache line bouncing on access to adjacent addresses by different cores all the time and it brings down the performance to “sub-sequential” level.

  3. Pingback: Reductio ad absurdum « /// Parallelism Panorama ///

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: