Apple’s competition to TBB and CUDA

Apple recently announced Grand Central and OpenCL which seem to be competitors to TBB and CUDA, respectively. Grand Central tries to make it easier to write multi-threaded apps for today’s multicore CPUs, and OpenCL (Open Computer Library) aims to make the processing power of GPUs available in general-purpose computing applications. OpenCL sounds like CUDA to me, but Steve Jobs says it’s “way beyond what Nvidia or anyone else has, and it’s really simple.” We’ll see. Here are some blog posts about Grand Central and OpenCL.

Addendum (June 18, 2008): Looks like Apple has submitted OpenCL to the Khronos Group “that aims to define a programming environment for applications running across both x86 and graphics chips”.  And here is a Wikipedia entry about OpenCL.

parallel_scan finally explained!

I beat my head against parallel_scan for a week and never really understood why I was having the problems I did. Now the developers at Intel have provided a better explanation of how parallel_scan works. It turns out that the pre_scan method may never be run at all, so the final_scan method always has to re-do what was done in pre_scan just to be safe. That explains why I had to make my pre_scan and final_scan methods identical in my example program. It would have been nice if one of the developers had mentioned that within a few days of when I submitted my problem to the Intel TBB forum. Or perhaps they should have called the method pre_scan_sometimes_if_we_feel_like_it just to warn TBB-users of the actual behavior.

Anyway, problem solved.

parallel_do? Parallel done!

parallel_do is a new TBB construct. It isn’t even in the Commercial Aligned or Stable releases; I had to install a Development release (tbb20_20080226oss) in order to get access to it.

The parallel_do construct is used when you don’t know how much data you have to process. parallel_do starts up tasks from a list, but these tasks can add further work to the list. parallel_do only shuts down when the list is empty and all the tasks are done. Read more of this post

Parallel sorting

After my problems with parallel_scan, I approached parallel_sort with some trepidation. I was pleasantly surprised when parallel_sort worked as advertised. (I did have a few problems, but these were related to my C++ skills and not to TBB directly.) Read more of this post

parallel_scan works … kinda, sorta

In a previous post, I showed a program that uses the parallel_scan construct but gets the wrong result. Since then, I received a working, running sum example for parallel_scan (from Mike Deskevich of ITT) that I could poke at and observe the results. Doing that I found a mistake I was making, and I found a mistake that TBB is making. Read more of this post

Scanners? Aren’t those the guys that make your head explode?

Back in the 80’s, there was a movie called “Scanners”. Scanners were mutants that could think intensely about you with a very constipated look and veins standing out on their faces. Then your head would explode. That’s the way the parallel_scan construct makes me feel.

There is no example for parallel_scan in the TBB tutorial, but there is a terse explanation in the reference manual. Basically, parallel_scan breaks a range into subranges and computes a partial result in each subrange in parallel. Then, the partial result for subrange k is used to update the information in subrange k+1, starting from k=0 and proceeding sequentially up to the last subrange. Then each subrange uses its updated information to compute its final result in parallel with all the other subranges. Read more of this post

Reductio ad absurdum

I used parallel_for in my first TBB program to multiply two vectors element-by-element and store the products into a third vector. But what about calculating their dot product where the vector products are added together? TBB supports this type of operation with the parallel_reduce construct.

In order to use parallel_reduce, I needed to add two methods to the object that performs the parallel operations:

  • a splitting constructor that can cut off a piece of an existing object and initialize it;
  • a join method that combines the answers of two smaller problems into a result that applies to their total.

I modified my vector multiplication program to compute the dot product using parallel_reduce. Here’s the result: Read more of this post