Apple’s competition to TBB and CUDA

Apple recently announced Grand Central and OpenCL which seem to be competitors to TBB and CUDA, respectively. Grand Central tries to make it easier to write multi-threaded apps for today’s multicore CPUs, and OpenCL (Open Computer Library) aims to make the processing power of GPUs available in general-purpose computing applications. OpenCL sounds like CUDA to me, but Steve Jobs says it’s “way beyond what Nvidia or anyone else has, and it’s really simple.” We’ll see. Here are some blog posts about Grand Central and OpenCL.

Addendum (June 18, 2008): Looks like Apple has submitted OpenCL to the Khronos Group “that aims to define a programming environment for applications running across both x86 and graphics chips”.  And here is a Wikipedia entry about OpenCL.


Which is more popular? TBB or CUDA?

I’ve been monitoring the traffic on this blog since it started. (That isn’t hard – a big day has 100 page views.) Here are the accumulated hits for the blog posts I’ve made on TBB and CUDA:

Accumulated Hits
Framework #Hits Duration
TBB 1,045 120 days
CUDA 1,112 40 days

It appears that CUDA has garnered as much attention as TBB, but in a much shorter time and with far fewer posts. I’ll give three possible explanations for this:

  1. CUDA has been more visible in the tech press over the past few months, while TBB coverage has been almost non-existent.
  2. People perceive a bigger payoff from learning about CUDA (which offers much more potential parallelism with hundreds of parallel processors) than TBB (which uses the handful of cores available in today’s CPUs).
  3. My most popular posts concern setting up CUDA or TBB on Windows and getting a small example to compile. This is easy to do with TBB (after all, it’s being developed by Intel), but its hard to get the CUDA nvcc compiler integrated into Microsoft’s Visual C++ so people are looking for help with that.

What do you think? Is there some other reason I’ve missed? What’s your parallel programming framework of choice and why?

Update (6/24/2008):

Here are the updated statistics. CUDA is pulling away!

Accumulated Hits
Framework #Hits Duration
TBB 1,267 140 days
CUDA 2,722 60 days

Further Update (6/17/2010):

Let’s see where we are after two years:

Accumulated Hits
Framework #Hits
TBB 10,220
CUDA 235,010

parallel_scan finally explained!

I beat my head against parallel_scan for a week and never really understood why I was having the problems I did. Now the developers at Intel have provided a better explanation of how parallel_scan works. It turns out that the pre_scan method may never be run at all, so the final_scan method always has to re-do what was done in pre_scan just to be safe. That explains why I had to make my pre_scan and final_scan methods identical in my example program. It would have been nice if one of the developers had mentioned that within a few days of when I submitted my problem to the Intel TBB forum. Or perhaps they should have called the method pre_scan_sometimes_if_we_feel_like_it just to warn TBB-users of the actual behavior.

Anyway, problem solved.

parallel_do? Parallel done!

parallel_do is a new TBB construct. It isn’t even in the Commercial Aligned or Stable releases; I had to install a Development release (tbb20_20080226oss) in order to get access to it.

The parallel_do construct is used when you don’t know how much data you have to process. parallel_do starts up tasks from a list, but these tasks can add further work to the list. parallel_do only shuts down when the list is empty and all the tasks are done. Read more of this post

Parallel sorting

After my problems with parallel_scan, I approached parallel_sort with some trepidation. I was pleasantly surprised when parallel_sort worked as advertised. (I did have a few problems, but these were related to my C++ skills and not to TBB directly.) Read more of this post

parallel_scan works … kinda, sorta

In a previous post, I showed a program that uses the parallel_scan construct but gets the wrong result. Since then, I received a working, running sum example for parallel_scan (from Mike Deskevich of ITT) that I could poke at and observe the results. Doing that I found a mistake I was making, and I found a mistake that TBB is making. Read more of this post

Scanners? Aren’t those the guys that make your head explode?

Back in the 80’s, there was a movie called “Scanners”. Scanners were mutants that could think intensely about you with a very constipated look and veins standing out on their faces. Then your head would explode. That’s the way the parallel_scan construct makes me feel.

There is no example for parallel_scan in the TBB tutorial, but there is a terse explanation in the reference manual. Basically, parallel_scan breaks a range into subranges and computes a partial result in each subrange in parallel. Then, the partial result for subrange k is used to update the information in subrange k+1, starting from k=0 and proceeding sequentially up to the last subrange. Then each subrange uses its updated information to compute its final result in parallel with all the other subranges. Read more of this post