Threads and blocks and grids, oh my!

As an engineer, I like C because it is relatively low-level compared to other languages. This lets me infer how the C code is handled by the processor so I can make on-the-fly judgments about the efficiency of a program. For the same reason, I need a mental model of how a CUDA device is organized and how its parts operate.

With a single processor executing a program, it’s easy to tell what’s going on and where it’s happening. With a CUDA device, not so much. There seem to be a lot of things going on at once in a lot of different places. CUDA organizes a parallel computation using the abstractions of threads, blocks and grids for which I provide these simple definitions:

Thread: This is just an execution of a kernel with a given index. Each thread uses its index to access elements in array (see the kernel in my first CUDA program) such that the collection of all threads cooperatively processes the entire data set.

Block: This is a group of threads. There’s not much you can say about the execution of threads within a block – they could execute concurrently or serially and in no particular order. You can coordinate the threads, somewhat, using the _syncthreads() function that makes a thread stop at a certain point in the kernel until all the other threads in its block reach the same point.

Grid: This is a group of blocks. There’s no synchronization at all between the blocks.

But where do threads, blocks and grids actually get executed? With respect to Nvidia’s G80 GPU chip, it appears the computation is distributed as follows:

Grid → GPU: An entire grid is handled by a single GPU chip.

Block → MP: The GPU chip is organized as a collection of multiprocessors (MPs), with each multiprocessor responsible for handling one or more blocks in a grid. A block is never divided across multiple MPs.

Thread → SP: Each MP is further divided into a number of stream processors (SPs), with each SP handling one or more threads in a block.

(Some universities have extended CUDA to work on multicore CPUs by assigning a grid to the CPU with each core executing the threads in one or more blocks. I don’t have a link to this work, but it is mentioned here.)

In combination with the hierarchy of processing units, the G80 also provides a memory hierarchy:

Global memory: This memory is built from a bank of SDRAM chips connected to the GPU chip. Any thread in any MP can read or write to any location in the global memory. Sometimes this is called device memory.

Texture cache: This is a memory within each MP that can be filled with data from the global memory so it acts like a cache. Threads running in the MP are restricted to read-only access of this memory.

Constant cache: This is a read-only memory within each MP.

Shared memory: This is a small memory within each MP that can be read/written by any thread in a block assigned to that MP.

Registers: Each MP has a number of registers that are shared between its SPs.

As usual, the upper levels of the memory hierarchy provide larger, slower storage (access times of 400-600 cycles) while the lower levels are smaller and faster (access times of several cycles or less, I think). How large are these memories? You can find out by running the DeviceQuery application included in the CUDA SDK. This is the result I get for my Nvidia card:

There is 1 device supporting CUDA

Device 0: "GeForce 8600 GTS"
  Major revision number:                         1
  Minor revision number:                         1
  Total amount of global memory:                 268173312 bytes
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1458000 kilohertz

Test PASSED

The GPU device is hooked to 256 MB of SDRAM that provides the global memory. Each MP in the GPU has access to 16 KB of shared memory and 8192 registers (I’m not sure what the register width is). And there is 64 KB of constant memory for all the MPs in the GPU (I’m not sure why that is not reported per-MP like the shared memory is).

DeviceQuery also shows the limits on the sizes of blocks and grids. A block is one-, two- or three-dimensional with the maximum sizes of the x, y and z dimensions being 512, 512 and 64, respectively, and such that x × y × z ≤ 512, which is the maximum number of threads per block. Blocks are organized into one- or two-dimensional grids of up to 65,535 blocks in each dimension. The primary limitation here is the maximum of 512 threads per block, primarily imposed by the small number of registers that can be allocated across all the threads running in all the blocks assigned to an MP. The thread limit constrains the amount of cooperation between threads because only threads within the same block can synchronize with each other and exchange data through the fast shared memory in an MP.

Between the memory sizes and the thread/block/grid dimensions is the warp size. What is a warp? I think the definition that applies to CUDA is “threads in a fabric running lengthwise”. The warp size is the number of threads running concurrently on an MP. In actuality, the threads are running both in parallel and pipelined. At the time this was written, each MP contains eight SPs and the fastest instruction takes four cycles. Therefore, each SP can have four instructions in its pipeline for a total of 8 × 4 = 32 instructions being executed concurrently. Within a warp, the threads all have sequential indices so there is a warp with indices 0..31, the next with indices 32..63 and so on up to the total number of threads in a block.

The homogeneity of the threads in a warp has a big effect on the computational throughput. If all the threads are executing the same instruction, then all the SPs in an MP can execute the same instruction in parallel. But if one or more threads in a warp is executing a different instruction from the others, then the warp has to be partitioned into groups of threads based on the instructions being executed, after which the groups are executed one after the other. This serialization reduces the throughput as the threads become more and more divergent and split into smaller and smaller groups. So it pays to keep the threads as homogenous as possible.

How the threads access global memory also affects the throughput. Things go much faster if the GPU can coalesce several global addresses into a single burst access over the wide data bus that goes to the external SDRAM. Conversely, reading/writing separated memory addresses requires multiple accesses to the SDRAM which slows things down. To help the GPU combine multiple accesses, the addresses generated by the threads in a warp must be sequential with respect to the thread indices, i.e. thread N must access address Base + N where Base is a pointer of type T and is aligned to 16 × sizeof(T) bytes.

A program that shows some of these performance effects is given below.

// example2.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

#include <stdio.h>
#include <cuda.h>
#include <cutil.h>

// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
#define STRIDE       32
#define OFFSET        0
#define GROUP_SIZE  512
  int n_elem_per_thread = N / (gridDim.x * blockDim.x);
  int block_start_idx = n_elem_per_thread * blockIdx.x * blockDim.x;
  int thread_start_idx = block_start_idx
			+ (threadIdx.x / STRIDE) * n_elem_per_thread * STRIDE
			+ ((threadIdx.x + OFFSET) % STRIDE);
  int thread_end_idx = thread_start_idx + n_elem_per_thread * STRIDE;
  if(thread_end_idx > N) thread_end_idx = N;
  int group = (threadIdx.x / GROUP_SIZE) & 1;
  for(int idx=thread_start_idx; idx < thread_end_idx; idx+=STRIDE)
  {
    if(!group) a[idx] = a[idx] * a[idx];
    else       a[idx] = a[idx] + a[idx];
  }
}

// main routine that executes on the host
int main(void)
{
  float *a_h, *a_d;  // Pointer to host & device arrays
  const int N = 1<<25;  // Make a big array with 2**N elements
  size_t size = N * sizeof(float);
  a_h = (float *)malloc(size);        // Allocate array on host
  cudaMalloc((void **) &a_d, size);   // Allocate array on device
  // Initialize host array and copy it to CUDA device
  for (int i=0; i<N; i++) a_h[i] = (float)i;
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  // Create timer for timing CUDA calculation
  unsigned int timer = 0;
  cutCreateTimer( &timer );
  // Set number of threads and blocks
  int n_threads_per_block = 1<<9;  // 512 threads per block
  int n_blocks = 1<<10;  // 1024 blocks
  // Do calculation on device
  cutStartTimer( timer );  // Start timer
  square_array <<< n_blocks, n_threads_per_block >>> (a_d, N);
  cudaThreadSynchronize();  // Wait for square_array to finish on CUDA
  cutStopTimer( timer );  // Stop timer
  // Retrieve result from device and store it in host array
  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  // Print some of the results and the CUDA execution time
  for (int i=0; i<N; i+=N/50) printf("%d %f\n", i, a_h[i]);
  printf("CUDA execution time = %f ms\n",cutGetTimerValue( timer ));
  // Cleanup
  free(a_h); cudaFree(a_d);
}

This program is just a modification of my first CUDA program. The size of the array on line 35 is increased from ten to 32M elements so the kernel will execute long enough to get an accurate reading with the timer (declared on line 44). The number of blocks and threads per block are increased (lines 46 and 47) because that provides a large pool of available computations the GPU can execute while waiting for memory accesses to complete.

The timer is started (line 49) and the kernel is initiated (line 50). Because the CUDA device works in parallel with the CPU, a call to the cudaThreadSynchronize() subroutine is needed to halt the CPU until the kernel has completed (line 51). Then the timer can be stopped (line 52). The execution time for the kernel is output on line 57.

The kernel (lines 11-29) has been modified so that each thread loops over a set of array elements using a memory access pattern that can be changed along with the homogeneity of the threads. The number of array elements handled by each thread is calculated on line 16 by dividing the array size by the total number of threads (which is the product of the number of blocks and the block dimension). Then, each block of threads is assigned a starting address for the contiguous block of array elements that its threads will process (line 17). Each thread in a block is assigned a starting address within its block determined by the thread index and the addressing stride length and offset. For example, if there are 512 total array elements handled by eight threads with the stride and offset set to four and zero, respectively, then the addresses accessed by each thread are:

thread 0:   0   4   8  12  16  20  24 ... 248 252
thread 1:   1   5   9  13  17  21  25 ... 249 253
thread 2:   2   6  10  14  18  22  26 ... 250 254
thread 3:   3   7  11  15  19  23  27 ... 251 255
thread 4: 256 260 264 268 272 276 280 ... 504 508
thread 5: 257 261 265 269 273 277 281 ... 505 509
thread 6: 258 262 266 270 274 278 282 ... 506 510
thread 7: 259 263 267 271 275 279 283 ... 507 511

In this example, you can see that threads 0-3 are accessing sequential addresses in their section of the global memory, so their accesses can be coalesced. The same applies to threads 4-7. But, the accesses for threads 0-3 and 4-7 combined cannot be coalesced because then they are not sequential. So the GPU has to make a separate memory bursts for each group of threads.

Using the same example, if the offset is increased to one then the addresses for each thread become:

thread 0:   1   5   9  13  17  21  25 ... 249 253
thread 1:   2   6  10  14  18  22  26 ... 250 254
thread 2:   3   7  11  15  19  23  27 ... 251 255
thread 3:   0   4   8  12  16  20  24 ... 248 252
thread 4: 257 261 265 269 273 277 281 ... 505 509
thread 5: 258 262 266 270 274 278 282 ... 506 510
thread 6: 259 263 267 271 275 279 283 ... 507 511
thread 7: 256 260 264 268 272 276 280 ... 504 508

Now the addresses are still in the same range, but they are no longer sequential with respect to the thread indices.

A third macro parameter, GROUP_SIZE (line 15), is used to direct the flow of execution in the for loop (lines 24-28). Setting the group size to four, for example, sets the group variable (line 23) to zero for threads with indices [0..3], [8..11], [16..19] … which makes these threads square the array elements (line 26). The remaining threads, which have group set to one, just double the array elements (line 27). This difference in operations prevents the MP from running the threads in both groups in parallel.

By playing with the STRIDE, OFFSET and GROUP macros (lines 13-15), we can vary the memory access patterns and thread homogeneity and see their effect on execution times for the kernel. The following trials were done using the Release version of the program with execution times averaged over 1000 runs:

Trial STRIDE OFFSET GROUP_SIZE Execution Time
#1 32 0 512 14.5 ms
#2 16 0 512 14.7 ms
#3 8 0 512 86.0 ms
#4 32 1 512 93.5 ms
#5 32 0 16 14.4 ms
#6 32 0 8 22.1 ms
#7 8 1 8 85.5 ms

Trial #1 uses the most advantageous settings with the stride set equal to the warp size and the offset set to zero so that each thread in a warp generates sequential addresses that track the thread indices. The threads in a block are placed into a maximal group of 512 so they are all executing exactly the same instructions.

In trial #2, decreasing the stride to half the warp size has no effect on the kernel’s execution time. Nvidia mentions that many operations in the MP proceed on a half-warp basis, but you can’t count on this behavior in future versions of their devices.

Further decreasing the stride in trial #3 expands the gaps between thread addresses and reduces the amount of address coalescing the GPU can do, to the point where the execution time increases by a factor of six.

Restoring the stride back to the warp size but edging the offset up to one in trial #4 causes the same problem with coalescing addresses. So addresses that do not track the thread index cause performance problems as great as using non-sequential addresses.

In trial #5, the stride and offset are returned to their best settings while the size of thread groups executing the same instruction are reduced to 16 threads. No performance decrease is seen at half-warp size.

Decreasing the groups to eight threads in trial #6 does cause the MPs to execute the groups sequentially. This causes only a 50% increase in run-time because the two groups still have many instructions that they can execute in common.

Finally, setting all the parameters to their worst-case values in trial #7 does not lead to a multiplicative increase in execution times because the effects of serializing the thread groups is hidden by the increased memory access times caused by the non-sequential addresses.

Here’s the source code for this example if you want to try it. (Note: I had to move the cutil32D.dll and cutil32.dll libraries from the CUDA SDK directory into my C:\CUDA\bin directory so my example2.exe program could find it. You may have to do the same.)

So now I’ve gotten an idea of how the GPU is executing code and some of the factors that affect it’s performance. (For reference and greater detail, please read the CUDA Programming Guide Version 1.1.) It’s important not to have a bunch of divergent conditional operations in the threads and not to jump around from place-to-place in global memory. In fact, given the long access times, it’s probably better to minimize global memory traffic and make use of on-chip registers and shared memory instead. I’ll take a look at these memories in a future post.

About these ads

About dave_vandenbout
President of XESS Corp, a manufacturer of FPGA development boards.

33 Responses to Threads and blocks and grids, oh my!

  1. Pitaniyat says:

    Хочу заняться продажей вот такой вот темы:
    Копия Samsung Galaxy Note I9220 Pad (N9000) Android
    Как вы думаете? будет ли спрос на такой китайский товар?
    а?

  2. EdwardSr says:

    Для чего нужен cekc? Вот моя версия.
    Читать статью

  3. Marquita says:

    I think this is among the most significant information for me.
    And i’m glad reading your article. But should remark on few general things, The website style is ideal, the articles is really excellent : D. Good job, cheers

  4. Having read this I believed it was really informative.
    I appreciate you taking the time and energy to put this article
    together. I once again find myself spending a significant amount of time both
    reading and leaving comments. But so what, it was still worthwhile!

  5. Rohit says:

    I have one more question. For my 2.0 capability, it says, 32 CUDA cores per multi-processor. What is this ‘core’ and how it is different from ‘stream processor’ ?

  6. Rohit says:

    For device with capability 2.0, that I have, using device properties, I get maximum threads per block as 1024, but when I put that value in occupancy calculator (threads per block), I get error. Can you explain? Thanks in advance.

  7. Sid says:

    Great post! Thank you so much

  8. Lance Harris says:

    When I click on the to get the source code is doesn’t work. Has it been moved or changed? Could you email me the source code?

    Thanks

  9. Ceangeall says:

    hello to every one.Wish you all will-power be ok.

  10. Appoffmob says:

    hi everyone

    Newly signed up, i want to say a quick hello

  11. Azhar says:

    I ran the program with -deviceemu; the execution time remains same in this mode irrespective of the values of STRIDE, OFFSET and GROUPSIZE (i.e ~4800ms).

    • llpanorama says:

      This makes some sense since you’re running the code with a device emulator in the CPU. The emulator won’t show the same RAM access characteristics that the real GPU will.

  12. Azhar says:

    Just in case one can’t find cutil32D.dll; plz go to: C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\bin\win32\Debug

  13. Krzysztof says:

    I used profiler and compare two cases

    1) – block size 512

    2) – block size 384

    for both GROUP_SIZE 512 was set.

    There is almost no difference in execution time, all data is very similar except occupancy which is 67%
    for block size 512 and 100% for 384.
    Can anybody explain this ??

  14. Krzysztof says:

    OK i changed in the code block size to 384 and got occupancy 100% in sheet and in profiler. But still no decrease of the execution time !! I tried
    with GROUP_SIZE 512, 16 and 8.

  15. Krzysztof says:

    I have NVIDIA GTS 250 card with Compute Capability 1.1 and I made a test with STRIDE = 32, OFFSET = 0 and
    GROUP_SIZE = 512 with result of 18 ms.

    However occupancy calculator shows 67% load for this settings. Than I changed GROUP_SIZE to 384 and got 100% load in the sheet but no decrease in execution time ???

    Any idea what is a reason ??

  16. Dominik says:

    You mention an implementation of CUDA on multi-cores. If you’re still interested you can find a paper about it here: http://www.ldc.usb.ve/~vtheok/cursos/ci6323/pdf/lecturas/MCUDA:%20An%20Eficient%20Implementation%20of%20CUDA%20Kernels%20on%20Multi-cores.pdf

  17. Mark King says:

    If you are having a problem with the Visual C++ Express Edition compiler 2008 not finding the cutil.h header file then go to tool->options, project & solutions and make sure the directory it is in is included in the list of directories for header files listed here.

  18. I tried the code by changing the kernel code to the below, I got the performance of 27.624880 ms in Geforce 8600 MGT
    __global__ void square_array(float *a, int N)
    {
    int totalThreads = blockDim.x * gridDim.x ;
    int tasksPerThread = (N % totalThreads) == 0 ? N / totalThreads : N/totalThreads + 1;
    int myThreadIdx = blockIdx.x * blockDim.x + threadIdx.x ;
    int startIdx = myThreadIdx ;
    int endIdx = myThreadIdx + totalThreads * tasksPerThread ;
    if( endIdx > N)
    {
    endIdx = N;
    }
    const int stride = totalThreads;
    for(int i = startIdx ; i < endIdx ; i += stride)
    {
    a[i] = a[i] * a[i];
    }
    }

  19. LocalAdLink says:

    Love that yahoo, fine stuff.

    Will read up and hopefully have something to post.

    Cya.

    _____________________________________
    LocalAdLink

  20. llpanorama says:

    Josh, the squares computed by the program will differ from those found by your calculator because the Nvidia GPU does all its calculations using single-precision floating-point (24-bit mantissa, 8-bit exponent) while your calculator uses BCD integers (at least for the smaller integers). Single-precision floating-point can only maintain 6 or 7 digits of accuracy, so the squares output by the GPU are drifting.

  21. Josh says:

    nevermind on the 671088, as I see that it is just 2^25 (N) * 50 as the for loop just prints every 50… sigh, its late…

    Anyway, why is the math wrong?

  22. Josh says:

    Clarification: I understand that the number to the right is the approximate square of the number to the left (I say approximate because my computer calc shows the first answer should be 450359103744 when this program shows 450359099392 – a difference of 4352).

    What I’m not sure of is where “671088″ as the first number came from (I also can tell that the following numbers are multiples of the first one).

    Basically, does this number have any significance and why is it incorrect?

  23. Josh says:

    Could you please show what the output from the code above should be? When I run your program, I get the following and I’m not sure I understand what is going on:

    0 0.000000
    671088 450359099392.000000
    1342176 1801436397568.000000
    2013264 4053231992832.000000
    2684352 7205745590272.000000
    3355440 11258977845248.000000
    4026528 16212927971328.000000
    4697616 22067596492800.000000
    5368704 28822982361088.000000
    6039792 36479086624768.000000
    6710880 45035911380992.000000
    7381968 54493450338304.000000
    8053056 64851711885312.000000
    8724144 76110687633408.000000
    9395232 88270385971200.000000
    10066320 101330794315776.000000
    10737408 115291929444352.000000
    11408496 130153782968320.000000
    12079584 145916346499072.000000
    12750672 162579628425216.000000
    13421760 180143645523968.000000
    14092848 198608364240896.000000
    14763936 217973801353216.000000
    15435024 238239973638144.000000
    16106112 259406847541248.000000
    16777200 281474439839744.000000
    17448288 304442750533632.000000
    18119376 328311796400128.000000
    18790464 353081543884800.000000
    19461552 378751992987648.000000
    20132640 405323177263104.000000
    20803728 432795096711168.000000
    21474816 461167717777408.000000
    22145904 490441074016256.000000
    22816992 520615131873280.000000
    23488080 551689891348480.000000
    24159168 583665385996288.000000
    24830256 616541615816704.000000
    25501344 650318513700864.000000
    26172432 684996213866496.000000
    26843520 720574582095872.000000
    27514608 757053685497856.000000
    28185696 794433456963584.000000
    28856784 832713963601920.000000
    29527872 871895205412864.000000
    30198960 911977182396416.000000
    30870048 952959894552576.000000
    31541136 994843274772480.000000
    32212224 1037627390164992.000000
    32883312 1081312240730112.000000
    33554400 1125897759358976.000000
    CUDA execution time = 4.708800 ms

  24. Sotohictuts says:

    Hello. It is test.

  25. Pingback: Final Year Project Diary » Blog Archive » Terminology research

  26. Pingback: Desenvolvimendo com CUDA no Ubuntu 8.04 « John Tortugo

  27. Paul says:

    Hi there I have just started using CUDA. I am having a little problem using 2D arrays.

    I am currently using cudaMallocPitch and cudaMemcpy2D, but not all of my outputs are correct when the program has finished.

    Any help would be greately apreciated!

    sincerly

    Paul

  28. Rohit says:

    Is there an easy way to determine if your memory accesses are mostly coalesced or not? I guess the profiler (for XP only) gives a count of memory accesses by type but is there any easier way to find out if the accesses are too un-coalesced?

  29. emberhok says:

    Thanks for the post

  30. llpanorama says:

    I add CUDA files by creating a new .cpp file and then I change the file extension to .cu.

    The CUDA example projects have their linker output property set to dump the executables into a common directory. I suspect they do this so the required CUDA DLLs can be stored once in the common directory where all the executables can find it. You can output your executable into the normal VS2005 Debug or Release folders but you will have to copy the required CUDA DLL files there if you want to run it. Or place the CUDA DLLs in one of the standard folders that are searched when the executable starts.

  31. Satakarni says:

    Hi,

    I have installed CUDA SDK and I was able to run almost every project provided in SDK.

    I opened one the program (device Query) and has copied your program (“My first CUDA program”)…and then rebuild the solution and when compiled I got the right output.

    However my question is …how to “Start a new project”. I mean in VC++ 2005 we have “inc” folder that includes all *.h item files and like wise we have “src” folder that includes all the *.cpp item files.

    How to add the *.cu item files into those directories.

    And also like I found CUDA programs are compiled when they are created in the CUDA SDK folder. Programs created in the default Visual Studio 2005 folder are not. What could be the problem?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 33 other followers

%d bloggers like this: