As an engineer, I like C because it is relatively low-level compared to other languages. This lets me infer how the C code is handled by the processor so I can make on-the-fly judgments about the efficiency of a program. For the same reason, I need a mental model of how a CUDA device is organized and how its parts operate.
With a single processor executing a program, it’s easy to tell what’s going on and where it’s happening. With a CUDA device, not so much. There seem to be a lot of things going on at once in a lot of different places. CUDA organizes a parallel computation using the abstractions of threads, blocks and grids for which I provide these simple definitions:
Thread: This is just an execution of a kernel with a given index. Each thread uses its index to access elements in array (see the kernel in my first CUDA program) such that the collection of all threads cooperatively processes the entire data set.
Block: This is a group of threads. There’s not much you can say about the execution of threads within a block – they could execute concurrently or serially and in no particular order. You can coordinate the threads, somewhat, using the _syncthreads() function that makes a thread stop at a certain point in the kernel until all the other threads in its block reach the same point.
Grid: This is a group of blocks. There’s no synchronization at all between the blocks.
But where do threads, blocks and grids actually get executed? With respect to Nvidia’s G80 GPU chip, it appears the computation is distributed as follows:
Grid → GPU: An entire grid is handled by a single GPU chip.
Block → MP: The GPU chip is organized as a collection of multiprocessors (MPs), with each multiprocessor responsible for handling one or more blocks in a grid. A block is never divided across multiple MPs.
Thread → SP: Each MP is further divided into a number of stream processors (SPs), with each SP handling one or more threads in a block.
(Some universities have extended CUDA to work on multicore CPUs by assigning a grid to the CPU with each core executing the threads in one or more blocks. I don’t have a link to this work, but it is mentioned here.)
In combination with the hierarchy of processing units, the G80 also provides a memory hierarchy:
Global memory: This memory is built from a bank of SDRAM chips connected to the GPU chip. Any thread in any MP can read or write to any location in the global memory. Sometimes this is called device memory.
Texture cache: This is a memory within each MP that can be filled with data from the global memory so it acts like a cache. Threads running in the MP are restricted to read-only access of this memory.
Constant cache: This is a read-only memory within each MP.
Shared memory: This is a small memory within each MP that can be read/written by any thread in a block assigned to that MP.
Registers: Each MP has a number of registers that are shared between its SPs.
As usual, the upper levels of the memory hierarchy provide larger, slower storage (access times of 400-600 cycles) while the lower levels are smaller and faster (access times of several cycles or less, I think). How large are these memories? You can find out by running the DeviceQuery application included in the CUDA SDK. This is the result I get for my Nvidia card:
There is 1 device supporting CUDA Device 0: "GeForce 8600 GTS" Major revision number: 1 Minor revision number: 1 Total amount of global memory: 268173312 bytes Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1458000 kilohertz Test PASSED
The GPU device is hooked to 256 MB of SDRAM that provides the global memory. Each MP in the GPU has access to 16 KB of shared memory and 8192 registers (I’m not sure what the register width is). And there is 64 KB of constant memory for all the MPs in the GPU (I’m not sure why that is not reported per-MP like the shared memory is).
DeviceQuery also shows the limits on the sizes of blocks and grids. A block is one-, two- or three-dimensional with the maximum sizes of the x, y and z dimensions being 512, 512 and 64, respectively, and such that x × y × z ≤ 512, which is the maximum number of threads per block. Blocks are organized into one- or two-dimensional grids of up to 65,535 blocks in each dimension. The primary limitation here is the maximum of 512 threads per block, primarily imposed by the small number of registers that can be allocated across all the threads running in all the blocks assigned to an MP. The thread limit constrains the amount of cooperation between threads because only threads within the same block can synchronize with each other and exchange data through the fast shared memory in an MP.
Between the memory sizes and the thread/block/grid dimensions is the warp size. What is a warp? I think the definition that applies to CUDA is “threads in a fabric running lengthwise”. The warp size is the number of threads running concurrently on an MP. In actuality, the threads are running both in parallel and pipelined. At the time this was written, each MP contains eight SPs and the fastest instruction takes four cycles. Therefore, each SP can have four instructions in its pipeline for a total of 8 × 4 = 32 instructions being executed concurrently. Within a warp, the threads all have sequential indices so there is a warp with indices 0..31, the next with indices 32..63 and so on up to the total number of threads in a block.
The homogeneity of the threads in a warp has a big effect on the computational throughput. If all the threads are executing the same instruction, then all the SPs in an MP can execute the same instruction in parallel. But if one or more threads in a warp is executing a different instruction from the others, then the warp has to be partitioned into groups of threads based on the instructions being executed, after which the groups are executed one after the other. This serialization reduces the throughput as the threads become more and more divergent and split into smaller and smaller groups. So it pays to keep the threads as homogenous as possible.
How the threads access global memory also affects the throughput. Things go much faster if the GPU can coalesce several global addresses into a single burst access over the wide data bus that goes to the external SDRAM. Conversely, reading/writing separated memory addresses requires multiple accesses to the SDRAM which slows things down. To help the GPU combine multiple accesses, the addresses generated by the threads in a warp must be sequential with respect to the thread indices, i.e. thread N must access address Base + N where Base is a pointer of type T and is aligned to 16 × sizeof(T) bytes.
A program that shows some of these performance effects is given below.
// example2.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <stdio.h>
#include <cuda.h>
#include <cutil.h>
// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
#define STRIDE 32
#define OFFSET 0
#define GROUP_SIZE 512
int n_elem_per_thread = N / (gridDim.x * blockDim.x);
int block_start_idx = n_elem_per_thread * blockIdx.x * blockDim.x;
int thread_start_idx = block_start_idx
+ (threadIdx.x / STRIDE) * n_elem_per_thread * STRIDE
+ ((threadIdx.x + OFFSET) % STRIDE);
int thread_end_idx = thread_start_idx + n_elem_per_thread * STRIDE;
if(thread_end_idx > N) thread_end_idx = N;
int group = (threadIdx.x / GROUP_SIZE) & 1;
for(int idx=thread_start_idx; idx < thread_end_idx; idx+=STRIDE)
{
if(!group) a[idx] = a[idx] * a[idx];
else a[idx] = a[idx] + a[idx];
}
}
// main routine that executes on the host
int main(void)
{
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 1<<25; // Make a big array with 2**N elements
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Create timer for timing CUDA calculation
unsigned int timer = 0;
cutCreateTimer( &timer );
// Set number of threads and blocks
int n_threads_per_block = 1<<9; // 512 threads per block
int n_blocks = 1<<10; // 1024 blocks
// Do calculation on device
cutStartTimer( timer ); // Start timer
square_array <<< n_blocks, n_threads_per_block >>> (a_d, N);
cudaThreadSynchronize(); // Wait for square_array to finish on CUDA
cutStopTimer( timer ); // Stop timer
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print some of the results and the CUDA execution time
for (int i=0; i<N; i+=N/50) printf("%d %f\n", i, a_h[i]);
printf("CUDA execution time = %f ms\n",cutGetTimerValue( timer ));
// Cleanup
free(a_h); cudaFree(a_d);
}
This program is just a modification of my first CUDA program. The size of the array on line 35 is increased from ten to 32M elements so the kernel will execute long enough to get an accurate reading with the timer (declared on line 44). The number of blocks and threads per block are increased (lines 46 and 47) because that provides a large pool of available computations the GPU can execute while waiting for memory accesses to complete.
The timer is started (line 49) and the kernel is initiated (line 50). Because the CUDA device works in parallel with the CPU, a call to the cudaThreadSynchronize() subroutine is needed to halt the CPU until the kernel has completed (line 51). Then the timer can be stopped (line 52). The execution time for the kernel is output on line 57.
The kernel (lines 11-29) has been modified so that each thread loops over a set of array elements using a memory access pattern that can be changed along with the homogeneity of the threads. The number of array elements handled by each thread is calculated on line 16 by dividing the array size by the total number of threads (which is the product of the number of blocks and the block dimension). Then, each block of threads is assigned a starting address for the contiguous block of array elements that its threads will process (line 17). Each thread in a block is assigned a starting address within its block determined by the thread index and the addressing stride length and offset. For example, if there are 512 total array elements handled by eight threads with the stride and offset set to four and zero, respectively, then the addresses accessed by each thread are:
thread 0: 0 4 8 12 16 20 24 ... 248 252 thread 1: 1 5 9 13 17 21 25 ... 249 253 thread 2: 2 6 10 14 18 22 26 ... 250 254 thread 3: 3 7 11 15 19 23 27 ... 251 255 thread 4: 256 260 264 268 272 276 280 ... 504 508 thread 5: 257 261 265 269 273 277 281 ... 505 509 thread 6: 258 262 266 270 274 278 282 ... 506 510 thread 7: 259 263 267 271 275 279 283 ... 507 511
In this example, you can see that threads 0-3 are accessing sequential addresses in their section of the global memory, so their accesses can be coalesced. The same applies to threads 4-7. But, the accesses for threads 0-3 and 4-7 combined cannot be coalesced because then they are not sequential. So the GPU has to make a separate memory bursts for each group of threads.
Using the same example, if the offset is increased to one then the addresses for each thread become:
thread 0: 1 5 9 13 17 21 25 ... 249 253 thread 1: 2 6 10 14 18 22 26 ... 250 254 thread 2: 3 7 11 15 19 23 27 ... 251 255 thread 3: 0 4 8 12 16 20 24 ... 248 252 thread 4: 257 261 265 269 273 277 281 ... 505 509 thread 5: 258 262 266 270 274 278 282 ... 506 510 thread 6: 259 263 267 271 275 279 283 ... 507 511 thread 7: 256 260 264 268 272 276 280 ... 504 508
Now the addresses are still in the same range, but they are no longer sequential with respect to the thread indices.
A third macro parameter, GROUP_SIZE (line 15), is used to direct the flow of execution in the for loop (lines 24-28). Setting the group size to four, for example, sets the group variable (line 23) to zero for threads with indices [0..3], [8..11], [16..19] … which makes these threads square the array elements (line 26). The remaining threads, which have group set to one, just double the array elements (line 27). This difference in operations prevents the MP from running the threads in both groups in parallel.
By playing with the STRIDE, OFFSET and GROUP macros (lines 13-15), we can vary the memory access patterns and thread homogeneity and see their effect on execution times for the kernel. The following trials were done using the Release version of the program with execution times averaged over 1000 runs:
| Trial | STRIDE | OFFSET | GROUP_SIZE | Execution Time |
|---|---|---|---|---|
| #1 | 32 | 0 | 512 | 14.5 ms |
| #2 | 16 | 0 | 512 | 14.7 ms |
| #3 | 8 | 0 | 512 | 86.0 ms |
| #4 | 32 | 1 | 512 | 93.5 ms |
| #5 | 32 | 0 | 16 | 14.4 ms |
| #6 | 32 | 0 | 8 | 22.1 ms |
| #7 | 8 | 1 | 8 | 85.5 ms |
Trial #1 uses the most advantageous settings with the stride set equal to the warp size and the offset set to zero so that each thread in a warp generates sequential addresses that track the thread indices. The threads in a block are placed into a maximal group of 512 so they are all executing exactly the same instructions.
In trial #2, decreasing the stride to half the warp size has no effect on the kernel’s execution time. Nvidia mentions that many operations in the MP proceed on a half-warp basis, but you can’t count on this behavior in future versions of their devices.
Further decreasing the stride in trial #3 expands the gaps between thread addresses and reduces the amount of address coalescing the GPU can do, to the point where the execution time increases by a factor of six.
Restoring the stride back to the warp size but edging the offset up to one in trial #4 causes the same problem with coalescing addresses. So addresses that do not track the thread index cause performance problems as great as using non-sequential addresses.
In trial #5, the stride and offset are returned to their best settings while the size of thread groups executing the same instruction are reduced to 16 threads. No performance decrease is seen at half-warp size.
Decreasing the groups to eight threads in trial #6 does cause the MPs to execute the groups sequentially. This causes only a 50% increase in run-time because the two groups still have many instructions that they can execute in common.
Finally, setting all the parameters to their worst-case values in trial #7 does not lead to a multiplicative increase in execution times because the effects of serializing the thread groups is hidden by the increased memory access times caused by the non-sequential addresses.
Here’s the source code for this example if you want to try it. (Note: I had to move the cutil32D.dll and cutil32.dll libraries from the CUDA SDK directory into my C:\CUDA\bin directory so my example2.exe program could find it. You may have to do the same.)
So now I’ve gotten an idea of how the GPU is executing code and some of the factors that affect it’s performance. (For reference and greater detail, please read the CUDA Programming Guide Version 1.1.) It’s important not to have a bunch of divergent conditional operations in the threads and not to jump around from place-to-place in global memory. In fact, given the long access times, it’s probably better to minimize global memory traffic and make use of on-chip registers and shared memory instead. I’ll take a look at these memories in a future post.
Love that yahoo, fine stuff.
Will read up and hopefully have something to post.
Cya.
_____________________________________
LocalAdLink
Comment by LocalAdLink — February 23, 2009 @ 12:09 pm
Josh, the squares computed by the program will differ from those found by your calculator because the Nvidia GPU does all its calculations using single-precision floating-point (24-bit mantissa, 8-bit exponent) while your calculator uses BCD integers (at least for the smaller integers). Single-precision floating-point can only maintain 6 or 7 digits of accuracy, so the squares output by the GPU are drifting.
Comment by llpanorama — January 21, 2009 @ 9:31 am
nevermind on the 671088, as I see that it is just 2^25 (N) * 50 as the for loop just prints every 50… sigh, its late…
Anyway, why is the math wrong?
Comment by Josh — January 20, 2009 @ 11:06 pm
Clarification: I understand that the number to the right is the approximate square of the number to the left (I say approximate because my computer calc shows the first answer should be 450359103744 when this program shows 450359099392 – a difference of 4352).
What I’m not sure of is where “671088″ as the first number came from (I also can tell that the following numbers are multiples of the first one).
Basically, does this number have any significance and why is it incorrect?
Comment by Josh — January 20, 2009 @ 11:03 pm
Could you please show what the output from the code above should be? When I run your program, I get the following and I’m not sure I understand what is going on:
0 0.000000
671088 450359099392.000000
1342176 1801436397568.000000
2013264 4053231992832.000000
2684352 7205745590272.000000
3355440 11258977845248.000000
4026528 16212927971328.000000
4697616 22067596492800.000000
5368704 28822982361088.000000
6039792 36479086624768.000000
6710880 45035911380992.000000
7381968 54493450338304.000000
8053056 64851711885312.000000
8724144 76110687633408.000000
9395232 88270385971200.000000
10066320 101330794315776.000000
10737408 115291929444352.000000
11408496 130153782968320.000000
12079584 145916346499072.000000
12750672 162579628425216.000000
13421760 180143645523968.000000
14092848 198608364240896.000000
14763936 217973801353216.000000
15435024 238239973638144.000000
16106112 259406847541248.000000
16777200 281474439839744.000000
17448288 304442750533632.000000
18119376 328311796400128.000000
18790464 353081543884800.000000
19461552 378751992987648.000000
20132640 405323177263104.000000
20803728 432795096711168.000000
21474816 461167717777408.000000
22145904 490441074016256.000000
22816992 520615131873280.000000
23488080 551689891348480.000000
24159168 583665385996288.000000
24830256 616541615816704.000000
25501344 650318513700864.000000
26172432 684996213866496.000000
26843520 720574582095872.000000
27514608 757053685497856.000000
28185696 794433456963584.000000
28856784 832713963601920.000000
29527872 871895205412864.000000
30198960 911977182396416.000000
30870048 952959894552576.000000
31541136 994843274772480.000000
32212224 1037627390164992.000000
32883312 1081312240730112.000000
33554400 1125897759358976.000000
CUDA execution time = 4.708800 ms
Comment by Josh — January 20, 2009 @ 10:57 pm
Hello. It is test.
Comment by Sotohictuts — November 19, 2008 @ 1:39 pm
[...] [...]
Pingback by Final Year Project Diary » Blog Archive » Terminology research — October 28, 2008 @ 8:13 am
[...] http://forums.nvidia.com/ http://www.ddj.com/architect/207200659 http://llpanorama.wordpress.com/2008/06/11/threads-and-blocks-and-grids-oh-my/ http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/ [...]
Pingback by Desenvolvimendo com CUDA no Ubuntu 8.04 « John Tortugo — October 25, 2008 @ 5:03 pm
Hi there I have just started using CUDA. I am having a little problem using 2D arrays.
I am currently using cudaMallocPitch and cudaMemcpy2D, but not all of my outputs are correct when the program has finished.
Any help would be greately apreciated!
sincerly
Paul
Comment by Paul — October 2, 2008 @ 5:06 am
Is there an easy way to determine if your memory accesses are mostly coalesced or not? I guess the profiler (for XP only) gives a count of memory accesses by type but is there any easier way to find out if the accesses are too un-coalesced?
Comment by Rohit — September 10, 2008 @ 4:23 pm
Thanks for the post
Comment by emberhok — August 2, 2008 @ 8:36 pm
I add CUDA files by creating a new .cpp file and then I change the file extension to .cu.
The CUDA example projects have their linker output property set to dump the executables into a common directory. I suspect they do this so the required CUDA DLLs can be stored once in the common directory where all the executables can find it. You can output your executable into the normal VS2005 Debug or Release folders but you will have to copy the required CUDA DLL files there if you want to run it. Or place the CUDA DLLs in one of the standard folders that are searched when the executable starts.
Comment by llpanorama — July 14, 2008 @ 8:46 am
Hi,
I have installed CUDA SDK and I was able to run almost every project provided in SDK.
I opened one the program (device Query) and has copied your program (”My first CUDA program”)…and then rebuild the solution and when compiled I got the right output.
However my question is …how to “Start a new project”. I mean in VC++ 2005 we have “inc” folder that includes all *.h item files and like wise we have “src” folder that includes all the *.cpp item files.
How to add the *.cu item files into those directories.
And also like I found CUDA programs are compiled when they are created in the CUDA SDK folder. Programs created in the default Visual Studio 2005 folder are not. What could be the problem?
Comment by Satakarni — July 12, 2008 @ 6:40 pm