CUDA Gets Easier!

Several of my readers have had problems creating CUDA projects in Visual Studio, so I thought I’d update how to do it using the current version of CUDA (3.0 at the time of this writing).  The main point: it’s a lot easier than the procedure I outlined two years ago.

For hardware, I’m now using a Zotac GeForce GT240 card with 96 stream processors that I purchased last year for $90. For my software development environment, I downloaded and installed the Microsoft SDK for Windows Server 2008 and Microsoft Visual C++ 2008 Express Edition. Then I downloaded and installed the NVIDIA Driver 197.13, the CUDA Toolkit 3.0 and the CUDA SDK 3.0 for 32-bit Windows XP.

Once everything was set up, the first thing I did was to recompile and run the deviceQuery example in

C:\Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\src\deviceQuery

I just double-clicked the deviceQuery_vc90.sln file and the project popped-up in the Visual Studio IDE. I hit F7 to rebuild the program, and then I pressed Ctrl+F5 to run it. The program ran and reported the presence of a GeForce GT 240 in my PC. So far, so good.

Next, I created an empty Win32 console application called cuda_example3. I renamed cuda_example3.cpp to cuda_example3.cu because that’s where the CUDA kernel source is going. Then I copied the source from my first CUDA program into the file and saved it. Here’s the code so you can see it:

// cuda_example3.cu : Defines the entry point for the console application.
//

#include "stdafx.h"

#include <stdio.h>
#include <cuda.h>

// Kernel that executes on the CUDA device
__global__ void square_array( float *a, int N )
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if ( idx < N )
        a[idx] = a[idx] * a[idx];
}



// main routine that executes on the host
int main( void )
{
    float *a_h, *a_d; // Pointer to host & device arrays
    const int N = 10; // Number of elements in arrays
    size_t size = N * sizeof( float );
    a_h = (float *)malloc( size );    // Allocate array on host
    cudaMalloc( (void **)&a_d, size ); // Allocate array on device
    // Initialize host array and copy it to CUDA device
    for ( int i = 0; i < N; i++ )
        a_h[i] = (float)i;
    cudaMemcpy( a_d, a_h, size, cudaMemcpyHostToDevice );
    // Do calculation on device:
    int block_size = 4;
    int n_blocks   = N / block_size + ( N % block_size == 0 ? 0 : 1 );
    square_array <<< n_blocks, block_size >>> ( a_d, N );
    // Retrieve result from device and store it in host array
    cudaMemcpy( a_h, a_d, sizeof( float ) * N, cudaMemcpyDeviceToHost );
    // Print results
    for ( int i = 0; i < N; i++ )
        printf( "%d %f\n", i, a_h[i] ); // Cleanup
    free( a_h );
    cudaFree( a_d );
}

At this point, Visual Studio had no idea how to compile a .cu file. In the past, I crafted a Custom Build Step in the Project Properties page that invoked Nvidia’s nvcc tool with the appropriate compiler options. No more need for that! Instead, I highlighted cuda_example3 in the Solution Explorer pane, and then selected Project→Custom Build Rules… from the menu. Then I clicked on the Find Existing… button in the Custom Build Rule Files window and steered it to this file:

C:\Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\common\Cuda.rules

Cuda.rules contains all the rules and options needed to merge .cu files into the Visual Studio C++ compilation flow.

The only other changes I needed to make were to indicate the locations of the CUDA libraries in the project properties (I did this for both the Debug and Release configurations):

Configuration Properties → Linker -> General:
Additional Library Directories = C:\CUDA\lib;”C:\Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\common\lib”

Configuration Properties → Linker → Input:
Additional Dependencies = cudart.lib

After doing this, the program compiled and produced the following correct result:

0 0.000000
1 1.000000
2 4.000000
3 9.000000
4 16.000000
5 25.000000
6 36.000000
7 49.000000
8 64.000000
9 81.000000

For those of you who want to try CUDA but don’t have CUDA-enabled GPU card, there is a way to link to a CUDA device emulator. Simply replace cudart.lib with cudartemu.lib in the project properties as follows:

Configuration Properties → Linker → Input:
Additional Dependencies = cudartemu.lib

This supplants the use of the -deviceemu compiler option in earlier versions of CUDA.

Finally, you may want C++ syntax-coloring and Intellisense to work on your .cu source files. To get syntax-coloring, click on the Tools→Options menu. Then in the Options window under Text Editor→File Extension, enter the .cu and .cuh file extensions and select Microsoft Visual C++ as the editor. To enable Intellisense, you’ll have to edit the Windows registry by adding the .cu and .cuh file extensions to the key HKEY_CURRENT_USER\Software\Microsoft\VisualStudio\9.0\Languages\Language Services\C/C++\NCB Default C/C++ Extensions. That should do it.

Here’s the source code for this example if you want to try it.

Bookmark and Share

Updating to CUDA 2.3

It’s been a while since I’ve posted anything, so I thought I’d start again by upgrading from CUDA 1.1 to the new CUDA 2.3.

For hardware, I’m still using my old NX8600GTS graphics card.  For my software development environment, I downloaded and installed the Microsoft SDK for Windows Server 2008 and Microsoft Visual C++ 2008 Express Edition.  Then I downloaded and installed the NVIDIA Driver 190.38, the CUDA Toolkit 2.3 and the CUDA SDK 2.3 for 32-bit XP. Read more of this post

Nvidia GTX 295 GPU with 480 Cores!

The title says it all.  Read a bit more about it here.

CUDA vs. FPGAs for high-performance computing

A column by Kevin Morris, editor of the FPGA Journal, discusses the new Nvidia GPU offerings.  Here’s my response about why GPUs will kill-off the use of field-programmable gate arrays (FPGAs) as accelerators in high-performance computing systems.

New Nvidia GTX280 and 260 GPUs are announced!

Nvidia has announced their new GTX 280 and 260 GPU chips. The 280 and 260 increase the number of SPs up to 240 and 192 while the width of the interface to device memory has increased to 512 and 448 bits, respectively. (The older 8800 GTX has 128 SPs and a 384-bit wide memory interface.)

Here is a blog posting with a picture of the GTX 280 chip.

Apple’s competition to TBB and CUDA

Apple recently announced Grand Central and OpenCL which seem to be competitors to TBB and CUDA, respectively. Grand Central tries to make it easier to write multi-threaded apps for today’s multicore CPUs, and OpenCL (Open Computer Library) aims to make the processing power of GPUs available in general-purpose computing applications. OpenCL sounds like CUDA to me, but Steve Jobs says it’s “way beyond what Nvidia or anyone else has, and it’s really simple.” We’ll see. Here are some blog posts about Grand Central and OpenCL.

Addendum (June 18, 2008): Looks like Apple has submitted OpenCL to the Khronos Group “that aims to define a programming environment for applications running across both x86 and graphics chips”.  And here is a Wikipedia entry about OpenCL.

Threads and blocks and grids, oh my!

As an engineer, I like C because it is relatively low-level compared to other languages. This lets me infer how the C code is handled by the processor so I can make on-the-fly judgments about the efficiency of a program. For the same reason, I need a mental model of how a CUDA device is organized and how its parts operate. Read more of this post

Follow

Get every new post delivered to your Inbox.

Join 33 other followers