I got CUDA setup and running with Visual C++ 2005 Express Edition in my previous post. Now I’ll write my first CUDA program. It’s a modification of an example program from a great series of articles on CUDA by Rob Farber published in Dr. Dobbs Journal. Rob does his examples in a make-based build environment; I’ll show how to build a CUDA program in the Visual C++ IDE.
Simple CUDA programs have a basic flow:
- The host initializes an array with data.
- The array is copied from the host to the memory on the CUDA device.
- The CUDA device operates on the data in the array.
- The array is copied back to the host.
My first CUDA program, shown below, follows this flow. It takes an array and squares each element. I can barely contain my excitement.
// example1.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <stdio.h>
#include <cuda.h>
// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
// main routine that executes on the host
int main(void)
{
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
}
Two pointers are declared on line 19 of the main routine: a_h points to the array that is stored on the host, while a_d points to the array on the CUDA device. The a_h array is allocated in the host memory on line 22 using the standard malloc subroutine, but a_d is allocated in the CUDA device memory using the cudaMalloc subroutine found in the CUDA API (line 23). (Note that a pointer to the a_d pointer is passed to cudaMalloc so it can store the address of the array in a_d.)
In order to create some values to operate upon, each element in the host array is initialized with its array index (line 25). Then the cudaMemcpy subroutine is used to copy a_h from the host into a_d on the CUDA device. (The cudaMemcpyHostToDevice flag, defined in the API, indicates the direction of the transfer.)
In lines 28-30, the host initiates the execution of the kernel function, square_array, on the CUDA device. A CUDA device contains individual processing elements, each of which can execute a thread. A number of the processing elements are grouped together to form a block, and a number of blocks constitutes a grid. In this example, the number of threads per block is set to four (line 28). Then the total number of blocks that are needed to get enough threads to square each array element is calculated on line 29. (For ten array elements, three blocks each with four threads are needed.) On line 30, the host initiates the kernel function on the CUDA device. The number of blocks and the number of threads in each block are indicated between the <<<…>>> following the kernel name. (This information is picked up by the Nvidia compiler, nvcc, and is used when generating the instructions that start the kernel on the CUDA device. More on nvcc, later.) Following that, the standard argument list to square_array contains a pointer to the array in the CUDA device memory and the number of elements in the array.
The kernel is shown on lines 10-14. The __global__ keyword indicates that this is a kernel function that should be processed by nvcc to create machine code that executes on the CUDA device, not the host. In this example, each thread will execute the same kernel function and will operate upon only a single array element. Each thread is distinguished from all the others by block and thread indices that can be used to determine the array element the thread will access. On line 12, the array index is found by multiplying the thread’s block index (blockIdx.x) by the number of threads in each block (blockDim.x) and then adding the index of the thread within the block (threadIdx.x). If the index is within the bounds of the array, then the corresponding array element is squared (line 13).
Immediately after starting the kernel, the host begins a transfer of the data from the array in the CUDA device memory back to the array in the host memory (line 32). This transfer is delayed until the CUDA device has finished executing the kernel, so there is no chance of getting data that has not been processed yet. Then the host displays the contents of the array (line 34) and frees the array memory on both itself and the CUDA device (line 36).
At this point, I have a CUDA-enabled program, but I don’t have it integrated into a Visual C++ project. It actually takes a bit of work to do that. To start, I brought up the Visual C++ 2005 Express Edition IDE and clicked on the New Project button (you can also use File→New→Project… from the menu). In the New Project window, I selected Win32 as the project type and Win32 Console Application as the template. I gave the project the creative name of example1 and set its location to the C:\llpanorama\CUDA\examples directory. After clicking OK in the New Project window, and then clicking Finish in the Win32 Application Wizard window, a window opened with a simple code skeleton. I replaced the code skeleton with the code shown above.
After saving the code, I right-clicked the example1.cpp file, selected Rename from the drop-down menu and renamed the file to example1.cu. Files with the .cu extension are intended to be processed by nvcc. nvcc will extract the kernel portion of example1.cu and compile it for execution on the CUDA device while using the Visual C++ compiler to compile the remainder of the file for execution on the host.
In its default configuration, Visual C++ doesn’t know how to compile .cu file. It has to be told explicitly how to do this using a Custom Build Step. This is done by right-clicking on the example1.cu file and selecting Properties from the drop-down menu. In the Property Pages window that appears, set the Custom Build Step command line as follows:
Configuration Properties → Custom Build Step → General:
Command Line =
“$(CUDA_BIN_PATH)\nvcc.exe” -ccbin “$(VCInstallDir)bin” -c -D_DEBUG -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/MTd -I”$(CUDA_INC_PATH)” -I./ -o $(ConfigurationName)\example1.obj example1.cu
What does this command line do? Let’s break it down piece-by-piece:
“$(CUDA_BIN_PATH)\nvcc.exe”: The location of the nvcc compiler.
-ccbin “$(VCInstallDir)bin”: The location of the Visual C++ compiler.
-c: The compilation will proceed all the way to the generation of an object file (.obj extension).
-D_DEBUG -DWIN32 -D_CONSOLE -D_MBCS: Various macro definitions.
-Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/MTd: Various options that are passed by nvcc directly to the Visual C++ compiler.
-I”$(CUDA_INC_PATH)”: Look in the CUDA include directories for needed header files.
-I./: Look in the current directory for needed header files.
-o $(ConfigurationName)\example1.obj: The location and name of the resulting object file.
example1.cu: The source file that the compiler will work on.
In addition to setting the command line for the example1.cu file, the location of the output file is specified as follows:
Configuration Properties → Custom Build Step → General:
Outputs = $(ConfigurationName)\example1.obj
After setting the file properties, the properties for the example1 project have to be modified. Here are the project property settings I used for the Debug configuration:
Configuration Properties → C/C++ → General:
Additional Include Directories = $(CUDA_INC_PATH);”C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\common\inc”Configuration Properties → C/C++ → General:
Debug Information Format = Program Database (/Zi)Configuration Properties → C/C++ → Code Generation:
Runtime Library = Multi-threaded Debug (/MTd)Configuration Properties → Linker → General:
Enable incremental linking = No (/INCREMENTAL:NO)Configuration Properties → Linker -> General:
Additional Library Directories = “C:\CUDA\lib”;”C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\common\lib”Configuration Properties → Linker → Input:
Additional Dependencies = cudart.lib cutil32D.libConfiguration Properties → Linker → Optimization:
Enable COMDAT folding = Do Not Remove Redundant COMDATs (/OPT:NOICF)
Now the project can be compiled and run. Here’s the result:
0 0.000000
1 1.000000
2 4.000000
3 9.000000
4 16.000000
5 25.000000
6 36.000000
7 49.000000
8 64.000000
9 81.000000
I told you it was exciting! Well, at least it’s right.
In order to compile the Release configuration, a few changes need to be made to the file and project properties. For the example1.cu file,the Custom Build Step command line has to be changed to remove the _DEBUG macro definition, enable compiler optimization, and link with the Release runtime library:
Configuration Properties → Custom Build Step → General:
Command Line =
“$(CUDA_BIN_PATH)\nvcc.exe” -ccbin “$(VCInstallDir)bin” -c -D_DEBUG -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/O2,/Zi,/MT -I”$(CUDA_INC_PATH)” -I./ -o $(ConfigurationName)\example1.obj example1.cu
The project properties that have to be changed in the Release configuration are the linking for the runtime library and the use of the non-debug version of the CUDA utilities library:
Configuration Properties → C/C++ → Code Generation:
Runtime Library = Multi-threaded (/MT)Configuration Properties → Linker → Input:
Additional Dependencies = cudart.lib cutil32.lib
Once those changes are made, the Release version of the example1 project can be compiled and run. It will output the same exciting result.
Here’s the source code for this example if you want to try it.
Don’t have a CUDA-capable GPU board on your PC but still want to try running this program? Easy! Just add the following option to the Custom Build Step command line: -deviceemu. This will link-in a CUDA device emulator that runs on the host. The emulator becomes the target for all the CUDA API calls and executes the kernel. The program will run just like a CUDA device is there, except slower. (Here is the project file with the -deviceemu option.)
So I’ve written my first CUDA program and gotten it to compile using Visual C++ 2005 Express Edition. Setting up the compilation options was as much (more?) work as writing the program, so you might be interested in a CUDA template for Visual C++ 2005 written by kyzhao. The installer doesn’t work for me (maybe because I’m using the free Express Edition), but it might help you.
Using it in Ubuntu. I am compiling with nvcc -o out vekadd.cu und running it with ./out and the result is as followed (no squares):
0 0.000000
1 1.000000
2 2.000000
3 3.000000
4 4.000000
5 5.000000
6 6.000000
7 7.000000
8 8.000000
9 9.000000
What I’m doing wrong?
Greets
Comment by alex — July 9, 2009 @ 4:17 am
Alex, this is exactly the result I get if I disable my GPU card. Essentially, the a_h array gets initialized with 0..9 but never gets the squared results because the GPU is not running.
I don’t know how to enable/disable your card under linux. You might try running the deviceQuery example program to see if it picks up your GPU card. (See my previous blog entry which does this.)
Comment by llpanorama — July 9, 2009 @ 8:30 am
Are you sure your example really runs on the GPU instead of on the CPU?
I think you happy to soon.
Just increase N and/or run it repeatedly, then you see that th GPU stays, cool, while the CPU reports load.
T
Comment by tom — July 6, 2009 @ 10:49 pm
Yes, I believe this program runs on the GPU and not the CPU. I can disable my NVIDIA 8600 card and the program computes incorrect results when I do so. The correct results are output once the GPU is re-enabled.
I also compiled the program for CPU-only operation using the -deviceemu option and it computes correct answers regardless of whether the GPU is enabled or disabled.
This program makes very little use of the GPU, even with large N or repeated use (just a single multiplication for each array element). Most of the work involves moving the data from the PC to the GPU card and back under the direction of the CPU. Therefore, it is not surprising that the GPU stays cool and the CPU shows a large load.
I could be wrong, but I would need to see more compelling evidence than you have provided that the program is not actually running in the GPU.
Comment by llpanorama — July 7, 2009 @ 1:20 pm
If you increase N, then you increase the number of lines displayed in the command window using printf. That’s where the CPU load happens.
Put N = 10000 (for example) and comment the line with “printf”, you won’t see CPU load.
Comment by Vincent — July 7, 2009 @ 2:38 pm
Hi,
thank you for providing me your gmail ID. Last day CUDA was working fine on VC++ 2005 but it had developed some problem.
when I compile the file it gives this message.
1>—— Build started: Project: example1, Configuration: Debug Win32 ——
1>Performing Custom Build Step
1>nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified
1>Build log was saved at “file://c:\Users\chetan\Desktop\example1_emu1\example1_emu\example1\Debug\BuildLog.htm”
1>example1 – 0 error(s), 0 warning(s)
========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========
and when i build the project it gives .
>—— Build started: Project: example1, Configuration: Debug Win32 ——
1>Performing Custom Build Step
1>nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified
1>Linking…
1>LINK : fatal error LNK1181: cannot open input file ‘.\Debug\example1.obj’
1>Build log was saved at “file://c:\Users\chetan\Desktop\example1_emu1\example1_emu\example1\Debug\BuildLog.htm”
1>example1 – 1 error(s), 0 warning(s)
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========
I think there is some problem in Custom Build step .I tried putting quotes (”) for every path i specify but still it is not working. can you pelase help me.
with regards
chetan
Comment by Chetan Khaladkar — May 26, 2009 @ 1:03 am
I don’t know. Maybe it’s the version of CUDA you are using. (I used 1.1 for my example.)
Comment by llpanorama — May 26, 2009 @ 8:16 am
Hi,
amazing tutorial! I really appreciate your effort. Please do continue this.
I followed your steps and I took your program for emulator. But when i build it give “cannot read input file: cutil32D.lib” I have checked i have both the cutil32.lib as well as cutil 32D.lib. and I also tried manually adding the file location. But it is not working. then i deleted that Configuration Properties->linker->Input . but then I am not able to compile your program but simple CUDA programs are compiling. SO what can i do? please guide me.
thank you in advance
Comment by Chetan Khaladkar — May 22, 2009 @ 10:15 pm
I’m facing the following problem:
1>—— Build started: Project: hope, Configuration: Emudebug Win32 ——
1>Linking…
1>.\Emudebug\stdafx.obj : fatal error LNK1112: module machine type ‘X86′ conflicts with target machine type ‘x64′
plz help me out!!
Comment by Sonal — May 20, 2009 @ 5:48 am
You are compiling for a Windows 32 machine, but you are running on a 64-bit Windows machine. Either move your development to a 32-bit Windows PC, or upgrade your CUDA to a version that supports 64-bit Windows.
Comment by llpanorama — May 20, 2009 @ 8:47 am
I still get an error (WIN XP 64bit, VS2008)
Error 1 fatal error LNK1181: cannot open input file ‘.\Debug\example1.obj’ CUDA_ex1b CUDA_ex1b
Do you know what it means?
Comment by Alessandro — May 8, 2009 @ 9:06 am
I’ve tried it on linux and it works, just simple: nvcc example.cu
Comment by kkapron — April 29, 2009 @ 2:34 pm
I have a error:
—— Build started: Project: example1, Configuration: Debug Win32 ——
Performing Custom Build Step
Project : error PRJ0002 : Error result -1073741510 returned from ‘C:\WINDOWS\system32\cmd.exe’.
Build log was saved at “file://c:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\projects\example1_emu\example1\Debug\BuildLog.htm”
example1 – 1 error(s), 0 warning(s)
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========
Comment by cakrud — April 6, 2009 @ 9:44 pm
I have error:
—— Build started: Project: example1, Configuration: Debug Win32 ——
Performing Custom Build Step
Project : error PRJ0002 : Error result -1073741510 returned from ‘C:\WINDOWS\system32\cmd.exe’.
Build log was saved at “file://c:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\projects\example1_emu\example1\Debug\BuildLog.htm”
example1 – 1 error(s), 0 warning(s)
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========
Comment by cakrud — April 6, 2009 @ 9:42 pm
Hi, I am trying to compile your exemple in VISTA 64 and I got the error: “>nvcc fatal : Visual Studio configuration file ‘(null)’ could not be found for installation at ‘c:/Program File” do you have nay advise?
Comment by lam — March 27, 2009 @ 10:16 am
Do a search for CUDA, Vista and 64-bits. I’m sure I’ve seen people talk about running CUDA on 64-bit XP.
Comment by llpanorama — March 28, 2009 @ 12:07 am
Here’s a solution for Visual Studio:
nvcc fatal : Visual Studio configuration file ‘(null)’ could not be found for installation at ‘C:/Program Files (x86)/Microsoft Visual Studio 8/VC/bin
This is probably because Visual Studio doesn’t install “X64 Compilers and Tools” by default, so you should go to Control Panel, Programs and Features, select visual studio, add or remove features, and select “X64 Compilers and Tools” under Visual C++.
Just notice that I think this solution doesn’t work for Express Editions because it doesn’t have support for x64.
I hope this info helps.-
Comment by aldebaran — June 2, 2009 @ 10:37 pm
Hello thx a lot for this helpful article.
i follow ur steps one by one , but when i build the project ,it blocks at this step :
”
1>—— Build started: Project: example1, Configuration: Debug x32 ——
1>Performing Custom Build Step
1>example1.cu
1>tmpxft_00000be0_00000000-3_example1.cudafe1.gpu
1>tmpxft_00000be0_00000000-8_example1.cudafe2.gpu
”
and i don t see the cause !!!
when i build the cuda program directly from cmd with “nvcc -deviceemu -o exp1 exemple1.cu” it can generate the .exe file and it gives same result !!!! do u have any ideas about this prob????
Comment by MIA — March 19, 2009 @ 11:13 am
Have you tried getting the source and project files and compiling it directly from those? Are you using CUDA 1.1?
Comment by llpanorama — March 19, 2009 @ 9:24 pm
thx for ur answer ,yes i already try to get source of project ,it work perfectly in emulation mode (debugemu vc++ 8 or using cmd “nvcc -deviceemu….)
i use CUDA 2.0 / MSVC 8 /Gforce 8600
Comment by MIA — March 20, 2009 @ 9:08 pm
You should try CUDA 1.1 since that is the version used with my example.
Comment by llpanorama — March 20, 2009 @ 10:27 pm
OK i will try ur advice,thx a lot for ur answer
Comment by MIA — March 21, 2009 @ 4:25 am
thx for ur cooperation ,i tryed CUDA 1.1 and now the program works perfectly.
Comment by MIA — March 23, 2009 @ 10:18 am
Up in the main article, I have added a link to a project zip file with the -deviceemu option.
Comment by llpanorama — March 14, 2009 @ 7:22 pm
would it be possible that you post your program as a zip file with the visual studio 2005 set up with emulation and debug?
perhaps this would be a simple solution for all of us here.
thanks
Comment by robert — March 14, 2009 @ 1:08 pm
Hi;
I have the same problem as one of your other users. I switched on the emumode as suggested, but I still get
the output
1 1.0000
2 2.0000
3 3.0000
etc, etc etc.
Any Suggestions
Comment by robert — March 14, 2009 @ 1:06 pm
me too…any ideas what’s causing this?
Comment by vo — March 28, 2009 @ 11:37 pm
My program runs fine in emudebug mode, but when I try to run it on Debug (using the actual GPU, because my goal is to run the program for more than 1,000,000 threads) I end up with following two problems:
1. I have memcopy from host to device, but when I debug the structure doesn’t get copied.
cutilSafeCall( cudaMemcpy(layerIni,layer, 1 * sizeof(TissueStruct), cudaMemcpyHostToDevice) );
2. The kernel execution fails with cudaThreadSynchronize error : unspecified launch failure.
Can you please help me in this case.
The device I am using is GeForce GTX 260.
Comment by Chathuri — March 11, 2009 @ 10:00 am
No idea what the problem is. I suggest you try asking on the CUDA forums.
Comment by llpanorama — March 16, 2009 @ 9:28 am
I am getting the following message:
cudaSafeCall() Runtime API error in file , line 59: feature is not yet implemented.
line 59 of that piece of code says
cutilSafeCall(cudaGetDeviceProperties(&deviceProp, dev));
Comment by Ho Xung Lenh — February 19, 2009 @ 1:00 am
Thanks for your advice.
Actually, I do not have the CUDA graphic cards on my machine, so I must use the emulator mode. I also tried to followed your steps in the previous post, but I can do the step 1 and 2. I can not do the step 3, which is about installing the Driver. It says that it could not locate any drivers compatible with the current hardware. The DeviceQuery compiles fine but it could not run. The debug mode show the following information when running:
‘deviceQuery.exe’: Loaded ‘C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\bin\win32\Debug\deviceQuery.exe’, Symbols loaded.
‘deviceQuery.exe’: Loaded ‘C:\WINDOWS\system32\ntdll.dll’, No symbols loaded.
‘deviceQuery.exe’: Loaded ‘C:\WINDOWS\system32\kernel32.dll’, No symbols loaded.
‘deviceQuery.exe’: Loaded ‘C:\CUDA\bin\cudart.dll’, Binary was not built with debug information.
‘deviceQuery.exe’: Loaded ‘C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\bin\win32\Debug\cutil32D.dll’, No symbols loaded.
First-chance exception at 0×7c812a5b in deviceQuery.exe: Microsoft C++ exception: cudaError_enum at memory location 0×0012fd2c..
First-chance exception at 0×7c812a5b in deviceQuery.exe: Microsoft C++ exception: cudaError at memory location 0×0012fd7c..
The program ‘[464] deviceQuery.exe: Native’ has exited with code 1 (0×1).
Also note: I use Visual Studio .NET 2005.
Comment by Ho Xung Lenh — February 19, 2009 @ 12:56 am
Have you downloaded my source files and tried to compile and run those? Have you tried the DeviceQuery example in my previous post about setting-up the CUDA tools?
Comment by llpanorama — February 18, 2009 @ 11:36 am
PS: The debug mode shows the following information:
‘example2.exe’: Loaded ‘C:\Documents and Settings\Tuan Anh NGUYEN\My Documents\Visual Studio 2005\Projects\example2\debug\example2.exe’, Symbols loaded.
‘example2.exe’: Loaded ‘C:\WINDOWS\system32\ntdll.dll’, No symbols loaded.
‘example2.exe’: Loaded ‘C:\WINDOWS\system32\kernel32.dll’, No symbols loaded.
‘example2.exe’: Loaded ‘C:\CUDA\bin\cudart.dll’, Binary was not built with debug information.
First-chance exception at 0×7c812a5b in example2.exe: Microsoft C++ exception: cudaError_enum at memory location 0×0012fe5c..
First-chance exception at 0×7c812a5b in example2.exe: Microsoft C++ exception: cudaError at memory location 0×0012feac..
First-chance exception at 0×7c812a5b in example2.exe: Microsoft C++ exception: cudaError_enum at memory location 0×0012fe54..
First-chance exception at 0×7c812a5b in example2.exe: Microsoft C++ exception: cudaError at memory location 0×0012fea4..
First-chance exception at 0×7c812a5b in example2.exe: Microsoft C++ exception: cudaError_enum at memory location 0×0012fe44..
First-chance exception at 0×7c812a5b in example2.exe: Microsoft C++ exception: cudaError at memory location 0×0012fe94..
First-chance exception at 0×7c812a5b in example2.exe: Microsoft C++ exception: cudaError_enum at memory location 0×0012fe54..
First-chance exception at 0×7c812a5b in example2.exe: Microsoft C++ exception: cudaError at memory location 0×0012fea4..
First-chance exception at 0×7c812a5b in example2.exe: Microsoft C++ exception: cudaError_enum at memory location 0×0012fe60..
First-chance exception at 0×7c812a5b in example2.exe: Microsoft C++ exception: cudaError at memory location 0×0012feb0..
The program ‘[1932] example2.exe: Native’ has exited with code 0 (0×0).
Please help me in this case.
Comment by Ho Xung Lenh — February 18, 2009 @ 1:02 am
Hi,
I follow your step (with -deviceemu option): it compiled fine but the result is wrong:
0 0.000000
1 1.000000
2 2.000000
3 3.000000
4 4.000000
5 5.000000
6 6.000000
7 7.000000
8 8.000000
9 9.000000
The full command is:
“$(CUDA_BIN_PATH)\nvcc.exe” -ccbin “$(VCInstallDir)bin” -c -D_DEBUG -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/MTd -I ” $(CUDA_INC_PATH)” -I./ -o -deviceemu $(ConfigurationName)\example1.obj example1.cu
Can you help me for this problem ? I use SDK and Toolkit 1.1
Thanks
Comment by Ho Xung Lenh — February 18, 2009 @ 12:56 am
Marc:
I believe the indices in the program all start at zero, so the lowest array index is 0 * 4 + 0 = 0.
Comment by llpanorama — January 29, 2009 @ 5:08 pm
First of all, thank you for an awesome article!
I’m a bit confused about the inner workings of the kernel function.
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
so blockIdx ranges from 1 to 3 (3 blocks, given by n_blocks), and blockDim is 4.
so 1 * 4 + 1 = 5 is the lowest array index you can get.
What am I not understanding correctly here?
Comment by Marc — January 29, 2009 @ 4:08 pm
Ivan Dj …
Please use the following: I just removed the qoutation marks from this statement and it worked
…
Configuration Properties → Linker -> General:
Additional Library Directories = C:\CUDA\lib;C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\common\lib
Comment by Asim — January 6, 2009 @ 11:46 pm
Ivan:
I used Visual Studio 2005 and CUDA 1.1. You’re using VS 2008. Go back and use VS 2005 and maybe then the example will work for you. Or find the cudart.lib file on your system and update the linkage paths so it will be found.
Comment by llpanorama — December 20, 2008 @ 10:46 am
Hello!
please help quickly. I have the following problem:
1>—— Build started: Project: example1, Configuration: Debug Win32 ——
1>Linking…
1>LINK : fatal error LNK1181: cannot open input file ‘cudart.lib’
1>Build log was saved at “file://e:\Software Projects\Visual Studio 2008 projects\C++\CUDA\example1\example1\Debug\BuildLog.htm”
1>example1 – 1 error(s), 0 warning(s)
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========
What else do I need to do, I did everything like you said in tutorial
Comment by Ivan Dj — December 20, 2008 @ 9:19 am
Thanks for this! With just a little bit of tweaking, I was able to get this code to work under linux without issue. Matter of fact, it was easier to do, I think.
In my case, all I had to do was comment out the stdafx.h include, rename the source to a .cu file, then compile it with nvcc. This created an a.out file that worked first time through!
Comment by Jon — December 11, 2008 @ 3:22 pm
it was really useful…
thanks alot
Comment by Krishna — November 4, 2008 @ 8:12 am
Muchas grasias~!
Comment by Song — October 30, 2008 @ 12:47 pm
[...] http://forums.nvidia.com/ http://www.ddj.com/architect/207200659 http://llpanorama.wordpress.com/2008/06/11/threads-and-blocks-and-grids-oh-my/ http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/ [...]
Pingback by Desenvolvimendo com CUDA no Ubuntu 8.04 « John Tortugo — October 25, 2008 @ 5:07 pm
I would do the obvious and install CUDA 1.1 and see if the error still occurs. If it does, then there is a problem when you setup the project. Otherwise, there is a problem when using CUDA 2.0.
Comment by llpanorama — October 21, 2008 @ 6:26 am
the tutorial is wonderful. Unfortunately, I got the error from vs2005,
Error 1 error PRJ0019: A tool returned an error code from “Performing Custom Build Step”
I dont know how to figure our it. Could you help me if possible? thank u a lot. By the way,I used CUDA2.0, Is it trouble with that version?
Comment by sky — October 21, 2008 @ 12:05 am
Thank you, wonderful article.
Comment by N — September 26, 2008 @ 7:58 am
Thanks for this great tutorial. I used Vista x64 and works very well.
Thanks a lot.
Comment by J.F. Garamendi — September 10, 2008 @ 10:36 am
When you are messing around with the properties, it might be advantagous to replace all uses of “example1″ with “$(InputName)” (without quotes).
This means that the project created can easily be reused just by renaming the files involved, and not requiring that you manually fiddle with the properties every time.
Great guide by the way! I just got bugged having to constantly change those variables, so I went hunting for an alternative.
Comment by Robert Evrae — September 9, 2008 @ 3:03 pm
Thanks for the help, GREAT TUTORIAL
The project can be compiled and run. Here’s the result:
0 0.000000
1 1.000000
2 4.000000
3 9.000000
4 16.000000
5 25.000000
6 36.000000
7 49.000000
8 64.000000
9 81.000000
Comment by Josue — August 18, 2008 @ 1:58 pm
Here is some info I found about compiling for 64-bit Windows on the Nvidia forums:
I ended up getting it to work by following the instructions under “How To Create 64-bit apps” at http://blogs.msdn.com/deeptanshuv/archive/…/11/573795.aspx
In summary I had to:
* List cutil64D.lib instead of cutil32D.lib under Project Properties -> Configuration Properties -> Linker -> Input -> Additional Dependancies
* Change from the MachineX86 to MachineX64 option under Project Properties -> Configuration Properties -> Linker -> Advanced -> Target Machine
* open the solution explorer, select solution, right click->Configuration Manager.
* go to ‘Active Solution Platform’, click New.
* in the ‘New Solution Platform’ dialog that comes up select the new platform x64. Set ‘Copy Settings From’ to ‘Win32′
* click OK.
And if I do this before writing a project it seems to build properly.
Comment by llpanorama — August 18, 2008 @ 7:59 am
OK.my bad.I change the Active solution platform Win32 to x64.But now the problem is other:
1>—— Build started: Project: example1, Configuration: Debug x64 ——
1>Performing Custom Build Step
1>example1.cu
1>tmpxft_00000be0_00000000-3_example1.cudafe1.gpu
1>tmpxft_00000be0_00000000-8_example1.cudafe2.gpu
1>tmpxft_00000be0_00000000-3_example1.cudafe1.cpp
1>tmpxft_00000be0_00000000-12_example1.ii
1>Linking…
1>LINK : fatal error LNK1181: cannot open input file ‘cutil32D.lib’
1>Build log was saved at “file://c:\Documents and Settings\jacevedo\Desktop\example1\example1\example1\x64\Debug\BuildLog.htm”
1>example1 – 1 error(s), 0 warning(s)
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========
and this library is not in C:\CUDA\lib
Can somebody give me advise how to resolve this problem? Thanks.
Comment by Josue — August 15, 2008 @ 9:13 am
hi,Great tutorial
I did all the steps.
But I got a fatal error when i tried to build it (compile) using VS2005 in a XP x64
This is what shows:
1>—— Build started: Project: example1, Configuration: Debug Win32 ——
1>Compiling…
1>stdafx.cpp
1>Linking…
1>LINK : fatal error LNK1181: cannot open input file ‘cudart.lib’
1>Build log was saved at “file://c:\Documents and Settings\jacevedo\Desktop\example1\example1\example1\Debug\BuildLog.htm”
1>example1 – 1 error(s), 0 warning(s)
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========
Can somebody give me advise how to resolve this problem? Thanks.
Comment by Josue — August 15, 2008 @ 8:38 am
The result you are getting is the same thing that happens if I try to run the program in non-emulated mode with my 8600 card disabled. That is similar to trying to run the code on a non-CUDA device such as your Radeon. Are you sure you are running the emulated version of the program?
Comment by llpanorama — August 1, 2008 @ 2:36 pm
Great tutorial.
I have a question. i try to use this cool sample without NVDIA gpu (i use ATI radeon). So it’s compil (with -deviceemu option) fine but when i try to launch the exe file the result it’s strong. See below :
0 0.000000
1 1.000000
2 2.000000
3 3.000000
4 4.000000
5 5.000000
6 6.000000
7 7.000000
8 8.000000
9 9.000000
There is no square computing. Thanks.
Comment by Zebiloute — August 1, 2008 @ 8:13 am
Marek:
When you install CUDA, the installer should create all the CUDA… environment variables. Then, in the Visual Studio project, you have to create all the configuration properties so the correct compiler is called as I did in the example shown above.
Also, I’m not sure if CUDA supports VS2008. I know the version I am using (1.1) doesn’t. That may have changed. Check the Nvidia forums for more information.
Comment by llpanorama — July 31, 2008 @ 9:59 am
Hi. I have a problem with compilation CUDA programs in Visual Studio 2008. It doesn’t know CUDA… variables so for example it doesn’t find CUDA compiler. Can somebode give me advise how to resolve this problem? Thanks.
Comment by Marek — July 29, 2008 @ 10:57 am
Peter:
Use the -deviceemu compiler option as shown in the second-to-last paragraph of this article. This will create an executable that uses the CUDA emulator instead of a graphics card.
Comment by llpanorama — July 17, 2008 @ 6:23 am
Thank you
I have a mistake, because I copy option directly to visual studio
“$(CUDA_BIN_PATH)\nvcc.exe” -ccbin “$(VCInstallDir)bin” -c -D_DEBUG -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/MTd -I”$(CUDA_INC_PATH)” -I./ -o $(ConfigurationName)\example1.obj example1.cu
but upper option double quotes wasn`t correctly copied.
Comment by Hyunhojo — July 17, 2008 @ 6:05 am
I was wondering if you knew how to set up the emulator on XP so I dont have to buy an new graphics card. I cant seem to find anything on the internet.
Comment by peter — July 16, 2008 @ 9:12 pm
This article is very helpful
Comment by Sumesh — July 14, 2008 @ 11:27 pm
Satakarni:
I know that a block of threads will be executed on a single multiprocessor and multiple blocks can be assigned to each multiprocessor. Other than that, I don’t know of any static relation between grid elements and multiprocessors.
I don’t see anything in the CUDA API that lets you select the number of processors that will be used to run your code. You might ask on the CUDA forum and see if anyone knows of a switch to do this.
Comment by llpanorama — July 14, 2008 @ 8:23 am
As we know `<<>>’ is required for calling Kernel to execute on GPUs (or device), where Dg meant for grid size, Db for block size (and number of threads), and optional Ns for memory allocation.
However I would like to know how the number of processors and grids are related?
For example, I am using Tesla C870. which has 16 multiprocessors with each multiprocessor having 8 processors. total 128 processors. I want to scale my program my testing it on 16, 32, 48, and so on up to 128 processors. How can I archive this with CUDA programming?
( I thought that there must be some relation with grid and/or block size used in the program with the number of processors in the GPU card.)
Kindly let me know.
With Regards,
Satakarni
Comment by Satakarni — July 13, 2008 @ 7:37 pm
Explanation was excellent, and I found interesting and helpful for my Work. Keep writing my dear friend.
Comment by m ravi kuar — June 24, 2008 @ 6:55 am
You can download new version of the CUDA wizard for the VS Express.
url:
http://forums.nvidia.com/index.php?showtopic=69183
Comment by kyzhao — June 21, 2008 @ 5:54 pm
Thank you for the article.
Comment by samsam99 — June 19, 2008 @ 12:59 pm
Thank you for this article,it is very helpful.
Comment by Fatih — June 15, 2008 @ 6:21 am
[...] with a given index. Each thread uses its index to access elements in array (see the kernel in my first CUDA program) such that the collection of all threads cooperatively processes the entire data [...]
Pingback by Threads and blocks and grids, oh my! « /// Parallel Panorama /// — June 11, 2008 @ 2:49 pm
Thank you very much for this article. It really helped me. Continue writing.
Comment by amput — May 28, 2008 @ 12:17 pm
This is extremely helpful for those of us wanting to start from scratch (which is the only way I can learn anything).
Great work buddy! Particularly the painstaking details given on how to configure VS for cu, etc.
Comment by kurt — May 23, 2008 @ 6:21 pm
Many thanks for article, it is written very clear, to read and understand !
I am interested in that direction very, but there is no good information enough ! Please do not stop and write more !!!!
I would be very glad to see the same detailed example with using MersenneTwister and Montecarlo !!!
Thanks for article!!!!!!
Comment by Anton — May 22, 2008 @ 8:05 am