CUDA Tutorial
Here is a good introductory article on GPU computing that’s oriented toward CUDA: The GPU Computing Era .
Below is a list of my blog entries that discuss developing parallel programs using CUDA. These are listed in the proper sequence so you can just click through them instead of having to search through the entire blog.
- Getting started with CUDA
- My first CUDA program!
- Threads and blocks and grids, oh my!
- Updating to CUDA 2.3
- CUDA Gets Easier!
I would be remiss if I didn’t also point you to the great series of articles written by Rob Farber and published in Dr. Dobb’s Journal:
- CUDA, Supercomputing for the Masses: Part 1
- CUDA, Supercomputing for the Masses: Part 2
- CUDA, Supercomputing for the Masses: Part 3
- CUDA, Supercomputing for the Masses: Part 4
- CUDA, Supercomputing for the Masses: Part 5
- CUDA, Supercomputing for the Masses: Part 6
- CUDA, Supercomputing for the Masses: Part 7
- CUDA, Supercomputing for the Masses: Part 8
- CUDA, Supercomputing for the Masses: Part 9
- CUDA, Supercomputing for the Masses: Part 10
- CUDA, Supercomputing for the Masses: Part 11
- CUDA, Supercomputing for the Masses: Part 12
- CUDA, Supercomputing for the Masses: Part 13
- CUDA, Supercomputing for the Masses: Part 14
- CUDA, Supercomputing for the Masses: Part 15
- CUDA, Supercomputing for the Masses: Part 16
- CUDA, Supercomputing for the Masses: Part 17
- CUDA, Supercomputing for the Masses: Part 18
ge other refer masyarakat may by technology h create from reduced i
Very Informative tutorial, very helpful tips, thanks for sharing guys, its so helpful for a newbie like me.:)
Hey thanks yar.. here i comes to know about what is CUDA.. and started studying now! 🙂
Pingback: Elsewhere, on January 28th - Once a nomad, always a nomad
We have an extremely wide variety of running a blog websites out there readily available for blog writers to make use of. There are many to pick from. Simply to name some there is live journal …wordpress
Hi,
Thanks for excellent article. Is there any article written for CUDA for C++ by you. If yes, please give me the link.
Manoj
A perfect guide for beginners.
The main concept of CUDA.
http://www.techrefined.com/progamming/cuda-way/
Those who are unaware of what cuda is may visit this link
http://www.techrefined.com/progamming/parallel-computing/cuda/
awesome tutorial..
what the f*** is CUDA
it’s a turd refinery where you can refine and re eat your own shit perpetually
fsf
Pingback: O que há de novo no Mathematica 8? « INTEGRALDX
Hello,
I am facing one strange problem. I am very new to CUDA.
I tried the first example according to CUDA by EXAMPLE.
I tried to run the program on CPU and then changed the program in accordance to CUDA.
Strangely, MY cuda program takes 8 times more time than the CPU version.
Its very strange
My cpu program is :
#include
#include
#include
#include
#define N 10
void add( float *a, float *b, float *c ) {
int tid = 0;
while (tid < N) {
c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));
tid += 1;
}
}
int main( void ) {
float elapsed;
float a[N], b[N], c[N];
int test;
int i;
clock_t timerStart, timerStop;
for (i=0; i<N; i++) {
a[i] = (float) (i)/(i+1);
b[i] = (float) (i)/(i+1);
c[i] = 0;
}
timerStart = clock();
add( a, b, c );
timerStop = clock();
elapsed = (float) ( timerStop – timerStart ) / CLOCKS_PER_SEC;
printf( "Time elapsed: %f ", elapsed);
return 0;
}
My CUDA version is:
#include "Common.h"
#include "cutil.h"
#include
//#define TIMECUDA
#define TIMECPU
#define N 10
__global__ void add( float *a, float *b, float *c ) {
int tid = blockIdx.x; // TID is the block ID
if (tid < N) {
c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));
}
}
int main( void ) {
float a[N], b[N], c[N];
float *temp_a,*temp_b,*temp_c;
long i;
#ifdef TIMECUDA
float elapsed_time_cpu_gpu,elapsed_time_add,elapsed_time_gpu_cpu;
#else
#endif
#ifdef TIMECPU
float elapsed_time;
#else
#endif
#ifdef TIMECPU
clock_t timerStart, timerStop;
#else
#endif
#ifdef TIMECUDA
cudaEvent_t start,stop,startadd,stopadd,startback,stopback;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventCreate(&startadd);
cudaEventCreate(&stopadd);
cudaEventCreate(&startback);
cudaEventCreate(&stopback);
#else
#endif
cudaMalloc((void**)&temp_a,N*sizeof(int));
cudaMalloc((void**)&temp_b,N*sizeof(int));
cudaMalloc((void**)&temp_c,N*sizeof(int));
for (i=0; i<N; i++) {
a[i] = (float) (i)/(i+1);
b[i] = (float) (i)/(i+1);
c[i] = 0;
}
#ifdef TIMECUDA
cudaEventRecord(start,0);
#else
#endif
cudaMemcpy(temp_a,a,N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(temp_b,b,N*sizeof(int),cudaMemcpyHostToDevice);
#ifdef TIMECUDA
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_cpu_gpu,start,stop);
printf("Time taken CUDA : %f \n",elapsed_time_cpu_gpu);
cudaEventDestroy(start);
cudaEventDestroy(stop);
#else
#endif
#ifdef TIMECPU
timerStart = clock();
#else
#endif
#ifdef TIMECUDA
cudaEventRecord(startadd,0);
#else
#endif
add<<>>(temp_a,temp_b,temp_c);
#ifdef TIMECUDA
cudaEventRecord(stopadd,0);
cudaEventSynchronize(stopadd);
cudaEventElapsedTime(&elapsed_time_add,startadd,stopadd);
printf(“Time taken CUDA : %f \n”,elapsed_time_add);
cudaEventDestroy(startadd);
cudaEventDestroy(stopadd);
#else
#endif
#ifdef TIMECPU
timerStop = clock();
elapsed_time = (float) ( timerStart – timerStop ) / CLOCKS_PER_SEC;
printf(“Time taken CPU : %f \n”,elapsed_time);
#else
#endif
#ifdef TIMECUDA
cudaEventRecord(startback,0);
#else
#endif
cudaMemcpy(c,temp_c, N*sizeof(int),cudaMemcpyDeviceToHost);
#ifdef TIMECUDA
cudaEventRecord(stopback,0);
cudaEventSynchronize(stopback);
cudaEventElapsedTime(&elapsed_time_gpu_cpu,startback,stopback);
printf(“Time taken CUDA : %f “,elapsed_time_gpu_cpu);
cudaEventDestroy(startback);
cudaEventDestroy(stopback);
#else
#endif
/* for (i=0; i<N; i++) {
printf ("%f %f %f\n", a[i], b[i], c[i] );
}
*/
cudaFree(temp_a);
cudaFree(temp_b);
cudaFree(temp_c);
return 0;
}
Also I see that copying from device to host and from host to device takes major part of time.
Can you please see into this and help me why its showing such strange behavior.
Pingback: Tutorial de CUDA « Cómo aprender cuda en un mes…
Hello everybody
I m here for another time to ask you for help.
the goal of my project is to use parallelism to reduce program latency time.
i start with data parallelism and i succes to do it with cuda structure,so the idea now is to improve the program with treatement parallelism,i think to a first solution using thread,the first thread do treatement in GPU and the second one do the inputfrom CPU memory to GPU memory-output from GPU memory to CPU memory.
but infortunately i don t get wishable result only 0000 that mean that i have problems to pass gpu parameters to thread function even if i do all the allocation & copy!!!!!
do you have any ideas
Hey there thanks.
I’d love to see a CUDA code for dijkstra’s alogirthm.1
Pingback: Pengalaman pertama dengan CUDA « Rudy ngeBlog
Pingback: Coding With GPGPU: Useful Links
I love CUDA.
Pingback: Coding With GPGPU: CUDA
Pingback: Coding With GPGPU: Useful Links