<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>/// Parallel Panorama ///</title>
	<atom:link href="http://llpanorama.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://llpanorama.wordpress.com</link>
	<description>Groan in exasperation as I explore parallel programming techniques...</description>
	<lastBuildDate>Mon, 01 Apr 2013 02:04:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='llpanorama.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>/// Parallel Panorama ///</title>
		<link>http://llpanorama.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://llpanorama.wordpress.com/osd.xml" title="/// Parallel Panorama ///" />
	<atom:link rel='hub' href='http://llpanorama.wordpress.com/?pushpress=hub'/>
		<item>
		<title>CUDA Gets Easier!</title>
		<link>http://llpanorama.wordpress.com/2010/06/18/cuda-gets-easier/</link>
		<comments>http://llpanorama.wordpress.com/2010/06/18/cuda-gets-easier/#comments</comments>
		<pubDate>Fri, 18 Jun 2010 18:43:21 +0000</pubDate>
		<dc:creator>dave_vandenbout</dc:creator>
				<category><![CDATA[GPU]]></category>
		<category><![CDATA[CUDA]]></category>

		<guid isPermaLink="false">http://llpanorama.wordpress.com/?p=67</guid>
		<description><![CDATA[Several of my readers have had problems creating CUDA projects in Visual Studio, so I thought I&#8217;d update how to do it using the current version of CUDA (3.0 at the time of this writing).  The main point: it&#8217;s a lot easier than the procedure I outlined two years ago. For hardware, I&#8217;m now using [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=67&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Several of my readers have had problems creating CUDA projects in Visual Studio, so I thought I&#8217;d update how to do it using the current version of CUDA (3.0 at the time of this writing).  The main point: <em>it&#8217;s a lot easier than <a title="My First CUDA Program" href="http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/" target="_self">the procedure I outlined two years ago</a>.</em></p>
<p>For hardware, I&#8217;m now using a <a title="Zotac GeForce GT240" href="http://www.newegg.com/Product/Product.aspx?Item=N82E16814500131&amp;cm_re=zotac_gt240-_-14-500-131-_-Product" target="_blank">Zotac GeForce GT240 card with 96 stream processors</a> that I purchased last year for $90. For my software development environment, I downloaded and installed the <a title="Microsoft SDK for Windows Server 2008" href="http://www.microsoft.com/downloads/details.aspx?FamilyId=F26B1AA4-741A-433A-9BE5-FA919850BDBF&amp;displaylang=en" target="_blank">Microsoft SDK for Windows Server 2008</a> and <a title="MS Visual C++ 2008 Express Edition" href="http://www.microsoft.com/express/download/#webInstall" target="_blank">Microsoft Visual C++ 2008 Express Edition</a>.  Then I downloaded and installed the <a title="NVIDIA Driver 197.13" href="http://developer.download.nvidia.com/compute/cuda/3_0/drivers/devdriver_3.0_winxp_32_197.13_general.exe" target="_blank">NVIDIA Driver 197.13</a>, the <a title="CUDA Toolkit 3.0" href="http://www.nvidia.com/object/thankyou.html?url=/compute/cuda/3_0/toolkit/cudatoolkit_3.0_win_32.exe" target="_blank">CUDA Toolkit 3.0</a> and the <a title="CUDA SDK 3.0" href="http://developer.download.nvidia.com/compute/cuda/3_0/sdk/gpucomputingsdk_3.0_win_32.exe" target="_blank">CUDA SDK 3.0</a> for 32-bit Windows XP.</p>
<p>Once everything was set up, the first thing I did was to recompile and run the deviceQuery example in</p>
<blockquote><p>C:\Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\src\deviceQuery</p></blockquote>
<p>I just double-clicked the deviceQuery_vc90.sln file and the project popped-up in the Visual Studio IDE.  I hit F7 to rebuild the program, and then I pressed Ctrl+F5 to run it.  The program ran and reported the presence of a GeForce GT 240 in my PC.  So far, so good.</p>
<p>Next, I created an empty Win32 console application called cuda_example3.  I renamed cuda_example3.cpp to cuda_example3.cu because that&#8217;s where the CUDA kernel source is going.  Then I copied the source from <a title="My First CUDA Program" href="http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/" target="_blank">my first CUDA program</a> into the file and saved it.  Here&#8217;s the code so you can see it:</p>
<pre class="brush: cpp; title: ; notranslate">
// cuda_example3.cu : Defines the entry point for the console application.
//

#include &quot;stdafx.h&quot;

#include &lt;stdio.h&gt;
#include &lt;cuda.h&gt;

// Kernel that executes on the CUDA device
__global__ void square_array( float *a, int N )
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if ( idx &lt; N )
        a[idx] = a[idx] * a[idx];
}



// main routine that executes on the host
int main( void )
{
    float *a_h, *a_d; // Pointer to host &amp; device arrays
    const int N = 10; // Number of elements in arrays
    size_t size = N * sizeof( float );
    a_h = (float *)malloc( size );    // Allocate array on host
    cudaMalloc( (void **)&amp;a_d, size ); // Allocate array on device
    // Initialize host array and copy it to CUDA device
    for ( int i = 0; i &lt; N; i++ )
        a_h[i] = (float)i;
    cudaMemcpy( a_d, a_h, size, cudaMemcpyHostToDevice );
    // Do calculation on device:
    int block_size = 4;
    int n_blocks   = N / block_size + ( N % block_size == 0 ? 0 : 1 );
    square_array &lt;&lt;&lt; n_blocks, block_size &gt;&gt;&gt; ( a_d, N );
    // Retrieve result from device and store it in host array
    cudaMemcpy( a_h, a_d, sizeof( float ) * N, cudaMemcpyDeviceToHost );
    // Print results
    for ( int i = 0; i &lt; N; i++ )
        printf( &quot;%d %f\n&quot;, i, a_h[i] ); // Cleanup
    free( a_h );
    cudaFree( a_d );
}
</pre>
<p>At this point, Visual Studio had no idea how to compile a .cu file.  In the past, I crafted a Custom Build Step in the Project Properties page that invoked Nvidia&#8217;s nvcc tool with the appropriate compiler options. No more need for that!  Instead, I highlighted cuda_example3 in the Solution Explorer pane, and then selected Project→Custom Build Rules&#8230; from the menu. Then I clicked on the Find Existing&#8230; button in the Custom Build Rule Files window and steered it to this file:</p>
<blockquote><p>C:\Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\common\Cuda.rules</p></blockquote>
<p>Cuda.rules contains all the rules and options needed to merge .cu files into the Visual Studio C++ compilation flow.</p>
<p>The only other changes I needed to make were to indicate the locations of the CUDA libraries in the project properties (I did this for both the Debug and Release configurations):</p>
<blockquote><p>Configuration Properties → Linker -&gt; General:<br />
Additional Library Directories = C:\CUDA\lib;&#8221;C:\Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\common\lib&#8221;</p>
<p>Configuration Properties → Linker → Input:<br />
Additional Dependencies = cudart.lib</p></blockquote>
<p>After doing this, the program compiled and produced the following correct result:</p>
<blockquote><p>0   0.000000<br />
1     1.000000<br />
2     4.000000<br />
3     9.000000<br />
4     16.000000<br />
5     25.000000<br />
6     36.000000<br />
7     49.000000<br />
8     64.000000<br />
9     81.000000</p></blockquote>
<p>For those of you who want to try CUDA but don&#8217;t have CUDA-enabled GPU card, there is a way to link to a CUDA device emulator.  Simply replace cudart.lib with cudartemu.lib in the project properties as follows:</p>
<blockquote><p>Configuration Properties → Linker → Input:<br />
Additional Dependencies = cudartemu.lib</p></blockquote>
<p>This supplants the use of the -deviceemu compiler option in earlier versions of CUDA.</p>
<p>Finally, you may want C++ syntax-coloring and Intellisense to work on your .cu source files. To get syntax-coloring, click on the Tools→Options menu.  Then in the Options window under Text Editor→File Extension, enter the .cu and .cuh file extensions and select Microsoft Visual C++ as the editor. To enable Intellisense, you&#8217;ll have to edit the Windows registry by adding the .cu and .cuh file extensions to the key <code>HKEY_CURRENT_USER\Software\Microsoft\VisualStudio\9.0\Languages\Language Services\C/C++\NCB Default C/C++ Extensions</code>. That should do it.</p>
<p>Here&#8217;s the <a title="Source code for example 3 (VC++ 2008)" href="http://drop.io/llpanorama/asset/cuda-example3-zip">source code for this example</a> if you want to try it.</p>
<p><!-- AddThis Button BEGIN --></p>
<div><a title="Bookmark and Share" href="http://www.addthis.com/bookmark.php?v=250&amp;username=xa-4c1b69122aceb871" target="_blank"><img style="border:0;" src="http://s7.addthis.com/static/btn/sm-share-en.gif" alt="Bookmark and Share" width="83" height="16" /></a></div>
<p><!-- AddThis Button END --></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/llpanorama.wordpress.com/67/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/llpanorama.wordpress.com/67/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=67&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://llpanorama.wordpress.com/2010/06/18/cuda-gets-easier/feed/</wfw:commentRss>
		<slash:comments>24</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d9f61aab37fbd3971dc82737f85ee0c3?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">llpanorama</media:title>
		</media:content>

		<media:content url="http://s7.addthis.com/static/btn/sm-share-en.gif" medium="image">
			<media:title type="html">Bookmark and Share</media:title>
		</media:content>
	</item>
		<item>
		<title>Updating to CUDA 2.3</title>
		<link>http://llpanorama.wordpress.com/2009/08/07/updating-to-cuda-2-3/</link>
		<comments>http://llpanorama.wordpress.com/2009/08/07/updating-to-cuda-2-3/#comments</comments>
		<pubDate>Fri, 07 Aug 2009 17:04:27 +0000</pubDate>
		<dc:creator>dave_vandenbout</dc:creator>
				<category><![CDATA[GPU]]></category>

		<guid isPermaLink="false">http://llpanorama.wordpress.com/?p=49</guid>
		<description><![CDATA[It&#8217;s been a while since I&#8217;ve posted anything, so I thought I&#8217;d start again by upgrading from CUDA 1.1 to the new CUDA 2.3. For hardware, I&#8217;m still using my old NX8600GTS graphics card.  For my software development environment, I downloaded and installed the Microsoft SDK for Windows Server 2008 and Microsoft Visual C++ 2008 [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=49&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>It&#8217;s been a while since I&#8217;ve posted anything, so I thought I&#8217;d start again by upgrading from CUDA 1.1 to the new CUDA 2.3.</p>
<p>For hardware, I&#8217;m still using my old <a title="NX8600GTS at Newegg.com" href="http://www.newegg.com/Product/Product.aspx?Item=N82E16814127284" target="_blank">NX8600GTS graphics card</a>.  For my software development environment, I downloaded and installed the <a title="Microsoft SDK for Windows Server 2008" href="http://www.microsoft.com/downloads/details.aspx?FamilyId=F26B1AA4-741A-433A-9BE5-FA919850BDBF&amp;displaylang=en" target="_blank">Microsoft SDK for Windows Server 2008</a> and <a title="MS Visual C++ 2008 Express Edition" href="http://www.microsoft.com/express/download/#webInstall" target="_blank">Microsoft Visual C++ 2008 Express Edition</a>.  Then I downloaded and installed the <a title="NVIDIA Driver 190.38" href="http://www.nvidia.com/object/thankyou.html?url=/compute/cuda/2_3/drivers/cudadriver_2.3_winxp_32_190.38_general.exe" target="_blank">NVIDIA Driver 190.38</a>, the <a title="CUDA Toolkit 2.3" href="http://www.nvidia.com/object/thankyou.html?url=/compute/cuda/2_3/toolkit/cudatoolkit_2.3_win_32.exe" target="_blank">CUDA Toolkit 2.3</a> and the <a title="CUDA SDK 2.3" href="http://www.nvidia.com/object/thankyou.html?url=/compute/cuda/2_3/sdk/cudasdk_2.3_win_32.exe" target="_blank">CUDA SDK 2.3</a> for 32-bit XP.<span id="more-49"></span></p>
<p>Once everything was set up, the first thing I did was to recompile and run the deviceQuery example (just as I did in <a title="Getting Started With CUDA" href="http://llpanorama.wordpress.com/2008/04/24/getting-started-with-cuda/" target="_blank">my first attempt with CUDA</a>).  The default installation location of the CUDA examples has changed since version 1.1, so I found the example in</p>
<p><code>C:\Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\src\deviceQuery</code></p>
<p>I just double-clicked the <code>deviceQuery_vc90.sln</code> file and the project popped-up in the Visual C++ IDE.  I hit F7 to rebuild the program, and then I pressed Ctrl+F5 to run it.  The program ran and reported the presence of a &#8220;GeForce 8600 GTS&#8221; in my PC.  So far, so good.</p>
<p>Next, I <a title="CUDA example1 archive" href="ftp://ftp.drivehq.com/llpanorama/CUDA/example1.zip" target="_blank">downloaded </a>and unpacked <a title="My First CUDA Program" href="http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/" target="_blank">my first CUDA program</a> (written so long ago).  Because this example was created using MS Visual C++ 2005, I had to start the Visual C++ 2008 IDE and then drag-and-drop the example1.sln file into it.  This started the Visual Studio Conversion Wizard which converted the project into the newer format.  Then I hit F7 to build it and got the following error:</p>
<p><code>1&gt;LINK : fatal error LNK1181: cannot open input file 'cutil32D.lib'</code></p>
<p>This error is the result of moving the default installation directory of CUDA SDK 2.3 versus 1.1.  To correct for this, I opened the Property Pages for this project and, under the <code>Linker</code> properties, I changed the <code>Additional Library Directories</code> to include</p>
<p><code>C:\Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\common\lib</code></p>
<p>After doing this, the program compiled and ran successfully.</p>
<p>Here&#8217;s the <a title="Source code for example 1 (VC++ 2008)" href="ftp://ftp.drivehq.com/llpanorama/TBB/example1_vc90.zip">source code for this example</a> if you want to try it.</p>
<p><a title="Bookmark and Share" href="http://www.addthis.com/bookmark.php?pub=llpanorama&amp;url=http://llpanorama.wordpress.com/2009/08/07/updating-to-cuda-2-3/&amp;title=Updating to CUDA 2.3" target="_blank"><img src="http://s9.addthis.com/button0-bm.gif" border="0" alt="Bookmark and Share" width="83" height="16" /></a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/llpanorama.wordpress.com/49/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/llpanorama.wordpress.com/49/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=49&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://llpanorama.wordpress.com/2009/08/07/updating-to-cuda-2-3/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d9f61aab37fbd3971dc82737f85ee0c3?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">llpanorama</media:title>
		</media:content>

		<media:content url="http://s9.addthis.com/button0-bm.gif" medium="image">
			<media:title type="html">Bookmark and Share</media:title>
		</media:content>
	</item>
		<item>
		<title>Nvidia GTX 295 GPU with 480 Cores!</title>
		<link>http://llpanorama.wordpress.com/2008/12/19/nvidia-gtx-295-gpu-with-480-cores/</link>
		<comments>http://llpanorama.wordpress.com/2008/12/19/nvidia-gtx-295-gpu-with-480-cores/#comments</comments>
		<pubDate>Sat, 20 Dec 2008 03:55:54 +0000</pubDate>
		<dc:creator>dave_vandenbout</dc:creator>
				<category><![CDATA[GPU]]></category>

		<guid isPermaLink="false">http://llpanorama.wordpress.com/?p=41</guid>
		<description><![CDATA[The title says it all.  Read a bit more about it here.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=41&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>The title says it all.  Read a bit more about it <a title="GTX 295 Specs" href="http://www.maximumpc.com/article/news/nvidia_reveals_geforce_gtx_295_specs" target="_blank">here</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/llpanorama.wordpress.com/41/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/llpanorama.wordpress.com/41/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=41&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://llpanorama.wordpress.com/2008/12/19/nvidia-gtx-295-gpu-with-480-cores/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d9f61aab37fbd3971dc82737f85ee0c3?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">llpanorama</media:title>
		</media:content>
	</item>
		<item>
		<title>CUDA vs. FPGAs for high-performance computing</title>
		<link>http://llpanorama.wordpress.com/2008/06/19/cuda-vs-fpgas-for-high-performance-computing/</link>
		<comments>http://llpanorama.wordpress.com/2008/06/19/cuda-vs-fpgas-for-high-performance-computing/#comments</comments>
		<pubDate>Thu, 19 Jun 2008 10:50:54 +0000</pubDate>
		<dc:creator>dave_vandenbout</dc:creator>
				<category><![CDATA[GPU]]></category>
		<category><![CDATA[CUDA]]></category>

		<guid isPermaLink="false">http://llpanorama.wordpress.com/?p=38</guid>
		<description><![CDATA[A column by Kevin Morris, editor of the FPGA Journal, discusses the new Nvidia GPU offerings.  Here&#8217;s my response about why GPUs will kill-off the use of field-programmable gate arrays (FPGAs) as accelerators in high-performance computing systems.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=38&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>A <a title="A Passel of Processors" href="http://www.fpgajournal.com/articles_2008/20080617_nvidia.htm" target="_blank">column </a>by Kevin Morris, editor of the <a title="FPGA Journal" href="http://www.fpgajournal.com/index.htm" target="_blank">FPGA Journal</a>, discusses the <a title="Nvidia announces new GTX 280 and 260 GPUs!" href="http://llpanorama.wordpress.com/2008/06/16/new-nvidia-gtx280-and-260-gpus-are-announced/" target="_blank">new Nvidia GPU offerings</a>.  Here&#8217;s <a title="my CUDA vs. FPGA response" href="http://www.journalforums.com/cgi-bin/ikonboard.cgi?act=ST;f=1;t=421" target="_blank">my response</a> about why GPUs will kill-off the use of field-programmable gate arrays (FPGAs) as accelerators in high-performance computing systems.</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/llpanorama.wordpress.com/38/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/llpanorama.wordpress.com/38/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/llpanorama.wordpress.com/38/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/llpanorama.wordpress.com/38/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=38&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://llpanorama.wordpress.com/2008/06/19/cuda-vs-fpgas-for-high-performance-computing/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d9f61aab37fbd3971dc82737f85ee0c3?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">llpanorama</media:title>
		</media:content>
	</item>
		<item>
		<title>New Nvidia GTX280 and 260 GPUs are announced!</title>
		<link>http://llpanorama.wordpress.com/2008/06/16/new-nvidia-gtx280-and-260-gpus-are-announced/</link>
		<comments>http://llpanorama.wordpress.com/2008/06/16/new-nvidia-gtx280-and-260-gpus-are-announced/#comments</comments>
		<pubDate>Mon, 16 Jun 2008 22:09:40 +0000</pubDate>
		<dc:creator>dave_vandenbout</dc:creator>
				<category><![CDATA[GPU]]></category>
		<category><![CDATA[CUDA]]></category>

		<guid isPermaLink="false">http://llpanorama.wordpress.com/?p=34</guid>
		<description><![CDATA[Nvidia has announced their new GTX 280 and 260 GPU chips. The 280 and 260 increase the number of SPs up to 240 and 192 while the width of the interface to device memory has increased to 512 and 448 bits, respectively. (The older 8800 GTX has 128 SPs and a 384-bit wide memory interface.) [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=34&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a title="Nvidia announcement of GTX 280 and 260." href="http://www.nvidia.com/object/io_1213610051114.html" target="_blank">Nvidia has announced</a> their new <a title="GTX 280 GPU specs." href="http://www.nvidia.com/object/geforce_gtx_280.html" target="_blank">GTX 280</a> and <a title="GTX 260 GPU specs." href="http://www.nvidia.com/object/geforce_gtx_260.html" target="_blank">260</a> GPU chips.  The 280 and 260 increase the number of SPs up to 240 and 192 while the width of the interface to device memory has increased to 512 and 448 bits, respectively.  (The older 8800 GTX has 128 SPs and a 384-bit wide memory interface.)</p>
<p><a title="Picture of GTX 280 chip." href="http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/" target="_blank">Here </a>is a blog posting with a picture of the GTX 280 chip.</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/llpanorama.wordpress.com/34/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/llpanorama.wordpress.com/34/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/llpanorama.wordpress.com/34/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/llpanorama.wordpress.com/34/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=34&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://llpanorama.wordpress.com/2008/06/16/new-nvidia-gtx280-and-260-gpus-are-announced/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d9f61aab37fbd3971dc82737f85ee0c3?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">llpanorama</media:title>
		</media:content>
	</item>
		<item>
		<title>Apple&#8217;s competition to TBB and CUDA</title>
		<link>http://llpanorama.wordpress.com/2008/06/13/apples-competition-to-tbb-and-cuda/</link>
		<comments>http://llpanorama.wordpress.com/2008/06/13/apples-competition-to-tbb-and-cuda/#comments</comments>
		<pubDate>Fri, 13 Jun 2008 17:11:49 +0000</pubDate>
		<dc:creator>dave_vandenbout</dc:creator>
				<category><![CDATA[GPU]]></category>
		<category><![CDATA[multicore]]></category>
		<category><![CDATA[CUDA]]></category>
		<category><![CDATA[TBB]]></category>

		<guid isPermaLink="false">http://llpanorama.wordpress.com/?p=32</guid>
		<description><![CDATA[Apple recently announced Grand Central and OpenCL which seem to be competitors to TBB and CUDA, respectively. Grand Central tries to make it easier to write multi-threaded apps for today&#8217;s multicore CPUs, and OpenCL (Open Computer Library) aims to make the processing power of GPUs available in general-purpose computing applications. OpenCL sounds like CUDA to [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=32&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Apple recently announced <a title="Snow Leopard" href="http://www.apple.com/server/macosx/snowleopard/" target="_blank">Grand Central and OpenCL</a> which seem to be competitors to TBB and CUDA, respectively.  Grand Central tries to make it easier to write multi-threaded apps for today&#8217;s multicore CPUs, and OpenCL (Open Computer Library) aims to make the processing power of GPUs available in general-purpose computing applications.  OpenCL sounds like CUDA to me, but Steve Jobs says it&#8217;s &#8220;way beyond what Nvidia or anyone else has, and it&#8217;s really simple.&#8221;  We&#8217;ll see.  Here are some blog posts about <a title="Grand Central and OpenCL" href="http://www.anandtech.com/weblog/showpost.aspx?i=461" target="_blank">Grand Central</a> and <a title="OpenCL" href="http://www.betanews.com/article/So_what_is_OpenCL_Apples_next_enhancement_to_Mac_OS_X_106/1213196124" target="_blank">OpenCL</a>.</p>
<p><em>Addendum (June 18, 2008): </em>Looks like <a title="Apple submits OpenCL to Khronos Group." href="http://www.eetimes.com/news/latest/showArticle.jhtml?articleID=208700254" target="_blank">Apple has submitted OpenCL to the Khronos Group</a> &#8220;that aims to define a programming environment for applications running across both x86 and graphics chips&#8221;.  And <a title="Wikipedia entry about OpenCL." href="http://en.wikipedia.org/wiki/OpenCL" target="_blank">here </a>is a Wikipedia entry about OpenCL.</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/llpanorama.wordpress.com/32/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/llpanorama.wordpress.com/32/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/llpanorama.wordpress.com/32/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/llpanorama.wordpress.com/32/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=32&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://llpanorama.wordpress.com/2008/06/13/apples-competition-to-tbb-and-cuda/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d9f61aab37fbd3971dc82737f85ee0c3?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">llpanorama</media:title>
		</media:content>
	</item>
		<item>
		<title>Threads and blocks and grids, oh my!</title>
		<link>http://llpanorama.wordpress.com/2008/06/11/threads-and-blocks-and-grids-oh-my/</link>
		<comments>http://llpanorama.wordpress.com/2008/06/11/threads-and-blocks-and-grids-oh-my/#comments</comments>
		<pubDate>Wed, 11 Jun 2008 19:49:34 +0000</pubDate>
		<dc:creator>dave_vandenbout</dc:creator>
				<category><![CDATA[GPU]]></category>
		<category><![CDATA[CUDA]]></category>

		<guid isPermaLink="false">http://llpanorama.wordpress.com/?p=25</guid>
		<description><![CDATA[As an engineer, I like C because it is relatively low-level compared to other languages. This lets me infer how the C code is handled by the processor so I can make on-the-fly judgments about the efficiency of a program. For the same reason, I need a mental model of how a CUDA device is [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=25&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>As an engineer, I like C because it is relatively low-level compared to other languages.  This lets me infer how the C code is handled by the processor so I can make on-the-fly judgments about the efficiency of a program.  For the same reason, I need a mental model of how a CUDA device is organized and how its parts operate.<span id="more-25"></span></p>
<p>With a single processor executing a program, it&#8217;s easy to tell what&#8217;s going on and where it&#8217;s happening.  With a CUDA device, not so much.  There seem to be a lot of things going on at once in a lot of different places.  CUDA organizes a parallel computation using the abstractions of threads, blocks and grids for which I provide these simple definitions:</p>
<blockquote><p>Thread: This is just an execution of a kernel with a given index.  Each thread uses its index to access elements in array (see the  kernel in <a title="My first CUDA program" href="http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/" target="_blank">my first CUDA program</a>) such that the collection of all threads cooperatively processes the entire data set.</p>
<p>Block: This is a group of threads.  There&#8217;s not much you can say about the execution of threads within a block &#8211; they could execute concurrently or serially and in no particular order.  You can coordinate the threads, somewhat, using the _syncthreads() function that makes a thread stop at a certain point in the kernel until all the other threads in its block reach the same point.</p>
<p>Grid: This is a group of blocks.  There&#8217;s no synchronization at all between the blocks.</p></blockquote>
<p>But where do threads, blocks and grids actually get executed?  With respect to Nvidia&#8217;s G80 GPU chip, it appears the computation is distributed as follows:</p>
<blockquote><p>Grid → GPU: An entire grid is handled by a single GPU chip.</p>
<p>Block → MP: The GPU chip is organized as a collection of multiprocessors (MPs), with each multiprocessor responsible for handling one or more blocks in a grid.  A block is never divided across multiple MPs.</p>
<p>Thread → SP: Each MP is further divided into a number of stream processors (SPs), with each SP handling one or more threads in a block.</p></blockquote>
<p>(Some universities have extended CUDA to work on multicore CPUs by assigning a grid to the CPU with each core executing the threads in one or more blocks.  I don&#8217;t have a link to this work, but it is mentioned <a title="CUDA on multicore CPUs." href="http://www.eetimes.com/news/latest/showArticle.jhtml?articleID=207403647&amp;pgno=3" target="_blank">here</a>.)</p>
<p>In combination with the hierarchy of processing units, the G80 also provides a memory hierarchy:</p>
<blockquote><p>Global memory: This memory is built from a bank of SDRAM chips connected to the GPU chip.  Any thread in any MP can read or write to any location in the global memory.  Sometimes this is called <em>device memory</em>.</p>
<p>Texture cache: This is a memory within each MP that can be filled with data from the global memory so it acts like a cache.  Threads running in the MP are restricted to read-only access of this memory.</p>
<p>Constant cache: This is a read-only memory within each MP.</p>
<p>Shared memory: This is a small memory within each MP that can be read/written by any thread in a block assigned to that MP.</p>
<p>Registers: Each MP has a number of registers that are shared between its SPs.</p></blockquote>
<p>As usual, the upper levels of the memory hierarchy provide larger, slower storage (access times of 400-600 cycles) while the lower levels are  smaller and faster (access times of several cycles or less, I think).  How large are these memories?  You can find out by running the DeviceQuery application included in the CUDA SDK.  This is the result I get for my Nvidia card:</p>
<pre>There is 1 device supporting CUDA

Device 0: "GeForce 8600 GTS"
  Major revision number:                         1
  Minor revision number:                         1
  Total amount of global memory:                 268173312 bytes
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1458000 kilohertz

Test PASSED</pre>
<p>The GPU device is hooked to 256 MB of SDRAM that provides the global memory.  Each MP in the GPU has access to 16 KB of shared memory and 8192 registers (I&#8217;m not sure what the register width is).  And there is 64 KB of constant memory for all the MPs in the GPU (I&#8217;m not sure why that is not reported per-MP like the shared memory is).</p>
<p>DeviceQuery also shows the limits on the sizes of blocks and grids.  A block is one-, two- or three-dimensional with the maximum sizes of the <em>x</em>, <em>y</em> and <em>z</em> dimensions being 512, 512 and 64, respectively, and such that <em>x</em> × <em>y</em> × <em>z</em> ≤ 512, which is the maximum number of threads per block.  Blocks are organized into one- or two-dimensional grids of up to 65,535 blocks in each dimension.  The primary limitation here is the maximum of 512 threads per block, primarily imposed by the small number of registers that can be allocated across all the threads running in all the blocks assigned to an MP.  The thread limit constrains the amount of cooperation between threads because only threads within the same block can synchronize with each other and exchange data through the fast shared memory in an MP.</p>
<p>Between the memory sizes and the thread/block/grid dimensions is the <em>warp size</em>.  What is a warp?  I think the definition that applies to CUDA is &#8220;threads in a fabric running lengthwise&#8221;.  The warp size is the number of threads running concurrently on an MP.  In actuality, the threads are running both in parallel and pipelined.  At the time this was written, each MP contains eight SPs and the fastest instruction takes four cycles.  Therefore, each SP can have four instructions in its pipeline for a total of 8 × 4 = 32 instructions being executed concurrently.  Within a warp, the threads all have sequential indices so there is a warp with indices 0..31, the next with indices 32..63 and so on up to the total number of threads in a block.</p>
<p>The homogeneity of the threads in a warp has a big effect on the computational throughput.  If all the threads are executing the same instruction, then all the SPs in an MP can execute the same instruction in parallel.  But if one or more threads in a warp is executing a different instruction from the others, then the warp has to be partitioned into groups of threads based on the instructions being executed, after which the groups are executed one after the other.  This serialization reduces the throughput as the threads become more and more divergent and split into smaller and smaller groups.  So it pays to keep the threads as homogenous as possible.</p>
<p>How the threads access global memory also affects the throughput.  Things go much faster if the GPU can coalesce several global addresses into a single burst access over the wide data bus that goes to the external SDRAM.  Conversely, reading/writing separated memory addresses requires multiple accesses to the SDRAM which slows things down.  To help the GPU combine multiple accesses, the addresses generated by the threads in a warp must be sequential with respect to the thread indices, i.e. thread N must access address Base + N where Base is a pointer of type T and is aligned  to 16 × sizeof(T) bytes.</p>
<p>A program that shows some of these performance effects is given below.</p>
<pre class="brush: cpp; title: ; notranslate">
// example2.cpp : Defines the entry point for the console application.
//

#include &quot;stdafx.h&quot;

#include &lt;stdio.h&gt;
#include &lt;cuda.h&gt;
#include &lt;cutil.h&gt;

// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
#define STRIDE       32
#define OFFSET        0
#define GROUP_SIZE  512
  int n_elem_per_thread = N / (gridDim.x * blockDim.x);
  int block_start_idx = n_elem_per_thread * blockIdx.x * blockDim.x;
  int thread_start_idx = block_start_idx
			+ (threadIdx.x / STRIDE) * n_elem_per_thread * STRIDE
			+ ((threadIdx.x + OFFSET) % STRIDE);
  int thread_end_idx = thread_start_idx + n_elem_per_thread * STRIDE;
  if(thread_end_idx &gt; N) thread_end_idx = N;
  int group = (threadIdx.x / GROUP_SIZE) &amp; 1;
  for(int idx=thread_start_idx; idx &lt; thread_end_idx; idx+=STRIDE)
  {
    if(!group) a[idx] = a[idx] * a[idx];
    else       a[idx] = a[idx] + a[idx];
  }
}

// main routine that executes on the host
int main(void)
{
  float *a_h, *a_d;  // Pointer to host &amp; device arrays
  const int N = 1&lt;&lt;25;  // Make a big array with 2**N elements
  size_t size = N * sizeof(float);
  a_h = (float *)malloc(size);        // Allocate array on host
  cudaMalloc((void **) &amp;a_d, size);   // Allocate array on device
  // Initialize host array and copy it to CUDA device
  for (int i=0; i&lt;N; i++) a_h[i] = (float)i;
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  // Create timer for timing CUDA calculation
  unsigned int timer = 0;
  cutCreateTimer( &amp;timer );
  // Set number of threads and blocks
  int n_threads_per_block = 1&lt;&lt;9;  // 512 threads per block
  int n_blocks = 1&lt;&lt;10;  // 1024 blocks
  // Do calculation on device
  cutStartTimer( timer );  // Start timer
  square_array &lt;&lt;&lt; n_blocks, n_threads_per_block &gt;&gt;&gt; (a_d, N);
  cudaThreadSynchronize();  // Wait for square_array to finish on CUDA
  cutStopTimer( timer );  // Stop timer
  // Retrieve result from device and store it in host array
  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  // Print some of the results and the CUDA execution time
  for (int i=0; i&lt;N; i+=N/50) printf(&quot;%d %f\n&quot;, i, a_h[i]);
  printf(&quot;CUDA execution time = %f ms\n&quot;,cutGetTimerValue( timer ));
  // Cleanup
  free(a_h); cudaFree(a_d);
}
</pre>
<p>This program is just a modification of <a title="My first CUDA program" href="http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/" target="_blank">my first CUDA program</a>.  The size of the array on line 35 is increased from ten to 32M elements so the kernel will execute long enough to get an accurate reading with the timer (declared on line 44).  The number of blocks and threads per block are increased (lines 46 and 47) because that provides a large pool of available computations the GPU can execute while waiting for memory accesses to complete.</p>
<p>The timer is started (line 49) and the kernel is initiated (line 50).  Because the CUDA device works in parallel with the CPU, a call to the cudaThreadSynchronize() subroutine is needed to halt the CPU until the kernel has completed (line 51).   Then the timer can be stopped (line 52).  The execution time for the kernel is output on line 57.</p>
<p>The kernel (lines 11-29) has been modified so that each thread loops over a set of array elements using a memory access pattern that can be changed along with the homogeneity of the threads.  The number of array elements handled by each thread is calculated on line 16 by dividing the array size by the total number of threads (which is the product of the number of blocks and the block dimension).  Then, each block of threads is assigned a starting address for the contiguous block of array elements that its threads will process (line 17).  Each thread in a block is assigned a starting address within its block determined by the thread index and the addressing stride length and offset.  For example, if there are 512 total array elements handled by eight threads with the stride and offset set to four and zero, respectively, then the addresses accessed by each thread are:</p>
<pre>thread 0:   0   4   8  12  16  20  24 ... 248 252
thread 1:   1   5   9  13  17  21  25 ... 249 253
thread 2:   2   6  10  14  18  22  26 ... 250 254
thread 3:   3   7  11  15  19  23  27 ... 251 255
thread 4: 256 260 264 268 272 276 280 ... 504 508
thread 5: 257 261 265 269 273 277 281 ... 505 509
thread 6: 258 262 266 270 274 278 282 ... 506 510
thread 7: 259 263 267 271 275 279 283 ... 507 511</pre>
<p>In this example, you can see that threads 0-3 are accessing sequential addresses in their section of the global memory, so their accesses can be coalesced.  The same applies to threads 4-7.  But, the accesses for threads 0-3 and 4-7 combined cannot be coalesced because then they are not sequential.  So the GPU has to make a separate memory bursts for each group of threads.</p>
<p>Using the same example, if the offset is increased to one then the addresses for each thread become:</p>
<pre>thread 0:   1   5   9  13  17  21  25 ... 249 253
thread 1:   2   6  10  14  18  22  26 ... 250 254
thread 2:   3   7  11  15  19  23  27 ... 251 255
thread 3:   0   4   8  12  16  20  24 ... 248 252
thread 4: 257 261 265 269 273 277 281 ... 505 509
thread 5: 258 262 266 270 274 278 282 ... 506 510
thread 6: 259 263 267 271 275 279 283 ... 507 511
thread 7: 256 260 264 268 272 276 280 ... 504 508</pre>
<p>Now the addresses are still in the same range, but they are no longer sequential with respect to the thread indices.</p>
<p>A third macro parameter, GROUP_SIZE (line 15), is used to direct the flow of execution in the for loop (lines 24-28).  Setting the group size to four, for example, sets the group variable (line 23) to zero for threads with indices [0..3], [8..11], [16..19] &#8230; which makes these threads square the array elements (line 26).  The remaining threads, which have group set to one, just double the array elements (line 27).  This difference in operations prevents the MP from running the threads in both groups in parallel.</p>
<p>By playing with the STRIDE, OFFSET and GROUP macros (lines 13-15), we can vary the memory access patterns and thread homogeneity and see their effect on execution times for the kernel.  The following trials were done using the Release version of the program with execution times averaged over 1000 runs:</p>
<table border="1" cellspacing="0" cellpadding="4" width="400" align="center">
<tbody>
<tr align="center">
<th align="center">Trial</th>
<th align="center">STRIDE</th>
<th align="center">OFFSET</th>
<th align="center">GROUP_SIZE</th>
<th align="center">Execution Time</th>
</tr>
<tr>
<td align="center">#1</td>
<td align="center">32</td>
<td align="center">0</td>
<td align="center">512</td>
<td align="center">14.5 ms</td>
</tr>
<tr>
<td align="center">#2</td>
<td align="center">16</td>
<td align="center">0</td>
<td align="center">512</td>
<td align="center">14.7 ms</td>
</tr>
<tr>
<td align="center">#3</td>
<td align="center">8</td>
<td align="center">0</td>
<td align="center">512</td>
<td align="center">86.0 ms</td>
</tr>
<tr>
<td align="center">#4</td>
<td align="center">32</td>
<td align="center">1</td>
<td align="center">512</td>
<td align="center">93.5 ms</td>
</tr>
<tr>
<td align="center">#5</td>
<td align="center">32</td>
<td align="center">0</td>
<td align="center">16</td>
<td align="center">14.4 ms</td>
</tr>
<tr>
<td align="center">#6</td>
<td align="center">32</td>
<td align="center">0</td>
<td align="center">8</td>
<td align="center">22.1 ms</td>
</tr>
<tr>
<td align="center">#7</td>
<td align="center">8</td>
<td align="center">1</td>
<td align="center">8</td>
<td align="center">85.5 ms</td>
</tr>
</tbody>
</table>
<p>Trial #1 uses the most advantageous settings with the stride set equal to the warp size and the offset set to zero so that each thread in a warp generates sequential addresses that track the thread indices.  The threads in a block are placed into a maximal group of 512 so they are all executing exactly the same instructions.</p>
<p>In trial #2, decreasing the stride to half the warp size has no effect on the kernel&#8217;s execution time.  Nvidia mentions that many operations in the MP proceed on a <em>half-warp</em> basis, but you can&#8217;t count on this behavior in future versions of their devices.</p>
<p>Further decreasing the stride in trial #3 expands the gaps between thread addresses and reduces the amount of address coalescing the GPU can do, to the point where the execution time increases by a factor of six.</p>
<p>Restoring the stride back to the warp size but edging the offset up to one in trial #4 causes the same problem with coalescing addresses.  So addresses that do not track the thread index cause performance problems as great as using non-sequential addresses.</p>
<p>In trial #5, the stride and offset are returned to their best settings while the size of thread groups executing the same instruction are reduced to 16 threads.   No performance decrease is seen at half-warp size.</p>
<p>Decreasing the groups to eight threads in trial #6 does cause the MPs to execute the groups sequentially.  This causes only a 50% increase in run-time because the two groups still have many instructions that they can execute in common.</p>
<p>Finally, setting all the parameters to their worst-case values in trial #7 does not lead to a multiplicative increase in execution times because the effects of serializing the thread groups is hidden by the increased memory access times caused by the non-sequential addresses.</p>
<p>Here&#8217;s the <a title="Source code for example 2" href="ftp://ftp.drivehq.com/llpanorama/CUDA/example2.zip" target="_blank">source code for this example</a> if you want to try it.  (Note: I had to move the cutil32D.dll and cutil32.dll libraries from the CUDA SDK directory into my C:\CUDA\bin directory so my example2.exe program could find it.  You may have to do the same.)</p>
<p>So now I&#8217;ve gotten an idea of how the GPU is executing code and some of the factors that affect it&#8217;s performance.  (For reference and greater detail, please read the <a title="CUDA PRogramming Guide Version 1.1" href="http://developer.download.nvidia.com/compute/cuda/1_1/NVIDIA_CUDA_Programming_Guide_1.1.pdf" target="_blank">CUDA Programming Guide Version 1.1</a>.)  It&#8217;s important not to have a bunch of divergent conditional operations in the threads and not to jump around from place-to-place in global memory.  In fact, given the long access times, it&#8217;s probably better to minimize global memory traffic and make use of on-chip registers and shared memory instead.  I&#8217;ll take a look at these memories in a future post.</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/llpanorama.wordpress.com/25/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/llpanorama.wordpress.com/25/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/llpanorama.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/llpanorama.wordpress.com/25/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=25&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://llpanorama.wordpress.com/2008/06/11/threads-and-blocks-and-grids-oh-my/feed/</wfw:commentRss>
		<slash:comments>30</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d9f61aab37fbd3971dc82737f85ee0c3?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">llpanorama</media:title>
		</media:content>
	</item>
		<item>
		<title>Which is more popular? TBB or CUDA?</title>
		<link>http://llpanorama.wordpress.com/2008/06/04/which-is-more-popular-tbb-or-cuda/</link>
		<comments>http://llpanorama.wordpress.com/2008/06/04/which-is-more-popular-tbb-or-cuda/#comments</comments>
		<pubDate>Wed, 04 Jun 2008 14:28:21 +0000</pubDate>
		<dc:creator>dave_vandenbout</dc:creator>
				<category><![CDATA[GPU]]></category>
		<category><![CDATA[multicore]]></category>
		<category><![CDATA[general]]></category>

		<guid isPermaLink="false">http://llpanorama.wordpress.com/?p=27</guid>
		<description><![CDATA[I&#8217;ve been monitoring the traffic on this blog since it started. (That isn&#8217;t hard &#8211; a big day has 100 page views.) Here are the accumulated hits for the blog posts I&#8217;ve made on TBB and CUDA: Accumulated Hits Framework #Hits Duration TBB 1,045 120 days CUDA 1,112 40 days It appears that CUDA has [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=27&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I&#8217;ve been monitoring the traffic on this blog since it started.  (That isn&#8217;t hard &#8211; a big day has 100 page views.)  Here are the accumulated hits for the blog posts I&#8217;ve made on TBB and CUDA:</p>
<table border="1" cellspacing="0" cellpadding="4" width="200" align="center">
<tbody>
<tr align="center">
<th colspan="3">Accumulated Hits</th>
</tr>
<tr>
<th align="center">Framework</th>
<th align="center">#Hits</th>
<th align="center">Duration</th>
</tr>
<tr>
<td align="center">TBB</td>
<td align="center">1,045</td>
<td align="center">120 days</td>
</tr>
<tr>
<td align="center">CUDA</td>
<td align="center">1,112</td>
<td align="center">40 days</td>
</tr>
</tbody>
</table>
<p>It appears that CUDA has garnered as much attention as TBB, but in a much shorter time and with far fewer posts.  I&#8217;ll give three possible explanations for this:</p>
<ol>
<li>CUDA has been more visible in the tech press over the past few months, while TBB coverage has been almost non-existent.</li>
<li>People perceive a bigger payoff from learning about CUDA (which offers much more potential parallelism  with hundreds of parallel processors) than TBB (which uses the handful of cores available in today&#8217;s CPUs).</li>
<li>My most popular posts concern setting up CUDA or TBB on Windows and getting a small example to compile.  This is easy to do with TBB (after all, it&#8217;s being developed by Intel), but its hard to get the CUDA nvcc compiler integrated into Microsoft&#8217;s Visual C++ so people are looking for help with that.</li>
</ol>
<p>What do you think?  Is there some other reason I&#8217;ve missed?  What&#8217;s your parallel programming framework of choice and why?</p>
<p><em>Update (6/24/2008):</em></p>
<p>Here are the updated statistics.  CUDA is pulling away!</p>
<table border="1" cellspacing="0" cellpadding="4" width="200" align="center">
<tbody>
<tr align="center">
<th colspan="3">Accumulated Hits</th>
</tr>
<tr>
<th align="center">Framework</th>
<th align="center">#Hits</th>
<th align="center">Duration</th>
</tr>
<tr>
<td align="center">TBB</td>
<td align="center">1,267</td>
<td align="center">140 days</td>
</tr>
<tr>
<td align="center">CUDA</td>
<td align="center">2,722</td>
<td align="center">60 days</td>
</tr>
</tbody>
</table>
<p><em>Further Update (6/17/2010):</em></p>
<p>Let&#8217;s see where we are after two years:</p>
<table border="1" cellspacing="0" cellpadding="4" width="200" align="center">
<tbody>
<tr align="center">
<th colspan="2">Accumulated Hits</th>
</tr>
<tr>
<th align="center">Framework</th>
<th align="center">#Hits</th>
</tr>
<tr>
<td align="center">TBB</td>
<td align="center">10,220</td>
</tr>
<tr>
<td align="center">CUDA</td>
<td align="center">235,010</td>
</tr>
</tbody>
</table>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/llpanorama.wordpress.com/27/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/llpanorama.wordpress.com/27/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/llpanorama.wordpress.com/27/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/llpanorama.wordpress.com/27/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=27&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://llpanorama.wordpress.com/2008/06/04/which-is-more-popular-tbb-or-cuda/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d9f61aab37fbd3971dc82737f85ee0c3?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">llpanorama</media:title>
		</media:content>
	</item>
		<item>
		<title>parallel_scan finally explained!</title>
		<link>http://llpanorama.wordpress.com/2008/05/22/parallel_scan-finally-explained/</link>
		<comments>http://llpanorama.wordpress.com/2008/05/22/parallel_scan-finally-explained/#comments</comments>
		<pubDate>Thu, 22 May 2008 16:59:18 +0000</pubDate>
		<dc:creator>dave_vandenbout</dc:creator>
				<category><![CDATA[multicore]]></category>
		<category><![CDATA[TBB]]></category>

		<guid isPermaLink="false">http://llpanorama.wordpress.com/?p=24</guid>
		<description><![CDATA[I beat my head against parallel_scan for a week and never really understood why I was having the problems I did. Now the developers at Intel have provided a better explanation of how parallel_scan works. It turns out that the pre_scan method may never be run at all, so the final_scan method always has to [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=24&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a title="Problems getting parallel_scan to work." href="http://llpanorama.wordpress.com/2008/03/05/scanners-arent-those-the-guys-that-make-your-head-explode/" target="_self">I beat my head against parallel_scan</a> for a week and <a title="Finally got parallel_scan working." href="http://llpanorama.wordpress.com/2008/03/07/parallel_scan-works-kinda-sorta/" target="_self">never really understood why I was having the problems I did</a>.  Now the developers at Intel have provided <a title="A better explanation of parallel_scan." href="http://softwarecommunity.intel.com/isn/Community/en-US/forums/30253723/PostAttachment.aspx" target="_blank">a better explanation of how parallel_scan works</a>.  It turns out that the pre_scan method may never be run at all, so the final_scan method always has to re-do what was done in pre_scan just to be safe.  That explains why I had to make my pre_scan and final_scan methods identical in my example program.  It would have been nice if one of the developers had mentioned that within a few days of when <a title="TBB forum discussion of my problem with parallel_scan." href="http://softwarecommunity.intel.com/isn/Community/en-US/forums/thread/30250057.aspx" target="_blank">I submitted my problem to the Intel TBB forum</a>.  Or perhaps they should have called the method <strong>pre_scan_sometimes_if_we_feel_like_it</strong> just to warn TBB-users of the actual behavior.</p>
<p>Anyway, problem solved.</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/llpanorama.wordpress.com/24/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/llpanorama.wordpress.com/24/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/llpanorama.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/llpanorama.wordpress.com/24/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=24&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://llpanorama.wordpress.com/2008/05/22/parallel_scan-finally-explained/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d9f61aab37fbd3971dc82737f85ee0c3?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">llpanorama</media:title>
		</media:content>
	</item>
		<item>
		<title>My first CUDA program!</title>
		<link>http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/</link>
		<comments>http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/#comments</comments>
		<pubDate>Wed, 21 May 2008 14:04:33 +0000</pubDate>
		<dc:creator>dave_vandenbout</dc:creator>
				<category><![CDATA[GPU]]></category>
		<category><![CDATA[CUDA]]></category>

		<guid isPermaLink="false">http://llpanorama.wordpress.com/?p=22</guid>
		<description><![CDATA[Note: Check out &#8220;CUDA Gets Easier&#8221; for a simpler way to create CUDA projects in Visual Studio. I got CUDA setup and running with Visual C++ 2005 Express Edition in my previous post. Now I&#8217;ll write my first CUDA program. It&#8217;s a modification of an example program from a great series of articles on CUDA [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=22&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><em>Note: Check out <a title="CUDA Gets Easier" href="http://llpanorama.wordpress.com/2010/06/18/cuda-gets-easier/" target="_blank">&#8220;CUDA Gets Easier&#8221;</a> for a simpler way to create CUDA projects in Visual Studio.</em></p>
<p>I got CUDA setup and running with Visual C++ 2005 Express Edition in <a title="Getting started with CUDA" href="http://llpanorama.wordpress.com/2008/04/24/getting-started-with-cuda/" target="_blank">my previous post</a>.  Now I&#8217;ll write my first CUDA program.  It&#8217;s a modification of an example program from a <a title="Supercomputing for the Masses" href="http://ddj.com/hpc-high-performance-computing/207402986" target="_blank">great series of articles on CUDA by Rob Farber</a> published in <em>Dr. Dobbs Journal</em>.  Rob does his examples in a make-based build environment; I&#8217;ll show how to build a CUDA program in the Visual C++ IDE.<span id="more-22"></span></p>
<p>Simple CUDA programs have a basic flow:</p>
<ol>
<li>The host initializes an array with data.</li>
<li>The array is copied from the host to the memory on the CUDA device.</li>
<li>The CUDA device operates on the data in the array.</li>
<li>The array is copied back to the host.</li>
</ol>
<p>My first CUDA program, shown below, follows this flow.  It takes an array and squares each element.  I can barely contain my excitement.</p>
<pre class="brush: cpp; title: ; notranslate">
// example1.cpp : Defines the entry point for the console application.
//

#include &quot;stdafx.h&quot;

#include &lt;stdio.h&gt;
#include &lt;cuda.h&gt;

// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx&lt;N) a[idx] = a[idx] * a[idx];
}

// main routine that executes on the host
int main(void)
{
  float *a_h, *a_d;  // Pointer to host &amp; device arrays
  const int N = 10;  // Number of elements in arrays
  size_t size = N * sizeof(float);
  a_h = (float *)malloc(size);        // Allocate array on host
  cudaMalloc((void **) &amp;a_d, size);   // Allocate array on device
  // Initialize host array and copy it to CUDA device
  for (int i=0; i&lt;N; i++) a_h[i] = (float)i;
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  // Do calculation on device:
  int block_size = 4;
  int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
  square_array &lt;&lt;&lt; n_blocks, block_size &gt;&gt;&gt; (a_d, N);
  // Retrieve result from device and store it in host array
  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  // Print results
  for (int i=0; i&lt;N; i++) printf(&quot;%d %f\n&quot;, i, a_h[i]);
  // Cleanup
  free(a_h); cudaFree(a_d);
}

</pre>
<p>Two pointers are declared on line 19 of the main routine: a_h points to the array that is stored on the host, while a_d points to the array on the CUDA device.  The a_h array is allocated in the host memory on line 22 using the standard malloc subroutine, but a_d is allocated in the CUDA device memory using the cudaMalloc subroutine found in the CUDA API (line 23).  (Note that a pointer to the a_d pointer is passed to cudaMalloc so it can store the address of the array in a_d.)</p>
<p>In order to create some values to operate upon, each element in the host array is initialized with its array index (line 25).  Then the cudaMemcpy subroutine is used to copy a_h from the host into a_d on the CUDA device. (The cudaMemcpyHostToDevice flag, defined in the API, indicates the direction of the transfer.)</p>
<p>In lines 28-30, the host initiates the execution of the kernel function, square_array, on the CUDA device.  A CUDA device contains individual processing elements, each of which can execute a thread.  A number of the processing elements are grouped together to form a block, and a number of blocks constitutes a grid.  In this example, the number of threads per block is set to four (line 28).   Then the total number of blocks that are needed to get enough threads to square each array element is calculated on line 29.  (For ten array elements, three blocks each with four threads are needed.)  On line 30, the host initiates the kernel function on the CUDA device.  The number of blocks and the number of threads in each block are indicated between the &lt;&lt;&lt;&#8230;&gt;&gt;&gt; following the kernel name.  (This information is picked up by the Nvidia compiler, nvcc, and is used when generating the instructions that start the kernel on the CUDA device. More on nvcc, later.)  Following that, the standard argument list to square_array contains a pointer to the array in the CUDA device memory and the number of elements in the array.</p>
<p>The kernel is shown on lines 10-14.  The __global__ keyword indicates that this is a kernel function that should be processed by nvcc to create machine code that executes on the CUDA device, not the host.  In this example, each thread will execute the same kernel function and will operate upon only a single array element.  Each thread is distinguished from all the others by block and thread indices that can be used to determine the array element the thread will access. On line 12, the array index is found by multiplying the thread&#8217;s block index (blockIdx.x) by the number of threads in each block (blockDim.x) and then adding the index of the thread within the block (threadIdx.x).  If the index is within the bounds of the array, then the corresponding array element is squared (line 13).</p>
<p>Immediately after starting the kernel, the host begins a transfer of the data from the array in the CUDA device memory back to the array in the host memory (line  32).  This transfer is delayed until the CUDA device has finished executing the kernel, so there is no chance of getting data that has not been processed yet.  Then the host displays the contents of the array (line 34) and frees the array memory on both itself and the CUDA device (line 36).</p>
<p>At this point, I have a CUDA-enabled program, but I don&#8217;t have it integrated into a Visual C++ project.  It actually takes a bit of work to do that.  To start, I brought up the Visual C++ 2005 Express Edition IDE and clicked on the New Project button (you can also use File→New→Project… from the menu). In the New Project window, I selected Win32 as the project type and Win32 Console Application as the template. I gave the project the creative name of example1 and set its location to the C:\llpanorama\CUDA\examples directory. After clicking OK in the New Project window, and then clicking Finish in the Win32 Application Wizard window, a window opened with a simple code skeleton. I replaced the code skeleton with the code shown above.</p>
<p>After saving the code, I right-clicked the example1.cpp file, selected Rename from the drop-down menu and renamed the file to example1.cu.  Files with the .cu extension are intended to be processed by nvcc.  nvcc will extract the kernel portion of example1.cu and compile it for execution on the CUDA device while using the Visual C++ compiler to compile the remainder of the file for execution on the host.</p>
<p>In its default configuration, Visual C++ doesn&#8217;t know how to compile .cu file.  It has to be told explicitly how to do this using a Custom Build Step.  This is done by right-clicking on the example1.cu file and selecting Properties from the drop-down menu.  In the Property Pages window that appears, set the Custom Build Step command line as follows:</p>
<blockquote><p>Configuration Properties → Custom Build Step → General:<br />
Command Line =<br />
&#8220;$(CUDA_BIN_PATH)\nvcc.exe&#8221; -ccbin &#8220;$(VCInstallDir)bin&#8221; -c -D_DEBUG -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/MTd -I&#8221;$(CUDA_INC_PATH)&#8221; -I./ -o $(ConfigurationName)\example1.obj example1.cu</p></blockquote>
<p>What does this command line do?  Let&#8217;s break it down piece-by-piece:</p>
<blockquote><p>&#8220;$(CUDA_BIN_PATH)\nvcc.exe&#8221;: The location of the nvcc compiler.</p>
<p>-ccbin &#8220;$(VCInstallDir)bin&#8221;: The location of the Visual C++ compiler.</p>
<p>-c: The compilation will proceed all the way to the generation of an object file (.obj extension).</p>
<p>-D_DEBUG -DWIN32 -D_CONSOLE -D_MBCS: Various macro definitions.</p>
<p>-Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/MTd: Various options that are passed by nvcc directly to the Visual C++ compiler.</p>
<p>-I&#8221;$(CUDA_INC_PATH)&#8221;: Look in the CUDA include directories for needed header files.</p>
<p>-I./: Look in the current directory for needed header files.</p>
<p>-o $(ConfigurationName)\example1.obj: The location and name of the resulting object file.</p>
<p>example1.cu: The source file that the compiler will work on.</p></blockquote>
<p>In addition to setting the command line for the example1.cu file, the location of the output file is specified as follows:</p>
<blockquote><p>Configuration Properties → Custom Build Step → General:<br />
Outputs = $(ConfigurationName)\example1.obj</p></blockquote>
<p>After setting the file properties, the properties for the example1 project have to be modified.  Here are the project property settings I used for the Debug configuration:</p>
<blockquote><p>Configuration Properties → C/C++ → General:<br />
Additional Include Directories = $(CUDA_INC_PATH);&#8221;C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\common\inc&#8221;</p>
<p>Configuration Properties → C/C++ → General:<br />
Debug Information Format = Program Database (/Zi)</p>
<p>Configuration Properties → C/C++ → Code Generation:<br />
Runtime Library = Multi-threaded Debug (/MTd)</p>
<p>Configuration Properties → Linker → General:<br />
Enable incremental linking = No (/INCREMENTAL:NO)</p>
<p>Configuration Properties → Linker -&gt; General:<br />
Additional Library Directories = &#8220;C:\CUDA\lib&#8221;;&#8221;C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\common\lib&#8221;</p>
<p>Configuration Properties → Linker → Input:<br />
Additional Dependencies = cudart.lib cutil32D.lib</p>
<p>Configuration Properties → Linker → Optimization:<br />
Enable COMDAT folding = Do Not Remove Redundant COMDATs (/OPT:NOICF)</p></blockquote>
<p>Now the project can be compiled and run.  Here&#8217;s the result:</p>
<blockquote><p>0   0.000000<br />
1     1.000000<br />
2     4.000000<br />
3     9.000000<br />
4     16.000000<br />
5     25.000000<br />
6     36.000000<br />
7     49.000000<br />
8     64.000000<br />
9     81.000000</p></blockquote>
<p>I told you it was exciting!  Well, at least it&#8217;s right.</p>
<p>In order to compile the Release configuration, a few changes need to be made to the file and project properties.  For the example1.cu file,the Custom Build Step command line has to be changed to remove the _DEBUG macro definition, enable compiler optimization, and link with the Release runtime library:</p>
<blockquote><p>Configuration Properties → Custom Build Step → General:<br />
Command Line =<br />
&#8220;$(CUDA_BIN_PATH)\nvcc.exe&#8221; -ccbin &#8220;$(VCInstallDir)bin&#8221; -c <span style="text-decoration:line-through;">-D_DEBUG</span> -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,<strong>/O2</strong>,/Zi,<strong>/MT</strong> -I&#8221;$(CUDA_INC_PATH)&#8221; -I./ -o $(ConfigurationName)\example1.obj example1.cu</p></blockquote>
<p>The project properties that have to be changed in the Release configuration are the linking for the runtime library and the use of the non-debug version of the CUDA utilities library:</p>
<blockquote><p>Configuration Properties → C/C++ → Code Generation:<br />
Runtime Library = Multi-threaded (/MT)</p>
<p>Configuration Properties → Linker → Input:<br />
Additional Dependencies = cudart.lib cutil32.lib</p></blockquote>
<p>Once those changes are made, the Release version of the example1 project can be compiled and run.  It will output the same exciting result.</p>
<p>Here&#8217;s the <a title="Source code for example 1" href="ftp://ftp.drivehq.com/llpanorama/CUDA/example1.zip" target="_blank">source code for this example</a> if you want to try it.</p>
<p>Don&#8217;t have a CUDA-capable GPU board on your PC but still want to try running this program?  Easy!  Just add the following option to the Custom Build Step command line: -deviceemu.  This will link-in a CUDA device emulator that runs on the host.  The emulator becomes the target for all the CUDA API calls and executes the kernel.  The program will run just like a CUDA device is there, except slower.  (<a href="ftp://ftp.drivehq.com/llpanorama/CUDA/example1_emu.zip">Here </a>is the project file with the -deviceemu option.)</p>
<p>So I&#8217;ve written my first CUDA program and gotten it to compile using Visual C++ 2005 Express Edition.  Setting up the compilation options was as much (more?) work as writing the program, so you might be interested in a <a title="CUDA template for Visual C++ 2005" href="http://forums.nvidia.com/index.php?showtopic=65111" target="_blank">CUDA template for Visual C++ 2005</a> written by <a title="kyzhao profile on Nvidia forum" href="http://forums.nvidia.com/index.php?showuser=77682" target="_blank">kyzhao</a>.  The installer doesn&#8217;t work for me (maybe because I&#8217;m using the free Express Edition), but it might help you.</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/llpanorama.wordpress.com/22/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/llpanorama.wordpress.com/22/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/llpanorama.wordpress.com/22/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/llpanorama.wordpress.com/22/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=llpanorama.wordpress.com&#038;blog=2601119&#038;post=22&#038;subd=llpanorama&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/feed/</wfw:commentRss>
		<slash:comments>157</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d9f61aab37fbd3971dc82737f85ee0c3?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">llpanorama</media:title>
		</media:content>
	</item>
	</channel>
</rss>
