Sophie: nvidia-cuda-toolkit-devel-6.5.14-6.1.mga5.nonfree x86

nvidia-cuda-toolkit-devel-6.5.14-6.1.mga5.nonfree.x86_64.rpm

<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-us" xml:lang="en-us">
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>
      <meta http-equiv="X-UA-Compatible" content="IE=edge"></meta>
      <meta name="copyright" content="(C) Copyright 2005"></meta>
      <meta name="DC.rights.owner" content="(C) Copyright 2005"></meta>
      <meta name="DC.Type" content="concept"></meta>
      <meta name="DC.Title" content="Tuning CUDA Applications for Kepler"></meta>
      <meta name="abstract" content="The programming guide to tuning CUDA Applications for GPUs based on the NVIDIA Kepler Architecture."></meta>
      <meta name="description" content="The programming guide to tuning CUDA Applications for GPUs based on the NVIDIA Kepler Architecture."></meta>
      <meta name="DC.Coverage" content="Programming Guides"></meta>
      <meta name="DC.subject" content="CUDA Kepler, CUDA Kepler tuning, CUDA Kepler best practices, CUDA Kepler performance"></meta>
      <meta name="keywords" content="CUDA Kepler, CUDA Kepler tuning, CUDA Kepler best practices, CUDA Kepler performance"></meta>
      <meta name="DC.Format" content="XHTML"></meta>
      <meta name="DC.Identifier" content="abstract"></meta>
      <link rel="stylesheet" type="text/css" href="../common/formatting/commonltr.css"></link>
      <link rel="stylesheet" type="text/css" href="../common/formatting/site.css"></link>
      <title>Kepler Tuning Guide :: CUDA Toolkit Documentation</title>
      <!--[if lt IE 9]>
      <script src="../common/formatting/html5shiv-printshiv.min.js"></script>
      <![endif]-->
      <script type="text/javascript" charset="utf-8" src="../common/scripts/tynt/tynt.js"></script>
      <script type="text/javascript" charset="utf-8" src="../common/formatting/jquery.min.js"></script>
      <script type="text/javascript" charset="utf-8" src="../common/formatting/jquery.ba-hashchange.min.js"></script>
      <script type="text/javascript" charset="utf-8" src="../common/formatting/jquery.scrollintoview.min.js"></script>
      <script type="text/javascript" src="../search/htmlFileList.js"></script>
      <script type="text/javascript" src="../search/htmlFileInfoList.js"></script>
      <script type="text/javascript" src="../search/nwSearchFnt.min.js"></script>
      <script type="text/javascript" src="../search/stemmers/en_stemmer.min.js"></script>
      <script type="text/javascript" src="../search/index-1.js"></script>
      <script type="text/javascript" src="../search/index-2.js"></script>
      <script type="text/javascript" src="../search/index-3.js"></script>
      <link rel="canonical" href="http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html"></link>
      <link rel="stylesheet" type="text/css" href="../common/formatting/qwcode.highlight.css"></link>
   </head>
   <body>
      
      <header id="header"><span id="company">NVIDIA</span><span id="site-title">CUDA Toolkit Documentation</span><form id="search" method="get" action="search">
            <input type="text" name="search-text"></input><fieldset id="search-location">
               <legend>Search In:</legend>
               <label><input type="radio" name="search-type" value="site"></input>Entire Site</label>
               <label><input type="radio" name="search-type" value="document"></input>Just This Document</label></fieldset>
            <button type="reset">clear search</button>
            <button id="submit" type="submit">search</button></form>
      </header>
      <div id="site-content">
         <nav id="site-nav">
            <div class="category closed"><a href="../index.html" title="The root of the site.">CUDA Toolkit
                  v6.5</a></div>
            <div class="category"><a href="index.html" title="Kepler Tuning Guide">Kepler Tuning Guide</a></div>
            <ul>
               <li>
                  <div class="section-link"><a href="#tuning-cuda-applications-for-kepler">1.&nbsp;Kepler Tuning Guide</a></div>
                  <ul>
                     <li>
                        <div class="section-link"><a href="#nvidia-kepler-compute-architecture">1.1.&nbsp;NVIDIA Kepler Compute Architecture</a></div>
                     </li>
                     <li>
                        <div class="section-link"><a href="#cuda-best-practices">1.2.&nbsp;CUDA Best Practices</a></div>
                     </li>
                     <li>
                        <div class="section-link"><a href="#application-compatibility">1.3.&nbsp;Application Compatibility</a></div>
                     </li>
                     <li>
                        <div class="section-link"><a href="#kepler-tuning">1.4.&nbsp;Kepler Tuning</a></div>
                        <ul>
                           <li>
                              <div class="section-link"><a href="#device-utilization-and-occupancy">1.4.1.&nbsp;Device Utilization and Occupancy</a></div>
                           </li>
                           <li>
                              <div class="section-link"><a href="#managing-coarse-grained-parallelism">1.4.2.&nbsp;Managing Coarse-Grained Parallelism</a></div>
                              <ul>
                                 <li>
                                    <div class="section-link"><a href="#concurrent-kernels">1.4.2.1.&nbsp;Concurrent Kernels</a></div>
                                 </li>
                                 <li>
                                    <div class="section-link"><a href="#hyper-q">1.4.2.2.&nbsp;Hyper-Q</a></div>
                                 </li>
                                 <li>
                                    <div class="section-link"><a href="#dynamic-parallelism">1.4.2.3.&nbsp;Dynamic Parallelism</a></div>
                                 </li>
                              </ul>
                           </li>
                           <li>
                              <div class="section-link"><a href="#shared-memory-and-warp-shuffle">1.4.3.&nbsp;Shared Memory and Warp Shuffle</a></div>
                              <ul>
                                 <li>
                                    <div class="section-link"><a href="#shared-memory-bandwidth">1.4.3.1.&nbsp;Shared Memory Bandwidth</a></div>
                                 </li>
                                 <li>
                                    <div class="section-link"><a href="#shared-memory-capacity">1.4.3.2.&nbsp;Shared Memory Capacity</a></div>
                                 </li>
                                 <li>
                                    <div class="section-link"><a href="#warp-shuffle">1.4.3.3.&nbsp;Warp Shuffle</a></div>
                                 </li>
                              </ul>
                           </li>
                           <li>
                              <div class="section-link"><a href="#memory-throughput">1.4.4.&nbsp;Memory Throughput</a></div>
                              <ul>
                                 <li>
                                    <div class="section-link"><a href="#increased-addressable-registers">1.4.4.1.&nbsp;Increased Addressable Registers Per Thread</a></div>
                                 </li>
                                 <li>
                                    <div class="section-link"><a href="#l1-cache">1.4.4.2.&nbsp;L1 Cache</a></div>
                                 </li>
                                 <li>
                                    <div class="section-link"><a href="#read-only-data-cache">1.4.4.3.&nbsp;Read-Only Data Cache</a></div>
                                 </li>
                                 <li>
                                    <div class="section-link"><a href="#fast-global-memory-atomics">1.4.4.4.&nbsp;Fast Global Memory Atomics</a></div>
                                 </li>
                                 <li>
                                    <div class="section-link"><a href="#global-memory-bandwidth-gpuboost">1.4.4.5.&nbsp;Global Memory Bandwidth and GPU Boost</a></div>
                                 </li>
                                 <li>
                                    <div class="section-link"><a href="#memcopy2d">1.4.4.6.&nbsp;2D Memory Copies</a></div>
                                 </li>
                              </ul>
                           </li>
                           <li>
                              <div class="section-link"><a href="#instruction-throughput">1.4.5.&nbsp;Instruction Throughput</a></div>
                              <ul>
                                 <li>
                                    <div class="section-link"><a href="#single-precision-vs-double-precision">1.4.5.1.&nbsp;Single-precision vs. Double-precision</a></div>
                                 </li>
                              </ul>
                           </li>
                           <li>
                              <div class="section-link"><a href="#gpuboost">1.4.6.&nbsp;GPU Boost</a></div>
                           </li>
                           <li>
                              <div class="section-link"><a href="#multi-gpu">1.4.7.&nbsp;Multi-GPU</a></div>
                           </li>
                           <li>
                              <div class="section-link"><a href="#pcie-30">1.4.8.&nbsp;PCIe 3.0</a></div>
                           </li>
                           <li>
                              <div class="section-link"><a href="#warp-synchronous">1.4.9.&nbsp;Warp-synchronous Programming</a></div>
                           </li>
                        </ul>
                     </li>
                  </ul>
               </li>
               <li>
                  <div class="section-link"><a href="#references">A.&nbsp;References</a></div>
               </li>
               <li>
                  <div class="section-link"><a href="#revision-history">B.&nbsp;Revision History</a></div>
               </li>
            </ul>
         </nav>
         <div id="resize-nav"></div>
         <nav id="search-results">
            <h2>Search Results</h2>
            <ol></ol>
         </nav>
         
         <div id="contents-container">
            <div id="breadcrumbs-container">
               <div id="release-info">Kepler Tuning Guide
                  (<a href="../../pdf/Kepler_Tuning_Guide.pdf">PDF</a>)
                  -
                  
                  v6.5
                  (<a href="https://developer.nvidia.com/cuda-toolkit-archive">older</a>)
                  -
                  Last updated August 1, 2014
                  -
                  <a href="mailto:cudatools@nvidia.com?subject=CUDA Toolkit Documentation Feedback: Kepler Tuning Guide">Send Feedback</a>
                  -
                  <span class="st_facebook"></span><span class="st_twitter"></span><span class="st_linkedin"></span><span class="st_reddit"></span><span class="st_slashdot"></span><span class="st_tumblr"></span><span class="st_sharethis"></span></div>
            </div>
            <article id="contents">
               <div class="topic nested0" id="abstract"><a name="abstract" shape="rect">
                     <!-- --></a><h2 class="title topictitle1"><a href="#abstract" name="abstract" shape="rect">Tuning CUDA Applications for Kepler</a></h2>
                  <div class="body conbody"></div>
               </div>
               <div class="topic concept nested0" xml:lang="en-US" id="tuning-cuda-applications-for-kepler"><a name="tuning-cuda-applications-for-kepler" shape="rect">
                     <!-- --></a><h2 class="title topictitle1"><a href="#tuning-cuda-applications-for-kepler" name="tuning-cuda-applications-for-kepler" shape="rect">1.&nbsp;Kepler Tuning Guide</a></h2>
                  <div class="topic concept nested1" xml:lang="en-US" id="nvidia-kepler-compute-architecture"><a name="nvidia-kepler-compute-architecture" shape="rect">
                        <!-- --></a><h3 class="title topictitle2"><a href="#nvidia-kepler-compute-architecture" name="nvidia-kepler-compute-architecture" shape="rect">1.1.&nbsp;NVIDIA Kepler Compute Architecture</a></h3>
                     <div class="body conbody">
                        <p class="p">Kepler is NVIDIA's 3<sup class="ph sup">rd</sup>-generation architecture for CUDA
                           compute applications. Kepler retains and extends the same CUDA
                           programming model as in earlier NVIDIA architectures such as Fermi, and
                           applications that follow the best practices for the Fermi architecture
                           should typically see speedups on the Kepler architecture without any
                           code changes. This guide summarizes the ways that an application can be
                           fine-tuned to gain additional speedups by leveraging Kepler
                           architectural features.<a name="fnsrc_1" href="#fntarg_1" shape="rect"><sup>1</sup></a></p>
                        <p class="p">The Kepler architecture comprises two major variants: GK104 and
                           GK110. A detailed overview of the major improvements in GK104<a name="fnsrc_2" href="#fntarg_2" shape="rect"><sup>2</sup></a> and GK110<a name="fnsrc_3" href="#fntarg_3" shape="rect"><sup>3</sup></a> over the earlier Fermi architecture are described in a pair
                           of whitepapers <a class="xref" href="index.html#references" shape="rect">[1]</a><a class="xref" href="index.html#references" shape="rect">[2]</a> entitled <cite class="cite">NVIDIA GeForce GTX 680:
                              The fastest, most efficient GPU ever built</cite> for GK104 and
                           <cite class="cite">NVIDIA's Next Generation CUDA Compute Architecture: Kepler
                              GK110</cite> for GK110.
                        </p>
                        <p class="p">For details on the programming features discussed in this guide,
                           please refer to the <cite class="cite">CUDA C Programming Guide</cite>. Details
                           on the architectural features are covered in the architecture
                           whitepapers referenced above. Some of the Kepler features described in
                           this guide are specific to GK110, as noted; if not specified, Kepler
                           features refer to both GK104 and GK110.
                        </p>
                     </div>
                  </div>
                  <div class="topic concept nested1" xml:lang="en-US" id="cuda-best-practices"><a name="cuda-best-practices" shape="rect">
                        <!-- --></a><h3 class="title topictitle2"><a href="#cuda-best-practices" name="cuda-best-practices" shape="rect">1.2.&nbsp;CUDA Best Practices</a></h3>
                     <div class="body conbody">
                        <p class="p">The performance guidelines and best practices described in the
                           <cite class="cite">CUDA C Programming Guide</cite><a class="xref" href="index.html#references" shape="rect">[3]</a> and the <cite class="cite">CUDA C Best
                              Practices Guide</cite><a class="xref" href="index.html#references" shape="rect">[4]</a>
                           apply to all CUDA-capable GPU architectures. Programmers must primarily
                           focus on following those recommendations to achieve the best
                           performance.
                        </p>
                        <div class="p">The high-priority recommendations from those guides are as follows:
                           
                           <ul class="ul">
                              <li class="li">Find ways to parallelize sequential code,</li>
                              <li class="li">Minimize data transfers between the host and the device,</li>
                              <li class="li">Adjust kernel launch configuration to maximize device
                                 utilization,
                              </li>
                              <li class="li">Ensure global memory accesses are coalesced,</li>
                              <li class="li">Minimize redundant accesses to global memory whenever
                                 possible,
                              </li>
                              <li class="li">Avoid different execution paths within the same warp.</li>
                           </ul>
                        </div>
                     </div>
                  </div>
                  <div class="topic concept nested1" xml:lang="en-US" id="application-compatibility"><a name="application-compatibility" shape="rect">
                        <!-- --></a><h3 class="title topictitle2"><a href="#application-compatibility" name="application-compatibility" shape="rect">1.3.&nbsp;Application Compatibility</a></h3>
                     <div class="body conbody">
                        <p class="p">Before addressing the specific performance tuning issues covered in
                           this guide, refer to the <cite class="cite">Kepler Compatibility Guide for CUDA
                              Applications</cite> to ensure that your application is being compiled
                           in a way that will be compatible with Kepler.
                        </p>
                        <p class="p">Note that many of the GK110 architectural features described in this
                           document require the device code in the application to be compiled for
                           its native compute capability 3.5 target architecture
                           (<samp class="ph codeph">sm_35</samp>).
                        </p>
                     </div>
                  </div>
                  <div class="topic concept nested1" xml:lang="en-US" id="kepler-tuning"><a name="kepler-tuning" shape="rect">
                        <!-- --></a><h3 class="title topictitle2"><a href="#kepler-tuning" name="kepler-tuning" shape="rect">1.4.&nbsp;Kepler Tuning</a></h3>
                     <div class="topic concept nested2" xml:lang="en-US" id="device-utilization-and-occupancy"><a name="device-utilization-and-occupancy" shape="rect">
                           <!-- --></a><h3 class="title topictitle2"><a href="#device-utilization-and-occupancy" name="device-utilization-and-occupancy" shape="rect">1.4.1.&nbsp;Device Utilization and Occupancy</a></h3>
                        <div class="body conbody">
                           <p class="p">Kepler's new Streaming Multiprocessor, called SMX, has
                              significantly more CUDA Cores than the SM of Fermi GPUs, yielding a
                              throughput improvement of 2-3x per clock.<a name="fnsrc_4" href="#fntarg_4" shape="rect"><sup>4</sup></a> Furthermore, GK110 has increased memory
                              bandwidth over Fermi and GK104. To match these throughput
                              increases, we need roughly twice as much parallelism per
                              multiprocessor on Kepler GPUs, via either an increased number of
                              active warps of threads or increased instruction-level parallelism
                              (ILP) or some combination thereof.
                           </p>
                           <p class="p">Balancing this is the fact that GK104 ships with only 8
                              multiprocessors, half of the size of Fermi GF110, meaning that
                              GK104 needs roughly the same total amount of parallelism as is
                              needed by Fermi GF110, though it needs more parallelism per
                              multiprocessor to achieve this. Since GK110 can have up to 15
                              multiprocessors, which is similar to the number of multiprocessors
                              of Fermi GF110, then GK110 typically needs a larger amount of
                              parallelism than Fermi or GK104.
                           </p>
                           <div class="p">To enable the increased per-multiprocessor warp occupancy
                              beneficial to both GK104 and GK110, several important
                              multiprocessor resources have been significantly increased in
                              SMX:
                              
                              <ul class="ul">
                                 <li class="li">Kepler increases the size of the register file over Fermi
                                    by 2x per multiprocessor. On Fermi, the number of registers
                                    available was the primary limiting factor of occupancy for
                                    many kernels. On Kepler, these kernels can automatically
                                    fit more thread blocks per multiprocessor. For example, a
                                    kernel using 63 registers per thread and 256 threads per
                                    block can fit at most 16 concurrent warps per
                                    multiprocessor on Fermi (out of a maximum of 48, i.e., 33%
                                    theoretical occupancy).  The same configuration can fit 32
                                    warps per multiprocessor on Kepler (out of a maximum of 64,
                                    i.e., 50% theoretical occupancy).
                                 </li>
                                 <li class="li">Kepler has increased the maximum number of simultaneous
                                    blocks per multiprocessor from 8 to 16. As a result,
                                    kernels having their occupancy limited due to reaching the
                                    maximum number of thread blocks per multiprocessor will see
                                    increased theoretical occupancy in Kepler.
                                 </li>
                              </ul>
                           </div>
                           <p class="p">Note that these automatic occupancy improvements require kernel
                              launches with sufficient total thread blocks to fill Kepler. For
                              this reason, it remains a best practice to launch kernels with
                              significantly more thread blocks than necessary to fill current
                              GPUs, allowing this kind of scaling to occur naturally without
                              modifications to the application. The <cite class="cite">CUDA Occupancy
                                 Calculator</cite><a class="xref" href="index.html#references" shape="rect">[5]</a>
                              spreadsheet is a valuable tool in visualizing the achievable
                              occupancy for various kernel launch configurations.
                           </p>
                           <p class="p">Also note that Kepler GPUs can utilize ILP in place of
                              thread/warp-level parallelism (TLP) more readily than Fermi GPUs
                              can. Furthermore, some degree of ILP in conjunction with TLP is
                              <em class="ph i">required</em> by Kepler GPUs in order to approach peak
                              single-precision performance, since SMX's warp scheduler issues
                              one or two independent instructions from each of four warps per
                              clock. ILP can be increased by means of, for example, processing
                              several data items concurrently per thread or unrolling loops
                              in the device code, though note that either of these approaches
                              may also increase register pressure.
                           </p>
                        </div>
                     </div>
                     <div class="topic concept nested2" xml:lang="en-US" id="managing-coarse-grained-parallelism"><a name="managing-coarse-grained-parallelism" shape="rect">
                           <!-- --></a><h3 class="title topictitle2"><a href="#managing-coarse-grained-parallelism" name="managing-coarse-grained-parallelism" shape="rect">1.4.2.&nbsp;Managing Coarse-Grained Parallelism</a></h3>
                        <div class="body conbody">
                           <p class="p">Since GK110 requires more concurrently active threads than
                              either GK104 or Fermi, GK110 introduces several features that can
                              assist applications having more limited parallelism, where the
                              expanded multiprocessor resources described in <a class="xref" href="index.html#device-utilization-and-occupancy" shape="rect">Device Utilization and Occupancy</a> are difficult to
                              leverage from any single kernel launch. These improvements allow
                              the application to more readily use several concurrent kernel grids
                              to fill GK110:
                           </p>
                        </div>
                        <div class="topic concept nested3" xml:lang="en-US" id="concurrent-kernels"><a name="concurrent-kernels" shape="rect">
                              <!-- --></a><h3 class="title topictitle2"><a href="#concurrent-kernels" name="concurrent-kernels" shape="rect">1.4.2.1.&nbsp;Concurrent Kernels</a></h3>
                           <div class="body conbody">
                              <p class="p">Since the introduction of Fermi, applications have had the
                                 ability to launch several kernels concurrently. This provides a
                                 mechanism by which applications can fill the device with
                                 several smaller kernel launches simultaneously as opposed to a
                                 single larger one. On Fermi and on GK104, at most 16 kernels
                                 can execute concurrently; GK110 allows up to 32 concurrent
                                 kernels to execute, which can provide a speedup for
                                 applications with necessarily small (but independent) kernel
                                 launches.
                              </p>
                           </div>
                        </div>
                        <div class="topic concept nested3" xml:lang="en-US" id="hyper-q"><a name="hyper-q" shape="rect">
                              <!-- --></a><h3 class="title topictitle2"><a href="#hyper-q" name="hyper-q" shape="rect">1.4.2.2.&nbsp;Hyper-Q</a></h3>
                           <div class="body conbody">
                              <p class="p">GK110 further improves this with the addition of Hyper-Q,
                                 which removes the false dependencies that can be introduced
                                 among CUDA streams in cases of suboptimal kernel launch or
                                 memory copy ordering across streams in Fermi or GK104. Hyper-Q
                                 allows GK110 to handle the concurrent kernels and/or memory
                                 transfers in separate CUDA streams truly independently, rather
                                 than serializing the several streams into a single work queue
                                 at the hardware level. This allows applications to enqueue
                                 work into separate CUDA streams without considering the
                                 relative order of insertion of otherwise independent work,
                                 making concurrency of multiple kernels as well as overlapping
                                 of memory copies with computation much more readily achievable
                                 on GK110.
                              </p>
                              <p class="p">CUDA streams are automatically mapped onto Hyper-Q's
                                 multiple hardware work queues via connections to the hardware
                                 allocated by the CUDA Driver. While it is possible to allocate
                                 more CUDA streams than there are connections, this simply
                                 implies that the driver will alias several streams onto some or
                                 all of those connections. The
                                 <samp class="ph codeph">CUDA_DEVICE_MAX_CONNECTIONS</samp> environment
                                 variable can be used to specify the preferred number of
                                 connections to be allocated to the driver. The default is 8 (or
                                 fewer if CUDA Multi-Process Service is in use); the
                                 architectural maximum for GK110 is 32.
                              </p>
                              <p class="p">CUDA Multi-Process Service (MPS) presents another means
                                 by which applications can take advantage of Hyper-Q, wherein
                                 several host processes (typically MPI processes) share access
                                 to and submit work to the same GPU concurrently, each process
                                 receiving some subset of the available connections to that GPU.
                                 Using CUDA MPS, processes can achieve overlap of their
                                 respective memory transfers and computations with or without
                                 the use of CUDA streams, although at the cost of some added
                                 latency of work submission and a few other caveats. For more
                                 information see the CUDA MPS Overview<a class="xref" href="index.html#references" shape="rect">[6]</a>.
                              </p>
                           </div>
                        </div>
                        <div class="topic concept nested3" xml:lang="en-US" id="dynamic-parallelism"><a name="dynamic-parallelism" shape="rect">
                              <!-- --></a><h3 class="title topictitle2"><a href="#dynamic-parallelism" name="dynamic-parallelism" shape="rect">1.4.2.3.&nbsp;Dynamic Parallelism</a></h3>
                           <div class="body conbody">
                              <p class="p">GK110 also introduces a new architectural feature called
                                 Dynamic Parallelism, which allows the GPU to create additional
                                 work for itself. A programming model enhancement that leverages
                                 this architectural feature was introduced in CUDA 5.0 to enable
                                 kernels running on GK110 to launch additional kernels onto the
                                 same GPU.  Nested kernel launches are done via the same
                                 <samp class="ph codeph">&lt;&lt;&lt;&gt;&gt;&gt;</samp> triple-angle bracket
                                 notation used from the host and can make use of the familiar
                                 CUDA streams interface to specify whether or not the kernels
                                 launched are independent of one another. More than one GPU
                                 thread can simultaneously launch kernel grids (of the same or
                                 different kernels), further increasing the application's
                                 flexibility in keeping the GPU filled with parallel work.
                              </p>
                           </div>
                        </div>
                     </div>
                     <div class="topic concept nested2" xml:lang="en-US" id="shared-memory-and-warp-shuffle"><a name="shared-memory-and-warp-shuffle" shape="rect">
                           <!-- --></a><h3 class="title topictitle2"><a href="#shared-memory-and-warp-shuffle" name="shared-memory-and-warp-shuffle" shape="rect">1.4.3.&nbsp;Shared Memory and Warp Shuffle</a></h3>
                        <div class="topic concept nested3" xml:lang="en-US" id="shared-memory-bandwidth"><a name="shared-memory-bandwidth" shape="rect">
                              <!-- --></a><h3 class="title topictitle2"><a href="#shared-memory-bandwidth" name="shared-memory-bandwidth" shape="rect">1.4.3.1.&nbsp;Shared Memory Bandwidth</a></h3>
                           <div class="body conbody">
                              <p class="p">In balance with the increased computational throughput in
                                 Kepler's SMX described in <a class="xref" href="index.html#device-utilization-and-occupancy" shape="rect">Device Utilization and Occupancy</a>, shared
                                 memory bandwidth in SMX is twice that of Fermi's SM.  This
                                 bandwidth increase is exposed to the application through a
                                 configurable new 8-byte shared memory bank mode.  When this
                                 mode is enabled, 64-bit (8-byte) shared memory accesses (such
                                 as loading a double-precision floating point number from shared
                                 memory) achieve twice the effective bandwidth of 32-bit
                                 (4-byte) accesses. Applications that are sensitive to shared
                                 memory bandwidth can benefit from enabling this mode as long as
                                 their kernels' accesses to shared memory are for 8-byte
                                 entities wherever possible.
                              </p>
                           </div>
                        </div>
                        <div class="topic concept nested3" xml:lang="en-US" id="shared-memory-capacity"><a name="shared-memory-capacity" shape="rect">
                              <!-- --></a><h3 class="title topictitle2"><a href="#shared-memory-capacity" name="shared-memory-capacity" shape="rect">1.4.3.2.&nbsp;Shared Memory Capacity</a></h3>
                           <div class="body conbody">
                              <p class="p">Fermi introduced an L1 cache in addition to the shared
                                 memory available since the earliest CUDA-capable GPUs. In
                                 Fermi, the shared memory and the L1 cache share the same
                                 physical on-chip storage, and a split of 48 KB shared memory /
                                 16 KB L1 cache or vice versa can be selected per application or
                                 per kernel launch. Kepler continues this pattern and introduces
                                 an additional setting of 32 KB shared memory / 32 KB L1 cache,
                                 the use of which may benefit L1 hit rate in kernels that need
                                 more than 16 KB but less than 48 KB of shared memory per
                                 multiprocessor.
                              </p>
                              <p class="p">Since the maximum shared memory capacity per multiprocessor
                                 remains 48 KB, however, applications that depend on shared
                                 memory capacity either at a per-block level for data exchange
                                 or at a per-thread level for additional thread-private storage
                                 may require some rebalancing on Kepler to improve their
                                 achievable occupancy.  The <a class="xref" href="index.html#warp-shuffle" shape="rect">Warp Shuffle</a>
                                 operation for data-exchange uses of shared memory and the <a class="xref" href="index.html#increased-addressable-registers" shape="rect">Increased Addressable Registers Per Thread</a> as an
                                 alternative to thread-private uses of shared memory present two
                                 possible alternatives to achieve this rebalancing.
                              </p>
                           </div>
                        </div>
                        <div class="topic concept nested3" xml:lang="en-US" id="warp-shuffle"><a name="warp-shuffle" shape="rect">
                              <!-- --></a><h3 class="title topictitle2"><a href="#warp-shuffle" name="warp-shuffle" shape="rect">1.4.3.3.&nbsp;Warp Shuffle</a></h3>
                           <div class="body conbody">
                              <p class="p">Kepler introduces a new warp-level intrinsic called the
                                 <dfn class="term">shuffle</dfn> operation.  This feature allows the
                                 threads of a warp to exchange data with each other directly
                                 without going through shared (or global) memory. The shuffle
                                 instruction also has lower latency than shared memory access
                                 and does not consume shared memory space for data exchange, so
                                 this can present an attractive way for applications to rapidly
                                 interchange data among threads.
                              </p>
                           </div>
                        </div>
                     </div>
                     <div class="topic concept nested2" xml:lang="en-US" id="memory-throughput"><a name="memory-throughput" shape="rect">
                           <!-- --></a><h3 class="title topictitle2"><a href="#memory-throughput" name="memory-throughput" shape="rect">1.4.4.&nbsp;Memory Throughput</a></h3>
                        <div class="topic concept nested3" xml:lang="en-US" id="increased-addressable-registers"><a name="increased-addressable-registers" shape="rect">
                              <!-- --></a><h3 class="title topictitle2"><a href="#increased-addressable-registers" name="increased-addressable-registers" shape="rect">1.4.4.1.&nbsp;Increased Addressable Registers Per Thread</a></h3>
                           <div class="body conbody">
                              <p class="p">GK110 increases the maximum number of registers addressable
                                 per thread from 63 to 255. This can improve performance of
                                 bandwidth-limited kernels that have significant register
                                 spilling on Fermi or GK104. Experimentation should be used to
                                 determine the optimum balance of spilling vs. occupancy,
                                 however, as significant increases in the number of registers
                                 used per thread naturally decreases the warp occupancy that can
                                 be achieved, which trades off latency due to memory traffic for
                                 arithmetic latency due to fewer concurrent warps.
                              </p>
                           </div>
                        </div>
                        <div class="topic concept nested3" xml:lang="en-US" id="l1-cache"><a name="l1-cache" shape="rect">
                              <!-- --></a><h3 class="title topictitle2"><a href="#l1-cache" name="l1-cache" shape="rect">1.4.4.2.&nbsp;L1 Cache</a></h3>
                           <div class="body conbody">
                              <p class="p">L1 caching in Kepler GPUs is reserved only for local memory
                                 accesses, such as register spills and stack data. Global loads
                                 are cached in L2 only (or in the <a class="xref" href="index.html#read-only-data-cache" shape="rect">Read-Only Data Cache</a>).
                              </p>
                              <p class="p">GK110B-based products such as the Tesla K40 GPU Accelerator
                                 retain this behavior by default but also allow applications to
                                 opt-in to the Fermi-style behavior of caching both global and
                                 local loads in L1. To select this mode, pass the
                                 <samp class="ph codeph">-Xptxas -dlcm=ca</samp> flag to <samp class="ph codeph">nvcc</samp>
                                 at compile time.
                              </p>
                           </div>
                        </div>
                        <div class="topic concept nested3" xml:lang="en-US" id="read-only-data-cache"><a name="read-only-data-cache" shape="rect">
                              <!-- --></a><h3 class="title topictitle2"><a href="#read-only-data-cache" name="read-only-data-cache" shape="rect">1.4.4.3.&nbsp;Read-Only Data Cache</a></h3>
                           <div class="body conbody">
                              <p class="p">GK110 adds the ability for read-only data in global memory
                                 to be loaded through the same cache used by the texture
                                 pipeline via a standard pointer without the need to bind a
                                 texture beforehand and without the sizing limitations of
                                 standard textures.  Since this is a separate cache with a
                                 separate memory pipe and with relaxed memory coalescing rules,
                                 use of this feature can benefit the performance of
                                 bandwidth-limited kernels. This feature will be automatically
                                 enabled and utilized where possible by the compiler when
                                 compiling for GK110 as long as certain conditions are met.
                                 Foremost among these requirements is that the data <em class="ph i">must</em>
                                 be guaranteed read-only <em class="ph i">for the duration of the kernel</em>,
                                 as the read-only data cache is incoherent with respect to
                                 writes. In order to allow the compiler to detect that this
                                 condition is satisfied, a necessary (but not always sufficient)
                                 condition is that pointers used for loading such data should be
                                 marked with both the <samp class="ph codeph">const</samp><em class="ph i">and</em><samp class="ph codeph">__restrict__</samp> qualifiers. Note that adding
                                 these qualifiers where applicable can improve code generation
                                 quality via other mechanisms on earlier GPUs as well.
                              </p>
                              <p class="p">In cases where more explicit control over the read-only data
                                 cache mechanism is desired than the <samp class="ph codeph">const
                                    __restrict__</samp> qualifiers provide, or where the code is
                                 sufficiently complex that the compiler is unable to detect that
                                 the read-only data cache is safe to use, the
                                 <samp class="ph codeph">__ldg()</samp> intrinsic can be used in place of a
                                 normal pointer dereference to force the load to go through the
                                 read-only data cache.
                              </p>
                              <p class="p">Note that the read-only data cache accessed via
                                 <samp class="ph codeph">const __restrict__</samp> is separate and distinct
                                 from the constant cache acessed via the
                                 <samp class="ph codeph">__constant__</samp> qualifier.  Data loaded through
                                 the constant cache must be relatively small and must be
                                 accessed uniformly for good performance (i.e., all threads of a
                                 warp should access the same location at any given time),
                                 whereas data loaded through the read-only data cache can be
                                 much larger and can be accessed in a non-uniform pattern.
                                 These two data paths can be used simultaneously for different
                                 data if desired.
                              </p>
                           </div>
                        </div>
                        <div class="topic concept nested3" xml:lang="en-US" id="fast-global-memory-atomics"><a name="fast-global-memory-atomics" shape="rect">
                              <!-- --></a><h3 class="title topictitle2"><a href="#fast-global-memory-atomics" name="fast-global-memory-atomics" shape="rect">1.4.4.4.&nbsp;Fast Global Memory Atomics</a></h3>
                           <div class="body conbody">
                              <p class="p">Global memory atomic operations have dramatically higher
                                 throughput on Kepler than on Fermi. Algorithms requiring
                                 multiple threads to update the same location in memory
                                 concurrently have at times on earlier GPUs resorted to complex
                                 data rearrangements in order to minimize the number of atomics
                                 required. Given the improvements in global memory atomic
                                 performance, many atomics can be performed on Kepler nearly as
                                 quickly as memory loads.  This may simplify implementations
                                 requiring atomicity or enable algorithms previously deemed
                                 impractical.
                              </p>
                           </div>
                        </div>
                        <div class="topic concept nested3" xml:lang="en-US" id="global-memory-bandwidth-gpuboost"><a name="global-memory-bandwidth-gpuboost" shape="rect">
                              <!-- --></a><h3 class="title topictitle2"><a href="#global-memory-bandwidth-gpuboost" name="global-memory-bandwidth-gpuboost" shape="rect">1.4.4.5.&nbsp;Global Memory Bandwidth and GPU Boost</a></h3>
                           <div class="body conbody">
                              <p class="p">GK110B provides higher memory clocks (and, by extension,
                                 higher peak global memory bandwidth) than GK110. For the
                                 GK110B-based Tesla K40 GPU Accelerator, while all of the <a class="xref" href="index.html#gpuboost" shape="rect">GPU Boost</a> clock settings use the same 3GHz memory
                                 clock, the effective memory bandwidth utilization can typically
                                 be increased by using the highest boost setting for SM core
                                 clocks as well.
                              </p>
                           </div>
                        </div>
                        <div class="topic concept nested3" xml:lang="en-US" id="memcopy2d"><a name="memcopy2d" shape="rect">
                              <!-- --></a><h3 class="title topictitle2"><a href="#memcopy2d" name="memcopy2d" shape="rect">1.4.4.6.&nbsp;2D Memory Copies</a></h3>
                           <div class="body conbody">
                              <p class="p">The effective bandwidth of <samp class="ph codeph">cudaMemcpy2D()</samp>
                                 operations is best when avoiding the use of small device pitches
                                 together with large host pitches (&gt;64 KB).
                              </p>
                           </div>
                        </div>
                     </div>
                     <div class="topic concept nested2" xml:lang="en-US" id="instruction-throughput"><a name="instruction-throughput" shape="rect">
                           <!-- --></a><h3 class="title topictitle2"><a href="#instruction-throughput" name="instruction-throughput" shape="rect">1.4.5.&nbsp;Instruction Throughput</a></h3>
                        <div class="body conbody">
                           <p class="p">While the maximum instructions per clock (IPC) of both
                              floating-point and integer operations has been either increased or
                              maintained in Kepler as compared to Fermi, the relative ratios of
                              maximum IPC for various specific instructions has changed somewhat.
                              Refer to the <cite class="cite">CUDA C Programming Guide</cite> for details.
                           </p>
                        </div>
                        <div class="topic concept nested3" xml:lang="en-US" id="single-precision-vs-double-precision"><a name="single-precision-vs-double-precision" shape="rect">
                              <!-- --></a><h3 class="title topictitle2"><a href="#single-precision-vs-double-precision" name="single-precision-vs-double-precision" shape="rect">1.4.5.1.&nbsp;Single-precision vs. Double-precision</a></h3>
                           <div class="body conbody">
                              <p class="p">As one example of these instruction throughput ratios, an
                                 important difference between GK104 and GK110 is the ratio of
                                 peak single-precision to peak double-precision floating point
                                 performance. Whereas GK104 focuses primarily on high
                                 single-precision throughput, GK110 significantly improves the
                                 peak double-precision throughput over Fermi as well.
                                 Applications that depend heavily on high double-precision
                                 performance will generally perform best with GK110.
                              </p>
                           </div>
                        </div>
                     </div>
                     <div class="topic concept nested2" xml:lang="en-US" id="gpuboost"><a name="gpuboost" shape="rect">
                           <!-- --></a><h3 class="title topictitle2"><a href="#gpuboost" name="gpuboost" shape="rect">1.4.6.&nbsp;GPU Boost</a></h3>
                        <div class="body conbody">
                           <p class="p">NVIDIA GPU Boost is a feature available on the GK110B-based
                              Tesla K40 GPU Accelerator that makes use of power headroom to run
                              the SM core clock to a higher frequency. While the default clock
                              is set to the base clock, which is necessary for some applications
                              that are demanding on power (e.g., DGEMM), many application
                              workloads are less demanding on power and can take advantage of a
                              higher boost clock setting for added performance.
                           </p>
                           <p class="p">GPU Boost clocks can be selected through
                              <samp class="ph codeph">nvidia-smi</samp> or <samp class="ph codeph">NVML</samp>. See the
                              <cite class="cite">Tesla K40 Board Specification</cite> for additional
                              information.
                           </p>
                        </div>
                     </div>
                     <div class="topic concept nested2" xml:lang="en-US" id="multi-gpu"><a name="multi-gpu" shape="rect">
                           <!-- --></a><h3 class="title topictitle2"><a href="#multi-gpu" name="multi-gpu" shape="rect">1.4.7.&nbsp;Multi-GPU</a></h3>
                        <div class="body conbody">
                           <p class="p">NVIDIA's Tesla K10 GPU Accelerator is a dual-GK104
                              <dfn class="term">Gemini</dfn> board. As with other dual-GPU NVIDIA boards,
                              the two GPUs on the board will appear as two separate CUDA devices;
                              they have separate memories and operate independently. As such,
                              applications that will target the Tesla K10 GPU Accelerator but
                              that are not yet multi-GPU aware should begin preparing for the
                              multi-GPU paradigm. Since dual-GPU boards appear to the host
                              application exactly the same as two separate single-GPU boards,
                              enabling applications for multi-GPU can benefit application
                              performance on a wide range of systems where more than one
                              CUDA-capable GPU can be installed.
                           </p>
                        </div>
                     </div>
                     <div class="topic concept nested2" xml:lang="en-US" id="pcie-30"><a name="pcie-30" shape="rect">
                           <!-- --></a><h3 class="title topictitle2"><a href="#pcie-30" name="pcie-30" shape="rect">1.4.8.&nbsp;PCIe 3.0</a></h3>
                        <div class="body conbody">
                           <p class="p">Kepler's interconnection to the host system has been enhanced to
                              support PCIe 3.0. For applications where host-to-device,
                              device-to-host, or device-to-device transfer time is significant
                              and not easily overlapped with computation, the additional
                              bandwidth provided by PCIe 3.0, given the requisite host system
                              support, over the earlier PCIe 2.0 specification supported by Fermi
                              GPUs should boost application performance without modifications to
                              the application. Note that best PCIe transfer speeds to or from
                              system memory with either PCIe generation are achieved when using
                              pinned system memory.
                           </p>
                           <div class="note note"><span class="notetitle">Note:</span> In the Tesla K10 GPU Accelerator, the two GPUs sharing
                              a board are connected via an on-board PCIe 3.0 switch. Since these
                              GPUs are also capable of GPUDirect Peer-to-Peer transfers, the
                              inter-device memory transfers between GPUs on the same board can
                              run at PCIe 3.0 speeds even if the host system supports only PCIe
                              2.0 or earlier.
                           </div>
                           <div class="note note"><span class="notetitle">Note:</span> While the Kepler architecture is compliant with the PCIe 3.0
                              specification, not all Kepler-based products support PCIe 3.0
                              speeds. For example, while Tesla K10 and K40 support PCIe 3.0,
                              Tesla K20 and K20X do not.
                           </div>
                           <div class="note note"><span class="notetitle">Note:</span> PCIe 3.0 throughputs may be improved in some circumstances by
                              using the highest-available <a class="xref" href="index.html#gpuboost" shape="rect">GPU Boost</a> clock.
                           </div>
                        </div>
                     </div>
                     <div class="topic concept nested2" xml:lang="en-US" id="warp-synchronous"><a name="warp-synchronous" shape="rect">
                           <!-- --></a><h3 class="title topictitle2"><a href="#warp-synchronous" name="warp-synchronous" shape="rect">1.4.9.&nbsp;Warp-synchronous Programming</a></h3>
                        <div class="body conbody">
                           <p class="p">As a means of mitigating the cost of repeated block-level
                              synchronizations, particularly in parallel primitives such as
                              reduction and prefix sum, some programmers exploit the knowledge
                              that threads in a warp execute in lock-step with each other to omit
                              <samp class="ph codeph">__syncthreads()</samp> in some places where it is
                              semantically necessary for correctness in the CUDA programming
                              model.
                           </p>
                           <p class="p">The absence of an explicit synchronization in a program where
                              different threads communicate via memory constitutes a data race
                              condition or synchronization error.  Warp-synchronous programs are
                              unsafe and easily broken by evolutionary improvements to the
                              optimization strategies used by the CUDA compiler toolchain, which
                              generally has no visibility into cross-thread interactions of this
                              variety in the absence of barriers, or by changes to the hardware
                              memory subsystem's behavior. Such programs also tend to assume that
                              the warp size is 32 threads, which may not necessarily be the case
                              for all future CUDA-capable architectures.
                           </p>
                           <p class="p">Therefore, programmers should avoid warp-synchronous programming
                              to ensure future-proof correctness in CUDA applications. When
                              threads in a block must communicate or synchronize with each other,
                              regardless of whether those threads are expected to be in the same
                              warp or not, the appropriate barrier primitives should be used.
                              Note that the <a class="xref" href="index.html#warp-shuffle" shape="rect">Warp Shuffle</a> primitive
                              presents a future-proof, supported mechanism for intra-warp
                              communication that can safely be used as an alternative in many
                              cases.
                           </p>
                        </div>
                     </div>
                  </div>
               </div>
               <div class="topic reference nested0" id="references"><a name="references" shape="rect">
                     <!-- --></a><h2 class="title topictitle1"><a href="#references" name="references" shape="rect">A.&nbsp;References</a></h2>
                  <div class="body refbody">
                     <div class="section" id="references__1"><a name="references__1" shape="rect">
                           <!-- --></a><p class="p">[1] <cite class="cite">NVIDIA GeForce GTX 680: The fastest, most efficient GPU ever built</cite>.
                        </p>
                        <p class="p"><a class="xref" href="http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf" target="_blank" shape="rect">http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf</a></p>
                     </div>
                     <div class="section" id="references__2"><a name="references__2" shape="rect">
                           <!-- --></a><p class="p">[2] <cite class="cite">NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110</cite>.
                        </p>
                        <p class="p"><a class="xref" href="http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf" target="_blank" shape="rect">http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf</a></p>
                     </div>
                     <div class="section" id="references__3"><a name="references__3" shape="rect">
                           <!-- --></a><p class="p">[3] <cite class="cite">CUDA C Programming Guide</cite>.
                        </p>
                        <p class="p"><a class="xref" href="http://docs.nvidia.com/cuda/cuda-c-programming-guide/" target="_blank" shape="rect">http://docs.nvidia.com/cuda/cuda-c-programming-guide/</a></p>
                     </div>
                     <div class="section" id="references__4"><a name="references__4" shape="rect">
                           <!-- --></a><p class="p">[4] <cite class="cite">CUDA C Best Practices Guide</cite>.
                        </p>
                        <p class="p"><a class="xref" href="http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/" target="_blank" shape="rect">http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/</a></p>
                     </div>
                     <div class="section" id="references__5"><a name="references__5" shape="rect">
                           <!-- --></a><p class="p">[5] <cite class="cite">CUDA Occupancy Calculator</cite> spreadsheet.
                        </p>
                        <p class="p"><a class="xref" href="http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls" target="_blank" shape="rect">http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls</a></p>
                     </div>
                     <div class="section" id="references__6"><a name="references__6" shape="rect">
                           <!-- --></a><p class="p">[6] <cite class="cite">Sharing A GPU Between MPI Processes: Multi-Process Service (MPS) Overview</cite>.
                        </p>
                        <p class="p"><a class="xref" href="http://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf" target="_blank" shape="rect">http://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf</a></p>
                     </div>
                  </div>
               </div>
               <div class="topic reference nested0" id="revision-history"><a name="revision-history" shape="rect">
                     <!-- --></a><h2 class="title topictitle1"><a href="#revision-history" name="revision-history" shape="rect">B.&nbsp;Revision History</a></h2>
                  <div class="body refbody">
                     <div class="section">
                        <h2 class="title sectiontitle">Version 0.9</h2>
                        <ul class="ul">
                           <li class="li">CUDA 5.0 Preview Release</li>
                        </ul>
                     </div>
                     <div class="section">
                        <h2 class="title sectiontitle">Version 1.0</h2>
                        <ul class="ul">
                           <li class="li">Added discussion of ILP vs TLP (see <a class="xref" href="index.html#device-utilization-and-occupancy" shape="rect">Device Utilization and Occupancy</a>).
                           </li>
                           <li class="li">Expanded discussion of cache behaviors (see <a class="xref" href="index.html#memory-throughput" shape="rect">Memory Throughput</a>).
                           </li>
                           <li class="li">Added section regarding <a class="xref" href="index.html#warp-synchronous" shape="rect">Warp-synchronous Programming</a>.
                           </li>
                           <li class="li">Added section regarding <a class="xref" href="index.html#memcopy2d" shape="rect">2D Memory Copies</a>.
                           </li>
                           <li class="li">Minor corrections and clarifications.</li>
                        </ul>
                     </div>
                     <div class="section">
                        <h2 class="title sectiontitle">Version 1.1</h2>
                        <ul class="ul">
                           <li class="li">Clarified <samp class="ph codeph">const __restrict__</samp> discussion and mentioned <samp class="ph codeph">__ldg()</samp>
                              intrinsic in <a class="xref" href="index.html#read-only-data-cache" shape="rect">Read-Only Data Cache</a>.
                           </li>
                        </ul>
                     </div>
                     <div class="section">
                        <h2 class="title sectiontitle">Version 1.2</h2>
                        <ul class="ul">
                           <li class="li">Add references to GK110B, which allows an opt-in to the caching of global loads in the
                              <a class="xref" href="index.html#l1-cache" shape="rect">L1 Cache</a> and enables higher clock speeds via <a class="xref" href="index.html#gpuboost" shape="rect">GPU Boost</a>.
                           </li>
                           <li class="li">Expand discussion of ILP in <a class="xref" href="index.html#device-utilization-and-occupancy" shape="rect">Device Utilization and Occupancy</a>.
                           </li>
                           <li class="li">Expand discussion of <a class="xref" href="index.html#hyper-q" shape="rect">Hyper-Q</a>, adding mention of
                              <samp class="ph codeph">CUDA_DEVICE_MAX_CONNECTIONS</samp> and CUDA Multi-Process Service (MPS).
                           </li>
                           <li class="li">Clarification of PCIe 3.0 support.</li>
                           <li class="li">Add hyperlinks to all endnote references.</li>
                        </ul>
                     </div>
                  </div>
               </div>
               <div class="topic concept nested0" id="notices-header"><a name="notices-header" shape="rect">
                     <!-- --></a><h2 class="title topictitle1"><a href="#notices-header" name="notices-header" shape="rect">Notices</a></h2>
                  <div class="topic reference nested1" id="notice"><a name="notice" shape="rect">
                        <!-- --></a><h3 class="title topictitle2"><a href="#notice" name="notice" shape="rect"></a></h3>
                     <div class="body refbody">
                        <div class="section">
                           <h3 class="title sectiontitle">Notice</h3>
                           <p class="p">ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND
                              SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE
                              WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS
                              FOR A PARTICULAR PURPOSE. 
                           </p>
                           <p class="p">Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the
                              consequences of use of such information or for any infringement of patents or other rights of third parties that may result
                              from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications
                              mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information
                              previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems
                              without express written approval of NVIDIA Corporation.
                           </p>
                        </div>
                     </div>
                  </div>
                  <div class="topic reference nested1" id="trademarks"><a name="trademarks" shape="rect">
                        <!-- --></a><h3 class="title topictitle2"><a href="#trademarks" name="trademarks" shape="rect"></a></h3>
                     <div class="body refbody">
                        <div class="section">
                           <h3 class="title sectiontitle">Trademarks</h3>
                           <p class="p">NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation
                              in the U.S. and other countries.  Other company and product names may be trademarks of
                              the respective companies with which they are associated.
                           </p>
                        </div>
                     </div>
                  </div>
                  <div class="topic reference nested1" id="copyright"><a name="copyright" shape="rect">
                        <!-- --></a><h3 class="title topictitle2"><a href="#copyright" name="copyright" shape="rect"></a></h3>
                     <div class="body refbody">
                        <div class="section">
                           <h3 class="title sectiontitle">Copyright</h3>
                           <p class="p">© 2012-<span class="ph">2014</span> NVIDIA Corporation. All rights reserved.
                           </p>
                           <p class="p">This product includes software developed by the Syncro Soft SRL (http://www.sync.ro/).</p>
                        </div>
                     </div>
                  </div>
               </div>
               <div class="fn"><a name="fntarg_1" href="#fnsrc_1" shape="rect"><sup>1</sup></a>  Throughout this guide, <dfn class="term">Fermi</dfn>
                  refers to devices of compute capability 2.x and <dfn class="term">Kepler</dfn>
                  refers to devices of compute capability 3.x. GK104 has compute
                  capability 3.0; GK110 has compute capability 3.5.
               </div>
               <div class="fn"><a name="fntarg_2" href="#fnsrc_2" shape="rect"><sup>2</sup></a>  The
                  features of GK107 are similar to those of GK104.
               </div>
               <div class="fn"><a name="fntarg_3" href="#fnsrc_3" shape="rect"><sup>3</sup></a>  The
                  features of GK110B are similar to those of GK110 except where
                  noted.
               </div>
               <div class="fn"><a name="fntarg_4" href="#fnsrc_4" shape="rect"><sup>4</sup></a>  Note, however, that
                  Kepler clocks are generally lower than Fermi clocks for improved
                  power efficiency.
               </div>
               
               <hr id="contents-end"></hr>
               
            </article>
         </div>
      </div>
      <script language="JavaScript" type="text/javascript" charset="utf-8" src="../common/formatting/common.min.js"></script>
      <script language="JavaScript" type="text/javascript" charset="utf-8" src="../common/scripts/google-analytics/google-analytics-write.js"></script>
      <script language="JavaScript" type="text/javascript" charset="utf-8" src="../common/scripts/google-analytics/google-analytics-tracker.js"></script>
      <script type="text/javascript">var switchTo5x=true;</script><script type="text/javascript" src="http://w.sharethis.com/button/buttons.js"></script><script type="text/javascript">stLight.options({publisher: "998dc202-a267-4d8e-bce9-14debadb8d92", doNotHash: false, doNotCopy: false, hashAddressBar: false});</script></body>
</html>