Sophie

Sophie

distrib > Mageia > 4 > x86_64 > by-pkgid > b0aa6cd23b567cd0e312b072b2e3b0bf > files > 1679

nvidia-cuda-toolkit-devel-5.5.22-2.mga4.nonfree.x86_64.rpm

<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-us" xml:lang="en-us">
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>
      <meta http-equiv="X-UA-Compatible" content="IE=edge"></meta>
      <meta name="copyright" content="(C) Copyright 2005"></meta>
      <meta name="DC.rights.owner" content="(C) Copyright 2005"></meta>
      <meta name="DC.Type" content="concept"></meta>
      <meta name="DC.Title" content="Profiler User's Guide"></meta>
      <meta name="abstract" content="The user manual for NVIDIA profiling tools for optimizing performance of CUDA applications."></meta>
      <meta name="description" content="The user manual for NVIDIA profiling tools for optimizing performance of CUDA applications."></meta>
      <meta name="DC.Coverage" content="Tools"></meta>
      <meta name="DC.subject" content="CUDA nvprof, CUDA profiling, CUDA nvprof profiling, CUDA nvprof kernel, CUDA nvprof session, CUDA nvprof views, CUDA nvprof modes, CUDA nvprof output, CUDA nvprof controls, CUDA nvprof limitations"></meta>
      <meta name="keywords" content="CUDA nvprof, CUDA profiling, CUDA nvprof profiling, CUDA nvprof kernel, CUDA nvprof session, CUDA nvprof views, CUDA nvprof modes, CUDA nvprof output, CUDA nvprof controls, CUDA nvprof limitations"></meta>
      <meta name="DC.Format" content="XHTML"></meta>
      <meta name="DC.Identifier" content="abstract"></meta>
      <link rel="stylesheet" type="text/css" href="../common/formatting/commonltr.css"></link>
      <link rel="stylesheet" type="text/css" href="../common/formatting/site.css"></link>
      <title>Profiler :: CUDA Toolkit Documentation</title>
      <!--[if lt IE 9]>
      <script src="../common/formatting/html5shiv-printshiv.min.js"></script>
      <![endif]-->
      <script type="text/javascript" charset="utf-8" src="../common/formatting/jquery.min.js"></script>
      <script type="text/javascript" charset="utf-8" src="../common/formatting/jquery.ba-hashchange.min.js"></script>
      <link rel="canonical" href="http://docs.nvidia.com/cuda/profiler-users-guide/index.html"></link>
      <link rel="stylesheet" type="text/css" href="../common/formatting/qwcode.highlight.css"></link>
   </head>
   <body>
      
      <article id="contents">
         <div id="release-info">Profiler
            (<a href="../../pdf/CUDA_Profiler_Users_Guide.pdf">PDF</a>)
            -
            CUDA Toolkit v5.5
            (<a href="https://developer.nvidia.com/cuda-toolkit-archive">older</a>)
            -
            Last updated 
            July 19, 2013
            -
            <a href="mailto:cudatools@nvidia.com?subject=CUDA Tools Documentation Feedback: profiler-users-guide">Send Feedback</a></div>
         <div class="topic nested0" id="abstract"><a name="abstract" shape="rect">
               <!-- --></a><h2 class="title topictitle1"><a href="#abstract" name="abstract" shape="rect">Profiler User's Guide</a></h2>
            <div class="body conbody"></div>
         </div>
         <div class="topic concept nested0" id="profiling-overview"><a name="profiling-overview" shape="rect">
               <!-- --></a><h2 class="title topictitle1"><a href="#profiling-overview" name="profiling-overview" shape="rect">Profiling Overview</a></h2>
            <div class="body conbody">
               <p class="p">This document describes NVIDIA profiling tools and APIs that
                  enable you to understand and optimize the performance of your CUDA
                  application. The <a class="xref" href="index.html#visual-profiler" shape="rect">Visual
                     Profiler</a> is a graphical profiling tool that displays a
                  timeline of your application's CPU and GPU activity, and that
                  includes an automated analysis engine to identify optimization
                  opportunities. The Visual Profiler is available as both a
                  standalone application and as part of Nsight Eclipse Edition. The
                  <a class="xref" href="index.html#nvprof-overview" shape="rect"><samp class="ph codeph">nvprof</samp></a> profiling
                  tool enables you to collect and view profiling data from the
                  command-line. The existing <a class="xref" href="index.html#compute-command-line-profiler-overview" shape="rect">command-line
                     profiler</a> continues to be supported.
                  
               </p>
            </div>
            <div class="topic concept nested1" id="whats-new"><a name="whats-new" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#whats-new" name="whats-new" shape="rect">What's New</a></h3>
               <div class="body conbody">
                  <div class="p">
                     The profiling tools contain a number of changes and new features
                     as part of the CUDA Toolkit 5.5 release.
                     
                     <ul class="ul">
                        <li class="li">The Visual Profiler now supports applications that use
                           CUDA Dynamic Parallelism. The application timeline includes
                           both host-launched and device-launched kernels, and shows the
                           parent-child relationship between kernels.
                        </li>
                        <li class="li">The application analysis performed by the NVIDIA Visual
                           Profiler has been enhanced. A guided analysis mode has been
                           added that provides step-by-step analysis and optimization
                           guidance. Also, the analysis results now include graphical
                           visualizations to more clearly indicate the optimization
                           opportunities.
                        </li>
                        <li class="li">The NVIDIA Visual Profiler and the command-line profiler,
                           <samp class="ph codeph">nvprof</samp>, now support power, thermal, and clock
                           profiling.
                        </li>
                        <li class="li"><samp class="ph codeph">nvprof</samp> now collects metrics, and can
                           collect any number of events and metrics during a single run
                           of a CUDA application. <samp class="ph codeph">nvprof</samp> uses kernel
                           replay to execute each kernel as many times as necessary to
                           collect all the requested profile data.
                        </li>
                        <li class="li">The NVIDIA Visual Profiler and <samp class="ph codeph">nvprof</samp>,
                           now support metrics that report the floating-point operations
                           performed by a kernel. These metrics include both
                           single-precision and double-precision counts for adds,
                           multiplies, multiply-accumulates, and special floating-point
                           operations.
                        </li>
                        <li class="li"><samp class="ph codeph">nvprof</samp> now supports two multi-process
                           modes. In "profile child processes" mode, a parent process and
                           all child processes are profiled. In "profile all processes"
                           mode, all CUDA processes on a system are profiled.
                        </li>
                        <li class="li">The Visual Profiler now correctly shows all CUDA
                           peer-to-peer memory copies on the timeline.
                        </li>
                     </ul>
                  </div>
               </div>
            </div>
            <div class="topic concept nested1" id="terminology"><a name="terminology" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#terminology" name="terminology" shape="rect">Terminology</a></h3>
               <div class="body conbody">
                  <p class="p">
                     An <strong class="ph b">event</strong> is a countable activity, action, or occurrence on a
                     device. It corresponds to a single hardware counter value which
                     is collected during kernel execution. To see a list of all
                     available events on a particular NVIDIA GPU, type <samp class="ph codeph">nvprof
                        --query-events</samp>.
                     
                  </p>
                  <p class="p">
                     A <strong class="ph b">metric</strong> is a characteristic of an application that is calculated
                     from one or more event values. To see a list of all available metrics
                     on a particular NVIDIA GPU, type <samp class="ph codeph">nvprof --query-metrics</samp>.
                     You can also refer to the <a class="xref" href="index.html#metrics-reference" shape="rect">metrics reference
                        </a>.
                     
                  </p>
               </div>
            </div>
         </div>
         <div class="topic concept nested0" id="prepare-application"><a name="prepare-application" shape="rect">
               <!-- --></a><h2 class="title topictitle1"><a href="#prepare-application" name="prepare-application" shape="rect">1.&nbsp;Preparing An Application For Profiling</a></h2>
            <div class="body conbody">
               <p class="p">The CUDA profiling tools do not require any application changes
                  to enable profiling; however, by making some simple modifications
                  and additions, you can greatly increase the usability and
                  effectiveness of the profilers. This section describes these
                  modifications and how they can improve your profiling results.
               </p>
            </div>
            <div class="topic concept nested1" id="focusing-profiling"><a name="focusing-profiling" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#focusing-profiling" name="focusing-profiling" shape="rect">1.1.&nbsp;Focused Profiling</a></h3>
               <div class="body conbody">
                  <p class="p">By default, the profiling tools collect profile data over
                     the entire run of your application. But, as explained below,
                     you typically only want to profile the region(s) of your
                     application containing some or all of the performance-critical
                     code. Limiting profiling to performance-critical regions
                     reduces the amount of profile data that both you and the
                     tools must process, and focuses attention on the code where
                     optimization will result in the greatest performance
                     gains.
                  </p>
                  <div class="p">There are several common situations where profiling a region
                     of the application is helpful.
                     
                     <ol class="ol">
                        <li class="li">The application is a test harness that contains a CUDA
                           implementation of all or part of your algorithm. The test
                           harness initializes the data, invokes the CUDA functions to
                           perform the algorithm, and then checks the results for
                           correctness. Using a test harness is a common and productive
                           way to quickly iterate and test algorithm changes. When
                           profiling, you want to collect profile data for the CUDA
                           functions implementing the algorithm, but not for the test
                           harness code that initializes the data or checks the
                           results.
                        </li>
                        <li class="li">The application operates in phases, where a different
                           set of algorithms is active in each phase. When the
                           performance of each phase of the application can be
                           optimized independently of the others, you want to profile
                           each phase separately to focus your optimization
                           efforts.
                        </li>
                        <li class="li">The application contains algorithms that operate over a
                           large number of iterations, but the performance of the
                           algorithm does not vary significantly across those
                           iterations. In this case you can collect profile data from a
                           subset of the iterations.
                        </li>
                     </ol>
                     
                     To limit profiling to a region of your application, CUDA
                     provides functions to start and stop profile data
                     collection. <samp class="ph codeph">cudaProfilerStart()</samp> is used to
                     start profiling and <samp class="ph codeph">cudaProfilerStop()</samp> is used
                     to stop profiling (using the CUDA driver API, you get the same
                     functionality with
                     <samp class="ph codeph">cuProfilerStart()</samp>
                     and <samp class="ph codeph">cuProfilerStop()</samp>). To use these functions
                     you must include <samp class="ph codeph">cuda_profiler_api.h</samp>
                     (or <samp class="ph codeph">cudaProfiler.h</samp> for the driver API).
                  </div>
                  <p class="p">When using the start and stop functions, you also need to
                     instruct the profiling tool to disable profiling at the start of
                     the application. For <samp class="ph codeph">nvprof</samp> you do this with
                     the <samp class="ph codeph">--profile-from-start off</samp> flag. For the
                     Visual Profiler you use the <span class="ph uicontrol">Start execution with
                        profiling enabled</span> checkbox in the <a class="xref" href="index.html#settings-view" shape="rect">Settings View</a>.
                  </p>
               </div>
            </div>
            <div class="topic concept nested1" id="marking-regions-of-cpu-activity"><a name="marking-regions-of-cpu-activity" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#marking-regions-of-cpu-activity" name="marking-regions-of-cpu-activity" shape="rect">1.2.&nbsp;Marking Regions of CPU Activity</a></h3>
               <div class="body conbody">
                  <p class="p">The Visual Profiler can collect a trace of the CUDA function
                     calls made by your application. The Visual Profiler shows these
                     calls in the <a class="xref" href="index.html#timeline-view" shape="rect">Timeline View</a>,
                     allowing you to see where each CPU thread in the application is
                     invoking CUDA functions. To understand what the application's CPU
                     threads are doing outside of CUDA function calls, you can use
                     the <a class="xref" href="index.html#nvtx" shape="rect">NVIDIA Tools Extension API</a>
                     (NVTX). When you add NVTX markers and ranges to your application, the 
                     <a class="xref" href="index.html#timeline-view" shape="rect">Timeline View</a> shows when
                     your CPU threads are executing within those regions.
                     
                  </p>
                  <p class="p"><samp class="ph codeph">nvprof</samp> also supports NVTX markers and ranges.
                     Markers and ranges are shown in the API trace output in the timeline.
                     In summary mode, each range is shown with CUDA activities associated
                     with that range.
                  </p>
               </div>
            </div>
            <div class="topic concept nested1" id="naming-cpu-objects"><a name="naming-cpu-objects" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#naming-cpu-objects" name="naming-cpu-objects" shape="rect">1.3.&nbsp;Naming CPU and CUDA Resources</a></h3>
               <div class="body conbody">
                  <p class="p">The Visual Profiler <a class="xref" href="index.html#timeline-view" shape="rect">Timeline
                        View</a> shows default naming for CPU thread and GPU
                     devices, context and streams. Using custom names for these
                     resources can improve understanding of the application
                     behavior, especially for CUDA applications that have many host
                     threads, devices, contexts, or streams.  You can use
                     the <a class="xref" href="index.html#nvtx" shape="rect">NVIDIA Tools Extension API</a> to
                     assign custom names for your CPU and GPU resources. Your
                     custom names will then be displayed in the
                     <a class="xref" href="index.html#timeline-view" shape="rect">Timeline View</a>.
                  </p>
                  <p class="p"><samp class="ph codeph">nvprof</samp> also supports NVTX naming. Names of
                     CUDA devices, contexts and streams are displayed in summary
                     and trace mode. Thread names are displayed in summary mode.
                  </p>
               </div>
            </div>
            <div class="topic concept nested1" id="flush-profile-data"><a name="flush-profile-data" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#flush-profile-data" name="flush-profile-data" shape="rect">1.4.&nbsp;Flush Profile Data</a></h3>
               <div class="body conbody">
                  <p class="p">To reduce profiling overhead, the profiling tools collect and
                     record profile information into internal buffers. These buffers
                     are then flushed asynchronously to disk with low priority to
                     avoid perturbing application behavior. To avoid losing profile
                     information that has not yet been flushed, the application being
                     profiled should call <samp class="ph codeph">cudaDeviceReset()</samp>,
                     <samp class="ph codeph">cudaProfilerStop()</samp> or <samp class="ph codeph">cuProfilerStop()</samp>
                     before exiting. Doing so forces buffered profile information on
                     corresponding context(s) to be flushed.
                  </p>
                  <p class="p">If your CUDA application includes graphics that operate using
                     a <dfn class="term">display</dfn> or <dfn class="term">main</dfn> loop, care must be
                     taken to call <samp class="ph codeph">cudaDeviceReset()</samp>,
                     <samp class="ph codeph">cudaProfilerStop()</samp> or
                     <samp class="ph codeph">cuProfilerStop()</samp> before the thread executing
                     that loop calls <samp class="ph codeph">exit()</samp>. Failure to call one of
                     these APIs may result in the loss of some or all of the
                     collected profile data.
                  </p>
               </div>
            </div>
            <div class="topic concept nested1" id="dynamic-parallelism"><a name="dynamic-parallelism" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#dynamic-parallelism" name="dynamic-parallelism" shape="rect">1.5.&nbsp;Dynamic Parallelism</a></h3>
               <div class="body conbody">
                  <div class="p">When profiling an application that uses Dynamic Parallelism
                     there are several limitations to the profiling tools.
                     
                     <ul class="ul">
                        <li class="li">
                           The Visual Profiler timeline does not display CUDA API calls
                           invoked from within device-launched kernels.
                           
                        </li>
                        <li class="li">
                           The Visual Profiler does not display detailed event, metric,
                           and source-level results for device-launched kernels. Event,
                           metric, and source-level results collected for CPU-launched
                           kernels will include event, metric, and source-level results
                           for the entire call-tree of kernels launched from within that
                           kernel.
                           
                        </li>
                        <li class="li">
                           The <samp class="ph codeph">nvprof</samp> event/metric output and the
                           command-line profiler event output does not include results for device-launched
                           kernels. Events/metrics collected for CPU-launched kernels will
                           include events/metrics for the entire call-tree of kernels launched
                           from within that kernel.
                           
                        </li>
                     </ul>
                  </div>
               </div>
            </div>
         </div>
         <div class="topic concept nested0" id="visual-profiler"><a name="visual-profiler" shape="rect">
               <!-- --></a><h2 class="title topictitle1"><a href="#visual-profiler" name="visual-profiler" shape="rect">2.&nbsp;Visual Profiler</a></h2>
            <div class="body conbody">
               <p class="p">The NVIDIA Visual Profiler allows you to visualize and optimize
                  the performance of your CUDA application. The Visual Profiler
                  displays a timeline of your application's activity on both the CPU
                  and GPU so that you can identify opportunities for performance
                  improvement. In addition, the Visual Profiler will analyze your
                  application to detect potential performance bottlenecks and direct
                  you on how to take action to eliminate or reduce those
                  bottlenecks.
               </p>
               <p class="p">The Visual Profiler is available as both a standalone
                  application and as part of Nsight Eclipse Edition. The standalone
                  version of the Visual Profiler, <samp class="ph codeph">nvvp</samp>, is included
                  in the CUDA Toolkit for all supported OSes. Within Nsight Eclipse
                  Edition, the Visual Profiler is located in the Profile Perspective
                  and is activated when an application is run in profile
                  mode. Nsight Ecipse Edition, <samp class="ph codeph">nsight</samp>, is included
                  in the CUDA Toolkit for Linux and Mac OSX.
               </p>
            </div>
            <div class="topic concept nested1" id="getting-started"><a name="getting-started" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#getting-started" name="getting-started" shape="rect">2.1.&nbsp;Getting Started</a></h3>
               <div class="body conbody">
                  <p class="p">This section describes the steps you need to take to get started with the Visual Profiler.
                     
                  </p>
               </div>
               <div class="topic concept nested2" id="modify-your-application"><a name="modify-your-application" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#modify-your-application" name="modify-your-application" shape="rect">2.1.1.&nbsp;Modify Your Application For Profiling</a></h3>
                  <div class="body conbody">
                     <p class="p">The Visual Profiler does not require any application changes;
                        however, by making some simple modifications and additions, you
                        can greatly increase its usability and effectiveness. Section
                        <a class="xref" href="index.html#prepare-application" shape="rect">Preparing An Application For Profiling</a> describes how you can focus
                        your profiling efforts and add extra annotations to your
                        application that will greatly improve your profiling
                        experience.
                     </p>
                  </div>
               </div>
               <div class="topic concept nested2" id="creating-session"><a name="creating-session" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#creating-session" name="creating-session" shape="rect">2.1.2.&nbsp;Creating a Session</a></h3>
                  <div class="body conbody">
                     <p class="p">The first step in using the Visual Profiler to profile your
                        application is to create a new profiling <dfn class="term">session</dfn>. A
                        session contains the settings, data, and results associated with
                        your application. <a class="xref" href="index.html#sessions" shape="rect">Sessions</a> gives more
                        information on working with sessions.
                     </p>
                     <p class="p">You can create a new session by selecting the <span class="ph uicontrol">Profile
                           An Application</span> link on the Welcome page, or by selecting
                        <span class="ph uicontrol">New Session</span> from the
                        <span class="ph uicontrol">File</span> menu. In the <span class="ph uicontrol">Create New
                           Session</span> dialog enter the executable for your
                        application. Optionally, you can also specify the working directory,
                        arguments, and environment. 
                     </p>
                     <div class="p">Press <span class="ph uicontrol">Next</span> to choose some additional profiling options. The options are:
                        
                        <ul class="ul">
                           <li class="li"><span class="ph uicontrol">Start execution with profiling
                                 enabled</span> - If selected profile data is collected from
                              the start of application execution. If not selected profile data
                              is not collected until <samp class="ph codeph">cudaProfilerStart()</samp> is
                              called in the application. See <a class="xref" href="index.html#focusing-profiling" shape="rect">Focused Profiling</a> for more information about
                              <samp class="ph codeph">cudaProfilerStart()</samp>.
                           </li>
                           <li class="li"><span class="ph uicontrol">Enable concurrent kernel profiling</span> -
                              This option should be selected for an application that uses CUDA
                              streams to launch kernels that can execute concurrently. If the
                              application uses only a single stream (and therefore cannot have
                              concurrent kernel execution), deselecting this option may
                              decrease profiling overhead.
                           </li>
                           <li class="li"><span class="ph uicontrol">Enable power, clock, and thermal
                                 profiling</span> - If selected, power, clock, and thermal
                              conditions on the GPUs will be sampled and displayed on the
                              timeline. Collection of this data is not supported on all
                              GPUs. See the description of the Device timeline in <a class="xref" href="index.html#timeline-view" shape="rect">Timeline View</a> for more
                              information.
                           </li>
                           <li class="li"><span class="ph uicontrol">Don't run guided analysis</span> - By
                              default guided analysis is run immediately after the creation of
                              a new session. Select this option to disable this behavior.
                           </li>
                        </ul>
                        
                        Press <span class="ph uicontrol">Finish</span>.
                     </div>
                  </div>
               </div>
               <div class="topic concept nested2" id="analyzing-your-application"><a name="analyzing-your-application" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#analyzing-your-application" name="analyzing-your-application" shape="rect">2.1.3.&nbsp;Analyzing Your Application</a></h3>
                  <div class="body conbody">
                     <p class="p">If the <span class="ph uicontrol">Don't run guided analysis</span> option
                        was not selected when you created your session, the Visual
                        Profiler will immediately run your application to collect the data
                        needed for the first stage of guided analysis. As described in
                        <a class="xref" href="index.html#analysis-view" shape="rect">Analysis View</a>, you can use the guided analysis
                        system to get recommendations on performance limiting behavior in
                        your application.
                     </p>
                  </div>
               </div>
               <div class="topic concept nested2" id="exploring-timeline"><a name="exploring-timeline" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#exploring-timeline" name="exploring-timeline" shape="rect">2.1.4.&nbsp;Exploring the Timeline</a></h3>
                  <div class="body conbody">
                     <p class="p">In addition to the guided analysis results, you will see a
                        timeline for your application showing the CPU and GPU activity
                        that occurred as your application executed. Read <a class="xref" href="index.html#timeline-view" shape="rect">Timeline View</a> and <a class="xref" href="index.html#properties-view" shape="rect">Properties View</a> to
                        learn how to explore the profiling information that is available
                        in the timeline. <a class="xref" href="index.html#navigating-timeline" shape="rect">Navigating the Timeline</a> describes
                        how you can zoom and scroll the timeline to focus on specific
                        areas of your application.
                     </p>
                  </div>
               </div>
               <div class="topic concept nested2" id="looking-at-details"><a name="looking-at-details" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#looking-at-details" name="looking-at-details" shape="rect">2.1.5.&nbsp;Looking at the Details</a></h3>
                  <div class="body conbody">
                     <p class="p">In addition to the results provided in the <a class="xref" href="index.html#analysis-view" shape="rect">Analysis View</a>, you can also look at the specific metric
                        and event values collected as part of the analysis. Metric and
                        event values are displayed in the <a class="xref" href="index.html#details-view" shape="rect">Details View</a>. You can collect specific metric and event
                        values that reveal how the kernels in your application are
                        behaving. You collect metrics and events as described in the <a class="xref" href="index.html#details-view" shape="rect">Details View</a> section.
                     </p>
                  </div>
               </div>
            </div>
            <div class="topic concept nested1" id="sessions"><a name="sessions" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#sessions" name="sessions" shape="rect">2.2.&nbsp;Sessions</a></h3>
               <div class="body conbody">
                  <p class="p">A session contains the settings, data, and profiling results
                     associated with your application. Each session is saved in a
                     separate file; so you can delete, move, copy, or share a session
                     by simply deleting, moving, copying, or sharing the session
                     file. By convention, the file extension <samp class="ph codeph">.nvvp</samp> is
                     used for Visual Profiler session files.
                  </p>
                  <p class="p">There are two types of sessions: an executable session that is
                     associated with an application that is executed and profiled from
                     within the Visual Profiler, and an import session that is created
                     by importing data generated by <a class="xref" href="index.html#nvprof-overview" shape="rect">nvprof</a> or
                     the <a class="xref" href="index.html#compute-command-line-profiler-overview" shape="rect">command-line profiler</a>.
                  </p>
               </div>
               <div class="topic concept nested2" id="executable-session"><a name="executable-session" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#executable-session" name="executable-session" shape="rect">2.2.1.&nbsp;Executable Session</a></h3>
                  <div class="body conbody">
                     <p class="p">You can create a new executable session for your application by
                        selecting the <span class="ph uicontrol">Profile An Application</span> link
                        on the <span class="ph uicontrol">Welcome</span> page, or by selecting
                        <span class="ph uicontrol">New Session</span> from the
                        <span class="ph uicontrol">File</span> menu. Once a session is created, you
                        can edit the session's settings as described in the <a class="xref" href="index.html#settings-view" shape="rect">Settings View</a>.
                     </p>
                     <p class="p">You can open and save existing sessions using the open and save
                        options in the <span class="ph uicontrol">File</span> menu.
                     </p>
                     <p class="p">To analyze your application and to collect metric and event
                        values, the Visual Profiler will execute your application multiple
                        times. To get accurate profiling results, it is important that
                        your application conform to the requirements detailed in <a class="xref" href="index.html#application-requirements" shape="rect">Application Requirements</a>.
                     </p>
                  </div>
               </div>
               <div class="topic concept nested2" id="import-session"><a name="import-session" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#import-session" name="import-session" shape="rect">2.2.2.&nbsp;Import Session</a></h3>
                  <div class="body conbody">
                     <p class="p">You create an import session from the output of
                        <samp class="ph codeph">nvprof</samp> or the command-line profiler by using the
                        <span class="ph uicontrol">Import...</span> option in the
                        <span class="ph uicontrol">File</span> menu. Selecting this option opens the
                        import wizard which guides you through the import process.
                     </p>
                     <p class="p">Because an executable application is not associated with an
                        import session, the Visual Profiler cannot execute the application
                        to collect additional profile data. As a result, analysis can only
                        be performed with the data that is imported. Also, the <a class="xref" href="index.html#details-view" shape="rect">Details View</a> will show any imported event and metrics
                        values but new metrics and events cannot be selected and collected
                        for the import session.
                     </p>
                  </div>
                  <div class="topic concept nested3" id="import-nvprof-session"><a name="import-nvprof-session" shape="rect">
                        <!-- --></a><h3 class="title topictitle2"><a href="#import-nvprof-session" name="import-nvprof-session" shape="rect">2.2.2.1.&nbsp;Import <samp class="ph codeph">nvprof</samp> Session</a></h3>
                     <div class="body conbody">
                        <p class="p">Using the import wizard you can select one or more
                           <samp class="ph codeph">nvprof</samp> data files for import into the new
                           session.
                        </p>
                        <p class="p">You must have one <samp class="ph codeph">nvprof</samp> data file that
                           contains the timeline information for the session. This data file
                           should be collected by running nvprof with the
                           <samp class="ph codeph">--output-profile</samp> option. You can optionally
                           enable other options such as <samp class="ph codeph">--system-profiling
                              on</samp>, but you should not collect any events or metrics as
                           that will distort the timeline so that it is not representative of
                           the applications true behavior.
                        </p>
                        <p class="p">You may optionally specify one or more event/metric data files
                           that contain event and metric values for the application. These
                           data files should be collected by running nvprof with one or both
                           of the <samp class="ph codeph">--events</samp> and <samp class="ph codeph">--metrics</samp>
                           options. To collect all the events and metrics that are needed for
                           the guided analysis system, you can simply use the
                           <samp class="ph codeph">--analysis-metrics</samp> option along with the
                           <samp class="ph codeph">--kernels</samp> option to select the kernel(s) to
                           collect events and metrics for. See <a class="xref" href="index.html#remote-profiling" shape="rect">Remote Profiling</a> for more information.
                        </p>
                        <p class="p">If you are importing multiple <samp class="ph codeph">nvprof</samp> output
                           files into the session, it is important that your application
                           conform to the requirements detailed in <a class="xref" href="index.html#application-requirements" shape="rect">Application Requirements</a>.
                        </p>
                     </div>
                  </div>
                  <div class="topic concept nested3" id="import-csv-session"><a name="import-csv-session" shape="rect">
                        <!-- --></a><h3 class="title topictitle2"><a href="#import-csv-session" name="import-csv-session" shape="rect">2.2.2.2.&nbsp;Import Command-Line Profiler Session</a></h3>
                     <div class="body conbody">
                        <p class="p">Using the import wizard you can select one or more command-line
                           profiler generated CSV files for import into the new session. When
                           you import multiple CSV files, their contents are combined and
                           displayed in a single timeline.
                        </p>
                        <p class="p">When using the <a class="xref" href="index.html#compute-command-line-profiler-overview" shape="rect">command-line
                              profiler</a> to create a CSV file for import into the Visual
                           Profiler, the following requirement must be met:
                        </p>
                        <ul class="ul">
                           <li class="li"><samp class="ph codeph">COMPUTE_PROFILE_CSV</samp> must be 1 to generate
                              CSV formatted output.
                           </li>
                           <li class="li"><samp class="ph codeph">COMPUTE_PROFILE_CONFIG</samp> must point to a file
                              that contains gpustarttimestamp and streamid configuration
                              parameters. The configuration file may also contain other
                              configuration parameters, including events.
                           </li>
                        </ul>
                     </div>
                  </div>
               </div>
            </div>
            <div class="topic concept nested1" id="application-requirements"><a name="application-requirements" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#application-requirements" name="application-requirements" shape="rect">2.3.&nbsp;Application Requirements</a></h3>
               <div class="body conbody">
                  <div class="p">To collect performance data about your application, the Visual
                     Profiler must be able to execute your application repeatedly in a
                     deterministic manner. Due to software and hardware limitations, it
                     is not possible to collect all the necessary profile data in a
                     single execution of your application. Each time your application
                     is run, it must operate on the same data and perform the same
                     kernel and memory copy invocations in the same order. Specifically,
                     
                     <ul class="ul">
                        <li class="li">For a device, the order of context creation must be the
                           same each time the application executes. For a multi-threaded
                           application where each thread creates its own context(s), care
                           must be taken to ensure that the order of those context
                           creations is consistent across multiple runs. For example, it
                           may be necessary to create the contexts on a single thread and
                           then pass the contexts to the other threads. Alternatively,
                           the <a class="xref" href="index.html#nvtx" shape="rect">NVIDIA Tools Extension
                              API</a> can be used to provide a custom name for each
                           context. As long as the same custom name is applied to the
                           same context on each execution of the application, the Visual
                           Profiler will be able to correctly associate those contexts
                           across multiple runs.
                        </li>
                        <li class="li">For a context, the order of stream creation must be the
                           same each time the application executes. Alternatively,
                           the <a class="xref" href="index.html#nvtx" shape="rect">NVIDIA Tools Extension
                              API</a> can be used to provide a custom name for each
                           stream. As long as the same custom name is applied to the same
                           stream on each execution of the application, the Visual
                           Profiler will be able to correctly associate those streams
                           across multiple runs.
                        </li>
                        <li class="li">Within a stream, the order of kernel and memcpy
                           invocations must be the same each time the application
                           executes.
                        </li>
                     </ul>
                  </div>
               </div>
            </div>
            <div class="topic concept nested1" id="profiling-limitations"><a name="profiling-limitations" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#profiling-limitations" name="profiling-limitations" shape="rect">2.4.&nbsp;Profiling Limitations</a></h3>
               <div class="body conbody">
                  <div class="p">Due to software and hardware restrictions, there are several
                     limitations to the profiling and analysis performed by the Visual
                     Profiler.
                     
                     <ul class="ul">
                        <li class="li">Some analysis results require metrics that are not available
                           on all devices. When these analyses are attempted on a device
                           where the metric is not available the analysis results will show
                           that the required data is "not available".
                        </li>
                        <li class="li">Some metric values are calculated assuming a kernel is large
                           enough to occupy all device multiprocessors with approximately
                           the same amount of work. If a kernel launch does not have this
                           characteristic, then those metric values may not be
                           accurate.
                        </li>
                        <li class="li">For some metrics, the required events can only be collected
                           for a single CUDA context. For an application that uses multiple
                           CUDA contexts, these metrics will only be collected for one of
                           the contexts. The metrics that can be collected only for a
                           single CUDA context are indicated in the <a class="xref" href="index.html#metrics-reference" shape="rect">metric reference tables</a>.
                        </li>
                        <li class="li">The <samp class="ph codeph">Warp Non-Predicated Execution
                              Efficiency</samp> metric is only available on compute
                           capability 3.5 and later devices.
                        </li>
                        <li class="li">The <samp class="ph codeph">Warp Execution Efficiency</samp> metric is not
                           available on compute capability 3.0 devices.
                        </li>
                        <li class="li">The <samp class="ph codeph">Branch Efficiency</samp> metric is not
                           available on compute capability 3.5 devices.
                        </li>
                        <li class="li">For compute capability 2.x devices, the <samp class="ph codeph">Achieved
                              Occupancy</samp> metric can report inaccurate values that are
                           greater than the actual achieved occupancy. In rare cases this
                           can cause the achieved occupancy value to exceed the theoretical
                           occupancy value for the kernel.
                        </li>
                        <li class="li">The timestamps collected for applications running on GPUs in
                           an SLI configuration are incorrect. As a result most profiling
                           results collected for the application will be invalid.
                        </li>
                        <li class="li">Concurrent kernel mode can add significant overhead if used
                           on kernels that execute a large number of blocks and that have
                           short execution durations.
                        </li>
                     </ul>
                  </div>
               </div>
            </div>
            <div class="topic concept nested1" id="visual-profiler-views"><a name="visual-profiler-views" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#visual-profiler-views" name="visual-profiler-views" shape="rect">2.5.&nbsp;Visual Profiler Views</a></h3>
               <div class="body conbody">
                  <p class="p">The Visual Profiler is organized into views. Together, the views allow you to analyze and visualize the performance of your
                     application. This section describes each view and how you use it while profiling your application.
                  </p>
               </div>
               <div class="topic concept nested2" id="timeline-view"><a name="timeline-view" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#timeline-view" name="timeline-view" shape="rect">2.5.1.&nbsp;Timeline View</a></h3>
                  <div class="body conbody">
                     <p class="p">The Timeline View shows CPU and GPU activity that occurred
                        while your application was being profiled. Multiple timelines can
                        be opened in the Visual Profiler at the same time. Each opened
                        timeline is represented by a different instance of the view. The
                        following figure shows a Timeline View for a CUDA application.
                     </p><a name="timeline-view__img-timeline-view" shape="rect">
                        <!-- --></a><img class="image" id="timeline-view__img-timeline-view" src="graphics/timeline-view.png" alt="Timeline View shows CPU and GPU activity that occurred while your application was being profiled."></img><p class="p">Along the top of the view is a horizontal ruler that shows
                        elapsed time from the start of application profiling. Along the
                        left of the view is a vertical ruler that describes what is being
                        shown for each horizontal row of the timeline, and that contains
                        various controls for the timeline. These controls are described in
                        <a class="xref" href="index.html#timeline-controls" shape="rect">Timeline Controls</a></p>
                     <p class="p">The types of timeline rows that are displayed in the Timeline
                        View are:
                     </p>
                     <dl class="dl">
                        <dt class="dt dlterm">Process</dt>
                        <dd class="dd">A timeline will contain a <span class="ph uicontrol">Process</span> row for each
                           application profiled. The process identifier represents the pid of the process.
                           The timeline row for a process does not contain any intervals of activity.
                           Threads within the process are shown as children of the process.
                        </dd>
                        <dt class="dt dlterm">Thread</dt>
                        <dd class="dd">A timeline will contain a <span class="ph uicontrol">Thread</span> row for each thread in
                           the profiled application that performed either a CUDA driver or runtime API
                           call. The thread identifier is a unique id for that thread. The timeline row for
                           a thread is does not contain any intervals of activity.
                        </dd>
                        <dt class="dt dlterm">Runtime API</dt>
                        <dd class="dd">A timeline will contain a <span class="ph uicontrol">Runtime API</span> row for each thread
                           that performs a CUDA Runtime API call. Each interval in the row represents the
                           duration of the call on the CPU.
                        </dd>
                        <dt class="dt dlterm">Driver API</dt>
                        <dd class="dd">A timeline will contain a <span class="ph uicontrol">Driver API</span> row for each thread
                           that performs a CUDA Driver API call. Each interval in the row represents the
                           duration of the call on the CPU.
                        </dd>
                        <dt class="dt dlterm">Markers and Ranges</dt>
                        <dd class="dd">A timeline will contain a single <span class="ph uicontrol">Markers and
                              Ranges</span> row for each thread that uses the <a class="xref" href="index.html#nvtx" shape="rect">NVIDIA Tools Extension API</a>
                           to annotate a time range or marker. Each interval in the row
                           represents the duration of a time range, or the instantaneous
                           point of a marker.
                           
                        </dd>
                        <dt class="dt dlterm">Profiling Overhead</dt>
                        <dd class="dd">A timeline will contain a single <span class="ph uicontrol">Profiling
                              Overhead</span> row for each process. Each interval in
                           the row represents the duration of execution of some activity
                           required for profiling. These intervals represent activity
                           that does not occur when the application is not being
                           profiled.
                        </dd>
                        <dt class="dt dlterm">Device</dt>
                        <dd class="dd">A timeline will contain a <span class="ph uicontrol">Device</span>
                           row for each GPU device utilized by the application being
                           profiled. The name of the timeline row indicates the device ID
                           in square brackets followed by the name of the device.  After
                           running the <span class="ph uicontrol">Compute Utilization</span>
                           analysis, the row will contain an estimate of the compute
                           utilization of the device over time. If power, clock, and
                           thermal profiling are enabled, the row will also contain
                           points representing those readings.
                        </dd>
                        <dt class="dt dlterm">Context</dt>
                        <dd class="dd">A timeline will contains a <span class="ph uicontrol">Context</span>
                           row for each CUDA context on a GPU device. The name of the
                           timeline row indicates the context ID or the custom context
                           name if the <a class="xref" href="index.html#nvtx" shape="rect">NVIDIA Tools
                              Extension API</a> was used to name the context.  The row
                           for a context does not contain any intervals of activity.
                        </dd>
                        <dt class="dt dlterm">Memcpy</dt>
                        <dd class="dd">A timeline will contain memory copy row(s) for each
                           context that performs memcpys. A context may contain up to
                           four memcpy rows for device-to-host, host-to-device,
                           device-to-device, and peer-to-peer memory copies. Each
                           interval in a row represents the duration of a memcpy
                           executing on the GPU.
                        </dd>
                        <dt class="dt dlterm">Compute</dt>
                        <dd class="dd">A timeline will contain a <span class="ph uicontrol">Compute</span> row for each context
                           that performs computation on the GPU. Each interval in a row represents the
                           duration of a kernel on the GPU device. The <span class="ph uicontrol">Compute</span> row
                           indicates all the compute activity for the context on a GPU device. The
                           contained <span class="ph uicontrol">Kernel</span> rows show activity of each individual
                           application kernel.
                        </dd>
                        <dt class="dt dlterm">Kernel</dt>
                        <dd class="dd">A timeline will contain a <span class="ph uicontrol">Kernel</span>
                           row for each type of kernel executed by the application. Each
                           interval in a row represents the duration of execution of an
                           instance of that kernel on the GPU device.  Each row is
                           labeled with a percentage that indicates the total execution
                           time of all instances of that kernel compared to the total
                           execution time of all kernels.  For each context, the kernels
                           are ordered top to bottom by this execution time percentage.
                        </dd>
                        <dt class="dt dlterm">Stream</dt>
                        <dd class="dd">A timeline will contain a <span class="ph uicontrol">Stream</span> row for each stream used
                           by the application (including both the default stream and any application
                           created streams). Each interval in a <span class="ph uicontrol">Stream</span> row
                           represents the duration of a memcpy or kernel execution performed on that
                           stream.
                        </dd>
                     </dl>
                  </div>
                  <div class="topic concept nested3" id="timeline-controls"><a name="timeline-controls" shape="rect">
                        <!-- --></a><h3 class="title topictitle2"><a href="#timeline-controls" name="timeline-controls" shape="rect">2.5.1.1.&nbsp;Timeline Controls</a></h3>
                     <div class="body conbody">
                        <p class="p">The <a class="xref" href="index.html#timeline-view" shape="rect">Timeline View</a> has several controls that
                           you use to control how the timeline is displayed. Some of
                           these controls also influence the presentation of data in
                           the <a class="xref" href="index.html#details-view" shape="rect">Details View</a> and the
                           <a class="xref" href="index.html#analysis-view" shape="rect">Analysis View</a>.
                        </p>
                        <div class="section">
                           <h5 class="title sectiontitle">Resizing the Vertical Timeline Ruler</h5>
                           <p class="p">The width of the vertical ruler can be adjusted by placing
                              the mouse pointer over the right edge of the ruler. When the
                              double arrow pointer appears, click and hold the left mouse
                              button while dragging. The vertical ruler width is saved with
                              your session.
                           </p>
                        </div>
                        <div class="section">
                           <h5 class="title sectiontitle">Reordering Timelines</h5>
                           <p class="p">The <span class="ph uicontrol">Kernel</span>
                              and <span class="ph uicontrol">Stream</span> timeline rows can be
                              reordered. You may want to reorder these rows to aid in
                              visualizing related kernels and streams, or to move
                              unimportant kernels and streams to the bottom of the
                              timeline. To reorder a row, left-click on the row
                              label. When the double arrow pointer appears, drag up or
                              down to position the row. The timeline ordering is saved
                              with your session.
                           </p>
                        </div>
                        <div class="section">
                           <h5 class="title sectiontitle">Filtering Timelines</h5>
                           <p class="p"><span class="ph uicontrol">Memcpy</span>
                              and <span class="ph uicontrol">Kernel</span> rows can be filtered to
                              exclude their activities from presentation in the
                              <a class="xref" href="index.html#details-view" shape="rect">Details View</a> and the
                              <a class="xref" href="index.html#analysis-view" shape="rect">Analysis View</a>. To filter out a row,
                              left-click on the filter icon just to the left of the row
                              label. To filter all Kernel or Memcpy
                              rows, <span class="ph uicontrol">Shift</span>-left-click one of the
                              rows. When a row is filtered, any intervals on that row are
                              dimmed to indicate their filtered status.
                           </p>
                        </div>
                        <div class="section">
                           <h5 class="title sectiontitle">Expanding and Collapsing Timelines</h5>
                           <div class="p">Groups of timeline rows can be expanded and collapsed using
                              the <span class="ph uicontrol">[+]</span> and <span class="ph uicontrol">[-]</span>
                              controls just to the left of the row labels. There are three
                              expand/collapse states:
                              <dl class="dl">
                                 <dt class="dt dlterm">Collapsed</dt>
                                 <dd class="dd">No timeline rows contained in the collapsed row are shown.</dd>
                                 <dt class="dt dlterm">Expanded</dt>
                                 <dd class="dd">All non-filtered timeline rows are shown.</dd>
                                 <dt class="dt dlterm">All-Expanded</dt>
                                 <dd class="dd">All timeline rows, filtered and non-filtered, are shown.</dd>
                              </dl>
                           </div>
                           <p class="p">Intervals associated with collapsed rows may not be shown in
                              the <a class="xref" href="index.html#details-view" shape="rect">Details View</a> and the <a class="xref" href="index.html#analysis-view" shape="rect">Analysis View</a>, depending on the filtering mode set
                              for those views (see view documentation for more
                              information). For example, if you collapse a device row, then
                              all memcpys, memsets, and kernels associated with that device
                              are excluded from the results shown in those views.
                              
                           </p>
                        </div>
                        <div class="section">
                           <h5 class="title sectiontitle">Coloring Timelines</h5>
                           <p class="p">There are two modes for
                              timeline coloring. The coloring mode can be selected in the
                              <span class="ph uicontrol">View</span> menu, in the timeline context menu
                              (accessed by right-clicking in the timeline view), and
                              on the Visual Profiler toolbar. In
                              <span class="ph uicontrol">kernel</span> coloring mode, each type of
                              kernel is assigned a unique color (that is, all activity
                              intervals in a kernel row have the same color).
                              In <span class="ph uicontrol">stream</span> coloring mode, each stream is
                              assigned a unique color (that is, all memcpy and kernel
                              activity occurring on a stream are assigned the same
                              color).
                           </p>
                        </div>
                     </div>
                  </div>
                  <div class="topic concept nested3" id="navigating-timeline"><a name="navigating-timeline" shape="rect">
                        <!-- --></a><h3 class="title topictitle2"><a href="#navigating-timeline" name="navigating-timeline" shape="rect">2.5.1.2.&nbsp;Navigating the Timeline</a></h3>
                     <div class="body conbody">
                        <p class="p">The timeline can be scrolled, zoomed, and focused in several ways to help you better understand and visualize your application's
                           performance.
                        </p>
                        <div class="section">
                           <h5 class="title sectiontitle">Zooming</h5>
                           <p class="p">The zoom controls are available in the <span class="ph uicontrol">View</span> menu, in the timeline
                              context menu (accessed by right-clicking in the timeline view), and on the Visual
                              Profiler toolbar. Zoom-in reduces the timespan displayed in the view, zoom-out
                              increases the timespan displayed in the view, and zoom-to-fit scales the view so
                              that the entire timeline is visible.
                           </p>
                           <p class="p">You can also zoom-in and zoom-out with the mouse wheel while holding the
                              <span class="ph uicontrol">Ctrl</span> key (for MacOSX use the <span class="ph uicontrol">Command</span> key).
                           </p>
                           <p class="p">Another useful zoom mode is zoom-to-region. Select a region
                              of the timeline by holding <span class="ph uicontrol">Ctrl</span> (for
                              MacOSX use the <span class="ph uicontrol">Command</span> key) while
                              left-clicking and dragging the mouse. The highlighted region
                              will be expanded to occupy the entire view when the mouse
                              button is released.
                           </p>
                        </div>
                        <div class="section">
                           <h5 class="title sectiontitle">Scrolling</h5>
                           <p class="p">The timeline can be scrolled vertically with the scrollbar
                              of the mouse wheel. The timeline can be scrolled
                              horizontally with the scrollbar or by using the mouse wheel
                              while holding the <span class="ph uicontrol">Shift</span> key.
                           </p>
                        </div>
                        <div class="section">
                           <h5 class="title sectiontitle">Highlighting/Correlation</h5>
                           <p class="p">When you move the mouse pointer over an activity interval
                              on the timeline, that interval is highlighted in all places
                              where the corresponding activity is shown. For example, if
                              you move the mouse pointer over an interval representing a
                              kernel execution, that kernel execution is also highlighted
                              in the <span class="ph uicontrol">Stream</span> and in
                              the <span class="ph uicontrol">Compute</span> timeline row. When a
                              kernel or memcpy interval is highlighted, the corresponding
                              driver or runtime API interval will also highlight.  This
                              allows you to see the correlation between the invocation of
                              a driver or runtime API on the CPU and the corresponding
                              activity on the GPU. Information about the highlighted
                              interval is shown in the <a class="xref" href="index.html#properties-view" shape="rect">Properties View</a>.
                           </p>
                        </div>
                        <div class="section">
                           <h5 class="title sectiontitle">Selecting</h5>
                           <p class="p">You can left-click on a timeline interval or row to select
                              it. Multi-select is done using
                              <span class="ph uicontrol">Ctrl</span>-left-click. To unselect an
                              interval or row simply <span class="ph uicontrol">Ctrl</span>-left-click
                              on it again. When a single interval or row is selected, the
                              information about that interval or row is pinned in the <a class="xref" href="index.html#properties-view" shape="rect">Properties View</a>.  In the <a class="xref" href="index.html#details-view" shape="rect">Details View</a>, the detailed information for the
                              selected interval is shown in the table.
                           </p>
                        </div>
                        <div class="section">
                           <h5 class="title sectiontitle">Measuring Time Deltas</h5>
                           <p class="p">Measurement rulers can be created by left-click dragging in
                              the horizontal ruler at the top of the timeline. Once a ruler
                              is created it can be activated and deactivated by
                              left-clicking. Multiple rulers can be activated
                              by <span class="ph uicontrol">Ctrl</span>-left-click. Any number of
                              rulers can be created. Active rulers are deleted with
                              the <span class="ph uicontrol">Delete</span>
                              or <span class="ph uicontrol">Backspace</span> keys. After a ruler is
                              created, it can be resized by dragging the vertical guide
                              lines that appear over the timeline. If the mouse is dragger
                              over a timeline interval, the guideline will snap to the
                              nearest edge of that interval.
                           </p>
                        </div>
                     </div>
                  </div>
               </div>
               <div class="topic concept nested2" id="analysis-view"><a name="analysis-view" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#analysis-view" name="analysis-view" shape="rect">2.5.2.&nbsp;Analysis View</a></h3>
                  <div class="body conbody">
                     <p class="p">The Analysis View is used to control application analysis and
                        to display the analysis results. There are two analysis modes:
                        <dfn class="term">guided</dfn> and <dfn class="term">unguided</dfn>. In guided mode the
                        analysis system will guide you though multiple analysis stages to
                        help you understand the likely performance limiters and
                        optimization opportunties in your application. In unguided mode
                        you can manually explore all the analysis results collect for you
                        application. The following figure shows the analysis view in
                        guided analysis mode. The left part of the view provides
                        step-by-step directions to help you analyze and optimize your
                        application. The right part of the view shows you detailed
                        analysis results appropriate for each part of the analysis.
                     </p><a name="analysis-view__img-analysis-view" shape="rect">
                        <!-- --></a><img class="image" id="analysis-view__img-analysis-view" src="graphics/analysis-view.png" alt="Analysis View is used to control application analysis and to display the analysis results."></img><div class="section">
                        <h4 class="title sectiontitle">Guided Application Analysis</h4>
                        <p class="p">In guided mode, the analysis view will guide you step-by-step
                           though analysis of your entire application with specific
                           analysis guidance provided for each kernel within your
                           application. Guided analysis starts with <span class="ph uicontrol">CUDA
                              Application Analysis</span> and from there will guide you
                           to optimization opportunites within your application.
                        </p>
                     </div>
                     <div class="section">
                        <h4 class="title sectiontitle">Unguided Application Analysis</h4>
                        <p class="p">In unguided analysis mode each application analysis stage has
                           a <span class="ph uicontrol">Run analysis</span> button that can be used to
                           generate the analysis results for that stage.  When the
                           <span class="ph uicontrol">Run analysis</span> button is selected, the
                           Visual Profiler will execute the application to collect the
                           profiling data needed to perform the analysis. The green
                           checkmark next to an analysis stage indicates that the analysis
                           results for that stage are available. Each analysis result
                           contains a brief description of the analysis and a
                           <span class="ph uicontrol">More…</span> link to detailed documentation on
                           the analysis. When you select an analysis result, the timeline
                           rows or intervals associated with that result are highlighted in
                           the <a class="xref" href="index.html#timeline-view" shape="rect">Timeline View</a>.
                        </p>
                        <p class="p">When a single kernel instance is selected in the timeline,
                           additional kernel-specific analysis stages are available. Each
                           kernel-specific analysis stage has a <span class="ph uicontrol">Run
                              analysis</span> button that operates in the same manner as
                           for the application analysis stages. The following figure shows
                           the analysis results for the <span class="ph uicontrol">Divergent
                              Execution</span> analysis stage. Some kernel instance
                           analysis results, like <span class="ph uicontrol">Divergent
                              Execution</span> are associated with specific source-lines
                           within the kernel. To see the source associated with each
                           result, select an entry from the table. The source-file
                           associated with that entry will open.
                        </p><a name="analysis-view__img-analysis-view-sourcelevel" shape="rect">
                           <!-- --></a><img class="image" id="analysis-view__img-analysis-view-sourcelevel" src="graphics/analysis-view-sourcelevel.png" alt="Analysis View is used to control application analysis and to display the analysis results."></img></div>
                  </div>
               </div>
               <div class="topic concept nested2" id="details-view"><a name="details-view" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#details-view" name="details-view" shape="rect">2.5.3.&nbsp;Details View</a></h3>
                  <div class="body conbody">
                     <p class="p">The Details View displays a table of information for each
                        memory copy and kernel execution in the profiled
                        application. The following figure shows the table containing
                        several memcpy and kernel executions. Each row of the table
                        contains general information for a kernel execution or memory
                        copy. For kernels, the table will also contain a column for each
                        metric or event value collected for that kernel. In the figure,
                        the <span class="ph uicontrol">Achieved Occupancy</span> column shows the
                        value of that metric for each of the kernel executions.
                     </p><a name="details-view__img-details-view" shape="rect">
                        <!-- --></a><img class="image" id="details-view__img-details-view" src="graphics/details-view.png" alt="Details View displays a table of information for each memory copy and kernel execution in the profiled application."></img><p class="p">You can sort the data by a column by left clicking on the
                        column header, and you can rearrange the columns by left
                        clicking on a column header and dragging it to its new
                        location. If you select a row in the table, the corresponding
                        interval will be selected in the <a class="xref" href="index.html#timeline-view" shape="rect">Timeline View</a>.
                        Similarly, if you select a kernel or memcpy interval in the
                        <a class="xref" href="index.html#timeline-view" shape="rect">Timeline View</a> the table will be scrolled to
                        show the corresponding data.
                     </p>
                     <p class="p">If you hover the mouse over a column header, a tooltip will
                        display describing the data shown in that column. For a column
                        containing event or metric data, the tooltip will describe the
                        corresponding event or metric. Section
                        <a class="xref" href="index.html#metrics-reference" shape="rect">Metrics Reference</a> contains more detailed
                        information about each metric.
                     </p>
                     <div class="p">The information shown in the Details View can be filtered in
                        various ways, controlled by the menu accessible from the Details
                        View toolbar. The following modes are available:
                        
                        <ul class="ul">
                           <li class="li"><span class="ph uicontrol">Filter By Selection</span> - If selected,
                              the Details View shows data only for the selected kernel and
                              memcpy intervals.
                           </li>
                           <li class="li"><span class="ph uicontrol">Show Hidden Timeline Data</span> - If not
                              selected, data is shown only for kernels and memcpys that are
                              visible in the timeline. Kernels and memcpys that are not
                              visible because they are inside collapsed parts of the timeline
                              are not shown.
                           </li>
                           <li class="li"><span class="ph uicontrol">Show Filtered Timeline Data</span> - If not
                              selected, data is shown only for kernels and memcpys that are in
                              timeline rows that are not filtered.
                           </li>
                        </ul>
                     </div>
                     <div class="section">
                        <h4 class="title sectiontitle">Collecting Events and Metrics</h4>
                        <p class="p">Specific event and metric values can be collected for each
                           kernel and displayed in the details table. Use the toolbar
                           icon in the upper right corner of the view to configure the
                           events and metrics to collect for each device, and to run the
                           application to collect those events and metrics.
                        </p>
                     </div>
                     <div class="section">
                        <h4 class="title sectiontitle">Show Summary Data</h4>
                        <p class="p">By default the table shows one row for each memcpy and kernel
                           invocation. Alternatively, the table can show summary results
                           for each kernel function. Use the toolbar icon in the upper
                           right corner of the view to select or deselect summary
                           format.
                        </p>
                     </div>
                     <div class="section">
                        <h4 class="title sectiontitle">Formatting Table Contents</h4>
                        <p class="p">The numbers in the table can be displayed either with or
                           without grouping separators. Use the toolbar icon in the upper
                           right corner of the view to select or deselect grouping
                           separators.
                        </p>
                     </div>
                     <div class="section">
                        <h4 class="title sectiontitle">Exporting Details</h4>
                        <p class="p">The contents of the table can be exported in CSV format using
                           the toolbar icon in the upper right corner of the view.
                        </p>
                     </div>
                  </div>
               </div>
               <div class="topic concept nested2" id="properties-view"><a name="properties-view" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#properties-view" name="properties-view" shape="rect">2.5.4.&nbsp;Properties View</a></h3>
                  <div class="body conbody">
                     <p class="p">The Properties View shows information about the row or
                        interval highlighted or selected in the
                        <a class="xref" href="index.html#timeline-view" shape="rect">Timeline View</a>. If a row or interval is not
                        selected, the displayed information tracks the motion of the
                        mouse pointer. If a row or interval is selected, the
                        displayed information is pinned to that row or interval.
                     </p>
                  </div>
               </div>
               <div class="topic concept nested2" id="console-view"><a name="console-view" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#console-view" name="console-view" shape="rect">2.5.5.&nbsp;Console View</a></h3>
                  <div class="body conbody">
                     <p class="p">The Console View shows the stdout and stderr output of the
                        application each time it executes. If you need to provide
                        stdin input to you application, you do so by typing into the
                        console view.
                     </p>
                  </div>
               </div>
               <div class="topic concept nested2" id="settings-view"><a name="settings-view" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#settings-view" name="settings-view" shape="rect">2.5.6.&nbsp;Settings View</a></h3>
                  <div class="body conbody">
                     <p class="p">The Settings View allows you to specify execution settings for
                        the application being profiled.  As shown in the following
                        figure, the <span class="ph uicontrol">Executable</span> settings tab
                        allows you to specify the executable file for the application,
                        the working directory for the application, the command-line
                        arguments for the application, and the environment for the
                        application. Only the executable file is required, all other
                        fields are optional.
                     </p><a name="settings-view__img-executable" shape="rect">
                        <!-- --></a><img class="image" id="settings-view__img-executable" src="graphics/executable.png" alt="Settings View dialogue box. Allows you to specify execution settings for the application being profiled."></img><div class="section">
                        <h4 class="title sectiontitle">Exection timeout</h4>
                        <p class="p">The <span class="ph uicontrol">Executable</span> settings tab also
                           allows you to specify and optional execution timeout. If the
                           execution timeout is specified, the application execution will
                           be terminated after that number of seconds. If the execution
                           timeout is not specified, the application will be allowed to
                           continue execution until it terminates normally.
                        </p>
                        <div class="note note"><span class="notetitle">Note:</span> Timeout starts counting from the moment the
                           CUDA driver is initialized. If the application
                           doesn't call any CUDA APIs, timeout won't be
                           triggered.
                        </div>
                     </div>
                     <div class="section">
                        <h4 class="title sectiontitle">Start execution with profiling enabled</h4>
                        <p class="p">The <span class="ph uicontrol">Start execution with profiling
                              enabled</span> checkbox is set by default to indicate
                           that application profiling begins at the start of
                           application execution. If you are
                           using <samp class="ph codeph">cudaProfilerStart()</samp>
                           and <samp class="ph codeph">cudaProfilerStop()</samp> to control profiling
                           within your application as described in
                           <a class="xref" href="index.html#focusing-profiling" shape="rect">Focused Profiling</a>, then you should uncheck
                           this box.
                        </p>
                     </div>
                     <div class="section">
                        <h4 class="title sectiontitle">Enable concurrent kernels profiling</h4>
                        <p class="p">The <span class="ph uicontrol">Enable concurrent kernel profiling</span>
                           checkbox is set by default to enable profiling of applications
                           that exploit concurrent kernel execution. If this checkbox is
                           unset, the profiler will disable concurrent kernel
                           execution. Disabling concurrent kernel execution can reduce
                           profiling overhead in some cases and so may be appropriate for
                           applications that do not exploit concurrent kernels.
                        </p>
                     </div>
                     <div class="section">
                        <h4 class="title sectiontitle">Enable power, clock, and thermal profiling</h4>
                        <p class="p">The <span class="ph uicontrol">Enable power, clock, and thermal
                              profiling</span> checkbox can be set to enable low
                           frequency sampling of the power, clock, and thermal behavior of
                           each GPU used by the application.
                        </p>
                     </div>
                  </div>
               </div>
            </div>
            <div class="topic concept nested1" id="customizing-visual-profiler"><a name="customizing-visual-profiler" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#customizing-visual-profiler" name="customizing-visual-profiler" shape="rect">2.6.&nbsp;Customizing the Visual Profiler</a></h3>
               <div class="body conbody">
                  <p class="p">When you first start the Visual Profiler, and after closing the <span class="ph uicontrol">Welcome</span> page, you will be presented with a default placement of the views. By moving and resizing the views, you can customize the
                     Visual Profiler to meet you development needs. Any changes you make to the Visual Profiler are restored the next time you
                     start the profiler.
                  </p>
               </div>
               <div class="topic concept nested2" id="resizing-view"><a name="resizing-view" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#resizing-view" name="resizing-view" shape="rect">2.6.1.&nbsp;Resizing a View</a></h3>
                  <div class="body conbody">
                     <p class="p">To resize a view, simply left click and drag on the dividing area
                        between the views. All views stacked together in one area are
                        resized at the same time.
                     </p>
                  </div>
               </div>
               <div class="topic concept nested2" id="reordering-view"><a name="reordering-view" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#reordering-view" name="reordering-view" shape="rect">2.6.2.&nbsp;Reordering a View</a></h3>
                  <div class="body conbody">
                     <p class="p">To reorder a view in a stacked set of views, left click and drag the view tab to the new location within the view stack.</p>
                  </div>
               </div>
               <div class="topic concept nested2" id="moving-view"><a name="moving-view" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#moving-view" name="moving-view" shape="rect">2.6.3.&nbsp;Moving a View</a></h3>
                  <div class="body conbody">
                     <p class="p">To move a view, left click the view tab and drag it to its new location. As you drag the view, an outline will show the target
                        location for the view. You can place the view in a new location, or stack it in the same location as other views.
                     </p>
                  </div>
               </div>
               <div class="topic concept nested2" id="undocking-view"><a name="undocking-view" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#undocking-view" name="undocking-view" shape="rect">2.6.4.&nbsp;Undocking a View</a></h3>
                  <div class="body conbody">
                     <p class="p">You can undock a view from the Visual Profiler window so that the
                        view occupies its own stand-alone window. You may want to do this
                        to take advantage of multiple monitors or to maximum the size of an
                        individual view. To undock a view, left click the view tab and drag
                        it outside of the Visual Profiler window. To dock a view, left
                        click the view tab (not the window decoration) and drag it into the
                        Visual Profiler window.
                     </p>
                  </div>
               </div>
               <div class="topic concept nested2" id="conceptId"><a name="conceptId" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#conceptId" name="conceptId" shape="rect">Opening and Closing a View</a></h3>
                  <div class="body conbody">
                     <p class="p">Use the <span class="ph uicontrol">X</span> icon on a view tab to close a view. To open a view, use the
                        <span class="ph uicontrol">View</span> menu.
                     </p>
                  </div>
               </div>
            </div>
         </div>
         <div class="topic concept nested0" id="nvprof-overview"><a name="nvprof-overview" shape="rect">
               <!-- --></a><h2 class="title topictitle1"><a href="#nvprof-overview" name="nvprof-overview" shape="rect">nvprof</a></h2>
            <div class="body conbody">
               <p class="p">The <samp class="ph codeph">nvprof</samp> profiling tool enables you to
                  collect and view profiling data from the
                  command-line. <samp class="ph codeph">nvprof</samp> enables the collection of
                  a timeline of CUDA-related activities on both CPU and GPU, including
                  kernel execution, memory transfers, memory set and CUDA API
                  calls. <samp class="ph codeph">nvprof</samp> also enables you to collect events/metrics
                  for CUDA kernels. Profiling options are provided to
                  <samp class="ph codeph">nvprof</samp> through command-line options. Profiling results are
                  displayed in the console after the profiling data is collected, and
                  may also be saved for later viewing by
                  either <samp class="ph codeph">nvprof</samp> or
                  the <a class="xref" href="index.html#visual-profiler" shape="rect">Visual Profiler</a>.
               </p>
               <div class="note note"><span class="notetitle">Note:</span> The textual output is redirected to <samp class="ph codeph">stderr</samp> by default.
                  Use <samp class="ph codeph">--log-file</samp> to redirect the output to another file. See
                  <a class="xref" href="index.html#redirecting-output" shape="rect">Redirecting Output</a>.
               </div>
               <div class="p"><samp class="ph codeph">nvprof</samp> is included in the CUDA Toolkit for all
                  supported OSes. Here's how to use nvprof to profile a CUDA
                  application:
                  <pre xml:space="preserve">nvprof [options] [CUDA-application] [application-arguments]</pre><samp class="ph codeph">nvprof</samp> and the
                  <a class="xref" href="index.html#compute-command-line-profiler-overview" shape="rect">Command Line Profiler</a> are mutually
                  exclusive profiling tools. If <samp class="ph codeph">nvprof</samp> is invoked
                  when the command-line profiler is enabled, <samp class="ph codeph">nvprof</samp>
                  will report an error and exit.
               </div>
               <p class="p">To view the full help page, type <samp class="ph codeph">nvprof --help</samp>.
               </p>
            </div>
            <div class="topic concept nested1" id="profiling-modes"><a name="profiling-modes" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#profiling-modes" name="profiling-modes" shape="rect">3.1.&nbsp;Profiling Modes</a></h3>
               <div class="body conbody">
                  <p class="p"><samp class="ph codeph">nvprof</samp> operates in one of the modes listed below.
                  </p>
               </div>
               <div class="topic concept nested2" id="summary-mode"><a name="summary-mode" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#summary-mode" name="summary-mode" shape="rect">3.1.1.&nbsp;Summary Mode</a></h3>
                  <div class="body conbody">
                     <div class="section">
                        <p class="p">Summary mode is the default operating mode
                           for <samp class="ph codeph">nvprof</samp>. In this
                           mode, <samp class="ph codeph">nvprof</samp> outputs a single result line for
                           each kernel function and each type of CUDA memory copy/set
                           performed by the application. For each
                           kernel, <samp class="ph codeph">nvprof</samp> outputs the total time of all
                           instances of the kernel or type of memory copy as well as the
                           average, minimum, and maximum time. Output of <samp class="ph codeph">nvprof</samp>
                           (except for tables) are prefixed with <samp class="ph codeph">==&lt;pid&gt;==</samp>,
                           <samp class="ph codeph">&lt;pid&gt;</samp> being the process ID of the application
                           being profiled. Here's a simple
                           example of running <samp class="ph codeph">nvprof</samp> on the CUDA sample <samp class="ph codeph">matrixMul</samp>:
                        </p><pre xml:space="preserve">$ nvprof matrixMul
[Matrix Multiply Using CUDA] - Starting...
==27694== NVPROF is profiling process 27694, command: matrixMul
GPU Device 0: "GeForce GT 640M LE" with compute capability 3.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 35.35 GFlop/s, Time= 3.708 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK

Note: For peak performance, please refer to the matrixMulCUBLAS example.
==27694== Profiling application: matrixMul
==27694== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 99.94%  1.11524s       301  3.7051ms  3.6928ms  3.7174ms  void matrixMulCUDA&lt;int=32&gt;(float*, float*, float*, int, int)
  0.04%  406.30us         2  203.15us  136.13us  270.18us  [CUDA memcpy HtoD]
  0.02%  248.29us         1  248.29us  248.29us  248.29us  [CUDA memcpy DtoH]
</pre><p class="p"><samp class="ph codeph">nvprof</samp> supports CUDA Dynamic Parallelism in summary mode. If your
                           application uses Dynamic Parallelism, the output will contain one column for
                           the number of host-launched kernels and one for the number of device-launched kernels. Here's an
                           example of running <samp class="ph codeph">nvprof</samp> on the CUDA Dynamic Parallelism sample
                           <samp class="ph codeph">cdpSimpleQuicksort</samp>:
                        </p><pre xml:space="preserve">$ nvprof cdpSimpleQuicksort
==27325== NVPROF is profiling process 27325, command: cdpSimpleQuicksort
Running on GPU 0 (Tesla K20c)
Initializing data:
Running quicksort on 128 elements
Launching kernel on the GPU
Validating results: OK
==27325== Profiling application: cdpSimpleQuicksort
==27325== Profiling result:
Time(%)      Time  Calls (host)  Calls (device)       Avg       Min       Max  Name
 99.71%  1.2114ms             1              14  80.761us  5.1200us  145.66us  cdp_simple_quicksort(unsigned int*, int, int, int)
  0.18%  2.2080us             1               -  2.2080us  2.2080us  2.2080us  [CUDA memcpy DtoH]
  0.11%  1.2800us             1               -  1.2800us  1.2800us  1.2800us  [CUDA memcpy HtoD]
</pre></div>
                  </div>
               </div>
               <div class="topic concept nested2" id="gpu-trace-and-api-trace-modes"><a name="gpu-trace-and-api-trace-modes" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#gpu-trace-and-api-trace-modes" name="gpu-trace-and-api-trace-modes" shape="rect">3.1.2.&nbsp;GPU-Trace and API-Trace Modes</a></h3>
                  <div class="body conbody">
                     <p class="p">GPU-Trace and API-Trace modes can be enabled individually
                        or at the same time. GPU-trace mode provides a timeline of all
                        activities taking place on the GPU in chronological
                        order. Each kernel execution and memory copy/set instance is shown
                        in the output. For each kernel or memory copy detailed
                        information such as kernel parameters, shared memory usage and
                        memory transfer throughput are shown. The number shown in the
                        square brackets after the kernel name correlates to the CUDA API
                        that launched that kernel.
                     </p>
                     <p class="p">Here's an example:</p><pre xml:space="preserve">$ nvprof --print-gpu-trace matrixMul
==27706== NVPROF is profiling process 27706, command: matrixMul
==27706== Profiling application: matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce GT 640M LE" with compute capability 3.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 35.36 GFlop/s, Time= 3.707 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK

Note: For peak performance, please refer to the matrixMulCUBLAS example.
==27706== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
133.81ms  135.78us                    -               -         -         -         -  409.60KB  3.0167GB/s  GeForce GT 640M         1         2  [CUDA memcpy HtoD]
134.62ms  270.66us                    -               -         -         -         -  819.20KB  3.0267GB/s  GeForce GT 640M         1         2  [CUDA memcpy HtoD]
134.90ms  3.7037ms            (20 10 1)       (32 32 1)        29  8.1920KB        0B         -           -  GeForce GT 640M         1         2  void matrixMulCUDA&lt;int=32&gt;(float*, float*, float*, int, int) [94]
138.71ms  3.7011ms            (20 10 1)       (32 32 1)        29  8.1920KB        0B         -           -  GeForce GT 640M         1         2  void matrixMulCUDA&lt;int=32&gt;(float*, float*, float*, int, int) [105]
&lt;...more output...&gt;
1.24341s  3.7011ms            (20 10 1)       (32 32 1)        29  8.1920KB        0B         -           -  GeForce GT 640M         1         2  void matrixMulCUDA&lt;int=32&gt;(float*, float*, float*, int, int) [2191]
1.24711s  3.7046ms            (20 10 1)       (32 32 1)        29  8.1920KB        0B         -           -  GeForce GT 640M         1         2  void matrixMulCUDA&lt;int=32&gt;(float*, float*, float*, int, int) [2198]
1.25089s  248.13us                    -               -         -         -         -  819.20KB  3.3015GB/s  GeForce GT 640M         1         2  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
</pre><p class="p"><samp class="ph codeph">nvprof</samp> supports CUDA Dynamic Parallelism in GPU-trace mode. For host
                        kernel launch, the kernel ID will be shown. For device kernel launch, the kernel ID,
                        parent kernel ID and parent block will be shown. Here's an example:
                     </p><pre xml:space="preserve">$nvprof --print-gpu-trace cdpSimpleQuicksort
==28128== NVPROF is profiling process 28128, command: cdpSimpleQuicksort
Running on GPU 0 (Tesla K20c)
Initializing data:
Running quicksort on 128 elements
Launching kernel on the GPU
Validating results: OK
==28128== Profiling application: cdpSimpleQuicksort
==28128== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream          ID   Parent ID         Parent Block  Name
192.76ms  1.2800us                    -               -         -         -         -      512B  400.00MB/s   Tesla K20c (0)         1         2           -           -                    -  [CUDA memcpy HtoD]
193.31ms  146.02us              (1 1 1)         (1 1 1)        32        0B        0B         -           -   Tesla K20c (0)         1         2           2           -                    -  cdp_simple_quicksort(unsigned int*, int, int, int) [171]
193.41ms  110.53us              (1 1 1)         (1 1 1)        32        0B      256B         -           -   Tesla K20c (0)         1         2          -5           2              (0 0 0)  cdp_simple_quicksort(unsigned int*, int, int, int)
193.45ms  125.57us              (1 1 1)         (1 1 1)        32        0B      256B         -           -   Tesla K20c (0)         1         2          -6           2              (0 0 0)  cdp_simple_quicksort(unsigned int*, int, int, int)
193.48ms  9.2480us              (1 1 1)         (1 1 1)        32        0B      256B         -           -   Tesla K20c (0)         1         2          -7          -5              (0 0 0)  cdp_simple_quicksort(unsigned int*, int, int, int)
193.52ms  107.23us              (1 1 1)         (1 1 1)        32        0B      256B         -           -   Tesla K20c (0)         1         2          -8          -5              (0 0 0)  cdp_simple_quicksort(unsigned int*, int, int, int)
193.53ms  93.824us              (1 1 1)         (1 1 1)        32        0B      256B         -           -   Tesla K20c (0)         1         2          -9          -6              (0 0 0)  cdp_simple_quicksort(unsigned int*, int, int, int)
193.57ms  117.47us              (1 1 1)         (1 1 1)        32        0B      256B         -           -   Tesla K20c (0)         1         2         -10          -6              (0 0 0)  cdp_simple_quicksort(unsigned int*, int, int, int)
193.58ms  5.0560us              (1 1 1)         (1 1 1)        32        0B      256B         -           -   Tesla K20c (0)         1         2         -11          -8              (0 0 0)  cdp_simple_quicksort(unsigned int*, int, int, int)
193.62ms  108.06us              (1 1 1)         (1 1 1)        32        0B      256B         -           -   Tesla K20c (0)         1         2         -12          -8              (0 0 0)  cdp_simple_quicksort(unsigned int*, int, int, int)
193.65ms  113.34us              (1 1 1)         (1 1 1)        32        0B      256B         -           -   Tesla K20c (0)         1         2         -13         -10              (0 0 0)  cdp_simple_quicksort(unsigned int*, int, int, int)
193.68ms  29.536us              (1 1 1)         (1 1 1)        32        0B      256B         -           -   Tesla K20c (0)         1         2         -14         -12              (0 0 0)  cdp_simple_quicksort(unsigned int*, int, int, int)
193.69ms  22.848us              (1 1 1)         (1 1 1)        32        0B      256B         -           -   Tesla K20c (0)         1         2         -15         -10              (0 0 0)  cdp_simple_quicksort(unsigned int*, int, int, int)
193.71ms  130.85us              (1 1 1)         (1 1 1)        32        0B      256B         -           -   Tesla K20c (0)         1         2         -16         -13              (0 0 0)  cdp_simple_quicksort(unsigned int*, int, int, int)
193.73ms  62.432us              (1 1 1)         (1 1 1)        32        0B      256B         -           -   Tesla K20c (0)         1         2         -17         -12              (0 0 0)  cdp_simple_quicksort(unsigned int*, int, int, int)
193.76ms  41.024us              (1 1 1)         (1 1 1)        32        0B      256B         -           -   Tesla K20c (0)         1         2         -18         -13              (0 0 0)  cdp_simple_quicksort(unsigned int*, int, int, int)
193.92ms  2.1760us                    -               -         -         -         -      512B  235.29MB/s   Tesla K20c (0)         1         2           -           -                    -  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
</pre><p class="p">API-trace mode shows the timeline of all CUDA runtime and
                        driver API calls invoked on the host in chronological
                        order. Here's an example:
                     </p><pre xml:space="preserve">$nvprof --print-api-trace matrixMul
==27722== NVPROF is profiling process 27722, command: matrixMul
==27722== Profiling application: matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce GT 640M LE" with compute capability 3.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 35.35 GFlop/s, Time= 3.708 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK

Note: For peak performance, please refer to the matrixMulCUBLAS example.
==27722== Profiling result:
   Start  Duration  Name
108.38ms  6.2130us  cuDeviceGetCount
108.42ms     840ns  cuDeviceGet
108.42ms  22.459us  cuDeviceGetName
108.45ms  11.782us  cuDeviceTotalMem
108.46ms     945ns  cuDeviceGetAttribute
149.37ms  23.737us  cudaLaunch (void matrixMulCUDA&lt;int=32&gt;(float*, float*, float*, int, int) [2198])
149.39ms  6.6290us  cudaEventRecord
149.40ms  1.10156s  cudaEventSynchronize
&lt;...more output...&gt;
1.25096s  21.543us  cudaEventElapsedTime
1.25103s  1.5462ms  cudaMemcpy
1.25467s  153.93us  cudaFree
1.25483s  75.373us  cudaFree
1.25491s  75.564us  cudaFree
1.25693s  10.901ms  cudaDeviceReset
</pre></div>
               </div>
               <div class="topic concept nested2" id="event-summary-mode"><a name="event-summary-mode" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#event-summary-mode" name="event-summary-mode" shape="rect">3.1.3.&nbsp;Event/metric Summary Mode</a></h3>
                  <div class="body conbody">
                     <p class="p">To see a list of all available events on a particular NVIDIA GPU,
                        type <samp class="ph codeph">nvprof --query-events</samp>. To see a list of all
                        available metrics on a particular NVIDIA GPU, type <samp class="ph codeph">nvprof
                           --query-metrics</samp>. <samp class="ph codeph">nvprof</samp>
                        is able to collect multiple events/metrics at the same time. Here's an
                        example:
                     </p><pre xml:space="preserve">$ nvprof --events warps_launched,branch --metrics ipc matrixMul
[Matrix Multiply Using CUDA] - Starting...
==60544== NVPROF is profiling process 60544, command: matrixMul
GPU Device 0: "GeForce GT 640M LE" with compute capability 3.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
==60544== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
done
Performance= 7.75 GFlop/s, Time= 16.910 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK

Note: For peak performance, please refer to the matrixMulCUBLAS example.
==60544== Profiling application: matrixMul
==60544== Profiling result:
==60544== Event result:
Invocations                      Event Name         Min         Max         Avg
Device "GeForce GT 640M LE (0)"
        Kernel: void matrixMulCUDA&lt;int=32&gt;(float*, float*, float*, int, int)
        301                  warps_launched        6400        6400        6400
        301                          branch       70400       70400       70400

==60544== Metric result:
Invocations                     Metric Name              Metric Description         Min         Max         Avg
Device "GeForce GT 640M LE (0)"
        Kernel: void matrixMulCUDA&lt;int=32&gt;(float*, float*, float*, int, int)
        301                             ipc                    Executed IPC    1.386412    1.393312    1.390278
</pre><div class="note note"><span class="notetitle">Note:</span> When collecting multiple events/metrics, <samp class="ph codeph">nvprof</samp>
                        uses kernel replay to execute each kernel multiple times as needed
                        to collect all the requested data. If a large number of events or
                        metrics are requested then a large number of replays may be required,
                        resulting in a significant increase in application execution time.
                     </div>
                  </div>
               </div>
               <div class="topic concept nested2" id="event-trace-mode"><a name="event-trace-mode" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#event-trace-mode" name="event-trace-mode" shape="rect">3.1.4.&nbsp;Event/metric Trace Mode</a></h3>
                  <div class="body conbody">
                     <p class="p">In event/metric trace mode, event and metric values are shown for each
                        kernel execution. By default, event and metric values are aggregated
                        across all units in the GPU. For example, by default
                        multiprocessor specific events are aggregated across all
                        multiprocessors on the
                        GPU. If <samp class="ph codeph">--aggregate-mode off</samp> is specified,
                        values of each unit are shown. For example, in the following
                        example, the "branch" event value is shown for each
                        multiprocessor on the GPU.
                     </p><pre xml:space="preserve">$ nvprof --aggregate-mode off --events branch --print-gpu-trace matrixMul
[Matrix Multiply Using CUDA] - Starting...
==60642== NVPROF is profiling process 60642, command: matrixMul
GPU Device 0: "GeForce GT 640M LE" with compute capability 3.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 23.73 GFlop/s, Time= 5.523 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK

Note: For peak performance, please refer to the matrixMulCUBLAS example.
==60642== Profiling application: matrixMul
==60642== Profiling result:
         Device          Context           Stream                Kernel  branch (0)  branch (1)
GeForce GT 640M                1                2  void matrixMulCUDA&lt;i       35200       35200
GeForce GT 640M                1                2  void matrixMulCUDA&lt;i       35200       35200
&lt;...more output...&gt;
</pre><div class="note note"><span class="notetitle">Note:</span><samp class="ph codeph">--aggregate-mode</samp> also applies to metrics. However some
                        metrics are only available in aggregate mode and some are only available
                        in non-aggregate mode.
                     </div>
                  </div>
               </div>
            </div>
            <div class="topic concept nested1" id="profiling-controls"><a name="profiling-controls" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#profiling-controls" name="profiling-controls" shape="rect">3.2.&nbsp;Profiling Controls</a></h3>
               <div class="topic concept nested2" id="timeout"><a name="timeout" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#timeout" name="timeout" shape="rect">3.2.1.&nbsp;Timeout</a></h3>
                  <div class="body conbody">
                     <p class="p">A timeout (in seconds) can be provided to <samp class="ph codeph">nvprof</samp>.
                        The CUDA application being profiled will be killed by <samp class="ph codeph">nvprof</samp>
                        after the timeout. Profiling result collected before the timeout will be shown.
                     </p>
                     <div class="note note"><span class="notetitle">Note:</span> Timeout starts counting from the moment the
                        CUDA driver is initialized. If the application
                        doesn't call any CUDA APIs, timeout won't be
                        triggered.
                     </div>
                  </div>
               </div>
               <div class="topic concept nested2" id="concurrent-kernels"><a name="concurrent-kernels" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#concurrent-kernels" name="concurrent-kernels" shape="rect">3.2.2.&nbsp;Concurrent Kernels</a></h3>
                  <div class="body conbody">
                     <p class="p">Concurrent-kernel profiling is supported, and is turned on by default.
                        To turn the feature off, use the option <samp class="ph codeph">--concurrent-kernels off</samp>.
                        This forces concurrent kernel executions to be serialized when a CUDA
                        application is run with <samp class="ph codeph">nvprof</samp>.
                     </p>
                  </div>
               </div>
               <div class="topic concept nested2" id="profiling-scope"><a name="profiling-scope" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#profiling-scope" name="profiling-scope" shape="rect">3.2.3.&nbsp;Profiling Scope</a></h3>
                  <div class="body conbody">
                     <p class="p">When collecting events/metrics, <samp class="ph codeph">nvprof</samp> profiles all
                        kernels launched on all visible CUDA devices by default. This profiling
                        scope can be limited by the following options.
                     </p>
                     <p class="p"><samp class="ph codeph">--devices &lt;device IDs&gt;</samp> applies to <samp class="ph codeph">--events</samp>,
                        <samp class="ph codeph">--metrics</samp>, <samp class="ph codeph">--query-events</samp> and <samp class="ph codeph">--query-metrics</samp>
                        options <strong class="ph b">that follows it</strong>. It
                        limits these options to collect events/metrics only on the devices specified by
                        <samp class="ph codeph">&lt;device IDs&gt;</samp>, which can be a list of device ID numbers separated by comma.
                     </p>
                     <p class="p"><samp class="ph codeph">--kernels &lt;kernel filter&gt;</samp> applies to <samp class="ph codeph">--events</samp>
                        and <samp class="ph codeph">--metrics</samp> options <strong class="ph b">that follows it</strong>. It
                        limits these options to collect events/metrics only on the kernels specified by
                        <samp class="ph codeph">&lt;kernel filter&gt;</samp>, which has the following syntax:
                     </p><pre class="pre screen" xml:space="preserve">&lt;context id/name&gt;:&lt;stream id/name&gt;:&lt;kernel name&gt;:&lt;invocation&gt;</pre><p class="p">Each string in the angle brackets, except for invocation, can be a standard Perl regular expression.
                        Empty string matches any number or character combination. Invocation should be a positive number,
                        and indicates the <em class="ph i">n</em>th invocation of the kernel.
                     </p>
                     <p class="p">Both <samp class="ph codeph">--devices</samp> and <samp class="ph codeph">--kernels</samp> can be specified multiple
                        times, with distinct events/metrics associated.
                     </p>
                     <p class="p"><samp class="ph codeph">--events</samp>, <samp class="ph codeph">--metrics</samp>, <samp class="ph codeph">--query-events</samp> and
                        <samp class="ph codeph">--query-metrics</samp> are controlled by the nearest scope options before them.
                     </p>
                     <p class="p">As an example, the following command,</p><pre class="pre screen" xml:space="preserve">nvprof --devices 0 --metrics ipc --kernels "1:foo:bar:2" --events local_load a.out</pre><p class="p">collects metric <samp class="ph codeph">ipc</samp> on all kernels launched on device 0. It also
                        collects event <samp class="ph codeph">local_load</samp> for any kernel whose name contains <samp class="ph codeph">bar</samp>
                        and is the 2nd instance launched on context 1 and on stream named <samp class="ph codeph">foo</samp> on device 0.
                     </p>
                  </div>
               </div>
               <div class="topic concept nested2" id="multiprocess-profiling"><a name="multiprocess-profiling" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#multiprocess-profiling" name="multiprocess-profiling" shape="rect">3.2.4.&nbsp;Multiprocess Profiling</a></h3>
                  <div class="body conbody">
                     <p class="p">By default, <samp class="ph codeph">nvprof</samp> only profiles the application
                        specified by the command-line argument. It doesn't trace child processes
                        launched by that process. To profile all processes launched by an application,
                        use the <samp class="ph codeph">--profile-child-process</samp> option.
                     </p>
                     <div class="note note"><span class="notetitle">Note:</span><samp class="ph codeph">nvprof</samp> cannot profile processes that <samp class="ph codeph">fork()</samp>
                        but do not then <samp class="ph codeph">exec()</samp>.
                     </div>
                     <p class="p"><samp class="ph codeph">nvprof</samp> also has a "profile all processes" mode, in which
                        it profiles every CUDA process launched on the same system by the same user
                        who launched <samp class="ph codeph">nvprof</samp>. Exit this mode by typing "Ctrl-c".
                     </p>
                  </div>
               </div>
               <div class="topic concept nested2" id="system-profiling"><a name="system-profiling" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#system-profiling" name="system-profiling" shape="rect">3.2.5.&nbsp;System Profiling</a></h3>
                  <div class="body conbody">
                     <p class="p">For devices that support system profiling, nvprof can enable low
                        frequency sampling of the power, clock, and thermal behavior of
                        each GPU used by the application. This feature is turned off by
                        default. To turn on this feature, use <samp class="ph codeph">--system-profiling on</samp>.
                        To see the detail of each sample point, combine the above option with
                        <samp class="ph codeph">--print-gpu-trace</samp>.
                     </p>
                  </div>
               </div>
            </div>
            <div class="topic concept nested1" id="output"><a name="output" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#output" name="output" shape="rect">3.3.&nbsp;Output</a></h3>
               <div class="topic concept nested2" id="adjust-units"><a name="adjust-units" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#adjust-units" name="adjust-units" shape="rect">3.3.1.&nbsp;Adjust Units</a></h3>
                  <div class="body conbody">
                     <p class="p">By default, <samp class="ph codeph">nvprof</samp> adjusts the time units
                        automatically to get the most precise time
                        values. The <samp class="ph codeph">--normalized-time-unit</samp> options
                        can be used to get fixed time units throughout the results.
                     </p>
                  </div>
               </div>
               <div class="topic concept nested2" id="csv"><a name="csv" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#csv" name="csv" shape="rect">3.3.2.&nbsp;CSV</a></h3>
                  <div class="body conbody">
                     <p class="p">For each profiling mode, option <samp class="ph codeph">--csv</samp> can
                        be used to generate output in comma-separated values (CSV)
                        format. The result can be directly imported to spreadsheet
                        software such as Excel.
                     </p>
                  </div>
               </div>
               <div class="topic concept nested2" id="export-import"><a name="export-import" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#export-import" name="export-import" shape="rect">3.3.3.&nbsp;Export/Import</a></h3>
                  <div class="body conbody">
                     <p class="p">For each profiling mode,
                        option <samp class="ph codeph">--output-profile</samp> can be used to
                        generate a result file. This file is not human-readable, but
                        can be imported to <samp class="ph codeph">nvprof</samp> using the
                        option <samp class="ph codeph">--import-profile</samp>, or into the
                        <a class="xref" href="index.html#visual-profiler" shape="rect">Visual Profiler</a>.
                     </p>
                  </div>
               </div>
               <div class="topic concept nested2" id="demangling"><a name="demangling" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#demangling" name="demangling" shape="rect">3.3.4.&nbsp;Demangling</a></h3>
                  <div class="body conbody">
                     <p class="p">By default, <samp class="ph codeph">nvprof</samp> demangles C++ function names.
                        Use option <samp class="ph codeph">--demangling off</samp> to turn this feature off.
                     </p>
                  </div>
               </div>
               <div class="topic concept nested2" id="redirecting-output"><a name="redirecting-output" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#redirecting-output" name="redirecting-output" shape="rect">3.3.5.&nbsp;Redirecting Output</a></h3>
                  <div class="body conbody">
                     <p class="p">By default, <samp class="ph codeph">nvprof</samp> sends most of its output to <samp class="ph codeph">stderr</samp>.
                        To redirect the output, use <samp class="ph codeph">--log-file</samp>. <samp class="ph codeph">--log-file %1</samp> tells
                        <samp class="ph codeph">nvprof</samp> to redirect all output to <samp class="ph codeph">stdout</samp>. <samp class="ph codeph">--log-file
                           &lt;filename&gt;</samp> redirects output to a file. Use <samp class="ph codeph">%p</samp> in the filename
                        to be replaced by the process ID of <samp class="ph codeph">nvprof</samp>, <samp class="ph codeph">%h</samp> by the hostname
                        and <samp class="ph codeph">%%</samp> by <samp class="ph codeph">%</samp>.
                     </p>
                  </div>
               </div>
            </div>
            <div class="topic concept nested1" id="limitations"><a name="limitations" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#limitations" name="limitations" shape="rect">3.4.&nbsp;Limitations</a></h3>
               <div class="body conbody">
                  <p class="p">
                     This section documents some <samp class="ph codeph">nvprof</samp> limitations.
                     
                  </p>
                  <ul class="ul">
                     <li class="li">For some metrics, the required events can only be collected
                        for a single CUDA context. For an application that uses multiple
                        CUDA contexts, these metrics will only be collected for one of
                        the contexts. The metrics that can be collected only for a
                        single CUDA context are indicated in the <a class="xref" href="index.html#metrics-reference" shape="rect">metric reference tables</a>.
                     </li>
                     <li class="li">The warp_nonpred_execution_efficiency metric is only
                        available on compute capability 3.5 and later devices.
                     </li>
                     <li class="li">The warp_execution_efficiency metric is not available on compute
                        capability 3.0 devices.
                     </li>
                     <li class="li">The branch_efficiency metric is not available on compute
                        capability 3.5 devices.
                     </li>
                     <li class="li">For compute capability 2.x devices, the achieved_occupancy
                        metric can report inaccurate values that are greater than the
                        actual achieved occupancy. In rare cases this can cause the
                        achieved occupancy value to exceed the theoretical occupancy
                        value for the kernel.
                     </li>
                     <li class="li"><samp class="ph codeph">nvprof</samp> cannot profile processes that
                        <samp class="ph codeph">fork()</samp> but do not then
                        <samp class="ph codeph">exec()</samp>.
                     </li>
                     <li class="li">The timestamps collected for applications running on GPUs in
                        an SLI configuration are incorrect. As a result most profiling
                        results collected for the application will be invalid.
                     </li>
                     <li class="li">Concurrent kernel mode can add significant overhead if used
                        on kernels that execute a large number of blocks and that have
                        short execution durations.
                     </li>
                     <li class="li">If the kernel launch rate is very high, the device memory
                        used to collect profiling data can run out. In such a case some
                        profiling data might be dropped. This will be indicated by a warning.
                     </li>
                     <li class="li"><samp class="ph codeph">nvprof</samp> assumes it has access to the temporary directory
                        on the system, which it uses to store temporary profiling data. On Linux/Mac
                        the default is <samp class="ph codeph">/tmp</samp>. On Windows it's specified by the
                        system environment variables. To specify a custom location,
                        change <samp class="ph codeph">$TMPDIR</samp> on Linux/Mac or <samp class="ph codeph">%TMP%</samp> on
                        Windows.
                     </li>
                  </ul>
               </div>
            </div>
         </div>
         <div class="topic concept nested0" id="compute-command-line-profiler-overview"><a name="compute-command-line-profiler-overview" shape="rect">
               <!-- --></a><h2 class="title topictitle1"><a href="#compute-command-line-profiler-overview" name="compute-command-line-profiler-overview" shape="rect">Command Line Profiler</a></h2>
            <div class="body conbody">
               <p class="p">The Command Line Profiler is a profiling tool that can be used
                  to measure performance and find potential opportunities for
                  optimization for CUDA applications executing on NVIDIA GPUs.
                  The command line profiler allows users to gather timing
                  information about kernel execution and memory transfer
                  operations. Profiling options are controlled through environment
                  variables and a profiler configuration file. Profiler output is
                  generated in text files either in Key-Value-Pair (KVP) or Comma
                  Separated (CSV) format.
                  
               </p>
            </div>
            <div class="topic concept nested1" id="command-line-profiler-control"><a name="command-line-profiler-control" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#command-line-profiler-control" name="command-line-profiler-control" shape="rect">4.1.&nbsp;Command Line Profiler Control</a></h3>
               <div class="body conbody">
                  <p class="p">The command line profiler is controlled using the following environment
                     		variables:
                     	 
                  </p>
                  <p class="p"><strong class="ph b">COMPUTE_PROFILE</strong>: is set to either 1 or 0 (or unset) to enable or
                     		disable profiling.
                     	 
                  </p>
                  <p class="p"><strong class="ph b">COMPUTE_PROFILE_LOG:</strong> is set to the desired file path for profiling
                     		output. In case of multiple contexts you must add '%d' in the
                     		COMPUTE_PROFILE_LOG name. This will generate separate profiler output files for
                     		each context - with '%d' substituted by the context number. Contexts are
                     		numbered starting with zero. In case of multiple processes you must add '%p' in
                     		the COMPUTE_PROFILE_LOG name. This will generate separate profiler output files
                     		for each process - with '%p' substituted by the process id. If there is no log
                     		path specified, the profiler will log data to "cuda_profile_%d.log" in case of
                     		a CUDA context ('%d' is substituted by the context number).
                     	 
                  </p>
                  <p class="p"><strong class="ph b">COMPUTE_PROFILE_CSV</strong>: is set to either 1 (set) or 0 (unset) to
                     		enable or disable a comma separated version of the log output.
                     	 
                  </p>
                  <p class="p"><strong class="ph b">COMPUTE_PROFILE_CONFIG</strong>: is used to specify a config file for
                     		selecting profiling options and performance counters.
                     	 
                  </p>
                  <p class="p">Configuration details are covered in a subsequent section.
                     	 
                  </p>
                  <p class="p">The following old environment variables used for the above
                     		functionalities are still supported:
                     	 
                  </p>
                  <p class="p"><strong class="ph b">CUDA_PROFILE</strong></p>
                  <p class="p"><strong class="ph b">CUDA_PROFILE_LOG</strong></p>
                  <p class="p"><strong class="ph b">CUDA_PROFILE_CSV</strong></p>
                  <p class="p"><strong class="ph b">CUDA_PROFILE_CONFIG</strong></p>
               </div>
            </div>
            <div class="topic concept nested1" id="command-line-profiler-default-output"><a name="command-line-profiler-default-output" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#command-line-profiler-default-output" name="command-line-profiler-default-output" shape="rect">4.2.&nbsp;Command Line Profiler Default Output</a></h3>
               <div class="body conbody">
                  <p class="p"><a class="xref" href="index.html#command-line-profiler-default-output__command-line-profiler-default-columns" shape="rect">Table 1</a> describes the columns that are output in the profiler log
                     		by default.
                     	 
                  </p>
                  <p class="p"></p>
                  <div class="tablenoborder"><a name="command-line-profiler-default-output__command-line-profiler-default-columns" shape="rect">
                        <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="command-line-profiler-default-output__command-line-profiler-default-columns" class="table" frame="border" border="1" rules="all">
                        <caption><span class="tablecap">Table 1. Command Line Profiler Default Columns</span></caption>
                        <thead class="thead" align="left">
                           <tr class="row">
                              <th class="entry" valign="top" width="28.57142857142857%" id="d54e2370" rowspan="1" colspan="1">Column</th>
                              <th class="entry" valign="top" width="71.42857142857143%" id="d54e2373" rowspan="1" colspan="1">Description</th>
                           </tr>
                        </thead>
                        <tbody class="tbody">
                           <tr class="row">
                              <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2370" rowspan="1" colspan="1">
                                 <p class="p"> method
                                    				
                                 </p>
                              </td>
                              <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2373" rowspan="1" colspan="1">
                                 <p class="p">This is character string which gives the name of the GPU kernel
                                    				  or memory copy method. In case of kernels the method name is the mangled name
                                    				  generated by the compiler.
                                    				
                                 </p>
                              </td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2370" rowspan="1" colspan="1">
                                 <p class="p"> gputime
                                    				
                                 </p>
                              </td>
                              <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2373" rowspan="1" colspan="1">
                                 <p class="p">This column gives the execution time for the GPU kernel or
                                    				  memory copy method. This value is calculated as (gpuendtimestamp -
                                    				  gpustarttimestamp)/1000.0. The column value is a single precision floating
                                    				  point value in microseconds.
                                    				
                                 </p>
                              </td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2370" rowspan="1" colspan="1">
                                 <p class="p"> cputime
                                    				
                                 </p>
                              </td>
                              <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2373" rowspan="1" colspan="1">
                                 <p class="p">For non-blocking methods the cputime is only the CPU or host
                                    				  side overhead to launch the method. In this case:
                                    				
                                 </p>
                                 <p class="p">walltime = cputime + gputime
                                    				
                                 </p>
                                 <p class="p">For blocking methods cputime is the sum of gputime and CPU
                                    				  overhead. In this case:
                                    				
                                 </p>
                                 <p class="p">walltime = cputime
                                    				
                                 </p>
                                 <p class="p">Note all kernel launches by default are non-blocking. But if any
                                    				  of the profiler counters are enabled kernel launches are blocking. Also
                                    				  asynchronous memory copy requests in different streams are non-blocking.
                                    				
                                 </p>
                                 <p class="p">The column value is a single precision floating point value in
                                    				  microseconds.
                                    				
                                 </p>
                              </td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2370" rowspan="1" colspan="1">
                                 <p class="p"> occupancy
                                    				
                                 </p>
                              </td>
                              <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2373" rowspan="1" colspan="1">
                                 <p class="p">This column gives the multiprocessor occupancy which is the
                                    				  ratio of number of active warps to the maximum number of warps supported on a
                                    				  multiprocessor of the GPU. This is helpful in determining how effectively the GPU is kept busy.
                                    				  This column is output only for GPU kernels and the
                                    				  column value is a single precision floating point value in the range 0.0 to
                                    				  1.0.
                                    				
                                 </p>
                              </td>
                           </tr>
                        </tbody>
                     </table>
                  </div>
               </div>
            </div>
            <div class="topic concept nested1" id="command-line-profiler-configuration"><a name="command-line-profiler-configuration" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#command-line-profiler-configuration" name="command-line-profiler-configuration" shape="rect">4.3.&nbsp;Command Line Profiler Configuration</a></h3>
               <div class="body conbody">
                  <p class="p">The profiler configuration file is used to select the profiler options
                     		and counters which are to be collected during application execution. The
                     		configuration file is a simple format text file with one option on each line.
                     		Options can be commented out using the <samp class="ph codeph">#</samp> character at the start of a line.
                     		Refer the command line profiler options table for the column names in the 
                     profiler output for each profiler configuration option.
                     	 
                  </p>
               </div>
               <div class="topic concept nested2" id="command-line-profiler-options"><a name="command-line-profiler-options" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#command-line-profiler-options" name="command-line-profiler-options" shape="rect">4.3.1.&nbsp;Command Line Profiler Options</a></h3>
                  <div class="body conbody">
                     <p class="p"><a class="xref" href="index.html#command-line-profiler-options__command-line-profiler-options-table" shape="rect">Table 2</a> contains the options supported by the command line
                        
                        		profiler. Note the following regarding the profiler log that is produced from
                        
                        		the different options:
                        
                        	 
                     </p>
                     <ul class="ul">
                        <li class="li">Typically, each profiler
                           
                           		  option corresponds to a single column is output. There are a few exceptions in
                           
                           		  which case multiple columns are output; these are noted where applicable in 
                           
                           		  <a class="xref" href="index.html#command-line-profiler-options__command-line-profiler-options-table" shape="rect">Table 2</a>.
                           
                           		
                        </li>
                        <li class="li">In most cases the column
                           
                           		  name is the same as the option name; the exceptions are listed in 
                           
                           		  <a class="xref" href="index.html#command-line-profiler-options__command-line-profiler-options-table" shape="rect">Table 2</a>.
                           
                           		
                        </li>
                        <li class="li">In most cases the column
                           
                           		  values are 32-bit integers in decimal format; the exceptions are listed in 
                           
                           		  <a class="xref" href="index.html#command-line-profiler-options__command-line-profiler-options-table" shape="rect">Table 2</a>.
                           
                           		
                        </li>
                     </ul>
                     <div class="tablenoborder"><a name="command-line-profiler-options__command-line-profiler-options-table" shape="rect">
                           <!-- --></a><table cellpadding="4" cellspacing="0" summary="" id="command-line-profiler-options__command-line-profiler-options-table" class="table" frame="border" border="1" rules="all">
                           <caption><span class="tablecap">Table 2. Command Line Profiler Options</span></caption>
                           <thead class="thead" align="left">
                              <tr class="row">
                                 <th class="entry" valign="top" width="28.57142857142857%" id="d54e2524" rowspan="1" colspan="1">Option</th>
                                 <th class="entry" valign="top" width="71.42857142857143%" id="d54e2527" rowspan="1" colspan="1">Description</th>
                              </tr>
                           </thead>
                           <tbody class="tbody">
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> gpustarttimestamp
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">Time stamp when a kernel or memory transfer starts.
                                       
                                       				
                                    </p>
                                    <p class="p">The column values are 64-bit unsigned value in nanoseconds in
                                       
                                       				  hexadecimal format.
                                       
                                       				
                                    </p>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> gpuendtimestamp
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">Time stamp when a kernel or memory transfer completes.
                                       
                                       				
                                    </p>
                                    <p class="p">The column values are 64-bit unsigned value in nanoseconds in
                                       
                                       				  hexadecimal format.
                                       
                                       				
                                    </p>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> timestamp
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">Time stamp when a kernel or memory transfer starts. The column
                                       
                                       				  values are single precision floating point value in microseconds. Use of the
                                       
                                       				  gpustarttimestamp column is recommended as this provides a more accurate time
                                       
                                       				  stamp.
                                       
                                       				
                                    </p>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> gridsize
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">Number of blocks in a grid along the X and Y dimensions for a
                                       
                                       				  kernel launch.
                                       
                                       				
                                    </p>
                                    <p class="p">This option outputs the following two columns:
                                       
                                       				
                                    </p>
                                    <div class="p">
                                       <ul class="ul">
                                          <li class="li">gridsizeX
                                             
                                             				  
                                          </li>
                                          <li class="li">gridsizeY
                                             
                                             				  
                                          </li>
                                       </ul>
                                    </div>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> gridsize3d
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">Number of blocks in a grid along the X, Y and Z dimensions for a
                                       
                                       				  kernel launch.
                                       
                                       				
                                    </p>
                                    <p class="p">This option outputs the following three columns:
                                       
                                       				
                                    </p>
                                    <ul class="ul">
                                       <li class="li">gridsizeX
                                          
                                          				  
                                       </li>
                                       <li class="li">gridsizeY
                                          
                                          				  
                                       </li>
                                       <li class="li">gridsizeZ
                                          
                                          				  
                                       </li>
                                    </ul>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> threadblocksize
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">Number of threads in a block along the X, Y and Z dimensions for
                                       
                                       				  a kernel launch.
                                       
                                       				
                                    </p>
                                    <p class="p">This option outputs the following three columns:
                                       
                                       				
                                    </p>
                                    <div class="p">
                                       <ul class="ul">
                                          <li class="li">threadblocksizeX
                                             
                                             				  
                                          </li>
                                          <li class="li">threadblocksizeY
                                             
                                             				  
                                          </li>
                                          <li class="li">threadblocksizeZ
                                             
                                             				  
                                          </li>
                                       </ul>
                                    </div>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> dynsmemperblock
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">Size of dynamically allocated shared memory per block in bytes
                                       
                                       				  for a kernel launch. (Only CUDA)
                                       
                                       				
                                    </p>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> stasmemperblock
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">Size of statically allocated shared memory per block in bytes
                                       
                                       				  for a kernel launch.
                                       
                                       				
                                    </p>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> regperthread
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">Number of registers used per thread for a kernel launch.
                                       
                                       				
                                    </p>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> memtransferdir
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">Memory transfer direction, a direction value of 0 is used for
                                       
                                       				  host to device memory copies and a value of 1 is used for device to host memory
                                       
                                       				  copies.
                                       
                                       				
                                    </p>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> memtransfersize
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">Memory transfer size in bytes. This option shows the amount of
                                       
                                       				  memory transferred between source (host/device) to destination (host/device).
                                       
                                       				
                                    </p>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> memtransferhostmemtype
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">Host memory type (pageable or page-locked). This option implies
                                       
                                       				  whether during a memory transfer, the host memory type is pageable or
                                       
                                       				  page-locked.
                                       
                                       				
                                    </p>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> streamid
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">Stream Id for a kernel launch or a memory transfer.
                                       
                                       				
                                    </p>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> localblocksize
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">This option is no longer supported and if it is selected
                                       
                                       				  all values in the column will be -1. 
                                    </p>
                                    <p class="p">This option outputs the following column:
                                       
                                       				
                                    </p>
                                    <div class="p">
                                       <ul class="ul">
                                          <li class="li">localworkgroupsize
                                             
                                             				  
                                          </li>
                                       </ul>
                                    </div>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> cacheconfigrequested
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">Requested cache configuration option for a kernel launch:
                                       
                                       				
                                    </p>
                                    <ul class="ul">
                                       <li class="li">0 CU_FUNC_CACHE_PREFER_NONE - no preference for shared memory or
                                          
                                          				  L1 (default)
                                          
                                          				
                                       </li>
                                       <li class="li">1 CU_FUNC_CACHE_PREFER_SHARED - prefer larger shared memory and
                                          
                                          				  smaller L1 cache
                                          
                                          				
                                       </li>
                                       <li class="li">2 CU_FUNC_CACHE_PREFER_L1 - prefer larger L1 cache and smaller
                                          
                                          				  shared memory
                                          
                                          				
                                       </li>
                                       <li class="li">3 CU_FUNC_CACHE_PREFER_EQUAL - prefer equal sized L1 cache and
                                          
                                          				  shared memory
                                          
                                          				
                                       </li>
                                    </ul>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> cacheconfigexecuted
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">Cache configuration which was used for the kernel launch. The
                                       
                                       				  values are same as those listed under cacheconfigrequested.
                                       
                                       				
                                    </p>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> cudadevice &lt;device_index&gt;
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">This can be used to select different counters for different CUDA
                                       
                                       				  devices. All counters after this option are selected only for a CUDA device
                                       
                                       				  with index &lt;device_index&gt;.
                                       
                                       				
                                    </p>
                                    <p class="p">&lt;device_index&gt; is an integer value specifying the CUDA
                                       
                                       				  device index.
                                       
                                       				
                                    </p>
                                    <p class="p">Example: To select counterA for all devices, counterB for CUDA
                                       
                                       				  device 0 and counterC for CUDA device 1:
                                       
                                       				
                                    </p><pre xml:space="preserve">counterA 
cudadevice 0
counterB
cudadevice 1
counterC</pre></td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> profilelogformat [CSV|KVP]
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">Choose format for profiler log.
                                       
                                       				
                                    </p>
                                    <ul class="ul">
                                       <li class="li">CSV: Comma separated format
                                          
                                          				
                                       </li>
                                       <li class="li">KVP: Key Value Pair format
                                          
                                          				
                                       </li>
                                    </ul>
                                    <p class="p">The default format is KVP.
                                       
                                       				
                                    </p>
                                    <p class="p">This option will override the format selected using the
                                       
                                       				  environment variable COMPUTE_PROFILE_CSV.
                                       
                                       				
                                    </p>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> countermodeaggregate
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">If this option is selected then aggregate counter values will be
                                       
                                       				  output. For a SM counter the counter value is the sum of the counter values
                                       
                                       				  from all SMs. For l1*, tex*, sm_cta_launched, uncached_global_load_transaction
                                       
                                       				  and global_store_transaction counters the counter value is collected for 1 SM
                                       
                                       				  from each GPC and it is extrapolated for all SMs. This option is supported only
                                       
                                       				  for CUDA devices with compute capability 2.0 or higher.
                                       
                                       				
                                    </p>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> conckerneltrace
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p">This option should be used to get gpu start and end timestamp values in case of concurrent kernels. 
                                       
                                       Without this option execution of concurrent kernels is serialized and the timestamps are not correct. 
                                       
                                       Only CUDA devices with compute capability 2.0 or higher support execution of multiple kernels concurrently.
                                       
                                       When this option is enabled additional code is inserted for each kernel and this will result in some additional execution
                                       overhead.
                                       
                                       This option cannot be used along with profiler counters. In case some counter is given in the configuration file
                                       
                                       along with "conckerneltrace" then a warning is printed in the profiler output file and the counter will not be enabled.
                                       
                                       				
                                    </p>
                                 </td>
                              </tr>
                              <tr class="row">
                                 <td class="entry" valign="top" width="28.57142857142857%" headers="d54e2524" rowspan="1" colspan="1">
                                    <p class="p"> enableonstart 0|1
                                       
                                       				
                                    </p>
                                 </td>
                                 <td class="entry" valign="top" width="71.42857142857143%" headers="d54e2527" rowspan="1" colspan="1">
                                    <p class="p"> Use <samp class="ph codeph">enableonstart 1</samp> option to enable or <samp class="ph codeph">enableonstart 0</samp> 
                                       
                                       to disable profiling from the start of application execution. If this option is not used then
                                       
                                       by default profiling is enabled from the start.
                                       
                                       To limit profiling to a region of your application, CUDA
                                       
                                       provides functions to start and stop profile data
                                       
                                       collection. <samp class="ph codeph">cudaProfilerStart()</samp> is used to
                                       
                                       start profiling and <samp class="ph codeph">cudaProfilerStop()</samp> is used
                                       
                                       to stop profiling (using the CUDA driver API, you get the same
                                       
                                       functionality with
                                       
                                       <samp class="ph codeph">cuProfilerStart()</samp>
                                       
                                       and <samp class="ph codeph">cuProfilerStop()</samp>). 
                                       
                                       When using the start and stop functions, you also need to
                                       
                                       instruct the profiling tool to disable profiling at the start of
                                       
                                       the application. For command line profiler you do this by adding 
                                       
                                       <samp class="ph codeph">enableonstart 0</samp> in the profiler configuration file.
                                       
                                       				
                                    </p>
                                 </td>
                              </tr>
                           </tbody>
                        </table>
                     </div>
                  </div>
               </div>
               <div class="topic concept nested2" id="topic_DC0AA473DE8A4917AC94C817BD231DBB"><a name="topic_DC0AA473DE8A4917AC94C817BD231DBB" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#topic_DC0AA473DE8A4917AC94C817BD231DBB" name="topic_DC0AA473DE8A4917AC94C817BD231DBB" shape="rect">Command Line Profiler Counters</a></h3>
                  <div class="body conbody">
                     <p class="p">The command line profiler supports logging of event counters
                        during kernel execution. The list of available events can be
                        found using <samp class="ph codeph">nvprof --query-events</samp> as described
                        in <a class="xref" href="index.html#event-summary-mode" shape="rect">Event/metric Summary Mode</a>. The event name can be
                        used in the command line profiler configuration file. In every
                        application run only a few counter values can be collected. The
                        number of counters depends on the specific counters
                        selected.
                     </p>
                  </div>
               </div>
            </div>
            <div class="topic concept nested1" id="command-line-profiler-output"><a name="command-line-profiler-output" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#command-line-profiler-output" name="command-line-profiler-output" shape="rect">4.4.&nbsp;Command Line Profiler Output</a></h3>
               <div class="body conbody">
                  <div class="section">
                     <p class="p"> If the 
                        		  <samp class="ph codeph">COMPUTE_PROFILE</samp> environment variable is set to enable profiling,
                        		  the profiler log records timing information for every kernel launch and memory
                        		  operation performed by the driver. 
                        		
                     </p>
                     <p class="p"><a class="xref" href="index.html#command-line-profiler-output__Example1" shape="rect">Example 1: CUDA Default Profiler Log- No Options or Counters Enabled (File name: cuda_profile_0.log)</a> shows the profiler log for a CUDA application with no
                        		  profiler configuration file specified.
                        		
                     </p>
                  </div>
                  <div class="section" id="command-line-profiler-output__Example1"><a name="command-line-profiler-output__Example1" shape="rect">
                        <!-- --></a><h3 class="title sectiontitle">Example 1: CUDA Default Profiler Log- No Options or Counters Enabled (File name:
                        		  
                        		  <samp class="ph codeph">cuda_profile_0.log</samp>)
                     </h3><pre xml:space="preserve"># CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 Tesla C2075
# CUDA_CONTEXT 1
# TIMESTAMPFACTOR fffff6de60e24570
method,gputime,cputime,occupancy
method=[ memcpyHtoD ] gputime=[ 80.640 ] cputime=[ 278.000 ]
method=[ memcpyHtoD ] gputime=[ 79.552 ] cputime=[ 237.000 ]
method=[ _Z6VecAddPKfS0_Pfi ] gputime=[ 5.760 ] cputime=[ 18.000 ] occupancy=[ 1.000 ]
method=[ memcpyDtoH ] gputime=[ 97.472 ] cputime=[ 647.000 ]</pre><p class="p">The log above in  <a class="xref" href="index.html#command-line-profiler-output__Example1" shape="rect">Example 1: CUDA Default Profiler Log- No Options or Counters Enabled (File name: cuda_profile_0.log)</a> 
                        		  shows 
                        		  data for memory copies and a kernel launch. The 
                        		  <samp class="ph codeph">method</samp> label specifies the name of the memory
                        		  copy method or kernel executed. The 
                        		  <samp class="ph codeph">gputime</samp> and 
                        		  <samp class="ph codeph">cputime</samp> labels specify the actual chip
                        		  execution time and the driver execution time, respectively. Note that 
                        		  <samp class="ph codeph">gputime</samp> and 
                        		  <samp class="ph codeph">cputime</samp> are in microseconds. The 'occupancy'
                        		  label gives the ratio of the number of active warps per multiprocessor to the
                        		  maximum number of active warps for a particular kernel launch. This is the
                        		  theoretical occupancy and is calculated using kernel block size, register usage
                        		  and shared memory usage.
                        		
                     </p>
                     <p class="p"><a class="xref" href="index.html#command-line-profiler-output__Example2" shape="rect">Example 2: CUDA Profiler Log- Options and Counters Enabled</a> shows the profiler log of a CUDA application. There are
                        		  a few options and counters enabled in this example using the profiler
                        		  configuration file:
                        		
                     </p><pre xml:space="preserve">gpustarttimestamp
gridsize3d
threadblocksize
dynsmemperblock
stasmemperblock
regperthread
memtransfersize
memtransferdir
streamid
countermodeaggregate
active_warps
active_cycles</pre></div>
                  <div class="section" id="command-line-profiler-output__Example2"><a name="command-line-profiler-output__Example2" shape="rect">
                        <!-- --></a><h3 class="title sectiontitle">Example 2: CUDA Profiler Log- Options and Counters Enabled</h3><pre xml:space="preserve"># CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 Tesla C2075
# CUDA_CONTEXT 1
# TIMESTAMPFACTOR fffff6de5e08e990
gpustarttimestamp,method,gputime,cputime,gridsizeX,gridsizeY,gridsizeZ, threadblocksizeX,threadblocksizeY,threadblocksizeZ,dynsmemperblock, stasmemperblock,regperthread,occupancy,streamid,active_warps, active_cycles,memtransfersize,memtransferdir
gpustarttimestamp=[ 124b9e484b6f3f40 ] method=[ memcpyHtoD ] gputime=[ 80.800 ] cputime=[ 280.000 ] streamid=[ 1 ] memtransfersize=[ 200000 ] memtransferdir=[ 1 ]
gpustarttimestamp=[ 124b9e484b7517a0 ] method=[ memcpyHtoD ] gputime=[ 79.744 ] cputime=[ 232.000 ] streamid=[ 1 ] memtransfersize=[ 200000 ] memtransferdir=[ 1 ]
gpustarttimestamp=[ 124b9e484b8fd8e0 ] method=[ _Z6VecAddPKfS0_Pfi ] gputime=[ 10.016 ] cputime=[ 57.000 ] gridsize=[ 196, 1, 1 ] threadblocksize=[ 256, 1, 1 ] dynsmemperblock=[ 0 ] stasmemperblock=[ 0 ] regperthread=[ 4 ] occupancy=[ 1.000 ] streamid=[ 1 ]active_warps=[ 1545830 ] active_cycles=[ 40774 ]
gpustarttimestamp=[ 124b9e484bb5a2c0 ] method=[ memcpyDtoH ] gputime=[ 98.528 ] cputime=[ 672.000 ] streamid=[ 1 ] memtransfersize=[ 200000 ] memtransferdir=[ 2 ]</pre><p class="p">The default log syntax is easy to parse with a script, but for
                        		  spreadsheet analysis it might be easier to use the comma separated format.
                        		
                     </p>
                     <p class="p">When 
                        		  <samp class="ph codeph">COMPUTE_PROFILE_CSV</samp> is set to 1, this same test produces the
                        		  output log shown in <a class="xref" href="index.html#command-line-profiler-output__Example3" shape="rect">Example 3: CUDA Profiler Log- Options and Counters Enabled in CSV Format</a>.
                        		
                     </p>
                  </div>
                  <div class="section" id="command-line-profiler-output__Example3"><a name="command-line-profiler-output__Example3" shape="rect">
                        <!-- --></a><h3 class="title sectiontitle">Example 3: CUDA Profiler Log- Options and Counters Enabled in CSV
                        		  Format
                     </h3><pre xml:space="preserve"># CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 Tesla C2075
# CUDA_CONTEXT 1
# CUDA_PROFILE_CSV 1
# TIMESTAMPFACTOR fffff6de5d77a1c0
gpustarttimestamp,method,gputime,cputime,gridsizeX,gridsizeY,gridsizeZ, threadblocksizeX,threadblocksizeY,threadblocksizeZ,dynsmemperblock, stasmemperblock,regperthread,occupancy,streamid,active_warps, active_cycles,memtransfersize,memtransferdir
124b9e85038d1800,memcpyHtoD,80.352,286.000,,,,,,,,,,,1,,,200000,1
124b9e850392ee00,memcpyHtoD,79.776,232.000,,,,,,,,,,,1,,,200000,1
124b9e8503af7460,_Z6VecAddPKfS0_Pfi,10.048,59.000,196,1,1,256,1,1,0, 0,4,1.000,1,1532814,42030
</pre></div>
               </div>
            </div>
         </div>
         <div class="topic concept nested0" id="remote-profiling"><a name="remote-profiling" shape="rect">
               <!-- --></a><h2 class="title topictitle1"><a href="#remote-profiling" name="remote-profiling" shape="rect">5.&nbsp;Remote Profiling</a></h2>
            <div class="body conbody">
               <p class="p">Remote profiling is the process of collecting profile data from
                  a <dfn class="term">remote</dfn> system that is different than the
                  <dfn class="term">host</dfn> system at which that profile data will be viewed
                  and analyzed. In CUDA Toolkit 5.5 it is possible to use
                  <samp class="ph codeph">nvprof</samp> to collect the profile data on the remote
                  system and then use <samp class="ph codeph">nvvp</samp> on the host system to
                  view and analyze the data.
               </p>
            </div>
            <div class="topic concept nested1" id="collecting-remote-data"><a name="collecting-remote-data" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#collecting-remote-data" name="collecting-remote-data" shape="rect">5.1.&nbsp;Collect Data On Remote System</a></h3>
               <div class="body conbody">
                  <p class="p">There are three common remote profiling use cases that can be
                     addressed by using <samp class="ph codeph">nvprof</samp> and
                     <samp class="ph codeph">nvvp</samp>.
                  </p>
                  <div class="section">
                     <h3 class="title sectiontitle">Timeline</h3>
                     <div class="p">The first use case is to collect a timeline of the
                        application executing on the remote system. The timeline should
                        be collected in a way that most accurately reflects the behavior
                        of the application. To collect the timeline execute the
                        following on the remote system. See <a class="xref" href="index.html#nvprof-overview" shape="rect">nvprof</a> for
                        more information on <samp class="ph codeph">nvprof</samp> options.
                        
                        <pre xml:space="preserve">$ nvprof --output-profile timeline.nvprof &lt;app&gt; &lt;app args&gt;
      </pre>
                        
                        The profile data will be collected in
                        <tt class="ph tt">timeline.nvprof</tt>. You should copy this file back to the
                        host system and then import it into <samp class="ph codeph">nvvp</samp> as
                        described in the next section.
                        
                     </div>
                  </div>
                  <div class="section">
                     <h3 class="title sectiontitle">Metrics And Events</h3>
                     <div class="p">The second use case is to collect events or metrics for all
                        kernels in an application for which you have already collected a
                        timeline. Collecting events or metrics for all kernels will
                        significantly change the overall performance characteristics of
                        the application because all kernel executions will be serialized
                        on the GPU. Even though overall application performance is
                        changed, the event or metric values for individual kernels will
                        be correct and so you can merge the collected event and metric
                        values onto a previously collected timeline to get an accurate
                        picture of the applications behavior. To collect events or
                        metrics you use the <samp class="ph codeph">--events</samp> or
                        <samp class="ph codeph">--metrics</samp> flag. The following shows an example
                        using just the <samp class="ph codeph">--metrics</samp> flag to collect two
                        metrics.
                        
                        <pre xml:space="preserve">$ nvprof --metrics achieved_occupancy,executed_ipc -o metrics.nvprof &lt;app&gt; &lt;app args&gt; 
      </pre>
                        
                        You can collect any number of events and metrics for each
                        <samp class="ph codeph">nvprof</samp> invocation, and you can invoke
                        <samp class="ph codeph">nvprof</samp> multiple times to collect multiple
                        <tt class="ph tt">metrics.nvprof</tt> files. To get accurate profiling
                        results, it is important that your application conform to the
                        requirements detailed in <a class="xref" href="index.html#application-requirements" shape="rect">Application Requirements</a>.
                     </div>
                     <p class="p">The profile data will be collected in the
                        <tt class="ph tt">metrics.nvprof</tt> file(s). You should copy these files
                        back to the host system and then import it into
                        <samp class="ph codeph">nvvp</samp> as described in the next section.
                     </p>
                  </div>
                  <div class="section">
                     <h3 class="title sectiontitle">Guided Analysis For Individual Kernel</h3>
                     <div class="p">The third common remote profiling use case is to collect the
                        metrics needed by the guided analysis system for an individual
                        kernel. When imported into <samp class="ph codeph">nvvp</samp> this data will
                        enable the guided analysis system to analyze the kernel and
                        report optimization opportunities for that kernel. To collect
                        the guided analysis data execute the following on the remote
                        system. It is important that the <samp class="ph codeph">--kernels</samp>
                        option appear before the <samp class="ph codeph">--analysis-metrics</samp>
                        option so that metrics are collected only for the kernel(s)
                        specified by <samp class="ph codeph">kernel specifier</samp>. See <a class="xref" href="index.html#profiling-scope" shape="rect">Profiling Scope</a> for more information on the
                        <samp class="ph codeph">--kernels</samp> option.
                        
                        <pre xml:space="preserve">$ nvprof --kernels &lt;kernel specifier&gt; --analysis-metrics -o analysis.nvprof &lt;app&gt; &lt;app args&gt; 
      </pre>
                        
                        The profile data will be collected in
                        <tt class="ph tt">analysis.nvprof</tt>. You should copy this file back to the
                        host system and then import it into <samp class="ph codeph">nvvp</samp> as
                        described in the next section.
                     </div>
                  </div>
               </div>
            </div>
            <div class="topic concept nested1" id="using-remote-data"><a name="using-remote-data" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#using-remote-data" name="using-remote-data" shape="rect">5.2.&nbsp;View And Analyze Data</a></h3>
               <div class="body conbody">
                  <p class="p">The collected profile data is viewed and analyzed by importing
                     it into <samp class="ph codeph">nvvp</samp> on the host system. See <a class="xref" href="index.html#import-session" shape="rect">Import Session</a> for more information about importing.
                     
                  </p>
                  <div class="section">
                     <h3 class="title sectiontitle">Timeline, Metrics And Events</h3>
                     <p class="p">To view collected timeline data, the <tt class="ph tt">timeline.nvprof</tt>
                        file can be imported into <samp class="ph codeph">nvvp</samp> as described in
                        <a class="xref" href="index.html#import-nvprof-session" shape="rect">Import nvprof Session</a>. If metric or event data was
                        also collected for the application, the corresponding
                        <tt class="ph tt">metrics.nvprof</tt> file(s) can be imported into
                        <samp class="ph codeph">nvvp</samp> along with the timeline so that the events
                        and metrics collected for each kernel are associated with the
                        corresponding kernel in the timeline.
                     </p>
                  </div>
                  <div class="section">
                     <h3 class="title sectiontitle">Guided Analysis For Individual Kernel</h3>
                     <p class="p">To view collected analysis data for an individual kernel, the
                        <tt class="ph tt">analysis.nvprof</tt> file can be imported into
                        <samp class="ph codeph">nvvp</samp> as described in <a class="xref" href="index.html#import-nvprof-session" shape="rect">Import nvprof Session</a>. The <tt class="ph tt">analysis.nvprof</tt>
                        must be imported by itself. The timeline will show just the
                        individual kernel that we specified during data
                        collection. After importing, the guided analysis system can be
                        used to explore the optimization opportunities for the
                        kernel.
                     </p>
                  </div>
               </div>
            </div>
            <div class="topic concept nested1" id="remote-limitations"><a name="remote-limitations" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#remote-limitations" name="remote-limitations" shape="rect">5.3.&nbsp;Limitations</a></h3>
               <div class="body conbody">
                  <div class="p">There are several limitations to remote profiling.
                     
                     <ul class="ul">
                        <li class="li">The host system must have an NVIDIA GPU and the CUDA Toolkit
                           must be installed. The host GPU does not have to match the
                           GPU(s) on the remote system.
                        </li>
                        <li class="li">When collecting events or metrics with the
                           <samp class="ph codeph">--events</samp>, <samp class="ph codeph">--metrics</samp>, or
                           <samp class="ph codeph">--analysis-metrics</samp> options,
                           <samp class="ph codeph">nvprof</samp> will use kernel replay to execute each
                           kernel multiple times as needed to collect all the requested
                           data. If a large number of events or metrics are requested then
                           a large number of replays may be required, resulting in a
                           significant increase in application execution time.
                        </li>
                     </ul>
                  </div>
               </div>
            </div>
         </div>
         <div class="topic concept nested0" id="nvtx"><a name="nvtx" shape="rect">
               <!-- --></a><h2 class="title topictitle1"><a href="#nvtx" name="nvtx" shape="rect">6.&nbsp;NVIDIA Tools Extension</a></h2>
            <div class="body conbody">
               <div class="p">NVIDIA Tools Extension (NVTX) is a C-based Application
                  Programming Interface (API) for annotating events, code ranges,
                  and resources in your applications. Applications which integrate
                  NVTX can use the Visual Profiler to capture and visualize these
                  events and ranges. The NVTX API provides two core services:
                  
                  <ol class="ol">
                     <li class="li">Tracing of CPU events and time ranges.</li>
                     <li class="li">Naming of OS and CUDA resources.</li>
                  </ol>
                  
                  NVTX can be quickly integrated into an application. The sample
                  program below shows the use of marker events, range events, and
                  resource naming.
               </div><pre xml:space="preserve">
    void Wait(int waitMilliseconds) {
      nvtxNameOsThread(“MAIN”);
      nvtxRangePush(__FUNCTION__);
      nvtxMark("Waiting...");
      Sleep(waitMilliseconds);
      nvtxRangePop();
    }

    int main(void) {
      nvtxNameOsThread("MAIN");
      nvtxRangePush(__FUNCTION__);
      Wait();
      nvtxRangePop();
    }
</pre></div>
            <div class="topic concept nested1" id="nvtx-overview"><a name="nvtx-overview" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#nvtx-overview" name="nvtx-overview" shape="rect">6.1.&nbsp;NVTX API Overview</a></h3>
               <div class="body conbody">
                  <div class="section">
                     <h3 class="title sectiontitle">Files</h3>
                     <p class="p">The core NVTX API is defined in file nvToolsExt.h, whereas
                        CUDA-specific extensions to the NVTX interface are defined in
                        nvToolsExtCuda.h and nvToolsExtCudaRt.h. On Linux the NVTX
                        shared library is called <samp class="ph codeph">libnvToolsExt.so</samp> and
                        on Mac OSX the shared library is
                        called <samp class="ph codeph">libnvToolsExt.dylib</samp>. On Windows the
                        library (.lib) and runtime components (.dll) are named
                        <samp class="ph codeph">nvToolsExt[bitness=32|64]_[version].{dll|lib}</samp>.
                     </p>
                  </div>
                  <div class="section">
                     <h3 class="title sectiontitle">Function Calls</h3>
                     <p class="p">All NVTX API functions start with an nvtx name prefix and may
                        end with one out of the three suffixes: A, W, or Ex. NVTX
                        functions with these suffixes exist in multiple variants,
                        performing the same core functionality with different
                        parameter encodings. Depending on the version of the NVTX
                        library, available encodings may include ASCII (A), Unicode
                        (W), or event structure (Ex).
                     </p>
                     <p class="p">The CUDA implementation of NVTX only implements the ASCII (A)
                        and event structure (Ex) variants of the API, the Unicode (W)
                        versions are not supported and have no effect when called.
                     </p>
                  </div>
                  <div class="section">
                     <h3 class="title sectiontitle">Return Values</h3>
                     <p class="p">Some of the NVTX functions are defined to have return
                        values. For example, the <samp class="ph codeph">nvtxRangeStart()</samp>
                        function returns a unique range identifier
                        and <samp class="ph codeph">nvtxRangePush()</samp> function outputs the
                        current stack level. It is recommended not to use the returned
                        values as part of conditional code in the instrumented
                        application. The returned values can differ between various
                        implementations of the NVTX library and, consequently, having
                        added dependencies on the return values might work with one
                        tool, but may fail with another.
                     </p>
                  </div>
               </div>
            </div>
            <div class="topic concept nested1" id="nvtx-events"><a name="nvtx-events" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#nvtx-events" name="nvtx-events" shape="rect">6.2.&nbsp;NVTX API Events</a></h3>
               <div class="body conbody">
                  <p class="p">Markers are used to describe events that occur at a specific
                     time during the execution of an application, while ranges detail
                     the time span in which they occur. This information is presented
                     alongside all of the other captured data, which makes it easier
                     to understand the collected information. All markers and ranges
                     are identified by a message string. The Ex version of the marker
                     and range APIs also allows category, color, and payload
                     attributes to be associated with the event using the event
                     attributes structure.
                  </p>
               </div>
               <div class="topic concept nested2" id="nvtx-marker"><a name="nvtx-marker" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#nvtx-marker" name="nvtx-marker" shape="rect">6.2.1.&nbsp;NVTX Markers</a></h3>
                  <div class="body conbody">
                     <p class="p">A marker is used to describe an instantaneous event.  A marker
                        can contain a text message or specify additional information
                        using the <a class="xref" href="index.html#nvtx-event-attribute-struct" shape="rect">event
                           attributes structure</a>.  Use <samp class="ph codeph">nvtxMarkA</samp> to
                        create a marker containing an ASCII message.
                        Use <samp class="ph codeph">nvtxMarkEx()</samp> to create a marker containing
                        additional attributes specified by the event attribute
                        structure. The <samp class="ph codeph">nvtxMarkW()</samp> function is not
                        supported in the CUDA implementation of NVTX and has no effect
                        if called.
                     </p>
                     <div class="section">
                        <h4 class="title sectiontitle">Code Example</h4><pre xml:space="preserve">
    nvtxMarkA("My mark");

    nvtxEventAttributes_t eventAttrib = {0};
    eventAttrib.version = NVTX_VERSION;
    eventAttrib.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE;
    eventAttrib.colorType = NVTX_COLOR_ARGB;
    eventAttrib.color = COLOR_RED;
    eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII;
    eventAttrib.message.ascii = "my mark with attributes";
    nvtxMarkEx(&amp;eventAttrib);
      </pre></div>
                  </div>
               </div>
               <div class="topic concept nested2" id="nvtx-range-start-stop"><a name="nvtx-range-start-stop" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#nvtx-range-start-stop" name="nvtx-range-start-stop" shape="rect">6.2.2.&nbsp;NVTX Range Start/Stop</a></h3>
                  <div class="body conbody">
                     <p class="p">A start/end range is used to denote an arbitrary, potentially
                        non-nested, time span. The start of a range can occur on a
                        different thread than the end of the range.  A range can contain
                        a text message or specify additional information using
                        the <a class="xref" href="index.html#nvtx-event-attribute-struct" shape="rect">event attributes
                           structure</a>.  Use <samp class="ph codeph">nvtxRangeStartA()</samp> to
                        create a marker containing an ASCII message.
                        Use <samp class="ph codeph">nvtxRangeStartEx()</samp> to create a range
                        containing additional attributes specified by the event
                        attribute structure. The <samp class="ph codeph">nvtxRangeStartW()</samp>
                        function is not supported in the CUDA implementation of NVTX and
                        has no effect if called.  For the correlation of a start/end
                        pair, a unique correlation ID is created that is returned
                        from <samp class="ph codeph">nvtxRangeStartA()</samp>
                        or <samp class="ph codeph">nvtxRangeStartEx()</samp>, and is then passed into
                        <samp class="ph codeph">nvtxRangeEnd()</samp>.
                     </p>
                     <div class="section">
                        <h4 class="title sectiontitle">Code Example</h4><pre xml:space="preserve">
    // non-overlapping range
    nvtxRangeId_t id1 = nvtxRangeStartA("My range");
    nvtxRangeEnd(id1);

    nvtxEventAttributes_t eventAttrib = {0};
    eventAttrib.version = NVTX_VERSION;
    eventAttrib.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE;
    eventAttrib.colorType = NVTX_COLOR_ARGB;
    eventAttrib.color = COLOR_BLUE;
    eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII;
    eventAttrib.message.ascii = "my start/stop range";
    nvtxRangeId_t id2 = nvtxRangeStartEx(&amp;eventAttrib);
    nvtxRangeEnd(id2);

    // overlapping ranges
    nvtxRangeId_t r1 = nvtxRangeStartA("My range 0");
    nvtxRangeId_t r2 = nvtxRangeStartA("My range 1");
    nvtxRangeEnd(r1);
    nvtxRangeEnd(r2);
      </pre></div>
                  </div>
               </div>
               <div class="topic concept nested2" id="nvtx-range-push-pop"><a name="nvtx-range-push-pop" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#nvtx-range-push-pop" name="nvtx-range-push-pop" shape="rect">6.2.3.&nbsp;NVTX Range Push/Pop</a></h3>
                  <div class="body conbody">
                     <p class="p">A push/pop range is used to denote nested time span. The start
                        of a range must occur on the same thread as the end of
                        the range.  A range can contain a text message or specify
                        additional information using
                        the <a class="xref" href="index.html#nvtx-event-attribute-struct" shape="rect">event attributes
                           structure</a>.  Use <samp class="ph codeph">nvtxRangePushA()</samp> to create
                        a marker containing an ASCII message.
                        Use <samp class="ph codeph">nvtxRangePushEx()</samp> to create a range
                        containing additional attributes specified by the event
                        attribute structure. The <samp class="ph codeph">nvtxRangePushW()</samp>
                        function is not supported in the CUDA implementation of NVTX and
                        has no effect if called.  Each push function returns the
                        zero-based depth of the range being
                        started. The <samp class="ph codeph">nvtxRangePop()</samp> function is used to
                        end the most recently pushed range for the
                        thread. <samp class="ph codeph">nvtxRangePop()</samp> returns the zero-based
                        depth of the range being ended. If the pop does not have a
                        matching push, a negative value is returned to indicate an
                        error.
                     </p>
                     <div class="section">
                        <h4 class="title sectiontitle">Code Example</h4><pre xml:space="preserve">
    nvtxRangePushA("outer");
    nvtxRangePushA("inner");
    nvtxRangePop();  // end "inner" range
    nvtxRangePop();  // end "outer" range

    nvtxEventAttributes_t eventAttrib = {0};
    eventAttrib.version = NVTX_VERSION;
    eventAttrib.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE;
    eventAttrib.colorType = NVTX_COLOR_ARGB;
    eventAttrib.color = COLOR_GREEN;
    eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII;
    eventAttrib.message.ascii = "my push/pop range";
    nvtxRangePushEx(&amp;eventAttrib);
    nvtxRangePop();
      </pre></div>
                  </div>
               </div>
               <div class="topic concept nested2" id="nvtx-event-attribute-struct"><a name="nvtx-event-attribute-struct" shape="rect">
                     <!-- --></a><h3 class="title topictitle2"><a href="#nvtx-event-attribute-struct" name="nvtx-event-attribute-struct" shape="rect">6.2.4.&nbsp;Event Attributes Structure</a></h3>
                  <div class="body conbody">
                     <p class="p">The events attributes
                        structure, <samp class="ph codeph">nvtxEventAttributes_t</samp>, is used to
                        describe the attributes of an event. The layout of the structure
                        is defined by a specific version of NVTX and can change between
                        different versions of the Tools Extension library.
                     </p>
                     <div class="section">
                        <h4 class="title sectiontitle">Attributes</h4>
                        <p class="p">Markers and ranges can use attributes to provide additional
                           information for an event or to guide the tool's visualization
                           of the data. Each of the attributes is optional and if left
                           unspecified, the attributes fall back to a default value.
                        </p>
                        <dl class="dl">
                           <dt class="dt dlterm">Message</dt>
                           <dd class="dd">The message field can be used to specify an optional
                              string. The caller must set both
                              the <samp class="ph codeph">messageType</samp>
                              and <samp class="ph codeph">message</samp> fields. The default value is
                              <samp class="ph codeph">NVTX_MESSAGE_UNKNOWN</samp>. The CUDA
                              implementation of NVTX only supports ASCII type
                              messages.
                           </dd>
                           <dt class="dt dlterm">Category</dt>
                           <dd class="dd">The category attribute is a user-controlled ID that can
                              be used to group events. The tool may use category IDs to
                              improve filtering, or for grouping events. The default
                              value is 0.
                           </dd>
                           <dt class="dt dlterm">Color</dt>
                           <dd class="dd">The color attribute is used to help visually identify
                              events in the tool. The caller must set both
                              the <samp class="ph codeph">colorType</samp> and <samp class="ph codeph">color</samp>
                              fields.
                           </dd>
                           <dt class="dt dlterm">Payload</dt>
                           <dd class="dd">The payload attribute can be used to provide additional
                              data for markers and ranges. Range events can only specify
                              values at the beginning of a range. The caller must
                              specify valid values for both
                              the <samp class="ph codeph">payloadType</samp>
                              and <samp class="ph codeph">payload</samp> fields.
                           </dd>
                        </dl>
                     </div>
                     <div class="section">
                        <h4 class="title sectiontitle">Initialization</h4>
                        <div class="p">The caller should always perform the following three tasks
                           when using attributes:
                           
                           <ul class="ul">
                              <li class="li">Zero the structure</li>
                              <li class="li">Set the version field</li>
                              <li class="li">Set the size field</li>
                           </ul>
                           
                           Zeroing the structure sets all the event attributes types and
                           values to the default value. The version and size field are
                           used by NVTX to handle multiple versions of the attributes
                           structure.
                        </div>
                        <p class="p">It is recommended that the caller use the following method to
                           initialize the event attributes structure.
                        </p><pre xml:space="preserve">
    nvtxEventAttributes_t eventAttrib = {0};
    eventAttrib.version = NVTX_VERSION;
    eventAttrib.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE;
    eventAttrib.colorType = NVTX_COLOR_ARGB;
    eventAttrib.color = ::COLOR_YELLOW;
    eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII;
    eventAttrib.message.ascii = "My event";
    nvtxMarkEx(&amp;eventAttrib);
</pre></div>
                  </div>
               </div>
            </div>
            <div class="topic concept nested1" id="nvtx-naming"><a name="nvtx-naming" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#nvtx-naming" name="nvtx-naming" shape="rect">6.3.&nbsp;NVTX Resource Naming</a></h3>
               <div class="body conbody">
                  <p class="p">NVTX resource naming allows custom names to be associated with
                     host OS threads and CUDA resources such as devices, contexts, and
                     streams. The names assigned using NVTX are displayed by the Visual
                     Profiler.
                  </p>
                  <div class="section">
                     <h3 class="title sectiontitle">OS Thread</h3>
                     <p class="p">The <samp class="ph codeph">nvtxNameOsThreadA()</samp> function is used to
                        name a host OS thread. The <samp class="ph codeph">nvtxNameOsThreadW()</samp>
                        function is not supported in the CUDA implementation of NVTX and
                        has no effect if called. The following example shows how the
                        current host OS thread can be named.
                     </p><pre xml:space="preserve">
    // Windows
    nvtxNameOsThread(GetCurrentThreadId(), "MAIN_THREAD");

    // Linux/Mac
    nvtxNameOsThread(pthread_self(), "MAIN_THREAD");
      </pre></div>
                  <div class="section">
                     <h3 class="title sectiontitle">CUDA Runtime Resources</h3>
                     <p class="p">The <samp class="ph codeph">nvtxNameCudaDeviceA()</samp>
                        and <samp class="ph codeph">nvtxNameCudaStreamA()</samp> functions are used to
                        name CUDA device and stream objects,
                        respectively. The <samp class="ph codeph">nvtxNameCudaDeviceW()</samp>
                        and <samp class="ph codeph">nvtxNameCudaStreamW()</samp> functions are not
                        supported in the CUDA implementation of NVTX and have no effect
                        if called. The <samp class="ph codeph">nvtxNameCudaEventA()</samp>
                        and <samp class="ph codeph">nvtxNameCudaEventW()</samp> functions are also not
                        supported. The following example shows how a CUDA device and
                        stream can be named.
                     </p><pre xml:space="preserve">
    nvtxNameCudaDeviceA(0, "my cuda device 0");

    cudaStream_t cudastream;
    cudaStreamCreate(&amp;cudastream);
    nvtxNameCudaStreamA(cudastream, "my cuda stream");
      </pre></div>
                  <div class="section">
                     <h3 class="title sectiontitle">CUDA Driver Resources</h3>
                     <p class="p">The <samp class="ph codeph">nvtxNameCuDeviceA()</samp>, <samp class="ph codeph">nvtxNameCuContextA()</samp>
                        and <samp class="ph codeph">nvtxNameCuStreamA()</samp> functions are used to
                        name CUDA driver device, context and stream objects,
                        respectively. The <samp class="ph codeph">nvtxNameCuDeviceW()</samp>, <samp class="ph codeph">nvtxNameCuContextW()</samp>
                        and <samp class="ph codeph">nvtxNameCuStreamW()</samp> functions are not
                        supported in the CUDA implementation of NVTX and have no effect
                        if called. The <samp class="ph codeph">nvtxNameCuEventA()</samp>
                        and <samp class="ph codeph">nvtxNameCuEventW()</samp> functions are also not
                        supported. The following example shows how a CUDA device,
                        context and stream can be named.
                     </p><pre xml:space="preserve">
    CUdevice device;
    cuDeviceGet(&amp;device, 0);
    nvtxNameCuDeviceA(device, "my device 0");

    CUcontext context;
    cuCtxCreate(&amp;context, 0, device);
    nvtxNameCuContextA(context, "my context");

    cuStream stream;
    cuStreamCreate(&amp;stream, 0);
    nvtxNameCuStreamA(stream, "my stream");
      </pre></div>
               </div>
            </div>
         </div>
         <div class="topic concept nested0" id="mpi-profiling"><a name="mpi-profiling" shape="rect">
               <!-- --></a><h2 class="title topictitle1"><a href="#mpi-profiling" name="mpi-profiling" shape="rect">7.&nbsp;MPI Profiling</a></h2>
            <div class="body conbody">
               <p class="p">The <a class="xref" href="index.html#nvprof-overview" shape="rect"><samp class="ph codeph">nvprof</samp></a>
                  profiler and
                  the <a class="xref" href="index.html#compute-command-line-profiler-overview" shape="rect">Command Line Profiler</a> 
                  can be used to profile individual MPI
                  processes. The resulting output can be used directly, or can be
                  imported into the <a class="xref" href="index.html#visual-profiler" shape="rect">Visual Profiler</a>.
                  
               </p>
            </div>
            <div class="topic concept nested1" id="mpi-nvprof"><a name="mpi-nvprof" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#mpi-nvprof" name="mpi-nvprof" shape="rect">7.1.&nbsp;MPI Profiling With nvprof</a></h3>
               <div class="body conbody">
                  <p class="p">To use <a class="xref" href="index.html#nvprof-overview" shape="rect"><samp class="ph codeph">nvprof</samp></a> to
                     collect the profiles of the individual MPI processes, you must
                     tell <samp class="ph codeph">nvprof</samp> to send its output to unique files.
                     In CUDA 5.0 and earlier versions, it was recommended to use a script
                     for this.  However, you can now easily do it utilizing the <samp class="ph codeph">%h</samp>
                     and <samp class="ph codeph">%p</samp> features of the <samp class="ph codeph">--output-profile</samp>
                     	argument to the <samp class="ph codeph">nvprof</samp> command.  Below is 
                     example run using Open MPI.
                  </p><pre xml:space="preserve">  
$ mpirun -np 2 -host c0-0,c0-1 nvprof -o output.%h.%p a.out  
    </pre><p class="p"> Alternatively, one can make use of the new feature to turn on
                     profiling on the nodes of interest using 
                     the <samp class="ph codeph">--profile-all-processes</samp> argument to
                     <samp class="ph codeph">nvprof</samp>.  To do this, you first log into the node
                     you want to profile and start up <samp class="ph codeph">nvprof</samp> there.
                  </p><pre xml:space="preserve">  
$ nvprof --profile-all-processes -o output.%h.%p
    </pre><p class="p">Then you can just run the MPI job as your normally would.</p><pre xml:space="preserve">  
$ mpirun -np 2 -host c0-0,c0-1 a.out  
    </pre><p class="p">
                     Any processes that run on the node where the 
                     <samp class="ph codeph">--profile-all-processes</samp> is running will
                     automatically get profiled.  The profiling data will be written
                     to the output files.
                  </p>
                  <p class="p"> Details about what types of additional arguments to use with
                     nvprof can be found in the <a class="xref" href="index.html#multiprocess-profiling" shape="rect">Multiprocess Profiling</a>
                     and <a class="xref" href="index.html#redirecting-output" shape="rect">Redirecting Output</a> section. Additional
                     information about how to view the data with <samp class="ph codeph">nvvp</samp>
                     can be found in the <a class="xref" href="index.html#import-nvprof-session" shape="rect">Import nvprof Session</a>
                     section.
                  </p>
               </div>
            </div>
            <div class="topic concept nested1" id="mpi-compute-command-line"><a name="mpi-compute-command-line" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#mpi-compute-command-line" name="mpi-compute-command-line" shape="rect">7.2.&nbsp;MPI Profiling With The Command-Line Profiler</a></h3>
               <div class="body conbody">
                  <div class="p">The <a class="xref" href="index.html#compute-command-line-profiler-overview" shape="rect">command-line
                        profiler</a> is enabled and controlled by environment
                     variables and a configuration file. To correctly profile MPI
                     jobs, the profile output produced by the command-line profiler
                     must be directed to unique output files for each MPI process.
                     The command-line profiler uses the COMPUTE_PROFILE_LOG
                     environment variable for this purpose. You can use special
                     substitute characters in the log name to ensure that different
                     devices and processes record their profile information to
                     different files. The '%d' is replaced by the device ID, and
                     the '%p' is replaced by the process ID.
                     <pre xml:space="preserve">
setenv COMPUTE_PROFILE_LOG cuda_profile.%d.%p
      </pre>
                     If you are running on multiple nodes, you will need to store the profile logs locally, so that 
                     processes with the same ID running on different nodes don't clobber each others log file.
                     <pre xml:space="preserve">
setenv COMPUTE_PROFILE_LOG /tmp/cuda_profile.%d.%p
      </pre>
                     COMPUTE_PROFILE_LOG and the other command-line profiler
                     environment variables must get passed to the remote processes of
                     the job. Most <samp class="ph codeph">mpiruns</samp> have a way to do
                     this. Examples for Open MPI and MVAPICH2 are shown below using
                     the simpleMPI program from the CUDA Software Development
                     Toolkit.
                  </div>
                  <div class="section">
                     <h3 class="title sectiontitle">Open MPI</h3><pre xml:space="preserve">    
&gt; setenv COMPUTE_PROFILE_LOG /tmp/cuda_profile.%d.%p
&gt; setenv COMPUTE_PROFILE_CSV 1
&gt; setenv COMPUTE_PROFILE_CONFIG /tmp/compute_profile.config
&gt; setenv COMPUTE_PROFILE 1
&gt; mpirun -x COMPUTE_PROFILE_CSV -x COMPUTE_PROFILE -x COMPUTE_PROFILE_CONFIG -x COMPUTE_PROFILE_LOG -np 6 -host c0-5,c0-6,c0-7 simpleMPI
Running on 6 nodes
Average of square roots is: 0.667282
PASSED
      </pre></div>
                  <div class="section">
                     <h3 class="title sectiontitle">MVAPICH2</h3><pre xml:space="preserve">    
&gt; mpirun_rsh -np 6 c0-5 c0-5 c0-6 c0-6 c0-7 c0-7 COMPUTE_PROFILE_CSV=1 COMPUTE_PROFILE=1 COMPUTE_PROFILE_CONFIG=/tmp/compute_profile.config COMPUTE_PROFILE_LOG=cuda_profile.%d.%p simpleMPI
Running on 6 nodes
Average of square roots is: 0.667282
PASSED
      </pre></div>
               </div>
            </div>
         </div>
         <div class="topic concept nested0" id="metrics-reference"><a name="metrics-reference" shape="rect">
               <!-- --></a><h2 class="title topictitle1"><a href="#metrics-reference" name="metrics-reference" shape="rect">8.&nbsp;Metrics Reference</a></h2>
            <div class="body conbody">
               <p class="p">This section contains detailed descriptions of the metrics that
                  can be collected by <samp class="ph codeph">nvprof</samp> and the Visual
                  Profiler. A scope value of single-context indicates that the
                  metric can only be accurately collected when a single context
                  (CUDA or graphic) is executing on the GPU. A scope value of
                  multi-context indicates that the metric can be accurately
                  collected when multiple contexts are executing on the GPU.
               </p>
               <div class="section">
                  <p class="p">Devices with compute capability less than 2.0 implement the metrics
                     shown in the following table.
                  </p>
                  <div class="tablenoborder">
                     <table cellpadding="4" cellspacing="0" summary="" class="table capabity_metrics" frame="border" border="1" rules="all">
                        <caption><span class="tablecap">Table 3. Capability 1.x Metrics</span></caption>
                        <thead class="thead" align="left">
                           <tr class="row">
                              <th class="entry" valign="top" width="33.33333333333333%" id="d54e3928" rowspan="1" colspan="1">Metric Name</th>
                              <th class="entry" valign="top" width="44.44444444444444%" id="d54e3931" rowspan="1" colspan="1">Description</th>
                              <th class="entry" valign="top" width="22.22222222222222%" id="d54e3934" rowspan="1" colspan="1">Scope</th>
                           </tr>
                        </thead>
                        <tbody class="tbody">
                           <tr class="row">
                              <td class="entry" valign="top" width="33.33333333333333%" headers="d54e3928" rowspan="1" colspan="1">branch_efficiency</td>
                              <td class="entry" valign="top" width="44.44444444444444%" headers="d54e3931" rowspan="1" colspan="1">Ratio of non-divergent branches to total branches</td>
                              <td class="entry" valign="top" width="22.22222222222222%" headers="d54e3934" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="33.33333333333333%" headers="d54e3928" rowspan="1" colspan="1">gld_efficiency</td>
                              <td class="entry" valign="top" width="44.44444444444444%" headers="d54e3931" rowspan="1" colspan="1">Ratio of requested global memory load transactions to actual
                                 global memory load transactions
                              </td>
                              <td class="entry" valign="top" width="22.22222222222222%" headers="d54e3934" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="33.33333333333333%" headers="d54e3928" rowspan="1" colspan="1">gst_efficiency</td>
                              <td class="entry" valign="top" width="44.44444444444444%" headers="d54e3931" rowspan="1" colspan="1">Ratio of requested global memory store transactions to
                                 actual global memory store transactions
                              </td>
                              <td class="entry" valign="top" width="22.22222222222222%" headers="d54e3934" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="33.33333333333333%" headers="d54e3928" rowspan="1" colspan="1">gld_requested_throughput</td>
                              <td class="entry" valign="top" width="44.44444444444444%" headers="d54e3931" rowspan="1" colspan="1">Requested global memory load throughput</td>
                              <td class="entry" valign="top" width="22.22222222222222%" headers="d54e3934" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="33.33333333333333%" headers="d54e3928" rowspan="1" colspan="1">gst_requested_throughput</td>
                              <td class="entry" valign="top" width="44.44444444444444%" headers="d54e3931" rowspan="1" colspan="1">Requested global memory store throughput</td>
                              <td class="entry" valign="top" width="22.22222222222222%" headers="d54e3934" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                        </tbody>
                     </table>
                  </div>
               </div>
               <div class="section">
                  <p class="p">Devices with compute capability between 2.0, inclusive, and 3.0
                     implement the metrics shown in the following table.
                  </p>
                  <div class="tablenoborder">
                     <table cellpadding="4" cellspacing="0" summary="" class="table capabity_metrics" frame="border" border="1" rules="all">
                        <caption><span class="tablecap">Table 4. Capability 2.x Metrics</span></caption>
                        <thead class="thead" align="left">
                           <tr class="row">
                              <th class="entry" valign="top" width="36.36363636363637%" id="d54e4027" rowspan="1" colspan="1">Metric Name</th>
                              <th class="entry" valign="top" width="45.45454545454545%" id="d54e4030" rowspan="1" colspan="1">Description</th>
                              <th class="entry" valign="top" width="18.181818181818183%" id="d54e4033" rowspan="1" colspan="1">Scope</th>
                           </tr>
                        </thead>
                        <tbody class="tbody">
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">sm_efficiency</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">The percentage of time at least one warp is active
                                 on a multiprocessor averaged over all multiprocessors on the GPU
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">sm_efficiency_instance</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">The percentage of time at least one warp is active
                                 on a specific multiprocessor
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">achieved_occupancy</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Ratio of the average active warps per active cycle
                                 to the maximum number of warps supported on a
                                 multiprocessor
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">issue_slot_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Percentage of issue slots that issued at least one
                                 instruction, averaged across all cycles
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">inst_executed</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">The number of instructions executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">inst_issued</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">The number of instructions issued</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">issue_slots</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">The number of issue slots used</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">executed_ipc</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Instructions executed per cycle</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">issued_ipc</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Instructions issued per cycle</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">ipc_instance</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Instructions executed per cycle for a single multiprocessor</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">inst_per_warp</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Average number of instructions executed by each warp</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">cf_issued</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Number of issued control-flow instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">cf_executed</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Number of executed control-flow instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">ldst_issued</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Number of issued load and store instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">ldst_executed</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Number of executed load and store instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">branch_efficiency</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Ratio of non-divergent branches to total branches</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">warp_execution_efficiency</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Ratio of the average active threads per warp to the
                                 maximum number of threads per warp supported on a
                                 multiprocessor
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">inst_replay_overhead</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Average number of replays for each instruction executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">shared_replay_overhead</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Average number of replays due to shared memory
                                 conflicts for each instruction executed
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">global_cache_replay_overhead</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Average number of replays due to global memory
                                 cache misses for each instruction executed
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">local_replay_overhead</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Average number of replays due to local memory
                                 accesses for each instruction executed
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">gld_efficiency</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Ratio of requested global memory load throughput to
                                 required global memory load throughput
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">gst_efficiency</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Ratio of requested global memory store throughput
                                 to required global memory store throughput
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">gld_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Number of global memory load transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">gst_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Number of global memory store transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">gld_transactions_per_request</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Average number of global memory load transactions performed for each global memory load</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">gst_transactions_per_request</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Average number of global memory store transactions performed for each global memory store</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">gld_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Global memory load throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">gst_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Global memory store throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">gld_requested_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Requested global memory load throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">gst_requested_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Requested global memory store throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">local_load_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Number of local memory load transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">local_store_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Number of local memory store transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">local_load_transactions_per_ request</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Average number of local memory load transactions performed for each local memory load</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">local_store_transactions_per_ request</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Average number of local memory store transactions performed for each local memory store</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">local_load_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Local memory load throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">local_store_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Local memory store throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">shared_load_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Number of shared memory load transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">shared_store_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Number of shared memory store transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">shared_load_transactions_per_ request</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Average number of shared memory load transactions performed for each shared memory load</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">shared_store_transactions_per_ request</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Average number of shared memory store transactions performed for each shared memory store</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">shared_load_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Shared memory load throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">shared_store_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Shared memory store throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">shared_efficiency</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Ratio of requested shared memory throughput to required shared memory throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">dram_read_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Device memory read transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">dram_write_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Device memory write transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">dram_read_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Device memory read throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">dram_write_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Device memory write throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">sysmem_read_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">System memory read transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">sysmem_write_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">System memory write transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">sysmem_read_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">System memory read throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">sysmem_write_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">System memory write throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">l1_cache_global_hit_rate</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Hit rate in L1 cache for global loads</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">l1_cache_local_hit_rate</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Hit rate in L1 cache for local loads and stores</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">tex_cache_hit_rate</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Texture cache hit rate</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">tex_cache_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Texture cache read transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">tex_cache_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Texture cache throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">l2_read_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Memory read transactions seen at L2 cache for all read requests</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">l2_write_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Memory write transactions seen at L2 cache for all write requests</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">l2_read_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Memory read throughput seen at L2 cache for all read requests</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">l2_write_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Memory write throughput seen at L2 cache for all write requests</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">l2_l1_read_hit_rate</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Hit rate at L2 cache for all read requests from L1 cache</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Sinlge-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">l2_l1_read_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Memory read throughput seen at L2 cache for read
                                 requests from L1 cache
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">l2_texture_read_hit_rate</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Hit rate at L2 cache for all read requests from texture cache</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">l2_texure_read_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Memory read throughput seen at L2 cache for read
                                 requests from the texture cache
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Sinlge-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">local_memory_overhead</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Ratio of local memory traffic to total memory
                                 traffic between the L1 and L2 caches
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">l1_shared_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">The utilization level of the L1/shared memory relative to peak utilization</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">l2_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">The utilization level of the L2 cache relative to the peak utilization</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">tex_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">The utilization level of the texture cache relative to the peak utilization</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">dram_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">The utilization level of the device memory relative to the peak utilization</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">sysmem_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">The utilization level of the system memory relative to the peak utilization</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">ldst_fu_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute load and store instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">int_fu_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute integer instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">cf_fu_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute control-flow instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">tex_fu_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute texture instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">tex_fu_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute floating point instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">fpspec_fu_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute special floating point instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">misc_fu_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute miscellaneous instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">flops_sp</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Single-precision floating point operations executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">flops_sp_add</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Single-precision floating point add operations executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">flops_sp_mul</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Single-precision floating point multiply operations executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">flops_sp_fma</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Single-precision floating point multiply-accumulate operations executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">flops_dp</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Double-precision floating point operations executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">flops_dp_add</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Double-precision floating point add operations executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">flops_dp_mul</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Double-precision floating point multiply operations executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">flops_dp_fma</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Double-precision floating point multiply-accumulate operations executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">flops_sp_special</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Single-precision floating point special operations executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">stall_inst_fetch</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Percentage of stalls occurring because the next assembly instruction has not yet been fetched</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">stall_exec_dependency</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Percentage of stalls occurring because an input required by the instruction is not yet available</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">stall_data_request</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available
                                 or fully utilized, or because too many requests of a given type are outstanding
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">stall_sync</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Percentage of stalls occurring because the warp is blocked at a __syncthreads() call</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">stall_texture</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e4027" rowspan="1" colspan="1">stall_other</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e4030" rowspan="1" colspan="1">Percentage of stalls occurring due to miscellaneous reasons</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e4033" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                        </tbody>
                     </table>
                  </div>
               </div>
               <div class="section">
                  <p class="p">Devices with compute capability greater than or equal to 3.0 implement the metrics
                     shown in the following table.
                  </p>
                  <div class="tablenoborder">
                     <table cellpadding="4" cellspacing="0" summary="" class="table capabity_metrics" frame="border" border="1" rules="all">
                        <caption><span class="tablecap">Table 5. Capability 3.x Metrics</span></caption>
                        <thead class="thead" align="left">
                           <tr class="row">
                              <th class="entry" valign="top" width="36.36363636363637%" id="d54e5197" rowspan="1" colspan="1">Metric Name</th>
                              <th class="entry" valign="top" width="45.45454545454545%" id="d54e5200" rowspan="1" colspan="1">Description</th>
                              <th class="entry" valign="top" width="18.181818181818183%" id="d54e5203" rowspan="1" colspan="1">Scope</th>
                           </tr>
                        </thead>
                        <tbody class="tbody">
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">sm_efficiency</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">The percentage of time at least one warp is active
                                 on a multiprocessor averaged over all multiprocessors on the GPU
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">sm_efficiency_instance</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">The percentage of time at least one warp is active
                                 on a specific multiprocessor
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">achieved_occupancy</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Ratio of the average active warps per active cycle
                                 to the maximum number of warps supported on a
                                 multiprocessor
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">issue_slot_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Percentage of issue slots that issued at least one
                                 instruction, averaged across all cycles
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">inst_executed</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">The number of instructions executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">inst_issued</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">The number of instructions issued</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">issue_slots</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">The number of issue slots used</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">executed_ipc</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Instructions executed per cycle</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">issued_ipc</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Instructions issued per cycle</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">ipc_instance</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Instructions executed per cycle for a single multiprocessor</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">inst_per_warp</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Average number of instructions executed by each warp</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">cf_issued</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Number of issued control-flow instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">cf_executed</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Number of executed control-flow instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">ldst_issued</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Number of issued load and store instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">ldst_executed</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Number of executed load and store instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">branch_efficiency</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Ratio of non-divergent branches to total branches</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">warp_execution_efficiency</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Ratio of the average active threads per warp to the
                                 maximum number of threads per warp supported on a
                                 multiprocessor
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">warp_nonpred_execution_efficiency</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Ratio of the average active threads per warp
                                 executing non-predicated instructions to the maximum
                                 number of threads per warp supported on a
                                 multiprocessor
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">inst_replay_overhead</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Average number of replays for each instruction executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">shared_replay_overhead</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Average number of replays due to shared memory
                                 conflicts for each instruction executed
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">global_cache_replay_overhead</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Average number of replays due to global memory
                                 cache misses for each instruction executed
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">local_replay_overhead</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Average number of replays due to local memory
                                 accesses for each instruction executed
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">gld_efficiency</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Ratio of requested global memory load throughput to
                                 required global memory load throughput
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">gst_efficiency</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Ratio of requested global memory store throughput
                                 to required global memory store throughput
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">gld_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Number of global memory load transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">gst_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Number of global memory store transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">gld_transactions_per_request</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Average number of global memory load transactions performed for each global memory load</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">gst_transactions_per_request</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Average number of global memory store transactions performed for each global memory store</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">gld_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Global memory load throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">gst_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Global memory store throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">gld_requested_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Requested global memory load throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">gst_requested_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Requested global memory store throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">local_load_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Number of local memory load transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">local_store_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Number of local memory store transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">local_load_transactions_per_ request</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Average number of local memory load transactions performed for each local memory load</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">local_store_transactions_per_ request</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Average number of local memory store transactions performed for each local memory store</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">local_load_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Local memory load throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">local_store_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Local memory store throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">shared_load_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Number of shared memory load transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">shared_store_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Number of shared memory store transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">shared_load_transactions_per_ request</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Average number of shared memory load transactions performed for each shared memory load</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">shared_store_transactions_per_ request</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Average number of shared memory store transactions performed for each shared memory store</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">shared_load_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Shared memory load throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">shared_store_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Shared memory store throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">shared_efficiency</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Ratio of requested shared memory throughput to required shared memory throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">dram_read_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Device memory read transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">dram_write_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Device memory write transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">dram_read_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Device memory read throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">dram_write_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Device memory write throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">sysmem_read_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">System memory read transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">sysmem_write_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">System memory write transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">sysmem_read_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">System memory read throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">sysmem_write_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">System memory write throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">l1_cache_global_hit_rate</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Hit rate in L1 cache for global loads</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">l1_cache_local_hit_rate</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Hit rate in L1 cache for local loads and stores</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">tex_cache_hit_rate</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Texture cache hit rate</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">tex_cache_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Texture cache read transactions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">tex_cache_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Texture cache throughput</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">l2_read_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Memory read transactions seen at L2 cache for all read requests</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">l2_write_transactions</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Memory write transactions seen at L2 cache for all write requests</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">l2_read_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Memory read throughput seen at L2 cache for all read requests</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">l2_write_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Memory write throughput seen at L2 cache for all write requests</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">l2_l1_read_hit_rate</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Hit rate at L2 cache for all read requests from L1 cache</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Sinlge-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">l2_l1_read_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Memory read throughput seen at L2 cache for read
                                 requests from L1 cache
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">l2_texture_read_hit_rate</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Hit rate at L2 cache for all read requests from texture cache</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">l2_texure_read_throughput</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Memory read throughput seen at L2 cache for read
                                 requests from the texture cache
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Sinlge-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">local_memory_overhead</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Ratio of local memory traffic to total memory
                                 traffic between the L1 and L2 caches
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">l1_shared_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">The utilization level of the L1/shared memory relative to peak utilization</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">l2_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">The utilization level of the L2 cache relative to the peak utilization</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">tex_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">The utilization level of the texture cache relative to the peak utilization</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">dram_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">The utilization level of the device memory relative to the peak utilization</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">sysmem_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">The utilization level of the system memory relative to the peak utilization</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Single-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">ldst_fu_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute load and store instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">int_fu_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute integer instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">cf_fu_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute control-flow instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">tex_fu_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute texture instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">tex_fu_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute floating point instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">fpspec_fu_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute special floating point instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">misc_fu_utilization</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute miscellaneous instructions</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">flops_sp</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Single-precision floating point operations executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">flops_sp_add</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Single-precision floating point add operations executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">flops_sp_mul</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Single-precision floating point multiply operations executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">flops_sp_fma</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Single-precision floating point multiply-accumulate operations executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">flops_dp</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Double-precision floating point operations executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">flops_dp_add</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Double-precision floating point add operations executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">flops_dp_mul</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Double-precision floating point multiply operations executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">flops_dp_fma</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Double-precision floating point multiply-accumulate operations executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">flops_sp_special</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Single-precision floating point special operations executed</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">stall_inst_fetch</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Percentage of stalls occurring because the next assembly instruction has not yet been fetched</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">stall_exec_dependency</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Percentage of stalls occurring because an input required by the instruction is not yet available</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">stall_data_request</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available
                                 or fully utilized, or because too many requests of a given type are outstanding
                              </td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">stall_sync</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Percentage of stalls occurring because the warp is blocked at a __syncthreads() call</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">stall_texture</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                           <tr class="row">
                              <td class="entry" valign="top" width="36.36363636363637%" headers="d54e5197" rowspan="1" colspan="1">stall_other</td>
                              <td class="entry" valign="top" width="45.45454545454545%" headers="d54e5200" rowspan="1" colspan="1">Percentage of stalls occurring due to miscellaneous reasons</td>
                              <td class="entry" valign="top" width="18.181818181818183%" headers="d54e5203" rowspan="1" colspan="1">Multi-context</td>
                           </tr>
                        </tbody>
                     </table>
                  </div>
               </div>
            </div>
         </div>
         <div class="topic concept nested0" id="notices-header"><a name="notices-header" shape="rect">
               <!-- --></a><h2 class="title topictitle1"><a href="#notices-header" name="notices-header" shape="rect">Notices</a></h2>
            <div class="topic reference nested1" id="notice"><a name="notice" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#notice" name="notice" shape="rect"></a></h3>
               <div class="body refbody">
                  <div class="section">
                     <h3 class="title sectiontitle">Notice</h3>
                     <p class="p">ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND
                        SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE
                        WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS
                        FOR A PARTICULAR PURPOSE. 
                     </p>
                     <p class="p">Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the
                        consequences of use of such information or for any infringement of patents or other rights of third parties that may result
                        from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications
                        mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information
                        previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems
                        without express written approval of NVIDIA Corporation.
                     </p>
                  </div>
               </div>
            </div>
            <div class="topic reference nested1" id="trademarks"><a name="trademarks" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#trademarks" name="trademarks" shape="rect"></a></h3>
               <div class="body refbody">
                  <div class="section">
                     <h3 class="title sectiontitle">Trademarks</h3>
                     <p class="p">NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation
                        in the U.S. and other countries.  Other company and product names may be trademarks of
                        the respective companies with which they are associated.
                     </p>
                  </div>
               </div>
            </div>
            <div class="topic reference nested1" id="copyright-past-to-present"><a name="copyright-past-to-present" shape="rect">
                  <!-- --></a><h3 class="title topictitle2"><a href="#copyright-past-to-present" name="copyright-past-to-present" shape="rect"></a></h3>
               <div class="body refbody">
                  <div class="section">
                     <h3 class="title sectiontitle">Copyright</h3>
                     <p class="p">© <span class="ph">2007</span>-<span class="ph">2013</span> NVIDIA
                        Corporation. All rights reserved.
                     </p>
                  </div>
               </div>
            </div>
         </div>
         
         <hr id="contents-end"></hr>
         <div id="release-info">Profiler
            (<a href="../../pdf/CUDA_Profiler_Users_Guide.pdf">PDF</a>)
            -
            CUDA Toolkit v5.5
            (<a href="https://developer.nvidia.com/cuda-toolkit-archive">older</a>)
            -
            Last updated 
            July 19, 2013
            -
            <a href="mailto:cudatools@nvidia.com?subject=CUDA Tools Documentation Feedback: profiler-users-guide">Send Feedback</a></div>
         
      </article>
      
      <header id="header"><span id="company">NVIDIA</span><span id="site-title">CUDA Toolkit Documentation</span><form id="search" method="get" action="search">
            <input type="text" name="search-text"></input><fieldset id="search-location">
               <legend>Search In:</legend>
               <label><input type="radio" name="search-type" value="site"></input>Entire Site</label>
               <label><input type="radio" name="search-type" value="document"></input>Just This Document</label></fieldset>
            <button type="reset">clear search</button>
            <button id="submit" type="submit">search</button></form>
      </header>
      <nav id="site-nav">
         <div class="category closed"><span class="twiddle">▷</span><a href="../index.html" title="The root of the site.">CUDA Toolkit</a></div>
         <ul class="closed">
            <li><a href="../cuda-toolkit-release-notes/index.html" title="The Release Notes for the CUDA Toolkit from v4.0 to today.">Release Notes</a></li>
            <li><a href="../eula/index.html" title="The End User License Agreements for the NVIDIA CUDA Toolkit, the NVIDIA CUDA Samples, the NVIDIA Display Driver, and NVIDIA NSight (Visual Studio Edition).">EULA</a></li>
            <li><a href="../cuda-getting-started-guide-for-linux/index.html" title="This guide discusses how to install and check for correct operation of the CUDA Development Tools on GNU/Linux systems.">Getting Started Linux</a></li>
            <li><a href="../cuda-getting-started-guide-for-mac-os-x/index.html" title="This guide discusses how to install and check for correct operation of the CUDA Development Tools on Mac OS X systems.">Getting Started Mac OS X</a></li>
            <li><a href="../cuda-getting-started-guide-for-microsoft-windows/index.html" title="This guide discusses how to install and check for correct operation of the CUDA Development Tools on Microsoft Windows systems.">Getting Started Windows</a></li>
            <li><a href="../cuda-c-programming-guide/index.html" title="This guide provides a detailed discussion of the CUDA programming model and programming interface. It then describes the hardware implementation, and provides guidance on how to achieve maximum performance. The Appendixes include a list of all CUDA-enabled devices, detailed description of all extensions to the C language, listings of supported mathematical functions, C++ features supported in host and device code, details on texture fetching, technical specifications of various devices, and concludes by introducing the low-level driver API.">Programming Guide</a></li>
            <li><a href="../cuda-c-best-practices-guide/index.html" title="This guide presents established parallelization and optimization techniques and explains coding metaphors and idioms that can greatly simplify programming for CUDA-capable GPU architectures. The intent is to provide guidelines for obtaining the best performance from NVIDIA GPUs using the CUDA Toolkit.">Best Practices Guide</a></li>
            <li><a href="../kepler-compatibility-guide/index.html" title="This application note is intended to help developers ensure that their NVIDIA CUDA applications will run effectively on GPUs based on the NVIDIA Kepler Architecture. This document provides guidance to ensure that your software applications are compatible with Kepler.">Kepler Compatibility Guide</a></li>
            <li><a href="../kepler-tuning-guide/index.html" title="Kepler is NVIDIA's next-generation architecture for CUDA compute applications. Applications that follow the best practices for the Fermi architecture should typically see speedups on the Kepler architecture without any code changes. This guide summarizes the ways that an application can be fine-tuned to gain additional speedups by leveraging Kepler architectural features.">Kepler Tuning Guide</a></li>
            <li><a href="../parallel-thread-execution/index.html" title="This guide provides detailed instructions on the use of PTX, a low-level parallel thread execution virtual machine and instruction set architecture (ISA). PTX exposes the GPU as a data-parallel computing device.">PTX ISA</a></li>
            <li><a href="../optimus-developer-guide/index.html" title="This document explains how CUDA APIs can be used to query for GPU capabilities in NVIDIA Optimus systems.">Developer Guide for Optimus</a></li>
            <li><a href="../video-decoder/index.html" title="This document provides the video decoder API specification and the format conversion and display using DirectX or OpenGL following decode.">Video Decoder</a></li>
            <li><a href="../video-encoder/index.html" title="This document provides the CUDA video encoder specifications, including the C-library API functions and encoder query parameters.">Video Encoder</a></li>
            <li><a href="../inline-ptx-assembly/index.html" title="This document shows how to inline PTX (parallel thread execution) assembly language statements into CUDA code. It describes available assembler statement parameters and constraints, and the document also provides a list of some pitfalls that you may encounter.">Inline PTX Assembly</a></li>
            <li><a href="../cuda-runtime-api/index.html" title="The CUDA runtime API.">CUDA Runtime API</a></li>
            <li><a href="../cuda-driver-api/index.html" title="The CUDA driver API.">CUDA Driver API</a></li>
            <li><a href="../cuda-math-api/index.html" title="The CUDA math API.">CUDA Math API</a></li>
            <li><a href="../cublas/index.html" title="The CUBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA CUDA runtime. It allows the user to access the computational resources of NVIDIA Graphical Processing Unit (GPU), but does not auto-parallelize across multiple GPUs.">CUBLAS</a></li>
            <li><a href="../cufft/index.html" title="The CUFFT library user guide.">CUFFT</a></li>
            <li><a href="../curand/index.html" title="The CURAND library user guide.">CURAND</a></li>
            <li><a href="../cusparse/index.html" title="The CUSPARSE library user guide.">CUSPARSE</a></li>
            <li><a href="../npp/index.html" title="NVIDIA NPP is a library of functions for performing CUDA accelerated processing. The initial set of functionality in the library focuses on imaging and video processing and is widely applicable for developers in these areas. NPP will evolve over time to encompass more of the compute heavy tasks in a variety of problem domains. The NPP library is written to maximize flexibility, while maintaining high performance.">NPP</a></li>
            <li><a href="../thrust/index.html" title="The Thrust getting started guide.">Thrust</a></li>
            <li><a href="../cuda-samples/index.html" title="This document contains a complete listing of the code samples that are included with the NVIDIA CUDA Toolkit. It describes each code sample, lists the minimum GPU specification, and provides links to the source code and white papers if available.">CUDA Samples</a></li>
            <li><a href="../cuda-compiler-driver-nvcc/index.html" title="This document is a reference guide on the use of the CUDA compiler driver nvcc. Instead of being a specific CUDA compilation driver, nvcc mimics the behavior of the GNU compiler gcc, accepting a range of conventional compiler options, such as for defining macros and include/library paths, and for steering the compilation process.">NVCC</a></li>
            <li><a href="../cuda-gdb/index.html" title="The NVIDIA tool for debugging CUDA applications running on Linux and Mac, providing developers with a mechanism for debugging CUDA applications running on actual hardware. CUDA-GDB is an extension to the x86-64 port of GDB, the GNU Project debugger.">CUDA-GDB</a></li>
            <li><a href="../cuda-memcheck/index.html" title="CUDA-MEMCHECK is a suite of run time tools capable of precisely detecting out of bounds and misaligned memory access errors, checking device allocation leaks, reporting hardware errors and identifying shared memory data access hazards.">CUDA-MEMCHECK</a></li>
            <li><a href="../nsight-eclipse-edition-getting-started-guide/index.html" title="Nsight Eclipse Edition getting started guide">Nsight Eclipse Edition</a></li>
            <li><a href="../profiler-users-guide/index.html" title="This is the guide to the Profiler.">Profiler</a></li>
            <li><a href="../cuda-binary-utilities/index.html" title="The application notes for cuobjdump and nvdisasm.">CUDA Binary Utilities</a></li>
            <li><a href="../floating-point/index.html" title="A number of issues related to floating point accuracy and compliance are a frequent source of confusion on both CPUs and GPUs. The purpose of this white paper is to discuss the most common issues related to NVIDIA GPUs and to supplement the documentation in the CUDA C Programming Guide.">Floating Point and IEEE 754</a></li>
            <li><a href="../incomplete-lu-cholesky/index.html" title="In this white paper we show how to use the CUSPARSE and CUBLAS libraries to achieve a 2x speedup over CPU in the incomplete-LU and Cholesky preconditioned iterative methods. We focus on the Bi-Conjugate Gradient Stabilized and Conjugate Gradient iterative methods, that can be used to solve large sparse nonsymmetric and symmetric positive definite linear systems, respectively. Also, we comment on the parallel sparse triangular solve, which is an essential building block in these algorithms.">Incomplete-LU and Cholesky Preconditioned Iterative Methods</a></li>
            <li><a href="../libnvvm-api/index.html" title="The libNVVM API.">libNVVM API</a></li>
            <li><a href="../libdevice-users-guide/index.html" title="The libdevice library is an LLVM bitcode library that implements common functions for GPU kernels.">libdevice User's Guide</a></li>
            <li><a href="../nvvm-ir-spec/index.html" title="NVVM IR is a compiler IR (internal representation) based on the LLVM IR. The NVVM IR is designed to represent GPU compute kernels (for example, CUDA kernels). High-level language front-ends, like the CUDA C compiler front-end, can generate NVVM IR.">NVVM IR</a></li>
            <li><a href="../cupti/index.html" title="The CUPTI API.">CUPTI</a></li>
            <li><a href="../debugger-api/index.html" title="The CUDA debugger API.">Debugger API</a></li>
            <li><a href="../gpudirect-rdma/index.html" title="A tool for Kepler-class GPUs and CUDA 5.0 enabling a direct path for communication between the GPU and a peer device on the PCI Express bus when the devices share the same upstream root complex using standard features of PCI Express. This document introduces the technology and describes the steps necessary to enable a RDMA for GPUDirect connection to NVIDIA GPUs within the Linux device driver model.">RDMA for GPUDirect</a></li>
         </ul>
         <div class="category"><span class="twiddle">▼</span><a href="index.html" title="Profiler">Profiler</a></div>
         <ul>
            <li><a href="#profiling-overview">Profiling Overview</a><ul>
                  <li><a href="#whats-new">What's New</a></li>
                  <li><a href="#terminology">Terminology</a></li>
               </ul>
            </li>
            <li><a href="#prepare-application">1.&nbsp;Preparing An Application For Profiling</a><ul>
                  <li><a href="#focusing-profiling">1.1.&nbsp;Focused Profiling</a></li>
                  <li><a href="#marking-regions-of-cpu-activity">1.2.&nbsp;Marking Regions of CPU Activity</a></li>
                  <li><a href="#naming-cpu-objects">1.3.&nbsp;Naming CPU and CUDA Resources</a></li>
                  <li><a href="#flush-profile-data">1.4.&nbsp;Flush Profile Data</a></li>
                  <li><a href="#dynamic-parallelism">1.5.&nbsp;Dynamic Parallelism</a></li>
               </ul>
            </li>
            <li><a href="#visual-profiler">2.&nbsp;Visual Profiler</a><ul>
                  <li><a href="#getting-started">2.1.&nbsp;Getting Started</a><ul>
                        <li><a href="#modify-your-application">2.1.1.&nbsp;Modify Your Application For Profiling</a></li>
                        <li><a href="#creating-session">2.1.2.&nbsp;Creating a Session</a></li>
                        <li><a href="#analyzing-your-application">2.1.3.&nbsp;Analyzing Your Application</a></li>
                        <li><a href="#exploring-timeline">2.1.4.&nbsp;Exploring the Timeline</a></li>
                        <li><a href="#looking-at-details">2.1.5.&nbsp;Looking at the Details</a></li>
                     </ul>
                  </li>
                  <li><a href="#sessions">2.2.&nbsp;Sessions</a><ul>
                        <li><a href="#executable-session">2.2.1.&nbsp;Executable Session</a></li>
                        <li><a href="#import-session">2.2.2.&nbsp;Import Session</a><ul>
                              <li><a href="#import-nvprof-session">2.2.2.1.&nbsp;Import nvprof Session</a></li>
                              <li><a href="#import-csv-session">2.2.2.2.&nbsp;Import Command-Line Profiler Session</a></li>
                           </ul>
                        </li>
                     </ul>
                  </li>
                  <li><a href="#application-requirements">2.3.&nbsp;Application Requirements</a></li>
                  <li><a href="#profiling-limitations">2.4.&nbsp;Profiling Limitations</a></li>
                  <li><a href="#visual-profiler-views">2.5.&nbsp;Visual Profiler Views</a><ul>
                        <li><a href="#timeline-view">2.5.1.&nbsp;Timeline View</a><ul>
                              <li><a href="#timeline-controls">2.5.1.1.&nbsp;Timeline Controls</a></li>
                              <li><a href="#navigating-timeline">2.5.1.2.&nbsp;Navigating the Timeline</a></li>
                           </ul>
                        </li>
                        <li><a href="#analysis-view">2.5.2.&nbsp;Analysis View</a></li>
                        <li><a href="#details-view">2.5.3.&nbsp;Details View</a></li>
                        <li><a href="#properties-view">2.5.4.&nbsp;Properties View</a></li>
                        <li><a href="#console-view">2.5.5.&nbsp;Console View</a></li>
                        <li><a href="#settings-view">2.5.6.&nbsp;Settings View</a></li>
                     </ul>
                  </li>
                  <li><a href="#customizing-visual-profiler">2.6.&nbsp;Customizing the Visual Profiler</a><ul>
                        <li><a href="#resizing-view">2.6.1.&nbsp;Resizing a View</a></li>
                        <li><a href="#reordering-view">2.6.2.&nbsp;Reordering a View</a></li>
                        <li><a href="#moving-view">2.6.3.&nbsp;Moving a View</a></li>
                        <li><a href="#undocking-view">2.6.4.&nbsp;Undocking a View</a></li>
                        <li><a href="#conceptId">2.6.5.&nbsp;Opening and Closing a View</a></li>
                     </ul>
                  </li>
               </ul>
            </li>
            <li><a href="#nvprof-overview">3.&nbsp;nvprof</a><ul>
                  <li><a href="#profiling-modes">3.1.&nbsp;Profiling Modes</a><ul>
                        <li><a href="#summary-mode">3.1.1.&nbsp;Summary Mode</a></li>
                        <li><a href="#gpu-trace-and-api-trace-modes">3.1.2.&nbsp;GPU-Trace and API-Trace Modes</a></li>
                        <li><a href="#event-summary-mode">3.1.3.&nbsp;Event/metric Summary Mode</a></li>
                        <li><a href="#event-trace-mode">3.1.4.&nbsp;Event/metric Trace Mode</a></li>
                     </ul>
                  </li>
                  <li><a href="#profiling-controls">3.2.&nbsp;Profiling Controls</a><ul>
                        <li><a href="#timeout">3.2.1.&nbsp;Timeout</a></li>
                        <li><a href="#concurrent-kernels">3.2.2.&nbsp;Concurrent Kernels</a></li>
                        <li><a href="#profiling-scope">3.2.3.&nbsp;Profiling Scope</a></li>
                        <li><a href="#multiprocess-profiling">3.2.4.&nbsp;Multiprocess Profiling</a></li>
                        <li><a href="#system-profiling">3.2.5.&nbsp;System Profiling</a></li>
                     </ul>
                  </li>
                  <li><a href="#output">3.3.&nbsp;Output</a><ul>
                        <li><a href="#adjust-units">3.3.1.&nbsp;Adjust Units</a></li>
                        <li><a href="#csv">3.3.2.&nbsp;CSV</a></li>
                        <li><a href="#export-import">3.3.3.&nbsp;Export/Import</a></li>
                        <li><a href="#demangling">3.3.4.&nbsp;Demangling</a></li>
                        <li><a href="#redirecting-output">3.3.5.&nbsp;Redirecting Output</a></li>
                     </ul>
                  </li>
                  <li><a href="#limitations">3.4.&nbsp;Limitations</a></li>
               </ul>
            </li>
            <li><a href="#compute-command-line-profiler-overview">4.&nbsp;Command Line Profiler</a><ul>
                  <li><a href="#command-line-profiler-control">4.1.&nbsp;Command Line Profiler Control</a></li>
                  <li><a href="#command-line-profiler-default-output">4.2.&nbsp;Command Line Profiler Default Output</a></li>
                  <li><a href="#command-line-profiler-configuration">4.3.&nbsp;Command Line Profiler Configuration</a><ul>
                        <li><a href="#command-line-profiler-options">4.3.1.&nbsp;Command Line Profiler Options</a></li>
                        <li><a href="#topic_DC0AA473DE8A4917AC94C817BD231DBB">4.3.2.&nbsp;Command Line Profiler Counters</a></li>
                     </ul>
                  </li>
                  <li><a href="#command-line-profiler-output">4.4.&nbsp;Command Line Profiler Output</a></li>
               </ul>
            </li>
            <li><a href="#remote-profiling">5.&nbsp;Remote Profiling</a><ul>
                  <li><a href="#collecting-remote-data">5.1.&nbsp;Collect Data On Remote System</a></li>
                  <li><a href="#using-remote-data">5.2.&nbsp;View And Analyze Data</a></li>
                  <li><a href="#remote-limitations">5.3.&nbsp;Limitations</a></li>
               </ul>
            </li>
            <li><a href="#nvtx">6.&nbsp;NVIDIA Tools Extension</a><ul>
                  <li><a href="#nvtx-overview">6.1.&nbsp;NVTX API Overview</a></li>
                  <li><a href="#nvtx-events">6.2.&nbsp;NVTX API Events</a><ul>
                        <li><a href="#nvtx-marker">6.2.1.&nbsp;NVTX Markers</a></li>
                        <li><a href="#nvtx-range-start-stop">6.2.2.&nbsp;NVTX Range Start/Stop</a></li>
                        <li><a href="#nvtx-range-push-pop">6.2.3.&nbsp;NVTX Range Push/Pop</a></li>
                        <li><a href="#nvtx-event-attribute-struct">6.2.4.&nbsp;Event Attributes Structure</a></li>
                     </ul>
                  </li>
                  <li><a href="#nvtx-naming">6.3.&nbsp;NVTX Resource Naming</a></li>
               </ul>
            </li>
            <li><a href="#mpi-profiling">7.&nbsp;MPI Profiling</a><ul>
                  <li><a href="#mpi-nvprof">7.1.&nbsp;MPI Profiling With nvprof</a></li>
                  <li><a href="#mpi-compute-command-line">7.2.&nbsp;MPI Profiling With The Command-Line Profiler</a></li>
               </ul>
            </li>
            <li><a href="#metrics-reference">8.&nbsp;Metrics Reference</a></li>
         </ul>
      </nav>
      <nav id="search-results">
         <h2>Search Results</h2>
         <ol></ol>
      </nav>
      <script language="JavaScript" type="text/javascript" charset="utf-8" src="../common/formatting/common.min.js"></script>
      <script language="JavaScript" type="text/javascript" charset="utf-8" src="../common/scripts/omniture/s_code_us_dev_aut1-nolinktrackin.js"></script>
      <script language="JavaScript" type="text/javascript" charset="utf-8" src="../common/scripts/omniture/omniture.js"></script>
      <noscript><a href="http://www.omniture.com" title="Web Analytics"><img src="http://omniture.nvidia.com/b/ss/nvidiacudadocs/1/H.17--NS/0" height="1" width="1" border="0" alt=""></img></a></noscript>
      <script language="JavaScript" type="text/javascript" charset="utf-8" src="../common/scripts/google-analytics/google-analytics-write.js"></script>
      <script language="JavaScript" type="text/javascript" charset="utf-8" src="../common/scripts/google-analytics/google-analytics-tracker.js"></script>
      </body>
</html>