Sophie: nvidia-cuda-profiler-3.0-1mdv2010.1 x86

nvidia-cuda-profiler-3.0-1mdv2010.1.x86_64.rpm

<html>
<body>
<p align="center"><strong><font size="6">NVIDIA CUDA Visual Profiler Version 3.0
</font></strong></p>
<p>
     
 

Published by<br>
   NVIDIA Corporation<br>
   2701 San Tomas Expressway<br>
   Santa 
Clara, CA 95050</p>
<p>
<br>

<a name="Notice">Notice</a>
<B><h3><a name="agreement">BY DOWNLOADING THIS FILE, USER AGREES TO THE FOLLOWING:</a></B></h3>

ALL NVIDIA SOFTWARE, DESIGN SPECIFICATIONS, REFERENCE 
BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND 
SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS". NVIDIA MAKES NO WARRANTIES, 
EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND 
EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, 
AND FITNESS FOR A PARTICULAR PURPOSE. </p>
      
<p>
Information furnished is believed to be accurate and reliable. However, 
NVIDIA Corporation assumes no responsibility for the consequences of use 
of such information or for any infringement of patents or other rights 
of third parties that may result from its use. No license is granted by 
implication or otherwise under any patent or patent rights of NVIDIA 
Corporation. Specifications mentioned in this publication are subject 
to change without notice. These materials supersedes and replaces all 
information previously supplied. NVIDIA Corporation products are not 
authorized for use as critical components in life support devices or 
systems without express written approval of NVIDIA Corporation. 
<p>
Trademarks<br>

NVIDIA, CUDA, and the NVIDIA logo are trademarks or registered trademarks 
of NVIDIA Corporation in the United States and other countries. Other 
company and product names may be trademarks of the respective companies 
with which they are associated.
<p>

Copyright (C) 2007-2010 by NVIDIA Corporation. All rights reserved. <br><br>

PLEASE REFER EULA.txt FOR THE LICENSE AGREEMENT FOR USING NVIDIA SOFTWARE.
<p>
   
<B><h3><a name="ListOfFeatures">List of supported features:</a></h3></B>
    Execute a CUDA program with profiling enabled and view the profiler output
    as a table. The table has the following columns for each GPU method:<br><br> 
    <ul>
     <li><B>GPU Timestamp:</B> Start time stamp. <br></li>
     <li><B>Method:</B> GPU method name. This is either "memcpy*" for memory copies or the name of a GPU kernel. 
               Memory copies have a suffix that describes 
               the type of a memory transfer, e.g. "memcpyDToHasync" means an asynchronous transfer 
               from Device memory to Host memory. <br></li>
     <li><B>GPU Time:</B> It is the execution time for the method on GPU.<br></li>
     <li><B>CPU Time:</B>It is sum of GPU time and CPU overhead to launch that Method. At driver generated data 
                          level, CPU Time is only CPU overhead to launch the Method for non-blocking Methods; 
                          for blocking methods it is sum of GPU time and CPU overhead. All kernel launches by 
                          default are non-blocking. But if any profiler counters are enabled kernel launches 
                          are blocking. Asynchronous memory copy requests in different streams are 
                          non-blocking.<br></li>
     <li><B>Stream Id</B>         : Identification number for the stream <br></li>
     <li><B>Columns only for kernel methods:</B> </li>
        <ul type="circle">
         <li><B>Occupancy</B> : Occupancy is the ratio of the number of active warps per multiprocessor to the maximum
number of active warps. <br></li>
     <li><B>Profiler counters</B>: Refer the profiler counters section for list of counters supported. </li>
         <li><B>grid size X</B>         : Number of blocks in the grid along dimension X<br></li>
         <li><B>grid size Y</B>         : Number of blocks in the grid along dimension Y<br></li>
         <li><B>block size X</B>        : Number of threads in a block along dimension X<br></li>
         <li><B>block size Y</B>        : Number of threads in a block along dimension Y<br></li>
         <li><B>block size Z</B>        : Number of threads in a block along dimension Z<br></li>
         <li><B>dyn smem per block</B>: Dynamic shared memory size per block in bytes<br></li>
         <li><B>sta smem per block</B>: Static shared memory size per block in bytes <br></li>
         <li><B>reg per thread</B>: Number of registers per thread <br></li>
        </ul>
     <li><B>Columns only for memcopy methods:</B> </li>
        <ul type="circle">
         <li><B>mem transfer size</B>: Memory transfer size in bytes<br></li>
	 <li><B>mem transfer hostmemtype</B>: Type of host memory during transfer from DtoH or HtoD. It can be either pageable or pagelocked.<br></li>
         <li><B>host mem transfer type</B>: Specifies whether a memory transfers uses "Pageable" or "Page-locked"<br></li>
        </ul>
    </ul>
        Please refer the "Interpreting Profiler Counters" section below for more
        information on profiler counters. Note that profiler counters are also
        referred to as profiler signals.<br> <br>
    Display the summary profiler table. It has the following columns for each
    GPU method:<br>  
    <ul>
        <li><B>Method</B>: Method name<br></li>
        <li><B>#calls</B>: Number of calls<br></li> 
        <li><B>GPU usec</B>: Total GPU time in micro seconds<br></li>
        <li><B>CPU usec</B>: Total CPU time in micro seconds<br></li>
        <li><B>%GPU time</B>: Percentage GPU time <br></li>
        <li><B>Total counts for each profiler counter </B><br></li>
        <li><B>glob mem read throughput(GB/s)</B>: Global memory read throughput in giga-bytes per second. </B><br></li>
        <li><B>glob mem write throughput(GB/s)</B>: Global memory write throughput in giga-bytes per second. </B><br></li>
        <li><B>glob mem overall throughput(GB/s)</B>: Global memory write throughput in giga-bytes per second. </B><br></li>
        <li><B>instruction throughput </B>: instruction throughput ratio for each kernel<br></li>
    </ul>
    Display various kinds of plots: <br>
    <ul>
        <li>Summary profiling data bar plot<br></li>
        <li>GPU Time Height plot<br></li>
        <li>GPU Time Width plot<br></li>
        <li>Profiler counter bar plot<br></li>
        <li>Profiler output table column bar plot<br></li>
        <li>Comparison Summary plot<br></li>
    </ul>
    Analysis of profiler output lists out method with high number of: <br>
    <ul type="square">
           <li>Uncoalesced stores <br></li>
           <li>Uncoalesced loads <br></li>
           <li>Warp serializations <br></li>
    </ul>

    Compare profiler output for multiple program runs of the same program or for different programs.<br><br>

    Each program run is referred to as a session.<br><br>

    Save profiling data for multiple sessions. A group of sessions is referred to as a project. <br><br>

    Import/Export CUDA Profiler CSV format data. <br><br>
   
<B><h3><a name="PlotDescription">Description of different plots:</a></B></h3>
        <B><h5><a name="SummaryProfilingDataBarPlot">Summary profiling data bar plot :</A></B></h5>
    <ul>
            One bar for each method <br>
            Bars sorted in decreasing gpu time, <br>
            Bar length is proportional to cumulative gputime for a method <br>
    </ul>
         <B><h5><a name="GPUTimeHeightPlot">GPU time height plot: </A></B></h5>
            It is a bar diagram in which the height of each bar is proportional 
            to the GPU time for a method and a different bar color is assigned 
            for each method. A legend is displayed which shows the color assignment
            for different methods. The width of each bar is fixed and the bars 
            are displayed in the order in which the methods are executed. When the 
            "fit in window" option is enabled the display is adjusted so as to fit
            all the bars in the displayed window width. In this case bars for multiple 
            methods can overlap. The overlapped bars are displayed in decreasing order 
            of height so that all the different bars are visible. When the "Show CPU Time" 
            option is enabled the CPU time is shown as a bar in a different color on 
            top of the GPU time bar. The height of this bar is proportional to the 
            difference of CPU time and GPU time for the method.

        <B><h5><a name="GPUTimeWidthPlot">GPU time width plot:</A></B></h5> 
            It is a bar diagram in which the width of each bar is proportional to 
            the GPU time for a method and a different bar color is assigned for each
            method. A legend is displayed which shows the color assignment for 
            different methods. The bars are displayed in the order in which the
            methods are executed. When time stamps are enabled the bars are positioned
            based on the time stamp. The height of each bar is based on the option 
            chosen: 
    <ol type="a">
                   <li>Fixed height : height is fixed.</li>
                   <li>Height proportional to instruction issue rate: the instruction 
                     issue rate for a method is equal to profiler "instructions" counter 
                     value divided by the gpu time for the method.</li> 
                   <li>Height proportional to uncoalesced load + store rate: the uncoalesced
                     load + store rate for a method is equal to the sum of profiler 
                     "gld uncoalesced" and "gst uncoalesced" counter values divided by the 
                     gpu time for the method. </li>
                   <li>Occupancy: Occupancy is proportional to height.</li>
    </ol>

            In case of multiple streams or multiple devices the "Split Options" can be used.
    <ol type="a">
                   <li>No Split : A single horizontal group of bars is displayed. Even in case of multiple streams or multiple devices the data is displayed in a single group.</li>
                   <li>Split on Device: In case of multiple devices one separate horizontal group of bars is displayed for each device.</li> 
                   <li>Split on Stream: In case of multiple devices one separate horizontal group of bars is displayed for each stream. </li>
    </ol>    
        <B><h5><a name="ProfilerCounterBarPlot">Profiler counter bar plot :</A></B></h5>
            It is a bar plot for profiler counter values for a method from the profiler 
            output table or the summary table. . One bar for each profiler counter. Bars
            sorted in decreasing profiler counter value .Bar length is proportional to 
            profiler counter value.

        <B><h5><a name="ProfilerOutputTableColumnBarPlot">Profiler output table column bar plot:</A></B></h5>
            It is a bar plot for any column of values from the profiler output table or 
            summary table . One bar for each row in the table. Bars sorted in decreasing 
            column value . Bar length is proportional to column value.

        <B><h5><a name="ComparisonSummaryPlot">Comparison summary plot:</A></B></h5>
            This plot can be used to compare GPU Time summary data for two sessions. 
            The Base Session is the session with respect to which comparison
            is done and the other session which is selected for comparison is called Compare
            Session. GPU Times for matching kernels from the two sessions are shown in a group. 
            For each matched kernel from Compare Session, percentage 
            increment or decrement with respect to Base Session is displayed
            at the right end of the bar. After showing all the matched pairs, the unmatched kernels GPU 
            Times are shown. 
            At the bottom two bars with  total GPU Times for the two sessions are shown. 

        <B><h5><a name="SummaryPlotDeviceLevel">Device level summary plot :</A></B></h5>
            One bar for each method is there. Bars are sorted in decreasing gpu time. Bar length 
            is proportional to cumulative gputime for a method across all contexts for a device.


        <B><h5><a name="SummaryPlotSessionLevel">Session level summary plot :</A></B></h5>
            One bar for each device is there. Bar length is proportional to Gpu Utilization. 
            Gpu Utilization is the proportion of time when gpu was actually executing some method 
            to total time interval from gpu start to end. The values are presented in percentage.


            


<B><h3><a name="SampleUsage">Steps for sample cudaprof usage:</a></B></h3>
<P>
    <BR><a name="SAMPLE1"><B>Sample1:</B></A> <br>
    <ul>
            <li>Open a new project using main menu option <kbd> File-&gt;New </kbd>or toolbar Select the 
            project name and project directory where the project files will be saved.<br></li>

            <li>Select the session settings through the dialog. <br> 
                Browse and select the CUDA program to profile. <br>
                Change the working directory if it is different from 
            the program directory. <br>
            Select options for profiler counters.<br>
            Select other kernel and memory transfer options. <br>
            Change maximum program execution time (if needed) <br></li>

            <li>Execute the CUDA program by clicking the Start button of the Session settings 
            dialog or through the main menu option <kbd>Session-&gt;Start</kbd> If the CUDA program 
            is correctly executed the profiler output will be displayed. <br></li>

            <li>To display the summary table right click on "Session1->Device_0->Context_0" in the session tree. Choose 
            the "Summary table" option. Or use the "Summary table" tool bar option. <br></li>

            <li>To display the GPU Time summary plot right click on "Session1->Device_0->Context_0" in the  session tree
            and choose the "GPU Time Summary Plot" option. Or use the "GPUTime Summary Plot" tool
            bar option. <br></li>

            <li>You can scroll, resize or reposition the profiler output and GPU Time Summary plot 
            windows. <br></li>

            <li>Save the project by using the main menu option <kbd>File-&gt;Save</kbd> or the toolbar.<br></li> 

            <li>Exit cudaprof using the main menu option <kbd>File-&gt;Exit</kbd>.<br></li> 
    </ul><br>

<P>
    <BR><a name="SAMPLE2"><B>Sample2:</B> </A>

    <ul>
            <li> Open the project saved in SAMPLE1 or one of the sample projects using the main menu 
            option <kbd>File-&gt;Open</kbd>. The profiler output table will be displayed.<br></li>

            <li> To display the GPU Time Height plot right click on "Session1->Device_0->Context_0" in the session tree.
            Choose the "GPU Time Height Plot" option. Also try the "GPU Time Width Plot". <br></li>

            <li> Select settings for a new session by using the main menu option "Session-&gt;Session settings". 
            Browse and select the CUDA program to profile. Change the working directory if it is 
            different from the program directory.<br></li>

            <li> Execute the CUDA program by clicking the Start button of the Session settings dialog or 
            through the main menu option "Session-&gt;Start" If the CUDA program is correctly executed 
            the profiler output will be displayed. - Compare the profiler output for "Session1" and "Session2".<br></li>

            <li> Try the "Profiler counter plot" and "Column plot" by right clicking on the appropriate row or column
            in the profiler output or summary table for a session. <br></li>
 
            <li> Exit cudaprof using the main menu option "File-&gt;Exit". <br></li>

    </ul>

<B><h3><a name="GUIDescription">Brief description of some cudaprof GUI components: </a></B></h3>
    Top line shows the main menu options:<Code> File, Profile, Session, Options, Window and Help.</Code>
    See the description below for details on the menu options.<P> 

    Second line has 4 groups of tool bar icons. <BR>
    <ul>
        <li> File tool bar group has: </li>
        <ul>
            <li> New project  </li>
            <li> Open existing project and  </li>
            <li> Save project  </li>
        </ul>
        <li> Profile tool bar group has: </li>
        <ul>
            <li> Session settings  </li>
            <li> Start profiling  </li>
        </ul>
        <li> Session tool bar group has: </li>
        <ul>
            <li> Summary table  </li>
            <li> Summary plot  </li>
            <li> GPU time height plot </li> 
            <li> GPU time width plot  </li>
            <li> Device level summary plot  </li>
            <li> Session level summary plot  </li>
        </ul>
        <li> View options tool bar group has: </li>
        <ul>
            <li>Session view settings  </li>
        </ul>
    </ul>

The left vertical window lists all the sessions in the current project as a tree with three levels. 
Sessions at the top level, devices under a session at the next level and contexts 
under a device at the lowest level.  

The child of a session is named as "Device_< device_number >" e.g Device_0. 

The child of a device is named as "Context_< context_number >" e.g. Context_0


        <P>
Summary session information is displayed when a session is selected in the tree view.

        <ul>
            <li>Project name</li>
            <li>Project location</li> 
            <li>Session name</li>
            <li>Program location</li>
            <li>Working directory</li>
            <li>Arguments</li>
            <li>Session time</li>
            <li>Normalized Count</li>
            <li>Device Count and List</li>
            <li>Signal and Options count and List</li>
        </ul>  


        <P>
Summary device information is displayed when a device is selected in the tree view.
        <ul>
            <li>Device name</li>
            <li># Contexts</li> 
            <li>List of contexts with row count for each context</li>
        </ul>    



    Right clicking on a session item or a context item in the tree view brings up the context sensitive menus. 
    See the description below for details on the menu options. <P>
    
    Session context menu. <BR>
    <ul>
        <li> Rename  </li>
        <li> Delete  </li>
        <li> Copy Setting to current  </li>
        <li> Properties  </li>
    </ul>

    Session->Device->Context context menu. <BR>
    <ul>
        <li> Summary table  </li>
        <li> Kernel table  </li>
        <li> Memcopy table  </li>
        <li> GPU time summary plot  </li>
        <li> GPU time height plot  </li>
        <li> GPU time width plot  </li>
        <li> Comparison summary Plot  </li>
    </ul>

    Right workspace area contains windows which include Tabbed window for each session, each device in a session and for each context for a device.<br> 

    The different  windows for each context are shown as different tabs:<br> 
    <ul>
            <li> Profiler output table  </li>
            <li> Summary table  </li>
            <li> Kernel table  </li>
            <li> Memcopy table  </li>
            <li> GPU Time height plot  </li>
            <li> GPU Time width plot  </li>
            <li> Profiler counter plot  </li>
            <li> Column plot  </li>
            <li> Comparison plot </li>
    </ul>

    Table Header context menu, for Profiler Output table and Summary table. <BR>
    <ul>
        <li> Hide  </li>
        <li> Hide zero columns </li>
        <li> Show all columns </li>
    </ul>    

    Output window - Appears, when asked to display, at the bottom. It displays standard output   &amp; 
    standard error for the CUDA program which is run. Also some additional status messages are displayed 
    in this window.<P> 


    <a name="MAINMENU">Main menu</A>
    <ul>
        <li> "File" menu   
        <ul>
           <li> New :  Create a  new project The "New project"
                dialog is opened to choose the project name and
                project directory. On OK the "Session settings" dialog
                is opened.  </li>
           <li> Open : Open an existing project The "Open project" dialog 
                is opened to  select the profiler project to be opened. 
                On "Open" the project data for all sessions is loaded 
                and the profiler data table is displayed.  </li>
           <li> Save : Save the current project The profiler data for the 
                current open project is saved to the disk.  </li>
           <li> Save As : Save the current project as a new project. The 
                project name &amp; directory can be selected. The profiler 
                data for the current open project is saved to the disk. </li>
           <li> Close : Close the current project The current open project is
                closed. All profiler session data is deleted from memory 
                and all open windows are closed. </li>
           <li> Delete : Delete the project. File dialog is opened to select the project.
                It deletes the selected project file(.cpj) and related data files(.csv) files.</li>
           <li> Import: Import CUDA profiler output in comma separated format 
                (CSV).A new session is created in the current project and 
                imported data is loaded.  </li>
           <li> Export: Export  CUDA profiler output for the current session to
                a file in   the comma separated format (CSV). </li>
           <li> List of recently opened profiler projects. </li>
           <li> Exit:  Exit the cudaprof program   </li>
        </ul>
        </li>

        <li> "Session" menu 

        <ul>
             <li> Session settings : Change session settings  </li>
             <li> Start : Start CUDA program with profiling enabled  </li>
           <li> Analyze profiler counters: Analyze profiler counters values for the 
               current session. This is same as the profiler table context menu 
           <li> Analyze Occupancy: reports details of occupancy calculation for each kernel
           and the factor due to which the maximum occupancy is not achieved </li>
           <li> Global Memory Throughput: Display overall application level global memory read throughput, global memory write throughput and overall global memory throughput. </li>
           <li> Rename: Rename the current session.   </li>
           <li> Delete: Delete the current session. This is same as the Session 
               context menu "Delete" option.   </li>
           <li> Copy settings to current: Copy settings for the current session    </li>
               as the session settings to be used for a new profiling session.    </li>
           <li> Properties: Show the properties for the current session. This is same
               as the Session context menu "Properties" option.    </li>        
        </ul>
        </li>


        <li> "View" menu  

        <ul>
           <li> Summary Table: View summary profiler table for current session. The
                summary table has the following columns: - 
                <ul> 
                   <li> Method: method name </li>
                   <li> #Calls: number of calls </li> 
                   <li> GPU usec: total GPU time in micro seconds</li>
                   <li> CPU usec: total CPU time in micro seconds (column is hidden by default)</li>
                   <li> %GPU time: Percentage of total GPU time across all methods</li>
                   <li> Cumulative count column for  each available profiler counter (columns are hidden by default)</li>
                   <li> mem read throughput (GB/s): Global memory read throughput in gigabytes per second. 
                   This is computed as (total bytes read)/(gpu time) where total bytes read is calculated 
                   using the profiler counters gld_32b, gld_64b and gld_128b.
                   Note that 1 gigabyte refers to 10^9 bytes in this calculation.  
                   This is supported only for GPUs with compute capability 1.2 or higher.
                   </li>
                   <li> mem write throughput (GB/s): Global memory write throughput in gigabytes per second. 
                   This is computed as (total bytes written)/(gpu time) where total bytes written 
                   is calculated using the profiler counters gst_32b, gst_64b and gst_128b. 
                   Note that 1 gigabyte refers to 10^9 bytes in this calculation.
                   This is supported only for GPUs with compute capability 1.2 or higher.
                   </li>
                   <li> mem overall throughput (GB/s): Overall global memory access throughput in gigabytes per second. 
                   This is computed as (total bytes read + total bytes written)/(gpu time). 
                   Total bytes read is calculated using the profiler counters gld_32b, gld_64b and gld_128b.
                   Total bytes written is calculated using the profiler counters gst_32b, gst_64b and gst_128b.
                   This is supported only for GPUs with compute capability 1.2 or higher.
                   </li>
                   <li> instruction throughput: Instruction throughput ratio. 
                   This is the ratio of achieved instruction rate to peak single issue instruction rate.
                   The achieved instruction rate is calculated using the "instructions" profiler counter.
                   The peak instruction rate is calculated based on the GPU clock speed.
                   In the case of instruction dual-issue coming into play, this ratio shoots up to greater than 1.
                   </li>
                </ul>
               The rows in the table are sorted in decreasing order of total GPU time and 
               memcopy is shown as the last row.
   </li>
           <li> Kernel Table: Show following Kernel properties
                <ul> 
                   <li> Grid Size (x,y both dimensions separately) </li>
                   <li> Thread Block Size (x,y,z all dimensions separately)</li> 
                   <li> Dynamic Shared Memory per Block </li>
                   <li> Static Shared Memory per Block </li>
                   <li> Register per Thread</li>
                </ul>
           </li>
           <li> Memcopy Table: Show following Memcopy properties
                <ul> 
                   <li> Memory Transfer Direction </li>
                   <li> Memory Transfer Size </li> 
                </ul>
           </li>
           <li> GPU Time Summary plot : View  GPU time summary plot for current session. This is
                same as the Session context menu "GPU Time Summary plot" option.   </li>
           <li> GPU Time Height plot : View   GPU time height plot for current session. This is
               same as the Session context menu "GPU Time Height plot" option.   </li>
           <li> GPU Time Width plot : View GPU time width plot for current session. This is same
               as the Session context menu "GPU Time Width plot" option.   </li>
           <li> Comparison plot : View Comparison plot with current session as Base. It first opens
                dialog for selecting the session for comparison called "Compare Session".</li>
         <li> Devices : Show List of Devices and each listed item on click 
                  would show the properties of the corresponding device.  </li>
        </ul>
  </li>

        <li> "Options" menu 
        <ul>
            <li> Session view settings: Change session view settings for the current session.   </li>
            <li> Default view settings: Change the default view settings to be used for new sessions.   </li>
            <li> Method Display Option: One of the following options to display method names :
                <ul> 
                   <li> Use Full Name : Full Mangled name is displayed.</li>
                   <li> Use Base Name : Only base name is displayed.</li> 
                   <li> Use Base Name with suffix : Full Mangled name with suffix is displayed.</li>
                </ul>
            </li>
            <li> <a name="GlobalScaleOptionForHeightPlot">Height plot: Change global GPU time height plot options.</A>     
                <ul>
                    <li> Use Global Scale: Enable / disable option to use a common global scale across multiple 
                         sessions.
                    </li>
                </ul>
            </li>
            <li> <a name="ColourConfiguration"> Plot Colors: Select colors for plots.</A>
                <ul>
                    <li>  Method Colors: Pop ups a color dialog which can be used to select colors used for 
                          different methods in plots. The colors are saved on application exit and so they can be 
                          used across cudaprof sessions.  
                    </li>
                </ul>
            </li>

            <li> Show output window: Enable / disable display of output window.   </li>
            <li> <a name="WindowsLayout"> Session window layout settings:</A> Change settings for display of multiple session windows.   </li>
            <li> <a name="EnvironmentVariableSetting">Environment variable settings:</A> Change environment variable settings used by the CUDA
               program.    </li>
        </ul>
        </li>


        <li> "Window" menu    </li> 
        <ul>
           <li> Close: Close active window   </li>
           <li> Close All: Close all open windows    </li>
           <li> Tile: Tile all open windows     </li>
           <li> Cascade: Cascade all open windows   </li>
        </ul>
        </li>

        <li> "Help" menu 
        <ul>
           <li>  Cuda Visual Profiler Help: Show the Help for Cuda Visual Profiler. (This is currently not supported on Mac OS)  </li>
           <li>  System Info: Show the Host system machine configuration information.   </li>
           <li>  About: Display CUDA Visual Profiler program version and copyright information.   </li>
        </ul>
        </li>
    </ul>
<P>
    <a name="TOOLBARS">Tool bars </A>
    <ul>
      <li> File tool bar group: 
      <ul>
          <li> Create a new project: The behavior  is same as the "File-&gt;New" menu option </li>
          <li> Open an existing project: The behavior  is same as the "File-&gt;Open" menu option </li>
          <li> Save the current project: The behavior  is same as the File-&gt;Save" menu option   </li>
      </ul>
      </li>
      <li> Profile tool bar group: 
      <ul>
          <li> Session settings: The behavior  is same as the "Session-&gt;Session settings" menu option   </li>
          <li> Start profiling: The behavior  is same as the "Session-&gt;Start" menu option   </li>
      </ul>
      <li> Session tool bar group: 
      <ul>
          <li> Summary table: The behavior  is same as the "View-&gt;Summary table" menu option    </li>
          <li> Summary plot: The behavior  is same as the "View-&gt;Summary plot" menu option    </li>
          <li> GPU time height plot: The behavior  is same as the "View-&gt;GPU time height plot" 
              menu option    </li>
          <li> GPU time width plot: The behavior  is same as the "View-&gt; GPU time width plot" 
              menu option      </li>
      </ul>
      </li>
      <li> View options tool bar group has:
      <ul>
          <li> Session view settings: The behavior  is same as the "Options-&gt;Session View  Settings" menu
          option  </li>   
      </ul>
      </li>
    </ul>

<P>
     <a name="DIALOGS">Dialogs </A>
    <ul>
        <li> "New project" dialog  <BR>  
        <ul>
               <li> Project Name: Name of the profiler project  </li> 
               <li> Project location: Directory where the project 
               files will  be saved   </li>
        </ul>
        </li> 
        <li> "Session settings" dialog  <BR>
        <ul>
                <li> "Session" Tab  <BR>
                <ul>
                   <li>Session Name: Name of the profiler session By default a new session name is chosen 
                   ("Session1", "Session2",        ...). This can be changed by the user. 
                   <li>Launch: Select the CUDA program to be profiled.</li>
                   <li>Working directory: Select the working directory to be used for running the CUDA program. </li>
                   <li>Device : Select the device to be used for running the CUDA program. </li>
                   <li>Arguments: Command line arguments to be passed to the CUDA program. </li>
                   <li>Max. execution time (in seconds): Select maximum time to wait for CUDA program execution
                       completion. After this cutoff time the program is aborted. </li>
                   <li>Run in separate window: This option is useful for console applications which accept some
                       keyboard input. In this case the CUDA program is run from a separate window. The 
                       standard output and standard error for the CUDA program is shown in this separate 
                       window. Note that currently this option is supported only on Linux and a new  "xterm" 
                       window is opened. </li>
                 </ul>
                </li>
                <li> "Profiler Counter" Tab  <BR>
                Profiler Counters are logically grouped based on their functions. 
                Since only a few of the selected profiler counters can be collected for a single application run - 
                the CUDA application is run multiple times. 
                <br>
                <ul>
                   <li>You can select or de-select all counters by using the "Select All Counters" check box. </li>
                   
                   <li> You can also select any sub-set of specific counters using the check boxes for each counters.
                        </li>
                    <li>You can enable or disable normalization of counter values by using the "Normalize counters" check box. </li> <br>
                  
                  Profiler counters are available only with CUDA version 1.1 or later. 
                  </li>   
                </ul>
                </li>
                <li> "Other Options" Tab  <BR>
                <ul>
                   <li> Timestamp: Enable option to include time stamps for kernel/method launching.  GPU timestamp
                        is the time when a method starts execution on the GPU. GPU timestamps are shifted in origin, 
                        to make the minimum GPU timestamp zero, across all devices and all contexts in a session. </li> 
                   <li> Stream id: Enable option to include stream id for kernel/method. This feature is available only
                        with CUDA version 1.1 or later. </li> 
                   <li> Memory Transfer Size : It is to be enabled for describing the size of memory transfer.
                        It outputs the total size in bytes at the Memcopy Table when profiling was done with this 
                        option enabled.</li> 
                             
                   </li>   
                   <li> Kernel Option: This is a group of following options :
                        <ul>
                             <li> Grid Size : It is to be enabled to get dimensions of grid in terms of blocks 
                                 (2 dimensional) in Kernel table.
                             </li> 

                             <li> Thread Block Size : It is to be enabled to get dimensions of a block in terms 
                                  of threads (3 dimensional).
                             </li>
                             <li> Dynamic shared memory size: It is to be enabled to get 
                                    Dynamic shared memory size.
                             </li>
                             <li> Static shared memory size: It is to be enabled to get 
                                    Static shared memory size.
                             </li>
                             <li> Register per thread: It is to be enabled to get 
                                  Register count per thread.
                             </li>

                             
                        </ul> 
                   </li>   
                </ul>
                </li>
        </ul>
        </li> 
        <li> "Session View Settings" dialog  <BR>
        <ul>
                     This dialog can be invoked using the main menu option "Options-&gt;Session View Settings" or the 
                     toolbar. This dialog allows changing settings for the different views for the current session. 
                     There is a separate tab for different views. The dialog is opened with the tab corresponding to 
                     the current view. Only tabs for currently created views can be selected. 

             <li> "Profiler Table" Tab   <BR>
             <ul>
               <li> Hide All Zero Counters: Enable /disable hiding of counter columns having all zero values. This is enabled by default. </li>
               <li> Columns Shown: Lists columns which are to be shown. Can select &amp; move columns from hidden list to shown list using 
               "&lt;&lt;". </li> 
               <li> Columns Hidden: Lists columns which are to be hidden. Can select &amp; move columns from shown list to
               hidden list using "&gt;&gt;".  </li>
             </ul>
             </li>
             <li> "Summary Table" Tab  <BR>
             <ul>
               <li> Show Average Data: Enable / disable showing average data values. When this option is disabled the sum total across all
                       the calls for a method are shown. When this option is enabled the total value is divided by the number of times 
                       the method is called and this average value for a method is displayed. This option is disabled by default.   </li>
               <li> Columns Shown:
                       Lists columns which are to be shown. Can select &amp; move columns from hidden list to shown list using "&lt;&lt;".   </li>
               <li> Columns Hidden: Lists columns which are to be hidden. Can select &amp; move columns from shown list to hidden list using 
                       "&gt;&gt;". The CPU usec and all counter columns are hidden by default.   </li>
             </ul>
            </li>
            <li> "Summary Plot" Tab  <BR>
            <ul>
              <li> Percentage Displayed: Enable/disable displaying percentage values. When this option is disabled total values are shown.
                       This option is enabled by default.  </li>
              <li> Average Displayed: Enable/disable using average data values. When this option is disabled total values are used. This
                  option is disabled by default. </li> 
              <li> Timestamp based Total: Enable/disable calculation of total using initial and final timestamps. If enabled, one extra bar showing
                   "Gpu Idle" with total no of method call is presented in a different color. </li> 

            </ul>
            </li>
            <li> "Height Plot" Tab 
            <ul>
               <li> Show legend: Enable / disable display of GPU Time plot legend  </li>
               <li> Fit in window: Enable / disable option to fit the GPU plot in the window. When fit is enabled multiple bars can overlap. </li>
               <li> Show CPU Time: Enable / disable option to show CPU time.  </li>
            </ul>
            </li>

            <li> "Width Plot" Tab  <BR>
            <ul>
              <li> Enable Time Stamp: Enable / disable option to use time stamps.   </li>
              <li> Show CPU Time: Enable / disable option to show CPU time.   </li>
              <li> Fit in window: Enable / disable option to fit the plot in the window.   </li>
              <li> Max Width of Bar: Maximum width of a bar in pixels. For this option the plot display is immediately updated &amp; so one
                       can interactively choose an appropriate value.   </li>
              <li> Bar Height Option: Choose option to use for bar height.   </li>
            </ul>
            </li>
            "Apply" and "Ok" respectively change the view properties temporarily and permanently.
        </ul>
        </li> 
        <li> "Default View Settings" dialog  <BR>
              This dialog can be invoked using the main menu  option "Options-&gt;Default View Settings". This dialog allows changing 
              the default settings which are used subsequently for new session views which are displayed. The description of settings
              is same as those for the "Session View Settings" dialog. 

        </li> 
        <li> "Method Colors" dialog  <BR>
              This dialog is invoked using the main menu option "Options-&gt;Plot Colors-&gt;Method Colors". This dialog allows user to
              select the colors which are used for different methods in plots. These colors are saved on cudaprof exit and can be used 
              across cudaprof sessions. 
        </li> 

        <li> "Select Session" dialog  <BR>
              This dialog is invoked using the session context menu item "Comparison Summary Plot" only when multiple sessions are listed
              in the current project. This is used to select the Compare Session which is to be compared with the Base session, the session 
              which invoked the "Select Session" Dialog.
        </li> 
    </ul>

<P>
    <a name="SESSION_LIST_CONTEXT_MENU">Session list context menu :</A>
      
    <ul>
      <li>  Rename: Rename the selected session.  </li> 
      <li>  Delete: Delete the selected session   </li> 
      <li>  Copy setting to current: This copies the settings of the selected session as the default session settings. This is same as main menu option "Session-&gt;Copy settings to current" </li>   
      <li>  Properties: Show the project and session settings for the selected session.  </li> 
    </ul>
<P>
    <a name="SESSION_DEvICE_CONTEXT_MENU">Session->Device context menu :</A>
    <ul>
      <li>  Summary  table: Display the profiler summary table.  </li>  
      <li>  Kernel  table:  Display the kernel specific grid and thread related information in table.  </li>  
      <li>  Memcopy  table: Display the Memcopy related information in table.  </li>  
      <li>  GPU Time Summary Plot: Display the GPU Time Summary plot for the selected session. The GPU time summary plot options can be changed 
          using the main menu option "Options-&gt;GPU Time Summary Plot".  </li> 
      <li>  GPU Time Height Plot: Display the GPU Time Height plot for the selected session. The GPU time Height plot options can be changed using
          the "Session View Settings" dialog.  </li> 
      <li>  GPU Time Width Plot: Display the GPU Time Width plot for the selected session. The GPU time width plot options can be changed using the
         "Session View Settings" dialog.  </li> 
      <li>  Comparison Summary Plot: Display the GPU Time Comparison plot for the selected sessions. </li> 
    </ul>

<P>
    <a name="PROFILER_TABLE_CONTEXT_MENU">Profiler table context menu :</A><br>
    
    <ul>
      <li>  Profiler counter plot: Display the profiler counter plot for the method in the current row.    </li> 
      <li>  Column plot: Display the column plot for the current column.   </li> 
      <li>  Analyze profiler counters: Analyze profiler counter values. This option is enabled only for the summary table. This highlights any methods
      which have a high rate of uncoalesced loads or a high rate of uncoalesced stores or a high rate of warp serialization. These rates are calculated
      as the cumulative profiler counter count value divided by the cumulative gpu time for a method.  </li> 
      <li>  Export: Export the profiler data to a CSV format file.   </li> 
      <li>  Copy: Copy the selected table cells to the clipboard.  </li> 
      <li>  Average data: Show average data values instead of totals in the summary table.  </li> 
    </ul>



<B><h3><a name="ProfilerCounters">Interpreting profiler counters</a></B></h3>

The performance counter values do not correspond to individual thread activity.
Instead, these values represent events within a thread warp. For example, a 
divergent branch within a thread warp will increment the divergent_branch 
counter by one. So the final counter value stores information for all divergent 
branches in all warps. In addition, the profiler can only target one of the 
multiprocessors in the GPU,so the counter values will not correspond to the
total number of warps launched for a particular kernel. For this reason, 
when using the performance counter options in the profiler the user should 
always launch enough threads blocks to ensure that the target multiprocessor 
is given a consistent percentage of the total work. In practice for consistent results,
it is best to launch at least 2 times as many blocks as there are 
multiprocessors in the device on which you are profiling. 
For the reasons
listed above, users should not expect the counter values to match the numbers 
one would get by inspecting kernel code. The values are best used to identify 
relative performance differences between un-optimized and optimized code. For 
example, if for the initial version of the program the profiler reports N 
non-coalesced global loads, it is easy to see if the optimized code produces 
less than N non-coalesced loads. In most cases, the goal is to make N go to 
0, so the counter value is useful for tracking progress toward this goal.
<br><br>
Note that the counter values for the same application can be different across
different runs even on the same setup since it depends on the number of thread
blocks which are executed on each multiprocessor. For consistent results it is
best to have number of blocks for each kernel launched to be at least equal 
to or a multiple of the total number of multiprocessors on a compute device.
In other words when profiling the grid configuration should be chosen such that
all the multiprocessors are uniformly loaded i.e. the number of blocks  
launched on each multiprocessor is same and also the amount of work of interest
per block is the same. This will result in better accuracy of extrapolated counts 
(such as memory and instruction throughput) and will also provide more consistent
results from run to run.
<br><br>

<B><h3><a name="ProfilerCounters">Profiler counters for GPUs with compute capability 1.x </a></B></h3>

In every application run only up to a maximum of four counter values can be collected. 
So in case more than four counters are selected Visual Profiler executes the application
multiple times to collect all the counter values. Note that in case the number blocks 
in a kernel is less than or not a multiple of the number of multiprocessors the counters values
across multiple runs will not be consistent.

</p>      

<li><B>Profiler counters for a single multiprocessor</B> </li> <br><br>
These counter values are a cumulative count for all thread blocks which were
run on multiprocessor zero. Note that the multiprocessor SIMT (single-instruction multi-thread) 
unit creates, manages, schedules, and executes threads in groups of 32 threads called warps.
These counters are incremented by one per each warp.

<ul type="circle">
 <li><B>branch</B>            : Number of branches taken by threads executing a kernel. 
                                This counter will be incremented by one if at least 
                                one thread in a warp takes the branch.<br></li>
 
 <li><B>divergent branch</B>  : Number of divergent branches within a warp. This counter will be
                                incremented by one if at least one thread in a warp 
                                diverges (that is, follows a different execution path) via
                                a data dependent conditional branch. The counter will be
                                incremented by one at each point of divergence in a warp.<br></li>
 
 <li><B>instructions</B>      : Number of instructions executed<br></li>
 
 <li><B>warp serialize</B>    : If two addresses of a memory request fall in the same memory
                                bank, there is a bank conflict and the access has to be serialized.
                                This counter gives the number of thread warps that serialize on address conflicts
                                to either shared or constant memory.<br></li>
 <li><B>sm cta launched</B>      : Number of threads blocks launched on a multiprocessor.<br></li>
</ul>

<li><B>Profiler counters for all multiprocessors in a Texture Processing Cluster (TPC)</B> </li> <br><br>
These counter values are a cumulative count for all thread blocks which were run on multiprocessors within Texture Processing Cluster (TPC) zero.
Note that there are two multiprocessors per TPC on compute devices with compute capability less than 1.3
and there are three multiprocessors per TPC on compute devices with compute capability greater than or equal to 1.3.
<br><br>
When simultaneous global memory accesses by threads in a half-warp (during the execution of a single read or
write instruction) can be combined into a single memory transaction of 32, 64, or 128 bytes it is called
a coalesced access. If the global memory access by all threads of a half-warp do not fulfill
the coalescing requirements it is called a non-coalesced access and a separate memory transaction
is issued for each thread and throughput is significantly reduced. The coalescing requirements 
on devices with compute capability 1.2 and higher are different from devices with compute capability 1.0 or 1.1.
Refer the CUDA Programming Guide for details. The profiler counters related to global memory count the number of
global memory accesses or memory transactions and they are not per warp. They provide counts for all global
memory requests initiated by warps running on a TPC.

<ul type="circle">
 <li><B>gld uncoalesced</B>    : Number of non-coalesced global memory loads. This counter is available only for GPUs with compute capability 1.1 or lower. <br></li>
 <li><B>gld coalesced</B>      : Number of coalesced global memory loads <br></li>
 <li><B>gld request</B>        : Number of global memory load requests. This counter is available only for GPUs with compute capability 1.2 or higher. 
 On devices with compute capability 1.3 enabling this counter
 will result in increased counts for the "instructions" and "branch" counter values
 if they are also enabled in the same application run. <br></li>
 <li><B>gld_32/64/128b</B>     : Number of 32 byte, 64 byte and 128 byte global memory load transactions. 
 These increment by 1 for each 32, 64, or 128 byte transaction.
 These counters are available only for GPUs with compute capability 1.2 or higher.<br></li>
 <li><B>gst uncoalesced</B>    : Number of non-coalesced global memory stores. 
 This counter is available only for GPUs with compute capability 1.1 or lower. <br></li> 
 <li><B>gst coalesced</B>      : Number of coalesced global memory stores <br></li>
 <li><B>gst request</B>        : Number of global memory store requests. 
 This counter is available only for GPUs with compute capability 1.2 or higher. 
 On devices with compute capability 1.3 enabling this counter
 will result in increased counts for the "instructions" and "branch" counter values
 if they are also enabled in the same application run.
 <br></li>
 <li><B>gst_32/64/128b</B>     : Number of 32 byte, 64 byte and 128 byte global memory store transactions. 
 These increment by 2 for each 32 byte transaction, by 4 for each 64 byte transaction and by 8 for each 128 byte transaction.
 These counters are available only for GPUs with compute capability 1.2 or higher.<br></li>
 <li><B>local load</B>         : Number of local memory loads<br></li>
 <li><B>local store</B> : Number of local memory stores<br></li> 
 <li><B>cta launched</B> : Number of threads blocks launched on a TPC. <br></li>
 <li><B>texture cache hit</B> : Number of texture cache hits. <br></li>
 <li><B>texture cache miss</B>  : Number of texture cache misses.<br></li>
 <li><B>prof triggers</B>      : There are 8 such triggers that user can profile. 
                                 Those are generic and can be inserted in any place of the code to collect 
                                 the related information. <br></li>
</ul>

<li><B>Normalized counter values</B> </li> <br><br>
When the the "Normalize counters" option is selected all counter values are normalized and per block counts are shown.

<ul type="circle">
<li>For single multiprocessor counters the counter value is divided by the number of thread blocks 
which were run on multiprocessor 0. 
The profiler counter "sm cta launched" is used to count thread blocks which were run on multiprocessor 0. <br></li>

<li>For TPC counters the counter value is divided by the number of thread blocks which were run on TPC0.
The profiler counter "cta lauched" is used to count thread blocks which were run on multiprocessors in TPC 0. <br></li>
</ul>


In the following cases the counter value is set to zero:

<ul type="circle">

<li> The number of blocks launched on the multiprocessor(s) being profiled is zero. 
   This can happen when the number of blocks launched for a kernel is less than 
   the total number of multiprocessors on a compute device. <br></li>

<li> The counter value is less than the number of blocks launched on the multiprocessor(s) being profiled.
     The normalized fractional value less than one is truncated to zero. <br></li>

</ul>

If any counter value is set to zero a warning is displayed at the end of the application profiling. <br><br>

With "Normalize counters" option enabled more number of application runs are required to collect all 
counter values compared to when the "Normalized counters" option is disabled. <br><br>

Also when "Normalize counters" option is enabled the "cta launched" and "sm cta launched" columns are not shown in the profiler table.


<B><h3><a name="ProfilerCounters">Profiler counters for GPUs with compute capability 2.0 </a></B></h3>

In every application run only a few counter values can be collected. The number of counters depends on
the specific counters selected. Visual Profiler executes the application
multiple times to collect all the counter values. Note that in case the number blocks 
in a kernel is less than or not a multiple of the number of multiprocessors the counters values
across multiple runs will not be consistent.

</p>      

All counter values are a cumulative count for all thread blocks which were
run on multiprocessor zero. Note that the multiprocessor SIMT (single-instruction multi-thread) 
unit creates, manages, schedules, and executes threads in groups of 32 threads called warps.
These counters are incremented by one per each warp.

<ul type="circle">
 <li><B>branch</B>            : Number of branches taken by threads executing a kernel. 
                                This counter will be incremented by one if at least 
                                one thread in a warp takes the branch.<br></li>
 
 <li><B>divergent branch</B>  : Number of divergent branches within a warp. This counter will be
                                incremented by one if at least one thread in a warp 
                                diverges (that is, follows a different execution path) via
                                a data dependent conditional branch. The counter will be
                                incremented by one at each point of divergence in a warp.<br></li>
 
 <li><B>sm cta launched</B>      : Number of threads blocks launched on a multiprocessor.<br></li>
 
<li><B>local load</B>        :  Number of executed local load instructions per warp on a multiprocessor. <br></li>

<li><B>local store</B>       :  Number of executed local store instructions per warp on a multiprocessor. <br></li>
 
<li><B>gld request</B>       :  Number of executed global load instructions per warp on a multiprocessor. <br></li>

<li><B>gst request</B>       :  Number of executed global store instructions per warp on a multiprocessor.<br></li>

<li><B>shared load</B>       :  Number of executed shared load instructions per warp on a multiprocessor.<br></li>

<li><B>shared store</B>      :  Number of executed shared store instructions per warp on a multiprocessor.<br></li>

<li><B>instructions issued</B>       :  Number of instructions issued including replays <br></li>

<li><B>instructions executed</B>     :  Number of instructions executed, do not include replays <br></li>

<li><B>warps launched</B>    :  Number of warps launched on a multiprocessor.<br></li>

<li><B>threads launched</B>  :  Number of threads launched on a multiprocessor.<br></li>

<li><B>l1 global load hit</B> :  Number of global load hits in L1 cache <br></li>

<li><B>l1 global load miss</B> : Number of global load misses in L1 cache <br></li>

</ul>

<B><h3><a name="ProjectFiles">cudaprof project files saved to disk</a></B></h3>
    <ul>

        <li>  &lt;project-name&gt;.cpj               : Cuda profiler project file  </li> 
        <li>  &lt;project-name&gt;_&lt;session-name&gt;_Context_&lt;context-number&gt;.csv : Cuda profiler data file for a context in a session.</li> 

    </ul>

<B><h3><a name="SavedSetting">cudaprof settings which are saved</a></B></h3>
Following is the list of cudaprof settings which are saved and remembered 
across different cudaprof sessions. 
    <ul>
        <li>  Last opened project path   </li>
        <li>  Method Colors   </li>
        <li>  Recent files list   </li>
        <li>  Recent programs   </li>
        <li>  Recent work Dirs   </li>
        <li>  Show Output window Demangle Method Names   </li><p>

        <li>  Main Window/Size   </li>
        <li>  Main Window/Maximized   </li>
        <li>  Global view dialog/Size   </li>
        <li>  Session view dialog/Size   </li>
        <li>  Horizontal Splitter/Sizes   </li>
        <li>  Vertical Splitter/Sizes   </li><p>

        <li>  Profiler Table/Hide Zero Columns   </li><p>

        <li>  Summary Table/Show Average   </li>
        <li>  Summary Plot/Average   </li><p>

        <li>  Displayed Summary Plot/Percentage   </li>
        <li>  Displayed Height Plot/Fit in window   </li><p>

        <li>  Height Plot/Show CPU Time   </li>
        <li>  Height Plot/Show Legend   </li>
        <li>  Height Plot/Use global scale   </li><p>

        <li>  Width Plot/Enable time stamp   </li>
        <li>  Width Plot/Fit in window   </li>
        <li>  Width Plot/Maximum bar width   </li>
        <li>  Width Plot/Show CPU Time   </li>
        <li>  Width Plot/Show legend   </li>
        <li>  Width Plot/Start time stamp at zero   </li>
        <li>  Width Plot/Type   </li><p>
    </ul>
On Windows these settings are saved in the system registry at the location 
"HKEY_CURRENT_USER\Software\NVIDIA\cudaprof".<br>
On Linux these settings are 
saved to the file "$HOME/.config/NVIDIA Corporation/cudaprof.conf". <br><br>

Cuda Visual Profiler Help cache is saved in the folder:
<ul>
    <li> Windows : C:\Documents and Settings\&lt;username&gt;\Local Settings\Application Data\NVIDIA Corporation\cudaprof </li>
    <li> Linux   : /home/&lt;username&gt;/.local/share/data/NVIDIA Corporation/cudaprof </li>
</ul>
There is a separate sub-directory for each version.
</body>
</html>