<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-us" xml:lang="en-us"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta> <meta http-equiv="X-UA-Compatible" content="IE=edge"></meta> <meta name="copyright" content="(C) Copyright 2005"></meta> <meta name="DC.rights.owner" content="(C) Copyright 2005"></meta> <meta name="DC.Type" content="cuda_reference"></meta> <meta name="DC.Title" content="Introduction"></meta> <meta name="DC.Format" content="XHTML"></meta> <meta name="DC.Identifier" content="r_main"></meta> <link rel="stylesheet" type="text/css" href="../common/formatting/commonltr.css"></link> <link rel="stylesheet" type="text/css" href="../common/formatting/site.css"></link> <title>CUPTI :: CUDA Toolkit Documentation</title> <!--[if lt IE 9]> <script src="../common/formatting/html5shiv-printshiv.min.js"></script> <![endif]--> <script type="text/javascript" charset="utf-8" src="../common/formatting/jquery.min.js"></script> <script type="text/javascript" charset="utf-8" src="../common/formatting/jquery.ba-hashchange.min.js"></script> <link rel="canonical" href="http://docs.nvidia.com/cuda/cupti/index.html"></link> <link rel="stylesheet" type="text/css" href="../common/formatting/qwcode.highlight.css"></link> </head> <body> <article id="contents"> <div id="breadcrumbs"><a href="index.html" shape="rect">< Previous</a> | <a href="modules.html" shape="rect">Next ></a></div> <div id="release-info">CUPTI (<a href="../../pdf/CUPTI_Library.pdf">PDF</a>) - CUDA Toolkit v5.5 (<a href="https://developer.nvidia.com/cuda-toolkit-archive">older</a>) - Last updated July 19, 2013 - <a href="mailto:cudatools@nvidia.com?subject=CUDA Tools Documentation Feedback: cupti">Send Feedback</a></div> <div class="topic nested1" id="r_main"><a name="r_main" shape="rect"> <!-- --></a><h2 class="topictitle2">1. Introduction</h2> <div class="body refbody"> <div class="section"> <p class="p">The <em class="ph i">CUDA Profiling Tools Interface</em> (CUPTI) enables the creation of profiling and tracing tools that target CUDA applications. CUPTI provides four APIs: <em class="ph i">the Activity API</em>, the <em class="ph i">Callback API</em>, the <em class="ph i">Event API</em>, and the <em class="ph i">Metric API</em>. Using these APIs, you can develop profiling tools that give insight into the CPU and GPU behavior of CUDA applications. CUPTI is delivered as a dynamic library on all platforms supported by CUDA. </p> </div> </div> <div class="topic reference cuda_reference nested1" id="r_compatibility_requirements"><a name="r_compatibility_requirements" shape="rect"> <!-- --></a><h3 class="topictitle3">1.1. CUPTI Compatibility and Requirements</h3> <div class="body refbody"> <div class="section"> <p class="p">New versions of the CUDA driver are backwards compatible with older versions of CUPTI. For example, a developer using a profiling tool based on CUPTI 4.1 can update to a more recently released CUDA driver. However, new versions of CUPTI are not backwards compatible with older versions of the CUDA driver. For example, a developer using a profiling tool based on CUPTI 4.1 must have a version of the CUDA driver released with CUDA Toolkit 4.1 (or later) installed as well. CUPTI calls will fail with <tt class="ph tt">CUPTI_ERROR_NOT_INITIALIZED</tt> if the CUDA driver version is not compatible with the CUPTI version. </p> </div> </div> </div> <div class="topic reference cuda_reference nested1" id="r_initialization"><a name="r_initialization" shape="rect"> <!-- --></a><h3 class="topictitle3">1.2. CUPTI Initialization</h3> <div class="body refbody"> <div class="section"> <p class="p">CUPTI initialization occurs lazily the first time you invoke any CUPTI function. For the Event, Metric, and Callback APIs there are no requirements on when this initialization must occur (i.e. you can invoke the first CUPTI function at any point). For correct operation, the Activity API does require that CUPTI be initialized before any CUDA driver or runtime API is invoked. See the CUPTI Activity API section for more information on CUPTI initialization requirements for the activity API. </p> </div> </div> </div> <div class="topic reference cuda_reference nested1" id="r_activity"><a name="r_activity" shape="rect"> <!-- --></a><h3 class="topictitle3">1.3. CUPTI Activity API</h3> <div class="body refbody"> <div class="section"> <p class="p">The CUPTI Activity API allows you to asynchronously collect a trace of an application's CPU and GPU CUDA activity. The following terminology is used by the activity API. </p> <dl class="dl"> <dt class="dt dlterm">Activity Record</dt> <dd class="dd">CPU and GPU activity is reported in C data structures called activity records. There is a different C structure type for each activity kind (e.g. <tt class="ph tt">CUpti_ActivityMemcpy</tt>). Records are generically referred to using the <tt class="ph tt">CUpti_Activity</tt> type. This type contains only a kind field that indicates the kind of the activity record. Using this kind, the object can be cast from the generic <tt class="ph tt">CUpti_Activity</tt> type to the specific type representing the activity. See the <tt class="ph tt">printActivity</tt> function in the <a class="xref" href="r_main.html#r_samples__activity_trace_async" shape="rect">activity_trace_async</a> sample for an example. </dd> <dt class="dt dlterm">Activity Buffer</dt> <dd class="dd">An activity buffer is used to transfer one or more activity records from CUPTI to the client. CUPTI fills activity buffers with activity records as the corresponding activities occur on the CPU and GPU. The CUPTI client is responsible for providing empty activity buffers as necessary to ensure that no records are dropped. </dd> </dl> <p class="p"> This section describes the new <em class="ph i">asynchronous</em> buffering API implemented by <tt class="ph tt">cuptiActivityRegisterCallbacks</tt>, <tt class="ph tt">cuptiActivityFlush</tt>, and <tt class="ph tt">cuptiActivityFlushAll</tt>. The old buffering API implemented by <tt class="ph tt">cuptiActivityEnqueueBuffer</tt> and <tt class="ph tt">cuptiActivityDequeueBuffer</tt> is still supported but is deprecated and will be removed in a future release (see the API documentation for information on these functions). </p> <p class="p">To ensure that all activity records are collected, CUPTI must be initialized before any CUDA driver or runtime API is invoked. Initialization can be done by enabling one or more activity kinds using <tt class="ph tt">cuptiActivityEnable</tt> or <tt class="ph tt">cuptiActivityEnableContext</tt>, as shown in the <tt class="ph tt">initTrace</tt> function of the <a class="xref" href="r_main.html#r_samples__activity_trace_async" shape="rect">activity_trace_async</a> sample. Some activity kinds cannot be directly enabled, see the API documentation for for <tt class="ph tt">CUpti_ActivityKind</tt> for details. Functions <tt class="ph tt">cuptiActivityEnable</tt> and <tt class="ph tt">cuptiActivityEnableContext</tt> will return <tt class="ph tt">CUPTI_ERROR_NOT_COMPATIBLE</tt> if the requested activity kind cannot be enabled. </p> </div> <div class="section"> The new activity buffer API uses callbacks to request and return buffers of activity records. The use the asynchronous buffering API you must first register two callbacks using <tt class="ph tt">cuptiActivityRegisterCallbacks</tt>. One of these callbacks will be invoked whenever CUPTI needs an empty activity buffer. The other callback is used to deliver a buffer containing one or more activity records to the client. To minimize profiling overhead the client should return as quickly as possible from these callbacks. Functions <tt class="ph tt">cuptiActivityFlush</tt> and <tt class="ph tt">cuptiActivityFlushAll</tt> can be used to force CUPTI to deliver any activity buffers that contain completed activity records. Functions <tt class="ph tt">cuptiActivityGetAttribute</tt> and <tt class="ph tt">cuptiActivitySetAttribute</tt> can be used to read and write attributes that control how the buffering API behaves. See the API documentation for more information. <p class="p"> The <a class="xref" href="r_main.html#r_samples__activity_trace_async" shape="rect">activity_trace_async</a> sample shows how to use the activity buffer API to collect a trace of CPU and GPU activity for a simple application. </p> </div> </div> </div> <div class="topic reference cuda_reference nested1" id="r_callback_api"><a name="r_callback_api" shape="rect"> <!-- --></a><h3 class="topictitle3">1.4. CUPTI Callback API</h3> <div class="body refbody"> <div class="section"> <p class="p">The CUPTI Callback API allows you to register a callback into your own code. Your callback will be invoked when the application being profiled calls a CUDA runtime or driver function, or when certain events occur in the CUDA driver. The following terminology is used by the callback API. </p> <dl class="dl"> <dt class="dt dlterm">Callback Domain</dt> <dd class="dd">Callbacks are grouped into domains to make it easier to associate your callback functions with groups of related CUDA functions or events. There are currently four callback domains, as defined by <tt class="ph tt">CUpti_CallbackDomain</tt>: a domain for CUDA runtime functions, a domain for CUDA driver functions, a domain for CUDA resource tracking, and a domain for CUDA synchronization notification. </dd> <dt class="dt dlterm">Callback ID</dt> <dd class="dd">Each callback is given a unique ID within the corresponding callback domain so that you can identify it within your callback function. The CUDA driver API IDs are defined in <tt class="ph tt">cupti_driver_cbid.h</tt> and the CUDA runtime API IDs are defined in <tt class="ph tt">cupti_runtime_cbid.h</tt>. Both of these headers are included for you when you include <tt class="ph tt">cupti.h</tt>. The CUDA resource callback IDs are defined by <tt class="ph tt">CUpti_CallbackIdResource</tt> and the CUDA synchronization callback IDs are defined by <tt class="ph tt">CUpti_CallbackIdSync</tt>. </dd> <dt class="dt dlterm">Callback Function</dt> <dd class="dd">Your callback function must be of type <tt class="ph tt">CUpti_CallbackFunc</tt>. This function type has two arguments that specify the callback domain and ID so that you know why the callback is occurring. The type also has a <tt class="ph tt">cbdata</tt> argument that is used to pass data specific to the callback. </dd> <dt class="dt dlterm">Subscriber</dt> <dd class="dd">A subscriber is used to associate each of your callback functions with one or more CUDA API functions. There can be at most one subscriber initialized with <tt class="ph tt">cuptiSubscribe()</tt> at any time. Before initializing a new subscriber, the existing subscriber must be finalized with <tt class="ph tt">cuptiUnsubscribe()</tt>. </dd> </dl> <p class="p"> Each callback domain is described in detail below. Unless explicitly stated, it is not supported to call any CUDA runtime or driver API from within a callback function. Doing so may cause the application to hang. </p> </div> </div> <div class="topic reference cuda_reference nested2" id="r_driver_runtime_api_callback"><a name="r_driver_runtime_api_callback" shape="rect"> <!-- --></a><h4 class="topictitle4">1.4.1. Driver and Runtime API Callbacks</h4> <div class="body refbody"> <div class="section"> <p class="p">Using the callback API with the <tt class="ph tt">CUPTI_CB_DOMAIN_DRIVER_API</tt> or <tt class="ph tt">CUPTI_CB_DOMAIN_RUNTIME_API</tt> domains, you can associate a callback function with one or more CUDA API functions. When those CUDA functions are invoked in the application, your callback function is invoked as well. For these domains, the <tt class="ph tt">cbdata</tt> argument to your callback function will be of the type <tt class="ph tt">CUpti_CallbackData</tt>. </p> <p class="p">It is legal to call <tt class="ph tt">cudaThreadSynchronize()</tt>, <tt class="ph tt">cudaDeviceSynchronize()</tt>, <tt class="ph tt">cudaStreamSynchronize()</tt>, <tt class="ph tt">cuCtxSynchronize()</tt>, and <tt class="ph tt">cuStreamSynchronize()</tt> from within a driver or runtime API callback function. </p> <p class="p">The following code shows a typical sequence used to associate a callback function with one or more CUDA API functions. To simplify the presentation error checking code has been removed. </p><pre xml:space="preserve"> CUpti_SubscriberHandle subscriber; MyDataStruct *my_data = ...; ... cuptiSubscribe(&subscriber, (CUpti_CallbackFunc)my_callback , my_data); cuptiEnableDomain(1, subscriber, CUPTI_CB_DOMAIN_RUNTIME_API);</pre><p class="p"> First, <tt class="ph tt">cuptiSubscribe</tt> is used to initialize a subscriber with the <tt class="ph tt">my_callback</tt> callback function. Next, <tt class="ph tt">cuptiEnableDomain</tt> is used to associate that callback with all the CUDA runtime API functions. Using this code sequence will cause <tt class="ph tt">my_callback</tt> to be called twice each time any of the CUDA runtime API functions are invoked, once on entry to the CUDA function and once just before exit from the CUDA function. CUPTI callback API functions <tt class="ph tt">cuptiEnableCallback</tt> and <tt class="ph tt">cuptiEnableAllDomains</tt> can also be used to associate CUDA API functions with a callback (see reference below for more information). </p> <p class="p">The following code shows a typical callback function.</p><pre xml:space="preserve">void CUPTIAPI my_callback(void *userdata, CUpti_CallbackDomain domain, CUpti_CallbackId cbid, const void *cbdata) { const CUpti_CallbackData *cbInfo = (CUpti_CallbackData *)cbdata; MyDataStruct *my_data = (MyDataStruct *)userdata; if ((domain == CUPTI_CB_DOMAIN_RUNTIME_API) && (cbid == CUPTI_RUNTIME_TRACE_CBID_cudaMemcpy_v3020)) { if (cbInfo->callbackSite == CUPTI_API_ENTER) { cudaMemcpy_v3020_params *funcParams = (cudaMemcpy_v3020_params *)(cbInfo-> functionParams); size_t count = funcParams->count; enum cudaMemcpyKind kind = funcParams->kind; ... } ...</pre><p class="p"> In your callback function, you use the <tt class="ph tt">CUpti_CallbackDomain</tt> and <tt class="ph tt">CUpti_CallbackID</tt> parameters to determine which CUDA API function invocation is causing this callback. In the example above, we are checking for the CUDA runtime <tt class="ph tt">cudaMemcpy</tt> function. The <tt class="ph tt">cbdata</tt> parameter holds a structure of useful information that can be used within the callback. In this case we use the <tt class="ph tt">callbackSite</tt> member of the structure to detect that the callback is occurring on entry to <tt class="ph tt">cudaMemcpy</tt>, and we use the <tt class="ph tt">functionParams</tt> member to access the parameters that were passed to <tt class="ph tt">cudaMemcpy</tt>. To access the parameters we first cast <tt class="ph tt">functionParams</tt> to a structure type corresponding to the <tt class="ph tt">cudaMemcpy</tt> function. These parameter structures are contained in <tt class="ph tt">generated_cuda_runtime_api_meta.h</tt>, <tt class="ph tt">generated_cuda_meta.h</tt>, and a number of other files. When possible these files are included for you by <tt class="ph tt">cupti.h</tt>. </p> <p class="p"> The <strong class="ph b">callback_event</strong> and <strong class="ph b">callback_timestamp</strong> samples described on the <a class="xref" href="r_main.html#r_samples" shape="rect">samples page</a> both show how to use the callback API for the driver and runtime API domains. </p> </div> </div> </div> <div class="topic reference cuda_reference nested2" id="r_resource_callbacks"><a name="r_resource_callbacks" shape="rect"> <!-- --></a><h4 class="topictitle4">1.4.2. Resource Callbacks</h4> <div class="body refbody"> <div class="section"> <p class="p">Using the callback API with the <tt class="ph tt">CUPTI_CB_DOMAIN_RESOURCE</tt> domain, you can associate a callback function with some CUDA resource creation and destruction events. For example, when a CUDA context is created, your callback function will be invoked with a callback ID equal to <tt class="ph tt">CUPTI_CBID_RESOURCE_CONTEXT_CREATED</tt>. For this domain, the <tt class="ph tt">cbdata</tt> argument to your callback function will be of the type <tt class="ph tt">CUpti_ResourceData</tt>. </p> </div> </div> </div> <div class="topic reference cuda_reference nested2" id="r_synchronization_callbacks"><a name="r_synchronization_callbacks" shape="rect"> <!-- --></a><h4 class="topictitle4">1.4.3. Synchronization Callbacks</h4> <div class="body refbody"> <div class="section"> <p class="p">Using the callback API with the <tt class="ph tt">CUPTI_CB_DOMAIN_SYNCHRONIZE</tt> domain, you can associate a callback function with CUDA context and stream synchronizations. For example, when a CUDA context is synchronized, your callback function will be invoked with a callback ID equal to <tt class="ph tt">CUPTI_CBID_SYNCHRONIZE_CONTEXT_SYNCHRONIZED</tt>. For this domain, the <tt class="ph tt">cbdata</tt> argument to your callback function will be of the type <tt class="ph tt">CUpti_SynchronizeData</tt>. </p> </div> </div> </div> <div class="topic reference cuda_reference nested2" id="r_nvtx_callbacks"><a name="r_nvtx_callbacks" shape="rect"> <!-- --></a><h4 class="topictitle4">1.4.4. NVIDIA Tools Extension Callbacks</h4> <div class="body refbody"> <div class="section"> <p class="p">Using the callback API with the <tt class="ph tt">CUPTI_CB_DOMAIN_NVTX</tt> domain, you can associate a callback function with NVIDIA Tools Extension (NVTX) API functions. When an NVTX function is invoked in the application, your callback function is invoked as well. For these domains, the <tt class="ph tt">cbdata</tt> argument to your callback function will be of the type <tt class="ph tt">CUpti_NvtxData</tt>. </p> <div class="p"> The NVTX library has its own convention for discovering the profiling library that will provide the implementation of the NVTX callbacks. To receive callbacks you must set the NVTX environment variables appropriately so that when the application calls an NVTX function, your profiling library recieve the callbacks. The following code sequence shows a typical initialization sequence to enable NVTX callbacks and activity records. <pre xml:space="preserve">/* Set env so CUPTI-based profiling library loads on first nvtx call. */ char *inj32_path = "/path/to/32-bit/version/of/cupti/based/profiling/library"; char *inj64_path = "/path/to/64-bit/version/of/cupti/based/profiling/library"; setenv("NVTX_INJECTION32_PATH", inj32_path, 1); setenv("NVTX_INJECTION64_PATH", inj64_path, 1);</pre></div> <p class="p">The following code shows a typical sequence used to associate a callback function with one or more NVTX functions. To simplify the presentation error checking code has been removed. </p><pre xml:space="preserve">CUpti_SubscriberHandle subscriber; MyDataStruct *my_data = ...; ... cuptiSubscribe(&subscriber, (CUpti_CallbackFunc)my_callback , my_data); cuptiEnableDomain(1, subscriber, CUPTI_CB_DOMAIN_NVTX);</pre><p class="p"> First, <tt class="ph tt">cuptiSubscribe</tt> is used to initialize a subscriber with the <tt class="ph tt">my_callback</tt> callback function. Next, <tt class="ph tt">cuptiEnableDomain</tt> is used to associate that callback with all the NVTX functions. Using this code sequence will cause <tt class="ph tt">my_callback</tt> to be called once each time any of the NVTX functions are invoked. CUPTI callback API functions <tt class="ph tt">cuptiEnableCallback</tt> and <tt class="ph tt">cuptiEnableAllDomains</tt> can also be used to associate NVTX API functions with a callback (see reference below for more information). </p> <p class="p">The following code shows a typical callback function.</p><pre xml:space="preserve">void CUPTIAPI my_callback(void *userdata, CUpti_CallbackDomain domain, CUpti_CallbackId cbid, const void *cbdata) { const CUpti_NvtxData *nvtxInfo = (CUpti_NvtxData *)cbdata; MyDataStruct *my_data = (MyDataStruct *)userdata; if ((domain == CUPTI_CB_DOMAIN_NVTX) && (cbid == NVTX_CBID_CORE_NameOsThreadA)) { nvtxNameOsThreadA_params *params = (nvtxNameOsThreadA_params *)nvtxInfo-> functionParams; ... } ...</pre><p class="p"> In your callback function, you use the <tt class="ph tt">CUpti_CallbackDomain</tt> and <tt class="ph tt">CUpti_CallbackID</tt> parameters to determine which NVTX API function invocation is causing this callback. In the example above, we are checking for the <tt class="ph tt">nvtxNameOsThreadA</tt> function. The <tt class="ph tt">cbdata</tt> parameter holds a structure of useful information that can be used within the callback. In this case, we use the <tt class="ph tt">functionParams</tt> member to access the parameters that were passed to <tt class="ph tt">nvtxNameOsThreadA</tt>. To access the parameters we first cast <tt class="ph tt">functionParams</tt> to a structure type corresponding to the <tt class="ph tt">nvtxNameOsThreadA</tt> function. These parameter structures are contained in <tt class="ph tt">generated_nvtx_meta.h</tt>. </p> </div> </div> </div> </div> <div class="topic reference cuda_reference nested1" id="r_event_api"><a name="r_event_api" shape="rect"> <!-- --></a><h3 class="topictitle3">1.5. CUPTI Event API</h3> <div class="body refbody"> <div class="section"> <p class="p">The CUPTI Event API allows you to query, configure, start, stop, and read the event counters on a CUDA-enabled device. The following terminology is used by the event API. </p> <dl class="dl"> <dt class="dt dlterm">Event</dt> <dd class="dd">An event is a countable activity, action, or occurrence on a device. </dd> <dt class="dt dlterm">Event ID</dt> <dd class="dd">Each event is assigned a unique identifier. A named event will represent the same activity, action, or occurrence on all device types. But the named event may have different IDs on different device families. Use <tt class="ph tt">cuptiEventGetIdFromName</tt> to get the ID for a named event on a particular device. </dd> <dt class="dt dlterm">Event Category</dt> <dd class="dd">Each event is placed in one of the categories defined by <tt class="ph tt">CUpti_EventCategory</tt>. The category indicates the general type of activity, action, or occurrence measured by the event. </dd> <dt class="dt dlterm">Event Domain</dt> <dd class="dd">A device exposes one or more event domains. Each event domain represents a group of related events available on that device. A device may have multiple instances of a domain, indicating that the device can simultaneously record multiple instances of each event within that domain. </dd> <dt class="dt dlterm">Event Group</dt> <dd class="dd">An event group is a collection of events that are managed together. The number and type of events that can be added to an event group are subject to device-specific limits. At any given time, a device may be configured to count events from a limited number of event groups. All events in an event group must belong to the same event domain. </dd> <dt class="dt dlterm">Event Group Set</dt> <dd class="dd">An event group set is a collection of event groups that can be enabled at the same time. Event group sets are created by <tt class="ph tt">cuptiEventGroupSetsCreate</tt> and <tt class="ph tt">cuptiMetricCreateEventGroupSets</tt>. </dd> </dl> <p class="p"> You can determine the events available on a device using the <tt class="ph tt">cuptiDeviceEnumEventDomains</tt> and <tt class="ph tt">cuptiEventDomainEnumEvents</tt> functions. The <strong class="ph b">cupti_query</strong> sample described on the <a class="xref" href="r_main.html#r_samples" shape="rect">samples page</a> shows how to use these functions. You can also enumerate all the CUPTI events available on any device using the <tt class="ph tt">cuptiEnumEventDomains</tt> function. </p> <p class="p"> Configuring and reading event counts requires the following steps. First, select your event collection mode. If you want to count events that occur during the execution of a kernel, use <tt class="ph tt">cuptiSetEventCollectionMode</tt> to set mode <tt class="ph tt">CUPTI_EVENT_COLLECTION_MODE_KERNEL</tt>. If you want to continuously sample the event counts, use mode <tt class="ph tt">CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS</tt>. Next determine the names of the events that you want to count, and then use the <tt class="ph tt">cuptiEventGroupCreate</tt>, <tt class="ph tt">cuptiEventGetIdFromName</tt>, and <tt class="ph tt">cuptiEventGroupAddEvent</tt> functions to create and initialize an event group with those events. If you are unable to add all the events to a single event group then you will need to create multiple event groups. Alternatively, you can use the <tt class="ph tt">cuptiEventGroupSetsCreate</tt> function to automatically create the event group(s) required for a set of events. </p> <p class="p"> To begin counting a set of events, enable the event group or groups that contain those events by using the <tt class="ph tt">cuptiEventGroupEnable</tt> function. If your events are contained in multiple event groups you may be unable to enable all of the event groups at the same time, due to device limitations. In this case, you can gather the events across multiple executions of the application or you can enable kernel replay. If you enable kernel replay using <tt class="ph tt">cuptiEnableKernelReplayMode</tt> you will be able to enabled any number of event groups and all the contained events will be collect. </p> <p class="p"> Use the <tt class="ph tt">cuptiEventGroupReadEvent</tt> and/or <tt class="ph tt">cuptiEventGroupReadAllEvents</tt> functions to read the event values. When you are done collecting events, use the <tt class="ph tt">cuptiEventGroupDisable</tt> function to stop counting of the events contained in an event group. The <strong class="ph b">callback_event</strong> sample described on the <a class="xref" href="r_main.html#r_samples" shape="rect">samples page</a> shows how to use these functions to create, enable, and disable event groups, and how to read event counts. </p> </div> </div> <div class="topic reference cuda_reference nested2" id="r_collecting_kernel_execution_events"><a name="r_collecting_kernel_execution_events" shape="rect"> <!-- --></a><h4 class="topictitle4">1.5.1. Collecting Kernel Execution Events</h4> <div class="body refbody"> <div class="section"> <p class="p">A common use of the event API is to count a set of events during the execution of a kernel (as demonstrated by the <strong class="ph b">callback_event</strong> sample). The following code shows a typical callback used for this purpose. Assume that the callback was enabled only for a kernel launch using the CUDA runtime (i.e. by <tt class="ph tt">cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_RUNTIME_API, CUPTI_RUNTIME_TRACE_CBID_cudaLaunch_v3020)</tt>. To simplify the presentation error checking code has been removed. </p><pre xml:space="preserve">static void CUPTIAPI getEventValueCallback(void *userdata, CUpti_CallbackDomain domain, CUpti_CallbackId cbid, const void *cbdata) { const CUpti_CallbackData *cbData = (CUpti_CallbackData *)cbdata; if (cbData->callbackSite == CUPTI_API_ENTER) { cudaThreadSynchronize(); cuptiSetEventCollectionMode(cbInfo->context, CUPTI_EVENT_COLLECTION_MODE_KERNEL); cuptiEventGroupEnable(eventGroup); } if (cbData->callbackSite == CUPTI_API_EXIT) { cudaThreadSynchronize(); cuptiEventGroupReadEvent(eventGroup, CUPTI_EVENT_READ_FLAG_NONE, eventId, &bytesRead, &eventVal); cuptiEventGroupDisable(eventGroup); } }</pre><p class="p"> Two synchronization points are used to ensure that events are counted only for the execution of the kernel. If the application contains other threads that launch kernels, then additional thread-level synchronization must also be introduced to ensure that those threads do not launch kernels while the callback is collecting events. When the cudaLaunch API is entered (that is, before the kernel is actually launched on the device), <tt class="ph tt">cudaThreadSynchronize</tt> is used to wait until the GPU is idle. The event collection mode is set to <tt class="ph tt">CUPTI_EVENT_COLLECTION_MODE_KERNEL</tt> so that the event counters are automatically started and stopped just before and after the kernel executes. Then event collection is enabled with <tt class="ph tt">cuptiEventGroupEnable</tt>. </p> <p class="p"> When the cudaLaunch API is exited (that is, after the kernel is queued for execution on the GPU) another <tt class="ph tt">cudaThreadSynchronize</tt> is used to cause the CPU thread to wait for the kernel to finish execution. Finally, the event counts are read with <tt class="ph tt">cuptiEventGroupReadEvent</tt>. </p> </div> </div> </div> <div class="topic reference cuda_reference nested2" id="r_sampling_events"><a name="r_sampling_events" shape="rect"> <!-- --></a><h4 class="topictitle4">1.5.2. Sampling Events</h4> <div class="body refbody"> <div class="section"> <p class="p">The event API can also be used to sample event values while a kernel or kernels are executing (as demonstrated by the <strong class="ph b">event_sampling</strong> sample). The sample shows one possible way to perform the sampling. The event collection mode is set to <tt class="ph tt">CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS</tt> so that the event counters run continuously. Two threads are used in <strong class="ph b">event_sampling</strong>: one thread schedules the kernels and memcpys that perform the computation, while another thread wakes periodically to sample an event counter. In this sample there is no correlation of the event samples with what is happening on the GPU. To get some coarse correlation, you can use <tt class="ph tt">cuptiDeviceGetTimestamp</tt> to collect the GPU timestamp at the time of the sample and also at other interesting points in your application. </p> </div> </div> </div> </div> <div class="topic reference cuda_reference nested1" id="r_metric_api"><a name="r_metric_api" shape="rect"> <!-- --></a><h3 class="topictitle3">1.6. CUPTI Metric API</h3> <div class="body refbody"> <div class="section"> <p class="p">The CUPTI Metric API allows you to collect application metrics calculated from one or more event values. The following terminology is used by the metric API. </p> <dl class="dl"> <dt class="dt dlterm">Metric</dt> <dd class="dd">An characteristic of an application that is calculated from one or more event values. </dd> <dt class="dt dlterm">Metric ID</dt> <dd class="dd">Each metric is assigned a unique identifier. A named metric will represent the same characteristic on all device types. But the named metric may have different IDs on different device families. Use <tt class="ph tt">cuptiMetricGetIdFromName</tt> to get the ID for a named metric on a particular device. </dd> <dt class="dt dlterm">Metric Category</dt> <dd class="dd">Each metric is placed in one of the categories defined by <tt class="ph tt">CUpti_MetricCategory</tt>. The category indicates the general type of the characteristic measured by the metric. </dd> <dt class="dt dlterm">Metric Property</dt> <dd class="dd">Each metric is calculated from input values. These input values can be events or properties of the device or system. The available properties are defined by <tt class="ph tt">CUpti_MetricPropertyID</tt>. </dd> <dt class="dt dlterm">Metric Value</dt> <dd class="dd">Each metric has a value that represents one of the kinds defined by <tt class="ph tt">CUpti_MetricValueKind</tt>. For each value kind, there is a corresponding member of the <tt class="ph tt">CUpti_MetricValue</tt> union that is used to hold the metric's value. </dd> </dl> </div> <div class="section"> <p class="p"> The tables included in this section list the metrics available for each device, as determined by the device's compute capability. You can also determine the metrics available on a device using the <tt class="ph tt">cuptiDeviceEnumMetrics</tt> function. The <strong class="ph b">cupti_query</strong> sample described on the <a class="xref" href="r_main.html#r_samples" shape="rect">samples page</a> shows how to use this function. You can also enumerate all the CUPTI metrics available on any device using the <tt class="ph tt">cuptiEnumMetrics</tt> function. </p> <p class="p"> CUPTI provides two functions for calculating a metric value. <tt class="ph tt">cuptiMetricGetValue2</tt> can be used to calculate a metric value when the device is not available. All required event values and metric properties must be provided by the caller. <tt class="ph tt">cuptiMetricGetValue</tt> can be used to calculate a metric value when the device is available (as a CUdevice object). All required event values must be provided by the caller but CUPTI will determine the appropriate property values from the CUdevice object. </p> <p class="p"> Configuring and calculating metric values requires the following steps. First, determine the name of the metric that you want to collect, and then use the <tt class="ph tt">cuptiMetricGetIdFromName</tt> to get the metric ID. Use <tt class="ph tt">cuptiMetricEnumEvents</tt> to get the events required to calculate the metric and follow instructions in the CUPTI Event API section to create the event groups for those events. Alternatively, you can use the <tt class="ph tt">cuptiMetricCreateEventGroupSets</tt> function to automatically create the event group(s) required for metric's events. </p> <p class="p"> If you are using <tt class="ph tt">cuptiMetricGetValue2</tt> the you must also collect the required metric property values using <tt class="ph tt">cuptiMetricEnumProperties</tt>. </p> <p class="p"> Collect event counts as described in the CUPTI Event API section, and then use either <tt class="ph tt">cuptiMetricGetValue</tt> or <tt class="ph tt">cuptiMetricGetValue2</tt> to calculate the metric value from the collected event and property values. The <strong class="ph b">callback_metric</strong> sample described on the <a class="xref" href="r_main.html#r_samples" shape="rect">samples page</a> shows how to use the functions to calculate event values and calculate a metric using <tt class="ph tt">cuptiMetricGetValue</tt>. Note that, as shown in the example, you should collect event counts from all domain instances and normalize the counts to get the most accurate metric values. It is necessary to normalize the event counts because the number of event counter instances varies by device and by the event being counted. </p> <p class="p"> For example, a device might have 8 multiprocessors but only have event counters for 4 of the multiprocessors, and might have 3 memory units and only have events counters for one memory unit. When calculating a metric that requires a multiprocessor event and a memory unit event, the 4 multiprocessor counters should be summed and multiplied by 2 to normalize the event count across the entire device. Similarly, the one memory unit counter should be multiplied by 3 to normalize the event count across the entire device. The normalized values can then be passed to <tt class="ph tt">cuptiMetricGetValue</tt> or <tt class="ph tt">cuptiMetricGetValue2</tt> to calculate the metric value. </p> <p class="p"> As described, the normalization assumes the kernel executes a sufficient number of blocks to completely load the device. If the kernel has only a small number of blocks, normalizing across the entire device may skew the result. </p> </div> </div> <div class="topic reference cuda_reference nested2" id="r_metric_reference_1x"><a name="r_metric_reference_1x" shape="rect"> <!-- --></a><h4 class="topictitle4">1.6.1. Metric Reference - Compute Capability 1.x</h4> <div class="body refbody"> <div class="section"> <p class="p">Devices with compute capability less than 2.0 implement the metrics shown in the following table. A scope value of single-context indicates that the metric can only be accurately collected when a single context (CUDA or graphic) is executing on the GPU. A scope value of multi-context indicates that the metric can be accurately collected when multiple contexts are executing on the GPU. </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table capabity_metrics" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 1. Capability 1.x Metrics</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="33.33333333333333%" id="d26284e982" rowspan="1" colspan="1">Metric Name</th> <th class="entry" valign="top" width="44.44444444444444%" id="d26284e985" rowspan="1" colspan="1">Description</th> <th class="entry" valign="top" width="22.22222222222222%" id="d26284e988" rowspan="1" colspan="1">Scope</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d26284e982" rowspan="1" colspan="1">branch_efficiency</td> <td class="entry" valign="top" width="44.44444444444444%" headers="d26284e985" rowspan="1" colspan="1">Ratio of non-divergent branches to total branches</td> <td class="entry" valign="top" width="22.22222222222222%" headers="d26284e988" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d26284e982" rowspan="1" colspan="1">gld_efficiency</td> <td class="entry" valign="top" width="44.44444444444444%" headers="d26284e985" rowspan="1" colspan="1">Ratio of requested global memory load transactions to actual global memory load transactions </td> <td class="entry" valign="top" width="22.22222222222222%" headers="d26284e988" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d26284e982" rowspan="1" colspan="1">gst_efficiency</td> <td class="entry" valign="top" width="44.44444444444444%" headers="d26284e985" rowspan="1" colspan="1">Ratio of requested global memory store transactions to actual global memory store transactions </td> <td class="entry" valign="top" width="22.22222222222222%" headers="d26284e988" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d26284e982" rowspan="1" colspan="1">gld_requested_throughput</td> <td class="entry" valign="top" width="44.44444444444444%" headers="d26284e985" rowspan="1" colspan="1">Requested global memory load throughput</td> <td class="entry" valign="top" width="22.22222222222222%" headers="d26284e988" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="33.33333333333333%" headers="d26284e982" rowspan="1" colspan="1">gst_requested_throughput</td> <td class="entry" valign="top" width="44.44444444444444%" headers="d26284e985" rowspan="1" colspan="1">Requested global memory store throughput</td> <td class="entry" valign="top" width="22.22222222222222%" headers="d26284e988" rowspan="1" colspan="1">Single-context</td> </tr> </tbody> </table> </div> </div> </div> </div> <div class="topic reference cuda_reference nested2" id="r_metric_reference_2x"><a name="r_metric_reference_2x" shape="rect"> <!-- --></a><h4 class="topictitle4">1.6.2. Metric Reference - Compute Capability 2.x</h4> <div class="body refbody"> <div class="section"> <p class="p">Devices with compute capability between 2.0, inclusive, and 3.0 implement the metrics shown in the following table. A scope value of single-context indicates that the metric can only be accurately collected when a single context (CUDA or graphic) is executing on the GPU. A scope value of multi-context indicates that the metric can be accurately collected when multiple contexts are executing on the GPU. </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table capabity_metrics" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 2. Capability 2.x Metrics</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="36.36363636363637%" id="d26284e1090" rowspan="1" colspan="1">Metric Name</th> <th class="entry" valign="top" width="45.45454545454545%" id="d26284e1093" rowspan="1" colspan="1">Description</th> <th class="entry" valign="top" width="18.181818181818183%" id="d26284e1096" rowspan="1" colspan="1">Scope</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">sm_efficiency</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">The percentage of time at least one warp is active on a multiprocessor averaged over all multiprocessors on the GPU </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">sm_efficiency_instance</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">The percentage of time at least one warp is active on a specific multiprocessor </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">achieved_occupancy</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">issue_slot_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Percentage of issue slots that issued at least one instruction, averaged across all cycles </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">inst_executed</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">The number of instructions executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">inst_issued</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">The number of instructions issued</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">issue_slots</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">The number of issue slots used</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">executed_ipc</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Instructions executed per cycle</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">issued_ipc</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Instructions issued per cycle</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">ipc_instance</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Instructions executed per cycle for a single multiprocessor</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">inst_per_warp</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Average number of instructions executed by each warp</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">cf_issued</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Number of issued control-flow instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">cf_executed</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Number of executed control-flow instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">ldst_issued</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Number of issued load and store instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">ldst_executed</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Number of executed load and store instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">branch_efficiency</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Ratio of non-divergent branches to total branches</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">warp_execution_efficiency</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">inst_replay_overhead</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Average number of replays for each instruction executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">shared_replay_overhead</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Average number of replays due to shared memory conflicts for each instruction executed </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">global_cache_replay_overhead</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Average number of replays due to global memory cache misses for each instruction executed </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">local_replay_overhead</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Average number of replays due to local memory accesses for each instruction executed </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">gld_efficiency</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Ratio of requested global memory load throughput to required global memory load throughput </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">gst_efficiency</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Ratio of requested global memory store throughput to required global memory store throughput </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">gld_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Number of global memory load transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">gst_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Number of global memory store transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">gld_transactions_per_request</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Average number of global memory load transactions performed for each global memory load</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">gst_transactions_per_request</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Average number of global memory store transactions performed for each global memory store</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">gld_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Global memory load throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">gst_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Global memory store throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">gld_requested_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Requested global memory load throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">gst_requested_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Requested global memory store throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">local_load_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Number of local memory load transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">local_store_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Number of local memory store transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">local_load_transactions_per_ request</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Average number of local memory load transactions performed for each local memory load</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">local_store_transactions_per_ request</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Average number of local memory store transactions performed for each local memory store</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">local_load_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Local memory load throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">local_store_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Local memory store throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">shared_load_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Number of shared memory load transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">shared_store_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Number of shared memory store transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">shared_load_transactions_per_ request</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Average number of shared memory load transactions performed for each shared memory load</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">shared_store_transactions_per_ request</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Average number of shared memory store transactions performed for each shared memory store</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">shared_load_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Shared memory load throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">shared_store_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Shared memory store throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">shared_efficiency</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Ratio of requested shared memory throughput to required shared memory throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">dram_read_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Device memory read transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">dram_write_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Device memory write transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">dram_read_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Device memory read throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">dram_write_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Device memory write throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">sysmem_read_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">System memory read transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">sysmem_write_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">System memory write transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">sysmem_read_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">System memory read throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">sysmem_write_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">System memory write throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">l1_cache_global_hit_rate</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Hit rate in L1 cache for global loads</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">l1_cache_local_hit_rate</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Hit rate in L1 cache for local loads and stores</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">tex_cache_hit_rate</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Texture cache hit rate</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">tex_cache_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Texture cache read transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">tex_cache_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Texture cache throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">l2_read_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Memory read transactions seen at L2 cache for all read requests</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">l2_write_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Memory write transactions seen at L2 cache for all write requests</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">l2_read_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Memory read throughput seen at L2 cache for all read requests</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">l2_write_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Memory write throughput seen at L2 cache for all write requests</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">l2_l1_read_hit_rate</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Hit rate at L2 cache for all read requests from L1 cache</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Sinlge-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">l2_l1_read_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Memory read throughput seen at L2 cache for read requests from L1 cache </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">l2_texture_read_hit_rate</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Hit rate at L2 cache for all read requests from texture cache</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">l2_texure_read_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Memory read throughput seen at L2 cache for read requests from the texture cache </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Sinlge-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">local_memory_overhead</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Ratio of local memory traffic to total memory traffic between the L1 and L2 caches </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">l1_shared_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">The utilization level of the L1/shared memory relative to peak utilization</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">l2_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">The utilization level of the L2 cache relative to the peak utilization</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">tex_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">The utilization level of the texture cache relative to the peak utilization</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">dram_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">The utilization level of the device memory relative to the peak utilization</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">sysmem_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">The utilization level of the system memory relative to the peak utilization</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">ldst_fu_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute load and store instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">int_fu_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute integer instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">cf_fu_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute control-flow instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">tex_fu_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute texture instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">tex_fu_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute floating point instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">fpspec_fu_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute special floating point instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">misc_fu_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute miscellaneous instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">flops_sp</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Single-precision floating point operations executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">flops_sp_add</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Single-precision floating point add operations executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">flops_sp_mul</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Single-precision floating point multiply operations executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">flops_sp_fma</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Single-precision floating point multiply-accumulate operations executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">flops_dp</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Double-precision floating point operations executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">flops_dp_add</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Double-precision floating point add operations executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">flops_dp_mul</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Double-precision floating point multiply operations executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">flops_dp_fma</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Double-precision floating point multiply-accumulate operations executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">flops_sp_special</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Single-precision floating point special operations executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">stall_inst_fetch</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Percentage of stalls occurring because the next assembly instruction has not yet been fetched</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">stall_exec_dependency</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Percentage of stalls occurring because an input required by the instruction is not yet available</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">stall_data_request</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">stall_sync</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Percentage of stalls occurring because the warp is blocked at a __syncthreads() call</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">stall_texture</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e1090" rowspan="1" colspan="1">stall_other</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e1093" rowspan="1" colspan="1">Percentage of stalls occurring due to miscellaneous reasons</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e1096" rowspan="1" colspan="1">Multi-context</td> </tr> </tbody> </table> </div> </div> </div> </div> <div class="topic reference cuda_reference nested2" id="r_metric_reference_3x"><a name="r_metric_reference_3x" shape="rect"> <!-- --></a><h4 class="topictitle4">1.6.3. Metric Reference - Compute Capability 3.x</h4> <div class="body refbody"> <div class="section"> <p class="p">Devices with compute capability greater than or equal to 3.0 implement the metrics shown in the following table. A scope value of single-context indicates that the metric can only be accurately collected when a single context (CUDA or graphic) is executing on the GPU. A scope value of multi-context indicates that the metric can be accurately collected when multiple contexts are executing on the GPU. </p> <div class="tablenoborder"> <table cellpadding="4" cellspacing="0" summary="" class="table capabity_metrics" frame="border" border="1" rules="all"> <caption><span class="tablecap">Table 3. Capability 3.x Metrics</span></caption> <thead class="thead" align="left"> <tr class="row"> <th class="entry" valign="top" width="36.36363636363637%" id="d26284e2269" rowspan="1" colspan="1">Metric Name</th> <th class="entry" valign="top" width="45.45454545454545%" id="d26284e2272" rowspan="1" colspan="1">Description</th> <th class="entry" valign="top" width="18.181818181818183%" id="d26284e2275" rowspan="1" colspan="1">Scope</th> </tr> </thead> <tbody class="tbody"> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">sm_efficiency</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">The percentage of time at least one warp is active on a multiprocessor averaged over all multiprocessors on the GPU </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">sm_efficiency_instance</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">The percentage of time at least one warp is active on a specific multiprocessor </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">achieved_occupancy</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">issue_slot_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Percentage of issue slots that issued at least one instruction, averaged across all cycles </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">inst_executed</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">The number of instructions executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">inst_issued</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">The number of instructions issued</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">issue_slots</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">The number of issue slots used</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">executed_ipc</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Instructions executed per cycle</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">issued_ipc</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Instructions issued per cycle</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">ipc_instance</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Instructions executed per cycle for a single multiprocessor</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">inst_per_warp</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Average number of instructions executed by each warp</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">cf_issued</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Number of issued control-flow instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">cf_executed</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Number of executed control-flow instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">ldst_issued</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Number of issued load and store instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">ldst_executed</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Number of executed load and store instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">branch_efficiency</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Ratio of non-divergent branches to total branches</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">warp_execution_efficiency</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">warp_nonpred_execution_efficiency</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Ratio of the average active threads per warp executing non-predicated instructions to the maximum number of threads per warp supported on a multiprocessor </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">inst_replay_overhead</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Average number of replays for each instruction executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">shared_replay_overhead</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Average number of replays due to shared memory conflicts for each instruction executed </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">global_cache_replay_overhead</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Average number of replays due to global memory cache misses for each instruction executed </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">local_replay_overhead</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Average number of replays due to local memory accesses for each instruction executed </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">gld_efficiency</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Ratio of requested global memory load throughput to required global memory load throughput </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">gst_efficiency</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Ratio of requested global memory store throughput to required global memory store throughput </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">gld_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Number of global memory load transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">gst_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Number of global memory store transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">gld_transactions_per_request</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Average number of global memory load transactions performed for each global memory load</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">gst_transactions_per_request</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Average number of global memory store transactions performed for each global memory store</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">gld_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Global memory load throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">gst_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Global memory store throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">gld_requested_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Requested global memory load throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">gst_requested_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Requested global memory store throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">local_load_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Number of local memory load transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">local_store_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Number of local memory store transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">local_load_transactions_per_ request</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Average number of local memory load transactions performed for each local memory load</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">local_store_transactions_per_ request</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Average number of local memory store transactions performed for each local memory store</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">local_load_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Local memory load throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">local_store_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Local memory store throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">shared_load_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Number of shared memory load transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">shared_store_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Number of shared memory store transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">shared_load_transactions_per_ request</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Average number of shared memory load transactions performed for each shared memory load</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">shared_store_transactions_per_ request</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Average number of shared memory store transactions performed for each shared memory store</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">shared_load_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Shared memory load throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">shared_store_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Shared memory store throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">shared_efficiency</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Ratio of requested shared memory throughput to required shared memory throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">dram_read_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Device memory read transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">dram_write_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Device memory write transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">dram_read_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Device memory read throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">dram_write_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Device memory write throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">sysmem_read_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">System memory read transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">sysmem_write_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">System memory write transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">sysmem_read_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">System memory read throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">sysmem_write_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">System memory write throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">l1_cache_global_hit_rate</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Hit rate in L1 cache for global loads</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">l1_cache_local_hit_rate</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Hit rate in L1 cache for local loads and stores</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">tex_cache_hit_rate</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Texture cache hit rate</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">tex_cache_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Texture cache read transactions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">tex_cache_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Texture cache throughput</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">l2_read_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Memory read transactions seen at L2 cache for all read requests</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">l2_write_transactions</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Memory write transactions seen at L2 cache for all write requests</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">l2_read_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Memory read throughput seen at L2 cache for all read requests</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">l2_write_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Memory write throughput seen at L2 cache for all write requests</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">l2_l1_read_hit_rate</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Hit rate at L2 cache for all read requests from L1 cache</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Sinlge-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">l2_l1_read_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Memory read throughput seen at L2 cache for read requests from L1 cache </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">l2_texture_read_hit_rate</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Hit rate at L2 cache for all read requests from texture cache</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">l2_texure_read_throughput</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Memory read throughput seen at L2 cache for read requests from the texture cache </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Sinlge-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">local_memory_overhead</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Ratio of local memory traffic to total memory traffic between the L1 and L2 caches </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">l1_shared_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">The utilization level of the L1/shared memory relative to peak utilization</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">l2_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">The utilization level of the L2 cache relative to the peak utilization</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">tex_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">The utilization level of the texture cache relative to the peak utilization</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">dram_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">The utilization level of the device memory relative to the peak utilization</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">sysmem_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">The utilization level of the system memory relative to the peak utilization</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Single-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">ldst_fu_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute load and store instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">int_fu_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute integer instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">cf_fu_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute control-flow instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">tex_fu_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute texture instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">tex_fu_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute floating point instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">fpspec_fu_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute special floating point instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">misc_fu_utilization</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">The utilization level of the multiprocessor function units that execute miscellaneous instructions</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">flops_sp</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Single-precision floating point operations executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">flops_sp_add</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Single-precision floating point add operations executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">flops_sp_mul</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Single-precision floating point multiply operations executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">flops_sp_fma</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Single-precision floating point multiply-accumulate operations executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">flops_dp</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Double-precision floating point operations executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">flops_dp_add</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Double-precision floating point add operations executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">flops_dp_mul</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Double-precision floating point multiply operations executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">flops_dp_fma</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Double-precision floating point multiply-accumulate operations executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">flops_sp_special</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Single-precision floating point special operations executed</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">stall_inst_fetch</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Percentage of stalls occurring because the next assembly instruction has not yet been fetched</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">stall_exec_dependency</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Percentage of stalls occurring because an input required by the instruction is not yet available</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">stall_data_request</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Percentage of stalls occurring because a memory operation cannot be performed due to the required resources not being available or fully utilized, or because too many requests of a given type are outstanding </td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">stall_sync</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Percentage of stalls occurring because the warp is blocked at a __syncthreads() call</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">stall_texture</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Percentage of stalls occurring because the texture sub-system is fully utilized or has too many outstanding requests</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> <tr class="row"> <td class="entry" valign="top" width="36.36363636363637%" headers="d26284e2269" rowspan="1" colspan="1">stall_other</td> <td class="entry" valign="top" width="45.45454545454545%" headers="d26284e2272" rowspan="1" colspan="1">Percentage of stalls occurring due to miscellaneous reasons</td> <td class="entry" valign="top" width="18.181818181818183%" headers="d26284e2275" rowspan="1" colspan="1">Multi-context</td> </tr> </tbody> </table> </div> </div> </div> </div> </div> <div class="topic reference cuda_reference nested1" id="r_samples"><a name="r_samples" shape="rect"> <!-- --></a><h3 class="topictitle3">1.7. Samples</h3> <div class="body refbody"> <div class="section"> <p class="p">The CUPTI installation includes several samples that demonstrate the use of the CUPTI APIs.The samples are: </p> <dl class="dl"> <dt class="dt dlterm"><a name="r_samples__activity_trace_async" shape="rect"> <!-- --></a>activity_trace_async </dt> <dd class="dd">This sample shows how to collect a trace of CPU and GPU activity using the new asynchronous activity buffer APIs. </dd> <dt class="dt dlterm">callback_event</dt> <dd class="dd">This sample shows how to use both the callback and event APIs to record the events that occur during the execution of a simple kernel. The sample shows the required ordering for synchronization, and for event group enabling, disabling and reading. </dd> <dt class="dt dlterm">callback_metric</dt> <dd class="dd">This sample shows how to use both the callback and metric APIs to record the metric's events during the execution of a simple kernel, and then use those events to calculate the metric value. </dd> <dt class="dt dlterm">callback_timestamp</dt> <dd class="dd">This sample shows how to use the callback API to record a trace of API start and stop times. </dd> <dt class="dt dlterm">cupti_query</dt> <dd class="dd">This sample shows how to query CUDA-enabled devices for their event domains, events, and metrics. </dd> <dt class="dt dlterm">event_sampling</dt> <dd class="dd">This sample shows how to use the event API to sample events using a separate host thread. </dd> </dl> </div> </div> </div> </div> <hr id="contents-end"></hr> <div id="breadcrumbs"><a href="index.html" shape="rect">< Previous</a> | <a href="modules.html" shape="rect">Next ></a></div> <div id="release-info">CUPTI (<a href="../../pdf/CUPTI_Library.pdf">PDF</a>) - CUDA Toolkit v5.5 (<a href="https://developer.nvidia.com/cuda-toolkit-archive">older</a>) - Last updated July 19, 2013 - <a href="mailto:cudatools@nvidia.com?subject=CUDA Tools Documentation Feedback: cupti">Send Feedback</a></div> </article> <header id="header"><span id="company">NVIDIA</span><span id="site-title">CUDA Toolkit Documentation</span><form id="search" method="get" action="search"> <input type="text" name="search-text"></input><fieldset id="search-location"> <legend>Search In:</legend> <label><input type="radio" name="search-type" value="site"></input>Entire Site</label> <label><input type="radio" name="search-type" value="document"></input>Just This Document</label></fieldset> <button type="reset">clear search</button> <button id="submit" type="submit">search</button></form> </header> <nav id="site-nav"> <div class="category closed"><span class="twiddle">▷</span><a href="../index.html" title="The root of the site.">CUDA Toolkit</a></div> <ul class="closed"> <li><a href="../cuda-toolkit-release-notes/index.html" title="The Release Notes for the CUDA Toolkit from v4.0 to today.">Release Notes</a></li> <li><a href="../eula/index.html" title="The End User License Agreements for the NVIDIA CUDA Toolkit, the NVIDIA CUDA Samples, the NVIDIA Display Driver, and NVIDIA NSight (Visual Studio Edition).">EULA</a></li> <li><a href="../cuda-getting-started-guide-for-linux/index.html" title="This guide discusses how to install and check for correct operation of the CUDA Development Tools on GNU/Linux systems.">Getting Started Linux</a></li> <li><a href="../cuda-getting-started-guide-for-mac-os-x/index.html" title="This guide discusses how to install and check for correct operation of the CUDA Development Tools on Mac OS X systems.">Getting Started Mac OS X</a></li> <li><a href="../cuda-getting-started-guide-for-microsoft-windows/index.html" title="This guide discusses how to install and check for correct operation of the CUDA Development Tools on Microsoft Windows systems.">Getting Started Windows</a></li> <li><a href="../cuda-c-programming-guide/index.html" title="This guide provides a detailed discussion of the CUDA programming model and programming interface. It then describes the hardware implementation, and provides guidance on how to achieve maximum performance. The Appendixes include a list of all CUDA-enabled devices, detailed description of all extensions to the C language, listings of supported mathematical functions, C++ features supported in host and device code, details on texture fetching, technical specifications of various devices, and concludes by introducing the low-level driver API.">Programming Guide</a></li> <li><a href="../cuda-c-best-practices-guide/index.html" title="This guide presents established parallelization and optimization techniques and explains coding metaphors and idioms that can greatly simplify programming for CUDA-capable GPU architectures. The intent is to provide guidelines for obtaining the best performance from NVIDIA GPUs using the CUDA Toolkit.">Best Practices Guide</a></li> <li><a href="../kepler-compatibility-guide/index.html" title="This application note is intended to help developers ensure that their NVIDIA CUDA applications will run effectively on GPUs based on the NVIDIA Kepler Architecture. This document provides guidance to ensure that your software applications are compatible with Kepler.">Kepler Compatibility Guide</a></li> <li><a href="../kepler-tuning-guide/index.html" title="Kepler is NVIDIA's next-generation architecture for CUDA compute applications. Applications that follow the best practices for the Fermi architecture should typically see speedups on the Kepler architecture without any code changes. This guide summarizes the ways that an application can be fine-tuned to gain additional speedups by leveraging Kepler architectural features.">Kepler Tuning Guide</a></li> <li><a href="../parallel-thread-execution/index.html" title="This guide provides detailed instructions on the use of PTX, a low-level parallel thread execution virtual machine and instruction set architecture (ISA). PTX exposes the GPU as a data-parallel computing device.">PTX ISA</a></li> <li><a href="../optimus-developer-guide/index.html" title="This document explains how CUDA APIs can be used to query for GPU capabilities in NVIDIA Optimus systems.">Developer Guide for Optimus</a></li> <li><a href="../video-decoder/index.html" title="This document provides the video decoder API specification and the format conversion and display using DirectX or OpenGL following decode.">Video Decoder</a></li> <li><a href="../video-encoder/index.html" title="This document provides the CUDA video encoder specifications, including the C-library API functions and encoder query parameters.">Video Encoder</a></li> <li><a href="../inline-ptx-assembly/index.html" title="This document shows how to inline PTX (parallel thread execution) assembly language statements into CUDA code. It describes available assembler statement parameters and constraints, and the document also provides a list of some pitfalls that you may encounter.">Inline PTX Assembly</a></li> <li><a href="../cuda-runtime-api/index.html" title="The CUDA runtime API.">CUDA Runtime API</a></li> <li><a href="../cuda-driver-api/index.html" title="The CUDA driver API.">CUDA Driver API</a></li> <li><a href="../cuda-math-api/index.html" title="The CUDA math API.">CUDA Math API</a></li> <li><a href="../cublas/index.html" title="The CUBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA CUDA runtime. It allows the user to access the computational resources of NVIDIA Graphical Processing Unit (GPU), but does not auto-parallelize across multiple GPUs.">CUBLAS</a></li> <li><a href="../cufft/index.html" title="The CUFFT library user guide.">CUFFT</a></li> <li><a href="../curand/index.html" title="The CURAND library user guide.">CURAND</a></li> <li><a href="../cusparse/index.html" title="The CUSPARSE library user guide.">CUSPARSE</a></li> <li><a href="../npp/index.html" title="NVIDIA NPP is a library of functions for performing CUDA accelerated processing. The initial set of functionality in the library focuses on imaging and video processing and is widely applicable for developers in these areas. NPP will evolve over time to encompass more of the compute heavy tasks in a variety of problem domains. The NPP library is written to maximize flexibility, while maintaining high performance.">NPP</a></li> <li><a href="../thrust/index.html" title="The Thrust getting started guide.">Thrust</a></li> <li><a href="../cuda-samples/index.html" title="This document contains a complete listing of the code samples that are included with the NVIDIA CUDA Toolkit. It describes each code sample, lists the minimum GPU specification, and provides links to the source code and white papers if available.">CUDA Samples</a></li> <li><a href="../cuda-compiler-driver-nvcc/index.html" title="This document is a reference guide on the use of the CUDA compiler driver nvcc. Instead of being a specific CUDA compilation driver, nvcc mimics the behavior of the GNU compiler gcc, accepting a range of conventional compiler options, such as for defining macros and include/library paths, and for steering the compilation process.">NVCC</a></li> <li><a href="../cuda-gdb/index.html" title="The NVIDIA tool for debugging CUDA applications running on Linux and Mac, providing developers with a mechanism for debugging CUDA applications running on actual hardware. CUDA-GDB is an extension to the x86-64 port of GDB, the GNU Project debugger.">CUDA-GDB</a></li> <li><a href="../cuda-memcheck/index.html" title="CUDA-MEMCHECK is a suite of run time tools capable of precisely detecting out of bounds and misaligned memory access errors, checking device allocation leaks, reporting hardware errors and identifying shared memory data access hazards.">CUDA-MEMCHECK</a></li> <li><a href="../nsight-eclipse-edition-getting-started-guide/index.html" title="Nsight Eclipse Edition getting started guide">Nsight Eclipse Edition</a></li> <li><a href="../profiler-users-guide/index.html" title="This is the guide to the Profiler.">Profiler</a></li> <li><a href="../cuda-binary-utilities/index.html" title="The application notes for cuobjdump and nvdisasm.">CUDA Binary Utilities</a></li> <li><a href="../floating-point/index.html" title="A number of issues related to floating point accuracy and compliance are a frequent source of confusion on both CPUs and GPUs. The purpose of this white paper is to discuss the most common issues related to NVIDIA GPUs and to supplement the documentation in the CUDA C Programming Guide.">Floating Point and IEEE 754</a></li> <li><a href="../incomplete-lu-cholesky/index.html" title="In this white paper we show how to use the CUSPARSE and CUBLAS libraries to achieve a 2x speedup over CPU in the incomplete-LU and Cholesky preconditioned iterative methods. We focus on the Bi-Conjugate Gradient Stabilized and Conjugate Gradient iterative methods, that can be used to solve large sparse nonsymmetric and symmetric positive definite linear systems, respectively. Also, we comment on the parallel sparse triangular solve, which is an essential building block in these algorithms.">Incomplete-LU and Cholesky Preconditioned Iterative Methods</a></li> <li><a href="../libnvvm-api/index.html" title="The libNVVM API.">libNVVM API</a></li> <li><a href="../libdevice-users-guide/index.html" title="The libdevice library is an LLVM bitcode library that implements common functions for GPU kernels.">libdevice User's Guide</a></li> <li><a href="../nvvm-ir-spec/index.html" title="NVVM IR is a compiler IR (internal representation) based on the LLVM IR. The NVVM IR is designed to represent GPU compute kernels (for example, CUDA kernels). High-level language front-ends, like the CUDA C compiler front-end, can generate NVVM IR.">NVVM IR</a></li> <li><a href="../cupti/index.html" title="The CUPTI API.">CUPTI</a></li> <li><a href="../debugger-api/index.html" title="The CUDA debugger API.">Debugger API</a></li> <li><a href="../gpudirect-rdma/index.html" title="A tool for Kepler-class GPUs and CUDA 5.0 enabling a direct path for communication between the GPU and a peer device on the PCI Express bus when the devices share the same upstream root complex using standard features of PCI Express. This document introduces the technology and describes the steps necessary to enable a RDMA for GPUDirect connection to NVIDIA GPUs within the Linux device driver model.">RDMA for GPUDirect</a></li> </ul> <div class="category"><span class="twiddle">▼</span><a href="index.html" title="CUPTI">CUPTI</a></div> <ul> <li><a href="r_main.html#r_main">1. Introduction</a><ul> <li><a href="r_main.html#r_compatibility_requirements">1.1. CUPTI Compatibility and Requirements</a></li> <li><a href="r_main.html#r_initialization">1.2. CUPTI Initialization</a></li> <li><a href="r_main.html#r_activity">1.3. CUPTI Activity API</a></li> <li><a href="r_main.html#r_callback_api">1.4. CUPTI Callback API</a><ul> <li><a href="r_main.html#r_driver_runtime_api_callback">1.4.1. Driver and Runtime API Callbacks</a></li> <li><a href="r_main.html#r_resource_callbacks">1.4.2. Resource Callbacks</a></li> <li><a href="r_main.html#r_synchronization_callbacks">1.4.3. Synchronization Callbacks</a></li> <li><a href="r_main.html#r_nvtx_callbacks">1.4.4. NVIDIA Tools Extension Callbacks</a></li> </ul> </li> <li><a href="r_main.html#r_event_api">1.5. CUPTI Event API</a><ul> <li><a href="r_main.html#r_collecting_kernel_execution_events">1.5.1. Collecting Kernel Execution Events</a></li> <li><a href="r_main.html#r_sampling_events">1.5.2. Sampling Events</a></li> </ul> </li> <li><a href="r_main.html#r_metric_api">1.6. CUPTI Metric API</a><ul> <li><a href="r_main.html#r_metric_reference_1x">1.6.1. Metric Reference - Compute Capability 1.x</a></li> <li><a href="r_main.html#r_metric_reference_2x">1.6.2. Metric Reference - Compute Capability 2.x</a></li> <li><a href="r_main.html#r_metric_reference_3x">1.6.3. Metric Reference - Compute Capability 3.x</a></li> </ul> </li> <li><a href="r_main.html#r_samples">1.7. Samples</a></li> </ul> </li> <li><a href="modules.html#modules">2. Modules</a><ul> <li><a href="group__CUPTI__VERSION__API.html#group__CUPTI__VERSION__API">2.1. CUPTI Version</a></li> <li><a href="group__CUPTI__RESULT__API.html#group__CUPTI__RESULT__API">2.2. CUPTI Result Codes</a></li> <li><a href="group__CUPTI__ACTIVITY__API.html#group__CUPTI__ACTIVITY__API">2.3. CUPTI Activity API</a></li> <li><a href="group__CUPTI__CALLBACK__API.html#group__CUPTI__CALLBACK__API">2.4. CUPTI Callback API</a></li> <li><a href="group__CUPTI__EVENT__API.html#group__CUPTI__EVENT__API">2.5. CUPTI Event API</a></li> <li><a href="group__CUPTI__METRIC__API.html#group__CUPTI__METRIC__API">2.6. CUPTI Metric API</a></li> </ul> </li> <li><a href="annotated.html#annotated">3. Data Structures</a><ul> <li><a href="structCUpti__Activity.html#structCUpti__Activity">3.1. CUpti_Activity</a></li> <li><a href="structCUpti__ActivityAPI.html#structCUpti__ActivityAPI">3.2. CUpti_ActivityAPI</a></li> <li><a href="structCUpti__ActivityBranch.html#structCUpti__ActivityBranch">3.3. CUpti_ActivityBranch</a></li> <li><a href="structCUpti__ActivityCdpKernel.html#structCUpti__ActivityCdpKernel">3.4. CUpti_ActivityCdpKernel</a></li> <li><a href="structCUpti__ActivityContext.html#structCUpti__ActivityContext">3.5. CUpti_ActivityContext</a></li> <li><a href="structCUpti__ActivityDevice.html#structCUpti__ActivityDevice">3.6. CUpti_ActivityDevice</a></li> <li><a href="structCUpti__ActivityEnvironment.html#structCUpti__ActivityEnvironment">3.7. CUpti_ActivityEnvironment</a></li> <li><a href="structCUpti__ActivityEvent.html#structCUpti__ActivityEvent">3.8. CUpti_ActivityEvent</a></li> <li><a href="structCUpti__ActivityEventInstance.html#structCUpti__ActivityEventInstance">3.9. CUpti_ActivityEventInstance</a></li> <li><a href="structCUpti__ActivityGlobalAccess.html#structCUpti__ActivityGlobalAccess">3.10. CUpti_ActivityGlobalAccess</a></li> <li><a href="structCUpti__ActivityKernel.html#structCUpti__ActivityKernel">3.11. CUpti_ActivityKernel</a></li> <li><a href="structCUpti__ActivityKernel2.html#structCUpti__ActivityKernel2">3.12. CUpti_ActivityKernel2</a></li> <li><a href="structCUpti__ActivityMarker.html#structCUpti__ActivityMarker">3.13. CUpti_ActivityMarker</a></li> <li><a href="structCUpti__ActivityMarkerData.html#structCUpti__ActivityMarkerData">3.14. CUpti_ActivityMarkerData</a></li> <li><a href="structCUpti__ActivityMemcpy.html#structCUpti__ActivityMemcpy">3.15. CUpti_ActivityMemcpy</a></li> <li><a href="structCUpti__ActivityMemcpy2.html#structCUpti__ActivityMemcpy2">3.16. CUpti_ActivityMemcpy2</a></li> <li><a href="structCUpti__ActivityMemset.html#structCUpti__ActivityMemset">3.17. CUpti_ActivityMemset</a></li> <li><a href="structCUpti__ActivityMetric.html#structCUpti__ActivityMetric">3.18. CUpti_ActivityMetric</a></li> <li><a href="structCUpti__ActivityMetricInstance.html#structCUpti__ActivityMetricInstance">3.19. CUpti_ActivityMetricInstance</a></li> <li><a href="structCUpti__ActivityName.html#structCUpti__ActivityName">3.20. CUpti_ActivityName</a></li> <li><a href="unionCUpti__ActivityObjectKindId.html#unionCUpti__ActivityObjectKindId">3.21. CUpti_ActivityObjectKindId</a></li> <li><a href="structCUpti__ActivityOverhead.html#structCUpti__ActivityOverhead">3.22. CUpti_ActivityOverhead</a></li> <li><a href="structCUpti__ActivityPreemption.html#structCUpti__ActivityPreemption">3.23. CUpti_ActivityPreemption</a></li> <li><a href="structCUpti__ActivitySourceLocator.html#structCUpti__ActivitySourceLocator">3.24. CUpti_ActivitySourceLocator</a></li> <li><a href="structCUpti__CallbackData.html#structCUpti__CallbackData">3.25. CUpti_CallbackData</a></li> <li><a href="structCUpti__EventGroupSet.html#structCUpti__EventGroupSet">3.26. CUpti_EventGroupSet</a></li> <li><a href="structCUpti__EventGroupSets.html#structCUpti__EventGroupSets">3.27. CUpti_EventGroupSets</a></li> <li><a href="unionCUpti__MetricValue.html#unionCUpti__MetricValue">3.28. CUpti_MetricValue</a></li> <li><a href="structCUpti__NvtxData.html#structCUpti__NvtxData">3.29. CUpti_NvtxData</a></li> <li><a href="structCUpti__ResourceData.html#structCUpti__ResourceData">3.30. CUpti_ResourceData</a></li> <li><a href="structCUpti__SynchronizeData.html#structCUpti__SynchronizeData">3.31. CUpti_SynchronizeData</a></li> </ul> </li> <li><a href="functions.html#functions">4. Data Fields</a></li> <li><a href="notices-header.html#notices-header">Notices</a><ul></ul> </li> </ul> </nav> <nav id="search-results"> <h2>Search Results</h2> <ol></ol> </nav> <script language="JavaScript" type="text/javascript" charset="utf-8" src="../common/formatting/common.min.js"></script> <script language="JavaScript" type="text/javascript" charset="utf-8" src="../common/scripts/omniture/s_code_us_dev_aut1-nolinktrackin.js"></script> <script language="JavaScript" type="text/javascript" charset="utf-8" src="../common/scripts/omniture/omniture.js"></script> <noscript><a href="http://www.omniture.com" title="Web Analytics"><img src="http://omniture.nvidia.com/b/ss/nvidiacudadocs/1/H.17--NS/0" height="1" width="1" border="0" alt=""></img></a></noscript> <script language="JavaScript" type="text/javascript" charset="utf-8" src="../common/scripts/google-analytics/google-analytics-write.js"></script> <script language="JavaScript" type="text/javascript" charset="utf-8" src="../common/scripts/google-analytics/google-analytics-tracker.js"></script> </body> </html>