Sophie: nvidia-cuda-toolkit-devel-10.1.168-1.2.mga7.nonfree x86

nvidia-cuda-toolkit-devel-10.1.168-1.2.mga7.nonfree.x86_64.rpm

.TH "Execution Control" 3 "24 Apr 2019" "Version 6.0" "Doxygen" \" -*- nroff -*-
.ad l
.nh
.SH NAME
Execution Control \- 
.SS "Functions"

.in +1c
.ti -1c
.RI "\fBCUresult\fP \fBcuFuncGetAttribute\fP (int *pi, \fBCUfunction_attribute\fP attrib, \fBCUfunction\fP hfunc)"
.br
.RI "\fIReturns information about a function. \fP"
.ti -1c
.RI "\fBCUresult\fP \fBcuFuncSetAttribute\fP (\fBCUfunction\fP hfunc, \fBCUfunction_attribute\fP attrib, int value)"
.br
.RI "\fISets information about a function. \fP"
.ti -1c
.RI "\fBCUresult\fP \fBcuFuncSetCacheConfig\fP (\fBCUfunction\fP hfunc, \fBCUfunc_cache\fP config)"
.br
.RI "\fISets the preferred cache configuration for a device function. \fP"
.ti -1c
.RI "\fBCUresult\fP \fBcuFuncSetSharedMemConfig\fP (\fBCUfunction\fP hfunc, \fBCUsharedconfig\fP config)"
.br
.RI "\fISets the shared memory configuration for a device function. \fP"
.ti -1c
.RI "\fBCUresult\fP \fBcuLaunchCooperativeKernel\fP (\fBCUfunction\fP f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, \fBCUstream\fP hStream, void **kernelParams)"
.br
.RI "\fILaunches a CUDA function where thread blocks can cooperate and synchronize as they execute. \fP"
.ti -1c
.RI "\fBCUresult\fP \fBcuLaunchCooperativeKernelMultiDevice\fP (\fBCUDA_LAUNCH_PARAMS\fP *launchParamsList, unsigned int numDevices, unsigned int flags)"
.br
.RI "\fILaunches CUDA functions on multiple devices where thread blocks can cooperate and synchronize as they execute. \fP"
.ti -1c
.RI "\fBCUresult\fP \fBcuLaunchHostFunc\fP (\fBCUstream\fP hStream, \fBCUhostFn\fP fn, void *userData)"
.br
.RI "\fIEnqueues a host function call in a stream. \fP"
.ti -1c
.RI "\fBCUresult\fP \fBcuLaunchKernel\fP (\fBCUfunction\fP f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, \fBCUstream\fP hStream, void **kernelParams, void **extra)"
.br
.RI "\fILaunches a CUDA function. \fP"
.in -1c
.SH "Detailed Description"
.PP 
\\brief execution control functions of the low-level CUDA driver API (\fBcuda.h\fP)
.PP
This section describes the execution control functions of the low-level CUDA driver application programming interface. 
.SH "Function Documentation"
.PP 
.SS "\fBCUresult\fP cuFuncGetAttribute (int * pi, \fBCUfunction_attribute\fP attrib, \fBCUfunction\fP hfunc)"
.PP
Returns in \fC*pi\fP the integer value of the attribute \fCattrib\fP on the kernel given by \fChfunc\fP. The supported attributes are:
.IP "\(bu" 2
\fBCU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK\fP: The maximum number of threads per block, beyond which a launch of the function would fail. This number depends on both the function and the device on which the function is currently loaded.
.IP "\(bu" 2
\fBCU_FUNC_ATTRIBUTE_SHARED_SIZE_BYTES\fP: The size in bytes of statically-allocated shared memory per block required by this function. This does not include dynamically-allocated shared memory requested by the user at runtime.
.IP "\(bu" 2
\fBCU_FUNC_ATTRIBUTE_CONST_SIZE_BYTES\fP: The size in bytes of user-allocated constant memory required by this function.
.IP "\(bu" 2
\fBCU_FUNC_ATTRIBUTE_LOCAL_SIZE_BYTES\fP: The size in bytes of local memory used by each thread of this function.
.IP "\(bu" 2
\fBCU_FUNC_ATTRIBUTE_NUM_REGS\fP: The number of registers used by each thread of this function.
.IP "\(bu" 2
\fBCU_FUNC_ATTRIBUTE_PTX_VERSION\fP: The PTX virtual architecture version for which the function was compiled. This value is the major PTX version * 10 + the minor PTX version, so a PTX version 1.3 function would return the value 13. Note that this may return the undefined value of 0 for cubins compiled prior to CUDA 3.0.
.IP "\(bu" 2
\fBCU_FUNC_ATTRIBUTE_BINARY_VERSION\fP: The binary architecture version for which the function was compiled. This value is the major binary version * 10 + the minor binary version, so a binary version 1.3 function would return the value 13. Note that this will return a value of 10 for legacy cubins that do not have a properly-encoded binary architecture version.
.IP "\(bu" 2
CU_FUNC_CACHE_MODE_CA: The attribute to indicate whether the function has been compiled with user specified option '-Xptxas --dlcm=ca' set .
.IP "\(bu" 2
\fBCU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES\fP: The maximum size in bytes of dynamically-allocated shared memory.
.IP "\(bu" 2
\fBCU_FUNC_ATTRIBUTE_PREFERRED_SHARED_MEMORY_CARVEOUT\fP: Preferred shared memory-L1 cache split ratio in percent of total shared memory.
.PP
.PP
\fBParameters:\fP
.RS 4
\fIpi\fP - Returned attribute value 
.br
\fIattrib\fP - Attribute requested 
.br
\fIhfunc\fP - Function to query attribute of
.RE
.PP
\fBReturns:\fP
.RS 4
\fBCUDA_SUCCESS\fP, \fBCUDA_ERROR_DEINITIALIZED\fP, \fBCUDA_ERROR_NOT_INITIALIZED\fP, \fBCUDA_ERROR_INVALID_CONTEXT\fP, \fBCUDA_ERROR_INVALID_HANDLE\fP, \fBCUDA_ERROR_INVALID_VALUE\fP 
.RE
.PP
\fBNote:\fP
.RS 4
Note that this function may also return error codes from previous, asynchronous launches.
.RE
.PP
\fBSee also:\fP
.RS 4
\fBcuCtxGetCacheConfig\fP, \fBcuCtxSetCacheConfig\fP, \fBcuFuncSetCacheConfig\fP, \fBcuLaunchKernel\fP, cudaFuncGetAttributes cudaFuncSetAttribute 
.RE
.PP

.SS "\fBCUresult\fP cuFuncSetAttribute (\fBCUfunction\fP hfunc, \fBCUfunction_attribute\fP attrib, int value)"
.PP
This call sets the value of a specified attribute \fCattrib\fP on the kernel given by \fChfunc\fP to an integer value specified by \fCval\fP This function returns CUDA_SUCCESS if the new value of the attribute could be successfully set. If the set fails, this call will return an error. Not all attributes can have values set. Attempting to set a value on a read-only attribute will result in an error (CUDA_ERROR_INVALID_VALUE)
.PP
Supported attributes for the cuFuncSetAttribute call are:
.IP "\(bu" 2
\fBCU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES\fP: This maximum size in bytes of dynamically-allocated shared memory. The value should contain the requested maximum size of dynamically-allocated shared memory. The sum of this value and the function attribute \fBCU_FUNC_ATTRIBUTE_SHARED_SIZE_BYTES\fP cannot exceed the device attribute \fBCU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN\fP. The maximal size of requestable dynamic shared memory may differ by GPU architecture.
.IP "\(bu" 2
\fBCU_FUNC_ATTRIBUTE_PREFERRED_SHARED_MEMORY_CARVEOUT\fP: On devices where the L1 cache and shared memory use the same hardware resources, this sets the shared memory carveout preference, in percent of the total shared memory. See \fBCU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR\fP This is only a hint, and the driver can choose a different ratio if required to execute the function.
.PP
.PP
\fBParameters:\fP
.RS 4
\fIhfunc\fP - Function to query attribute of 
.br
\fIattrib\fP - Attribute requested 
.br
\fIvalue\fP - The value to set
.RE
.PP
\fBReturns:\fP
.RS 4
\fBCUDA_SUCCESS\fP, \fBCUDA_ERROR_DEINITIALIZED\fP, \fBCUDA_ERROR_NOT_INITIALIZED\fP, \fBCUDA_ERROR_INVALID_CONTEXT\fP, \fBCUDA_ERROR_INVALID_HANDLE\fP, \fBCUDA_ERROR_INVALID_VALUE\fP 
.RE
.PP
\fBNote:\fP
.RS 4
Note that this function may also return error codes from previous, asynchronous launches.
.RE
.PP
\fBSee also:\fP
.RS 4
\fBcuCtxGetCacheConfig\fP, \fBcuCtxSetCacheConfig\fP, \fBcuFuncSetCacheConfig\fP, \fBcuLaunchKernel\fP, cudaFuncGetAttributes cudaFuncSetAttribute 
.RE
.PP

.SS "\fBCUresult\fP cuFuncSetCacheConfig (\fBCUfunction\fP hfunc, \fBCUfunc_cache\fP config)"
.PP
On devices where the L1 cache and shared memory use the same hardware resources, this sets through \fCconfig\fP the preferred cache configuration for the device function \fChfunc\fP. This is only a preference. The driver will use the requested configuration if possible, but it is free to choose a different configuration if required to execute \fChfunc\fP. Any context-wide preference set via \fBcuCtxSetCacheConfig()\fP will be overridden by this per-function setting unless the per-function setting is \fBCU_FUNC_CACHE_PREFER_NONE\fP. In that case, the current context-wide setting will be used.
.PP
This setting does nothing on devices where the size of the L1 cache and shared memory are fixed.
.PP
Launching a kernel with a different preference than the most recent preference setting may insert a device-side synchronization point.
.PP
The supported cache configurations are:
.IP "\(bu" 2
\fBCU_FUNC_CACHE_PREFER_NONE\fP: no preference for shared memory or L1 (default)
.IP "\(bu" 2
\fBCU_FUNC_CACHE_PREFER_SHARED\fP: prefer larger shared memory and smaller L1 cache
.IP "\(bu" 2
\fBCU_FUNC_CACHE_PREFER_L1\fP: prefer larger L1 cache and smaller shared memory
.IP "\(bu" 2
\fBCU_FUNC_CACHE_PREFER_EQUAL\fP: prefer equal sized L1 cache and shared memory
.PP
.PP
\fBParameters:\fP
.RS 4
\fIhfunc\fP - Kernel to configure cache for 
.br
\fIconfig\fP - Requested cache configuration
.RE
.PP
\fBReturns:\fP
.RS 4
\fBCUDA_SUCCESS\fP, \fBCUDA_ERROR_INVALID_VALUE\fP, \fBCUDA_ERROR_DEINITIALIZED\fP, \fBCUDA_ERROR_NOT_INITIALIZED\fP, \fBCUDA_ERROR_INVALID_CONTEXT\fP 
.RE
.PP
\fBNote:\fP
.RS 4
Note that this function may also return error codes from previous, asynchronous launches.
.RE
.PP
\fBSee also:\fP
.RS 4
\fBcuCtxGetCacheConfig\fP, \fBcuCtxSetCacheConfig\fP, \fBcuFuncGetAttribute\fP, \fBcuLaunchKernel\fP, cudaFuncSetCacheConfig 
.RE
.PP

.SS "\fBCUresult\fP cuFuncSetSharedMemConfig (\fBCUfunction\fP hfunc, \fBCUsharedconfig\fP config)"
.PP
On devices with configurable shared memory banks, this function will force all subsequent launches of the specified device function to have the given shared memory bank size configuration. On any given launch of the function, the shared memory configuration of the device will be temporarily changed if needed to suit the function's preferred configuration. Changes in shared memory configuration between subsequent launches of functions, may introduce a device side synchronization point.
.PP
Any per-function setting of shared memory bank size set via \fBcuFuncSetSharedMemConfig\fP will override the context wide setting set with \fBcuCtxSetSharedMemConfig\fP.
.PP
Changing the shared memory bank size will not increase shared memory usage or affect occupancy of kernels, but may have major effects on performance. Larger bank sizes will allow for greater potential bandwidth to shared memory, but will change what kinds of accesses to shared memory will result in bank conflicts.
.PP
This function will do nothing on devices with fixed shared memory bank size.
.PP
The supported bank configurations are:
.IP "\(bu" 2
\fBCU_SHARED_MEM_CONFIG_DEFAULT_BANK_SIZE\fP: use the context's shared memory configuration when launching this function.
.IP "\(bu" 2
\fBCU_SHARED_MEM_CONFIG_FOUR_BYTE_BANK_SIZE\fP: set shared memory bank width to be natively four bytes when launching this function.
.IP "\(bu" 2
\fBCU_SHARED_MEM_CONFIG_EIGHT_BYTE_BANK_SIZE\fP: set shared memory bank width to be natively eight bytes when launching this function.
.PP
.PP
\fBParameters:\fP
.RS 4
\fIhfunc\fP - kernel to be given a shared memory config 
.br
\fIconfig\fP - requested shared memory configuration
.RE
.PP
\fBReturns:\fP
.RS 4
\fBCUDA_SUCCESS\fP, \fBCUDA_ERROR_INVALID_VALUE\fP, \fBCUDA_ERROR_DEINITIALIZED\fP, \fBCUDA_ERROR_NOT_INITIALIZED\fP, \fBCUDA_ERROR_INVALID_CONTEXT\fP 
.RE
.PP
\fBNote:\fP
.RS 4
Note that this function may also return error codes from previous, asynchronous launches.
.RE
.PP
\fBSee also:\fP
.RS 4
\fBcuCtxGetCacheConfig\fP, \fBcuCtxSetCacheConfig\fP, \fBcuCtxGetSharedMemConfig\fP, \fBcuCtxSetSharedMemConfig\fP, \fBcuFuncGetAttribute\fP, \fBcuLaunchKernel\fP, cudaFuncSetSharedMemConfig 
.RE
.PP

.SS "\fBCUresult\fP cuLaunchCooperativeKernel (\fBCUfunction\fP f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, \fBCUstream\fP hStream, void ** kernelParams)"
.PP
Invokes the kernel \fCf\fP on a \fCgridDimX\fP x \fCgridDimY\fP x \fCgridDimZ\fP grid of blocks. Each block contains \fCblockDimX\fP x \fCblockDimY\fP x \fCblockDimZ\fP threads.
.PP
\fCsharedMemBytes\fP sets the amount of dynamic shared memory that will be available to each thread block.
.PP
The device on which this kernel is invoked must have a non-zero value for the device attribute \fBCU_DEVICE_ATTRIBUTE_COOPERATIVE_LAUNCH\fP.
.PP
The total number of blocks launched cannot exceed the maximum number of blocks per multiprocessor as returned by \fBcuOccupancyMaxActiveBlocksPerMultiprocessor\fP (or \fBcuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags\fP) times the number of multiprocessors as specified by the device attribute \fBCU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT\fP.
.PP
The kernel cannot make use of CUDA dynamic parallelism.
.PP
Kernel parameters must be specified via \fCkernelParams\fP. If \fCf\fP has N parameters, then \fCkernelParams\fP needs to be an array of N pointers. Each of \fCkernelParams\fP[0] through \fCkernelParams\fP[N-1] must point to a region of memory from which the actual kernel parameter will be copied. The number of kernel parameters and their offsets and sizes do not need to be specified as that information is retrieved directly from the kernel's image.
.PP
Calling \fBcuLaunchCooperativeKernel()\fP sets persistent function state that is the same as function state set through \fBcuLaunchKernel\fP API
.PP
When the kernel \fCf\fP is launched via \fBcuLaunchCooperativeKernel()\fP, the previous block shape, shared size and parameter info associated with \fCf\fP is overwritten.
.PP
Note that to use \fBcuLaunchCooperativeKernel()\fP, the kernel \fCf\fP must either have been compiled with toolchain version 3.2 or later so that it will contain kernel parameter information, or have no kernel parameters. If either of these conditions is not met, then \fBcuLaunchCooperativeKernel()\fP will return \fBCUDA_ERROR_INVALID_IMAGE\fP.
.PP
\fBParameters:\fP
.RS 4
\fIf\fP - Kernel to launch 
.br
\fIgridDimX\fP - Width of grid in blocks 
.br
\fIgridDimY\fP - Height of grid in blocks 
.br
\fIgridDimZ\fP - Depth of grid in blocks 
.br
\fIblockDimX\fP - X dimension of each thread block 
.br
\fIblockDimY\fP - Y dimension of each thread block 
.br
\fIblockDimZ\fP - Z dimension of each thread block 
.br
\fIsharedMemBytes\fP - Dynamic shared-memory size per thread block in bytes 
.br
\fIhStream\fP - Stream identifier 
.br
\fIkernelParams\fP - Array of pointers to kernel parameters
.RE
.PP
\fBReturns:\fP
.RS 4
\fBCUDA_SUCCESS\fP, \fBCUDA_ERROR_DEINITIALIZED\fP, \fBCUDA_ERROR_NOT_INITIALIZED\fP, \fBCUDA_ERROR_INVALID_CONTEXT\fP, \fBCUDA_ERROR_INVALID_HANDLE\fP, \fBCUDA_ERROR_INVALID_IMAGE\fP, \fBCUDA_ERROR_INVALID_VALUE\fP, \fBCUDA_ERROR_LAUNCH_FAILED\fP, \fBCUDA_ERROR_LAUNCH_OUT_OF_RESOURCES\fP, \fBCUDA_ERROR_LAUNCH_TIMEOUT\fP, \fBCUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING\fP, \fBCUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE\fP, \fBCUDA_ERROR_SHARED_OBJECT_INIT_FAILED\fP 
.RE
.PP
\fBNote:\fP
.RS 4
This function uses standard  semantics. 
.PP
Note that this function may also return error codes from previous, asynchronous launches.
.RE
.PP
\fBSee also:\fP
.RS 4
\fBcuCtxGetCacheConfig\fP, \fBcuCtxSetCacheConfig\fP, \fBcuFuncSetCacheConfig\fP, \fBcuFuncGetAttribute\fP, \fBcuLaunchCooperativeKernelMultiDevice\fP, cudaLaunchCooperativeKernel 
.RE
.PP

.SS "\fBCUresult\fP cuLaunchCooperativeKernelMultiDevice (\fBCUDA_LAUNCH_PARAMS\fP * launchParamsList, unsigned int numDevices, unsigned int flags)"
.PP
Invokes kernels as specified in the \fClaunchParamsList\fP array where each element of the array specifies all the parameters required to perform a single kernel launch. These kernels can cooperate and synchronize as they execute. The size of the array is specified by \fCnumDevices\fP.
.PP
No two kernels can be launched on the same device. All the devices targeted by this multi-device launch must be identical. All devices must have a non-zero value for the device attribute \fBCU_DEVICE_ATTRIBUTE_COOPERATIVE_MULTI_DEVICE_LAUNCH\fP.
.PP
All kernels launched must be identical with respect to the compiled code. Note that any __device__, __constant__ or __managed__ variables present in the module that owns the kernel launched on each device, are independently instantiated on every device. It is the application's responsiblity to ensure these variables are initialized and used appropriately.
.PP
The size of the grids as specified in blocks, the size of the blocks themselves and the amount of shared memory used by each thread block must also match across all launched kernels.
.PP
The streams used to launch these kernels must have been created via either \fBcuStreamCreate\fP or \fBcuStreamCreateWithPriority\fP. The NULL stream or \fBCU_STREAM_LEGACY\fP or \fBCU_STREAM_PER_THREAD\fP cannot be used.
.PP
The total number of blocks launched per kernel cannot exceed the maximum number of blocks per multiprocessor as returned by \fBcuOccupancyMaxActiveBlocksPerMultiprocessor\fP (or \fBcuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags\fP) times the number of multiprocessors as specified by the device attribute \fBCU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT\fP. Since the total number of blocks launched per device has to match across all devices, the maximum number of blocks that can be launched per device will be limited by the device with the least number of multiprocessors.
.PP
The kernels cannot make use of CUDA dynamic parallelism.
.PP
The \fBCUDA_LAUNCH_PARAMS\fP structure is defined as: 
.PP
.nf
        typedef struct CUDA_LAUNCH_PARAMS_st
        {
            CUfunction function;
            unsigned int gridDimX;
            unsigned int gridDimY;
            unsigned int gridDimZ;
            unsigned int blockDimX;
            unsigned int blockDimY;
            unsigned int blockDimZ;
            unsigned int sharedMemBytes;
            CUstream hStream;
            void **kernelParams;
        } CUDA_LAUNCH_PARAMS;

.fi
.PP
 where:
.IP "\(bu" 2
\fBCUDA_LAUNCH_PARAMS::function\fP specifies the kernel to be launched. All functions must be identical with respect to the compiled code.
.IP "\(bu" 2
\fBCUDA_LAUNCH_PARAMS::gridDimX\fP is the width of the grid in blocks. This must match across all kernels launched.
.IP "\(bu" 2
\fBCUDA_LAUNCH_PARAMS::gridDimY\fP is the height of the grid in blocks. This must match across all kernels launched.
.IP "\(bu" 2
\fBCUDA_LAUNCH_PARAMS::gridDimZ\fP is the depth of the grid in blocks. This must match across all kernels launched.
.IP "\(bu" 2
\fBCUDA_LAUNCH_PARAMS::blockDimX\fP is the X dimension of each thread block. This must match across all kernels launched.
.IP "\(bu" 2
\fBCUDA_LAUNCH_PARAMS::blockDimX\fP is the Y dimension of each thread block. This must match across all kernels launched.
.IP "\(bu" 2
\fBCUDA_LAUNCH_PARAMS::blockDimZ\fP is the Z dimension of each thread block. This must match across all kernels launched.
.IP "\(bu" 2
\fBCUDA_LAUNCH_PARAMS::sharedMemBytes\fP is the dynamic shared-memory size per thread block in bytes. This must match across all kernels launched.
.IP "\(bu" 2
\fBCUDA_LAUNCH_PARAMS::hStream\fP is the handle to the stream to perform the launch in. This cannot be the NULL stream or \fBCU_STREAM_LEGACY\fP or \fBCU_STREAM_PER_THREAD\fP. The CUDA context associated with this stream must match that associated with \fBCUDA_LAUNCH_PARAMS::function\fP.
.IP "\(bu" 2
\fBCUDA_LAUNCH_PARAMS::kernelParams\fP is an array of pointers to kernel parameters. If \fBCUDA_LAUNCH_PARAMS::function\fP has N parameters, then \fBCUDA_LAUNCH_PARAMS::kernelParams\fP needs to be an array of N pointers. Each of \fBCUDA_LAUNCH_PARAMS::kernelParams\fP[0] through \fBCUDA_LAUNCH_PARAMS::kernelParams\fP[N-1] must point to a region of memory from which the actual kernel parameter will be copied. The number of kernel parameters and their offsets and sizes do not need to be specified as that information is retrieved directly from the kernel's image.
.PP
.PP
By default, the kernel won't begin execution on any GPU until all prior work in all the specified streams has completed. This behavior can be overridden by specifying the flag \fBCUDA_COOPERATIVE_LAUNCH_MULTI_DEVICE_NO_PRE_LAUNCH_SYNC\fP. When this flag is specified, each kernel will only wait for prior work in the stream corresponding to that GPU to complete before it begins execution.
.PP
Similarly, by default, any subsequent work pushed in any of the specified streams will not begin execution until the kernels on all GPUs have completed. This behavior can be overridden by specifying the flag \fBCUDA_COOPERATIVE_LAUNCH_MULTI_DEVICE_NO_POST_LAUNCH_SYNC\fP. When this flag is specified, any subsequent work pushed in any of the specified streams will only wait for the kernel launched on the GPU corresponding to that stream to complete before it begins execution.
.PP
Calling \fBcuLaunchCooperativeKernelMultiDevice()\fP sets persistent function state that is the same as function state set through \fBcuLaunchKernel\fP API when called individually for each element in \fClaunchParamsList\fP.
.PP
When kernels are launched via \fBcuLaunchCooperativeKernelMultiDevice()\fP, the previous block shape, shared size and parameter info associated with each \fBCUDA_LAUNCH_PARAMS::function\fP in \fClaunchParamsList\fP is overwritten.
.PP
Note that to use \fBcuLaunchCooperativeKernelMultiDevice()\fP, the kernels must either have been compiled with toolchain version 3.2 or later so that it will contain kernel parameter information, or have no kernel parameters. If either of these conditions is not met, then \fBcuLaunchCooperativeKernelMultiDevice()\fP will return \fBCUDA_ERROR_INVALID_IMAGE\fP.
.PP
\fBParameters:\fP
.RS 4
\fIlaunchParamsList\fP - List of launch parameters, one per device 
.br
\fInumDevices\fP - Size of the \fClaunchParamsList\fP array 
.br
\fIflags\fP - Flags to control launch behavior
.RE
.PP
\fBReturns:\fP
.RS 4
\fBCUDA_SUCCESS\fP, \fBCUDA_ERROR_DEINITIALIZED\fP, \fBCUDA_ERROR_NOT_INITIALIZED\fP, \fBCUDA_ERROR_INVALID_CONTEXT\fP, \fBCUDA_ERROR_INVALID_HANDLE\fP, \fBCUDA_ERROR_INVALID_IMAGE\fP, \fBCUDA_ERROR_INVALID_VALUE\fP, \fBCUDA_ERROR_LAUNCH_FAILED\fP, \fBCUDA_ERROR_LAUNCH_OUT_OF_RESOURCES\fP, \fBCUDA_ERROR_LAUNCH_TIMEOUT\fP, \fBCUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING\fP, \fBCUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE\fP, \fBCUDA_ERROR_SHARED_OBJECT_INIT_FAILED\fP 
.RE
.PP
\fBNote:\fP
.RS 4
This function uses standard  semantics. 
.PP
Note that this function may also return error codes from previous, asynchronous launches.
.RE
.PP
\fBSee also:\fP
.RS 4
\fBcuCtxGetCacheConfig\fP, \fBcuCtxSetCacheConfig\fP, \fBcuFuncSetCacheConfig\fP, \fBcuFuncGetAttribute\fP, \fBcuLaunchCooperativeKernel\fP, cudaLaunchCooperativeKernelMultiDevice 
.RE
.PP

.SS "\fBCUresult\fP cuLaunchHostFunc (\fBCUstream\fP hStream, \fBCUhostFn\fP fn, void * userData)"
.PP
Enqueues a host function to run in a stream. The function will be called after currently enqueued work and will block work added after it.
.PP
The host function must not make any CUDA API calls. Attempting to use a CUDA API may result in \fBCUDA_ERROR_NOT_PERMITTED\fP, but this is not required. The host function must not perform any synchronization that may depend on outstanding CUDA work not mandated to run earlier. Host functions without a mandated order (such as in independent streams) execute in undefined order and may be serialized.
.PP
For the purposes of Unified Memory, execution makes a number of guarantees: 
.PD 0

.IP "\(bu" 2
The stream is considered idle for the duration of the function's execution. Thus, for example, the function may always use memory attached to the stream it was enqueued in. 
.IP "\(bu" 2
The start of execution of the function has the same effect as synchronizing an event recorded in the same stream immediately prior to the function. It thus synchronizes streams which have been 'joined' prior to the function. 
.IP "\(bu" 2
Adding device work to any stream does not have the effect of making the stream active until all preceding host functions and stream callbacks have executed. Thus, for example, a function might use global attached memory even if work has been added to another stream, if the work has been ordered behind the function call with an event. 
.IP "\(bu" 2
Completion of the function does not cause a stream to become active except as described above. The stream will remain idle if no device work follows the function, and will remain idle across consecutive host functions or stream callbacks without device work in between. Thus, for example, stream synchronization can be done by signaling from a host function at the end of the stream. 
.PP
.PP
Note that, in contrast to \fBcuStreamAddCallback\fP, the function will not be called in the event of an error in the CUDA context.
.PP
\fBParameters:\fP
.RS 4
\fIhStream\fP - Stream to enqueue function call in 
.br
\fIfn\fP - The function to call once preceding stream operations are complete 
.br
\fIuserData\fP - User-specified data to be passed to the function
.RE
.PP
\fBReturns:\fP
.RS 4
\fBCUDA_SUCCESS\fP, \fBCUDA_ERROR_DEINITIALIZED\fP, \fBCUDA_ERROR_NOT_INITIALIZED\fP, \fBCUDA_ERROR_INVALID_CONTEXT\fP, \fBCUDA_ERROR_INVALID_HANDLE\fP, \fBCUDA_ERROR_NOT_SUPPORTED\fP 
.RE
.PP
\fBNote:\fP
.RS 4
This function uses standard  semantics. 
.PP
Note that this function may also return error codes from previous, asynchronous launches.
.RE
.PP
\fBSee also:\fP
.RS 4
\fBcuStreamCreate\fP, \fBcuStreamQuery\fP, \fBcuStreamSynchronize\fP, \fBcuStreamWaitEvent\fP, \fBcuStreamDestroy\fP, \fBcuMemAllocManaged\fP, \fBcuStreamAttachMemAsync\fP, \fBcuStreamAddCallback\fP 
.RE
.PP

.SS "\fBCUresult\fP cuLaunchKernel (\fBCUfunction\fP f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, \fBCUstream\fP hStream, void ** kernelParams, void ** extra)"
.PP
Invokes the kernel \fCf\fP on a \fCgridDimX\fP x \fCgridDimY\fP x \fCgridDimZ\fP grid of blocks. Each block contains \fCblockDimX\fP x \fCblockDimY\fP x \fCblockDimZ\fP threads.
.PP
\fCsharedMemBytes\fP sets the amount of dynamic shared memory that will be available to each thread block.
.PP
Kernel parameters to \fCf\fP can be specified in one of two ways:
.PP
1) Kernel parameters can be specified via \fCkernelParams\fP. If \fCf\fP has N parameters, then \fCkernelParams\fP needs to be an array of N pointers. Each of \fCkernelParams\fP[0] through \fCkernelParams\fP[N-1] must point to a region of memory from which the actual kernel parameter will be copied. The number of kernel parameters and their offsets and sizes do not need to be specified as that information is retrieved directly from the kernel's image.
.PP
2) Kernel parameters can also be packaged by the application into a single buffer that is passed in via the \fCextra\fP parameter. This places the burden on the application of knowing each kernel parameter's size and alignment/padding within the buffer. Here is an example of using the \fCextra\fP parameter in this manner: 
.PP
.nf
    size_t argBufferSize;
    char argBuffer[256];

    // populate argBuffer and argBufferSize

    void *config[] = {
        CU_LAUNCH_PARAM_BUFFER_POINTER, argBuffer,
        CU_LAUNCH_PARAM_BUFFER_SIZE,    &argBufferSize,
        CU_LAUNCH_PARAM_END
    };
    status = cuLaunchKernel(f, gx, gy, gz, bx, by, bz, sh, s, NULL, config);

.fi
.PP
.PP
The \fCextra\fP parameter exists to allow \fBcuLaunchKernel\fP to take additional less commonly used arguments. \fCextra\fP specifies a list of names of extra settings and their corresponding values. Each extra setting name is immediately followed by the corresponding value. The list must be terminated with either NULL or \fBCU_LAUNCH_PARAM_END\fP.
.PP
.IP "\(bu" 2
\fBCU_LAUNCH_PARAM_END\fP, which indicates the end of the \fCextra\fP array;
.IP "\(bu" 2
\fBCU_LAUNCH_PARAM_BUFFER_POINTER\fP, which specifies that the next value in \fCextra\fP will be a pointer to a buffer containing all the kernel parameters for launching kernel \fCf\fP;
.IP "\(bu" 2
\fBCU_LAUNCH_PARAM_BUFFER_SIZE\fP, which specifies that the next value in \fCextra\fP will be a pointer to a size_t containing the size of the buffer specified with \fBCU_LAUNCH_PARAM_BUFFER_POINTER\fP;
.PP
.PP
The error \fBCUDA_ERROR_INVALID_VALUE\fP will be returned if kernel parameters are specified with both \fCkernelParams\fP and \fCextra\fP (i.e. both \fCkernelParams\fP and \fCextra\fP are non-NULL).
.PP
Calling \fBcuLaunchKernel()\fP sets persistent function state that is the same as function state set through the following deprecated APIs: \fBcuFuncSetBlockShape()\fP, \fBcuFuncSetSharedSize()\fP, \fBcuParamSetSize()\fP, \fBcuParamSeti()\fP, \fBcuParamSetf()\fP, \fBcuParamSetv()\fP.
.PP
When the kernel \fCf\fP is launched via \fBcuLaunchKernel()\fP, the previous block shape, shared size and parameter info associated with \fCf\fP is overwritten.
.PP
Note that to use \fBcuLaunchKernel()\fP, the kernel \fCf\fP must either have been compiled with toolchain version 3.2 or later so that it will contain kernel parameter information, or have no kernel parameters. If either of these conditions is not met, then \fBcuLaunchKernel()\fP will return \fBCUDA_ERROR_INVALID_IMAGE\fP.
.PP
\fBParameters:\fP
.RS 4
\fIf\fP - Kernel to launch 
.br
\fIgridDimX\fP - Width of grid in blocks 
.br
\fIgridDimY\fP - Height of grid in blocks 
.br
\fIgridDimZ\fP - Depth of grid in blocks 
.br
\fIblockDimX\fP - X dimension of each thread block 
.br
\fIblockDimY\fP - Y dimension of each thread block 
.br
\fIblockDimZ\fP - Z dimension of each thread block 
.br
\fIsharedMemBytes\fP - Dynamic shared-memory size per thread block in bytes 
.br
\fIhStream\fP - Stream identifier 
.br
\fIkernelParams\fP - Array of pointers to kernel parameters 
.br
\fIextra\fP - Extra options
.RE
.PP
\fBReturns:\fP
.RS 4
\fBCUDA_SUCCESS\fP, \fBCUDA_ERROR_DEINITIALIZED\fP, \fBCUDA_ERROR_NOT_INITIALIZED\fP, \fBCUDA_ERROR_INVALID_CONTEXT\fP, \fBCUDA_ERROR_INVALID_HANDLE\fP, \fBCUDA_ERROR_INVALID_IMAGE\fP, \fBCUDA_ERROR_INVALID_VALUE\fP, \fBCUDA_ERROR_LAUNCH_FAILED\fP, \fBCUDA_ERROR_LAUNCH_OUT_OF_RESOURCES\fP, \fBCUDA_ERROR_LAUNCH_TIMEOUT\fP, \fBCUDA_ERROR_LAUNCH_INCOMPATIBLE_TEXTURING\fP, \fBCUDA_ERROR_SHARED_OBJECT_INIT_FAILED\fP 
.RE
.PP
\fBNote:\fP
.RS 4
This function uses standard  semantics. 
.PP
Note that this function may also return error codes from previous, asynchronous launches.
.RE
.PP
\fBSee also:\fP
.RS 4
\fBcuCtxGetCacheConfig\fP, \fBcuCtxSetCacheConfig\fP, \fBcuFuncSetCacheConfig\fP, \fBcuFuncGetAttribute\fP, cudaLaunchKernel 
.RE
.PP

.SH "Author"
.PP 
Generated automatically by Doxygen from the source code.