Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 10 additions & 5 deletions docs/src/api/kernel.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,20 @@
# [Kernel programming](@id KernelAPI)

This section lists the package's public functionality that corresponds to special CUDA
functions for use in device code. It is loosely organized according to the [C language
extensions](http://docs.nvidia.com/cuda/cuda-c-programming-guide/#c-language-extensions)
appendix from the CUDA C programming guide. For more information about certain intrinsics,
refer to the aforementioned NVIDIA documentation.
functions for use in device code. It is loosely organized according to the [C/C++ language
extensions](https://docs.nvidia.com/cuda/cuda-programming-guide/05-appendices/cpp-language-extensions.html)
appendix from the [CUDA programming
guide](https://docs.nvidia.com/cuda/cuda-programming-guide/). For more information about
certain intrinsics, refer to the aforementioned NVIDIA documentation.


## Indexing and dimensions

!!! note "Differences with corresponding C/C++ indexing variables"
The indexing functions [`blockIdx`](@ref) and [`threadIdx`](@ref) have different
starting index from the corresponding variables in the C/C++ extensions.
Be careful when literally porting code from C/C++.

```@docs
gridDim
blockIdx
Expand All @@ -19,7 +25,6 @@ laneid
active_mask
```


## Device arrays

CUDA.jl provides a primitive, lightweight array type to manage GPU data organized in an
Expand Down
40 changes: 27 additions & 13 deletions docs/src/development/profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,12 +162,13 @@ without having to restart Julia.

#### NVIDIA Nsight Systems

Generally speaking, the first external profiler you should use is Nsight Systems, as it will
give you a high-level overview of your application's performance characteristics. After
downloading and installing the tool (a version might have been installed alongside with the
CUDA toolkit, but it is recommended to download and install the latest version from the
NVIDIA website), you need to launch Julia from the command-line, wrapped by the `nsys`
utility from Nsight Systems:
Generally speaking, the first external profiler you should use is [Nsight
Systems](https://developer.nvidia.com/nsight-systems), as it will give you a
high-level overview of your application's performance characteristics. After
downloading and installing the tool (a version might have been installed
alongside with the CUDA toolkit, but it is recommended to download and install
the latest version from the NVIDIA website), you need to launch Julia from the
command-line, wrapped by the `nsys` utility from Nsight Systems:

```
$ nsys launch julia
Expand Down Expand Up @@ -214,10 +215,12 @@ You can open the resulting `.qdrep` file with `nsight-sys`:

#### NVIDIA Nsight Compute

If you want details on the execution properties of a single kernel, or inspect API
interactions in detail, Nsight Compute is the tool for you. It is again possible to use this
profiler with an interactive session of Julia, and debug or profile only those sections of
your application that are marked with `CUDA.@profile`.
If you want details on the execution properties of a single kernel, or inspect
API interactions in detail, [Nsight
Compute](https://developer.nvidia.com/nsight-compute) is the tool for you. It is
again possible to use this profiler with an interactive session of Julia, and
debug or profile only those sections of your application that are marked with
`CUDA.@profile`.

First, ensure that all (CUDA) packages that are involved in your application have been
precompiled. Otherwise, you'll end up profiling the precompilation process, instead of
Expand Down Expand Up @@ -360,11 +363,22 @@ Incompatibility between the ncu and Julia CUDA version. Run `ncu --version` to f

##### "Profiling is not supported on this device" error

Nsight Compute does not support the GPU you have. Run `ncu --list-chips` to verify. Either delete newer versions of CUDA Toolkit and set the environment variable `CUDA_PATH` to a previous version, or install a newer version.
Nsight Compute does not support the GPU you have. Run `ncu --list-chips` to verify. Either delete newer versions of CUDA Toolkit and set the environment variable `CUDA_PATH` to a previous version, or install a newer version.

##### ==ERROR== ERR_NVGPUCTRPERM
Run the terminal as administrator. Refer to [the NVIDIA documentation issue webpage](https://developer.nvidia.com/ERR_NVGPUCTRPERM) for more details.
##### `==ERROR== ERR_NVGPUCTRPERM`

Run the terminal as administrator. Refer to [the NVIDIA documentation issue webpage](https://developer.nvidia.com/ERR_NVGPUCTRPERM) for more details.

##### `==ERROR== Cuda driver is not compatible with Nsight Compute`

As the error message suggests, the version of Nsight Compute you are using may be too new for the driver installed on the machine where you are trying to run the GPU application.
If possible and if it is compatible with your GPU, upgrade the CUDA driver on the target machine, otherwise downgrade Nsight Compute to a version which supports the CUDA driver installed on the system.

##### `CUDA error: device kernel image is invalid (code 200, ERROR_INVALID_IMAGE)`

First of all, make sure your application works normally outside the profiler.
If you only get this error while running Julia under Nsight Compute, the problem may be a mismatch between the CUDA toolkit used by the profiler and the one loaded by Julia.
In that case, try pointing CUDA.jl to a [local CUDA toolkit](@ref "Using a local CUDA").

## Source-code annotations

Expand Down
14 changes: 10 additions & 4 deletions src/device/intrinsics/indexing.jl
Original file line number Diff line number Diff line change
Expand Up @@ -66,42 +66,48 @@ end
"""
gridDim()::NamedTuple

Returns the dimensions of the grid.
Returns the dimensions of the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
These dimensions have the same starting index as the `gridDim` built-in variable in the C/C++ extension.
"""
@inline gridDim() = (x=gridDim_x(), y=gridDim_y(), z=gridDim_z())

"""
blockIdx()::NamedTuple

Returns the block index within the grid.
Returns the block index within the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
These indices are 1-based, unlike the `blockIdx` built-in variable in the C/C++ extension which is 0-based.
"""
@inline blockIdx() = (x=blockIdx_x(), y=blockIdx_y(), z=blockIdx_z())

"""
blockDim()::NamedTuple

Returns the dimensions of the block.
Returns the dimensions of the block as a `NamedTuple` with keys `x`, `y`, and `z`.
These dimensions have the same starting index as the `blockDim` built-in variable in the C/C++ extension.
"""
@inline blockDim() = (x=blockDim_x(), y=blockDim_y(), z=blockDim_z())

"""
threadIdx()::NamedTuple

Returns the thread index within the block.
Returns the thread index within the block as a `NamedTuple` with keys `x`, `y`, and `z`.
These indices are 1-based, unlike the `threadIdx` built-in variable in the C/C++ extension which is 0-based.
"""
@inline threadIdx() = (x=threadIdx_x(), y=threadIdx_y(), z=threadIdx_z())

"""
warpsize()::Int32

Returns the warp size (in threads).
This corresponds to the `warpSize` built-in variable in the C/C++ extension.
"""
@inline warpsize() = ccall("llvm.nvvm.read.ptx.sreg.warpsize", llvmcall, Int32, ())

"""
laneid()::Int32

Returns the thread's lane within the warp.
This ID is 1-based.
"""
@inline laneid() = ccall("llvm.nvvm.read.ptx.sreg.laneid", llvmcall, Int32, ()) + 1i32

Expand Down