JuliaGPU · giordano · Feb 13, 2026 · Feb 13, 2026 · Feb 13, 2026 · Feb 14, 2026
diff --git a/docs/src/api/kernel.md b/docs/src/api/kernel.md
@@ -1,14 +1,20 @@
 # [Kernel programming](@id KernelAPI)
 
 This section lists the package's public functionality that corresponds to special CUDA
-functions for use in device code. It is loosely organized according to the [C language
-extensions](http://docs.nvidia.com/cuda/cuda-c-programming-guide/#c-language-extensions)
-appendix from the CUDA C programming guide. For more information about certain intrinsics,
-refer to the aforementioned NVIDIA documentation.
+functions for use in device code. It is loosely organized according to the [C/C++ language
+extensions](https://docs.nvidia.com/cuda/cuda-programming-guide/05-appendices/cpp-language-extensions.html)
+appendix from the [CUDA programming
+guide](https://docs.nvidia.com/cuda/cuda-programming-guide/). For more information about
+certain intrinsics, refer to the aforementioned NVIDIA documentation.
 
 
 ## Indexing and dimensions
 
+!!! note "Differences with corresponding C/C++ indexing variables"
+    The indexing functions [`blockIdx`](@ref) and [`threadIdx`](@ref) have different
+    starting index from the corresponding variables in the C/C++ extensions.
+    Be careful when literally porting code from C/C++.
+
 ```@docs
 gridDim
 blockIdx
@@ -19,7 +25,6 @@ laneid
 active_mask
 ```
 
-
 ## Device arrays
 
 CUDA.jl provides a primitive, lightweight array type to manage GPU data organized in an

diff --git a/docs/src/development/profiling.md b/docs/src/development/profiling.md
@@ -162,12 +162,13 @@ without having to restart Julia.
 
 #### NVIDIA Nsight Systems
 
-Generally speaking, the first external profiler you should use is Nsight Systems, as it will
-give you a high-level overview of your application's performance characteristics. After
-downloading and installing the tool (a version might have been installed alongside with the
-CUDA toolkit, but it is recommended to download and install the latest version from the
-NVIDIA website), you need to launch Julia from the command-line, wrapped by the `nsys`
-utility from Nsight Systems:
+Generally speaking, the first external profiler you should use is [Nsight
+Systems](https://developer.nvidia.com/nsight-systems), as it will give you a
+high-level overview of your application's performance characteristics. After
+downloading and installing the tool (a version might have been installed
+alongside with the CUDA toolkit, but it is recommended to download and install
+the latest version from the NVIDIA website), you need to launch Julia from the
+command-line, wrapped by the `nsys` utility from Nsight Systems:
 
 ```
 $ nsys launch julia
@@ -214,10 +215,12 @@ You can open the resulting `.qdrep` file with `nsight-sys`:
 
 #### NVIDIA Nsight Compute
 
-If you want details on the execution properties of a single kernel, or inspect API
-interactions in detail, Nsight Compute is the tool for you. It is again possible to use this
-profiler with an interactive session of Julia, and debug or profile only those sections of
-your application that are marked with `CUDA.@profile`.
+If you want details on the execution properties of a single kernel, or inspect
+API interactions in detail, [Nsight
+Compute](https://developer.nvidia.com/nsight-compute) is the tool for you. It is
+again possible to use this profiler with an interactive session of Julia, and
+debug or profile only those sections of your application that are marked with
+`CUDA.@profile`.
 
 First, ensure that all (CUDA) packages that are involved in your application have been
 precompiled. Otherwise, you'll end up profiling the precompilation process, instead of
@@ -360,11 +363,22 @@ Incompatibility between the ncu and Julia CUDA version. Run `ncu --version` to f
 
 ##### "Profiling is not supported on this device" error
 
-Nsight Compute does not support the GPU you have. Run `ncu --list-chips` to verify. Either delete newer versions of CUDA Toolkit and set the environment variable `CUDA_PATH` to a previous version, or install a newer version. 
+Nsight Compute does not support the GPU you have. Run `ncu --list-chips` to verify. Either delete newer versions of CUDA Toolkit and set the environment variable `CUDA_PATH` to a previous version, or install a newer version.
 
-##### ==ERROR== ERR_NVGPUCTRPERM 
-Run the terminal as administrator. Refer to [the NVIDIA documentation issue webpage](https://developer.nvidia.com/ERR_NVGPUCTRPERM) for more details. 
+##### `==ERROR== ERR_NVGPUCTRPERM`
 
+Run the terminal as administrator. Refer to [the NVIDIA documentation issue webpage](https://developer.nvidia.com/ERR_NVGPUCTRPERM) for more details.
+
+##### `==ERROR== Cuda driver is not compatible with Nsight Compute`
+
+As the error message suggests, the version of Nsight Compute you are using may be too new for the driver installed on the machine where you are trying to run the GPU application.
+If possible and if it is compatible with your GPU, upgrade the CUDA driver on the target machine, otherwise downgrade Nsight Compute to a version which supports the CUDA driver installed on the system.
+
+##### `CUDA error: device kernel image is invalid (code 200, ERROR_INVALID_IMAGE)`
+
+First of all, make sure your application works normally outside the profiler.
+If you only get this error while running Julia under Nsight Compute, the problem may be a mismatch between the CUDA toolkit used by the profiler and the one loaded by Julia.
+In that case, try pointing CUDA.jl to a [local CUDA toolkit](@ref "Using a local CUDA").
 
 ## Source-code annotations
 

diff --git a/src/device/intrinsics/indexing.jl b/src/device/intrinsics/indexing.jl
@@ -66,42 +66,48 @@ end
 """
     gridDim()::NamedTuple
 
-Returns the dimensions of the grid.
+Returns the dimensions of the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
+These dimensions have the same starting index as the `gridDim` built-in variable in the C/C++ extension.
 """
 @inline gridDim() =   (x=gridDim_x(),   y=gridDim_y(),   z=gridDim_z())
 
 """
     blockIdx()::NamedTuple
 
-Returns the block index within the grid.
+Returns the block index within the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
+These indices are 1-based, unlike the `blockIdx` built-in variable in the C/C++ extension which is 0-based.
 """
 @inline blockIdx() =  (x=blockIdx_x(),  y=blockIdx_y(),  z=blockIdx_z())
 
 """
     blockDim()::NamedTuple
 
-Returns the dimensions of the block.
+Returns the dimensions of the block as a `NamedTuple` with keys `x`, `y`, and `z`.
+These dimensions have the same starting index as the `blockDim` built-in variable in the C/C++ extension.
 """
 @inline blockDim() =  (x=blockDim_x(),  y=blockDim_y(),  z=blockDim_z())
 
 """
     threadIdx()::NamedTuple
 
-Returns the thread index within the block.
+Returns the thread index within the block as a `NamedTuple` with keys `x`, `y`, and `z`.
+These indices are 1-based, unlike the `threadIdx` built-in variable in the C/C++ extension which is 0-based.
 """
 @inline threadIdx() = (x=threadIdx_x(), y=threadIdx_y(), z=threadIdx_z())
 
 """
     warpsize()::Int32
 
 Returns the warp size (in threads).
+This corresponds to the `warpSize` built-in variable in the C/C++ extension.
 """
 @inline warpsize() = ccall("llvm.nvvm.read.ptx.sreg.warpsize", llvmcall, Int32, ())
 
 """
     laneid()::Int32
 
 Returns the thread's lane within the warp.
+This ID is 1-based.
 """
 @inline laneid() = ccall("llvm.nvvm.read.ptx.sreg.laneid", llvmcall, Int32, ()) + 1i32