i was casually going through the matrixmul.cu i noticed that While the current matrixMul sample effectively demonstrates Shared Memory Tiling, it currently assumes that matrix dimensions are perfect multiples of the BLOCK_SIZE. This leads to undefined behavior or memory access violations when provided with arbitrary dimensions example (1024*1024).
i would suggest that add boundary checks in the tile-loading loop: if (row < height && col < width).
Initialize shared memory to zero for out-of-bounds threads to ensure the partial dot product remains numerically correct.
Update the checkCudaErrors logic to provide more descriptive output if a kernel fails due to illegal memory access on non-conforming sizes.
i have done the explaination below .
Root cause
Two main issues-
-
grid is undersized- the launch configuration uses integer division-
dim3 grid(dimsB.x / threads.x, dimsA.y / threads.y);
For a 1000×1000 matrix with BLOCK_SIZE=32, this gives a 31×31 grid. The last 8 rows and 8 columns of C are never written .silently left uninitialised.
-
No boundary guard in the tile load . the kernel loads shared memory with no bounds check:
As[ty][tx] = A[a + wA * ty + tx]; // reads past end of A on partial tiles
Bs[ty][tx] = B[b + wB * ty + tx]; // reads past end of B on partial tiles
Out-of-bounds threads load garbage values into shared memory, which then corrupt the dot product for all threads in the tile via __syncthreads().
The proposed fix
1 Use ceiling division for the grid
dim3 grid(
(dimsB.x + threads.x - 1) / threads.x,
(dimsA.y + threads.y - 1) / threads.y
);
- Add boundary-predicated tile loads with zero padding-
Pass hA into the kernel alongside wA and wB
int row = BLOCK_SIZE * by + ty;
int col = BLOCK_SIZE * bx + tx;
int a_col = tile * BLOCK_SIZE + tx;
int b_row = tile * BLOCK_SIZE + ty;
As[ty][tx] = (row < hA && a_col < wA) ? A[row * wA + a_col] : 0.0f;
Bs[ty][tx] = (b_row < wA && col < wB) ? B[b_row * wB + col] : 0.0f;
Zero-padding is numerically safe because 0 is the additive identity . partial tiles contribute nothing to the dot product.
- Guard the output write.
if (row < hA && col < wB)
C[row * wB + col] = Csub;
- Add input validation in
main to warn users-
if (dimsA.x % block_size != 0 || dimsA.y % block_size != 0 ||
dimsB.x % block_size != 0) {
printf("Warning: dimensions are not multiples of block_size (%d).\n"
"Boundary padding will be applied.\n", block_size
matrixMul -wA=1000 -hA=1000 -wB=1000 -hB=1000
- Expected: Result = PASS
- Actual: Result = FAIL (cudaErrorIllegalAddress)
You can suggest me if iam worng here and if am not . i would be happy to raise a pr .
i was casually going through the matrixmul.cu i noticed that While the current matrixMul sample effectively demonstrates Shared Memory Tiling, it currently assumes that matrix dimensions are perfect multiples of the BLOCK_SIZE. This leads to undefined behavior or memory access violations when provided with arbitrary dimensions example (1024*1024).
i would suggest that add boundary checks in the tile-loading loop: if (row < height && col < width).
Initialize shared memory to zero for out-of-bounds threads to ensure the partial dot product remains numerically correct.
Update the checkCudaErrors logic to provide more descriptive output if a kernel fails due to illegal memory access on non-conforming sizes.
i have done the explaination below .
Root cause
Two main issues-
grid is undersized- the launch configuration uses integer division-
dim3 grid(dimsB.x / threads.x, dimsA.y / threads.y);
For a 1000×1000 matrix with BLOCK_SIZE=32, this gives a 31×31 grid. The last 8 rows and 8 columns of C are never written .silently left uninitialised.
No boundary guard in the tile load . the kernel loads shared memory with no bounds check:
As[ty][tx] = A[a + wA * ty + tx]; // reads past end of A on partial tiles
Bs[ty][tx] = B[b + wB * ty + tx]; // reads past end of B on partial tiles
Out-of-bounds threads load garbage values into shared memory, which then corrupt the dot product for all threads in the tile via
__syncthreads().The proposed fix
1 Use ceiling division for the grid
dim3 grid(
(dimsB.x + threads.x - 1) / threads.x,
(dimsA.y + threads.y - 1) / threads.y
);
Pass hA into the kernel alongside wA and wB
int row = BLOCK_SIZE * by + ty;
int col = BLOCK_SIZE * bx + tx;
int a_col = tile * BLOCK_SIZE + tx;
int b_row = tile * BLOCK_SIZE + ty;
As[ty][tx] = (row < hA && a_col < wA) ? A[row * wA + a_col] : 0.0f;
Bs[ty][tx] = (b_row < wA && col < wB) ? B[b_row * wB + col] : 0.0f;
Zero-padding is numerically safe because 0 is the additive identity . partial tiles contribute nothing to the dot product.
if (row < hA && col < wB)
C[row * wB + col] = Csub;
mainto warn users-if (dimsA.x % block_size != 0 || dimsA.y % block_size != 0 ||
dimsB.x % block_size != 0) {
printf("Warning: dimensions are not multiples of block_size (%d).\n"
"Boundary padding will be applied.\n", block_size
matrixMul -wA=1000 -hA=1000 -wB=1000 -hB=1000
You can suggest me if iam worng here and if am not . i would be happy to raise a pr .