Skip to content

Improve matrixMul Tiling to support non-multiple BLOCK_SIZE dimensions in matrixMul.cu .  #411

@Codebruh-sudo

Description

@Codebruh-sudo

i was casually going through the matrixmul.cu i noticed that While the current matrixMul sample effectively demonstrates Shared Memory Tiling, it currently assumes that matrix dimensions are perfect multiples of the BLOCK_SIZE. This leads to undefined behavior or memory access violations when provided with arbitrary dimensions example (1024*1024).
i would suggest that add boundary checks in the tile-loading loop: if (row < height && col < width).
Initialize shared memory to zero for out-of-bounds threads to ensure the partial dot product remains numerically correct.
Update the checkCudaErrors logic to provide more descriptive output if a kernel fails due to illegal memory access on non-conforming sizes.
i have done the explaination below .
Root cause
Two main issues-

  1. grid is undersized- the launch configuration uses integer division-
    dim3 grid(dimsB.x / threads.x, dimsA.y / threads.y);
    For a 1000×1000 matrix with BLOCK_SIZE=32, this gives a 31×31 grid. The last 8 rows and 8 columns of C are never written .silently left uninitialised.

  2. No boundary guard in the tile load . the kernel loads shared memory with no bounds check:
    As[ty][tx] = A[a + wA * ty + tx]; // reads past end of A on partial tiles
    Bs[ty][tx] = B[b + wB * ty + tx]; // reads past end of B on partial tiles
    Out-of-bounds threads load garbage values into shared memory, which then corrupt the dot product for all threads in the tile via __syncthreads().

The proposed fix
1 Use ceiling division for the grid
dim3 grid(
(dimsB.x + threads.x - 1) / threads.x,
(dimsA.y + threads.y - 1) / threads.y
);

  1. Add boundary-predicated tile loads with zero padding-
    Pass hA into the kernel alongside wA and wB
    int row = BLOCK_SIZE * by + ty;
    int col = BLOCK_SIZE * bx + tx;

int a_col = tile * BLOCK_SIZE + tx;
int b_row = tile * BLOCK_SIZE + ty;

As[ty][tx] = (row < hA && a_col < wA) ? A[row * wA + a_col] : 0.0f;
Bs[ty][tx] = (b_row < wA && col < wB) ? B[b_row * wB + col] : 0.0f;

Zero-padding is numerically safe because 0 is the additive identity . partial tiles contribute nothing to the dot product.

  1. Guard the output write.

if (row < hA && col < wB)
C[row * wB + col] = Csub;

  1. Add input validation in main to warn users-
    if (dimsA.x % block_size != 0 || dimsA.y % block_size != 0 ||
    dimsB.x % block_size != 0) {
    printf("Warning: dimensions are not multiples of block_size (%d).\n"
    "Boundary padding will be applied.\n", block_size

matrixMul -wA=1000 -hA=1000 -wB=1000 -hB=1000

  • Expected: Result = PASS
  • Actual: Result = FAIL (cudaErrorIllegalAddress)

You can suggest me if iam worng here and if am not . i would be happy to raise a pr .

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions