Improve matrixMul Tiling to support non-multiple BLOCK_SIZE dimensions in matrixMul.cu . 

i was casually going  through the matrixmul.cu i noticed that While the current matrixMul sample effectively demonstrates Shared Memory Tiling, it currently assumes that matrix dimensions are perfect multiples of the BLOCK_SIZE. This leads to undefined behavior or memory access violations when provided with arbitrary dimensions example (1024*1024).  
i would suggest that add boundary checks in the tile-loading loop: if (row < height && col < width).
Initialize shared memory to zero for out-of-bounds threads to ensure the partial dot product remains numerically correct.
Update the checkCudaErrors logic to provide more descriptive output if a kernel fails due to illegal memory access on non-conforming sizes.
i have done the explaination below . 
Root cause
Two main issues-
1. grid is undersized- the launch configuration uses integer division-
dim3 grid(dimsB.x / threads.x, dimsA.y / threads.y);
For a 1000×1000 matrix with BLOCK_SIZE=32, this gives a 31×31 grid. The last 8 rows and 8 columns of C are never written .silently left uninitialised.

2. No boundary guard in the tile load . the kernel loads shared memory with no bounds check:
As[ty][tx] = A[a + wA * ty + tx];  // reads past end of A on partial tiles
Bs[ty][tx] = B[b + wB * ty + tx];  // reads past end of B on partial tiles
Out-of-bounds threads load garbage values into shared memory, which then corrupt the dot product for all threads in the tile via `__syncthreads()`.

The proposed fix
1 Use ceiling division for the grid
dim3 grid(
    (dimsB.x + threads.x - 1) / threads.x,
    (dimsA.y + threads.y - 1) / threads.y
);

2. Add boundary-predicated tile loads with zero padding-
 Pass hA into the kernel alongside wA and wB
int row = BLOCK_SIZE * by + ty;
int col = BLOCK_SIZE * bx + tx;

int a_col = tile * BLOCK_SIZE + tx;
int b_row = tile * BLOCK_SIZE + ty;

As[ty][tx] = (row < hA && a_col < wA) ? A[row * wA + a_col] : 0.0f;
Bs[ty][tx] = (b_row < wA && col < wB) ? B[b_row * wB + col] : 0.0f;

Zero-padding is numerically safe because 0 is the additive identity . partial tiles contribute nothing to the dot product.

3. Guard the output write.

if (row < hA && col < wB)
    C[row * wB + col] = Csub;

4. Add input validation in `main` to warn users-
if (dimsA.x % block_size != 0 || dimsA.y % block_size != 0 ||
    dimsB.x % block_size != 0) {
    printf("Warning: dimensions are not multiples of block_size (%d).\n"
           "Boundary padding will be applied.\n", block_size

matrixMul -wA=1000 -hA=1000 -wB=1000 -hB=1000
- Expected: Result = PASS
- Actual: Result = FAIL (cudaErrorIllegalAddress)

You can suggest me if iam worng here and if am not  . i would be happy to raise a pr .


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve matrixMul Tiling to support non-multiple BLOCK_SIZE dimensions in matrixMul.cu . #411

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve matrixMul Tiling to support non-multiple BLOCK_SIZE dimensions in matrixMul.cu . #411

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions