Fix vector size check for strided tensors with unit strides on non-axis dims#1312
Merged
louisfd merged 1 commit intotracel-ai:mainfrom Apr 24, 2026
Merged
Conversation
…is dims try_tensor_vector_size_parallel computed next_stride as the smallest stride > 1, which skipped non-axis dims with stride 1 (produced e.g. by unfold with step=1). That caused vectorized copies to return wrong data: index_offset_contiguous_fastdivmod would compute an unaligned source offset (stride 1 * frame_coord) and divide by the vector size, reading the wrong source vector and duplicating frames. Fix: use the smallest non-zero stride outside the axis. Broadcast dims (stride 0) are still ignored; unit-size dims with stride 1 may now disable vectorization (false negative, not incorrect output). Reproduces as Burn's stft_istft_roundtrip_hann_window failure where unfold(1, 4, 1) on a len-12 signal produced frames duplicated in 4-element chunks instead of sliding by one sample.
This was referenced Apr 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
try_tensor_vector_size_parallelpickednext_strideby filtering strides> 1and taking the min. The> 1filter was meant to exclude the axis's own unit stride, but it also silently excluded legitimate non-axis dims with stride 1, letting invalid vector sizes slip through.Changed to filter by
i != axis && s != 0: skip the axis and broadcast dims, keep everything else (including unit-strided non-axis dims).Reproducer
Burn's STFT test failed because
unfold(dim, n_fft=4, step=1)on a[1, 12]contiguous tensor produces shape[1, 9, 4]with strides[12, 1, 1]. Both the frame dim and the sample dim have stride 1. When that view is materialized viacopy_gpu_ref:tensor_vector_size_parallelreturnsvector_size = 4(12 is divisible by 4; the problematic stride-1 on the frame dim is filtered out as it's not> 1).index_offset_contiguous_fastdivmod, which computes an element offset then divides byvector_size.frame_coord * stride[frame] = 1 * 1), which rounds down to source vector 0 after the divide. The kernel reads source vector 0 instead of the unaligned data starting at element 1, so output vector 1 gets the same data as vector 0.Result: frames duplicated in
n_fft-sized chunks — the first 4 output frames all contain the first 4 elements of the signal, the next 4 contain elements 4..8, etc. Visible asstft_istft_roundtrip_hann_windowfailing in Burn (tracel-ai/burn at rev d7bed26, any backend routed through cubecl).Fix
The correct invariant: for every dim other than the axis, changing that coord must shift the source offset by a multiple of
vector_size(or by zero, for broadcasts). The old filters > 1silently dropped stride-1 non-axis dims, violating this.New filter:
i != axis && s != 0. Unit-size dims with stride 1 are now included, which can cause false negatives (vectorization disabled) but never incorrect output - acceptable since the function's contract already allows being "overly strict."Regression tests
Added unit tests in
cubecl-core/src/lib.rscovering:StrideMismatchEach test also exercises the path that produced incorrect output before the fix; verified the step=1 test fails on the old code (returns 4 instead of 1) and passes on the new code.
Test plan
cargo test --lib tests::incubecl-corepassesstft_istft_roundtrip_hann_windowpasses with the originalunfoldpath (no workaround needed)