Add CUDA sort shim for AOTI export (thrust-based sort_stable fallback) by digantdesai · Pull Request #18829 · pytorch/executorch

digantdesai · 2026-04-11T16:15:38Z

WIP

Inductor emits aten::sort.stable for ops like argsort, but lacks a native c-shim for it. This adds a thrust-based implementation (aoti_torch_cuda_sort_stable) that handles int64, int32, and float32 dtypes on contiguous innermost-dim tensors. Registered as a supported fallback kernel in CudaBackend so AOTI-compiled models can use sort.

Sweeps prompt lengths [1..4095] with Qwen3.5-35B-A3B shapes (256 experts, top-8, INT4 W4A16). Validates correctness against loop-based eager reference at small M, benchmarks vectorized eager, torch.compile, and Triton fused_moe. Handles OOM gracefully at large M where eager/compile dequantize all experts.

When the Triton tile size fits within a single quantization group, load one scale per N-element instead of per (K, N) element. Reduces scale memory traffic in both GEMM1 and GEMM2 vec-mat kernels.

Adds a batched (M>1) Triton fused MoE kernel using tensor-core mma instructions for prefill workloads. Includes moe_align_block_size for token-expert sorting and scale broadcast optimization in the batched GEMM inner loops. Weight layout: [E, N, K//2] (packed INT4).

Add use_batched_moe flag on FusedMoEExperts, toggled by _set_batched_moe in export.py before each method's torch.export call. Decode (T=1) uses the vec-mat fused_moe kernel; prefill (T>=2) uses fused_moe_batched_gemm.

pytorch-bot · 2026-04-11T16:15:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18829

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull workflow for PyTorch trunk commits

❌ 2 New Failures, 4 Unrelated Failures

As of commit a0d199a with merge base 266ff2d ():

NEW FAILURES - The following jobs have failed:

Test CUDA Builds / test-model-cuda-e2e (nvidia, parakeet-tdt, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t de0261ee0ba46739fbdbeeb09c411f38a161b5d0694a7074429dd023f979f2dd /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (nvidia, parakeet-tdt, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t 7d138767b941f8aa2342868cf9462b1b23d861811a94ffafe7ea5946d1f6d095 /exec failed with exit code 1

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-samsung-models-linux / linux-job (gh) (trunk failure)
test_w2l_fp16
pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
Test CUDA Builds / unittest-cuda / linux-job (gh) (trunk failure)
backends/cuda/tests/test_fused_moe.py::TestFusedMoE::test_e2e_cpp_runner

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-04-11T16:16:23Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

digantdesai added 5 commits April 10, 2026 14:37

Optimize MoE scale loading: broadcast when BLOCK_SIZE_K <= group_size

f22ed0f

When the Triton tile size fits within a single quantization group, load one scale per N-element instead of per (K, N) element. Reduces scale memory traffic in both GEMM1 and GEMM2 vec-mat kernels.

Route prefill MoE to batched tensor-core kernel in Qwen3.5 export

a0d199a

Add use_batched_moe flag on FusedMoEExperts, toggled by _set_batched_moe in export.py before each method's torch.export call. Decode (T=1) uses the vec-mat fused_moe kernel; prefill (T>=2) uses fused_moe_batched_gemm.

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA sort shim for AOTI export (thrust-based sort_stable fallback)#18829

Add CUDA sort shim for AOTI export (thrust-based sort_stable fallback)#18829
digantdesai wants to merge 5 commits intomainfrom
digantdesai/qwen35_moe

digantdesai commented Apr 11, 2026

Uh oh!

pytorch-bot bot commented Apr 11, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

digantdesai commented Apr 11, 2026

Uh oh!

pytorch-bot bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18829

❗ 1 Active SEVs

❌ 2 New Failures, 4 Unrelated Failures

Uh oh!

github-actions bot commented Apr 11, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot bot commented Apr 11, 2026 •

edited

Loading

This PR needs a `release notes:` label