Add Qwen3.5 MoE (35B-A3B) model export and runner for CUDA backend by mergennachin · Pull Request #18169 · pytorch/executorch

mergennachin · 2026-03-13T20:04:20Z

Memory-efficient loading using meta-device construction + lazy
safetensors shard-by-shard loading + assign=True state dict loading,
following the voxtral_realtime pattern. Peak CPU memory during loading
is ~1x model size instead of ~3x.

Expert weights are structured as grouped nn.Linear modules (16 groups
of 16 experts each) so quantize_model_() handles them automatically.
Layer-by-layer quantization on CUDA avoids loading the full bf16 model
onto GPU at once.

Includes C++ runner using the shared TextLLMRunner, Makefile target,
and CMake presets.

Reference implementations:

https://github.com/mergennachin/nano_qwen35_moe/
vLLM: vllm/model_executor/models/qwen3_5.py

6.5 token/s for decode -- can improve later

pytorch-bot · 2026-03-13T20:04:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18169

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-13T20:05:17Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

Adds an ExecuTorch export + C++ runner pipeline for the Qwen3.5 MoE (35B-A3B) model targeting the CUDA backend, including build system integration and model-specific loading/quantization/export code.

Changes:

Introduce a new examples/models/qwen3_5_moe/ package with model definition, HF safetensors loading/remapping, and CUDA export script.
Add a C++ runner using TextLLMRunner, plus CMake presets/lists to build it.
Wire up a new top-level make qwen3_5_moe-cuda target and document usage in a new README.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
Makefile	Adds `qwen3_5_moe-cuda` build target and help text entry.
examples/models/qwen3_5_moe/requirements.txt	Adds extra Python dependency for the Triton kernel path (FLA).
examples/models/qwen3_5_moe/README.md	Documents export/build/run workflow and model details.
examples/models/qwen3_5_moe/model.py	Implements export-friendly Qwen3.5 MoE model + HF checkpoint remapping/loading.
examples/models/qwen3_5_moe/export.py	Exports + lowers the model to CUDA ExecuTorch `.pte`/`.ptd` with optional quantization.
examples/models/qwen3_5_moe/main.cpp	Adds a minimal C++ CLI runner using `TextLLMRunner` + HF tokenizer.
examples/models/qwen3_5_moe/CMakePresets.json	Adds CMake presets/workflow for building the runner with CUDA.
examples/models/qwen3_5_moe/CMakeLists.txt	Adds standalone CMake build for the runner linked against ExecuTorch + CUDA delegate.
examples/models/qwen3_5_moe/init.py	Marks the directory as a Python package for imports.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

mergennachin · 2026-03-19T14:50:43Z

        python -m pytest backends/cuda/tests backends/cuda/passes/tests -v -o "addopts="

+        # Build Qwen3.5 MoE runner
+        make qwen3_5_moe-cuda


will add e2e (export and run the model) later. currently, export takes a long time, want to optimize it first

JacobSzwejbka · 2026-03-19T18:52:39Z

+    return model, config
+
+
+def _quantize(model, config, args):


can this be put somewhere generic. It seems non model specific

We're already using quantize_model in extension/llm/export/quantize.py. Agree it's generic in principle. Currently only this model needs it (35B params don't fit on GPU at once). Happy to extract it into extension/llm/export/quantize.py if/when a second model needs the pattern.

JacobSzwejbka · 2026-03-19T18:53:40Z

@@ -0,0 +1,835 @@
+"""


Ideally we should merge these with llama_transformer.py Do you want to punt that to a later PR?

Yes, ideally yes. There are so many moving parts to enable MoE. One we find a common pattern, will "upstream" to llama_transformer.py

JacobSzwejbka · 2026-03-19T18:58:08Z

+
+        # Split via slicing (torch.split produces split_copy which lacks AOTI fallback)
+        kd = self.key_dim
+        q = qkv_conv[..., :kd].reshape(B, T, self.num_k_heads, self.head_k_dim)


this slicing is sort of expensive after functionalization. Probably fine for AOTI based flows since they try and lower before functionalization iirc, but mlx lowering would have issues.

JacobSzwejbka · 2026-03-19T18:59:57Z

+
+        # Materialize remaining meta-device buffers (KV caches, conv/recurrent
+        # state, causal masks, RoPE inv_freq) as zeros on CPU
+        for fqn, buf in list(model.named_buffers()):


why materialize them? If we are just tracing why is meta not ok?

You're right, buffers aren't quantized so they don't need to be materialized during loading. Moved materialization out of from_hf_checkpoint (buffers stay on meta) and into export.py right before torch.export. Unfortunately torch.export(strict=True) can't handle meta buffers — dynamo fails to create FakeTensors from them when there are in-place updates (e.g. cache[:, input_pos] = h), so we still need to materialize before tracing.

JacobSzwejbka · 2026-03-19T19:01:04Z

+| `--max-seq-len` | `4096` | KV cache length |
+| `--qlinear` | (none) | Linear layer quantization: `4w`, `8w`, `8da4w`, `8da8w` |
+| `--qlinear-group-size` | `32` | Group size for linear quantization |
+| `--qlinear-packing-format` | (none) | Packing format for 4w: `tile_packed_to_4d` |


Why is the packing format a cli arg?

Remove the cli arg

JacobSzwejbka · 2026-03-19T19:01:30Z

+| `--model-dir` | (required) | HuggingFace model directory with `config.json` + safetensors |
+| `--output-dir` | `./qwen35_moe_exports` | Output directory |
+| `--max-seq-len` | `4096` | KV cache length |
+| `--qlinear` | (none) | Linear layer quantization: `4w`, `8w`, `8da4w`, `8da8w` |


nit: can we compbine this and the group size into one arg 4w,32 etc

I will leave as-is, since it is consistent with other CUDA models

digantdesai · 2026-03-19T20:07:21Z

+executorch_target_link_options_shared_lib(optimized_native_cpu_ops_lib)
+
+# Needed for cpuinfo where it uses android specific log lib
+if(ANDROID)


Why do we have this for a CUDA binary?

digantdesai · 2026-03-19T20:08:59Z

+endif()
+
+# On Windows, copy required DLLs to the executable directory
+if(MSVC AND EXECUTORCH_BUILD_CUDA)


Suggested change

if(MSVC AND EXECUTORCH_BUILD_CUDA)

Assert EXECUTORCH_BUILD_CUDA==1 above

if(MSVC)

find_package(CUDAToolkit REQUIRED) already handles it

digantdesai · 2026-03-19T20:10:53Z

+`model.pte` file with int4 quantization. At inference time, the C++ runner
+loads the `.pte`, `.ptd`, and a HuggingFace tokenizer, then generates text.
+
+## Prerequisites


digantdesai · 2026-03-19T20:11:17Z

+
+Export produces a `model.pte` and `aoti_cuda_blob.ptd` containing the
+compiled CUDA kernels and quantized weights. Int4 quantization is
+recommended — the model is too large to fit in VRAM at bf16.


for which GPU?

Memory-efficient loading using meta-device construction + lazy safetensors shard-by-shard loading + assign=True state dict loading, following the voxtral_realtime pattern. Peak CPU memory during loading is ~1x model size instead of ~3x. Expert weights are structured as grouped nn.Linear modules (16 groups of 16 experts each) so quantize_model_() handles them automatically. Layer-by-layer quantization on CUDA avoids loading the full bf16 model onto GPU at once. Includes C++ runner using the shared TextLLMRunner, Makefile target, and CMake presets. Reference implementations: - https://github.com/mergennachin/nano_qwen35_moe/ - vLLM: vllm/model_executor/models/qwen3_5.py

digantdesai · 2026-03-19T23:47:52Z

+            ]
+        )
+
+    def forward(self, x, expert_indices):


Do you know how is this lowered? If I were to guess this is where majority of the perf to be gained. You should be able to find a fused-moe kernels which leverages groupped gemm kernels.

Yes, that's next step

digantdesai · 2026-03-20T14:25:22Z

+    # Untie lm_head/embedding so they can be quantized independently:
+    # embedding uses index lookup (8w), lm_head uses matmul (4w).
+    if model.lm_head.weight.data_ptr() == model.embed_tokens.weight.data_ptr():
+        model.lm_head.weight = nn.Parameter(model.embed_tokens.weight.clone())


instead of untying, what if we run lm_head as int8? Memory should be better, but int8 is also slower, just curious.

digantdesai · 2026-03-20T15:03:12Z

+                qlinear_group_size=args.qlinear_group_size,
+                qlinear_packing_format=args.qlinear_packing_format,
+            )
+        layer.to(device="cpu")


I guess this is needed for us to fit in vram, but does it slow down lowering?

This doesn't slow down the lowering in my profiling.

digantdesai · 2026-03-20T15:04:26Z

+    exported = exported.run_decompositions(
+        {torch.ops.aten.conv1d.default: conv1d_to_conv2d}


Should this pass be folded inside the partitioner?

digantdesai · 2026-03-20T15:06:34Z

+  config.temperature = FLAGS_temperature;
+  config.max_new_tokens = FLAGS_max_new_tokens;
+
+  auto error = runner->generate(FLAGS_prompt.c_str(), config);


Do you want to add basic benchmarking like llama.cpp::llama_bench?

digantdesai · 2026-03-20T15:07:52Z

+within tinygemm int4 packing limits. 256 experts / 16 = 16 groups, giving
+32 matmul nodes per layer instead of 768 with per-expert linears.
+
+Forward pass: compute all groups → cat → gather top-k → SwiGLU → compute


This has redundant compute, wonder how much we can speed up if we don't compute all the experts and throw them away later.

digantdesai

Left a few comments, but if its functional then we should merge this. And later improve. Thanks @mergennachin.

Copilot

Pull request overview

Adds a new ExecuTorch example for exporting and running the Qwen3.5 MoE (35B-A3B) text model on the CUDA backend, including a memory-efficient safetensors loader, a C++ runner using the shared TextLLMRunner, and build/CI wiring.

Changes:

Introduces a self-contained Qwen3.5 MoE model implementation + safetensors checkpoint remapping/loading.
Adds a CUDA-only export pipeline with optional int4/int8 quantization and CUDA backend lowering.
Adds C++ runner + CMake presets + Makefile target, and builds the runner in CUDA CI.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
Makefile	Adds `qwen3_5_moe-cuda` build target and help text entry.
examples/models/qwen3_5_moe/requirements.txt	Adds `flash-linear-attention` dependency pin.
examples/models/qwen3_5_moe/README.md	Documents export/build/run workflow for the new model.
examples/models/qwen3_5_moe/model.py	Implements export-friendly Qwen3.5 MoE model + safetensors loading/remapping.
examples/models/qwen3_5_moe/model.md	Architecture/design notes for the implementation and export strategy.
examples/models/qwen3_5_moe/main.cpp	Adds a minimal TextLLMRunner-based C++ runner.
examples/models/qwen3_5_moe/export.py	Adds export + (int4/int8) quantization + CUDA lowering pipeline.
examples/models/qwen3_5_moe/CMakePresets.json	Adds presets to configure/build the runner with CUDA.
examples/models/qwen3_5_moe/CMakeLists.txt	Adds runner build definition and CUDA backend linkage.
.github/workflows/cuda.yml	Builds the new runner in the CUDA CI job.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-20T18:14:12Z

+DEFINE_string(
+    data_path,
+    "",
+    "Comma-separated data files (.ptd) for CUDA backend.");
+DEFINE_string(tokenizer_path, "", "HuggingFace tokenizer.json path.");
+DEFINE_string(prompt, "Hello", "Prompt text.");
+DEFINE_double(temperature, 0.8, "Sampling temperature (0 = greedy).");
+DEFINE_int32(max_new_tokens, 128, "Maximum tokens to generate.");
+
+namespace llm = ::executorch::extension::llm;
+
+int main(int argc, char** argv) {
+  gflags::ParseCommandLineFlags(&argc, &argv, true);
+
+  if (FLAGS_model_path.empty()) {
+    ET_LOG(Error, "Must specify --model_path");
+    return 1;
+  }
+  if (FLAGS_tokenizer_path.empty()) {
+    ET_LOG(Error, "Must specify --tokenizer_path");
+    return 1;
+  }
+
+  std::vector<std::string> data_files;
+  if (!FLAGS_data_path.empty()) {
+    data_files.push_back(FLAGS_data_path);
+  }


--data_path is documented as “Comma-separated data files”, but the runner currently pushes the entire string as a single entry. This breaks multi-.ptd usage (and is inconsistent with the flag description). Split FLAGS_data_path on commas (and trim whitespace) before passing data_files to create_text_llm_runner().

Copilot · 2026-03-20T18:14:13Z

+    if args.qlinear or args.qembedding:
+        _quantize(model, config, args)
+    else:
+        model.to(dtype=torch.bfloat16)
+
+    return model, config


Quantization runs before _materialize_buffers(), while the model still contains meta-device buffers from from_hf_checkpoint(). As a result, _to_device_skip_meta() skips moving key submodules (e.g., attention/GatedDeltaNet) to CUDA because they have meta buffers, so quantize_model_() may run on CPU weights or fail to apply CUDA packing. Materialize/replace meta buffers before the layer-by-layer CUDA quantization (or update _to_device_skip_meta() to move parameters even when a module has meta buffers).

Copilot · 2026-03-20T18:14:13Z

+    # Dynamic shapes
+    example_tokens = torch.tensor([[0, 1]], dtype=torch.long)
+    example_input_pos = torch.tensor([0, 1], dtype=torch.long)
+    seq_dim = Dim("seq_len", min=1, max=config.max_seq_len - 1)


The dynamic shape upper bound uses max=config.max_seq_len - 1, which caps exported sequence length at one token less than the KV cache size (e.g., 4095 for a 4096 cache). Since max_seq_len is the intended context length, this should likely be max=config.max_seq_len to avoid needlessly reducing supported prompt length.

Suggested change

seq_dim = Dim("seq_len", min=1, max=config.max_seq_len - 1)

seq_dim = Dim("seq_len", min=1, max=config.max_seq_len)

Copilot · 2026-03-20T18:14:13Z

+            k = k.repeat_interleave(self.n_kv_groups, dim=1)
+            v = v.repeat_interleave(self.n_kv_groups, dim=1)


repeat_interleave() on the full KV cache creates a new (B, n_heads, max_seq_len, head_dim) tensor every forward, which is a large allocation and can dominate decode-time memory/bandwidth. Consider using an expand-based view (e.g., add a group dimension then reshape) to broadcast K/V to n_heads without copying, if SDPA accepts the resulting strides.

Suggested change

k = k.repeat_interleave(self.n_kv_groups, dim=1)

v = v.repeat_interleave(self.n_kv_groups, dim=1)

B_kv, n_kv_heads, S_kv, D_kv = k.shape

# Broadcast K/V across groups without materializing full repeats

k = k.unsqueeze(2).expand(B_kv, n_kv_heads, self.n_kv_groups, S_kv, D_kv)

v = v.unsqueeze(2).expand(B_kv, n_kv_heads, self.n_kv_groups, S_kv, D_kv)

k = k.reshape(B_kv, n_kv_heads * self.n_kv_groups, S_kv, D_kv)

v = v.reshape(B_kv, n_kv_heads * self.n_kv_groups, S_kv, D_kv)

Copilot · 2026-03-20T18:14:14Z

+        g = -self.A_log.float().exp() * F.softplus(a.float() + self.dt_bias)
+
+        # FLA Triton kernel
+        state = self.recurrent_state[:B].clone()


state = self.recurrent_state[:B].clone() allocates a full copy of the recurrent state on every forward call. The chunk_gated_delta_rule op does not mutate its initial_state argument (it returns final_state separately), so you can pass self.recurrent_state[:B] directly and avoid the per-step clone.

Suggested change

state = self.recurrent_state[:B].clone()

state = self.recurrent_state[:B]

Copilot · 2026-03-20T18:14:14Z

+  target_link_options_gc_sections(qwen3_5_moe_runner)
+  target_link_options(qwen3_5_moe_runner PRIVATE "LINKER:-s")


This target unconditionally adds the linker option LINKER:-s when not Debug. That flag is not valid under MSVC, but the presets explicitly allow Windows. Guard LINKER:-s (and potentially target_link_options_gc_sections) behind if(NOT MSVC)/platform checks (similar to voxtral_realtime), and consider adding the post-build DLL copy for aoti_cuda_shims on MSVC so the runner can execute on Windows.

Suggested change

target_link_options_gc_sections(qwen3_5_moe_runner)

target_link_options(qwen3_5_moe_runner PRIVATE "LINKER:-s")

if(NOT MSVC)

target_link_options_gc_sections(qwen3_5_moe_runner)

target_link_options(qwen3_5_moe_runner PRIVATE "LINKER:-s")

endif()

Copilot · 2026-03-20T18:14:14Z

+        # Build Qwen3.5 MoE runner
+        make qwen3_5_moe-cuda


make qwen3_5_moe-cuda re-runs cmake --workflow --preset llm-release-cuda (per the Makefile target), but this job already built ExecuTorch with CUDA just above. This duplicates a full configure/build/install step and will add significant CI time; consider invoking only the runner build workflow (cd examples/models/qwen3_5_moe && cmake --workflow --preset qwen3-5-moe-cuda) here, or adjusting the Makefile target to skip rebuilding ExecuTorch when cmake-out is already populated.

Suggested change

# Build Qwen3.5 MoE runner

make qwen3_5_moe-cuda

# Build Qwen3.5 MoE runner (without rebuilding ExecuTorch)

pushd examples/models/qwen3_5_moe

cmake --workflow --preset qwen3-5-moe-cuda

popd

- Defer buffer materialization from model loading to export time (from_hf_checkpoint now returns model with meta buffers, materialized in export.py before torch.export) - Use _to_device_skip_meta() to move only non-meta submodules to CUDA during quantization, correctly handling tensor subclasses - Remove --qlinear-packing-format CLI arg, auto-set tile_packed_to_4d for 4w quantization - Remove Android, MSVC, and Apple platform guards from CMakeLists.txt (CUDA-on-Linux only runner) - Add GPU spec (A100 80GB) and pip install requirements.txt to README

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 13, 2026

mergennachin requested review from JacobSzwejbka and digantdesai March 13, 2026 20:10

mergennachin force-pushed the mnachin/qwen3_5_moe branch 3 times, most recently from 2bd84b3 to 54055da Compare March 19, 2026 00:17

mergennachin marked this pull request as ready for review March 19, 2026 00:19

mergennachin requested review from kirklandsign, larryliu0820 and lucylq as code owners March 19, 2026 00:19

Copilot AI review requested due to automatic review settings March 19, 2026 00:19

mergennachin force-pushed the mnachin/qwen3_5_moe branch 2 times, most recently from 9c07cbf to 3f5e7a5 Compare March 19, 2026 00:23

Copilot AI reviewed Mar 19, 2026

View reviewed changes

mergennachin force-pushed the mnachin/qwen3_5_moe branch from 3f5e7a5 to b28ccaf Compare March 19, 2026 14:40

mergennachin commented Mar 19, 2026

View reviewed changes

JacobSzwejbka reviewed Mar 19, 2026

View reviewed changes

digantdesai reviewed Mar 19, 2026

View reviewed changes

digantdesai reviewed Mar 20, 2026

View reviewed changes

digantdesai approved these changes Mar 20, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 20, 2026 18:06

mergennachin force-pushed the mnachin/qwen3_5_moe branch from b28ccaf to ecf1d43 Compare March 20, 2026 18:06

Copilot started reviewing on behalf of mergennachin March 20, 2026 18:07 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

mergennachin force-pushed the mnachin/qwen3_5_moe branch from ecf1d43 to f7336e5 Compare March 20, 2026 19:30

mergennachin merged commit 36c22e9 into main Mar 20, 2026
167 of 173 checks passed

mergennachin deleted the mnachin/qwen3_5_moe branch March 20, 2026 19:33

	if(MSVC AND EXECUTORCH_BUILD_CUDA)
	Assert EXECUTORCH_BUILD_CUDA==1 above
	if(MSVC)

		exported = exported.run_decompositions(
		{torch.ops.aten.conv1d.default: conv1d_to_conv2d}

	seq_dim = Dim("seq_len", min=1, max=config.max_seq_len - 1)
	seq_dim = Dim("seq_len", min=1, max=config.max_seq_len)

		k = k.repeat_interleave(self.n_kv_groups, dim=1)
		v = v.repeat_interleave(self.n_kv_groups, dim=1)

-            k = k.repeat_interleave(self.n_kv_groups, dim=1)
-            v = v.repeat_interleave(self.n_kv_groups, dim=1)
+            B_kv, n_kv_heads, S_kv, D_kv = k.shape
+            # Broadcast K/V across groups without materializing full repeats
+            k = k.unsqueeze(2).expand(B_kv, n_kv_heads, self.n_kv_groups, S_kv, D_kv)
+            v = v.unsqueeze(2).expand(B_kv, n_kv_heads, self.n_kv_groups, S_kv, D_kv)
+            k = k.reshape(B_kv, n_kv_heads * self.n_kv_groups, S_kv, D_kv)
+            v = v.reshape(B_kv, n_kv_heads * self.n_kv_groups, S_kv, D_kv)

	state = self.recurrent_state[:B].clone()
	state = self.recurrent_state[:B]

		target_link_options_gc_sections(qwen3_5_moe_runner)
		target_link_options(qwen3_5_moe_runner PRIVATE "LINKER:-s")

-        # Build Qwen3.5 MoE runner
-        make qwen3_5_moe-cuda
+        # Build Qwen3.5 MoE runner (without rebuilding ExecuTorch)
+        pushd examples/models/qwen3_5_moe
+        cmake --workflow --preset qwen3-5-moe-cuda
+        popd

Conversation

mergennachin commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18169

Uh oh!

github-actions bot commented Mar 13, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JacobSzwejbka Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

digantdesai left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

mergennachin commented Mar 13, 2026 •

edited

Loading

pytorch-bot bot commented Mar 13, 2026 •

edited

Loading

This PR needs a `release notes:` label

JacobSzwejbka Mar 19, 2026 •

edited

Loading