Skip to content

[fix](Qwen3-30b-A3B): fix nonsense output by correcting RoPE write-back#31

Open
yyj6666667 wants to merge 1 commit intokvcache-ai:mainfrom
yyj6666667:fix/qwen3-30b
Open

[fix](Qwen3-30b-A3B): fix nonsense output by correcting RoPE write-back#31
yyj6666667 wants to merge 1 commit intokvcache-ai:mainfrom
yyj6666667:fix/qwen3-30b

Conversation

@yyj6666667
Copy link
Copy Markdown

@yyj6666667 yyj6666667 commented Apr 15, 2026

Motivation

  • The fused_rope_store kernel wrote rotated k only to k_cache, leaving the caller's k tensor unrotated. Prefill attention ran with rotated q but unrotated k on the first forward of each request, producing gibberish output on Qwen3-30B-A3B.

Modifications

  • In the fused_rope_store kernel, always store the rotated tensor back in place (q → q, k → k), in addition to writing
    rotated k into k_cache.

Tests

  • Verified on Qwen3-30B-A3B: prefill outputs are nornal after the fix; prior gibberish reproducer no longer triggers.

- Restore in-place storage for q/k and ensure k_cache is updated with the rotated values.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the fused_rope_store_kernel to ensure that rotated values are consistently written back to the input tensor for both query and key inputs, while also updating the k_cache. The review feedback identifies a critical missing update in the Python wrapper's mutates_args for the k tensor, which is necessary for correct dependency tracking and graph capture. Additionally, there is a suggestion to reuse existing pointers for the RoPE dimension offsets to improve code readability and maintain consistency with other kernel implementations.

Comment on lines +207 to +209
store_as<InputStorage>(input, input_vec_x, lane_id);
const auto input_y_out = pointer::offset(input, (kRopeDim / 2) * sizeof(DType));
store_as<InputStorage>(input_y_out, input_vec_y, lane_id);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

You can reuse the existing input_x and input_y pointers (defined at lines 185-186) instead of recalculating the offset for the second half of the RoPE dimension. This improves readability and maintains consistency with the fused_rope_kernel implementation in the same file.

      store_as<InputStorage>(input_x, input_vec_x, lane_id);
      store_as<InputStorage>(input_y, input_vec_y, lane_id);

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes incorrect RoPE application in the fused RoPE+KV-cache store path for Qwen3-30B-A3B by ensuring rotated keys are written back to the caller’s k tensor (in addition to k_cache), preventing prefill attention from mixing rotated q with unrotated k.

Changes:

  • Update fused_rope_store_kernel to always write rotated outputs back in-place to the input tensor for both q and k.
  • Preserve existing behavior of also writing rotated k into k_cache for decode.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +204 to +212
// Always write rotated values back to `input` (q itself when
// load_q==true, k itself when load_q==false). When processing k, also
// write to k_cache for future decode.
store_as<InputStorage>(input, input_vec_x, lane_id);
const auto input_y_out = pointer::offset(input, (kRopeDim / 2) * sizeof(DType));
store_as<InputStorage>(input_y_out, input_vec_y, lane_id);
if (!load_q) {
const auto k_out = pointer::offset(k_cache, loc * cache_stride_bytes, head_id * head_stride_bytes);
store_as<InputStorage>(k_out, input_vec_x, lane_id);
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kernel now writes rotated values back into the k input buffer when load_q == false. The Python custom-op wrapper for apply_rope_inplace_with_kvcache currently does not declare k in mutates_args, which can lead to incorrect behavior under torch.compile/graph captures because the op will be treated as not mutating k. Please update the wrapper registration to include k as a mutated argument to match this new in-place writeback.

Copilot uses AI. Check for mistakes.
Comment on lines +230 to +235
// Always write rotated value back to `input`; also to k_cache when processing k.
store_as<InputStorage>(input, input_vec, lane_id);
if (!load_q) {
const auto k_out = pointer::offset(k_cache, loc * cache_stride_bytes, head_id * head_stride_bytes);
store_as<InputStorage>(k_out, input_vec, lane_id);
}
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing fused RoPE+KV-cache tests validate q and k_cache, but they don’t assert that the in-place k tensor is rotated. Since this change makes k writeback part of the contract, please extend the test to also compare the post-call k against the reference rotated k to prevent regressions.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants