Skip to content

Deferred GPU-resident sampling and pre-allocated decode tensors#779

Open
mitiskuma wants to merge 1 commit intomlc-ai:mainfrom
mitiskuma:Deferred-GPU-resident-sampling-and-pre-allocated-decode-tensors
Open

Deferred GPU-resident sampling and pre-allocated decode tensors#779
mitiskuma wants to merge 1 commit intomlc-ai:mainfrom
mitiskuma:Deferred-GPU-resident-sampling-and-pre-allocated-decode-tensors

Conversation

@mitiskuma
Copy link

Summary

  • Deferred sampling: tokens stay GPU-resident, flushed in batches of 4 to reduce GPU↔CPU round-trips during decode
  • Pre-allocated tensors reused per step instead of per-token allocation
  • Incremental penalty tracking avoids Map spread each token

- Deferred sampling: tokens stay GPU-resident, flushed in batches of 4
- Pre-allocated tensors reused per step instead of per-token allocation
- Incremental penalty tracking avoids Map spread each token
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant performance enhancements to the token generation pipeline by optimizing GPU-CPU interactions and memory management. It achieves this by deferring token sampling to keep data resident on the GPU for longer periods, batching transfers to the CPU, and pre-allocating critical tensors to reduce dynamic memory operations. These changes aim to decrease latency and improve the overall efficiency of the LLM chat pipeline.

Highlights

  • Deferred GPU-resident Sampling: Implemented a DeferredSampler class to keep sampled tokens on the GPU, reducing GPU↔CPU round-trips by flushing tokens in batches (defaulting to 4) instead of processing each token individually. This significantly improves performance for token generation.
  • Pre-allocated Tensors for Sampling: Introduced pre-allocated tensors for sampling parameters (temperatures, top_p, penalties) and sampled tokens. These tensors are reused across decode steps, minimizing memory allocations and garbage collection overhead during generation.
  • Incremental Penalty Tracking: Optimized the tracking of appeared tokens for repetition, frequency, and presence penalties. Instead of recreating arrays from a Map on each token, parallel arrays are now incrementally updated, avoiding expensive spread operations unless a new unique token appears.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/engine.ts
    • Added calls to pipeline.flushDeferredTokens at the end of the generate and chat.request methods to ensure any pending GPU-resident tokens are processed.
  • src/llm_chat.ts
    • Added _hasValue helper function for robust null/undefined checks.
    • Defined SamplingParams interface to encapsulate sampling configuration.
    • Introduced DeferredSampler class to manage GPU-resident sampled tokens, including methods for pushing tokens, flushing batches, and checking deferral eligibility.
    • Added new private members to LLMChatPipeline for pre-allocated penalty tracking arrays (penaltyTokenIds, penaltyTokenCnts, penaltyNumTokens, penaltyArraysDirty) and sampling tensors (temperaturesDevice, sampledTokensHost, topPHost, penaltyPos2seqIds, penaltySeqIdsDevice, penaltiesDevice, penaltiesHost).
    • Initialized the deferredSampler in the LLMChatPipeline constructor.
    • Disposed of the deferredSampler and all new pre-allocated tensors in the dispose method to prevent memory leaks.
    • Reset the deferredSampler and penalty tracking state in resetState and prefillStep methods.
    • Modified the decode method to conditionally use a deferred sampling path, leveraging embedGPUTokenAndForward and sampleTokenFromLogitsDeferred when applicable, and to call flushDeferredTokens.
    • Implemented flushDeferredTokens to synchronize the GPU, read back accumulated tokens, and process them in a batch.
    • Updated processNextToken to incrementally manage appearedTokensFreq and the new parallel penalty arrays, marking penaltyArraysDirty when a new unique token is encountered.
    • Added embedGPUTokenAndForward to handle embedding of GPU-resident tokens directly, avoiding CPU round-trips.
    • Refactored sampleTokenFromLogits to extract sampling parameter validation into extractSamplingParams and penalty/logit bias application into applyPenaltiesAndLogitBias.
    • Modified sampleTokenFromLogits to utilize the pre-allocated temperaturesDevice, topPHost, and sampledTokensHost tensors.
    • Introduced sampleTokenFromLogitsDeferred for GPU-only sampling operations that do not require immediate CPU synchronization, returning a GPU tensor.
Activity
  • No specific activity (comments, reviews, progress updates) was provided in the context for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant performance optimizations by implementing deferred GPU-resident sampling and pre-allocating decode tensors. This reduces GPU-CPU round-trips and memory allocations during decoding. The changes are well-structured, with the introduction of the DeferredSampler class and refactoring of the sampling logic in llm_chat.ts. Overall, the implementation is solid, but I've identified a potential memory leak in the new DeferredSampler class that should be addressed.

Note: Security Review did not run due to the size of the PR.

@akaashrp
Copy link
Collaborator

akaashrp commented Mar 7, 2026

Thanks for the contribution @mitiskuma! This change looks reasonable. What kinds of improvements in decode toks/s do you observe? I don't think the pre-allocated tensors alone yield much improvement, so I'm curious to see what benefits deferred sampling and penalty tracking bring.

@akaashrp akaashrp self-assigned this Mar 9, 2026
@mitiskuma
Copy link
Author

hi @akaashrp!
I did few benchmarks on Qwen3-0.6B-q4f16_1 (on a Mac M4 Max, Chrome, WebGPU). Decode goes from ~140 t/s to ~200 t/s. Wall-clock E2E improves similarly (~1.4s to ~1s for ~200 token completions pretty much).
The main win is DeferredSampler eliminating per-token device.sync() given the GPU that stays busy instead of stalling between tokens. Preallocated tensors are a minor contributor. these numbers are on top of the TVM WebGPU batching PR that I did and got merged few days ago: (apache/tvm#18871) which batches dispatches into a single command encoder. The two are synergistic because without TVM batching, deferred sampling still helps (from ~51 to about ~150 t/s) but the full benefit requires both.
Important Note: the internally reported decode TPS metric currently excludes flushDeferredTokens sync time, so it overreports. The numbers above are calculated from wall-clock E2E minus TTFT.

@akaashrp
Copy link
Collaborator

Thanks, I'll publish a new version of the web-runtime package and test the improvement on my end. I'm a bit swamped with work, but I'll try to review this ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants