Deferred GPU-resident sampling and pre-allocated decode tensors by mitiskuma · Pull Request #779 · mlc-ai/web-llm

mitiskuma · 2026-03-04T17:54:31Z

Summary

Deferred sampling: tokens stay GPU-resident, flushed in batches of 4 to reduce GPU↔CPU round-trips during decode
Pre-allocated tensors reused per step instead of per-token allocation
Incremental penalty tracking avoids Map spread each token

- Deferred sampling: tokens stay GPU-resident, flushed in batches of 4 - Pre-allocated tensors reused per step instead of per-token allocation - Incremental penalty tracking avoids Map spread each token

gemini-code-assist · 2026-03-04T17:56:02Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant performance enhancements to the token generation pipeline by optimizing GPU-CPU interactions and memory management. It achieves this by deferring token sampling to keep data resident on the GPU for longer periods, batching transfers to the CPU, and pre-allocating critical tensors to reduce dynamic memory operations. These changes aim to decrease latency and improve the overall efficiency of the LLM chat pipeline.

Highlights

Deferred GPU-resident Sampling: Implemented a DeferredSampler class to keep sampled tokens on the GPU, reducing GPU↔CPU round-trips by flushing tokens in batches (defaulting to 4) instead of processing each token individually. This significantly improves performance for token generation.
Pre-allocated Tensors for Sampling: Introduced pre-allocated tensors for sampling parameters (temperatures, top_p, penalties) and sampled tokens. These tensors are reused across decode steps, minimizing memory allocations and garbage collection overhead during generation.
Incremental Penalty Tracking: Optimized the tracking of appeared tokens for repetition, frequency, and presence penalties. Instead of recreating arrays from a Map on each token, parallel arrays are now incrementally updated, avoiding expensive spread operations unless a new unique token appears.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/engine.ts
- Added calls to pipeline.flushDeferredTokens at the end of the generate and chat.request methods to ensure any pending GPU-resident tokens are processed.
src/llm_chat.ts
- Added _hasValue helper function for robust null/undefined checks.
- Defined SamplingParams interface to encapsulate sampling configuration.
- Introduced DeferredSampler class to manage GPU-resident sampled tokens, including methods for pushing tokens, flushing batches, and checking deferral eligibility.
- Added new private members to LLMChatPipeline for pre-allocated penalty tracking arrays (penaltyTokenIds, penaltyTokenCnts, penaltyNumTokens, penaltyArraysDirty) and sampling tensors (temperaturesDevice, sampledTokensHost, topPHost, penaltyPos2seqIds, penaltySeqIdsDevice, penaltiesDevice, penaltiesHost).
- Initialized the deferredSampler in the LLMChatPipeline constructor.
- Disposed of the deferredSampler and all new pre-allocated tensors in the dispose method to prevent memory leaks.
- Reset the deferredSampler and penalty tracking state in resetState and prefillStep methods.
- Modified the decode method to conditionally use a deferred sampling path, leveraging embedGPUTokenAndForward and sampleTokenFromLogitsDeferred when applicable, and to call flushDeferredTokens.
- Implemented flushDeferredTokens to synchronize the GPU, read back accumulated tokens, and process them in a batch.
- Updated processNextToken to incrementally manage appearedTokensFreq and the new parallel penalty arrays, marking penaltyArraysDirty when a new unique token is encountered.
- Added embedGPUTokenAndForward to handle embedding of GPU-resident tokens directly, avoiding CPU round-trips.
- Refactored sampleTokenFromLogits to extract sampling parameter validation into extractSamplingParams and penalty/logit bias application into applyPenaltiesAndLogitBias.
- Modified sampleTokenFromLogits to utilize the pre-allocated temperaturesDevice, topPHost, and sampledTokensHost tensors.
- Introduced sampleTokenFromLogitsDeferred for GPU-only sampling operations that do not require immediate CPU synchronization, returning a GPU tensor.

Activity

No specific activity (comments, reviews, progress updates) was provided in the context for this pull request.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant performance optimizations by implementing deferred GPU-resident sampling and pre-allocating decode tensors. This reduces GPU-CPU round-trips and memory allocations during decoding. The changes are well-structured, with the introduction of the DeferredSampler class and refactoring of the sampling logic in llm_chat.ts. Overall, the implementation is solid, but I've identified a potential memory leak in the new DeferredSampler class that should be addressed.

_{Note: Security Review did not run due to the size of the PR.}

src/llm_chat.ts

akaashrp · 2026-03-07T21:37:34Z

Thanks for the contribution @mitiskuma! This change looks reasonable. What kinds of improvements in decode toks/s do you observe? I don't think the pre-allocated tensors alone yield much improvement, so I'm curious to see what benefits deferred sampling and penalty tracking bring.

mitiskuma · 2026-03-11T22:34:23Z

hi @akaashrp!
I did few benchmarks on Qwen3-0.6B-q4f16_1 (on a Mac M4 Max, Chrome, WebGPU). Decode goes from ~140 t/s to ~200 t/s. Wall-clock E2E improves similarly (~1.4s to ~1s for ~200 token completions pretty much).
The main win is DeferredSampler eliminating per-token device.sync() given the GPU that stays busy instead of stalling between tokens. Preallocated tensors are a minor contributor. these numbers are on top of the TVM WebGPU batching PR that I did and got merged few days ago: (apache/tvm#18871) which batches dispatches into a single command encoder. The two are synergistic because without TVM batching, deferred sampling still helps (from ~51 to about ~150 t/s) but the full benefit requires both.
Important Note: the internally reported decode TPS metric currently excludes flushDeferredTokens sync time, so it overreports. The numbers above are calculated from wall-clock E2E minus TTFT.

akaashrp · 2026-03-15T17:43:38Z

Thanks, I'll publish a new version of the web-runtime package and test the improvement on my end. I'm a bit swamped with work, but I'll try to review this ASAP.

Deferred GPU-resident sampling and pre-allocated decode tensors

41403f6

- Deferred sampling: tokens stay GPU-resident, flushed in batches of 4 - Pre-allocated tensors reused per step instead of per-token allocation - Incremental penalty tracking avoids Map spread each token

gemini-code-assist bot reviewed Mar 4, 2026

View reviewed changes

src/llm_chat.ts Show resolved Hide resolved

akaashrp self-assigned this Mar 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deferred GPU-resident sampling and pre-allocated decode tensors#779

Deferred GPU-resident sampling and pre-allocated decode tensors#779
mitiskuma wants to merge 1 commit intomlc-ai:mainfrom
mitiskuma:Deferred-GPU-resident-sampling-and-pre-allocated-decode-tensors

mitiskuma commented Mar 4, 2026

Uh oh!

gemini-code-assist bot commented Mar 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

akaashrp commented Mar 7, 2026

Uh oh!

mitiskuma commented Mar 11, 2026

Uh oh!

akaashrp commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mitiskuma commented Mar 4, 2026

Summary

Uh oh!

gemini-code-assist bot commented Mar 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

akaashrp commented Mar 7, 2026

Uh oh!

mitiskuma commented Mar 11, 2026

Uh oh!

akaashrp commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants