Deferred GPU-resident sampling and pre-allocated decode tensors#779
Conversation
- Deferred sampling: tokens stay GPU-resident, flushed in batches of 4 - Pre-allocated tensors reused per step instead of per-token allocation - Incremental penalty tracking avoids Map spread each token
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces significant performance enhancements to the token generation pipeline by optimizing GPU-CPU interactions and memory management. It achieves this by deferring token sampling to keep data resident on the GPU for longer periods, batching transfers to the CPU, and pre-allocating critical tensors to reduce dynamic memory operations. These changes aim to decrease latency and improve the overall efficiency of the LLM chat pipeline. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant performance optimizations by implementing deferred GPU-resident sampling and pre-allocating decode tensors. This reduces GPU-CPU round-trips and memory allocations during decoding. The changes are well-structured, with the introduction of the DeferredSampler class and refactoring of the sampling logic in llm_chat.ts. Overall, the implementation is solid, but I've identified a potential memory leak in the new DeferredSampler class that should be addressed.
Note: Security Review did not run due to the size of the PR.
|
Thanks for the contribution @mitiskuma! This change looks reasonable. What kinds of improvements in decode toks/s do you observe? I don't think the pre-allocated tensors alone yield much improvement, so I'm curious to see what benefits deferred sampling and penalty tracking bring. |
|
hi @akaashrp! |
|
Thanks, I'll publish a new version of the web-runtime package and test the improvement on my end. I'm a bit swamped with work, but I'll try to review this ASAP. |
Summary