Stabilize WebGPU text generation for Gemma-3 ONNX #289
Stabilize WebGPU text generation for Gemma-3 ONNX #289TanmayThakur2209 wants to merge 2 commits intogoogle-gemini:mainfrom
Conversation
Summary of ChangesHello @TanmayThakur2209, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the stability and coherence of Gemma-3 (270M) text generation when utilizing quantized ONNX models within a WebGPU browser environment. By introducing conservative generation settings and explicitly disabling KV-cache reuse, the changes mitigate issues such as incoherent or unstable outputs observed during long, interactive sessions. The overall goal is to provide a more predictable and robust user experience for browser-based Gemma demos. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request improves the stability of text generation on WebGPU by introducing a set of safe default generation parameters and disabling KV-cache reuse. The changes are well-motivated and effectively centralize the configuration, making it easier to manage. I have one suggestion to improve the maintainability of a new warning by replacing a hardcoded value with a constant.
|
@sitammeur to double check, just in case. |
|
Yeah, I will check and let you know here. |
|
I reviewed your code and noticed that the sampling method for inference is set to true. However, it appears that it is mostly set to false for any kind of inference, both for ONNX and Torch. For more information, you can refer to the official model release at this link: https://huggingface.co/onnx-community/gemma-3-270m-it-ONNX. Regarding the maximum number of new tokens, setting this value too high can lead to issues, just as setting it too low can. I chose 512 tokens, which is a common choice in many Hugging Face demos. Lastly, I’d like to mention the I hope this helps! cc: @bebechien |
|
Thanks a lot for the detailed feedback and references — really appreciate you taking the time to review this 🙏 You’re right that in most official demos That said, your point about aligning more closely with the official demo defaults makes sense, especially for performance and consistency. I’m happy to adjust the configuration to better match the recommended inference setup (e.g. I’ll experiment with these changes and update the PR accordingly so we strike a better balance between stability and performance. Thanks again for the guidance and the helpful links! |
This PR improves the stability and coherence of Gemma-3 (270M) text generation when running quantized ONNX models on WebGPU in the browser.
During testing of long and interactive sessions, I observed repeated, incoherent, or unstable outputs caused by aggressive generation settings and KV-cache reuse. This change introduces conservative, WebGPU-safe defaults to address those issues.
What’s changed
Added safe default generation parameters for WebGPU inference:
max_new_tokensto 256temperaturefor more deterministic outputtop_pnucleus samplingrepetition_penaltyDisabled KV-cache reuse (
use_cache: false) to prevent instability with quantized ONNX modelsAdded a runtime warning when large
max_new_tokensvalues are usedCentralized generation parameters into a documented configuration block for easier tuning
Why this matters
WebGPU + ONNX + quantized LLMs are particularly sensitive to long generations, cache reuse, and high-entropy sampling. These changes make browser-based Gemma demos more stable, predictable, and suitable for interactive use.
Scope