Skip to content

Stabilize WebGPU text generation for Gemma-3 ONNX #289

Open
TanmayThakur2209 wants to merge 2 commits intogoogle-gemini:mainfrom
TanmayThakur2209:fix-webgpu-generation-stability
Open

Stabilize WebGPU text generation for Gemma-3 ONNX #289
TanmayThakur2209 wants to merge 2 commits intogoogle-gemini:mainfrom
TanmayThakur2209:fix-webgpu-generation-stability

Conversation

@TanmayThakur2209
Copy link

This PR improves the stability and coherence of Gemma-3 (270M) text generation when running quantized ONNX models on WebGPU in the browser.
During testing of long and interactive sessions, I observed repeated, incoherent, or unstable outputs caused by aggressive generation settings and KV-cache reuse. This change introduces conservative, WebGPU-safe defaults to address those issues.

What’s changed

  • Added safe default generation parameters for WebGPU inference:

    • Reduced max_new_tokens to 256
    • Lowered temperature for more deterministic output
    • Enabled top_p nucleus sampling
    • Applied a mild repetition_penalty
  • Disabled KV-cache reuse (use_cache: false) to prevent instability with quantized ONNX models

  • Added a runtime warning when large max_new_tokens values are used

  • Centralized generation parameters into a documented configuration block for easier tuning

Why this matters

WebGPU + ONNX + quantized LLMs are particularly sensitive to long generations, cache reuse, and high-entropy sampling. These changes make browser-based Gemma demos more stable, predictable, and suitable for interactive use.

Scope

  • No changes to model weights or architecture
  • No backend or API changes
  • Browser-only behavior improvement

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @TanmayThakur2209, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the stability and coherence of Gemma-3 (270M) text generation when utilizing quantized ONNX models within a WebGPU browser environment. By introducing conservative generation settings and explicitly disabling KV-cache reuse, the changes mitigate issues such as incoherent or unstable outputs observed during long, interactive sessions. The overall goal is to provide a more predictable and robust user experience for browser-based Gemma demos.

Highlights

  • WebGPU Generation Stability: Implemented conservative default generation parameters for WebGPU inference to enhance stability and coherence of Gemma-3 text generation.
  • KV-Cache Management: Disabled KV-cache reuse (use_cache: false) to prevent instability specifically with quantized ONNX models in browser environments.
  • Generation Parameter Configuration: Centralized generation parameters into a SAFE_WEBGPU_GENERATION_CONFIG block for improved readability and easier tuning.
  • Max New Tokens Warning: Added a runtime warning to alert users when max_new_tokens values exceed the recommended stable limit of 256 for WebGPU.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the stability of text generation on WebGPU by introducing a set of safe default generation parameters and disabling KV-cache reuse. The changes are well-motivated and effectively centralize the configuration, making it easier to manage. I have one suggestion to improve the maintainability of a new warning by replacing a hardcoded value with a constant.

@bebechien
Copy link
Collaborator

@sitammeur to double check, just in case.

@sitammeur
Copy link
Contributor

Yeah, I will check and let you know here.

@sitammeur
Copy link
Contributor

Hi @TanmayThakur2209,

I reviewed your code and noticed that the sampling method for inference is set to true. However, it appears that it is mostly set to false for any kind of inference, both for ONNX and Torch. For more information, you can refer to the official model release at this link: https://huggingface.co/onnx-community/gemma-3-270m-it-ONNX.

Regarding the maximum number of new tokens, setting this value too high can lead to issues, just as setting it too low can. I chose 512 tokens, which is a common choice in many Hugging Face demos.

Lastly, I’d like to mention the use_cache argument. If you set it to false, the decoding process will be slower for non-reusable key-value pairs. You can see my approach taken from the official demos here: https://github.com/huggingface/transformers.js-examples.

I hope this helps!

cc: @bebechien

@TanmayThakur2209
Copy link
Author

Thanks a lot for the detailed feedback and references — really appreciate you taking the time to review this 🙏

You’re right that in most official demos do_sample is set to false, and that use_cache: false can slow down decoding due to non-reuse of KV pairs. My motivation here was based on observing unstable and repetitive outputs during longer interactive sessions specifically with WebGPU + quantized ONNX models in the browser. Disabling cache and using conservative sampling settings helped improve coherence and prevent the repeated or corrupted outputs I was seeing locally.

That said, your point about aligning more closely with the official demo defaults makes sense, especially for performance and consistency. I’m happy to adjust the configuration to better match the recommended inference setup (e.g. do_sample: false and revisiting use_cache) while still keeping safeguards for browser stability.

I’ll experiment with these changes and update the PR accordingly so we strike a better balance between stability and performance.

Thanks again for the guidance and the helpful links!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants