Stabilize WebGPU text generation for Gemma-3 ONNX by TanmayThakur2209 · Pull Request #289 · google-gemini/gemma-cookbook

TanmayThakur2209 · 2026-01-17T13:23:39Z

This PR improves the stability and coherence of Gemma-3 (270M) text generation when running quantized ONNX models on WebGPU in the browser.
During testing of long and interactive sessions, I observed repeated, incoherent, or unstable outputs caused by aggressive generation settings and KV-cache reuse. This change introduces conservative, WebGPU-safe defaults to address those issues.

What’s changed

Added safe default generation parameters for WebGPU inference:
- Reduced max_new_tokens to 256
- Lowered temperature for more deterministic output
- Enabled top_p nucleus sampling
- Applied a mild repetition_penalty
Disabled KV-cache reuse (use_cache: false) to prevent instability with quantized ONNX models
Added a runtime warning when large max_new_tokens values are used
Centralized generation parameters into a documented configuration block for easier tuning

Why this matters

WebGPU + ONNX + quantized LLMs are particularly sensitive to long generations, cache reuse, and high-entropy sampling. These changes make browser-based Gemma demos more stable, predictable, and suitable for interactive use.

Scope

No changes to model weights or architecture
No backend or API changes
Browser-only behavior improvement

gemini-code-assist · 2026-01-17T13:24:01Z

Summary of Changes

Hello @TanmayThakur2209, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the stability and coherence of Gemma-3 (270M) text generation when utilizing quantized ONNX models within a WebGPU browser environment. By introducing conservative generation settings and explicitly disabling KV-cache reuse, the changes mitigate issues such as incoherent or unstable outputs observed during long, interactive sessions. The overall goal is to provide a more predictable and robust user experience for browser-based Gemma demos.

Highlights

WebGPU Generation Stability: Implemented conservative default generation parameters for WebGPU inference to enhance stability and coherence of Gemma-3 text generation.
KV-Cache Management: Disabled KV-cache reuse (use_cache: false) to prevent instability specifically with quantized ONNX models in browser environments.
Generation Parameter Configuration: Centralized generation parameters into a SAFE_WEBGPU_GENERATION_CONFIG block for improved readability and easier tuning.
Max New Tokens Warning: Added a runtime warning to alert users when max_new_tokens values exceed the recommended stable limit of 256 for WebGPU.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request improves the stability of text generation on WebGPU by introducing a set of safe default generation parameters and disabling KV-cache reuse. The changes are well-motivated and effectively centralize the configuration, making it easier to manage. I have one suggestion to improve the maintainability of a new warning by replacing a hardcoded value with a constant.

Demos/Gemma3-on-Web/src/worker.js

bebechien · 2026-01-21T06:51:26Z

@sitammeur to double check, just in case.

sitammeur · 2026-01-22T12:06:18Z

Yeah, I will check and let you know here.

sitammeur · 2026-01-27T07:50:05Z

Hi @TanmayThakur2209,

I reviewed your code and noticed that the sampling method for inference is set to true. However, it appears that it is mostly set to false for any kind of inference, both for ONNX and Torch. For more information, you can refer to the official model release at this link: https://huggingface.co/onnx-community/gemma-3-270m-it-ONNX.

Regarding the maximum number of new tokens, setting this value too high can lead to issues, just as setting it too low can. I chose 512 tokens, which is a common choice in many Hugging Face demos.

Lastly, I’d like to mention the use_cache argument. If you set it to false, the decoding process will be slower for non-reusable key-value pairs. You can see my approach taken from the official demos here: https://github.com/huggingface/transformers.js-examples.

I hope this helps!

cc: @bebechien

TanmayThakur2209 · 2026-01-27T08:59:53Z

Thanks a lot for the detailed feedback and references — really appreciate you taking the time to review this 🙏

You’re right that in most official demos do_sample is set to false, and that use_cache: false can slow down decoding due to non-reuse of KV pairs. My motivation here was based on observing unstable and repetitive outputs during longer interactive sessions specifically with WebGPU + quantized ONNX models in the browser. Disabling cache and using conservative sampling settings helped improve coherence and prevent the repeated or corrupted outputs I was seeing locally.

That said, your point about aligning more closely with the official demo defaults makes sense, especially for performance and consistency. I’m happy to adjust the configuration to better match the recommended inference setup (e.g. do_sample: false and revisiting use_cache) while still keeping safeguards for browser stability.

I’ll experiment with these changes and update the PR accordingly so we strike a better balance between stability and performance.

Thanks again for the guidance and the helpful links!

Stabilize WebGPU text generation for Gemma3 ONNX

18636dc

github-actions bot added the status:awaiting review label Jan 17, 2026

gemini-code-assist bot reviewed Jan 17, 2026

View reviewed changes

Demos/Gemma3-on-Web/src/worker.js Outdated Show resolved Hide resolved

Refactor WebGPU max token warning to avoid magic number

89c6187

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize WebGPU text generation for Gemma-3 ONNX #289

Stabilize WebGPU text generation for Gemma-3 ONNX #289
TanmayThakur2209 wants to merge 2 commits intogoogle-gemini:mainfrom
TanmayThakur2209:fix-webgpu-generation-stability

TanmayThakur2209 commented Jan 17, 2026

Uh oh!

gemini-code-assist bot commented Jan 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

bebechien commented Jan 21, 2026

Uh oh!

sitammeur commented Jan 22, 2026

Uh oh!

sitammeur commented Jan 27, 2026

Uh oh!

TanmayThakur2209 commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

TanmayThakur2209 commented Jan 17, 2026

What’s changed

Why this matters

Scope

Uh oh!

gemini-code-assist bot commented Jan 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

bebechien commented Jan 21, 2026

Uh oh!

sitammeur commented Jan 22, 2026

Uh oh!

sitammeur commented Jan 27, 2026

Uh oh!

TanmayThakur2209 commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants