fix: avoid Groq token-limit 413 for small prompts#449
fix: avoid Groq token-limit 413 for small prompts#449CarlosAlexandredevv wants to merge 1 commit intoGitlawb:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses Groq HTTP 413 failures caused by token budget overflow (TPM) by shrinking Groq-bound OpenAI-compatible requests and dynamically reducing completion token budgets, plus improving how certain 413 responses are surfaced to users.
Changes:
- Add Groq detection and request payload compaction (strip tool schema descriptions, trim messages, and disable tools as needed).
- Dynamically clamp Groq
max_tokensbased on an estimated prompt token size. - Improve 413 error mapping to show token/rate-limit guidance when the provider response indicates token budget overflow.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/services/api/openaiShim.ts | Adds Groq-specific payload compaction and max_tokens clamping before sending requests. |
| src/services/api/openaiShim.test.ts | Adds tests asserting Groq payload compaction and max_tokens clamping behavior. |
| src/services/api/errors.ts | Improves 413 handling to map token/rate-limit-style 413s to a more helpful message. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| while (promptTokens > GROQ_TARGET_PROMPT_TOKENS) { | ||
| const firstNonSystemIndex = messages.findIndex( | ||
| (message, index) => | ||
| message.role !== 'system' && index < messages.length - 1, | ||
| ) | ||
| if (firstNonSystemIndex === -1) break | ||
|
|
||
| messages.splice(firstNonSystemIndex, 1) | ||
| body.messages = messages |
There was a problem hiding this comment.
The message-trimming loop removes the first non-system message without accounting for tool-call/message pairing (assistant tool_calls ↔ subsequent role:'tool' messages). This can leave orphaned tool results or tool calls in body.messages, which OpenAI-compatible APIs typically reject (400) and would negate the intended 413 mitigation. Consider trimming whole “turn” segments and preserving tool_call/tool_result adjacency (e.g., when removing an assistant message with tool_calls, also remove the following tool messages for those ids; or only remove complete user+assistant(+tool) groups from the front).
| function estimateJsonBytes(value: unknown): number { | ||
| return new TextEncoder().encode(JSON.stringify(value)).length | ||
| } | ||
|
|
||
| function estimateGroqPromptTokens(value: unknown): number { |
There was a problem hiding this comment.
estimateJsonBytes allocates a new TextEncoder and stringifies the entire payload on every call, and compactPayloadForGroq calls this repeatedly (including inside a loop). For large payloads this can become a noticeable CPU/memory hotspot. Consider reusing a module-scoped TextEncoder and reducing full-body JSON.stringify calls (e.g., estimate only the prompt-bearing fields or cache the serialized form between compaction steps).
Summary
max_tokensdynamically for Groq based on estimated prompt size.Why
A short prompt like
oicould still fail with 413 on Groq because the outgoing request (tools + context + completion budget) exceeded token limits, not necessarily byte-size upload limits.Validation
bun test src/services/api/openaiShim.test.tsbun run buildbun run start --print "oi"(reproduced previously failing path; now succeeds)Closes #337