Skip to content

fix: avoid Groq token-limit 413 for small prompts#449

Open
CarlosAlexandredevv wants to merge 1 commit intoGitlawb:mainfrom
CarlosAlexandredevv:fix/groq-token-limit-337
Open

fix: avoid Groq token-limit 413 for small prompts#449
CarlosAlexandredevv wants to merge 1 commit intoGitlawb:mainfrom
CarlosAlexandredevv:fix/groq-token-limit-337

Conversation

@CarlosAlexandredevv
Copy link
Copy Markdown

Summary

  • Fixes Groq requests that failed with HTTP 413 due to token budget overflow (TPM), even for short prompts.
  • Adds Groq-specific payload compaction by estimated prompt tokens.
  • Caps max_tokens dynamically for Groq based on estimated prompt size.
  • Improves 413 error mapping: when provider response indicates token/rate-limit overflow, show a rate-limit guidance message instead of generic "Request too large".

Why

A short prompt like oi could still fail with 413 on Groq because the outgoing request (tools + context + completion budget) exceeded token limits, not necessarily byte-size upload limits.

Validation

  • bun test src/services/api/openaiShim.test.ts
  • bun run build
  • bun run start --print "oi" (reproduced previously failing path; now succeeds)

Closes #337

Copilot AI review requested due to automatic review settings April 6, 2026 18:59
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses Groq HTTP 413 failures caused by token budget overflow (TPM) by shrinking Groq-bound OpenAI-compatible requests and dynamically reducing completion token budgets, plus improving how certain 413 responses are surfaced to users.

Changes:

  • Add Groq detection and request payload compaction (strip tool schema descriptions, trim messages, and disable tools as needed).
  • Dynamically clamp Groq max_tokens based on an estimated prompt token size.
  • Improve 413 error mapping to show token/rate-limit guidance when the provider response indicates token budget overflow.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/services/api/openaiShim.ts Adds Groq-specific payload compaction and max_tokens clamping before sending requests.
src/services/api/openaiShim.test.ts Adds tests asserting Groq payload compaction and max_tokens clamping behavior.
src/services/api/errors.ts Improves 413 handling to map token/rate-limit-style 413s to a more helpful message.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +171 to +179
while (promptTokens > GROQ_TARGET_PROMPT_TOKENS) {
const firstNonSystemIndex = messages.findIndex(
(message, index) =>
message.role !== 'system' && index < messages.length - 1,
)
if (firstNonSystemIndex === -1) break

messages.splice(firstNonSystemIndex, 1)
body.messages = messages
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message-trimming loop removes the first non-system message without accounting for tool-call/message pairing (assistant tool_calls ↔ subsequent role:'tool' messages). This can leave orphaned tool results or tool calls in body.messages, which OpenAI-compatible APIs typically reject (400) and would negate the intended 413 mitigation. Consider trimming whole “turn” segments and preserving tool_call/tool_result adjacency (e.g., when removing an assistant message with tool_calls, also remove the following tool messages for those ids; or only remove complete user+assistant(+tool) groups from the front).

Copilot uses AI. Check for mistakes.
Comment on lines +132 to +136
function estimateJsonBytes(value: unknown): number {
return new TextEncoder().encode(JSON.stringify(value)).length
}

function estimateGroqPromptTokens(value: unknown): number {
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

estimateJsonBytes allocates a new TextEncoder and stringifies the entire payload on every call, and compactPayloadForGroq calls this repeatedly (including inside a loop). For large payloads this can become a noticeable CPU/memory hotspot. Consider reusing a module-scoped TextEncoder and reducing full-body JSON.stringify calls (e.g., estimate only the prompt-bearing fields or cache the serialized form between compaction steps).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Request too large (max 20MB). Double press esc to go back and try with a smaller file.

2 participants