QVAC-19778 fix[api]: finish_reason=length + unified token accounting across chat-category routes#2477
Open
lauripiisang wants to merge 2 commits into
Open
QVAC-19778 fix[api]: finish_reason=length + unified token accounting across chat-category routes#2477lauripiisang wants to merge 2 commits into
lauripiisang wants to merge 2 commits into
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
Contributor
Tier-based Approval Status |
2ac8659 to
e71da37
Compare
e71da37 to
f4d049c
Compare
Introduce drainCompletion (completion-result.ts) to consume the SDK event stream once: derives text, stats, stopReason, and finish_reason (tool_calls > length > stop). All three chat-category routes use it; per-route token-counting variants removed. Responses API: length truncation maps to status "incomplete" + incomplete_details.reason "max_output_tokens". Two e2e tests for finish_reason=length are known-failing pending SDK PR #2484 (stopReason="length" not yet emitted by the plugin). Six e2e max_tokens bumped 8-24 → 512 for Qwen3-600M <think> compat. Docs: /v1/videos* added to serve-openai.md and http-server mdx. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
34b780a to
5a79aa6
Compare
messages[].content now accepts OpenAI array format (text/image_url/ input_audio/file parts). Non-text parts are silently dropped; text parts are concatenated. Fixes 400 errors from Cline and Open WebUI. audio transcription/translation temperature was z.string() — OpenAI spec requires number. Fixes 400 for any client sending temperature:0.0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🎯 What problem does this PR solve?
finish_reasonwas hardcoded to"stop"on all chat-category routes, even when generation was cut off bymax_tokens(should be"length").completion_tokenswas counted three different ways: stats helper in/v1/chat/completions, whitespace split in/v1/completions(blocking), SSE-event counter in/v1/completions(streaming). The Responses route had its own inline fallback. All four could diverge for the same generation./v1/videos*endpoints were missing from both the in-repo and marketing-site docs.📝 How does it solve it?
drainCompletion(result, onToken?)inadapters/openai/completion-result.tsconsumes the SDKresult.eventsstream once and returns{ text, toolCalls, stats, stopReason, completionTokens, finishReason }.tool_calls > length > stop. All three chat routes (chat, completions, responses) now use it.stopReason === "length"maps tostatus: "incomplete"+incomplete_details.reason: "max_output_tokens"; the streaming path emitsresponse.incompleteinstead ofresponse.completed./v1/videos*topackages/cli/docs/serve-openai.mdanddocs/website/content/docs/cli/http-server/index.mdx.The two other QVAC-19778 follow-up items needed no code change — verified against the current tree:
req.on('close')→cancel({ requestId }).resolveAndCheckModel()/requireModel()(added in QVAC-19179 feat[bc]: rewrite serve HTTP layer on Fastify + Zod #2306) already centralise the preamble across all routes.🧪 How was it tested?
test/completion-result.test.ts: 20 cases coveringeos,length,stopSequence,tool_callsfinish-reason branches;stats.generatedTokensvs whitespace-fallback token counting; streamingonTokencallback ordering.test/responses.test.tsandtest/responses-streaming.test.ts: assertstatus: "incomplete"+incomplete_detailsonmax_tokenstruncation.test/e2e.batsagainst a real loaded LLM:max_tokens: 1with a long-output prompt, assertingfinish_reason == "length"andusage.completion_tokens == 1for both blocking and streaming paths.bun run test:unit). Lint and typecheck clean (bun run lint).npm run test:e2e, Qwen3-600M). Sixmax_tokensvalues bumped from 8–24 → 512 ine2e.batsfor tests that assertedfinish_reason: "stop": these were too tight for Qwen3's chain-of-thought<think>preamble after the fix correctly started returning"length"on budget exhaustion.🔌 API Changes
🔗 Dependencies
Depends on #2484 — SDK must emit
stopReason: "length"before the two known-failing e2e tests (94, 95) pass.