feat(llm): pack chunks by token budget, parallelise, retry on truncation#625
feat(llm): pack chunks by token budget, parallelise, retry on truncation#625jasonm4130 wants to merge 2 commits intosafishamsi:v5from
Conversation
4d85968 to
b7073ce
Compare
Token-budget chunking (safishamsi#625) cuts the truncation rate on extract calls but doesn't eliminate it. Output token cost scales with extractable concept density rather than input tokens — a chunk that lands on a directory of dense design docs can fit comfortably under the input budget while needing more than `max_completion_tokens=8192` to express every named concept, so the response is truncated mid-string and `_parse_llm_json` returns an empty fragment. Pre-tuning chunk size to be conservative enough that this never happens leaves throughput on the table for the common case. Adding a hard `max_files_per_chunk` cap on top of `token_budget` reintroduces the "tune a static constant" problem that safishamsi#625 set out to fix. The fix uses the API's own truncation signal: 1. `_call_openai_compat` and `_call_claude` now expose `finish_reason` on the result dict (Anthropic's `stop_reason == "max_tokens"` is normalised to `"length"`). 2. `_extract_with_adaptive_retry` checks it: when truncated, splits the chunk in half and recurses on each half. Recursion is bounded by `max_retry_depth` (default 3 → at most 8x fanout per top-level chunk). 3. Single-file chunks that truncate can't recover (we can't make a file smaller than itself) and surface a warning rather than infinite-loop. 4. `extract_corpus_parallel` routes every chunk through the retry wrapper. The `on_chunk_done` callback fires once per top-level chunk with the merged result — recursive splits are invisible to callers. This is signal-driven: chunks too dense to fit in one response self-heal by splitting until they do, while well-sized chunks pay no extra cost. 6 new tests in tests/test_chunking.py cover pass-through when not truncated, single-level split, recursive split, depth cap, single-file unrecoverable case, and integration with extract_corpus_parallel + the on_chunk_done contract. Full suite at 459 passed. Builds on safishamsi#625 — that PR's token-budget chunking and the adaptive retry here are complementary: chunking makes most chunks fit, retry recovers the ones that don't.
Three independent improvements to extract_corpus_parallel: 1. Token-aware chunking. Replaces `chunk_size=20` static packing with a greedy packer keyed on `token_budget` (default 60_000), grouped by parent directory so related artefacts share a chunk. Pass `token_budget=None` to fall back to fixed-count packing. 2. Optional tiktoken (added to the [kimi] extra). When available, `_estimate_file_tokens` uses cl100k_base for accurate counts; without it, the existing chars/4 heuristic kicks in. Kimi-K2 ships a tiktoken-based tokenizer so estimates against Moonshot are very close to truth. 3. True parallelism. The function name said "parallel" but the body was a sequential for-loop. Now uses ThreadPoolExecutor capped at `max_concurrency` (default 4 — conservative against provider rate limits). `on_chunk_done(idx, total, result)` still fires once per chunk with the original submission idx so progress UIs work unchanged. `max_concurrency=1` skips the pool to preserve sequential semantics. Plus failure tolerance: a chunk raising is now caught, logged to stderr, and the run continues. Other chunks' results merge as normal. On a 162-file repo (~125k words), the same work that took ~36 min sequential under the old code finishes in ~7 min.
…cation Token-budget chunking cuts the truncation rate but doesn't eliminate it. Output token cost scales with extractable concept density rather than input tokens — a chunk that lands on a directory of dense design docs can pack under the input budget while needing more than `max_completion_tokens=8192` to express every named concept, so the response is truncated mid-string and `_parse_llm_json` returns an empty fragment. Pre-tuning chunk size to be conservative enough that this never happens leaves throughput on the table for the common case. Adding a hard `max_files_per_chunk` cap on top of `token_budget` reintroduces the "tune a static constant" problem the previous commit set out to fix. The fix uses the API's own truncation signal: 1. `_call_openai_compat` and `_call_claude` now expose `finish_reason` on the result dict (Anthropic's `stop_reason == "max_tokens"` is normalised to `"length"`). 2. `_extract_with_adaptive_retry` checks it: when truncated, splits the chunk in half and recurses on each half. Recursion is bounded by `max_retry_depth` (default 3 → at most 8x fanout per top-level chunk). 3. Single-file chunks that truncate can't recover and surface a warning rather than infinite-loop. 4. `extract_corpus_parallel` routes every chunk through the retry wrapper. The `on_chunk_done` callback fires once per top-level chunk with the merged result — recursive splits are invisible to callers.
b6f154a to
2d13a17
Compare
|
Hi, extract_corpus_parallel() logs per-chunk exceptions and skips them, but the returned merged dict provides no structured indication that the run was partial. Callers cannot reliably detect missing chunks without scraping stderr. Severity: remediation recommended | Category: reliability How to fix: Return failure metadata to caller Agent prompt to fix - you can give this to your LLM of choice:
Found by Qodo. Free code review for open-source maintainers. |
|
Hi, When tiktoken is available, token-budget packing reads and tokenizes each file, and extraction then reads the same files again to build the prompt. This increases I/O and can noticeably slow startup on large corpora. Severity: informational | Category: performance How to fix: Cache content during estimation Agent prompt to fix - you can give this to your LLM of choice:
Found by Qodo code review |
Co-Authored-By: Jason Matthew <jasonm4130@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two commits, both improvements to
extract_corpus_parallel. Reviewable independently.Summary
Commit 1: token-budget chunking, parallelism, optional tiktoken
chunk_size=20static packing with greedy_pack_chunks_by_tokens(token_budget=60_000), grouped by parent directorytiktokento the[kimi]extra;_estimate_file_tokensusescl100k_basewhen available, falls back to chars/4 when notThreadPoolExecutorcapped atmax_concurrency=4.on_chunk_done(idx, total, result)fires in completion order with the original submissionidxso progress UIs work unchanged.max_concurrency=1skips the pool to preserve sequential semanticstoken_budget=Nonefalls back to legacychunk_size-based packing for backwards compatibilityCommit 2: adaptive retry on
finish_reason == "length"finish_reasonout of_call_openai_compatand_call_claude(Anthropic'sstop_reason == "max_tokens"is normalised to"length")_extract_with_adaptive_retry: when a chunk's response is truncated, split in half and recurse on each half. Recursion bounded bymax_retry_depth(default 3)extract_corpus_parallelroutes every chunk through the retry wrapper; recursive splits are invisible to callers (callback still fires once per top-level chunk with merged result)Why
extract_corpus_parallelhad three issues that compounded on real corpora:chunk_size=20static packing has unbounded per-chunk costforloopAfter fixing 1-3, a fourth issue surfaces: chunks too dense to fit their JSON output in
max_completion_tokens=8192are silently truncated and contribute nothing. Adding a hardmax_files_per_chunkcap reintroduces the "tune a static constant" problem the chunking commit set out to fix. Thefinish_reasonsignal is what the API gives us — acting on it is the principled fix.Test plan
max_concurrency=4max_concurrency=1preserves call ordertoken_budget=Nonereverts to fixed-count chunking (45 files /chunk_size=20=[20, 20, 5])finish_reason="stop"finish_reason="length"max_depthbounds recursion (no infinite loop)extract_corpus_parallel—on_chunk_donefires once per top-level chunkCompanion
#623 — kimi-k2.6 reasoning fix. Independent, can land in either order.