feat(llm): pack chunks by token budget, parallelise, retry on truncation by jasonm4130 · Pull Request #625 · safishamsi/graphify

jasonm4130 · 2026-04-30T11:08:29Z

Two commits, both improvements to extract_corpus_parallel. Reviewable independently.

Summary

Commit 1: token-budget chunking, parallelism, optional tiktoken

Replace chunk_size=20 static packing with greedy _pack_chunks_by_tokens(token_budget=60_000), grouped by parent directory
Add tiktoken to the [kimi] extra; _estimate_file_tokens uses cl100k_base when available, falls back to chars/4 when not
Run chunks via ThreadPoolExecutor capped at max_concurrency=4. on_chunk_done(idx, total, result) fires in completion order with the original submission idx so progress UIs work unchanged. max_concurrency=1 skips the pool to preserve sequential semantics
Catch per-chunk exceptions, log to stderr, continue. One bad chunk no longer aborts the run
token_budget=None falls back to legacy chunk_size-based packing for backwards compatibility

Commit 2: adaptive retry on finish_reason == "length"

Plumb finish_reason out of _call_openai_compat and _call_claude (Anthropic's stop_reason == "max_tokens" is normalised to "length")
Add _extract_with_adaptive_retry: when a chunk's response is truncated, split in half and recurse on each half. Recursion bounded by max_retry_depth (default 3)
Single-file chunks that truncate can't recover — surface a warning rather than infinite-loop
extract_corpus_parallel routes every chunk through the retry wrapper; recursive splits are invisible to callers (callback still fires once per top-level chunk with merged result)

Why

extract_corpus_parallel had three issues that compounded on real corpora:

#	Issue	Concrete failure
1	`chunk_size=20` static packing has unbounded per-chunk cost	A 162-file mixed code/docs/images repo (~125k words) packed unevenly. One PNG-heavy chunk hit 282k input tokens and got 400'd by Moonshot's 262k context limit
2	Function name says "parallel" but body is a sequential `for` loop	Same 162-file repo took ~36 minutes wall-clock
3	A single chunk raising aborts the whole run	Lose all preceding chunks' work to one transient API error

After fixing 1-3, a fourth issue surfaces: chunks too dense to fit their JSON output in max_completion_tokens=8192 are silently truncated and contribute nothing. Adding a hard max_files_per_chunk cap reintroduces the "tune a static constant" problem the chunking commit set out to fix. The finish_reason signal is what the API gives us — acting on it is the principled fix.

Test plan

Companion

#623 — kimi-k2.6 reasoning fix. Independent, can land in either order.

Token-budget chunking (safishamsi#625) cuts the truncation rate on extract calls but doesn't eliminate it. Output token cost scales with extractable concept density rather than input tokens — a chunk that lands on a directory of dense design docs can fit comfortably under the input budget while needing more than `max_completion_tokens=8192` to express every named concept, so the response is truncated mid-string and `_parse_llm_json` returns an empty fragment. Pre-tuning chunk size to be conservative enough that this never happens leaves throughput on the table for the common case. Adding a hard `max_files_per_chunk` cap on top of `token_budget` reintroduces the "tune a static constant" problem that safishamsi#625 set out to fix. The fix uses the API's own truncation signal: 1. `_call_openai_compat` and `_call_claude` now expose `finish_reason` on the result dict (Anthropic's `stop_reason == "max_tokens"` is normalised to `"length"`). 2. `_extract_with_adaptive_retry` checks it: when truncated, splits the chunk in half and recurses on each half. Recursion is bounded by `max_retry_depth` (default 3 → at most 8x fanout per top-level chunk). 3. Single-file chunks that truncate can't recover (we can't make a file smaller than itself) and surface a warning rather than infinite-loop. 4. `extract_corpus_parallel` routes every chunk through the retry wrapper. The `on_chunk_done` callback fires once per top-level chunk with the merged result — recursive splits are invisible to callers. This is signal-driven: chunks too dense to fit in one response self-heal by splitting until they do, while well-sized chunks pay no extra cost. 6 new tests in tests/test_chunking.py cover pass-through when not truncated, single-level split, recursive split, depth cap, single-file unrecoverable case, and integration with extract_corpus_parallel + the on_chunk_done contract. Full suite at 459 passed. Builds on safishamsi#625 — that PR's token-budget chunking and the adaptive retry here are complementary: chunking makes most chunks fit, retry recovers the ones that don't.

Three independent improvements to extract_corpus_parallel: 1. Token-aware chunking. Replaces `chunk_size=20` static packing with a greedy packer keyed on `token_budget` (default 60_000), grouped by parent directory so related artefacts share a chunk. Pass `token_budget=None` to fall back to fixed-count packing. 2. Optional tiktoken (added to the [kimi] extra). When available, `_estimate_file_tokens` uses cl100k_base for accurate counts; without it, the existing chars/4 heuristic kicks in. Kimi-K2 ships a tiktoken-based tokenizer so estimates against Moonshot are very close to truth. 3. True parallelism. The function name said "parallel" but the body was a sequential for-loop. Now uses ThreadPoolExecutor capped at `max_concurrency` (default 4 — conservative against provider rate limits). `on_chunk_done(idx, total, result)` still fires once per chunk with the original submission idx so progress UIs work unchanged. `max_concurrency=1` skips the pool to preserve sequential semantics. Plus failure tolerance: a chunk raising is now caught, logged to stderr, and the run continues. Other chunks' results merge as normal. On a 162-file repo (~125k words), the same work that took ~36 min sequential under the old code finishes in ~7 min.

…cation Token-budget chunking cuts the truncation rate but doesn't eliminate it. Output token cost scales with extractable concept density rather than input tokens — a chunk that lands on a directory of dense design docs can pack under the input budget while needing more than `max_completion_tokens=8192` to express every named concept, so the response is truncated mid-string and `_parse_llm_json` returns an empty fragment. Pre-tuning chunk size to be conservative enough that this never happens leaves throughput on the table for the common case. Adding a hard `max_files_per_chunk` cap on top of `token_budget` reintroduces the "tune a static constant" problem the previous commit set out to fix. The fix uses the API's own truncation signal: 1. `_call_openai_compat` and `_call_claude` now expose `finish_reason` on the result dict (Anthropic's `stop_reason == "max_tokens"` is normalised to `"length"`). 2. `_extract_with_adaptive_retry` checks it: when truncated, splits the chunk in half and recurses on each half. Recursion is bounded by `max_retry_depth` (default 3 → at most 8x fanout per top-level chunk). 3. Single-file chunks that truncate can't recover and surface a warning rather than infinite-loop. 4. `extract_corpus_parallel` routes every chunk through the retry wrapper. The `on_chunk_done` callback fires once per top-level chunk with the merged result — recursive splits are invisible to callers.

Qodo-Free-For-OSS · 2026-05-02T10:56:06Z

Hi, extract_corpus_parallel() logs per-chunk exceptions and skips them, but the returned merged dict provides no structured indication that the run was partial. Callers cannot reliably detect missing chunks without scraping stderr.

Severity: remediation recommended | Category: reliability

How to fix: Return failure metadata to caller

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

extract_corpus_parallel() continues after chunk errors but does not expose failures in the returned value.

Issue Context

This is a behavioral change from aborting-on-error to best-effort. Best-effort is fine, but callers need a programmatic signal that results may be incomplete.

Fix Focus Areas

graphify/llm.py[349-446]

Recommended change

Add structured failure reporting, e.g.:

Maintain failed_chunks: list[dict] with {idx, error} (and maybe chunk_files), and include it in the returned dict.

Optionally add a fail_fast: bool = False parameter to restore old semantics when desired.

Consider incrementing/returning chunks_succeeded/chunks_failed counts for easy UI reporting.

Found by Qodo. Free code review for open-source maintainers.

Qodo-Free-For-OSS · 2026-05-02T11:00:41Z

Hi, When tiktoken is available, token-budget packing reads and tokenizes each file, and extraction then reads the same files again to build the prompt. This increases I/O and can noticeably slow startup on large corpora.

Severity: informational | Category: performance

How to fix: Cache content during estimation

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

With tiktoken installed, files are read once to estimate tokens and again to build the prompt.

Issue Context

This can add significant overhead for large corpora.

Fix Focus Areas

graphify/llm.py[208-269]

graphify/llm.py[80-93]

Suggested approaches

Cache per-file truncated content during packing (e.g., dict[Path, str]) and allow extract_files_direct() / _read_files() to accept pre-read content.

Or, estimate using st_size only as a fast heuristic even when tiktoken exists (but then accept less accurate packing).

Or, add a flag to disable tiktoken-based estimation when throughput matters.

Found by Qodo code review

Co-Authored-By: Jason Matthew <jasonm4130@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jasonm4130 mentioned this pull request Apr 30, 2026

Proposal: AST import-graph aware chunking (uses Part A output to inform Part B packing) #627

Open

jasonm4130 force-pushed the feat/token-aware-chunking-parallel branch from 4d85968 to b7073ce Compare April 30, 2026 11:41

jasonm4130 changed the title ~~feat(llm): token-aware chunking, true parallelism, optional tiktoken (PR #2 of 2)~~ pack chunks by token budget, run them in parallel, accept tiktoken Apr 30, 2026

jasonm4130 changed the title ~~pack chunks by token budget, run them in parallel, accept tiktoken~~ pack chunks by token budget, run them in parallel, retry on truncation Apr 30, 2026

Crazyop757 mentioned this pull request Apr 30, 2026

feat: AST import-graph aware chunking (issue #627) truxt-ai/graphify#1

Open

3 tasks

jasonm4130 added 2 commits April 30, 2026 22:08

jasonm4130 force-pushed the feat/token-aware-chunking-parallel branch from b6f154a to 2d13a17 Compare April 30, 2026 12:08

jasonm4130 changed the title ~~pack chunks by token budget, run them in parallel, retry on truncation~~ feat(llm): pack chunks by token budget, parallelise, retry on truncation Apr 30, 2026

safishamsi added a commit that referenced this pull request May 2, 2026

merge PR #625: token-aware chunking with split-and-retry on truncation

24808ef

Co-Authored-By: Jason Matthew <jasonm4130@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(llm): pack chunks by token budget, parallelise, retry on truncation#625

feat(llm): pack chunks by token budget, parallelise, retry on truncation#625
jasonm4130 wants to merge 2 commits intosafishamsi:v5from
jasonm4130:feat/token-aware-chunking-parallel

jasonm4130 commented Apr 30, 2026 •

edited

Loading

Uh oh!

Qodo-Free-For-OSS commented May 2, 2026

Issue description

Issue Context

Fix Focus Areas

Recommended change

Uh oh!

Qodo-Free-For-OSS commented May 2, 2026

Issue description

Issue Context

Fix Focus Areas

Suggested approaches

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jasonm4130 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan

Companion

Uh oh!

Qodo-Free-For-OSS commented May 2, 2026

Issue description

Issue Context

Fix Focus Areas

Recommended change

Uh oh!

Qodo-Free-For-OSS commented May 2, 2026

Issue description

Issue Context

Fix Focus Areas

Suggested approaches

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jasonm4130 commented Apr 30, 2026 •

edited

Loading