Skip to content

feat(llm): pack chunks by token budget, parallelise, retry on truncation#625

Open
jasonm4130 wants to merge 2 commits intosafishamsi:v5from
jasonm4130:feat/token-aware-chunking-parallel
Open

feat(llm): pack chunks by token budget, parallelise, retry on truncation#625
jasonm4130 wants to merge 2 commits intosafishamsi:v5from
jasonm4130:feat/token-aware-chunking-parallel

Conversation

@jasonm4130
Copy link
Copy Markdown
Contributor

@jasonm4130 jasonm4130 commented Apr 30, 2026

Two commits, both improvements to extract_corpus_parallel. Reviewable independently.

Summary

Commit 1: token-budget chunking, parallelism, optional tiktoken

  • Replace chunk_size=20 static packing with greedy _pack_chunks_by_tokens(token_budget=60_000), grouped by parent directory
  • Add tiktoken to the [kimi] extra; _estimate_file_tokens uses cl100k_base when available, falls back to chars/4 when not
  • Run chunks via ThreadPoolExecutor capped at max_concurrency=4. on_chunk_done(idx, total, result) fires in completion order with the original submission idx so progress UIs work unchanged. max_concurrency=1 skips the pool to preserve sequential semantics
  • Catch per-chunk exceptions, log to stderr, continue. One bad chunk no longer aborts the run
  • token_budget=None falls back to legacy chunk_size-based packing for backwards compatibility

Commit 2: adaptive retry on finish_reason == "length"

  • Plumb finish_reason out of _call_openai_compat and _call_claude (Anthropic's stop_reason == "max_tokens" is normalised to "length")
  • Add _extract_with_adaptive_retry: when a chunk's response is truncated, split in half and recurse on each half. Recursion bounded by max_retry_depth (default 3)
  • Single-file chunks that truncate can't recover — surface a warning rather than infinite-loop
  • extract_corpus_parallel routes every chunk through the retry wrapper; recursive splits are invisible to callers (callback still fires once per top-level chunk with merged result)

Why

extract_corpus_parallel had three issues that compounded on real corpora:

# Issue Concrete failure
1 chunk_size=20 static packing has unbounded per-chunk cost A 162-file mixed code/docs/images repo (~125k words) packed unevenly. One PNG-heavy chunk hit 282k input tokens and got 400'd by Moonshot's 262k context limit
2 Function name says "parallel" but body is a sequential for loop Same 162-file repo took ~36 minutes wall-clock
3 A single chunk raising aborts the whole run Lose all preceding chunks' work to one transient API error

After fixing 1-3, a fourth issue surfaces: chunks too dense to fit their JSON output in max_completion_tokens=8192 are silently truncated and contribute nothing. Adding a hard max_files_per_chunk cap reintroduces the "tune a static constant" problem the chunking commit set out to fix. The finish_reason signal is what the API gives us — acting on it is the principled fix.

Test plan

  • Packer: small-file packing into one chunk
  • Packer: starts new chunk when next file would exceed budget
  • Packer: groups files from same directory contiguously
  • Packer: oversized single file gets its own chunk
  • Packer: rejects non-positive budget
  • Tokenizer: uses tiktoken when available (mocked)
  • Tokenizer: falls back to chars/4 when tiktoken absent
  • Parallel: 4 chunks × 0.3s sleep finishes in <1s with max_concurrency=4
  • Sequential: max_concurrency=1 preserves call order
  • Resilience: simulated chunk failure logs to stderr, other chunks still merge
  • Legacy: token_budget=None reverts to fixed-count chunking (45 files / chunk_size=20 = [20, 20, 5])
  • Token-budget default: 50 tiny files pack into 1 chunk
  • Adaptive retry: pass-through when finish_reason="stop"
  • Adaptive retry: single-level split on finish_reason="length"
  • Adaptive retry: recursive split when halves are still truncated (8 → 4+4 → 2+2+2+2)
  • Adaptive retry: max_depth bounds recursion (no infinite loop)
  • Adaptive retry: single-file truncation surfaces warning instead of recursing
  • Adaptive retry: integration with extract_corpus_parallelon_chunk_done fires once per top-level chunk
  • Full existing test suite (459 tests total) passes with zero regressions

Companion

#623 — kimi-k2.6 reasoning fix. Independent, can land in either order.

@jasonm4130 jasonm4130 force-pushed the feat/token-aware-chunking-parallel branch from 4d85968 to b7073ce Compare April 30, 2026 11:41
@jasonm4130 jasonm4130 changed the title feat(llm): token-aware chunking, true parallelism, optional tiktoken (PR #2 of 2) pack chunks by token budget, run them in parallel, accept tiktoken Apr 30, 2026
jasonm4130 added a commit to jasonm4130/graphify that referenced this pull request Apr 30, 2026
Token-budget chunking (safishamsi#625) cuts the truncation rate on extract calls
but doesn't eliminate it. Output token cost scales with extractable
concept density rather than input tokens — a chunk that lands on a
directory of dense design docs can fit comfortably under the input
budget while needing more than `max_completion_tokens=8192` to express
every named concept, so the response is truncated mid-string and
`_parse_llm_json` returns an empty fragment.

Pre-tuning chunk size to be conservative enough that this never happens
leaves throughput on the table for the common case. Adding a hard
`max_files_per_chunk` cap on top of `token_budget` reintroduces the
"tune a static constant" problem that safishamsi#625 set out to fix.

The fix uses the API's own truncation signal:

1. `_call_openai_compat` and `_call_claude` now expose `finish_reason`
   on the result dict (Anthropic's `stop_reason == "max_tokens"` is
   normalised to `"length"`).
2. `_extract_with_adaptive_retry` checks it: when truncated, splits the
   chunk in half and recurses on each half. Recursion is bounded by
   `max_retry_depth` (default 3 → at most 8x fanout per top-level chunk).
3. Single-file chunks that truncate can't recover (we can't make a file
   smaller than itself) and surface a warning rather than infinite-loop.
4. `extract_corpus_parallel` routes every chunk through the retry
   wrapper. The `on_chunk_done` callback fires once per top-level chunk
   with the merged result — recursive splits are invisible to callers.

This is signal-driven: chunks too dense to fit in one response self-heal
by splitting until they do, while well-sized chunks pay no extra cost.

6 new tests in tests/test_chunking.py cover pass-through when not
truncated, single-level split, recursive split, depth cap,
single-file unrecoverable case, and integration with
extract_corpus_parallel + the on_chunk_done contract. Full suite at
459 passed.

Builds on safishamsi#625 — that PR's token-budget chunking and the adaptive
retry here are complementary: chunking makes most chunks fit, retry
recovers the ones that don't.
@jasonm4130 jasonm4130 changed the title pack chunks by token budget, run them in parallel, accept tiktoken pack chunks by token budget, run them in parallel, retry on truncation Apr 30, 2026
Three independent improvements to extract_corpus_parallel:

1. Token-aware chunking. Replaces `chunk_size=20` static packing with
   a greedy packer keyed on `token_budget` (default 60_000), grouped
   by parent directory so related artefacts share a chunk. Pass
   `token_budget=None` to fall back to fixed-count packing.

2. Optional tiktoken (added to the [kimi] extra). When available,
   `_estimate_file_tokens` uses cl100k_base for accurate counts;
   without it, the existing chars/4 heuristic kicks in. Kimi-K2 ships
   a tiktoken-based tokenizer so estimates against Moonshot are very
   close to truth.

3. True parallelism. The function name said "parallel" but the body
   was a sequential for-loop. Now uses ThreadPoolExecutor capped at
   `max_concurrency` (default 4 — conservative against provider rate
   limits). `on_chunk_done(idx, total, result)` still fires once per
   chunk with the original submission idx so progress UIs work
   unchanged. `max_concurrency=1` skips the pool to preserve
   sequential semantics.

Plus failure tolerance: a chunk raising is now caught, logged to
stderr, and the run continues. Other chunks' results merge as normal.

On a 162-file repo (~125k words), the same work that took ~36 min
sequential under the old code finishes in ~7 min.
…cation

Token-budget chunking cuts the truncation rate but doesn't eliminate
it. Output token cost scales with extractable concept density rather
than input tokens — a chunk that lands on a directory of dense design
docs can pack under the input budget while needing more than
`max_completion_tokens=8192` to express every named concept, so the
response is truncated mid-string and `_parse_llm_json` returns an
empty fragment.

Pre-tuning chunk size to be conservative enough that this never
happens leaves throughput on the table for the common case. Adding a
hard `max_files_per_chunk` cap on top of `token_budget` reintroduces
the "tune a static constant" problem the previous commit set out to
fix.

The fix uses the API's own truncation signal:

1. `_call_openai_compat` and `_call_claude` now expose `finish_reason`
   on the result dict (Anthropic's `stop_reason == "max_tokens"` is
   normalised to `"length"`).
2. `_extract_with_adaptive_retry` checks it: when truncated, splits
   the chunk in half and recurses on each half. Recursion is bounded
   by `max_retry_depth` (default 3 → at most 8x fanout per top-level
   chunk).
3. Single-file chunks that truncate can't recover and surface a
   warning rather than infinite-loop.
4. `extract_corpus_parallel` routes every chunk through the retry
   wrapper. The `on_chunk_done` callback fires once per top-level
   chunk with the merged result — recursive splits are invisible to
   callers.
@jasonm4130 jasonm4130 force-pushed the feat/token-aware-chunking-parallel branch from b6f154a to 2d13a17 Compare April 30, 2026 12:08
@jasonm4130 jasonm4130 changed the title pack chunks by token budget, run them in parallel, retry on truncation feat(llm): pack chunks by token budget, parallelise, retry on truncation Apr 30, 2026
@Qodo-Free-For-OSS
Copy link
Copy Markdown

Hi, extract_corpus_parallel() logs per-chunk exceptions and skips them, but the returned merged dict provides no structured indication that the run was partial. Callers cannot reliably detect missing chunks without scraping stderr.

Severity: remediation recommended | Category: reliability

How to fix: Return failure metadata to caller

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

extract_corpus_parallel() continues after chunk errors but does not expose failures in the returned value.

Issue Context

This is a behavioral change from aborting-on-error to best-effort. Best-effort is fine, but callers need a programmatic signal that results may be incomplete.

Fix Focus Areas

  • graphify/llm.py[349-446]

Recommended change

Add structured failure reporting, e.g.:

  • Maintain failed_chunks: list[dict] with {idx, error} (and maybe chunk_files), and include it in the returned dict.
  • Optionally add a fail_fast: bool = False parameter to restore old semantics when desired.
  • Consider incrementing/returning chunks_succeeded/chunks_failed counts for easy UI reporting.

Found by Qodo. Free code review for open-source maintainers.

@Qodo-Free-For-OSS
Copy link
Copy Markdown

Hi, When tiktoken is available, token-budget packing reads and tokenizes each file, and extraction then reads the same files again to build the prompt. This increases I/O and can noticeably slow startup on large corpora.

Severity: informational | Category: performance

How to fix: Cache content during estimation

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

With tiktoken installed, files are read once to estimate tokens and again to build the prompt.

Issue Context

This can add significant overhead for large corpora.

Fix Focus Areas

  • graphify/llm.py[208-269]
  • graphify/llm.py[80-93]

Suggested approaches

  • Cache per-file truncated content during packing (e.g., dict[Path, str]) and allow extract_files_direct() / _read_files() to accept pre-read content.
  • Or, estimate using st_size only as a fast heuristic even when tiktoken exists (but then accept less accurate packing).
  • Or, add a flag to disable tiktoken-based estimation when throughput matters.

Found by Qodo code review

safishamsi added a commit that referenced this pull request May 2, 2026
Co-Authored-By: Jason Matthew <jasonm4130@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants