Skip to content

QVAC-18395: Continuous Batching (text, single-job)#2327

Open
jesusmb1995 wants to merge 29 commits into
tetherto:mainfrom
jesusmb1995:continuousBatching6
Open

QVAC-18395: Continuous Batching (text, single-job)#2327
jesusmb1995 wants to merge 29 commits into
tetherto:mainfrom
jesusmb1995:continuousBatching6

Conversation

@jesusmb1995

@jesusmb1995 jesusmb1995 commented May 28, 2026

Copy link
Copy Markdown
Contributor

The problem

When a single request is in flight, the GPU sits idle between token generation steps while the sequence waits on sampling or I/O. Nothing batches work across concurrent callers, so each request effectively serialises the whole pipeline.

What this does

This PR introduces continuous batching: a native decode loop (ContinuousBatchScheduler) that, on every step, pulls tokens from all active sequences into a single shared batch. Instead of one sequence at a time, the GPU sees a full batch drawn from however many callers are currently in flight.

Two components handle the concurrency machinery. MultiRequestBatcher manages admission and maps each incoming sequence to a slot. SequenceDriver owns the per-slot generation state: KV cache position, token budget, stop conditions. Intake runs through a lock-free moodycamel/concurrentqueue port so admission doesn't block the decode loop.

Interface

On the JS side, run() now accepts either a single Message[] (unchanged) or an array of prompts as Message[][] or BatchPrompt[] wrappers, which carry optional ids and per-item runOptions. It returns a BatchResponse that
streams BatchOutputChunk events keyed by sequence id and resolves await() to an ordered BatchResult[]. RuntimeStats gains avgConcurrentSeq to report observed decode concurrency over a run. The level of parallelism can be enabled and controlled with parallel flag when building the model.

Limitations

Text-only. Vision will be handled on a separate PR.

For now JS interface limited to single-job. This means that to truly take advantage of parallelism multiple messages needs to be bundled in same run() call. Multiple separate consecutive run calls with their own messages will not take advantage of the parallelism. This will be addressed later.

Testing

Integration tests cover basic multi-prompt batching, per-item generationParams, streaming output correlated by id, and concurrent admission. On the C++ side, unit tests cover MultiRequestBatcher slot lifecycle, the ContinuousBatchScheduler decode loop and budget enforcement, and a regression for the n_predict single-prompt guard.

API changes

  • run(prompt) accepts Message[] (unchanged) or BatchPrompt[] /
    Message[][] (new batch path).
  • New interfaces: BatchPrompt, BatchOutputChunk, BatchResult,
    BatchResponse.
  • RuntimeStats.avgConcurrentSeq: number added.

See previous PR #1890

@jesusmb1995

This comment was marked as spam.

Comment thread packages/llm-llamacpp/test/unit/test_context_slider.cpp
Comment thread packages/llm-llamacpp/addon/src/model-interface/ContinuousBatchScheduler.hpp Outdated
@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ❌ PENDING

**Requirements:**
- 1 Team Member approval ❌ (0/1)
- 1 Team Lead OR Management approval ❌ (0/1)



---
*This comment is automatically updated when reviews change.*

@jesusmb1995 jesusmb1995 force-pushed the continuousBatching6 branch from 60bb8d0 to 6a5aa0a Compare May 29, 2026 12:39
@jesusmb1995

This comment was marked as resolved.

@jesusmb1995 jesusmb1995 force-pushed the continuousBatching6 branch 2 times, most recently from fb8fac2 to a793de2 Compare May 29, 2026 12:45
@jesusmb1995 jesusmb1995 self-assigned this May 29, 2026
@jesusmb1995 jesusmb1995 added verify verified Authorize secrets / label-gate in PR workflows labels May 29, 2026
@jesusmb1995 jesusmb1995 force-pushed the continuousBatching6 branch from a793de2 to 935dd3b Compare May 29, 2026 13:52
@jesusmb1995 jesusmb1995 force-pushed the continuousBatching6 branch from 935dd3b to 56444af Compare May 29, 2026 13:59
Comment thread packages/llm-llamacpp/index.d.ts
gianni-cor

This comment was marked as resolved.

Comment thread packages/llm-llamacpp/index.js
Comment thread packages/llm-llamacpp/index.d.ts Outdated
Comment thread packages/llm-llamacpp/addon/src/model-interface/ContinuousBatchScheduler.cpp Outdated
state_->batchScheduler_->runtimeStats();
constexpr double kMillisInSecond = 1000.0;
const double elapsedMs = stats.elapsedMs();
const double tokensPerSecond =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we compute batch TPS and ppTPS from llama.cpp's aggregate context-level perf counters instead of the scheduler wall-clock timer? For batch we cannot get exact per-prompt compute time from llama.cpp, but the aggregate batch metrics can still mirror the single-prompt formulas: TPS = 1000 / perfData.t_eval_ms * perfData.n_eval and ppTPS = 1000 / perfData.t_p_eval_ms * perfData.n_p_eval. That also restores ppTPS for batch stats and keeps these fields based on llama.cpp timing rather than external elapsed time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this explanation was AI-generated (reviewed by me before posting).

I looked into this and I don't think the llama.cpp counters work for the batch path. Using them would make TPS and ppTPS worse than the wall-clock number we have now, so I'd push back on the suggestion.

Here's the issue. llama.cpp decides whether a decode step is prefill or generation purely from the batch size. In llama-context.cpp: n_queued_tokens == 1 goes to generation (t_eval/n_eval), > 1 goes to prompt processing (t_p_eval/n_p_eval). There's a FIXME right above that code warning the stats break when you evaluate multiple single tokens without a sync. That's continuous batching exactly. Every step here takes one token from each active sequence, so the batch is almost always > 1, and all our generation work gets logged as prompt eval.

So n_eval stays near zero. TPS = 1000 / t_eval_ms * n_eval collapses to roughly zero, and t_p_eval/n_p_eval hold prefill and generation mixed together, which makes ppTPS meaningless. The single-prompt formulas only work because llama.cpp can separate the two phases, and in batched mode it can't.

Better to measure it in the scheduler, which already knows the phase of every token. Time each llama_decode and classify the step. Pure-decode steps add their time and generated tokens to the decode totals. Pure-prefill steps add theirs to the prefill totals. For a mixed step (a newcomer prefilling while others generate), charge the whole step to decode: that pass had to run for the generators anyway, and the one prefill token rides along at near-zero cost. Then TPS = n_decode / t_decode and ppTPS = n_prefill / t_prefill, with each rate's tokens and time drawn from the same steps so they can't drift apart. If a run has no pure-prefill step, report ppTPS as N/A instead of inventing a value.

I'd keep the wall-clock number as the aggregate batch throughput, since that's the figure that actually shows what batching buys us. The per-step rates sit on top as the prefill/decode efficiency split.

Most of the data is already there: generatedTokens and promptTokens are tracked per step, so this is a per-step timer plus two accumulators next to the existing recordDecodeStep call.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with scheduler-side per-step timing instead of the llama.cpp counters: TPS now comes from decode-step time and ppTPS is restored from pure-prefill-step time.

The counters can't separate the phases under continuous batching — llama.cpp buckets a step by batch size, and every batched step is size > 1, so generation gets filed as prompt eval. From llama-context.cpp#L352-L367:

// FIXME: if multiple single tokens are evaluated without a synchronization,
// the stats will be added to the prompt evaluation stats
// this should only happen when using batch size 1 to evaluate a batch

// add the evaluation to the stats
if (n_queued_tokens == 1) {
    if (!cparams.no_perf) {
        t_eval_us += ggml_time_us() - t_compute_start_us;
    }
    n_eval++;
} else if (n_queued_tokens > 1) {
    if (!cparams.no_perf) {
        t_p_eval_us += ggml_time_us() - t_compute_start_us;
    }
    n_p_eval += n_queued_tokens;
}

So the rates are measured in the scheduler.

Fixes:

  • 427c1af: specs (unit + a real-model batch run)
  • ffed148: per-step timing impl

Left open in case you want to push back on the mixed-step → decode attribution.

Comment thread packages/llm-llamacpp/addon/src/model-interface/ContinuousBatchScheduler.cpp Outdated
@gianni-cor

gianni-cor commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Can we add docs explaining the continuous-batching architecture? This PR introduces several new moving parts (ContinuousBatchScheduler, MultiRequestBatcher, per-slot TextLlmContext / SequenceDriver, queued vs active requests, per-sequence context caps, streaming ids, cache-per-slot behavior, and cancellation semantics). Architecture docs would make it much easier to understand how batch prompts flow from JS into native slots, what parallel controls, how outputs/stats are aggregated, and what is/isn't supported in batch mode (media, per-prompt cache files, tools, context shifting).

jesusmb1995 and others added 20 commits June 3, 2026 14:49
Move the prefill-advancement body into an anonymous-namespace free function
to reduce nesting depth in the advance() loop.
Pull the driver-finalize logic out of stepLocked to cut nesting and
dedupe the decode-error and success paths. finalizeFinishedSequences
collapses the error-path cleanup; getOutputCallback and hasValidDriverF
factor out the streaming callback and the valid-driver predicate, the
latter reused via std::views::filter. Behavior is unchanged: per-seqId
ordering of save-cache, kv-clear, and free-slot is preserved.
The single declaration `runJob(...): Promise<boolean>` did not match the
batch binding, where the C++ side resolves `{ accepted, ids }` so
BatchHandler.run() can expose `response.ids`. Split into two overloads:
single-input resolves the boolean accepted flag, batch-input resolves
AddonBatchRunResult.

Also make AddonBatchRunItem.id optional to match runtime behavior, where
the native binding auto-assigns an id when the caller omits it.
Add explicit typed field so the continuous-batching slot count is
IDE-discoverable. Previously fell through the catch-all index
signature undocumented. `NumericLike` matches the string-or-number
values the C++ layer accepts as `n_parallel`.
Two prompts in the same batch sharing a non-empty `cacheKey` with
`saveCacheToDisk` would silently clobber each other on disk
(last-writer-wins, no per-slot isolation). The scheduler has no way
to pick a winner, so `processPromptBatchImpl` now throws
`InvalidArgument` before any request is scheduled.

Read-only sharing of the same key (no `saveCacheToDisk`) is a valid
cache-warming pattern and remains allowed. Regression tests cover
both the rejection path and the allowed read-sharing path.
Regression test for ContinuousBatchScheduler. With `parallel = 2` and 6
prompts, 4 sit in `pending_`; the first emitted token triggers
`model.cancel()`. If any queued prompt then emits a token it was admitted
and run *after* the cancel, which is the leak this guards against.
`stepLocked` consumed `cancelRequested_` with `exchange(false)` while
cancelling the active slots, so the post-step `admitPendingIntoFreeSlots`
ran with the flag already cleared and admitted the overflow prompts from
`pending_` — they kept generating after `model.cancel()`.

Leave the flag set in `stepLocked` (active slots are still cancelled
in-step so onCancel/saveCache stay prompt) and let `workerLoop` consume it
after the step: when a cancel is in effect, drain `pending_` instead of
admitting it. Cancel-all now covers active and queued prompts atomically.
Same `parallel = 2` / 6-prompt setup as the leak test, but asserts the
contract: a prompt cancelled while still queued in `pending_` produced no
output because it never ran, so `processPromptBatch` must throw a
`Cancelled` StatusError rather than return empty strings that look like
successful completions.
A prompt drained from `pending_` on cancel never reached a slot, so it
produced nothing. Completing its group as a normal (empty) success hid the
cancellation. Fail those groups with a new `Cancelled` LLM error code
instead.

Only prompts that had no chance to run throw. An in-flight slot is still
cancelled gracefully — it keeps whatever it generated and the call returns
normally — mirroring how cancelling a single (non-batched) request behaves.

Updates the leak test to tolerate the new throw: its invariant is only
that queued prompts must not run, which holds whether the call throws or
returns.
    Cancelling a batch with more prompts than `parallel` slots now behaves in
    two ways: in-flight prompts cancel gracefully (partial output, no error),
    while queued prompts that never ran reject with a `Cancelled` StatusError.
    Spell this out next to the cancel state table so JS callers handle the
    rejection instead of expecting empty strings for the un-run prompts.

diff --git a/packages/llm-llamacpp/README.md b/packages/llm-llamacpp/README.md
index e80ca72..da00a13 100644
--- a/packages/llm-llamacpp/README.md
+++ b/packages/llm-llamacpp/README.md
@@ -264,6 +264,15 @@ The following table describes the expected behavior of `run` and `cancel` depend
 When `run()` is called while another job is active, the implementation first waits briefly for the previous job to settle. This preserves single-job behavior while still failing fast when the instance is busy. If the second run cannot be accepted (timeout or addon busy rejection), it throws:
 - `"Cannot set new job: a job is already set or being processed"`

+#### Cancelling a batch
+
+When more prompts are submitted in one batch than the configured `parallel` slots, the overflow prompts wait in an internal queue until a slot frees up. `cancel` treats the two groups differently, mirroring how cancelling a single request behaves:
+
+- **In-flight prompts** (already decoding in a slot) are cancelled gracefully: they keep whatever they generated so far and the call resolves normally — no error.
+- **Queued prompts** (still waiting, never admitted to a slot) had no chance to run and produced nothing. These are surfaced as an error rather than silent empty results: the batch call rejects with a `Cancelled` `StatusError`.
+
+So a cancelled batch that contained queued prompts rejects with `Cancelled`; callers should handle that rejection rather than expecting empty strings for the un-run prompts.
+

 ## Fine-tuning

diff --git a/packages/llm-llamacpp/addon/src/addon/AddonJs.hpp b/packages/llm-llamacpp/addon/src/addon/AddonJs.hpp
index f971abe..d3fffb7 100644
--- a/packages/llm-llamacpp/addon/src/addon/AddonJs.hpp
+++ b/packages/llm-llamacpp/addon/src/addon/AddonJs.hpp
@@ -28,9 +28,20 @@ namespace js = qvac_lib_inference_addon_cpp::js;
 /// required to use it as a `const char*` template arg in `PayloadHandler`.
 inline constexpr char kBatchOutputTypeName[] = "batch_output";

+inline LlamaModel*
+tryGetLlamaModel(qvac_lib_inference_addon_cpp::AddonCpp& addonCpp) {
+  return dynamic_cast<LlamaModel*>(&addonCpp.model.get());
+}
+
 inline LlamaModel*
 getLlamaModel(qvac_lib_inference_addon_cpp::AddonJs& instance) {
-  return static_cast<LlamaModel*>(&instance.addonCpp->model.get());
+  using namespace qvac_lib_inference_addon_cpp;
+  auto* llamaModel = tryGetLlamaModel(*instance.addonCpp);
+  if (llamaModel == nullptr) {
+    throw StatusError(
+        general_error::InternalError, "Model is not a LlamaModel");
+  }
+  return llamaModel;
 }

 inline std::function<void(const std::string&)>
@@ -532,6 +543,14 @@ inline js_value_t* runJob(js_env_t* env, js_callback_info_t* info) try {
                            .getOptionalProperty<js::Array>(env, "messages")
                            .has_value();
   if (isBatch) {
+    // Reject before admission: otherwise processPromptBatch throws the same
+    // error on the worker thread, surfaced as an async rejection.
+    if (!getLlamaModel(instance)->supportsBatching()) {
+      throw StatusError(
+          general_error::InvalidArgument,
+          "Batch run() requires the model loaded with parallel >= 2 "
+          "(continuous batching, text-only model with n_seq_max > 1)");
+    }
     // Static to recycle vector capacity across calls; safe only while
     // admissions stay serialized (one batch in flight). Demote to a local
     // if that changes.
@@ -568,7 +587,7 @@ inline js_value_t* cancel(js_env_t* env, js_callback_info_t* info) try {
   // an in-flight cancel and trip a destroyed-mutex UAF in JobRunner.
   auto addonCppRef = instance.addonCpp;
   return js::JsAsyncTask::run(env, [addonCppRef]() {
-    auto* llamaModel = static_cast<LlamaModel*>(&addonCppRef->model.get());
+    auto* llamaModel = tryGetLlamaModel(*addonCppRef);
     if (llamaModel && llamaModel->finetuner().isFinetuneRunning() &&
         llamaModel->finetuner().requestPause()) {
       llamaModel->finetuner().waitUntilFinetuningPauseComplete();
@@ -586,13 +605,6 @@ inline js_value_t* finetune(js_env_t* env, js_callback_info_t* info) try {
   JsArgsParser args(env, info);
   AddonJs& instance = JsInterface::getInstance(env, args.get(0, "instance"));

-  LlamaModel* llamaModel = getLlamaModel(instance);
-  if (llamaModel == nullptr) {
-    throw StatusError(
-        general_error::InvalidArgument,
-        "Model not available or not a LlamaModel");
-  }
-
   auto paramsOpt = args.tryGetObject<LlamaFinetuningParams>(
       1, "finetuningParams", [](js_env_t* e, js::Object& jsObj) {
         return parseLlamaFinetuningParams(e, jsObj);
diff --git a/packages/llm-llamacpp/addon/src/model-interface/LlamaModel.cpp b/packages/llm-llamacpp/addon/src/model-interface/LlamaModel.cpp
index 2eb75d8..25d28e5 100644
--- a/packages/llm-llamacpp/addon/src/model-interface/LlamaModel.cpp
+++ b/packages/llm-llamacpp/addon/src/model-interface/LlamaModel.cpp
@@ -717,6 +717,11 @@ LlamaModel::processPromptBatch(const std::vector<Prompt>& prompts) {
   return processPromptBatchImpl(prompts);
 }

+bool LlamaModel::supportsBatching() const {
+  std::shared_lock lock(stateMtx_);
+  return state_ && isMultiBatchActivated(*state_);
+}
+
 std::vector<std::string>
 LlamaModel::processPromptBatchImpl(const std::vector<Prompt>& prompts) {
   validateBitnetQuantization();
diff --git a/packages/llm-llamacpp/addon/src/model-interface/LlamaModel.hpp b/packages/llm-llamacpp/addon/src/model-interface/LlamaModel.hpp
index 3a82291..7166ed9 100644
--- a/packages/llm-llamacpp/addon/src/model-interface/LlamaModel.hpp
+++ b/packages/llm-llamacpp/addon/src/model-interface/LlamaModel.hpp
@@ -133,6 +133,10 @@ public:
   std::vector<std::string>
   processPromptBatch(const std::vector<Prompt>& prompts);

+  /// @brief True when the model was loaded with continuous batching active
+  /// (text-only context with `n_seq_max > 1`, i.e. `parallel >= 2`).
+  [[nodiscard]] bool supportsBatching() const;
+
   /**
    * The Reset method.
    */
diff --git a/packages/llm-llamacpp/test/integration/api-behavior.test.js b/packages/llm-llamacpp/test/integration/api-behavior.test.js
index f9a4efa..26aaf5b 100644
--- a/packages/llm-llamacpp/test/integration/api-behavior.test.js
+++ b/packages/llm-llamacpp/test/integration/api-behavior.test.js
@@ -141,6 +141,22 @@ safeTest('idle | run batch: returns ids, keyed chunks, ordered results', { timeo
   t.ok(toNumber(response?.stats?.avgConcurrentSeq) > 1.1, 'batch stats report concurrent sequence decoding')
 })

+safeTest('idle | run batch without parallel >= 2: rejects before admission', { timeout: 600_000 }, async t => {
+  // Default load (parallel = 1) leaves continuous batching inactive, so batch
+  // input must be rejected up front rather than reaching the worker thread.
+  const { model } = await setupModel(t)
+  const batchPrompts = [
+    [{ role: 'user', content: 'Say red.' }],
+    [{ role: 'user', content: 'Say blue.' }]
+  ]
+
+  await t.exception.all(
+    () => model.run(batchPrompts),
+    /parallel >= 2/,
+    'batch run rejects when the model was not loaded with parallel >= 2'
+  )
+})
+
 safeTest('idle | run with prefill: evaluates prompt without token generation', { timeout: 600_000 }, async t => {
   const { model } = await setupModel(t)
Batch run() only works when the model was loaded with continuous
batching (parallel >= 2). Without it the batch reached native code and
failed on the worker thread, surfaced as a late async rejection. Add a
supportsBatching() check in the binding so run() throws synchronously,
before the job is admitted.

Centralize the LlamaModel downcast in tryGetLlamaModel/getLlamaModel
(getLlamaModel now throws InternalError on a non-LlamaModel), which also
drops the dead null-check that static_cast left in finetune.
In batch mode the KV pool is partitioned into per-slot caps
(ctx / n_parallel) far smaller than the whole-context size. A cached
prompt can fit the full context yet overflow its slot, so the prefill
slide must trigger against the per-sequence cap to let n_discarded free
room before the scheduler rejects the prompt.

trySlidePrefill gains an effectiveCtx parameter for that per-sequence
ceiling. This patch only plumbs the parameter (still slides against the
whole-context size), so the new regression test fails: a prompt over the
2048 slot cap but under the 8192 full ctx returns NotNeeded instead of
Slid. The fix follows in the next patch.

Regression for PR tetherto#2327 review r3344885390.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
In batch mode the KV pool is partitioned into per-slot windows
(n_ctx / n_parallel). The admission check in ContinuousBatchScheduler
rejects a prompt against that per-sequence cap, but the prefill slide in
TextLlmContext only triggered against the whole-context size reported by
llama_n_ctx(). A cached prompt larger than its slot yet smaller than the
full context therefore slid against the wrong ceiling: trySlidePrefill
returned NotNeeded, the prompt stayed oversized, and the scheduler
rejected it. n_discarded was effectively dead for batch prompts.

Thread the per-sequence ceiling through:
- trySlidePrefill / trySlideGeneration take an effectiveCtx parameter and
  slide against it (falls back to ops.nCtx() when <= 0, i.e. single
  sequence), so the new regression test now reports Slid.
- TextLlmContext carries perSeqCtxCeiling_ (set by the scheduler to
  perSeqMaxTokens_, -1 for single-sequence) and uses ctxCeiling() for the
  prefill/generation slide thresholds, the overflow checks, and the
  n_discarded clamps in loadCache/onPrefillComplete. The physical cache
  size check still uses the full llama_n_ctx().
- ContinuousBatchScheduler passes perSeqMaxTokens_ to the slot driver and
  warns at construction when configuredNDiscarded >= the per-sequence cap,
  since it will be clamped below the per-slot window.

README documents the per-slot clamp on n_discarded in batch mode.

Fixes PR tetherto#2327 review r3344885390.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
RuntimeStatsSnapshot should expose per-phase throughput derived from
per-step timing: a decode rate (generation steps) and a prefill rate
(pure-prefill steps). llama.cpp's context counters can't produce this
split under continuous batching, since every batched step has size > 1
and its generation work is misfiled as prompt eval.

These specs pin the bucketing contract: pure-decode steps feed the
decode rate, pure-prefill steps feed the prefill rate, and a mixed step
(a newcomer prefilling while others generate) is charged wholly to
decode so the piggybacked prefill token never inflates ppTPS. They fail
to compile until the snapshot grows the timing-aware recordDecodeStep
overload and the two rate getters.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Time each llama_decode in the scheduler and split its batch into prompt
vs generated tokens via FillResult::numPrefillingSequences. A step that
carries any decode token is charged wholly to the decode bucket; only
pure-prefill steps feed the prefill bucket. RuntimeStatsSnapshot turns
those into a decode rate and a prefill rate.

LlamaModel batch stats now derive TPS from the decode rate and ppTPS
from the prefill rate, replacing the wall-clock TPS. This avoids
llama.cpp's context counters, which misfile batched generation as prompt
eval (every batched step is size > 1) and so cannot separate the two
phases. ppTPS, previously absent from batch stats, is restored.

Greens the specs added in the previous patch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extract finalizeTerminalDriver (behavior-preserving) so the terminal
lifecycle-hook choice is unit-testable, and add a failing spec: a
DecodeError sequence that ran generation must pass through
onCancel/onGenerationFinished (which runs TextLlmContext::
onGenerationCompletePolicy, the tools_compact tool-region trim), not a
bare onSequenceEnd flush.
A decode-error finalization called only onSequenceEnd(), which flushes
the UTF-8 buffer but skips onGenerationFinished()/onCancel() and thus
TextLlmContext::onGenerationCompletePolicy() (the tools_compact
tool-region trim). Route decode-error terminations through onCancel like
the user-cancel and group-failure paths, and run the hook before the KV
is cleared so the trim operates on live KV state.
After decode-error finalization clears KV in the scheduler loop, no
caller passes a clear callback to extractFinished. Remove the dead
parameter; cancel()/clear() keep their own kvClear. Drop the test that
covered only the removed callback and trim the slot-reuse test to its
still-relevant reuse assertion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants