QVAC-18395: Continuous Batching (text, single-job)#2327
Conversation
This comment was marked as spam.
This comment was marked as spam.
Tier-based Approval Status |
60bb8d0 to
6a5aa0a
Compare
This comment was marked as resolved.
This comment was marked as resolved.
fb8fac2 to
a793de2
Compare
a793de2 to
935dd3b
Compare
935dd3b to
56444af
Compare
| state_->batchScheduler_->runtimeStats(); | ||
| constexpr double kMillisInSecond = 1000.0; | ||
| const double elapsedMs = stats.elapsedMs(); | ||
| const double tokensPerSecond = |
There was a problem hiding this comment.
Can we compute batch TPS and ppTPS from llama.cpp's aggregate context-level perf counters instead of the scheduler wall-clock timer? For batch we cannot get exact per-prompt compute time from llama.cpp, but the aggregate batch metrics can still mirror the single-prompt formulas: TPS = 1000 / perfData.t_eval_ms * perfData.n_eval and ppTPS = 1000 / perfData.t_p_eval_ms * perfData.n_p_eval. That also restores ppTPS for batch stats and keeps these fields based on llama.cpp timing rather than external elapsed time.
There was a problem hiding this comment.
Note: this explanation was AI-generated (reviewed by me before posting).
I looked into this and I don't think the llama.cpp counters work for the batch path. Using them would make TPS and ppTPS worse than the wall-clock number we have now, so I'd push back on the suggestion.
Here's the issue. llama.cpp decides whether a decode step is prefill or generation purely from the batch size. In llama-context.cpp: n_queued_tokens == 1 goes to generation (t_eval/n_eval), > 1 goes to prompt processing (t_p_eval/n_p_eval). There's a FIXME right above that code warning the stats break when you evaluate multiple single tokens without a sync. That's continuous batching exactly. Every step here takes one token from each active sequence, so the batch is almost always > 1, and all our generation work gets logged as prompt eval.
So n_eval stays near zero. TPS = 1000 / t_eval_ms * n_eval collapses to roughly zero, and t_p_eval/n_p_eval hold prefill and generation mixed together, which makes ppTPS meaningless. The single-prompt formulas only work because llama.cpp can separate the two phases, and in batched mode it can't.
Better to measure it in the scheduler, which already knows the phase of every token. Time each llama_decode and classify the step. Pure-decode steps add their time and generated tokens to the decode totals. Pure-prefill steps add theirs to the prefill totals. For a mixed step (a newcomer prefilling while others generate), charge the whole step to decode: that pass had to run for the generators anyway, and the one prefill token rides along at near-zero cost. Then TPS = n_decode / t_decode and ppTPS = n_prefill / t_prefill, with each rate's tokens and time drawn from the same steps so they can't drift apart. If a run has no pure-prefill step, report ppTPS as N/A instead of inventing a value.
I'd keep the wall-clock number as the aggregate batch throughput, since that's the figure that actually shows what batching buys us. The per-step rates sit on top as the prefill/decode efficiency split.
Most of the data is already there: generatedTokens and promptTokens are tracked per step, so this is a per-step timer plus two accumulators next to the existing recordDecodeStep call.
There was a problem hiding this comment.
Went with scheduler-side per-step timing instead of the llama.cpp counters: TPS now comes from decode-step time and ppTPS is restored from pure-prefill-step time.
The counters can't separate the phases under continuous batching — llama.cpp buckets a step by batch size, and every batched step is size > 1, so generation gets filed as prompt eval. From llama-context.cpp#L352-L367:
// FIXME: if multiple single tokens are evaluated without a synchronization,
// the stats will be added to the prompt evaluation stats
// this should only happen when using batch size 1 to evaluate a batch
// add the evaluation to the stats
if (n_queued_tokens == 1) {
if (!cparams.no_perf) {
t_eval_us += ggml_time_us() - t_compute_start_us;
}
n_eval++;
} else if (n_queued_tokens > 1) {
if (!cparams.no_perf) {
t_p_eval_us += ggml_time_us() - t_compute_start_us;
}
n_p_eval += n_queued_tokens;
}So the rates are measured in the scheduler.
Fixes:
Left open in case you want to push back on the mixed-step → decode attribution.
|
Can we add docs explaining the continuous-batching architecture? This PR introduces several new moving parts ( |
Move the prefill-advancement body into an anonymous-namespace free function to reduce nesting depth in the advance() loop.
Pull the driver-finalize logic out of stepLocked to cut nesting and dedupe the decode-error and success paths. finalizeFinishedSequences collapses the error-path cleanup; getOutputCallback and hasValidDriverF factor out the streaming callback and the valid-driver predicate, the latter reused via std::views::filter. Behavior is unchanged: per-seqId ordering of save-cache, kv-clear, and free-slot is preserved.
The single declaration `runJob(...): Promise<boolean>` did not match the
batch binding, where the C++ side resolves `{ accepted, ids }` so
BatchHandler.run() can expose `response.ids`. Split into two overloads:
single-input resolves the boolean accepted flag, batch-input resolves
AddonBatchRunResult.
Also make AddonBatchRunItem.id optional to match runtime behavior, where
the native binding auto-assigns an id when the caller omits it.
Add explicit typed field so the continuous-batching slot count is IDE-discoverable. Previously fell through the catch-all index signature undocumented. `NumericLike` matches the string-or-number values the C++ layer accepts as `n_parallel`.
Two prompts in the same batch sharing a non-empty `cacheKey` with `saveCacheToDisk` would silently clobber each other on disk (last-writer-wins, no per-slot isolation). The scheduler has no way to pick a winner, so `processPromptBatchImpl` now throws `InvalidArgument` before any request is scheduled. Read-only sharing of the same key (no `saveCacheToDisk`) is a valid cache-warming pattern and remains allowed. Regression tests cover both the rejection path and the allowed read-sharing path.
Regression test for ContinuousBatchScheduler. With `parallel = 2` and 6 prompts, 4 sit in `pending_`; the first emitted token triggers `model.cancel()`. If any queued prompt then emits a token it was admitted and run *after* the cancel, which is the leak this guards against.
`stepLocked` consumed `cancelRequested_` with `exchange(false)` while cancelling the active slots, so the post-step `admitPendingIntoFreeSlots` ran with the flag already cleared and admitted the overflow prompts from `pending_` — they kept generating after `model.cancel()`. Leave the flag set in `stepLocked` (active slots are still cancelled in-step so onCancel/saveCache stay prompt) and let `workerLoop` consume it after the step: when a cancel is in effect, drain `pending_` instead of admitting it. Cancel-all now covers active and queued prompts atomically.
Same `parallel = 2` / 6-prompt setup as the leak test, but asserts the contract: a prompt cancelled while still queued in `pending_` produced no output because it never ran, so `processPromptBatch` must throw a `Cancelled` StatusError rather than return empty strings that look like successful completions.
A prompt drained from `pending_` on cancel never reached a slot, so it produced nothing. Completing its group as a normal (empty) success hid the cancellation. Fail those groups with a new `Cancelled` LLM error code instead. Only prompts that had no chance to run throw. An in-flight slot is still cancelled gracefully — it keeps whatever it generated and the call returns normally — mirroring how cancelling a single (non-batched) request behaves. Updates the leak test to tolerate the new throw: its invariant is only that queued prompts must not run, which holds whether the call throws or returns.
Cancelling a batch with more prompts than `parallel` slots now behaves in
two ways: in-flight prompts cancel gracefully (partial output, no error),
while queued prompts that never ran reject with a `Cancelled` StatusError.
Spell this out next to the cancel state table so JS callers handle the
rejection instead of expecting empty strings for the un-run prompts.
diff --git a/packages/llm-llamacpp/README.md b/packages/llm-llamacpp/README.md
index e80ca72..da00a13 100644
--- a/packages/llm-llamacpp/README.md
+++ b/packages/llm-llamacpp/README.md
@@ -264,6 +264,15 @@ The following table describes the expected behavior of `run` and `cancel` depend
When `run()` is called while another job is active, the implementation first waits briefly for the previous job to settle. This preserves single-job behavior while still failing fast when the instance is busy. If the second run cannot be accepted (timeout or addon busy rejection), it throws:
- `"Cannot set new job: a job is already set or being processed"`
+#### Cancelling a batch
+
+When more prompts are submitted in one batch than the configured `parallel` slots, the overflow prompts wait in an internal queue until a slot frees up. `cancel` treats the two groups differently, mirroring how cancelling a single request behaves:
+
+- **In-flight prompts** (already decoding in a slot) are cancelled gracefully: they keep whatever they generated so far and the call resolves normally — no error.
+- **Queued prompts** (still waiting, never admitted to a slot) had no chance to run and produced nothing. These are surfaced as an error rather than silent empty results: the batch call rejects with a `Cancelled` `StatusError`.
+
+So a cancelled batch that contained queued prompts rejects with `Cancelled`; callers should handle that rejection rather than expecting empty strings for the un-run prompts.
+
## Fine-tuning
diff --git a/packages/llm-llamacpp/addon/src/addon/AddonJs.hpp b/packages/llm-llamacpp/addon/src/addon/AddonJs.hpp
index f971abe..d3fffb7 100644
--- a/packages/llm-llamacpp/addon/src/addon/AddonJs.hpp
+++ b/packages/llm-llamacpp/addon/src/addon/AddonJs.hpp
@@ -28,9 +28,20 @@ namespace js = qvac_lib_inference_addon_cpp::js;
/// required to use it as a `const char*` template arg in `PayloadHandler`.
inline constexpr char kBatchOutputTypeName[] = "batch_output";
+inline LlamaModel*
+tryGetLlamaModel(qvac_lib_inference_addon_cpp::AddonCpp& addonCpp) {
+ return dynamic_cast<LlamaModel*>(&addonCpp.model.get());
+}
+
inline LlamaModel*
getLlamaModel(qvac_lib_inference_addon_cpp::AddonJs& instance) {
- return static_cast<LlamaModel*>(&instance.addonCpp->model.get());
+ using namespace qvac_lib_inference_addon_cpp;
+ auto* llamaModel = tryGetLlamaModel(*instance.addonCpp);
+ if (llamaModel == nullptr) {
+ throw StatusError(
+ general_error::InternalError, "Model is not a LlamaModel");
+ }
+ return llamaModel;
}
inline std::function<void(const std::string&)>
@@ -532,6 +543,14 @@ inline js_value_t* runJob(js_env_t* env, js_callback_info_t* info) try {
.getOptionalProperty<js::Array>(env, "messages")
.has_value();
if (isBatch) {
+ // Reject before admission: otherwise processPromptBatch throws the same
+ // error on the worker thread, surfaced as an async rejection.
+ if (!getLlamaModel(instance)->supportsBatching()) {
+ throw StatusError(
+ general_error::InvalidArgument,
+ "Batch run() requires the model loaded with parallel >= 2 "
+ "(continuous batching, text-only model with n_seq_max > 1)");
+ }
// Static to recycle vector capacity across calls; safe only while
// admissions stay serialized (one batch in flight). Demote to a local
// if that changes.
@@ -568,7 +587,7 @@ inline js_value_t* cancel(js_env_t* env, js_callback_info_t* info) try {
// an in-flight cancel and trip a destroyed-mutex UAF in JobRunner.
auto addonCppRef = instance.addonCpp;
return js::JsAsyncTask::run(env, [addonCppRef]() {
- auto* llamaModel = static_cast<LlamaModel*>(&addonCppRef->model.get());
+ auto* llamaModel = tryGetLlamaModel(*addonCppRef);
if (llamaModel && llamaModel->finetuner().isFinetuneRunning() &&
llamaModel->finetuner().requestPause()) {
llamaModel->finetuner().waitUntilFinetuningPauseComplete();
@@ -586,13 +605,6 @@ inline js_value_t* finetune(js_env_t* env, js_callback_info_t* info) try {
JsArgsParser args(env, info);
AddonJs& instance = JsInterface::getInstance(env, args.get(0, "instance"));
- LlamaModel* llamaModel = getLlamaModel(instance);
- if (llamaModel == nullptr) {
- throw StatusError(
- general_error::InvalidArgument,
- "Model not available or not a LlamaModel");
- }
-
auto paramsOpt = args.tryGetObject<LlamaFinetuningParams>(
1, "finetuningParams", [](js_env_t* e, js::Object& jsObj) {
return parseLlamaFinetuningParams(e, jsObj);
diff --git a/packages/llm-llamacpp/addon/src/model-interface/LlamaModel.cpp b/packages/llm-llamacpp/addon/src/model-interface/LlamaModel.cpp
index 2eb75d8..25d28e5 100644
--- a/packages/llm-llamacpp/addon/src/model-interface/LlamaModel.cpp
+++ b/packages/llm-llamacpp/addon/src/model-interface/LlamaModel.cpp
@@ -717,6 +717,11 @@ LlamaModel::processPromptBatch(const std::vector<Prompt>& prompts) {
return processPromptBatchImpl(prompts);
}
+bool LlamaModel::supportsBatching() const {
+ std::shared_lock lock(stateMtx_);
+ return state_ && isMultiBatchActivated(*state_);
+}
+
std::vector<std::string>
LlamaModel::processPromptBatchImpl(const std::vector<Prompt>& prompts) {
validateBitnetQuantization();
diff --git a/packages/llm-llamacpp/addon/src/model-interface/LlamaModel.hpp b/packages/llm-llamacpp/addon/src/model-interface/LlamaModel.hpp
index 3a82291..7166ed9 100644
--- a/packages/llm-llamacpp/addon/src/model-interface/LlamaModel.hpp
+++ b/packages/llm-llamacpp/addon/src/model-interface/LlamaModel.hpp
@@ -133,6 +133,10 @@ public:
std::vector<std::string>
processPromptBatch(const std::vector<Prompt>& prompts);
+ /// @brief True when the model was loaded with continuous batching active
+ /// (text-only context with `n_seq_max > 1`, i.e. `parallel >= 2`).
+ [[nodiscard]] bool supportsBatching() const;
+
/**
* The Reset method.
*/
diff --git a/packages/llm-llamacpp/test/integration/api-behavior.test.js b/packages/llm-llamacpp/test/integration/api-behavior.test.js
index f9a4efa..26aaf5b 100644
--- a/packages/llm-llamacpp/test/integration/api-behavior.test.js
+++ b/packages/llm-llamacpp/test/integration/api-behavior.test.js
@@ -141,6 +141,22 @@ safeTest('idle | run batch: returns ids, keyed chunks, ordered results', { timeo
t.ok(toNumber(response?.stats?.avgConcurrentSeq) > 1.1, 'batch stats report concurrent sequence decoding')
})
+safeTest('idle | run batch without parallel >= 2: rejects before admission', { timeout: 600_000 }, async t => {
+ // Default load (parallel = 1) leaves continuous batching inactive, so batch
+ // input must be rejected up front rather than reaching the worker thread.
+ const { model } = await setupModel(t)
+ const batchPrompts = [
+ [{ role: 'user', content: 'Say red.' }],
+ [{ role: 'user', content: 'Say blue.' }]
+ ]
+
+ await t.exception.all(
+ () => model.run(batchPrompts),
+ /parallel >= 2/,
+ 'batch run rejects when the model was not loaded with parallel >= 2'
+ )
+})
+
safeTest('idle | run with prefill: evaluates prompt without token generation', { timeout: 600_000 }, async t => {
const { model } = await setupModel(t)
Batch run() only works when the model was loaded with continuous batching (parallel >= 2). Without it the batch reached native code and failed on the worker thread, surfaced as a late async rejection. Add a supportsBatching() check in the binding so run() throws synchronously, before the job is admitted. Centralize the LlamaModel downcast in tryGetLlamaModel/getLlamaModel (getLlamaModel now throws InternalError on a non-LlamaModel), which also drops the dead null-check that static_cast left in finetune.
In batch mode the KV pool is partitioned into per-slot caps (ctx / n_parallel) far smaller than the whole-context size. A cached prompt can fit the full context yet overflow its slot, so the prefill slide must trigger against the per-sequence cap to let n_discarded free room before the scheduler rejects the prompt. trySlidePrefill gains an effectiveCtx parameter for that per-sequence ceiling. This patch only plumbs the parameter (still slides against the whole-context size), so the new regression test fails: a prompt over the 2048 slot cap but under the 8192 full ctx returns NotNeeded instead of Slid. The fix follows in the next patch. Regression for PR tetherto#2327 review r3344885390. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
In batch mode the KV pool is partitioned into per-slot windows (n_ctx / n_parallel). The admission check in ContinuousBatchScheduler rejects a prompt against that per-sequence cap, but the prefill slide in TextLlmContext only triggered against the whole-context size reported by llama_n_ctx(). A cached prompt larger than its slot yet smaller than the full context therefore slid against the wrong ceiling: trySlidePrefill returned NotNeeded, the prompt stayed oversized, and the scheduler rejected it. n_discarded was effectively dead for batch prompts. Thread the per-sequence ceiling through: - trySlidePrefill / trySlideGeneration take an effectiveCtx parameter and slide against it (falls back to ops.nCtx() when <= 0, i.e. single sequence), so the new regression test now reports Slid. - TextLlmContext carries perSeqCtxCeiling_ (set by the scheduler to perSeqMaxTokens_, -1 for single-sequence) and uses ctxCeiling() for the prefill/generation slide thresholds, the overflow checks, and the n_discarded clamps in loadCache/onPrefillComplete. The physical cache size check still uses the full llama_n_ctx(). - ContinuousBatchScheduler passes perSeqMaxTokens_ to the slot driver and warns at construction when configuredNDiscarded >= the per-sequence cap, since it will be clamped below the per-slot window. README documents the per-slot clamp on n_discarded in batch mode. Fixes PR tetherto#2327 review r3344885390. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
RuntimeStatsSnapshot should expose per-phase throughput derived from per-step timing: a decode rate (generation steps) and a prefill rate (pure-prefill steps). llama.cpp's context counters can't produce this split under continuous batching, since every batched step has size > 1 and its generation work is misfiled as prompt eval. These specs pin the bucketing contract: pure-decode steps feed the decode rate, pure-prefill steps feed the prefill rate, and a mixed step (a newcomer prefilling while others generate) is charged wholly to decode so the piggybacked prefill token never inflates ppTPS. They fail to compile until the snapshot grows the timing-aware recordDecodeStep overload and the two rate getters. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Time each llama_decode in the scheduler and split its batch into prompt vs generated tokens via FillResult::numPrefillingSequences. A step that carries any decode token is charged wholly to the decode bucket; only pure-prefill steps feed the prefill bucket. RuntimeStatsSnapshot turns those into a decode rate and a prefill rate. LlamaModel batch stats now derive TPS from the decode rate and ppTPS from the prefill rate, replacing the wall-clock TPS. This avoids llama.cpp's context counters, which misfile batched generation as prompt eval (every batched step is size > 1) and so cannot separate the two phases. ppTPS, previously absent from batch stats, is restored. Greens the specs added in the previous patch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extract finalizeTerminalDriver (behavior-preserving) so the terminal lifecycle-hook choice is unit-testable, and add a failing spec: a DecodeError sequence that ran generation must pass through onCancel/onGenerationFinished (which runs TextLlmContext:: onGenerationCompletePolicy, the tools_compact tool-region trim), not a bare onSequenceEnd flush.
A decode-error finalization called only onSequenceEnd(), which flushes the UTF-8 buffer but skips onGenerationFinished()/onCancel() and thus TextLlmContext::onGenerationCompletePolicy() (the tools_compact tool-region trim). Route decode-error terminations through onCancel like the user-cancel and group-failure paths, and run the hook before the KV is cleared so the trim operates on live KV state.
After decode-error finalization clears KV in the scheduler loop, no caller passes a clear callback to extractFinished. Remove the dead parameter; cancel()/clear() keep their own kvClear. Drop the test that covered only the removed callback and trim the slot-reuse test to its still-relevant reuse assertion.
The problem
When a single request is in flight, the GPU sits idle between token generation steps while the sequence waits on sampling or I/O. Nothing batches work across concurrent callers, so each request effectively serialises the whole pipeline.
What this does
This PR introduces continuous batching: a native decode loop (
ContinuousBatchScheduler) that, on every step, pulls tokens from all active sequences into a single shared batch. Instead of one sequence at a time, the GPU sees a full batch drawn from however many callers are currently in flight.Two components handle the concurrency machinery.
MultiRequestBatchermanages admission and maps each incoming sequence to a slot.SequenceDriverowns the per-slot generation state: KV cache position, token budget, stop conditions. Intake runs through a lock-freemoodycamel/concurrentqueueport so admission doesn't block the decode loop.Interface
On the JS side,
run()now accepts either a singleMessage[](unchanged) or an array of prompts asMessage[][]orBatchPrompt[]wrappers, which carry optional ids and per-itemrunOptions. It returns aBatchResponsethatstreams
BatchOutputChunkevents keyed by sequence id and resolvesawait()to an orderedBatchResult[].RuntimeStatsgainsavgConcurrentSeqto report observed decode concurrency over a run. The level of parallelism can be enabled and controlled withparallelflag when building the model.Limitations
Text-only. Vision will be handled on a separate PR.
For now JS interface limited to single-job. This means that to truly take advantage of parallelism multiple messages needs to be bundled in same
run()call. Multiple separate consecutive run calls with their own messages will not take advantage of the parallelism. This will be addressed later.Testing
Integration tests cover basic multi-prompt batching, per-item
generationParams, streaming output correlated by id, and concurrent admission. On the C++ side, unit tests coverMultiRequestBatcherslot lifecycle, theContinuousBatchSchedulerdecode loop and budget enforcement, and a regression for then_predictsingle-prompt guard.API changes
run(prompt)acceptsMessage[](unchanged) orBatchPrompt[]/Message[][](new batch path).BatchPrompt,BatchOutputChunk,BatchResult,BatchResponse.RuntimeStats.avgConcurrentSeq: numberadded.See previous PR #1890