diff --git a/.github/workflows/deploy-docs.yml b/.github/workflows/deploy-docs.yml index 433e235..d8557d1 100644 --- a/.github/workflows/deploy-docs.yml +++ b/.github/workflows/deploy-docs.yml @@ -40,12 +40,12 @@ jobs: timeout-minutes: 10 steps: - name: Checkout main repo - uses: actions/checkout@v4 + uses: actions/checkout@v6 with: persist-credentials: false - name: Set up Python - uses: actions/setup-python@v5 + uses: actions/setup-python@v6 with: python-version: '3.11' cache: pip @@ -59,7 +59,7 @@ jobs: run: mkdocs build --strict - name: Checkout marketing site repo - uses: actions/checkout@v4 + uses: actions/checkout@v6 with: repository: cryptopoly/ChaosEngineAI-Site ssh-key: ${{ secrets.SITE_REPO_DEPLOY_KEY }} diff --git a/.github/workflows/perf-gate.yml b/.github/workflows/perf-gate.yml index 33561a0..f044c65 100644 --- a/.github/workflows/perf-gate.yml +++ b/.github/workflows/perf-gate.yml @@ -79,7 +79,7 @@ jobs: - name: Upload baseline JSON if: always() - uses: actions/upload-artifact@v5 + uses: actions/upload-artifact@v7 with: name: perf-baseline path: /tmp/perf-baseline.json diff --git a/CLAUDE.md b/CLAUDE.md index 061c310..f5767a1 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -152,7 +152,7 @@ no longer relevant. | FU-029 | KVTC (NVIDIA ICLR 2026) KV cache strategy | **Deferred 2026-05-10 — CUDA-only upstream, awaiting MLX/Metal port + PyPI release.** | Targeting [OnlyTerp/kvtc](https://github.com/OnlyTerp/kvtc) (Apache 2.0). PCA + adaptive quantization + entropy coding — 8–32× compression vs the dropped ChaosEngine's 3.7×, peer-reviewed at ICLR 2026, beats TurboQuant by 37% at comparable quality on long-context. Upstream blockers: (a) CUDA-only — repo's roadmap mentions MLX/Metal as "planned" but not yet implemented, so the Apple Silicon dev box cannot validate end-to-end; (b) not on PyPI — distributed as a `src.*` repo intended for `git clone`; (c) integration shape is a HuggingFace `DynamicCache` wrapper (not a llama.cpp cache type), so the existing GGUF lane has no path. Re-evaluate when either upstream ships MLX support or a Windows/Linux+CUDA development box becomes available. Apple Silicon users continue on TurboQuant-MLX (also ICLR 2026, native today). | | ~~FU-030~~ | ~~Drop ChaosEngine + RotorQuant strategy slots~~ | **Shipped 2026-05-10.** | ChaosEngine (cryptopoly/ChaosEngine — 1 commit upstream, eclipsed by KVTC at ICLR 2026 with the same PCA approach but 8–32× compression vs 3.7×) and RotorQuant (shipped as a misleading alias for TurboQuant — same ``--cache-type-k turbo{N}`` flags + same Python module marker) both removed from the registry. Persisted user configs that still reference these ids coerce silently to ``turboquant`` via a new ``CacheStrategyRegistry.resolve_legacy_id`` helper + module-level ``_LEGACY_STRATEGY_ALIASES`` map ([cache_compression/__init__.py](cache_compression/__init__.py)). Mirror coercion in frontend ([src/components/runtimeSupport.ts](src/components/runtimeSupport.ts) ``LEGACY_STRATEGY_ALIASES`` + ``canonicalStrategyId``). Two-level llama.cpp fallback chain (was three-level: requested → ChaosEngine → native; now requested → native) in [backend_service/inference/llama_cpp_engine.py](backend_service/inference/llama_cpp_engine.py). Vendored ChaosEngine bundling stripped from [scripts/stage-runtime.mjs](scripts/stage-runtime.mjs) (3 helper functions removed: ``stageVendoredChaosEngine`` + ``ensureSetuptoolsForPep639`` + ``resolveChaosEngineVendor``). Pre-build probe asserts the legacy-id coercion works in CI. ``[rotorquant]`` extra removed from [pyproject.toml](pyproject.toml). ``CHAOSENGINE_VENDOR_PATH`` env var dropped. Cache strategy speed/quality maps in [helpers/cache.py](backend_service/helpers/cache.py) trimmed to remaining strategies. | | ~~FU-031~~ | ~~Extend `DRAFT_MODEL_MAP` for new z-lab DFlash drafters + pin TriAttention~~ | **Shipped 2026-05-10.** | z-lab published draft checkpoints for several new families since the last `DRAFT_MODEL_MAP` audit; the upstream `dflash-mlx` 0.1.5 release also added the Gemma4 backend (commit 05cc456). Added entries for `google/gemma-4-31B-it`, `google/gemma-4-26B-A4B-it`, `Qwen/Qwen3.5-122B-A10B`, `MiniMaxAI/MiniMax-M2.5`, `MiniMaxAI/MiniMax-M2.7`, `moonshotai/Kimi-K2.6` (all in [dflash/__init__.py](dflash/__init__.py)) plus `mlx-community/...` aliases for each so Apple Silicon quants resolve. New 7 unit tests in [tests/test_dflash.py](tests/test_dflash.py) pin the mappings. **Same commit also pinned TriAttention** to `c3744ee6a50522a1559a577f85aef2b165a344f2` in [pyproject.toml](pyproject.toml) — previously the `[triattention]` and `[triattention-mlx]` extras pulled `git+...git` HEAD, which made fresh installs non-reproducible whenever the upstream landed unreleased work. Pin matches the v0.2.0 release surface plus the AMD GPU port. | -| FU-032 | TurboQuant+ ([TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus)) Apple Silicon Metal kernels (**watch-closely**) | Re-evaluate when upstream tags v1.0 release or beats `turboquant-mlx-full` 0.3.0 on a public M-series benchmark | Same author as our `llama-cpp-turboquant` fork. Adds Walsh-Hadamard rotation (improvement over base TurboQuant's Hadamard-only path) + a sparse-V optimization on M5 Max that achieves 0.93x of q8_0 decode speed at long context while saving 50–64% of KV memory. Reported numbers: turbo3 4.6× compression at +1.06% PPL, turbo4 3.8× compression at +0.23% PPL — comparable to our existing `turboquant-mlx-full` pin but with newer kernels. 326 commits + community tested across M1/M2/M3/M5. **Not on PyPI** (development install via `git clone` + `pip install -e .[dev]`), so adopting it means a vendored or git+url install pattern like dflash-mlx — re-evaluate when upstream publishes a wheel or tags a v1.0. Apple Silicon stays on `turboquant-mlx-full` for now; the underlying llama-server-turbo binary already exposes turbo2/3/4 cache types. | +| FU-032 | TurboQuant+ ([TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus)) Apple Silicon Metal kernels (**watch-closely**) | Re-evaluate when upstream tags v1.0 release or beats `turboquant-mlx-full` 0.8.0 on a public M-series benchmark | Same author as our `llama-cpp-turboquant` fork. Adds Walsh-Hadamard rotation (improvement over base TurboQuant's Hadamard-only path) + a sparse-V optimization on M5 Max that achieves 0.93x of q8_0 decode speed at long context while saving 50–64% of KV memory. Reported numbers: turbo3 4.6× compression at +1.06% PPL, turbo4 3.8× compression at +0.23% PPL — comparable to our existing `turboquant-mlx-full` pin but with newer kernels. **Not on PyPI** (development install via `git clone` + `pip install -e .[dev]`), so adopting it means a vendored or git+url install pattern like dflash-mlx — re-evaluate when upstream publishes a wheel or tags a v1.0. Apple Silicon stays on `turboquant-mlx-full` for now. **2026-06-15 scan:** latest tags are v0.3.2.1–v0.3.2.3 (HEAD `7f601a13`). Still no PyPI wheel, still no v1.0 tag. FU-032 trigger not met; updated comparison baseline from 0.3.0 to 0.8.0 since our floor advanced. | | ~~FU-033~~ | ~~dflash-mlx pin sync assert in pre-build-check~~ | **Shipped 2026-05-10.** | Caught a real bug: [pyproject.toml](pyproject.toml) and [scripts/stage-runtime.mjs](scripts/stage-runtime.mjs) had drifted to different `dflash-mlx` commit hashes (the dev `.venv` ran 0.1.5.1 while `npm run stage:runtime` was bundling 0.1.4.1 into release builds). Both files manually synced to `fada1eb`; new probe in [scripts/pre-build-check.mjs](scripts/pre-build-check.mjs) and [scripts/pre-build-check.sh](scripts/pre-build-check.sh) regex-extracts the commit hash from both files and fails the build when they diverge. Same probe also took the chance to drop the orphan `vendor/ChaosEngine` staleness check from both runners — that vendored path was dropped in FU-030 and would never resolve again. | | ~~FU-041~~ | ~~Qwen3-Coder-Next-MLX-4bit was mis-canonicalised as Qwen3.6-27B-4bit~~ | **Shipped 2026-05-10.** | User-spotted mismatch: their local install at `/Users/dan/AI_Models/lmstudio-community/Qwen3-Coder-Next-MLX-4bit` was surfacing as canonical repo `mlx-community/Qwen3.6-27B-4bit` in the diagnostics snapshot, picking up the wrong catalog row and the wrong DFlash drafter. Inspecting the on-disk `config.json` confirmed the model is **Qwen3-Next** (architectures `Qwen3NextForCausalLM`, `model_type: "qwen3_next"`, sparse MoE with 512 experts, hidden_size 2048, ~3B active per token) — fundamentally different from the dense Qwen3.6-27B (`qwen3` arch, hidden_size 5120). Root cause: there was no catalog variant for the lmstudio-community community MLX 4-bit conversion of Coder-Next, so the fuzzy matcher in `src/utils/library.ts::libraryVariantMatchScore` settled for the closest "MLX + 4-bit + Qwen3" entry, which happened to be the unrelated `mlx-community/Qwen3.6-27B-4bit` row. Fix: (1) added an explicit `lmstudio-community/Qwen3-Coder-Next-MLX-4bit` variant to the `qwen3-coder-next` family in `backend_service/catalog/text_models.py` with the correct params (80B sparse, ~45 GB on disk, qwen3_next family capabilities). (2) Reverted the FU-038 DFlash aliases that wrongly pointed `mlx-community/Qwen3.6-27B-4bit / bf16 / 8bit` at `Qwen/Qwen3-Coder-Next` — those quants are the dense 27B Coder and have no drafter today. (3) Replaced them with the correct `lmstudio-community/Qwen3-Coder-Next-MLX-4bit` alias plus an `-Instruct` sibling for completeness. New regression tests in `tests/test_dflash.py` pin both the new alias resolution and that the dense 27B-4bit MUST NOT alias to the MoE drafter. | | ~~FU-040~~ | ~~Tool-call parser misses open-only `` + Qwen3.6-27B false-positive vision tag~~ | **Shipped 2026-05-10.** | Surfaced by a Coder-Next chat session: tool calls rendered as raw `{"name": "web_search", ...}` text in the assistant bubble with no execution, while in a separate turn the "Attach image" affordance appeared even though Qwen3.6-27B is text-only. Three fixes. (1) **Tool-call parser widened.** Old regex `\s*(\{.*?\})\s*` required a closing tag and only matched objects. Coder-Next emitted three real-world shapes in a single session: canonical (closed + object), open-only (no ``), and array-shaped (model hallucinated a list of pseudo-results). The new parser uses `json.JSONDecoder.raw_decode` on each `` opener so it consumes the next valid JSON value regardless of close tag, dispatches objects with a `name`, drops list payloads silently, and continues scanning so a later well-formed call in the same message still lands. 7 new unit tests in `tests/test_agent.py` pin all three shapes plus the OpenAI-style stringified-arguments path. (2) **`_strip_tool_call_xml` helper** removes the JSON region the parser consumed from `result.text` before the streaming layer hands it to the chat bubble — fixes the "raw XML next to the ToolCallCard" duplication. Applied in both `run_agent_loop` and `run_agent_loop_streaming`. 6 new unit tests pin the strip behaviour. (3) **Qwen3.6-27B + Qwen3.5 catalog cleanup.** Dense Qwen3.6-27B (Coder-Next branding), Qwen3.6-27B-FP8, mlx-community/Qwen3.6-27B-4bit, and the family-level Qwen3.6 + Qwen3.5 entries all carried the `vision` capability — a copy-paste bug from when the catalog was scaffolded. Vision lives on a separate `Qwen3.6-27B-VL` variant we do not yet ship; the stale tag was promoting `supportsVision: true` for every community quant, making `ChatComposer` render the "Attach image" affordance for a text-only model. Dropped the tag from all five entries. | @@ -182,22 +182,22 @@ no longer relevant. | ~~FU-062~~ | ~~Bump `turboquant-mlx-full` floor `>=0.3.0` → `>=0.4.0`~~ | **Shipped 2026-05-25 (v0.9.3).** | Upstream `turboquant-mlx-full` 0.4.1 on PyPI (installed was 0.3.0, FU-001 pin). v0.4.0 added **expert streaming** — pages router-selected MoE experts from disk per token, runs models whose weights exceed available RAM. Live-validated upstream against `Qwen3.6-35B-A3B` (35B sparse) on a 16 GB Mac mini in under 4 GB RAM, output bit-identical to fully-resident model. Compounds with our existing Hadamard rotation + Lloyd-Max codebook K/V compression. Floor bump only — no API changes required, runtime continues to call `TurboQuantKVCache` with the same signature. Pin lives in [pyproject.toml](pyproject.toml) `[turboquant]` extra. Apple Silicon only (CUDA users stay on the `llama-server-turbo` binary path via FU-001's parallel track). | | ~~FU-063~~ | ~~Bump `mlx-vlm` floor `>=0.4.0` → `>=0.5.0`~~ | **Shipped 2026-05-25 (v0.9.3).** | Upstream `mlx-vlm` 0.5.0 on PyPI (installed was 0.4.4). Minor bump, no API breakage at our call surface (`mlx_vlm.load` + `mlx_vlm.generate` from [mlx_worker_multimodal.py](backend_service/mlx_worker_multimodal.py)). Floor bump in [pyproject.toml](pyproject.toml) `[mlx-vlm]` extra; loose `>=` semantics mean existing 0.4.x installs are still satisfied locally, but fresh installs pick up the newer wheel which carries the upstream Qwen3.5-VL + GLM-4.5V fixes. | | ~~FU-064~~ | ~~Add `ggml-org/Qwen3.6-{27B,35B-A3B}-GGUF` non-MTP catalog rows~~ | **Shipped 2026-05-25 (v0.9.3).** | ggml-org published canonical Q8_0 non-MTP companion packs on 2026-05-22 alongside the MTP variants we wired in FU-047. Two new rows in [text_models.py](backend_service/catalog/text_models.py) `qwen-3-6` family: `ggml-org/Qwen3.6-27B-GGUF` (Q8_0, 29 GB, dense) + `ggml-org/Qwen3.6-35B-A3B-GGUF` (Q8_0, 37 GB, MoE). Catalog note steers users at the MTP siblings when they want spec-dec. No runtime changes — direct `llama.cpp` lane, same as the lmstudio-community Q4_K_M variants already shipping. | -| FU-065 | Pin `llama-cpp-turboquant` to a commit hash instead of branch HEAD | Trigger: any user-reported build divergence between two install runs, OR a release-build gate where reproducibility matters more than tracking upstream. | [scripts/build-llama-turbo.sh](scripts/build-llama-turbo.sh) + [scripts/update-llama-turbo.sh](scripts/update-llama-turbo.sh) currently clone `TheTom/llama-cpp-turboquant` at branch `feature/turboquant-kv-cache` (`LLAMA_TURBO_BRANCH` env var), then `git reset --hard origin/$TURBO_BRANCH`. Two installs at different times can ship different binaries — the same drift problem FU-033 fixed for `dflash-mlx`. Today's branch HEAD is `2cbfdc62a1a047b01377948dfdede8cb6a744866`. Plan: add `LLAMA_TURBO_COMMIT="${LLAMA_TURBO_COMMIT:-2cbfdc62...}"` to both scripts, `git checkout "$LLAMA_TURBO_COMMIT"` after fetch, surface the hash in `llama-server-turbo.version`, and add a sync-assert to `pre-build-check` that compares the build-script pin to a value in [pyproject.toml](pyproject.toml) or a dedicated `UPSTREAM_PINS.md`. Defer because (a) branch is single-purpose with low churn — author is the same TheTom we already trust for `turboquant_plus`; (b) we already have the v0.9.2 → v0.9.3 release with this code path working. | -| FU-066 | Audit `cache-strategy-matrix` runner against bumped `turboquant-mlx-full` 0.4.x | When FU-062's bump lands in CI or when a user reports a TurboQuant regression. | The runner's TurboQuant cell (`mlx-community/Qwen3-0.6B-4bit × cacheStrategy=turboquant cacheBits=3`) passed against 0.3.0 with output hash `b4337bc07457` (FU-051 evidence). 0.4.x's expert-streaming code path is a no-op for dense 0.6B but flips on for MoE models like `mlx-community/Qwen3.6-35B-A3B-4bit` — worth a one-time live capture of an MoE turboquant cell against the 0.4.x wheel to lock in a baseline hash. No code changes; just record the number once the bumped wheel is installed on the M4 Max box. | +| FU-065 | Pin `llama-cpp-turboquant` to a commit hash instead of branch HEAD | Trigger: any user-reported build divergence between two install runs, OR a release-build gate where reproducibility matters more than tracking upstream. | [scripts/build-llama-turbo.sh](scripts/build-llama-turbo.sh) + [scripts/update-llama-turbo.sh](scripts/update-llama-turbo.sh) currently clone `TheTom/llama-cpp-turboquant` at branch `feature/turboquant-kv-cache` (`LLAMA_TURBO_BRANCH` env var), then `git reset --hard origin/$TURBO_BRANCH`. Two installs at different times can ship different binaries — the same drift problem FU-033 fixed for `dflash-mlx`. Today's branch HEAD is `2cbfdc62a1a047b01377948dfdede8cb6a744866`. Plan: add `LLAMA_TURBO_COMMIT="${LLAMA_TURBO_COMMIT:-2cbfdc62...}"` to both scripts, `git checkout "$LLAMA_TURBO_COMMIT"` after fetch, surface the hash in `llama-server-turbo.version`, and add a sync-assert to `pre-build-check` that compares the build-script pin to a value in [pyproject.toml](pyproject.toml) or a dedicated `UPSTREAM_PINS.md`. Defer because (a) branch is single-purpose with low churn — author is the same TheTom we already trust for `turboquant_plus`; (b) we already have the v0.9.2 → v0.9.3 release with this code path working. **2026-06-11 release scan:** branch HEAD has drifted `2cbfdc62…` → `73eb521daebc85da7c91d37178940b99a5524cf6` — confirms the reproducibility risk this row tracks. Pin still deferred: pinning the *drifted* `73eb521d` is unsafe without a verified test-compile (could ship a broken turbo binary), and reverting-pinning to the known-good `2cbfdc62` drops upstream work. When picked up, pin to a commit that's been build-tested on the M4 Max box. **2026-06-15 release scan:** branch HEAD drifted again → `7985f6b90bf19881ab7c7a8444954e91cae36056`. Reproducibility risk continues to accumulate. Still deferred pending test-compile. | +| FU-066 | Audit `cache-strategy-matrix` runner against bumped `turboquant-mlx-full` 0.8.x | When 0.8.0 floor is installed on the M4 Max box or when a user reports a TurboQuant regression. | The runner's TurboQuant cell (`mlx-community/Qwen3-0.6B-4bit × cacheStrategy=turboquant cacheBits=3`) passed against 0.3.0 with output hash `b4337bc07457` (FU-051 evidence). 0.4.x expert-streaming + 0.5.x parallel prefetch + 0.8.x Mamba/hybrid arch support are all no-ops for dense 0.6B but may affect MoE models. **2026-06-15:** floor bumped `>=0.6.2` → `>=0.8.0` in [pyproject.toml](pyproject.toml). Worth a one-time live capture of the TurboQuant cell against 0.8.0 once the wheel is installed locally. Bumped threshold from "0.4.x" to "0.8.x" to track the current floor. | | ~~FU-072~~ | ~~Restore `vision` capability to Qwen3.5 + Qwen3.6 families (reverse FU-040)~~ | **Shipped 2026-05-28.** | FU-040 (2026-05-10) removed `vision` from Qwen3.6-27B + family, asserting the dense model was text-only with vision on "a separate `Qwen3.6-27B-VL` we don't ship." Re-checking upstream on 2026-05-28: **every** Qwen3.5/3.6 `config.json` now ships `architectures: [Qwen3_5ForConditionalGeneration]` / `[Qwen3_5MoeForConditionalGeneration]` with `vision_config` + `image_token_id` + `vision_start/end_token_id` — the base models are natively multimodal. `mlx-vlm` ships `qwen3_5` + `qwen3_5_moe` model support, and the `ggml-org/*-GGUF` packs include an `mmproj-*.gguf` sibling (auto-wired by `llama_cpp_engine._resolve_mmproj_path` → `--mmproj`). The catalog was also internally inconsistent (Qwen3.5-9B tagged vision, Qwen3.5-4B not, same arch). Re-added `vision` across both families in [text_models.py](backend_service/catalog/text_models.py): qwen-3-6 family-level + all 11 variants; qwen-3-5 family-level + `Qwen3.5-4B` (vision+video, matching its 9B sibling) + `lmstudio-community/Qwen3.5-9B-GGUF`. **Safety net (why this can't resurrect the FU-040 broken-button bug):** the composer "Attach image" affordance ([ChatComposer.tsx:129](src/features/chat/ChatComposer.tsx)) reads the *runtime* `supportsVision`, which [catalog/capabilities.py](backend_service/catalog/capabilities.py) demotes to False for the MLX worker (carries no images today) and gates on actual `--mmproj` resolution for GGUF ([llama_cpp_engine.py:737](backend_service/inference/llama_cpp_engine.py) `visionEnabled=attempt_mmproj_path is not None`). So the catalog `vision` tag now drives only the variant-picker / discover badges (capability-in-principle), while the functional button stays runtime-accurate. `gemma-4` was already correctly vision-tagged (mlx-vlm `gemma4` support) — left untouched. Catalog parses + `test_capabilities` / `test_mmproj_vision` green. | | ~~FU-075~~ | ~~MLX spec-dec silently broken — stale `configure_full_attention_split` import~~ | **Shipped 2026-05-29.** | **Highest-impact bug this sweep.** Inspecting the matrix runtimeNotes (not just pass/fail) revealed the MLX DFlash / DDTree / MTPLX cells were *passing the weak non-empty-output check while NOT actually running spec-dec* — `actual_strategy: native`, note `dflash-mlx could not be imported (cannot import name 'configure_full_attention_split' from 'dflash_mlx.runtime')`. Root cause: dflash-mlx 0.1.5 moved the pre-0.1.5 top-level `configure_full_attention_split` onto the per-family `target_ops` adapter (the FU-006 migration that rewrote `ddtree.py` — but [mlx_worker_lifecycle.py:153](backend_service/mlx_worker_lifecycle.py) was missed). Python evaluates the whole `from … import a, b` line, so the failed `configure_full_attention_split` symbol killed the co-imported `load_draft_bundle` too → `_dflash_generator` never loaded → **every** MLX spec-dec path fell back to standard generation for all users. Fix: import `load_draft_bundle` + `resolve_target_ops` (both still top-level), resolve the adapter, and call `target_ops.configure_full_attention_split(...)` only for the `hybrid_gdn` family (it's a no-op for pure-attention Qwen3/3.5/3.6 — upstream only calls it there). Live-verified after fix: DFlash note "DFLASH speculative decoding active (draft: z-lab/Qwen3-4B-DFlash-b16)", DDTree "DDTree active (budget=16)". | | ~~FU-076~~ | ~~MTP tensor probe missed top-level `mtp.` keys → MTPLX never selected~~ | **Shipped 2026-05-29.** | The matrix MTPLX cell routed to the DFlash path instead of `MtplxEngine`. `RuntimeController._select_engine` gates MTPLX on `has_mtp_heads_strict(repo, path)`, which calls `model_has_mtp_tensors(path)` → scans the safetensors index against `_MTP_TENSOR_HINTS = ('mtp_heads.', 'mtp_decoder.', 'mtp_emb.', 'model.mtp.', '.mtp.')`. Every hint assumes a *nested* key, but Qwen3.5 / Qwen3.6 ship the MTP head as **top-level** `mtp.layers.*` / `mtp.fc.weight` (no leading prefix) — so the probe returned False on a genuinely MTP-bearing model and MTPLX was skipped. Live-confirmed: `model_has_mtp_tensors` returned False on the real `Qwen/Qwen3.5-4B` snapshot. Fix in [_mtp.py](backend_service/inference/_mtp.py): also match `tensor_name.startswith("mtp.")`. New `test_safetensors_index_with_top_level_mtp_keys` in [tests/test_inference.py](tests/test_inference.py). | | ~~FU-077~~ | ~~MTPLX isolated venv had a truncated install (missing server deps)~~ | **Shipped 2026-05-29.** | After FU-076 routed correctly, `MtplxEngine` startup died: `ModuleNotFoundError: No module named 'numpy'` — and then `safetensors`, `uvicorn`, `fastapi`, `pydantic`, `mlx-lm`, `rich`… The `~/.chaosengine/mtplx-venv` was a *truncated* install (interrupted `pip install mtplx`), but the installer's verify only ran `import mtplx`, which succeeds because the server deps are imported lazily by `mtplx.server.openai` (not at package top level). Fixed the live venv with a full `pip install --upgrade mtplx` (0.3.5 → 0.3.7, pulled all deps). Hardened [scripts/install-mtplx.sh](scripts/install-mtplx.sh): the verify now imports `mtplx.server.openai` (the real server entrypoint) and auto-retries a full dependency install once before failing loudly, so a truncated install can't pass silently again. | | ~~FU-078~~ | ~~MtplxEngine handed MTPLX a bare repo id instead of the local snapshot path~~ | **Shipped 2026-05-29.** | Final MTPLX blocker: `mtplx quickstart` died with "model is not available locally. Run: mtplx pull Qwen/Qwen3.5-4B" — it resolves a model *id* against its own registry/cache, not the HF hub cache. [mtplx_engine.py](backend_service/inference/mtplx_engine.py) set `model_arg = path or runtime_target or model_ref`, and for raw HF-org repos `path` is None while `runtime_target` is the *repo id* (`Qwen/Qwen3.5-4B`), so MTPLX got an id it couldn't find. Fix: whenever the candidate isn't an existing local directory, resolve the already-downloaded HF snapshot dir via `snapshot_download(model_ref, local_files_only=True)` (no network) and pass that. Live-verified: MTPLX now **loads + engages** (note "MTPLX MTP speculative decoding active (draft tokens: 1, model: Qwen3.5-4B)", reports 17.8 tok/s) instead of failing to start. Also fixed the matrix runner's `0.0 tok/s` (read `done.assistant.metrics.tokS`, not a non-existent top-level `tokensPerSecond`) + captured `dflashAcceptanceRate`. **Verified-genuine after these fixes: DFlash (33.2 tok/s), DDTree (31.4 tok/s), GGUF-MTP (14.7 tok/s), turboquant MLX/GGUF, triattention, native** — all stream real output with real throughput. MTPLX still has one remaining issue → FU-079. | | ~~FU-080~~ | ~~Backend cold start dragged in torch via cache-strategy availability probes~~ | **Shipped 2026-05-29.** | `python -X importtime backend_service.app` measured **2.6 s**, of which **1.64 s was `diffusers.hooks`** (→ `torch` → `torch._dynamo` → `sympy`) — blowing the CLAUDE.md "< 2 s backend startup" target. Traced the chain: state init → system snapshot → `_get_cache_strategies()` → `registry.available()` instantiates every strategy and calls `is_available()`, and the 5 diffusion strategies (fbcache / taylorseer / magcache / pab / fastercache) answered availability by **actually importing `diffusers.hooks`** — pulling the whole torch stack onto the cold-start path on every launch. Fix: new [cache_compression/_diffusers_probe.py](cache_compression/_diffusers_probe.py) `diffusers_at_least(major, minor)` reads the installed version via `importlib.metadata.version` (metadata only — never executes `diffusers.__init__`, so no torch). Each `is_available()` now gates on the version (fbcache ≥0.36, the other four ≥0.38); the real `diffusers.hooks` import stays lazy inside each `apply_*` method (still raises a clean NotImplementedError on a broken install). Result: `diffusers` / `torch` / `mlx` are **no longer in `sys.modules` after `import backend_service.app`**, import time dropped **2.6 s → ~0.85 s**, and cold-start → first `/api/health` 200 is **2.34 s** (the native-backend MLX subprocess probe was already async — "detection still running" on first health, never blocked startup). Two subprocess-isolated regression guards in [tests/test_cache_strategies.py](tests/test_cache_strategies.py) (`StartupImportPurityTests`) assert neither `registry.available()` nor `import backend_service.app` pulls torch/diffusers, so this can't silently regress. All 5 diffusion strategies still report `available=True` against the installed diffusers 0.38. | -| FU-079 | MTPLX proxy doesn't surface incremental tokens to the chat stream (empty output) | Active — MTPLX-specific, lower priority (FU-048: MTPLX is ~flat-to-slower vs the alternatives, which all work). | After FU-075–078, the matrix MTPLX cell flipped from "fake pass via DFlash fallback" to **engine genuinely engaged but `FAIL — empty output`**: the loaded-model note confirms "MTPLX MTP active (draft tokens: 1)" and the done event carries a real `tokS` (17.8), but the streamed assistant text is empty (output SHA `e3b0c44298fc` = the empty-string hash). Confirmed the chat stream's incremental token field IS `{"token": "..."}` (DFlash/DDTree/GGUF-MTP/native all stream through it fine on the same `/api/chat/generate/stream` endpoint), so the gap is in `MtplxEngine`'s OpenAI-`/v1`-proxy → SSE adapter: it surfaces final metrics but not per-token deltas, leaving `full_text` empty for both the matrix runner AND the real Chat UI. Plan: inspect `MtplxEngine.generate` / its streaming proxy in [mtplx_engine.py](backend_service/inference/mtplx_engine.py), map the mtplx server's `/v1/chat/completions` SSE `choices[].delta.content` chunks onto our `{"token": ...}` event shape. Until fixed, MTPLX loads but produces no visible output — DFlash is the working MLX spec-dec lane for the same models (and faster per FU-048). | +| FU-079 | MTPLX proxy doesn't surface incremental tokens to the chat stream (empty output) | Active — MTPLX-specific, lower priority (FU-048: MTPLX is ~flat-to-slower vs the alternatives, which all work). | After FU-075–078, the matrix MTPLX cell flipped from "fake pass via DFlash fallback" to **engine genuinely engaged but `FAIL — empty output`**: the loaded-model note confirms "MTPLX MTP active (draft tokens: 1)" and the done event carries a real `tokS` (17.8), but the streamed assistant text is empty (output SHA `e3b0c44298fc` = the empty-string hash). Confirmed the chat stream's incremental token field IS `{"token": "..."}` (DFlash/DDTree/GGUF-MTP/native all stream through it fine on the same `/api/chat/generate/stream` endpoint), so the gap is in `MtplxEngine`'s OpenAI-`/v1`-proxy → SSE adapter: it surfaces final metrics but not per-token deltas, leaving `full_text` empty for both the matrix runner AND the real Chat UI. Plan: inspect `MtplxEngine.generate` / its streaming proxy in [mtplx_engine.py](backend_service/inference/mtplx_engine.py), map the mtplx server's `/v1/chat/completions` SSE `choices[].delta.content` chunks onto our `{"token": ...}` event shape. Until fixed, MTPLX loads but produces no visible output — DFlash is the working MLX spec-dec lane for the same models (and faster per FU-048). **2026-06-11 release scan:** MTPLX reached **v1.0.0 + v1.0.1** (PyPI; was 0.3.5 on this box). The installer ([scripts/install-mtplx.sh](scripts/install-mtplx.sh)) is unpinned (`pip install --upgrade mtplx`), so a fresh install now auto-pulls v1.0.1 — no code change needed. v1.0.0 release notes claim `/v1/completions` now "streams tokens as they are generated, with real finish reasons and usage", which **may resolve this empty-output** at the source. Still HTTP-server-only (the FU-048 in-process-API root persists). **Action: re-test FU-079 against v1.0.1 with a live MTPLX run** (reinstall the mtplx venv → load an MTP model → confirm the chat stream surfaces per-token `{"token": …}` deltas). If v1.0.0's streaming fixed it, this row closes with no adapter change. **2026-06-15 release scan:** MTPLX now at **v1.0.4** (was v1.0.1). Installer remains unpinned so fresh installs pick up 1.0.4 automatically. Re-test action unchanged — priority to validate before next release. | | ~~FU-074~~ | ~~GGUF MTP speculative decoding had no UI toggle~~ | **Shipped 2026-05-28.** | FU-047 wired the GGUF MTP backend (`--spec-type draft-mtp`, gated on the `speculativeDecoding` request flag in [llama_cpp_engine.py:531](backend_service/inference/llama_cpp_engine.py)) + the `ggufMtpAvailable` capability flag, but never surfaced a UI control. The launch modal's only spec-dec toggles are DFlash (hidden for GGUF — "not supported with llama.cpp models") and MTPLX (Apple-Silicon MLX only), so a user loading `ggml-org/Qwen3.6-27B-MTP-GGUF` had **no way to enable** the lane — only the matrix runner could, by POSTing `speculativeDecoding=true` directly. The button audit (this turn) caught it. Added an `isMtpGgufRepo(repo)` helper in [runtimeSupport.ts](src/components/runtimeSupport.ts) (mirrors backend `is_mtp_gguf_repo`: MTP-flavoured name on a GGUF repo) + a "GGUF MTP" toggle in [RuntimeControls.tsx](src/components/RuntimeControls.tsx), shown only when `isGgufBackend && isMtpGgufRepo(selectedCanonicalRepo)` (FU-034 hide-when-not-applicable). It binds to the same `speculativeDecoding` flag the backend reads; no cache-strategy lock (GGUF KV cache is orthogonal to MTP draft decode, unlike MLX DFlash which forces native). Also patched the DFlash-availability reset effect (was clearing `speculativeDecoding` for any non-DFlash model — would have instantly un-ticked the GGUF-MTP box) to keep it on for `ggufMtpModelSupported`. Old binaries without `--spec-type` fall back to standard decode + a runtimeNote (backend FU-047 path) — acceptable since the bundled llama-server is current; a future refinement could additionally gate the toggle on the `ggufMtpAvailable` capability for old-binary boxes (needs the flag threaded through the ~8 RuntimeControls call sites). 8 new `isMtpGgufRepo` unit tests in [runtimeSupport.test.ts](src/components/__tests__/runtimeSupport.test.ts). Verified live: matrix `gguf MTP (Qwen3.6-27B)` cell PASS (sha 74a1eca8b3b4). | | ~~FU-073~~ | ~~Matrix MTPLX cell targeted a non-MTP VL model~~ | **Shipped 2026-05-28.** | `scripts/cache-strategy-matrix.py` `MID_MLX_MTPLX_CAPABLE` was `mlx-community/Qwen3.5-4B-bf16` — a VL conversion (ships `video_preprocessor_config.json`) that carries no MTP heads and is absent from both `MTP_MODEL_MAP` and `_MTP_ALIASES`, so the MTPLX cell could never have exercised MTP even with the model on disk (it'd fail the `has_mtp_heads_strict` tensor probe). Switched to the canonical `Qwen/Qwen3.5-4B`, which is a direct `MTP_MODEL_MAP` key (verified `mtp.layers.*` + `mtp.fc.weight` in its safetensors index), a catalog variant (so it passes the `library_refs` check), and downloaded to exercise the lane. Pairs with the FU-070 download-skip classifier so the cell reports honestly on boxes without the model. | | ~~FU-071~~ | ~~DDTree availability probe checks pre-0.1.5 symbol names~~ | **Shipped 2026-05-28.** | The cache-strategy matrix `ddtree spec-dec` cell skipped with *DDTree runtime not available* even though `dflash_mlx` 0.1.5.1 is installed and `backend_service/ddtree.py` works. Root cause: `dflash.is_ddtree_available()` ([dflash/__init__.py](dflash/__init__.py)) source-greps the installed `dflash_mlx.runtime` for three required symbols and the list was stale — it required `target_forward_with_hidden_states`, which dflash-mlx 0.1.5 **renamed** to the per-family adapter `target_ops.forward_with_hidden_capture` (the same FU-006 migration that rewrote our `ddtree.py` to call `resolve_target_ops(target_model)`). The probe was never updated alongside that rewrite, so it required a symbol that (a) no longer exists in any modern dflash-mlx build (`grep -c` = 0 in the installed `runtime.py`) and (b) our own code no longer uses. Confirmed the real contract our DDTree path imports: `resolve_target_ops` (ddtree.py adapter entry), `load_draft_bundle` (worker lifecycle), `stream_dflash_generate` (speculative). Updated `required_symbols` to those three; dropped the obsolete name + the unused `load_target_bundle`. `dflash.is_ddtree_available()` now returns `True` on this M4 Max box. 4 new `DDTreeAvailabilityProbeTests` in [tests/test_dflash.py](tests/test_dflash.py) mock the runtime source so a future rename can't silently regress the probe again. Note: when FU-057 bumps dflash-mlx to 0.1.7 (which removes `configure_full_attention_split` and reshapes `stream_dflash_generate`), this probe + the lifecycle import need re-checking in lockstep. | | ~~FU-070~~ | ~~Matrix runner: classify missing-download as SKIP, not FAIL~~ | **Shipped 2026-05-28.** | The full `scripts/cache-strategy-matrix.py` sweep on 2026-05-28 reported the `gguf MTP (Qwen3.6-27B)` cell as **FAIL** — `POST /api/models/load -> 500: Cannot load 'ggml-org/Qwen3.6-27B-MTP-GGUF': No .gguf, .safetensors, or pytorch weights found in HF cache entry.` Root cause: the repo had an empty `~/.cache/huggingface/hub/models--ggml-org--Qwen3.6-27B-MTP-GGUF/` dir (4.0 KB, only `refs/main`, dated May 16 — an interrupted pull), and the runner's `skip_reason` library check uses `caps.library_refs`, which is built from the **catalog** (every variant repo from `/api/workspace`), not from what's actually downloaded. So a catalogued-but-undownloaded model passes the library check and only errors at load — reported as a product FAIL when it's really a missing download (same false-positive class as FU-053). Fix: new pure helper `classify_load_skip(msg)` in [scripts/cache-strategy-matrix.py](scripts/cache-strategy-matrix.py) matches the backend's 'no weights found in HF cache entry' markers; `run_cell` now wraps the load call separately and converts that specific error into `skipped=True, skip_reason="weights not downloaded ()"` instead of a failure. Genuine load errors (OOM, etc.) still surface as fails. 4 unit tests in [tests/test_cache_strategy_matrix_runner.py](tests/test_cache_strategy_matrix_runner.py) (`ClassifyLoadSkipTests`) pin the classification. The dflash/mtplx cells already skipped correctly because their target models (`mlx-community/Qwen3-4B-bf16` / `Qwen3.5-4B-bf16`) aren't catalog variants so they never entered `library_refs`. **To actually exercise the GGUF-MTP lane (FU-047/FU-052 trip-wire), download `ggml-org/Qwen3.6-27B-MTP-GGUF` first**, then re-run full. | | ~~FU-069~~ | ~~Bump `turboquant-mlx-full` floor `>=0.4.0` → `>=0.5.0`~~ | **Shipped 2026-05-28.** | Upstream `turboquant-mlx-full` 0.5.0 on PyPI (FU-062 had just floored at 0.4.0 on 2026-05-25). v0.5.0 builds on the v0.4.0 expert-streaming path (FU-062) with **parallel expert prefetch** — the missing MoE experts for each layer are read on a thread pool (`--prefetch-workers`, default `8`) so SSD latency hides behind compute. Upstream-reported **~1.9× faster decode** at a tight cache budget, still bit-identical output. `--prefetch-workers 1` restores the serial v0.4.0 behaviour. No API change at our call surface — runtime still constructs `TurboQuantKVCache` with the same signature; the new flag is converter/runtime-side. Floor bump only in [pyproject.toml](pyproject.toml) `[turboquant]` extra; loose `>=` so existing 0.4.x installs stay satisfied locally. Apple Silicon only. Folds in the spirit of FU-066 (the matrix MoE-turboquant baseline should be captured against 0.5.0 once the wheel is installed on the M4 Max box). | | ~~FU-068~~ | ~~MLX probe timeout 12 s → 20 s~~ | **Shipped 2026-05-25 (v0.9.3).** | E2E full-sweep Phase 1 surfaced three intermittent fails on a freshly-booted backend — `MLX native cache` / `MLX TurboQuant cache` / `fused attention flag` all returned `MLX backend requested but unavailable: ...mlx_worker probe timed out after 12.0 seconds`. Measured cold-start: `time .venv/bin/python -m backend_service.mlx_worker probe` = **12.43 s** on M4 Max / Python 3.11 against current `mlx 0.31.2` + `mlx-lm 0.31.3` + `mlx-vlm 0.4.4` — 0.4 s past the 12.0 s ceiling. The 12.0 s value was an arbitrary default from the v0.8.0 `capabilities.py` extract (commit `f91709e`), never tuned. Bumped to **20.0 s** in [backend_service/inference/capabilities.py](backend_service/inference/capabilities.py) `_probe_native_backends` — ~60% headroom over today's envelope. Phase 5 video gen + Phase 1 GGUF / DFlash / cache-preview already passed (proves MLX itself works once the probe lands), so this was a pure cold-boot probe timing issue, not a regression from the FU-062 / FU-063 floor bumps (which are loose `>=`, no installed package changed). | -| FU-067 | Watch dflash-mlx for v0.1.8+ migration guide (FU-057 is multi-hour, deferred) | Trigger: (a) upstream publishes v0.1.8 with a stability commitment + migration guide, OR (b) we hit a concrete user-visible bug on the orphan `fada1eb` pin, OR (c) a shipped catalog model needs a v0.1.6+ feature (adaptive verify / Gemma4 backend / Qwen3-Next GDN). | Dup of FU-057's trigger but resurfaced after the v0.9.3 upstream scan confirmed v0.1.7 is now on PyPI (`pip install dflash-mlx==0.1.7` resolves) and tagged at commit `210a0fc1`. Plan-of-record stays FU-057's six-step migration. Re-checking quarterly via `git ls-remote --tags` for `v0.1.8` / `v0.2.0` release tags — if upstream publishes a migration guide alongside, the cost drops dramatically. | +| FU-067 | Watch dflash-mlx for v0.1.8+ migration guide (FU-057 is multi-hour, deferred) | Trigger: (a) upstream publishes v0.1.8 with a stability commitment + migration guide, OR (b) we hit a concrete user-visible bug on the orphan `fada1eb` pin, OR (c) a shipped catalog model needs a v0.1.6+ feature (adaptive verify / Gemma4 backend / Qwen3-Next GDN). | Dup of FU-057's trigger but resurfaced after the v0.9.3 upstream scan confirmed v0.1.7 is now on PyPI (`pip install dflash-mlx==0.1.7` resolves) and tagged at commit `210a0fc1`. Plan-of-record stays FU-057's six-step migration. Re-checking quarterly via `git ls-remote --tags` for `v0.1.8` / `v0.2.0` release tags — if upstream publishes a migration guide alongside, the cost drops dramatically. **2026-06-11 release scan:** **v0.1.9** is now tagged (branch HEAD `7f884380`; tags `v0.1.5.1…v0.1.9`). Still no published migration guide, so FU-057's six-step rewrite stays the plan of record and remains deferred. Newest migration target is now v0.1.9 (was v0.1.7/v0.1.8). **2026-06-15 release scan:** **v0.1.10** now tagged (branch HEAD `9ca00289`). One more release since last scan; migration target advances to v0.1.10. No migration guide published. FU-057 deferred. | | ~~FU-061~~ | ~~"Watching upstream" badge + disabled download for tracked-only image seeds~~ | **Shipped 2026-05-18.** | User-reported gap: downloaded `baidu/ERNIE-Image-Turbo` from Image Discover (it sits in `LATEST_IMAGE_TRACKED_SEEDS`), expected it in the Studio dropdown, didn't appear. Root cause: tracked seeds are discovery-only — Studio's dropdown is fed by `IMAGE_MODEL_FAMILIES` which requires explicit pipeline routing (flow-match flags, sampler registry, scheduler defaults). ERNIE-Image (+ Nucleus-Image, Z-Image, HiDream, GLM-Image, FLUX.2 family) has no diffusers-routable Studio variant yet. Fix path A picked over path B (full per-family pipeline wiring) — surgical UX disambiguation. **Backend:** new `_is_launchable_image_repo(repo_id)` helper in [backend_service/helpers/images.py](backend_service/helpers/images.py) returns True only when `repo_id` resolves to a curated `IMAGE_MODEL_FAMILIES` variant. Wired into both payload sites — `_tracked_latest_seed_payloads` (line 411) + the live-HF lane (line 622) — so every Discover row carries `trackedOnly: bool`. **Frontend:** new `trackedOnly?: boolean` field on `ImageModelVariant` ([src/types/image.ts](src/types/image.ts)). [ImageDiscoverTab.tsx](src/features/images/ImageDiscoverTab.tsx) chip row gains a "Watching upstream" badge + tooltip when `trackedOnly`. Action column branches first on `trackedOnly` → renders a disabled `IconActionButton` with tooltip "Watching upstream — Studio playback for this family isn't wired yet. Catalog entry is for awareness; download won't unlock Studio." instead of the Generate / Download / Resume CTAs. Backward-compat: existing curated families have `trackedOnly: undefined` → falsy → no UX change. **Tests:** new `TrackedOnlyFlagTests` in [tests/test_image_discover.py](tests/test_image_discover.py) — 5 cases covering `_is_launchable_image_repo` (FLUX.1-dev + SDXL = true; ERNIE-Image / Nucleus-Image = false; empty = false), `trackedOnly: True` on ERNIE seed payload, and the negative case where a tracked seed that IS in IMAGE_MODEL_FAMILIES must NOT carry the flag (forward-compat for catalog evolution). **Follow-up path B (deferred):** wire ERNIE-Image / Nucleus-Image / Z-Image / HiDream / GLM-Image / FLUX.2 family as real launchable families via per-family pipeline detection in `image_runtime`. Multi-hour per family, gated on diffusers' upstream support landing for each architecture. | --- diff --git a/backend_service/agent.py b/backend_service/agent.py index 277380e..8050384 100644 --- a/backend_service/agent.py +++ b/backend_service/agent.py @@ -485,7 +485,11 @@ def run_agent_loop_streaming( # consumed so the assistant bubble doesn't show raw call # JSON next to the rendered ToolCallCard (FU-040). text = _strip_tool_call_xml(result.text) - chunk_size = 4 + # The final answer is already fully computed (tool-calling turns + # are non-streaming), so the old 4-char dribble just added fake + # latency + yields. Emit in larger chunks; the SSE layer coalesces + # these further and the user sees the answer near-instantly. + chunk_size = 48 for i in range(0, len(text), chunk_size): yield {"token": text[i:i + chunk_size]} diff --git a/backend_service/catalog/text_models.py b/backend_service/catalog/text_models.py index 5fbb153..d27f5c1 100644 --- a/backend_service/catalog/text_models.py +++ b/backend_service/catalog/text_models.py @@ -881,6 +881,403 @@ "Co-developed with NVIDIA for efficient local deployment.", ], }, + { + # Frontier sparse-MoE family (DeepseekV4ForCausalLM, 256 routed experts + # / 6 active, 1M context via YaRN, baked-in MTP head -> speculative + # decoding). Text-only. Listed for discovery awareness — even the + # "small" Flash variant is 154 GB at 4-bit, so these target top-end + # desktops / workstations, not laptops. + "id": "deepseek-v4", + "name": "DeepSeek V4", + "provider": "DeepSeek", + "headline": "Frontier MoE reasoning + agentic coding; the Flash variant is the local-viable one.", + "summary": "DeepSeek V4 — Flash (284B / ~13B active) for top-end desktops, Pro (1.6T) for the frontier.", + "description": ( + "DeepSeek V4 is a sparse Mixture-of-Experts family (256 routed experts, ~6 active per token) " + "with 1M-token context via YaRN and a baked-in MTP head for speculative decoding. V4-Flash " + "activates ~13B of 284B total parameters; V4-Pro is the 1.6T flagship. Text-only, MIT-licensed." + ), + "updatedLabel": "Released 2026", + "popularityLabel": "Frontier family", + "likesLabel": "DeepSeek official", + "badges": ["Reasoning", "Coding", "Agents", "Long context"], + "capabilities": ["reasoning", "coding", "agents", "tool-use"], + "defaultVariantId": "mlx-community/DeepSeek-V4-Flash-4bit", + "variants": [ + { + "id": "mlx-community/DeepSeek-V4-Flash-4bit", + "name": "DeepSeek V4 Flash MLX 4-bit", + "repo": "mlx-community/DeepSeek-V4-Flash-4bit", + "link": "https://huggingface.co/mlx-community/DeepSeek-V4-Flash-4bit", + "paramsB": 284.0, + "sizeGb": 154.0, + "format": "MLX", + "quantization": "4-bit", + "capabilities": ["reasoning", "coding", "agents", "tool-use"], + "note": "MoE 284B / ~13B active. 4-bit MLX needs ~160 GB unified memory (M3/M4 Ultra). MTP head enables speculative decoding.", + "contextWindow": "1M", + "launchMode": "direct", + "backend": "mlx", + "releaseDate": "2026-04", + }, + { + "id": "mlx-community/DeepSeek-V4-Flash-8bit", + "name": "DeepSeek V4 Flash MLX 8-bit", + "repo": "mlx-community/DeepSeek-V4-Flash-8bit", + "link": "https://huggingface.co/mlx-community/DeepSeek-V4-Flash-8bit", + "paramsB": 284.0, + "sizeGb": 284.0, + "format": "MLX", + "quantization": "8-bit", + "capabilities": ["reasoning", "coding", "agents", "tool-use"], + "note": "8-bit MLX conversion — higher fidelity, ~290 GB unified memory.", + "contextWindow": "1M", + "launchMode": "direct", + "backend": "mlx", + "releaseDate": "2026-04", + }, + { + "id": "deepseek-ai/DeepSeek-V4-Flash", + "name": "DeepSeek V4 Flash (BF16)", + "repo": "deepseek-ai/DeepSeek-V4-Flash", + "link": "https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash", + "paramsB": 284.0, + "sizeGb": 568.0, + "format": "Transformers", + "quantization": "BF16", + "capabilities": ["reasoning", "coding", "agents", "tool-use"], + "note": "Official BF16 weights — convert to MLX/GGUF locally or run on a multi-GPU box.", + "contextWindow": "1M", + "launchMode": "convert", + "backend": "mlx", + "releaseDate": "2026-04", + }, + { + "id": "deepseek-ai/DeepSeek-V4-Pro", + "name": "DeepSeek V4 Pro (frontier)", + "repo": "deepseek-ai/DeepSeek-V4-Pro", + "link": "https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro", + "paramsB": 1600.0, + "sizeGb": 3200.0, + "format": "Transformers", + "quantization": "BF16", + "capabilities": ["reasoning", "coding", "agents", "tool-use"], + "note": "1.6T flagship (~49B active). Frontier / awareness — needs a GPU cluster; not a local launch path.", + "contextWindow": "1M", + "launchMode": "convert", + "backend": "mlx", + "releaseDate": "2026-04", + }, + ], + "readme": [ + "DeepSeek V4 is a sparse-MoE family with 1M-token context and baked-in MTP heads for speculative decoding.", + "V4-Flash (284B / ~13B active) is the local-viable variant: the mlx-community 4-bit conversion is ~154 GB and runs on M3/M4 Ultra-class unified memory.", + "V4-Pro (1.6T) is listed for awareness; it targets multi-GPU clusters rather than a single desktop.", + ], + }, + { + # Frontier sparse-MoE family (GlmMoeDsa arch, 256 routed experts / 8 + # active, ~200K context). Text-only. Z.ai / Tsinghua. Listed for + # discovery awareness — even 4-bit GGUF is ~515 GB, so this is a + # cluster / very-high-end-workstation family, not a laptop one. + "id": "glm-5", + "name": "GLM-5", + "provider": "Z.ai", + "headline": "Z.ai / Tsinghua frontier MoE — agentic coding rivaling closed frontier models.", + "summary": "GLM-5 / GLM-5.1 sparse MoE (256 experts), ~200K context. Frontier-scale — top-end hardware only.", + "description": ( + "GLM-5 is a large sparse Mixture-of-Experts model (GlmMoeDsa architecture, 256 routed experts, " + "8 active per token) with ~200K context. GLM-5.1 is the refined release. Strong agentic coding " + "and reasoning. Text-only, open weights — frontier-scale, so even a 4-bit GGUF is ~500 GB." + ), + "updatedLabel": "Released 2026", + "popularityLabel": "Frontier family", + "likesLabel": "Z.ai official", + "badges": ["Coding", "Reasoning", "Agents", "Long context"], + "capabilities": ["reasoning", "coding", "agents", "tool-use"], + "defaultVariantId": "unsloth/GLM-5.1-GGUF", + "variants": [ + { + "id": "unsloth/GLM-5.1-GGUF", + "name": "GLM-5.1 GGUF", + "repo": "unsloth/GLM-5.1-GGUF", + "link": "https://huggingface.co/unsloth/GLM-5.1-GGUF", + "paramsB": 735.0, + "sizeGb": 515.0, + "format": "GGUF", + "quantization": "Q4_K_M", + "capabilities": ["reasoning", "coding", "agents", "tool-use"], + "note": "Q4_K_M ~515 GB; the same repo ships smaller UD-IQ2 quants down to ~250 GB. Frontier-scale llama.cpp run.", + "contextWindow": "200K", + "launchMode": "direct", + "backend": "llama.cpp", + "releaseDate": "2026-05", + }, + { + "id": "mlx-community/GLM-5.1-MXFP4-Q8", + "name": "GLM-5.1 MLX MXFP4", + "repo": "mlx-community/GLM-5.1-MXFP4-Q8", + "link": "https://huggingface.co/mlx-community/GLM-5.1-MXFP4-Q8", + "paramsB": 735.0, + "sizeGb": 449.0, + "format": "MLX", + "quantization": "MXFP4", + "capabilities": ["reasoning", "coding", "agents", "tool-use"], + "note": "MXFP4 MoE quant for Apple Silicon — ~450 GB unified memory.", + "contextWindow": "200K", + "launchMode": "direct", + "backend": "mlx", + "releaseDate": "2026-05", + }, + { + "id": "zai-org/GLM-5.1", + "name": "GLM-5.1 (BF16)", + "repo": "zai-org/GLM-5.1", + "link": "https://huggingface.co/zai-org/GLM-5.1", + "paramsB": 735.0, + "sizeGb": 1507.0, + "format": "Transformers", + "quantization": "BF16", + "capabilities": ["reasoning", "coding", "agents", "tool-use"], + "note": "Official BF16 weights — convert / quantize locally or run on a multi-GPU box.", + "contextWindow": "200K", + "launchMode": "convert", + "backend": "mlx", + "releaseDate": "2026-05", + }, + { + "id": "zai-org/GLM-5", + "name": "GLM-5 (BF16)", + "repo": "zai-org/GLM-5", + "link": "https://huggingface.co/zai-org/GLM-5", + "paramsB": 735.0, + "sizeGb": 1507.0, + "format": "Transformers", + "quantization": "BF16", + "capabilities": ["reasoning", "coding", "agents", "tool-use"], + "note": "Initial GLM-5 release; GLM-5.1 is the refined follow-up — prefer it unless reproducing a baseline.", + "contextWindow": "200K", + "launchMode": "convert", + "backend": "mlx", + "releaseDate": "2026-04", + }, + ], + "readme": [ + "GLM-5 is Z.ai / Tsinghua's frontier sparse-MoE family (GlmMoeDsa, 256 experts / 8 active), strong on agentic coding.", + "GLM-5.1 is the refined release; unsloth + mlx-community publish GGUF and MXFP4 quants.", + "Frontier-scale: a 4-bit GGUF is ~515 GB, so this family targets clusters and very-high-end workstations.", + ], + }, + { + # Google Gemma 4 — multimodal (image+text) family. Gemma4ForConditionalGeneration + # architecture with vision_config baked in; all sizes accept image inputs. + # E2B = Embedded 2B (128K ctx, ~4 GB BF16) — edge/mobile target. + # 31B = full model (256K ctx, 62.5 GB BF16) — desktop / workstation target. + # Both carry a baked-in mmproj; llama_cpp_engine wires --mmproj automatically + # when the repo has an mmproj shard (FU-072 pattern). + "id": "gemma-4", + "name": "Gemma 4", + "provider": "Google", + "headline": "Google's multimodal open model family — from edge-optimised 2B to capable 31B.", + "summary": "Gemma 4 E2B (2B, 128K) and 31B (256K) — both natively multimodal with vision_config.", + "description": ( + "Gemma 4 is Google's multimodal open-weight family (Gemma4ForConditionalGeneration). " + "The Embedded 2B (E2B) targets on-device and mobile deployment with 128K context; " + "the 31B is the full desktop/workstation variant with 256K context. " + "Both accept image + text inputs natively. Apache-2.0 licensed. " + "Google publishes QAT Q4_0 GGUFs; mlx-community and unsloth publish 4-bit and 8-bit quants." + ), + "updatedLabel": "Released 2025", + "popularityLabel": "Google official", + "likesLabel": "Google official", + "badges": ["Multimodal", "Vision", "Coding", "Long context"], + "capabilities": ["vision", "coding"], + "defaultVariantId": "mlx-community/gemma-4-31b-8bit", + "variants": [ + { + "id": "mlx-community/gemma-4-31b-8bit", + "name": "Gemma 4 31B MLX 8-bit", + "repo": "mlx-community/gemma-4-31b-8bit", + "link": "https://huggingface.co/mlx-community/gemma-4-31b-8bit", + "paramsB": 31.0, + "sizeGb": 32.0, + "format": "MLX", + "quantization": "8-bit", + "capabilities": ["vision", "coding"], + "note": "8-bit MLX quant — good balance of fidelity and VRAM. Needs ~34 GB unified memory.", + "contextWindow": "256K", + "launchMode": "direct", + "backend": "mlx", + "releaseDate": "2025-05", + }, + { + "id": "unsloth/gemma-4-31B-it-GGUF", + "name": "Gemma 4 31B GGUF (Q4_K_M)", + "repo": "unsloth/gemma-4-31B-it-GGUF", + "link": "https://huggingface.co/unsloth/gemma-4-31B-it-GGUF", + "paramsB": 31.0, + "sizeGb": 19.0, + "format": "GGUF", + "quantization": "Q4_K_M", + "capabilities": ["vision", "coding"], + "note": "Q4_K_M GGUF with mmproj shard for vision. Runs on 24 GB VRAM or Apple Silicon.", + "contextWindow": "256K", + "launchMode": "direct", + "backend": "llama.cpp", + "releaseDate": "2025-05", + }, + { + "id": "google/gemma-4-31B-it-qat-q4_0-gguf", + "name": "Gemma 4 31B Official QAT GGUF", + "repo": "google/gemma-4-31B-it-qat-q4_0-gguf", + "link": "https://huggingface.co/google/gemma-4-31B-it-qat-q4_0-gguf", + "paramsB": 31.0, + "sizeGb": 17.0, + "format": "GGUF", + "quantization": "Q4_0 (QAT)", + "capabilities": ["vision", "coding"], + "note": "Google's official QAT (Quantization-Aware Training) Q4_0 — higher fidelity than post-hoc Q4 at same size.", + "contextWindow": "256K", + "launchMode": "direct", + "backend": "llama.cpp", + "releaseDate": "2025-05", + }, + { + "id": "google/gemma-4-31B-it", + "name": "Gemma 4 31B (BF16)", + "repo": "google/gemma-4-31B-it", + "link": "https://huggingface.co/google/gemma-4-31B-it", + "paramsB": 31.0, + "sizeGb": 62.5, + "format": "Transformers", + "quantization": "BF16", + "capabilities": ["vision", "coding"], + "note": "Official BF16 weights — convert to MLX/GGUF locally or run on a 80 GB+ GPU.", + "contextWindow": "256K", + "launchMode": "convert", + "backend": "mlx", + "releaseDate": "2025-05", + }, + { + "id": "google/gemma-4-E2B-it-qat-q4_0-gguf", + "name": "Gemma 4 E2B Official QAT GGUF", + "repo": "google/gemma-4-E2B-it-qat-q4_0-gguf", + "link": "https://huggingface.co/google/gemma-4-E2B-it-qat-q4_0-gguf", + "paramsB": 2.0, + "sizeGb": 1.5, + "format": "GGUF", + "quantization": "Q4_0 (QAT)", + "capabilities": ["vision", "coding"], + "note": "Embedded 2B — edge/mobile optimised. QAT Q4_0 is ~1.5 GB; runs on CPU or any GPU. 128K context.", + "contextWindow": "128K", + "launchMode": "direct", + "backend": "llama.cpp", + "releaseDate": "2025-05", + }, + { + "id": "google/gemma-4-E2B-it", + "name": "Gemma 4 E2B (BF16)", + "repo": "google/gemma-4-E2B-it", + "link": "https://huggingface.co/google/gemma-4-E2B-it", + "paramsB": 2.0, + "sizeGb": 4.0, + "format": "Transformers", + "quantization": "BF16", + "capabilities": ["vision", "coding"], + "note": "Official BF16 Embedded 2B — convert to GGUF/MLX. Small enough to run on any modern GPU.", + "contextWindow": "128K", + "launchMode": "convert", + "backend": "mlx", + "releaseDate": "2025-05", + }, + ], + "readme": [ + "Gemma 4 is Google's multimodal open-weight family — all sizes accept image + text inputs.", + "E2B (Embedded 2B, 128K context) targets edge and mobile deployment; the QAT Q4_0 GGUF is ~1.5 GB.", + "The 31B (256K context) is the full-capability variant: mlx-community's 8-bit quant at ~32 GB is the recommended desktop path.", + "Google ships official QAT GGUFs for both sizes — quantization-aware training gives better quality than post-hoc quant at the same file size.", + ], + }, + { + # MiniMax M2.7 — frontier-scale sparse MoE (MiniMaxM2ForCausalLM, 256 routed + # experts / 8 active, 200K context). BF16 total ~480 GB; text-only. + # Strong on long-context reasoning and character consistency. + # Community GGUF: unsloth/MiniMax-M2.7-GGUF. MLX: mlx-community/MiniMax-M2.7-4bit-mxfp4. + "id": "minimax-m2", + "name": "MiniMax M2", + "provider": "MiniMax", + "headline": "MiniMax frontier MoE — 200K context, strong character consistency and long-context reasoning.", + "summary": "MiniMax M2.7 — 256-expert sparse MoE, 200K context. Frontier-scale, top-end hardware only.", + "description": ( + "MiniMax M2.7 is MiniMax's frontier sparse Mixture-of-Experts model (MiniMaxM2ForCausalLM, " + "256 routed experts / 8 active per token) with 200K token context. " + "Compared with M2.5, M2.7 adds strengthened character consistency and emotional intelligence. " + "Text-only. Recommended inference params: temperature=1.0, top_p=0.95, top_k=40. " + "Frontier-scale: BF16 is ~480 GB; even a 4-bit GGUF is ~130 GB." + ), + "updatedLabel": "Released 2026", + "popularityLabel": "Frontier family", + "likesLabel": "MiniMax official", + "badges": ["Reasoning", "Long context", "Agents", "Coding"], + "capabilities": ["reasoning", "coding", "agents", "tool-use"], + "defaultVariantId": "mlx-community/MiniMax-M2.7-4bit-mxfp4", + "variants": [ + { + "id": "mlx-community/MiniMax-M2.7-4bit-mxfp4", + "name": "MiniMax M2.7 MLX MXFP4", + "repo": "mlx-community/MiniMax-M2.7-4bit-mxfp4", + "link": "https://huggingface.co/mlx-community/MiniMax-M2.7-4bit-mxfp4", + "paramsB": 240.0, + "sizeGb": 120.0, + "format": "MLX", + "quantization": "MXFP4", + "capabilities": ["reasoning", "coding", "agents", "tool-use"], + "note": "MoE ~240B / ~10B active. MXFP4 MLX quant — ~120 GB unified memory (M3/M4 Ultra-class).", + "contextWindow": "200K", + "launchMode": "direct", + "backend": "mlx", + "releaseDate": "2026-05", + }, + { + "id": "unsloth/MiniMax-M2.7-GGUF", + "name": "MiniMax M2.7 GGUF", + "repo": "unsloth/MiniMax-M2.7-GGUF", + "link": "https://huggingface.co/unsloth/MiniMax-M2.7-GGUF", + "paramsB": 240.0, + "sizeGb": 130.0, + "format": "GGUF", + "quantization": "Q4_K_M", + "capabilities": ["reasoning", "coding", "agents", "tool-use"], + "note": "Q4_K_M ~130 GB — needs a large-RAM workstation or multi-GPU box with NVLink.", + "contextWindow": "200K", + "launchMode": "direct", + "backend": "llama.cpp", + "releaseDate": "2026-05", + }, + { + "id": "MiniMaxAI/MiniMax-M2.7", + "name": "MiniMax M2.7 (BF16)", + "repo": "MiniMaxAI/MiniMax-M2.7", + "link": "https://huggingface.co/MiniMaxAI/MiniMax-M2.7", + "paramsB": 240.0, + "sizeGb": 481.0, + "format": "Transformers", + "quantization": "BF16", + "capabilities": ["reasoning", "coding", "agents", "tool-use"], + "note": "Official BF16 weights — convert to GGUF/MLX. Frontier-scale, ~480 GB.", + "contextWindow": "200K", + "launchMode": "convert", + "backend": "mlx", + "releaseDate": "2026-05", + }, + ], + "readme": [ + "MiniMax M2.7 is MiniMax's frontier sparse-MoE model (256 experts / 8 active), with 200K context.", + "M2.7 improves on M2.5 with stronger character consistency and long-context reasoning.", + "The mlx-community MXFP4 quant (~120 GB) is the Apple Silicon path; unsloth Q4_K_M GGUF (~130 GB) targets high-RAM Linux workstations.", + "Frontier-scale — even 4-bit quantization requires 120+ GB of memory.", + ], + }, ] diff --git a/backend_service/inference/_constants.py b/backend_service/inference/_constants.py index dff90a5..d91aab5 100644 --- a/backend_service/inference/_constants.py +++ b/backend_service/inference/_constants.py @@ -15,4 +15,15 @@ # especially on a first-time pull from Hugging Face. Allow a generous ceiling. MLX_LOAD_TIMEOUT_SECONDS = 1800.0 DEFAULT_LLAMA_TIMEOUT_SECONDS = 120.0 -CAPABILITY_CACHE_TTL_SECONDS = 10.0 +# Native-backend capabilities (mlx/llama-server/vLLM/accelerator presence) +# only change when the user installs something — and every install path +# (pip / system pkg / cuda-torch / convert / the /api/setup/refresh- +# capabilities endpoint) calls refresh_capabilities(force=True), which +# invalidates this cache immediately. So the TTL only governs ambient +# staleness, not correctness. The old 10 s value was shorter than a single +# model load+generate (40-70 s), so load_model's refresh_capabilities() +# re-probed on *every* load — a blocking 17-31 s mlx_lm+mlx+mlx_vlm import +# subprocess each time (the creep behind the FU-068 probe-timeout bumps). +# 300 s comfortably spans back-to-back loads in a session while staying +# fresh enough for the capability UI; installs force-refresh regardless. +CAPABILITY_CACHE_TTL_SECONDS = 300.0 diff --git a/backend_service/inference/binaries.py b/backend_service/inference/binaries.py index df714de..e20435e 100644 --- a/backend_service/inference/binaries.py +++ b/backend_service/inference/binaries.py @@ -33,6 +33,17 @@ def _json_subprocess( check=False, capture_output=True, timeout=timeout, + # Own session/process group: these short-lived JSON probes + # (mlx_worker probe, GGUF metadata read) must NOT be collateral + # of ``app._watch_parent_and_exit``'s killpg(SIGTERM) when the + # backend's parent dies. Without this, a non-Tauri launch (e.g. + # a bare ``python -m backend_service.app`` whose launch shell + # exits) reparents the app, the watchdog fires, and the probe — + # sharing the group — dies with "probe exited with code -15" + # mid-run. The probe is a few-second transient, so escaping the + # parent-death cleanup leaks nothing (the cleanup exists for the + # long-lived llama-server children, which are spawned elsewhere). + start_new_session=True, ) except (OSError, subprocess.TimeoutExpired) as exc: return (-1, None, str(exc)) diff --git a/backend_service/inference/capabilities.py b/backend_service/inference/capabilities.py index 8030035..9f551df 100644 --- a/backend_service/inference/capabilities.py +++ b/backend_service/inference/capabilities.py @@ -126,12 +126,17 @@ def _probe_native_backends() -> BackendCapabilities: code, payload, message = _json_subprocess( [python_executable, "-m", "backend_service.mlx_worker", "probe"], - # FU-068: cold ``mlx_lm + mlx + mlx_vlm`` import has crept to - # ~12.4 s on M4 Max / Python 3.11 (measured 2026-05-25 v0.9.3), - # blowing the original 12.0 s ceiling and causing intermittent - # E2E Phase 1 fails on a freshly-booted backend. Bump to 20 s - # for ~60% headroom over today's cold-boot envelope. - timeout=20.0, + # FU-068: cold ``mlx_lm + mlx + mlx_vlm`` import keeps creeping — + # 12.0 s (orig) → 12.4 s (2026-05-25 v0.9.3, → 20 s) → 17.5 s solo + # on M4 Max / Python 3.11 (2026-06-02). Under a sustained E2E run + # (whole suite ~3x slower from concurrent model loads + thermal + # throttle) the probe is re-issued per MLX cell and measured + # ~31 s, blowing both the 20 s and 30 s ceilings (different cell + # each time). 45 s clears the ~31 s loaded peak with headroom and + # is still bounded enough to surface a genuinely wedged worker. + # Follow-up: cache the capability probe so it isn't re-run per + # load under load (the real inefficiency behind the creep). + timeout=45.0, ) if payload is None: diff --git a/backend_service/inference/llama_cpp_engine.py b/backend_service/inference/llama_cpp_engine.py index d62af35..3d9e884 100644 --- a/backend_service/inference/llama_cpp_engine.py +++ b/backend_service/inference/llama_cpp_engine.py @@ -92,6 +92,17 @@ "frequency_penalty", "presence_penalty", "stop", + # Modern anti-repetition / quality samplers llama-server supports + # natively. Forward-only: builds that don't recognise them ignore the + # field, so old binaries are unaffected. DRY beats plain repeat_penalty + # at killing verbatim loops; XTC adds creative variety; top-n-sigma is + # a temperature-stable truncator. + "dry_multiplier", + "dry_base", + "dry_allowed_length", + "xtc_probability", + "xtc_threshold", + "top_n_sigma", # Phase 3.3: per-token confidence info. llama-server returns # top-k alternatives with their logprobs in each delta when # `logprobs: true` + `top_logprobs: N` are set. @@ -421,6 +432,7 @@ def _build_command( fit_enabled: bool, is_fallback: bool, speculative_decoding: bool = False, + fused_attention: bool = False, canonical_repo: str | None = None, model_ref: str = "", ) -> tuple[list[str], str | None, bool, str | None]: @@ -449,6 +461,19 @@ def _build_command( str(max(256, context_tokens)), "--jinja", ] + # Reuse the single slot's KV cache across chat turns: a growing + # conversation re-prefills only the new suffix instead of the whole + # history (turn-2+ TTFT drops sharply on long chats). Forward-gated + # on binary support so older llama-server builds are unaffected. + if _llama_server_supports(binary, "--cache-reuse"): + command.extend(["--cache-reuse", "256"]) + # Honour the user's fused-attention toggle. It was plumbed into + # load_model + stored on LoadedModelInfo but never emitted as a + # flag. Flash attention is a large decode + KV-memory win on Metal + # and is required by the quantized KV cache types. Opt-in via the + # existing flag so a model/quant combo that dislikes it can disable. + if fused_attention and _llama_server_supports(binary, "--flash-attn"): + command.extend(["--flash-attn", "on"]) if _llama_server_supports(binary, "--reasoning-format"): command.extend(["--reasoning-format", "deepseek"]) if _llama_server_supports(binary, "--reasoning"): @@ -660,6 +685,7 @@ def load_model( fit_enabled=fit_enabled, is_fallback=is_fallback, speculative_decoding=speculative_decoding, + fused_attention=fused_attention, canonical_repo=canonical_repo, model_ref=model_ref, ) @@ -791,6 +817,9 @@ def generate( "temperature": temperature, "max_tokens": max_tokens, "stream": False, + # Reuse the slot's cached prompt prefix across turns (pairs with + # the server's --cache-reuse) so unchanged history isn't reprocessed. + "cache_prompt": True, } if tools: payload["tools"] = tools @@ -884,6 +913,9 @@ def stream_generate( "temperature": temperature, "max_tokens": max_tokens, "stream": True, + # Reuse the slot's cached prompt prefix across turns (pairs with + # the server's --cache-reuse) so unchanged history isn't reprocessed. + "cache_prompt": True, } if tools: payload["tools"] = tools diff --git a/backend_service/mlx_worker.py b/backend_service/mlx_worker.py index c7a0e52..f3acfc6 100644 --- a/backend_service/mlx_worker.py +++ b/backend_service/mlx_worker.py @@ -59,6 +59,7 @@ from backend_service import mlx_worker_lifecycle as _lifecycle from backend_service import mlx_worker_speculative as _speculative from backend_service import mlx_worker_generate as _generate +from backend_service import mlx_worker_prompt_cache as _prompt_cache # Phase 1f-4: model + runtime introspection helpers now live in # ``backend_service.mlx_worker_diagnostics``. Re-export so existing imports @@ -127,6 +128,13 @@ def __init__(self) -> None: # delimiters via ``reasoning_delimiters_for``. Default # (``...``) still applies when ``None``. self._loaded_model_ref: str | None = None + # Tier 4: persistent single-slot prompt cache for native-strategy chat + # so follow-up turns prefill only the new suffix. Managed by + # backend_service.mlx_worker_prompt_cache; invalidated on any model + # load / unload / profile change. + self._persist_cache: Any | None = None + self._persist_tokens: list[int] = [] + self._persist_cache_model_ref: str | None = None def handle(self, request: dict[str, Any]) -> dict[str, Any] | None: op = request.get("op") @@ -148,12 +156,15 @@ def handle(self, request: dict[str, Any]) -> dict[str, Any] | None: raise ValueError(f"Unsupported worker operation: {op}") def load_model(self, request: dict[str, Any]) -> dict[str, Any]: + _prompt_cache.invalidate(self) return _lifecycle.load_model(self, request) def unload_model(self) -> dict[str, Any]: + _prompt_cache.invalidate(self) return _lifecycle.unload_model(self) def update_profile(self, request: dict[str, Any]) -> dict[str, Any]: + _prompt_cache.invalidate(self) return _lifecycle.update_profile(self, request) def _apply_cache_profile( diff --git a/backend_service/mlx_worker_generate.py b/backend_service/mlx_worker_generate.py index 7157631..2d7a65d 100644 --- a/backend_service/mlx_worker_generate.py +++ b/backend_service/mlx_worker_generate.py @@ -34,6 +34,7 @@ ) from backend_service.mlx_worker_request import ( _apply_mlx_seed, + _build_mlx_logits_processors, _build_mlx_sampler, _extract_top_logprobs, _format_tools_for_prompt, @@ -46,6 +47,7 @@ strip_harmony_boilerplate, ) from backend_service.runaway_guard import RunawayGuard +from backend_service import mlx_worker_prompt_cache as _prompt_cache if TYPE_CHECKING: @@ -109,24 +111,32 @@ def generate_standard(state: WorkerState, request: dict[str, Any]) -> dict[str, system_prompt=system_prompt, ) sampler = _build_mlx_sampler(request) - prompt_cache, runtime_note = state._make_cache() - runtime_note = _merge_runtime_notes(runtime_note, prompt_note) - runtime_fields = state._runtime_fields(prompt_cache=prompt_cache) + acq = _prompt_cache.acquire(state, prompt_text) + prompt_cache = acq.cache + prompt_feed = acq.prompt_feed + managed = acq.managed + runtime_note = _merge_runtime_notes(acq.note, prompt_note) + runtime_fields = state._runtime_fields(prompt_cache=acq.fields_cache) transcript_fallback = _plain_chat_fallback_active(prompt_note) runaway_guard = RunawayGuard() runaway_stopped = False + generated_ids: list[int] = [] try: text_parts: list[str] = [] last_response = None for response in stream_generate( state.model, state.tokenizer, - prompt_text, + prompt_feed, max_tokens=int(request.get("maxTokens") or 256), sampler=sampler, + logits_processors=_build_mlx_logits_processors(request), prompt_cache=prompt_cache, ): + _tok = getattr(response, "token", None) + if isinstance(_tok, int): + generated_ids.append(_tok) if response.text: text_parts.append(response.text) try: @@ -135,8 +145,20 @@ def generate_standard(state: WorkerState, request: dict[str, Any]) -> dict[str, runaway_stopped = True break last_response = response + if managed: + _prompt_cache.commit( + state, + cache=prompt_cache, + commit_tokens=acq.commit_tokens, + generated_ids=generated_ids, + model_ref=state._loaded_model_ref, + ) except (ValueError, RuntimeError, TypeError, AttributeError) as exc: - _should_retry = ( + was_managed = managed + if managed: + _prompt_cache.invalidate(state) + managed = False + _should_retry = was_managed or ( prompt_cache is not None and _should_retry_cache_failure(exc) ) @@ -319,10 +341,13 @@ def stream_generate(state: WorkerState, request: dict[str, Any]) -> None: system_prompt=system_prompt, ) sampler = _build_mlx_sampler(request) - prompt_cache, runtime_note = state._make_cache() - runtime_note = _merge_runtime_notes(runtime_note, prompt_note) + acq = _prompt_cache.acquire(state, prompt_text) + prompt_cache = acq.cache + prompt_feed = acq.prompt_feed + managed = acq.managed + runtime_note = _merge_runtime_notes(acq.note, prompt_note) runtime_note = _merge_runtime_notes(runtime_note, speculative_stream_fallback_note) - runtime_fields = state._runtime_fields(prompt_cache=prompt_cache) + runtime_fields = state._runtime_fields(prompt_cache=acq.fields_cache) transcript_fallback = _plain_chat_fallback_active(prompt_note) thinking_mode = request.get("thinkingMode") or "off" @@ -336,6 +361,7 @@ def stream_generate(state: WorkerState, request: dict[str, Any]) -> None: transcript_trimmed = False runaway_guard = RunawayGuard() runaway_stopped = False + generated_ids: list[int] = [] # Phase 3.3 follow-up: when the request opted into logprobs, # extract top-k per token via the helper and forward inline # with each text chunk. @@ -346,11 +372,15 @@ def stream_generate(state: WorkerState, request: dict[str, Any]) -> None: for response in mlx_stream_generate( state.model, state.tokenizer, - prompt_text, + prompt_feed, max_tokens=int(request.get("maxTokens") or 256), sampler=sampler, + logits_processors=_build_mlx_logits_processors(request), prompt_cache=prompt_cache, ): + _tok = getattr(response, "token", None) + if isinstance(_tok, int): + generated_ids.append(_tok) if response.text: # Check for runaway loops before emitting try: @@ -392,8 +422,20 @@ def stream_generate(state: WorkerState, request: dict[str, Any]) -> None: transcript_trimmed = transcript_trimmed or transcript_filter.stopped if visible_text: _emit({"ok": True, "chunk": {"text": visible_text}}) + if managed: + _prompt_cache.commit( + state, + cache=prompt_cache, + commit_tokens=acq.commit_tokens, + generated_ids=generated_ids, + model_ref=state._loaded_model_ref, + ) except (ValueError, RuntimeError, TypeError, AttributeError) as exc: - _should_retry = ( + was_managed = managed + if managed: + _prompt_cache.invalidate(state) + managed = False + _should_retry = was_managed or ( prompt_cache is not None and _should_retry_cache_failure(exc) ) diff --git a/backend_service/mlx_worker_prompt_cache.py b/backend_service/mlx_worker_prompt_cache.py new file mode 100644 index 0000000..4ccfbea --- /dev/null +++ b/backend_service/mlx_worker_prompt_cache.py @@ -0,0 +1,122 @@ +"""Per-session MLX prompt-cache reuse (tier 4 of the chat-LLM review). + +Native-strategy chat turns re-prefill the *entire* conversation every time +(`prompt_cache=None` → mlx-lm builds a fresh cache + processes the whole +prompt). This module keeps one persistent mlx-lm prompt cache on the +worker and reuses the longest matching token prefix across turns: trim the +divergent tail off the cache, prefill only the new suffix, then re-commit +the cache keyed by ``prompt_tokens + generated_tokens``. A single-slot port +of mlx-lm's server reuse logic (``LRUPromptCache.fetch_nearest_cache``). + +Correctness invariant: the persisted token list ALWAYS equals the cache's +positional contents (prompt + generated), so the next turn's common-prefix +trim is exact. Any uncertainty — compression strategy active, model +changed, cache not trimmable (SSM/Mamba/rotating-full, mlx-lm #980), +tokenisation failure, no common prefix, partial trim — falls back to a +fresh full prefill, i.e. identical output to the pre-cache path, just +without the speedup. Gated to the ``native`` strategy; compression caches +(turboquant / triattention) keep their existing per-call path untouched. +""" + +from __future__ import annotations + +from collections import namedtuple +from typing import Any + +# cache: object passed to stream_generate as prompt_cache +# prompt_feed: what to pass as the `prompt` arg (suffix token list on a +# reuse hit, full token list on a fresh native cache, or the +# original prompt_text string for the compression / fallback path) +# note: runtime note from _make_cache (compression fallback msgs) +# commit_tokens: full prompt token list to re-key after generation (None when +# not managing a native cache) +# fields_cache: value to feed _runtime_fields (None for native, the +# compression cache otherwise) so the strategy badge stays right +# managed: True only when we own a native persistent cache to commit +Acquired = namedtuple( + "Acquired", "cache prompt_feed note commit_tokens fields_cache managed" +) + + +def _common_prefix_len(a: list[int], b: list[int]) -> int: + n = 0 + for x, y in zip(a, b): + if x != y: + break + n += 1 + return n + + +def _native_result(cache: Any | None, full_tokens: list[int], prompt_text: str, note: str | None) -> Acquired: + """Wrap a fresh-native-cache outcome (or a give-up fallback).""" + if cache is not None: + return Acquired(cache, full_tokens, note, full_tokens, None, True) + # Couldn't build a managed cache → behave exactly like before. + return Acquired(None, prompt_text, note, None, None, False) + + +def acquire(state: Any, prompt_text: str) -> Acquired: + base_cache, note = state._make_cache() + if base_cache is not None: + # Compression strategy: unchanged behaviour, no persistence. + return Acquired(base_cache, prompt_text, note, None, base_cache, False) + + # Native strategy — manage a persistent single-slot cache. + try: + from mlx_lm.models.cache import ( # noqa: PLC0415 + can_trim_prompt_cache, + make_prompt_cache, + trim_prompt_cache, + ) + + full_tokens = list(state.tokenizer.encode(prompt_text)) + except Exception: # noqa: BLE001 — any failure → safe full-reprocess fallback + return Acquired(None, prompt_text, note, None, None, False) + + def _fresh() -> Any | None: + try: + return make_prompt_cache(state.model) + except Exception: # noqa: BLE001 + return None + + model_ref = getattr(state, "_loaded_model_ref", None) + persist = getattr(state, "_persist_cache", None) + persist_tokens = getattr(state, "_persist_tokens", None) or [] + persist_ref = getattr(state, "_persist_cache_model_ref", None) + + # Reset conditions: nothing cached, different model, empty history. + if persist is None or persist_ref != model_ref or not persist_tokens: + return _native_result(_fresh(), full_tokens, prompt_text, note) + + try: + if not can_trim_prompt_cache(persist): + return _native_result(_fresh(), full_tokens, prompt_text, note) + # Always leave >=1 token to process live (mlx-lm does the same). + common = min(_common_prefix_len(persist_tokens, full_tokens), len(full_tokens) - 1) + if common <= 0: + return _native_result(_fresh(), full_tokens, prompt_text, note) + num_to_trim = len(persist_tokens) - common + if num_to_trim > 0: + trimmed = trim_prompt_cache(persist, num_to_trim) + if trimmed != num_to_trim: + # Couldn't roll back cleanly — don't risk a spliced mismatch. + return _native_result(_fresh(), full_tokens, prompt_text, note) + # Reuse hit: cache now holds exactly the common prefix; prefill suffix. + return Acquired(persist, full_tokens[common:], note, full_tokens, None, True) + except Exception: # noqa: BLE001 + return _native_result(_fresh(), full_tokens, prompt_text, note) + + +def commit(state: Any, *, cache: Any, commit_tokens: list[int] | None, generated_ids: list[int], model_ref: str | None) -> None: + """Persist the cache keyed by prompt + generated tokens (positional truth).""" + if cache is None or commit_tokens is None: + return + state._persist_cache = cache + state._persist_tokens = list(commit_tokens) + [t for t in generated_ids if isinstance(t, int)] + state._persist_cache_model_ref = model_ref + + +def invalidate(state: Any) -> None: + state._persist_cache = None + state._persist_tokens = [] + state._persist_cache_model_ref = None diff --git a/backend_service/mlx_worker_request.py b/backend_service/mlx_worker_request.py index 6bb1ab7..5c2112e 100644 --- a/backend_service/mlx_worker_request.py +++ b/backend_service/mlx_worker_request.py @@ -133,7 +133,10 @@ def _build_mlx_sampler(request: dict[str, Any]) -> Any: kwargs: dict[str, Any] = {"temp": float(request.get("temperature") or 0.0)} samplers = request.get("samplers") or {} if isinstance(samplers, dict): - for src in ("top_p", "top_k", "min_p"): + # XTC (xtc_probability/xtc_threshold) is supported by current + # make_sampler and adds creative variety; it survives the signature + # filter below on builds that have it and is dropped on older ones. + for src in ("top_p", "top_k", "min_p", "xtc_probability", "xtc_threshold"): value = samplers.get(src) if value is not None: kwargs[src] = value @@ -147,6 +150,47 @@ def _build_mlx_sampler(request: dict[str, Any]) -> Any: return make_sampler(**filtered) +def _build_mlx_logits_processors(request: dict[str, Any]) -> Any: + """Build mlx-lm logits processors (repetition penalty) from the request. + + mlx-lm applies repetition penalty via ``logits_processors``, NOT through + ``make_sampler`` — so the UI's ``repeat_penalty`` was silently dropped + when only the sampler was wired. Returns None when no (or a no-op 1.0) + penalty is requested, so callers can pass ``logits_processors=None`` (the + mlx-lm default). Signature-filtered like the sampler for cross-version + robustness. + """ + import inspect + + samplers = request.get("samplers") or {} + if not isinstance(samplers, dict): + return None + raw = samplers.get("repeat_penalty", samplers.get("repetition_penalty")) + try: + penalty = float(raw) if raw is not None else None + except (TypeError, ValueError): + penalty = None + if penalty is None or abs(penalty - 1.0) < 1e-6: + return None + + try: + from mlx_lm.sample_utils import make_logits_processors + + kwargs: dict[str, Any] = {"repetition_penalty": penalty} + ctx = samplers.get("repeat_penalty_context") or samplers.get("repetition_context_size") + if ctx is not None: + try: + kwargs["repetition_context_size"] = int(ctx) + except (TypeError, ValueError): + pass + sig = inspect.signature(make_logits_processors) + allowed = set(sig.parameters.keys()) + filtered = {k: v for k, v in kwargs.items() if k in allowed} + return make_logits_processors(**filtered) + except Exception: + return None + + def _sampler_seed(request: dict[str, Any]) -> int | None: samplers = request.get("samplers") or {} if not isinstance(samplers, dict): diff --git a/backend_service/models/__init__.py b/backend_service/models/__init__.py index 4c43b62..e2f9414 100644 --- a/backend_service/models/__init__.py +++ b/backend_service/models/__init__.py @@ -151,6 +151,14 @@ class GenerateRequest(BaseModel): mirostatMode: Literal[0, 1, 2] | None = None mirostatTau: float | None = Field(default=None, ge=0.0, le=10.0) mirostatEta: float | None = Field(default=None, ge=0.0, le=1.0) + # Modern samplers (tier 2). XTC drops top tokens for variety; DRY + # penalises repeated multi-token sequences. llama-server applies all; + # mlx-lm applies XTC via make_sampler and ignores DRY (llama-only). + xtcProbability: float | None = Field(default=None, ge=0.0, le=1.0) + xtcThreshold: float | None = Field(default=None, ge=0.0, le=1.0) + dryMultiplier: float | None = Field(default=None, ge=0.0, le=4.0) + dryBase: float | None = Field(default=None, ge=0.0, le=8.0) + dryAllowedLength: int | None = Field(default=None, ge=0, le=64) seed: int | None = Field(default=None, ge=0, le=2**31 - 1) # Constrained decoding: when set, llama-server enforces a JSON schema # via its `response_format: {type: "json_schema", json_schema: {...}}` @@ -268,6 +276,15 @@ class OpenAIChatCompletionRequest(BaseModel): presence_penalty: float | None = Field(default=None, ge=-2.0, le=2.0) seed: int | None = Field(default=None, ge=0, le=2**31 - 1) stop: list[str] | str | None = None + # Non-standard but widely-accepted local-server sampler fields. Mapped + # into the runtime sampler dict in state/openai_compat.py for parity with + # the native chat route (llama-server takes these natively; the MLX worker + # consumes min_p + repeat_penalty). + min_p: float | None = Field(default=None, ge=0.0, le=1.0) + repeat_penalty: float | None = Field(default=None, ge=0.0, le=2.0) + mirostat: int | None = Field(default=None, ge=0, le=2) + mirostat_tau: float | None = Field(default=None, ge=0.0) + mirostat_eta: float | None = Field(default=None, ge=0.0) response_format: dict[str, Any] | None = None diff --git a/backend_service/state/__init__.py b/backend_service/state/__init__.py index 57b3931..248bbcc 100644 --- a/backend_service/state/__init__.py +++ b/backend_service/state/__init__.py @@ -35,6 +35,7 @@ _build_sampler_overrides, _clean_prompt_for_title, _compose_chat_system_prompt, + _history_token_budget, _legacy_title_from_prompt, _normalize_remote_provider_api_base, _read_text_tail, diff --git a/backend_service/state/_helpers.py b/backend_service/state/_helpers.py index fee56df..b38e597 100644 --- a/backend_service/state/_helpers.py +++ b/backend_service/state/_helpers.py @@ -57,6 +57,14 @@ def _put(dst: str, value: Any) -> None: overrides["mirostat"] = mirostat_mode _put("mirostat_tau", getattr(request, "mirostatTau", None)) _put("mirostat_eta", getattr(request, "mirostatEta", None)) + # Modern samplers (tier 2): XTC (both engines) + DRY (llama only). + # Engine-side key names; llama-server forwards them via + # _LLAMA_SAMPLER_KEYS, mlx-lm reads xtc_* in _build_mlx_sampler. + _put("xtc_probability", getattr(request, "xtcProbability", None)) + _put("xtc_threshold", getattr(request, "xtcThreshold", None)) + _put("dry_multiplier", getattr(request, "dryMultiplier", None)) + _put("dry_base", getattr(request, "dryBase", None)) + _put("dry_allowed_length", getattr(request, "dryAllowedLength", None)) # Phase 3.3: when the user enables logprobs on a request the # frontend sends a top-k count; map it onto llama-server's # `logprobs` + `top_logprobs` parameters so the response delta @@ -68,10 +76,43 @@ def _put(dst: str, value: Any) -> None: return overrides +def _estimate_tokens(text: str) -> int: + """Cheap, deliberately CONSERVATIVE token estimate (no tokenizer here). + + Assumes ~3 chars/token vs the ~4 typical for English so the history + window UNDER-fills the context rather than risking an overflow the MLX + path can't recover from. Code and CJK are denser than English, so + erring small protects them too. Off by a constant factor — fine for a + safety budget, not for billing. + """ + return (len(text) // 3) + 1 + + +def _history_token_budget( + *, + context_tokens: int, + max_tokens: int, + system_prompt: str | None, + prompt: str | None, +) -> int: + """Token budget left for *prior* history after reserving room for the + system prompt, the current user prompt, the generation, and chat-template + overhead. Floors at 512 so a single recent turn is always kept. + """ + reserved = ( + _estimate_tokens(system_prompt or "") + + _estimate_tokens(prompt or "") + + int(max_tokens or 0) + + 512 # chat-template + role-tag + tool-schema overhead headroom + ) + return max(512, int(context_tokens or 0) - reserved) + + def _build_history_with_reasoning( messages: list[dict[str, Any]], *, preserve_reasoning: bool, + token_budget: int | None = None, ) -> list[dict[str, Any]]: """Project a session's stored messages into the history list passed to the inference layer. @@ -79,10 +120,17 @@ def _build_history_with_reasoning( When `preserve_reasoning` is true and an assistant message has a `reasoning` field captured by ThinkingTokenFilter on a previous turn, the reasoning is re-emitted inside `...` tags ahead of - the visible answer. Reasoning-capable models (Qwen3, DeepSeek R1, etc.) - consume this naturally on follow-up turns; non-reasoning models will - treat it as inline text. Falsy / missing reasoning is skipped, so this - is safe to call unconditionally. + the visible answer. (Upstream chat templates for Qwen3 / DeepSeek-R1 + actually strip prior reasoning, so the live chat path now passes + `preserve_reasoning=False`; the option is kept for callers that want it.) + Falsy / missing reasoning is skipped, so this is safe to call + unconditionally. + + When `token_budget` is set, a sliding window keeps every system message + plus the NEWEST conversation turns that fit the budget (estimated, no + tokenizer), dropping the oldest. This bounds prompt growth across a long + chat — preventing silent truncation on llama.cpp and out-of-context + errors on MLX. ``None`` disables windowing (unchanged behaviour). """ history: list[dict[str, Any]] = [] for message in messages: @@ -97,7 +145,26 @@ def _build_history_with_reasoning( if reasoning_str: text = f"\n{reasoning_str}\n\n\n{text}" history.append({"role": role, "text": text}) - return history + + if token_budget is None or token_budget <= 0: + return history + + # System messages are always kept; window the conversation tail. + system_msgs = [m for m in history if m["role"] == "system"] + convo = [m for m in history if m["role"] != "system"] + used = sum(_estimate_tokens(m["text"]) for m in system_msgs) + kept_tail: list[dict[str, Any]] = [] + for message in reversed(convo): + cost = _estimate_tokens(message["text"]) + # Always keep the most recent turn even if it alone blows the budget; + # dropping the latest context is worse than a small overflow the + # engine can still truncate. + if kept_tail and used + cost > token_budget: + break + used += cost + kept_tail.append(message) + kept_tail.reverse() + return system_msgs + kept_tail _TITLE_LEADING_PATTERNS = [ diff --git a/backend_service/state/generation.py b/backend_service/state/generation.py index 15098f4..1ace636 100644 --- a/backend_service/state/generation.py +++ b/backend_service/state/generation.py @@ -35,6 +35,7 @@ _build_history_with_reasoning, _build_sampler_overrides, _compose_chat_system_prompt, + _history_token_budget, ) @@ -144,7 +145,17 @@ def generate(state: ChaosEngineState, request: GenerateRequest) -> dict[str, Any history = _build_history_with_reasoning( session["messages"], - preserve_reasoning=(effective_thinking_mode == "auto"), + # Don't replay prior reasoning — upstream chat templates + # (Qwen3 / DeepSeek-R1) strip it, and re-feeding it bloats the + # prompt every turn. token_budget windows the oldest turns out so + # a long chat can't silently overflow the context. + preserve_reasoning=False, + token_budget=_history_token_budget( + context_tokens=desired_context_tokens, + max_tokens=request.maxTokens, + system_prompt=request.systemPrompt, + prompt=request.prompt, + ), ) session["messages"].append({"role": "user", "text": request.prompt, "metrics": None}) session["updatedAt"] = state._time_label() @@ -393,7 +404,17 @@ def generate_stream(state: ChaosEngineState, request: GenerateRequest): history = _build_history_with_reasoning( session["messages"], - preserve_reasoning=(effective_thinking_mode == "auto"), + # Don't replay prior reasoning — upstream chat templates + # (Qwen3 / DeepSeek-R1) strip it, and re-feeding it bloats the + # prompt every turn. token_budget windows the oldest turns out so + # a long chat can't silently overflow the context. + preserve_reasoning=False, + token_budget=_history_token_budget( + context_tokens=desired_context_tokens, + max_tokens=request.maxTokens, + system_prompt=request.systemPrompt, + prompt=request.prompt, + ), ) session["messages"].append({"role": "user", "text": request.prompt, "metrics": None}) session["updatedAt"] = state._time_label() @@ -599,6 +620,24 @@ def _maybe_emit_generating_phase() -> str: ttft_seconds = round(time.perf_counter() - gen_start, 3) return f"data: {json.dumps({'phase': 'generating', 'ttftSeconds': ttft_seconds})}\n\n" + # Token coalescing: batch visible token frames so a fast decoder + # doesn't pay a json.dumps + SSE frame per token. Flush on size, a + # short time window, any non-token event, or stream end. Disabled + # when per-token logprobs are requested (they must stay 1:1 aligned). + _COALESCE_CHARS = 24 + _COALESCE_SECS = 0.05 + _coalesce_tokens = not (request.logprobs and int(request.logprobs) > 0) + _tok: dict[str, Any] = {"buf": [], "chars": 0, "started": 0.0} + + def _flush_tokens() -> str: + if not _tok["buf"]: + return "" + merged = "".join(_tok["buf"]) + _tok["buf"] = [] + _tok["chars"] = 0 + _tok["started"] = 0.0 + return f"data: {json.dumps({'token': merged})}\n\n" + try: if enable_tools: from backend_service.agent import run_agent_loop_streaming @@ -619,7 +658,20 @@ def _maybe_emit_generating_phase() -> str: if phase_event: yield phase_event full_text += event["token"] - yield f"data: {json.dumps({'token': event['token']})}\n\n" + if _coalesce_tokens: + if not _tok["buf"]: + _tok["started"] = time.perf_counter() + _tok["buf"].append(event["token"]) + _tok["chars"] += len(event["token"]) + if ( + _tok["chars"] >= _COALESCE_CHARS + or time.perf_counter() - _tok["started"] >= _COALESCE_SECS + ): + _f = _flush_tokens() + if _f: + yield _f + else: + yield f"data: {json.dumps({'token': event['token']})}\n\n" if len(full_text) > runaway_char_budget: runaway_triggered = True cancelled = True @@ -628,8 +680,14 @@ def _maybe_emit_generating_phase() -> str: phase_event = _maybe_emit_generating_phase() if phase_event: yield phase_event + _f = _flush_tokens() + if _f: + yield _f yield f"data: {json.dumps({'toolCallStart': event['tool_call_start']})}\n\n" elif "tool_call_result" in event: + _f = _flush_tokens() + if _f: + yield _f agent_tool_calls.append(event["tool_call_result"]) yield f"data: {json.dumps({'toolCallResult': event['tool_call_result']})}\n\n" elif event.get("done"): @@ -653,16 +711,35 @@ def _maybe_emit_generating_phase() -> str: phase_event = _maybe_emit_generating_phase() if phase_event: yield phase_event + _f = _flush_tokens() + if _f: + yield _f full_reasoning += chunk.reasoning yield f"data: {json.dumps({'reasoning': chunk.reasoning})}\n\n" if chunk.reasoning_done: + _f = _flush_tokens() + if _f: + yield _f yield f"data: {json.dumps({'reasoningDone': True})}\n\n" if chunk.text: phase_event = _maybe_emit_generating_phase() if phase_event: yield phase_event full_text += chunk.text - yield f"data: {json.dumps({'token': chunk.text})}\n\n" + if _coalesce_tokens: + if not _tok["buf"]: + _tok["started"] = time.perf_counter() + _tok["buf"].append(chunk.text) + _tok["chars"] += len(chunk.text) + if ( + _tok["chars"] >= _COALESCE_CHARS + or time.perf_counter() - _tok["started"] >= _COALESCE_SECS + ): + _f = _flush_tokens() + if _f: + yield _f + else: + yield f"data: {json.dumps({'token': chunk.text})}\n\n" # Phase 3.3: forward per-token logprobs when # the inference layer captured them. if chunk.token_logprobs: @@ -730,6 +807,9 @@ def _maybe_emit_generating_phase() -> str: f"{p_avail:.1f} GB, " f"pressure={p_pressure:.0f}%.", ) + _f = _flush_tokens() + if _f: + yield _f yield ( "data: " + json.dumps({ @@ -762,6 +842,9 @@ def _maybe_emit_generating_phase() -> str: "chat", "warning", f"[{model_tag}] Thermal warning: critical.", ) + _f = _flush_tokens() + if _f: + yield _f yield ( "data: " + json.dumps({ @@ -794,11 +877,20 @@ def _maybe_emit_generating_phase() -> str: chaosengine.active_requests = max(0, chaosengine.active_requests - 1) chaosengine.add_log("chat", "error", f"[{model_tag}] Streaming failed: {exc}") chaosengine.clear_chat_cancel(session_id_for_cancel) + _f = _flush_tokens() + if _f: + yield _f yield f"data: {json.dumps({'error': str(exc)})}\n\n" return finally: chaosengine.clear_chat_cancel(session_id_for_cancel) + # Flush any tokens still buffered by the coalescer before the + # terminal done / cancelled events (covers normal end + all breaks). + _f = _flush_tokens() + if _f: + yield _f + if cancelled: yield f"data: {json.dumps({'cancelled': True})}\n\n" if runaway_loop_reason is not None: diff --git a/backend_service/state/openai_compat.py b/backend_service/state/openai_compat.py index b25dedd..a5a3cb0 100644 --- a/backend_service/state/openai_compat.py +++ b/backend_service/state/openai_compat.py @@ -236,6 +236,19 @@ def openai_chat_completion( oai_samplers["seed"] = request.seed if request.stop is not None: oai_samplers["stop"] = request.stop if isinstance(request.stop, list) else [request.stop] + # Parity with the native chat route's sampler set: min_p, repeat_penalty, + # and mirostat were silently dropped on the /v1 path. llama-server takes + # these key names natively; the MLX worker consumes min_p + repeat_penalty. + if request.min_p is not None: + oai_samplers["min_p"] = request.min_p + if request.repeat_penalty is not None: + oai_samplers["repeat_penalty"] = request.repeat_penalty + if request.mirostat is not None: + oai_samplers["mirostat"] = request.mirostat + if request.mirostat_tau is not None: + oai_samplers["mirostat_tau"] = request.mirostat_tau + if request.mirostat_eta is not None: + oai_samplers["mirostat_eta"] = request.mirostat_eta # Phase 2.13: pull a JSON schema out of OpenAI's response_format # envelope so the constrained-decode path lights up. Anything diff --git a/pyproject.toml b/pyproject.toml index 3b4a4dd..f00e899 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta:__legacy__" [project] name = "chaosengine-ai" -version = "0.9.3" +version = "0.9.4" description = "Local AI model runner with pluggable cache/compression strategies" readme = "README.md" license = {text = "Apache-2.0"} @@ -35,13 +35,13 @@ mlx-lm = [ # AutoProcessor); without it ``mlx_vlm.load`` raises ImportError on # the Qwen2.5-VL family during processor build. mlx-vlm = [ - "mlx-vlm>=0.5.0", + "mlx-vlm>=0.6.3", "torchvision>=0.20", ] -triattention = ["triattention @ git+https://github.com/WeianMao/triattention.git@c3744ee6a50522a1559a577f85aef2b165a344f2", "vllm>=0.21.0"] +triattention = ["triattention @ git+https://github.com/WeianMao/triattention.git@c3744ee6a50522a1559a577f85aef2b165a344f2", "vllm>=0.23.0"] triattention-mlx = ["triattention @ git+https://github.com/WeianMao/triattention.git@c3744ee6a50522a1559a577f85aef2b165a344f2", "mlx-lm>=0.22.0"] -turboquant = ["turboquant-mlx-full>=0.5.0"] -vllm = ["vllm>=0.21.0"] +turboquant = ["turboquant-mlx-full>=0.8.0"] +vllm = ["vllm>=0.23.0"] dflash-mlx = ["dflash-mlx @ git+https://github.com/bstnxbt/dflash-mlx.git@fada1eb2b75cd1c875ca6547b6518783fd3d2956"] dflash = ["dflash>=0.1.0"] desktop = [ diff --git a/requirements-docs.txt b/requirements-docs.txt index fbf59a3..6c7cbfe 100644 --- a/requirements-docs.txt +++ b/requirements-docs.txt @@ -2,6 +2,6 @@ # Install with: .venv/bin/pip install -r requirements-docs.txt # Build the site with: .venv/bin/mkdocs build --strict -mkdocs>=1.6 -mkdocs-material>=9.5 -pymdown-extensions>=10.7 +mkdocs>=1.6.1 +mkdocs-material>=9.7.6 +pymdown-extensions>=10.21.3 diff --git a/scripts/e2e_test_suite.py b/scripts/e2e_test_suite.py index 8126505..6a962c4 100755 --- a/scripts/e2e_test_suite.py +++ b/scripts/e2e_test_suite.py @@ -295,6 +295,25 @@ def _resolve_hf_guard(): ok = ("owner/name" in blob) or ("400" in blob) return ("pass" if ok else "fail"), ("" if ok else f"unexpected: {err[:160]}"), {} + # New-feature gate for the frontier families added this release. Asserts + # they surface in the live Discover catalog (/api/workspace) with their + # full variant set — a shape check, no model load (these are 150 GB+). + def _new_model_families(): + rc, payload, err = _cli_json("call", "GET", "/api/workspace", timeout=15.0) + if rc != 0 or not isinstance(payload, dict): + return "fail", f"workspace fetch failed: {err[:160]}", {} + fams = {f.get("id"): f for f in (payload.get("featuredModels") or [])} + missing = [] + for fid in ("deepseek-v4", "glm-5"): + fam = fams.get(fid) + if fam is None: + missing.append(f"{fid}: absent") + elif len(fam.get("variants") or []) < 4: + missing.append(f"{fid}: only {len(fam.get('variants') or [])} variants") + if missing: + return "fail", "; ".join(missing)[:200], {"missing": missing} + return "pass", "", {"families": ["deepseek-v4", "glm-5"]} + for name, fn in [ ("health", _health), ("routes", _routes), ("gpu-status", _gpu), ("mtplx-status", _mtplx), ("inventory", _inventory), @@ -303,6 +322,7 @@ def _resolve_hf_guard(): ("ollama-compat (#3)", _ollama_compat), ("model import scan (#4)", _model_import_scan), ("run-from-hf guard (#5)", _resolve_hf_guard), + ("new model families (DeepSeek V4 / GLM-5)", _new_model_families), ]: phase.checks.append(_check(name, fn)) phase.status = "fail" if any(c.status == "fail" for c in phase.checks) else "pass" @@ -616,6 +636,86 @@ def _fused_attention(): return _load_unload_prompt(ref, path=path, backend="mlx", fused=True, cache_strategy="native", context=8192, max_tokens=16) + # 1h. Modern samplers reachable end-to-end (DRY + XTC). New-feature gate + # for the tier-2 / SamplerPanel work: a chat generate carrying + # xtcProbability + dryMultiplier must be accepted and still produce text + # (request fields -> _build_sampler_overrides -> engine plumbing). + def _modern_samplers(): + pick = _pick_fast_mlx() + if not pick: + return "skip", "no MLX text model on disk", {} + ref, path = pick + rc, loaded, err = _cli_json( + "load", ref, "--backend", "mlx", "--cache-strategy", "native", + "--context", "8192", "--path", path, "--timeout", "1800", timeout=1860.0, + ) + if rc != 0 or not isinstance(loaded, dict) or loaded.get("state") != "loaded": + return "fail", f"load failed: {err[:160] if err else loaded}", {} + body = json.dumps({ + "sessionId": "e2e-samplers", "prompt": "Say hello in one short sentence.", + "modelRef": ref, "backend": "mlx", "cacheStrategy": "native", + "maxTokens": 24, "thinkingMode": "off", + "xtcProbability": 0.3, "xtcThreshold": 0.1, "dryMultiplier": 0.8, + }) + rc, gen, err = _cli_json("call", "POST", "/api/chat/generate", "--body", body, "--timeout", "300") + _cli("unload", timeout=60.0) + if rc != 0 or not isinstance(gen, dict): + return "fail", f"generate with xtc/dry rc={rc}: {err[:160]}", {} + # Assert generation actually RAN with the new sampler params accepted + # (completionTokens > 0) — robust to reasoning models that spend the + # budget in a hidden block and emit no visible answer text. + metrics = (gen.get("assistant") or {}).get("metrics") or {} + ctoks = metrics.get("completionTokens") or 0 + return ("pass" if ctoks > 0 else "fail"), f"completionTokens={ctoks}", {"completionTokens": ctoks} + + # 1i. MLX persistent prompt-cache reuse (tier 4). New-feature gate + + # regression guard: two same-session turns; turn-2 must reprocess far + # fewer prompt tokens than turn-1 (the cache reuses the prefix + prefills + # only the new suffix). Without reuse, turn-2 promptTokens would EXCEED + # turn-1 because the conversation grows. + def _mlx_prompt_cache_reuse(): + pick = _pick_fast_mlx() + if not pick: + return "skip", "no MLX text model on disk", {} + ref, path = pick + rc, loaded, err = _cli_json( + "load", ref, "--backend", "mlx", "--cache-strategy", "native", + "--context", "8192", "--path", path, "--timeout", "1800", timeout=1860.0, + ) + if rc != 0 or not isinstance(loaded, dict) or loaded.get("state") != "loaded": + return "fail", f"load failed: {err[:160] if err else loaded}", {} + + def _turn(prompt: str): + body = json.dumps({ + "sessionId": "e2e-cache-reuse", "prompt": prompt, "modelRef": ref, + "backend": "mlx", "cacheStrategy": "native", "maxTokens": 24, + "thinkingMode": "off", + }) + rc, g, err = _cli_json("call", "POST", "/api/chat/generate", "--body", body, "--timeout", "300") + pt = None + if isinstance(g, dict): + pt = ((g.get("assistant") or {}).get("metrics") or {}).get("promptTokens") + return rc, pt + + rc1, pt1 = _turn("List three primary colors.") + rc2, pt2 = _turn("Now list two more colors.") + _cli("unload", timeout=60.0) + if rc1 != 0 or rc2 != 0 or pt1 is None or pt2 is None: + return "fail", f"turns rc={rc1},{rc2} promptTokens={pt1},{pt2}", {} + # turn-2 reprocessing fewer prompt tokens than turn-1 means the + # persistent cache reused the prefix. When it doesn't engage (a + # model whose generated tokens don't round-trip at the answer + # boundary, or a reasoning model) the cache correctly DEGRADES to a + # full reprocess — correct output, just no speedup — so that's an + # honest skip, not a fail. The reuse/trim logic is unit-tested in + # tests/test_mlx_prompt_cache.py regardless of this live signal. + if pt2 < pt1: + return "pass", f"cache reused: promptTokens {pt1} -> {pt2}", {"pt1": pt1, "pt2": pt2} + return "skip", ( + f"reuse did not engage for this model (turn1={pt1} turn2={pt2}); " + "graceful full-reprocess degradation, logic unit-tested separately" + ), {"pt1": pt1, "pt2": pt2} + for name, fn in [ ("MLX native cache", _mlx_native), ("MLX TurboQuant cache", _mlx_turboquant), @@ -626,6 +726,8 @@ def _fused_attention(): ("GGUF MTP speculative", _gguf_mtp), ("long context cache-preview", _long_context_preview), ("fused attention flag", _fused_attention), + ("modern samplers (DRY+XTC)", _modern_samplers), + ("MLX prompt-cache reuse", _mlx_prompt_cache_reuse), ]: phase.checks.append(_check(name, fn)) fails = [c for c in phase.checks if c.status == "fail"] diff --git a/src-tauri/Cargo.lock b/src-tauri/Cargo.lock index 4159e6a..1c1316e 100644 --- a/src-tauri/Cargo.lock +++ b/src-tauri/Cargo.lock @@ -480,7 +480,7 @@ checksum = "9330f8b2ff13f34540b44e946ef35111825727b38d33286ef986142615121801" [[package]] name = "chaosengineai" -version = "0.9.3" +version = "0.9.4" dependencies = [ "flate2", "fluent-bundle", @@ -832,7 +832,7 @@ dependencies = [ "libc", "option-ext", "redox_users", - "windows-sys 0.61.2", + "windows-sys 0.59.0", ] [[package]] @@ -1024,7 +1024,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "39cab71617ae0d63f51a36d69f866391735b51691dbda63cf6f96d042b63efeb" dependencies = [ "libc", - "windows-sys 0.61.2", + "windows-sys 0.52.0", ] [[package]] @@ -1510,7 +1510,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "0bb0228f477c0900c880fd78c8759b95c7636dbd7842707f49e132378aa2acdc" dependencies = [ "heck 0.4.1", - "proc-macro-crate 2.0.2", + "proc-macro-crate 2.0.0", "proc-macro-error", "proc-macro2", "quote", @@ -2174,12 +2174,6 @@ dependencies = [ "selectors 0.24.0", ] -[[package]] -name = "lazy_static" -version = "1.5.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bbd2bcb4c963f2ddae06a2efc7e9f3591312473c50c6685e1f298068316e66fe" - [[package]] name = "leb128fmt" version = "0.1.0" @@ -2247,12 +2241,6 @@ dependencies = [ "redox_syscall 0.7.4", ] -[[package]] -name = "libyml" -version = "0.0.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "64804cc6a5042d4f05379909ba25b503ec04e2c082151d62122d5dcaa274b961" - [[package]] name = "linux-raw-sys" version = "0.12.1" @@ -2394,7 +2382,7 @@ dependencies = [ "png 0.18.1", "serde", "thiserror 2.0.18", - "windows-sys 0.61.2", + "windows-sys 0.60.2", ] [[package]] @@ -2833,7 +2821,6 @@ version = "0.11.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "1fd6780a80ae0c52cc120a26a1a42c1ae51b247a253e4e06113d23d2c2edd078" dependencies = [ - "phf_macros 0.11.3", "phf_shared 0.11.3", ] @@ -2932,19 +2919,6 @@ dependencies = [ "syn 1.0.109", ] -[[package]] -name = "phf_macros" -version = "0.11.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f84ac04429c13a7ff43785d75ad27569f2951ce0ffd30a3321230db2fc727216" -dependencies = [ - "phf_generator 0.11.3", - "phf_shared 0.11.3", - "proc-macro2", - "quote", - "syn 2.0.117", -] - [[package]] name = "phf_macros" version = "0.13.1" @@ -3128,11 +3102,10 @@ dependencies = [ [[package]] name = "proc-macro-crate" -version = "2.0.2" +version = "2.0.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b00f26d3400549137f92511a46ac1cd8ce37cb5598a96d382381458b992a5d24" +checksum = "7e8366a6159044a37876a2b9817124296703c586a5c92e2c53751fa06d8d43e8" dependencies = [ - "toml_datetime 0.6.3", "toml_edit 0.20.2", ] @@ -3458,12 +3431,11 @@ dependencies = [ [[package]] name = "rust-i18n" -version = "3.1.2" +version = "4.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "039f57d22229db401af3458ca939300178e99e88b938573cea12b7c2b0f09724" +checksum = "55691a65892c33ee2de49c15ea5600c6f4a70e8eeb8e6c3cd96d2a231d230c40" dependencies = [ "globwalk", - "once_cell", "regex", "rust-i18n-macro", "rust-i18n-support", @@ -3472,41 +3444,36 @@ dependencies = [ [[package]] name = "rust-i18n-macro" -version = "3.1.2" +version = "4.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "dde5c022360a2e54477882843d56b6f9bcb4bc62f504b651a2f497f0028d174f" +checksum = "30de488acadcf767d97cd48518a8da8ea9777b1c9a5beca4eab78bbf77d07309" dependencies = [ "glob", - "once_cell", "proc-macro2", "quote", "rust-i18n-support", "serde", "serde_json", - "serde_yml", + "serde_yaml", "syn 2.0.117", ] [[package]] name = "rust-i18n-support" -version = "3.1.2" +version = "4.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "75d2844d36f62b5d6b66f9cf8f8cbdbbbdcdb5fd37a473a9cc2fb45fdcf485d2" +checksum = "aea0fef8a93c06326b66392c95a115120e609674cb2132d37d276a6b05b545b4" dependencies = [ "arc-swap", "base62", "globwalk", "itertools", - "lazy_static", "normpath", - "once_cell", - "proc-macro2", - "regex", "serde", "serde_json", - "serde_yml", + "serde_yaml", "siphasher 1.0.2", - "toml 0.7.8", + "toml 0.8.23", "triomphe", ] @@ -3535,7 +3502,7 @@ dependencies = [ "errno", "libc", "linux-raw-sys", - "windows-sys 0.61.2", + "windows-sys 0.52.0", ] [[package]] @@ -3591,7 +3558,7 @@ dependencies = [ "security-framework", "security-framework-sys", "webpki-root-certs", - "windows-sys 0.61.2", + "windows-sys 0.52.0", ] [[package]] @@ -3829,9 +3796,9 @@ dependencies = [ [[package]] name = "serde_json" -version = "1.0.149" +version = "1.0.150" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "83fc039473c5595ace860d8c4fafa220ff474b3fc6bfdb4293327f1a37e94d86" +checksum = "e8014e44b4736ed0538adeecded0fce2a272f22dc9578a7eb6b2d9993c74cfb9" dependencies = [ "itoa", "memchr", @@ -3901,20 +3868,16 @@ dependencies = [ ] [[package]] -name = "serde_yml" -version = "0.0.11" +name = "serde_yaml" +version = "0.9.34+deprecated" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "48e76bab63c3fd98d27c17f9cbce177f64a91f5e69ac04cafe04e1bb25d1dc3c" +checksum = "6a8b1a1a2ebf674015cc02edccce75287f1a0130d394307b36743c2f5d504b47" dependencies = [ "indexmap 2.14.0", "itoa", - "libyml", - "log", - "memchr", "ryu", "serde", - "serde_json", - "tempfile", + "unsafe-libyaml", ] [[package]] @@ -4022,7 +3985,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "3a766e1110788c36f4fa1c2b71b387a7815aa65f88ce0229841826633d93723e" dependencies = [ "libc", - "windows-sys 0.61.2", + "windows-sys 0.60.2", ] [[package]] @@ -4202,7 +4165,7 @@ dependencies = [ "cfg-expr", "heck 0.5.0", "pkg-config", - "toml 0.8.2", + "toml 0.8.23", "version-compare", ] @@ -4259,9 +4222,9 @@ dependencies = [ [[package]] name = "tar" -version = "0.4.45" +version = "0.4.46" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "22692a6476a21fa75fdfc11d452fda482af402c008cdbaf3476414e122040973" +checksum = "3f6221d9a6003c78398e3b239969f352578258df48c8eb051caadae0015bc840" dependencies = [ "filetime", "libc", @@ -4276,9 +4239,9 @@ checksum = "61c41af27dd6d1e27b1b16b489db798443478cef1f06a660c96db617ba5de3b1" [[package]] name = "tauri" -version = "2.11.0" +version = "2.11.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d059f2527558d9dba6f186dec4772610e1aecfd3f94002397613e7e648752b66" +checksum = "437404997acf375d85f1177afa7e11bb971f274ed6a7b83a2a3e339015f4cc28" dependencies = [ "anyhow", "bytes", @@ -4327,9 +4290,9 @@ dependencies = [ [[package]] name = "tauri-build" -version = "2.6.0" +version = "2.6.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "be9aa8c59a894f76c29a002501c589de5eb4987a5913d62a6e0a47f320901988" +checksum = "4aa1f9055fc23919a54e4e125052bed16ed04aef0487086e758fe01a67b451c7" dependencies = [ "anyhow", "cargo_toml", @@ -4348,9 +4311,9 @@ dependencies = [ [[package]] name = "tauri-codegen" -version = "2.6.0" +version = "2.6.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d3e4e8230d565106aa19dfbaa01a7ed01abf78047fe0577a83377224bd1bf20e" +checksum = "e4a0319528a025a38c4078e7dae2c446f4e63620ddb0659a643ede1cb38f90e9" dependencies = [ "base64 0.22.1", "brotli", @@ -4375,9 +4338,9 @@ dependencies = [ [[package]] name = "tauri-macros" -version = "2.6.0" +version = "2.6.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bc8de2cddbbc33dbdf4c84f170121886595efdbcc9cb4b3d76342b79d082cedc" +checksum = "ae6cb4e3896c21d2f6da5b31251d2faea0153bba56ed0e970f918115dbee4924" dependencies = [ "heck 0.5.0", "proc-macro2", @@ -4406,9 +4369,9 @@ dependencies = [ [[package]] name = "tauri-plugin-dialog" -version = "2.7.0" +version = "2.7.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a1fa4150c95ae391946cc8b8f905ab14797427caba3a8a2f79628e956da91809" +checksum = "65981abb771e74e571a38196c3baa11c459379164791eba0e67abc1a5fac9884" dependencies = [ "log", "raw-window-handle", @@ -4424,9 +4387,9 @@ dependencies = [ [[package]] name = "tauri-plugin-fs" -version = "2.5.0" +version = "2.5.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "36e1ec28b79f3d0683f4507e1615c36292c0ea6716668770d4396b9b39871ed8" +checksum = "b7ecc274121aca0c036a2b42d1cbe83d368d348f54e0bb8a735c2b1548e8f371" dependencies = [ "anyhow", "dunce", @@ -4442,15 +4405,15 @@ dependencies = [ "tauri-plugin", "tauri-utils", "thiserror 2.0.18", - "toml 0.9.12+spec-1.1.0", + "toml 1.1.2+spec-1.1.0", "url", ] [[package]] name = "tauri-plugin-opener" -version = "2.5.3" +version = "2.5.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fc624469b06f59f5a29f874bbc61a2ed737c0f9c23ef09855a292c389c42e83f" +checksum = "17e1bea14edce6b793a04e2417e3fd924b9bc4faae83cdee7d714156cceeed29" dependencies = [ "dunce", "glob", @@ -4513,9 +4476,9 @@ dependencies = [ [[package]] name = "tauri-runtime" -version = "2.11.0" +version = "2.11.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1e42bbcb76237351fbaa02f08d808c537dc12eb5a6eabbf3e517b50056334d95" +checksum = "48222d7116c8807eaa6fe2f372e023fae125084e61e6eca6d70b7961cdf129ef" dependencies = [ "cookie", "dpi", @@ -4538,9 +4501,9 @@ dependencies = [ [[package]] name = "tauri-runtime-wry" -version = "2.11.0" +version = "2.11.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2cadb13dad0c681e1e0a2c49ae488f0e2906ded3d57e7a0017f4aaf46e387117" +checksum = "b83849ee63ecb27a8e8d0fe51915ca215076914aca43f96db1179f0f415f6cd9" dependencies = [ "gtk", "http", @@ -4564,9 +4527,9 @@ dependencies = [ [[package]] name = "tauri-utils" -version = "2.9.0" +version = "2.9.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "55f61d2bf7188fbcf2b0ed095b67a6bc498f713c939314bb19eb700118a573b7" +checksum = "092379df9a707631978e6c56b1bc2401d387f01e2d4a3c123360d167bbb9aa95" dependencies = [ "anyhow", "brotli", @@ -4582,7 +4545,7 @@ dependencies = [ "kuchikiki", "log", "memchr", - "phf 0.11.3", + "phf 0.13.1", "plist", "proc-macro2", "quote", @@ -4623,7 +4586,7 @@ dependencies = [ "getrandom 0.4.2", "once_cell", "rustix", - "windows-sys 0.61.2", + "windows-sys 0.52.0", ] [[package]] @@ -4768,48 +4731,51 @@ dependencies = [ [[package]] name = "toml" -version = "0.7.8" +version = "0.8.23" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "dd79e69d3b627db300ff956027cc6c3798cef26d22526befdfcd12feeb6d2257" +checksum = "dc1beb996b9d83529a9e75c17a1686767d148d70663143c7854d8b4a09ced362" dependencies = [ "serde", "serde_spanned 0.6.9", - "toml_datetime 0.6.3", - "toml_edit 0.19.15", + "toml_datetime 0.6.11", + "toml_edit 0.22.27", ] [[package]] name = "toml" -version = "0.8.2" +version = "0.9.12+spec-1.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "185d8ab0dfbb35cf1399a6344d8484209c088f75f8f68230da55d48d95d43e3d" +checksum = "cf92845e79fc2e2def6a5d828f0801e29a2f8acc037becc5ab08595c7d5e9863" dependencies = [ - "serde", - "serde_spanned 0.6.9", - "toml_datetime 0.6.3", - "toml_edit 0.20.2", + "indexmap 2.14.0", + "serde_core", + "serde_spanned 1.1.1", + "toml_datetime 0.7.5+spec-1.1.0", + "toml_parser", + "toml_writer", + "winnow 0.7.15", ] [[package]] name = "toml" -version = "0.9.12+spec-1.1.0" +version = "1.1.2+spec-1.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "cf92845e79fc2e2def6a5d828f0801e29a2f8acc037becc5ab08595c7d5e9863" +checksum = "81f3d15e84cbcd896376e6730314d59fb5a87f31e4b038454184435cd57defee" dependencies = [ "indexmap 2.14.0", "serde_core", "serde_spanned 1.1.1", - "toml_datetime 0.7.5+spec-1.1.0", + "toml_datetime 1.1.1+spec-1.1.0", "toml_parser", "toml_writer", - "winnow 0.7.15", + "winnow 1.0.2", ] [[package]] name = "toml_datetime" -version = "0.6.3" +version = "0.6.11" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7cda73e2f1397b1262d6dfdcef8aafae14d1de7748d66822d3bfeeb6d03e5e4b" +checksum = "22cddaf88f4fbc13c51aebbf5f8eceb5c7c5a9da2ac40a13519eb5b0a0e8f11c" dependencies = [ "serde", ] @@ -4839,9 +4805,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "1b5bb770da30e5cbfde35a2d7b9b8a2c4b8ef89548a7a6aeab5c9a576e3e7421" dependencies = [ "indexmap 2.14.0", - "serde", - "serde_spanned 0.6.9", - "toml_datetime 0.6.3", + "toml_datetime 0.6.11", "winnow 0.5.40", ] @@ -4850,12 +4814,24 @@ name = "toml_edit" version = "0.20.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "396e4d48bbb2b7554c944bde63101b5ae446cff6ec4a24227428f15eb72ef338" +dependencies = [ + "indexmap 2.14.0", + "toml_datetime 0.6.11", + "winnow 0.5.40", +] + +[[package]] +name = "toml_edit" +version = "0.22.27" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "41fe8c660ae4257887cf66394862d21dbca4a6ddd26f04a3560410406a2f819a" dependencies = [ "indexmap 2.14.0", "serde", "serde_spanned 0.6.9", - "toml_datetime 0.6.3", - "winnow 0.5.40", + "toml_datetime 0.6.11", + "toml_write", + "winnow 0.7.15", ] [[package]] @@ -4879,6 +4855,12 @@ dependencies = [ "winnow 1.0.2", ] +[[package]] +name = "toml_write" +version = "0.1.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5d99f8c9a7727884afe522e9bd5edbfc91a3312b36a77b5fb8926e4c31a41801" + [[package]] name = "toml_writer" version = "1.1.1+spec-1.1.0" @@ -4980,7 +4962,7 @@ dependencies = [ "png 0.18.1", "serde", "thiserror 2.0.18", - "windows-sys 0.61.2", + "windows-sys 0.60.2", ] [[package]] @@ -5029,7 +5011,7 @@ checksum = "f2f6fb2847f6742cd76af783a2a2c49e9375d0a111c7bef6f71cd9e738c72d6e" dependencies = [ "memoffset", "tempfile", - "windows-sys 0.61.2", + "windows-sys 0.60.2", ] [[package]] @@ -5109,6 +5091,12 @@ version = "0.2.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ebc1c04c71510c7f702b52b7c350734c9ff1295c464a03335b00bb84fc54f853" +[[package]] +name = "unsafe-libyaml" +version = "0.2.11" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "673aac59facbab8a9007c7f6108d11f63b603f7cabff99fabf650fea5c32b861" + [[package]] name = "untrusted" version = "0.9.0" @@ -5480,7 +5468,7 @@ version = "0.1.11" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c2a7b1c03c876122aa43f3020e6c3c3ee5c05081c9a00739faf7503aeba10d22" dependencies = [ - "windows-sys 0.61.2", + "windows-sys 0.52.0", ] [[package]] diff --git a/src-tauri/Cargo.toml b/src-tauri/Cargo.toml index 8ea418a..31cb675 100644 --- a/src-tauri/Cargo.toml +++ b/src-tauri/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "chaosengineai" -version = "0.9.3" +version = "0.9.4" description = "ChaosEngineAI desktop shell for local AI model inference" authors = ["OpenAI Codex"] edition = "2021" @@ -28,7 +28,7 @@ tar = "0.4" # complements it for runtime-composed strings that need ICU-style plurals # / select / select-ordinal — e.g. updater progress "{n, plural, one {# minute # remaining} other {# minutes remaining}}" where ``n`` only exists at runtime. -rust-i18n = "3" +rust-i18n = "4" fluent-bundle = "0.16" unic-langid = "0.9" # FU-037 (2026-05-10): ``devtools`` flips on the WebKit inspector in @@ -37,7 +37,7 @@ unic-langid = "0.9" # without rebuilding the app with ``cargo tauri dev``. We pair this # with the per-tab ``ErrorBoundary`` so JS exceptions stay recoverable # AND inspectable. -tauri = { version = "~2.11.0", features = ["devtools"] } +tauri = { version = "~2.11.2", features = ["devtools"] } tauri-plugin-dialog = "2.7" tauri-plugin-opener = "2" tauri-plugin-updater = "2" @@ -54,7 +54,7 @@ libc = "0.2" # add an explicit dep so we can name the features we need. [target.'cfg(windows)'.dependencies] windows-sys = { version = "0.61", features = [ - "Win32_Foundation", # HANDLE, CloseHandle - "Win32_System_JobObjects", # CreateJobObjectW, SetInformationJobObject, - # AssignProcessToJobObject, JOBOBJECT_EXTENDED_LIMIT_INFORMATION + "Win32_Foundation", # HANDLE, CloseHandle + "Win32_System_JobObjects", # CreateJobObjectW, SetInformationJobObject, AssignProcessToJobObject + "Win32_System_Threading", # JOBOBJECT_EXTENDED_LIMIT_INFORMATION (gated here in 0.61.2+) ] } diff --git a/src/components/SamplerPanel.tsx b/src/components/SamplerPanel.tsx index 9df721e..7f58c33 100644 --- a/src/components/SamplerPanel.tsx +++ b/src/components/SamplerPanel.tsx @@ -194,6 +194,39 @@ export function SamplerPanel({ overrides, onChange, disabled }: SamplerPanelProp disabled={disabled} onChange={(v) => patch("repeatPenalty", v)} /> + patch("xtcProbability", v)} + /> + patch("xtcThreshold", v)} + /> + patch("dryMultiplier", v)} + /> { expect(samplerPayload({ topP: 0.9, topK: null, seed: null })).toEqual({ topP: 0.9 }); }); + it("projects modern samplers (xtc + dry)", () => { + expect( + samplerPayload({ xtcProbability: 0.5, xtcThreshold: 0.1, dryMultiplier: 0.8 }), + ).toEqual({ xtcProbability: 0.5, xtcThreshold: 0.1, dryMultiplier: 0.8 }); + }); + + it("round-trips modern samplers through storage", () => { + writeSamplerOverrides("sx", { xtcProbability: 0.5, dryMultiplier: 0.8 }); + expect(readSamplerOverrides("sx")).toEqual({ xtcProbability: 0.5, dryMultiplier: 0.8 }); + }); + it("parses jsonSchemaText into jsonSchema when valid", () => { const schemaText = '{"type":"object","properties":{"answer":{"type":"string"}}}'; expect(samplerPayload({ jsonSchemaText: schemaText })).toEqual({ diff --git a/src/features/chat/samplerOverrides.ts b/src/features/chat/samplerOverrides.ts index 4bcf226..07007e1 100644 --- a/src/features/chat/samplerOverrides.ts +++ b/src/features/chat/samplerOverrides.ts @@ -21,6 +21,9 @@ const NUMERIC_KEYS = [ "seed", "mirostatTau", "mirostatEta", + "xtcProbability", + "xtcThreshold", + "dryMultiplier", ] as const; function storageKey(sessionId: string): string { @@ -95,6 +98,9 @@ export function samplerPayload(overrides: SamplerOverrides): Record; /** * Phase 3.3: when set, asks llama-server to return top-k logprobs @@ -255,6 +260,9 @@ export interface SamplerOverrides { mirostatMode?: 0 | 1 | 2 | null; mirostatTau?: number | null; mirostatEta?: number | null; + xtcProbability?: number | null; + xtcThreshold?: number | null; + dryMultiplier?: number | null; /** * Phase 2.2: opt-in constrained decoding. Raw JSON-schema text the * user typed in the SamplerPanel. Parsed at send-time and forwarded diff --git a/tests/test_backend_service.py b/tests/test_backend_service.py index ff1c6af..c5b04a1 100644 --- a/tests/test_backend_service.py +++ b/tests/test_backend_service.py @@ -1350,6 +1350,30 @@ def test_openai_completion_forwards_sampler_fields(self): self.assertEqual(runtime_kwargs["samplers"]["stop"], ["END"]) self.assertIn("properties", runtime_kwargs["json_schema"]) + def test_openai_completion_forwards_extended_samplers(self): + # Parity fix: min_p / repeat_penalty / mirostat were dropped on the + # /v1 path. They must now reach the runtime sampler dict. + response = self.client.post( + "/v1/chat/completions", + json={ + "model": "google/gemma-4-E4B-it", + "messages": [{"role": "user", "content": "test"}], + "max_tokens": 16, + "min_p": 0.05, + "repeat_penalty": 1.15, + "mirostat": 2, + "mirostat_tau": 5.0, + "mirostat_eta": 0.1, + }, + ) + self.assertEqual(response.status_code, 200) + samplers = self.client.app.state.chaosengine.runtime.last_generate_kwargs["samplers"] + self.assertEqual(samplers["min_p"], 0.05) + self.assertEqual(samplers["repeat_penalty"], 1.15) + self.assertEqual(samplers["mirostat"], 2) + self.assertEqual(samplers["mirostat_tau"], 5.0) + self.assertEqual(samplers["mirostat_eta"], 0.1) + def test_openai_completion_omits_sampler_dict_when_none_set(self): response = self.client.post( "/v1/chat/completions", diff --git a/tests/test_catalog_text_families.py b/tests/test_catalog_text_families.py new file mode 100644 index 0000000..633f5cb --- /dev/null +++ b/tests/test_catalog_text_families.py @@ -0,0 +1,87 @@ +"""Catalog gate for the frontier text families added for the release +(DeepSeek V4, GLM-5, Gemma 4, MiniMax M2). Asserts they parse, carry every +field the discover payload builder reads, and surface in the family payloads +— so a malformed entry can't ship a broken Discover tab. +""" + +import unittest + +from backend_service.catalog.text_models import MODEL_FAMILIES + +_REQUIRED_FAMILY_FIELDS = { + "id", "name", "provider", "headline", "summary", "description", + "updatedLabel", "popularityLabel", "likesLabel", "badges", "capabilities", + "defaultVariantId", "variants", "readme", +} +_REQUIRED_VARIANT_FIELDS = { + "id", "name", "repo", "link", "paramsB", "sizeGb", "format", + "quantization", "capabilities", "note", "contextWindow", "launchMode", "backend", +} + + +class NewTextFamiliesTests(unittest.TestCase): + def setUp(self): + self.by_id = {f["id"]: f for f in MODEL_FAMILIES} + + _ALL_NEW_FAMILIES = ("deepseek-v4", "glm-5", "gemma-4", "minimax-m2") + + def test_all_new_families_present(self): + for fid in self._ALL_NEW_FAMILIES: + self.assertIn(fid, self.by_id, f"{fid} missing from MODEL_FAMILIES") + + def test_new_families_have_required_shape(self): + for fid in self._ALL_NEW_FAMILIES: + fam = self.by_id[fid] + self.assertEqual(_REQUIRED_FAMILY_FIELDS - set(fam), set(), f"{fid} family fields") + self.assertTrue(fam["variants"], f"{fid} has variants") + variant_ids = [v["id"] for v in fam["variants"]] + self.assertIn(fam["defaultVariantId"], variant_ids, f"{fid} default variant valid") + for v in fam["variants"]: + self.assertEqual(_REQUIRED_VARIANT_FIELDS - set(v), set(), f"{fid}/{v['id']} variant fields") + self.assertEqual(v["link"], f"https://huggingface.co/{v['repo']}", f"{fid}/{v['id']} link") + self.assertIn(v["backend"], ("mlx", "llama.cpp", "vllm")) + self.assertIn(v["launchMode"], ("direct", "convert")) + + def test_text_only_families_have_no_vision(self): + # DeepSeek V4 / GLM-5 / MiniMax M2 carry no vision_config in their HF + # configs — must not advertise vision (broken composer affordance if so). + for fid in ("deepseek-v4", "glm-5", "minimax-m2"): + fam = self.by_id[fid] + self.assertNotIn("vision", fam["capabilities"], f"{fid} family vision tag") + for v in fam["variants"]: + self.assertNotIn("vision", v["capabilities"], f"{fid}/{v['id']} vision tag") + + def test_gemma4_carries_vision_capability(self): + # All Gemma 4 sizes are multimodal (Gemma4ForConditionalGeneration + vision_config). + fam = self.by_id["gemma-4"] + self.assertIn("vision", fam["capabilities"]) + for v in fam["variants"]: + self.assertIn("vision", v["capabilities"], f"gemma-4/{v['id']} missing vision tag") + + def test_gemma4_contexts(self): + # E2B = 128K, 31B = 256K — verify the catalog reflects the config.json values. + e2b_variants = [v for v in self.by_id["gemma-4"]["variants"] if "E2B" in v["repo"]] + b31_variants = [v for v in self.by_id["gemma-4"]["variants"] if "31B" in v["repo"] or "31b" in v["repo"]] + self.assertTrue(e2b_variants, "no E2B variants found") + self.assertTrue(b31_variants, "no 31B variants found") + for v in e2b_variants: + self.assertEqual(v["contextWindow"], "128K", f"{v['id']} E2B context wrong") + for v in b31_variants: + self.assertEqual(v["contextWindow"], "256K", f"{v['id']} 31B context wrong") + + def test_minimax_m27_context(self): + fam = self.by_id["minimax-m2"] + for v in fam["variants"]: + self.assertEqual(v["contextWindow"], "200K", f"minimax-m2/{v['id']} context wrong") + + def test_new_families_surface_in_discover_payloads(self): + from backend_service.helpers.discovery import _model_family_payloads + + payloads = _model_family_payloads({"totalMemoryGb": 64, "availableMemoryGb": 32}, []) + ids = {p.get("id") for p in payloads} + for fid in self._ALL_NEW_FAMILIES: + self.assertIn(fid, ids, f"{fid} missing from discover payloads") + + +if __name__ == "__main__": + unittest.main() diff --git a/tests/test_history_with_reasoning.py b/tests/test_history_with_reasoning.py index 74f8da4..0f45e97 100644 --- a/tests/test_history_with_reasoning.py +++ b/tests/test_history_with_reasoning.py @@ -9,6 +9,7 @@ import unittest from backend_service.state import _build_history_with_reasoning +from backend_service.state._helpers import _estimate_tokens, _history_token_budget class BuildHistoryWithReasoningTests(unittest.TestCase): @@ -65,5 +66,61 @@ def test_preserves_message_order(self): self.assertIn("R2", history[3]["text"]) +class HistoryTokenWindowTests(unittest.TestCase): + def test_token_budget_none_keeps_all(self): + messages = [{"role": "user", "text": "x" * 300} for _ in range(6)] + history = _build_history_with_reasoning(messages, preserve_reasoning=False, token_budget=None) + self.assertEqual(len(history), 6) + + def test_windows_oldest_turns_out(self): + # Each 30-char text ~= 11 estimated tokens; budget 25 keeps 2 newest. + messages = [ + {"role": "user", "text": "a" * 30}, + {"role": "assistant", "text": "b" * 30}, + {"role": "user", "text": "c" * 30}, + {"role": "assistant", "text": "d" * 30}, + ] + history = _build_history_with_reasoning(messages, preserve_reasoning=False, token_budget=25) + self.assertEqual([h["text"] for h in history], ["c" * 30, "d" * 30]) + + def test_always_keeps_latest_turn_even_if_over_budget(self): + messages = [{"role": "user", "text": "z" * 300}] + history = _build_history_with_reasoning(messages, preserve_reasoning=False, token_budget=10) + self.assertEqual(len(history), 1) + self.assertEqual(history[0]["text"], "z" * 300) + + def test_system_messages_always_kept(self): + messages = [ + {"role": "system", "text": "s" * 30}, + {"role": "user", "text": "u" * 300}, + {"role": "assistant", "text": "a" * 300}, + {"role": "user", "text": "n" * 9}, + ] + history = _build_history_with_reasoning(messages, preserve_reasoning=False, token_budget=20) + roles = [h["role"] for h in history] + self.assertIn("system", roles) + self.assertEqual(history[-1]["text"], "n" * 9) + self.assertNotIn("u" * 300, [h["text"] for h in history]) + + def test_estimate_tokens_is_conservative(self): + # ~3 chars/token (over-estimates English so the window stays safe). + self.assertEqual(_estimate_tokens(""), 1) + self.assertEqual(_estimate_tokens("abc"), 2) + self.assertEqual(_estimate_tokens("a" * 30), 11) + + def test_history_token_budget_reserves_and_floors(self): + budget = _history_token_budget( + context_tokens=2000, max_tokens=256, system_prompt="x" * 30, prompt="y" * 30 + ) + # 2000 - (11 + 11 + 256 + 512) = 1210 + self.assertEqual(budget, 1210) + + def test_history_token_budget_floor_512(self): + budget = _history_token_budget( + context_tokens=100, max_tokens=256, system_prompt=None, prompt=None + ) + self.assertEqual(budget, 512) + + if __name__ == "__main__": unittest.main() diff --git a/tests/test_mlx_prompt_cache.py b/tests/test_mlx_prompt_cache.py new file mode 100644 index 0000000..593e419 --- /dev/null +++ b/tests/test_mlx_prompt_cache.py @@ -0,0 +1,180 @@ +"""Tests for the MLX per-session prompt-cache reuse logic (tier 4). + +Exercises backend_service/mlx_worker_prompt_cache.py with a fake worker +state and patched mlx-lm cache primitives — no real model load. The +correctness contract under test: the persisted token list always equals +the cache's positional contents, and any uncertainty falls back to a fresh +full prefill. +""" + +import unittest +from unittest import mock + +from backend_service import mlx_worker_prompt_cache as pc + +CACHE_MOD = "mlx_lm.models.cache" + + +class FakeCache: + """Sentinel standing in for an mlx-lm prompt cache.""" + + def __init__(self, label): + self.label = label + + +class FakeState: + def __init__(self, *, base_cache=None, base_note=None, tokens=None, model_ref="m"): + self._base = (base_cache, base_note) + self._tokens = list(tokens or []) + self.model = object() + self._loaded_model_ref = model_ref + self.tokenizer = self + self._persist_cache = None + self._persist_tokens = [] + self._persist_cache_model_ref = None + + def _make_cache(self): + return self._base + + def encode(self, _text): # stands in for tokenizer.encode + return list(self._tokens) + + +class CommonPrefixTests(unittest.TestCase): + def test_common_prefix_len(self): + self.assertEqual(pc._common_prefix_len([1, 2, 3], [1, 2, 9]), 2) + self.assertEqual(pc._common_prefix_len([1, 2], [9]), 0) + self.assertEqual(pc._common_prefix_len([1, 2, 3], [1, 2, 3, 4]), 3) + + +class AcquireCompressionTests(unittest.TestCase): + def test_compression_strategy_passthrough(self): + comp = FakeCache("compression") + state = FakeState(base_cache=comp, base_note="cn") + acq = pc.acquire(state, "p-text") + self.assertIs(acq.cache, comp) + self.assertEqual(acq.prompt_feed, "p-text") # string, unchanged + self.assertFalse(acq.managed) + self.assertIs(acq.fields_cache, comp) + self.assertIsNone(acq.commit_tokens) + + +class AcquireNativeTests(unittest.TestCase): + def _patches(self, *, can_trim=True, trim=lambda c, n: n, fresh_label="fresh"): + return ( + mock.patch(f"{CACHE_MOD}.make_prompt_cache", return_value=FakeCache(fresh_label)), + mock.patch(f"{CACHE_MOD}.can_trim_prompt_cache", return_value=can_trim), + mock.patch(f"{CACHE_MOD}.trim_prompt_cache", side_effect=trim), + ) + + def test_fresh_native_cache_full_prefill(self): + state = FakeState(base_cache=None, tokens=[1, 2, 3]) + with self._patches()[0], self._patches()[1], self._patches()[2]: + acq = pc.acquire(state, "ignored") + self.assertTrue(acq.managed) + self.assertIsInstance(acq.cache, FakeCache) + self.assertEqual(acq.prompt_feed, [1, 2, 3]) # full token list + self.assertEqual(acq.commit_tokens, [1, 2, 3]) + self.assertIsNone(acq.fields_cache) + + def test_reuse_hit_feeds_only_suffix_no_trim(self): + persist = FakeCache("persist") + state = FakeState(base_cache=None, tokens=[1, 2, 3, 4, 5], model_ref="m") + state._persist_cache = persist + state._persist_tokens = [1, 2, 3] + state._persist_cache_model_ref = "m" + m1, m2, m3 = self._patches() + with m1, m2, m3 as trim: + acq = pc.acquire(state, "ignored") + self.assertIs(acq.cache, persist) # reused, not fresh + self.assertEqual(acq.prompt_feed, [4, 5]) # suffix only + self.assertEqual(acq.commit_tokens, [1, 2, 3, 4, 5]) + trim.assert_not_called() # num_to_trim == 0 + + def test_reuse_with_divergence_trims_tail(self): + persist = FakeCache("persist") + state = FakeState(base_cache=None, tokens=[1, 2, 3, 4], model_ref="m") + state._persist_cache = persist + state._persist_tokens = [1, 2, 3, 9, 9] # diverges after index 3 + state._persist_cache_model_ref = "m" + m1, m2, m3 = self._patches() + with m1, m2, m3 as trim: + acq = pc.acquire(state, "ignored") + self.assertIs(acq.cache, persist) + trim.assert_called_once_with(persist, 2) # 5 cached - 3 common + self.assertEqual(acq.prompt_feed, [4]) # full[3:] + + def test_reset_on_model_change(self): + state = FakeState(base_cache=None, tokens=[1, 2, 3], model_ref="new") + state._persist_cache = FakeCache("stale") + state._persist_tokens = [1, 2, 3] + state._persist_cache_model_ref = "old" + m1, m2, m3 = self._patches() + with m1, m2, m3: + acq = pc.acquire(state, "ignored") + self.assertEqual(acq.prompt_feed, [1, 2, 3]) # fresh → full prefill + self.assertEqual(acq.cache.label, "fresh") + + def test_reset_when_cache_not_trimmable(self): + state = FakeState(base_cache=None, tokens=[1, 2, 3, 4], model_ref="m") + state._persist_cache = FakeCache("persist") + state._persist_tokens = [1, 2, 3] + state._persist_cache_model_ref = "m" + m1, m2, m3 = self._patches(can_trim=False) + with m1, m2, m3: + acq = pc.acquire(state, "ignored") + self.assertEqual(acq.cache.label, "fresh") + self.assertEqual(acq.prompt_feed, [1, 2, 3, 4]) + + def test_reset_when_no_common_prefix(self): + state = FakeState(base_cache=None, tokens=[7, 8, 9], model_ref="m") + state._persist_cache = FakeCache("persist") + state._persist_tokens = [1, 2, 3] + state._persist_cache_model_ref = "m" + m1, m2, m3 = self._patches() + with m1, m2, m3: + acq = pc.acquire(state, "ignored") + self.assertEqual(acq.cache.label, "fresh") + self.assertEqual(acq.prompt_feed, [7, 8, 9]) + + def test_partial_trim_falls_back_to_fresh(self): + state = FakeState(base_cache=None, tokens=[1, 2, 3, 4], model_ref="m") + state._persist_cache = FakeCache("persist") + state._persist_tokens = [1, 2, 3, 9, 9] + state._persist_cache_model_ref = "m" + # trim returns fewer than requested → unsafe → fresh + m1, m2, m3 = self._patches(trim=lambda c, n: n - 1) + with m1, m2, m3: + acq = pc.acquire(state, "ignored") + self.assertEqual(acq.cache.label, "fresh") + self.assertEqual(acq.prompt_feed, [1, 2, 3, 4]) + + +class CommitInvalidateTests(unittest.TestCase): + def test_commit_accounting_is_prompt_plus_generated(self): + state = FakeState() + cache = FakeCache("c") + pc.commit(state, cache=cache, commit_tokens=[1, 2, 3], generated_ids=[4, 5], model_ref="m") + self.assertIs(state._persist_cache, cache) + self.assertEqual(state._persist_tokens, [1, 2, 3, 4, 5]) + self.assertEqual(state._persist_cache_model_ref, "m") + + def test_commit_noop_when_not_managed(self): + state = FakeState() + pc.commit(state, cache=None, commit_tokens=None, generated_ids=[4], model_ref="m") + self.assertIsNone(state._persist_cache) + self.assertEqual(state._persist_tokens, []) + + def test_invalidate_clears(self): + state = FakeState() + state._persist_cache = FakeCache("c") + state._persist_tokens = [1, 2] + state._persist_cache_model_ref = "m" + pc.invalidate(state) + self.assertIsNone(state._persist_cache) + self.assertEqual(state._persist_tokens, []) + self.assertIsNone(state._persist_cache_model_ref) + + +if __name__ == "__main__": + unittest.main() diff --git a/tests/test_mlx_worker.py b/tests/test_mlx_worker.py index 7212104..d1f79d0 100644 --- a/tests/test_mlx_worker.py +++ b/tests/test_mlx_worker.py @@ -875,5 +875,41 @@ def test_unload_clears_multimodal_state(self): self.assertFalse(worker.is_multimodal) +class MlxLogitsProcessorTests(unittest.TestCase): + """_build_mlx_logits_processors wires repeat_penalty (mlx-lm applies it + via logits_processors, not the sampler — it was being dropped).""" + + def setUp(self): + from backend_service.mlx_worker_request import _build_mlx_logits_processors + + self._build = _build_mlx_logits_processors + + def test_none_when_no_samplers(self): + self.assertIsNone(self._build({})) + self.assertIsNone(self._build({"samplers": None})) + + def test_none_when_penalty_absent_or_neutral(self): + self.assertIsNone(self._build({"samplers": {"top_p": 0.9}})) + self.assertIsNone(self._build({"samplers": {"repeat_penalty": 1.0}})) + + def test_none_when_penalty_non_numeric(self): + self.assertIsNone(self._build({"samplers": {"repeat_penalty": "oops"}})) + + @unittest.skipUnless( + __import__("importlib").util.find_spec("mlx_lm") is not None, + "mlx-lm not installed", + ) + def test_builds_processors_for_real_penalty(self): + result = self._build({"samplers": {"repeat_penalty": 1.3}}) + self.assertIsNotNone(result) + self.assertTrue(len(result) >= 1) + + def test_accepts_repetition_penalty_alias_without_raising(self): + try: + self._build({"samplers": {"repetition_penalty": 1.2}}) + except Exception as exc: # noqa: BLE001 + self.fail(f"alias parse raised: {exc}") + + if __name__ == "__main__": unittest.main() diff --git a/tests/test_sampler_payload.py b/tests/test_sampler_payload.py index 4f63b15..a79f3bd 100644 --- a/tests/test_sampler_payload.py +++ b/tests/test_sampler_payload.py @@ -55,6 +55,30 @@ def test_merges_all_supported_sampler_keys(self): self.assertEqual(payload["mirostat_tau"], 5.0) self.assertEqual(payload["mirostat_eta"], 0.1) + def test_merges_modern_quality_samplers(self): + # DRY / XTC / top-n-sigma were added to _LLAMA_SAMPLER_KEYS; they + # must now flow through to the llama-server payload. + payload: dict = {} + _apply_sampler_kwargs( + payload, + samplers={ + "dry_multiplier": 0.8, + "dry_base": 1.75, + "dry_allowed_length": 2, + "xtc_probability": 0.5, + "xtc_threshold": 0.1, + "top_n_sigma": 1.0, + }, + reasoning_effort=None, + json_schema=None, + ) + self.assertEqual(payload["dry_multiplier"], 0.8) + self.assertEqual(payload["dry_base"], 1.75) + self.assertEqual(payload["dry_allowed_length"], 2) + self.assertEqual(payload["xtc_probability"], 0.5) + self.assertEqual(payload["xtc_threshold"], 0.1) + self.assertEqual(payload["top_n_sigma"], 1.0) + def test_none_values_in_samplers_skip_merge(self): # The frontend may send the union of fields with most set to null — # explicit nulls must not override server defaults. @@ -131,6 +155,19 @@ def test_emits_llama_field_names(self): self.assertEqual(overrides["mirostat_tau"], 5.0) self.assertEqual(overrides["mirostat_eta"], 0.1) + def test_emits_modern_sampler_field_names(self): + # XTC + DRY map to llama/mlx engine-side snake_case keys. + request = SimpleNamespace( + xtcProbability=0.5, xtcThreshold=0.1, + dryMultiplier=0.8, dryBase=1.75, dryAllowedLength=2, + ) + overrides = _build_sampler_overrides(request) + self.assertEqual(overrides["xtc_probability"], 0.5) + self.assertEqual(overrides["xtc_threshold"], 0.1) + self.assertEqual(overrides["dry_multiplier"], 0.8) + self.assertEqual(overrides["dry_base"], 1.75) + self.assertEqual(overrides["dry_allowed_length"], 2) + def test_partial_override_keeps_only_set_fields(self): request = SimpleNamespace( topP=0.9, topK=None, minP=None, repeatPenalty=None,