Skip to content

Add Mistral Small 4 benchmark results#6

Open
meaningfool wants to merge 1 commit intokwindla:mainfrom
meaningfool:meaningfool/mistral-bench
Open

Add Mistral Small 4 benchmark results#6
meaningfool wants to merge 1 commit intokwindla:mainfrom
meaningfool:meaningfool/mistral-bench

Conversation

@meaningfool
Copy link
Copy Markdown

@meaningfool meaningfool commented Apr 9, 2026

Summary

  • add mistral-small-2603 to the aiwf_medium_context text-model results table
  • use the latest 10 full 30-turn Opus-judged runs for the row
  • keep all 13 collected runs available in local run artifacts so the earlier degraded-latency window can still be investigated

Latest 10 Used For README Row

  • Pass Rate: 91.0%
  • Turn Pass: 273/300
  • Tool Use: 283/300
  • Instruction: 276/300
  • KB Ground: 296/300
  • TTFT Med / P95 / Max: 1192ms / 8838ms / 21670ms

Volatility Within The Kept 10 Runs

The latest 10 are clearly better than the earlier degraded window, but latency is still volatile:

  • 29/300 benchmark turns took more than 5s TTFT
  • 10/300 benchmark turns took more than 10s TTFT
  • 7/10 kept runs had at least one >5s turn
  • 4/10 kept runs had at least one >10s turn
  • worst kept turn: 21.7s TTFT on turn 28 ("One last detail: when is continental breakfast on June 4th?")
  • by contrast, the newest 3-run sanity check had TTFT 595ms / 1095ms / 2193ms

Turn-By-Turn TTFT Table For The Kept 10 Runs

Turn Median TTFT P95 TTFT Max TTFT >5s Count
1 903ms 15314ms 15314ms 1/10
2 1339ms 5108ms 5108ms 1/10
3 1304ms 10263ms 10263ms 2/10
4 1720ms 9823ms 9823ms 2/10
5 1853ms 14493ms 14493ms 2/10
6 1344ms 13987ms 13987ms 1/10
7 1361ms 3131ms 3131ms 0/10
8 1074ms 2862ms 2862ms 0/10
9 766ms 5444ms 5444ms 1/10
10 813ms 2381ms 2381ms 0/10
11 825ms 13923ms 13923ms 2/10
12 855ms 3204ms 3204ms 0/10
13 1192ms 8784ms 8784ms 2/10
14 1188ms 6348ms 6348ms 1/10
15 1421ms 5338ms 5338ms 1/10
16 779ms 6779ms 6779ms 1/10
17 839ms 6835ms 6835ms 1/10
18 1683ms 5686ms 5686ms 2/10
19 785ms 3413ms 3413ms 0/10
20 866ms 11642ms 11642ms 1/10
21 631ms 3304ms 3304ms 0/10
22 1011ms 7122ms 7122ms 1/10
23 921ms 9762ms 9762ms 1/10
24 919ms 12989ms 12989ms 3/10
25 1557ms 7687ms 7687ms 1/10
26 972ms 2767ms 2767ms 0/10
27 924ms 9327ms 9327ms 1/10
28 832ms 2734ms 2734ms 0/10
29 1706ms 21670ms 21670ms 1/10
30 906ms 2795ms 2795ms 0/10

Earlier Degraded Window

Across 13 complete runs, the first 3 accepted runs showed a degraded latency window, while later runs were much healthier:

  • first 3 accepted runs: TTFT median 22.2s, p95 42.1s, max 96.9s
  • later 7 accepted runs from the same stabilized setup: TTFT median 1.7s, p95 9.8s, max 21.7s
  • extra 3-run sanity check after the README update decision: TTFT median 595ms, p95 1095ms, max 2193ms

Config Used

See PR #5 for the exact provider wiring. Benchmark runs used mistral-small-2603 on Mistral's OpenAI-compatible chat completions API, with default sampling/tool settings and no explicit temperature or tool_choice override.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant