Add Mistral Small 4 benchmark results by meaningfool · Pull Request #6 · kwindla/aiewf-eval

meaningfool · 2026-04-09T16:10:17Z

Summary

add mistral-small-2603 to the aiwf_medium_context text-model results table
use the latest 10 full 30-turn Opus-judged runs for the row
keep all 13 collected runs available in local run artifacts so the earlier degraded-latency window can still be investigated

Latest 10 Used For README Row

Pass Rate: 91.0%
Turn Pass: 273/300
Tool Use: 283/300
Instruction: 276/300
KB Ground: 296/300
TTFT Med / P95 / Max: 1192ms / 8838ms / 21670ms

Volatility Within The Kept 10 Runs

The latest 10 are clearly better than the earlier degraded window, but latency is still volatile:

29/300 benchmark turns took more than 5s TTFT
10/300 benchmark turns took more than 10s TTFT
7/10 kept runs had at least one >5s turn
4/10 kept runs had at least one >10s turn
worst kept turn: 21.7s TTFT on turn 28 ("One last detail: when is continental breakfast on June 4th?")
by contrast, the newest 3-run sanity check had TTFT 595ms / 1095ms / 2193ms

Turn-By-Turn TTFT Table For The Kept 10 Runs

Turn	Median TTFT	P95 TTFT	Max TTFT	>5s Count
1	903ms	15314ms	15314ms	1/10
2	1339ms	5108ms	5108ms	1/10
3	1304ms	10263ms	10263ms	2/10
4	1720ms	9823ms	9823ms	2/10
5	1853ms	14493ms	14493ms	2/10
6	1344ms	13987ms	13987ms	1/10
7	1361ms	3131ms	3131ms	0/10
8	1074ms	2862ms	2862ms	0/10
9	766ms	5444ms	5444ms	1/10
10	813ms	2381ms	2381ms	0/10
11	825ms	13923ms	13923ms	2/10
12	855ms	3204ms	3204ms	0/10
13	1192ms	8784ms	8784ms	2/10
14	1188ms	6348ms	6348ms	1/10
15	1421ms	5338ms	5338ms	1/10
16	779ms	6779ms	6779ms	1/10
17	839ms	6835ms	6835ms	1/10
18	1683ms	5686ms	5686ms	2/10
19	785ms	3413ms	3413ms	0/10
20	866ms	11642ms	11642ms	1/10
21	631ms	3304ms	3304ms	0/10
22	1011ms	7122ms	7122ms	1/10
23	921ms	9762ms	9762ms	1/10
24	919ms	12989ms	12989ms	3/10
25	1557ms	7687ms	7687ms	1/10
26	972ms	2767ms	2767ms	0/10
27	924ms	9327ms	9327ms	1/10
28	832ms	2734ms	2734ms	0/10
29	1706ms	21670ms	21670ms	1/10
30	906ms	2795ms	2795ms	0/10

Earlier Degraded Window

Across 13 complete runs, the first 3 accepted runs showed a degraded latency window, while later runs were much healthier:

first 3 accepted runs: TTFT median 22.2s, p95 42.1s, max 96.9s
later 7 accepted runs from the same stabilized setup: TTFT median 1.7s, p95 9.8s, max 21.7s
extra 3-run sanity check after the README update decision: TTFT median 595ms, p95 1095ms, max 2193ms

Config Used

See PR #5 for the exact provider wiring. Benchmark runs used mistral-small-2603 on Mistral's OpenAI-compatible chat completions API, with default sampling/tool settings and no explicit temperature or tool_choice override.

Add Mistral benchmark row

1ab6b52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Mistral Small 4 benchmark results#6

Add Mistral Small 4 benchmark results#6
meaningfool wants to merge 1 commit intokwindla:mainfrom
meaningfool:meaningfool/mistral-bench

meaningfool commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

meaningfool commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Latest 10 Used For README Row

Volatility Within The Kept 10 Runs

Turn-By-Turn TTFT Table For The Kept 10 Runs

Earlier Degraded Window

Config Used

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

meaningfool commented Apr 9, 2026 •

edited

Loading