Added GLM-4.6 polyglot benchmark #4590

ehsan2003 · 2025-10-15T18:21:20Z

I just ran this I believe that I haven't enabled reasoning but I saw some * Thinking * in the responses.

CLAassistant · 2025-10-15T18:21:27Z

All committers have signed the CLA.

ehsan2003 · 2025-10-15T18:32:13Z

this is for reference in case anyone comes from google:
‍‍‍
─────────────────────────────────────────────────────── /benchmarks/2025-10-15-17-41-04--glm-4.6 ────────────────────────────────────────────────────────

- dirname: 2025-10-15-17-41-04--glm-4.6
  test_cases: 225
  model: openrouter/z-ai/glm-4.6
  edit_format: diff
  commit_hash: 11516d6-dirty
  pass_rate_1: 11.6
  pass_rate_2: 36.4
  pass_num_1: 26
  pass_num_2: 82
  percent_cases_well_formed: 93.8
  error_outputs: 26
  num_malformed_responses: 17
  num_with_malformed_responses: 14
  user_asks: 88
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 2493948
  completion_tokens: 347138
  test_timeouts: 5
  total_tests: 225
  command: aider --model openrouter/z-ai/glm-4.6
  date: 2025-10-15
  versions: 0.86.2.dev
  seconds_per_case: 49.8
  total_cost: 1.8545

costs: $0.0082/test-case, $1.85 total, $1.85 projected

MotherSoraka · 2025-10-16T16:10:06Z

36% @pass2 ?
thats worse than Qwen3 32

ehsan2003 · 2025-10-16T21:44:23Z

36% @pass2 ? thats worse than Qwen3 32

I was surprised as well but that was what I've got

bayorm · 2025-10-19T08:22:20Z

Seems smth wrong, have you used "enable_thinking": True?

ehsan2003 · 2025-10-19T11:00:29Z

Seems smth wrong, have you used "enable_thinking": True?

no. I haven't that's the benchmark for non thinking variant,

in my experience glm-4.6 + deepseek 3.2 exp ( as the week model & editor model ) works surprisingly well

sometimes glm itself makes mistakes on editing files

cperion · 2025-10-22T17:14:40Z

Benchmarks on openrouter must be taken with a grain of salt because we do not always know the provider and the model quantization behind each request. It could be mixed. That may explain the surprisingly low score

janus-reith · 2025-10-29T08:09:41Z

These are my results:

- dirname: 2025-10-29-08-39-00--glm-4.6
  test_cases: 225
  model: openrouter/z-ai/glm-4.6
  edit_format: diff
  commit_hash: 11516d6
  pass_rate_1: 16.0
  pass_rate_2: 44.4
  pass_num_1: 36
  pass_num_2: 100
  percent_cases_well_formed: 93.3
  error_outputs: 22
  num_malformed_responses: 16
  num_with_malformed_responses: 15
  user_asks: 33
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 2262299
  completion_tokens: 366044
  test_timeouts: 9
  total_tests: 225
  command: aider --model openrouter/z-ai/glm-4.6
  date: 2025-10-29
  versions: 0.86.2.dev
  seconds_per_case: 35.3
  total_cost: 1.5455

costs: $0.0069/test-case, $1.55 total, $1.55 projected

pass_rate_2: 44.4, that's better than the 36.4 you got, but still worse than I would expect.

I'm trying with the openrouter glm-4.6:exacto endpoint now to see if it makes a difference.

janus-reith · 2025-10-29T08:33:44Z

I'm trying with the openrouter glm-4.6:exacto endpoint now to see if it makes a difference.

- dirname: 2025-10-29-08-39-00--glm-4.6-exacto
  test_cases: 225
  model: openrouter/z-ai/glm-4.6:exacto
  edit_format: diff
  commit_hash: 11516d6
  pass_rate_1: 13.8
  pass_rate_2: 47.6
  pass_num_1: 31
  pass_num_2: 107
  percent_cases_well_formed: 91.6
  error_outputs: 23
  num_malformed_responses: 19
  num_with_malformed_responses: 19
  user_asks: 93
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 2618559
  completion_tokens: 397826
  test_timeouts: 9
  total_tests: 225
  command: aider --model openrouter/z-ai/glm-4.6:exacto
  date: 2025-10-29
  versions: 0.86.2.dev
  seconds_per_case: 43.8
  total_cost: 1.7436

costs: $0.0077/test-case, $1.74 total, $1.74 projected

Slightly better, still below expectations. Might need to adjust reasoning effort or other parameters, I'm not sure which defaults are used.

janus-reith · 2025-10-29T09:54:22Z

Tried with reasoning_effort: high got slightly worse results:

- dirname: 2025-10-29-08-39-00--glm-4.6-exacto-reasoning-high
  test_cases: 225
  model: openrouter/z-ai/glm-4.6:exacto
  edit_format: diff
  commit_hash: 11516d6
  reasoning_effort: high
  pass_rate_1: 12.0
  pass_rate_2: 41.3
  pass_num_1: 27
  pass_num_2: 93
  percent_cases_well_formed: 90.7
  error_outputs: 27
  num_malformed_responses: 23
  num_with_malformed_responses: 21
  user_asks: 85
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 3086467
  completion_tokens: 412182
  test_timeouts: 7
  total_tests: 225
  command: aider --model openrouter/z-ai/glm-4.6:exacto
  date: 2025-10-29
  versions: 0.86.2.dev
  seconds_per_case: 44.0
  total_cost: 1.9559

costs: $0.0087/test-case, $1.96 total, $1.96 projected

But I'm not sure though if the parameter is forwarded and respected properly. An indicator could be that prompt_tokens and cost is a bit higher, altough seconds_per_case is just slightly increased.

Kreijstal · 2025-11-15T09:58:00Z

time to cancel glm-4.6 subscription

Kreijstal · 2025-11-15T09:59:09Z

let us see now the kimi k2 thinking benchmarks

avtc · 2025-12-18T06:50:43Z

Here are results for the whole format with 6 tries (default is 2, i.e. pass_rate_2) for GLM-4.6 on CodingPlan with params:

- name: openai/glm-4.6
  extra_params:
    max_tokens: 50000
    top_p: 0.95
    min_p: 0.0
    top_k: 40
    temperature: 1.0
  use_temperature: true
  examples_as_sys_msg: true

- dirname: 2025-12-10-13-53-16--GLM-4.6-T1.0-WHOLE
  test_cases: 225
  model: openai/glm-4.6
  edit_format: whole
  commit_hash: c74f5ef-dirty
  pass_rate_1: 23.1
  pass_rate_2: 61.8
  pass_rate_3: 74.2
  pass_rate_4: 80.0
  pass_rate_5: 85.8
  pass_rate_6: 88.0
  pass_num_1: 52
  pass_num_2: 139
  pass_num_3: 167
  pass_num_4: 180
  pass_num_5: 193
  pass_num_6: 198
  percent_cases_well_formed: 98.2
  error_outputs: 34
  num_malformed_responses: 6
  num_with_malformed_responses: 4
  user_asks: 227
  lazy_comments: 2
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 9905090
  completion_tokens: 3666817
  test_timeouts: 2
  total_tests: 225
  command: aider --model openai/glm-4.6
  date: 2025-12-10
  versions: 0.86.2.dev
  seconds_per_case: 381.9
  total_cost: 14.0101

avtc · 2025-12-25T10:32:22Z

Here are results for the whole format for temperature=0.7 with 6 tries (default is 2, i.e. pass_rate_2) for GLM-4.7 on CodingPlan with params:

- name: openai/glm-4.7
  extra_params:
    max_tokens: 50000
    top_p: 1.0
    min_p: 0.0
    top_k: 40
    temperature: 0.7
  use_temperature: true
  examples_as_sys_msg: true

- dirname: 2025-12-10-13-53-16--GLM-4.7-T0.7-WHOLE
  test_cases: 225
  model: openai/glm-4.7
  edit_format: whole
  commit_hash: c74f5ef-dirty
  pass_rate_1: 28.4
  pass_rate_2: 68.9
  pass_rate_3: 81.3
  pass_rate_4: 86.2
  pass_rate_5: 88.9
  pass_rate_6: 89.3
  pass_num_1: 64
  pass_num_2: 155
  pass_num_3: 183
  pass_num_4: 194
  pass_num_5: 200
  pass_num_6: 201
  percent_cases_well_formed: 99.6
  error_outputs: 14
  num_malformed_responses: 1
  num_with_malformed_responses: 1
  user_asks: 201
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 6364230
  completion_tokens: 4844956
  test_timeouts: 1
  total_tests: 225
  command: aider --model openai/glm-4.7
  date: 2025-12-10
  versions: 0.86.2.dev
  seconds_per_case: 778.9
  total_cost: 14.4774

note: the date is incorrect it is from copy-pasted folder name

Kreijstal · 2025-12-25T10:45:57Z

I mean it is not fair that glm has 6 tries while all other models get only 2, maybe we should also let all models be 6 tries as well for fair comparison

avtc · 2025-12-25T11:28:59Z

I mean it is not fair that glm has 6 tries while all other models get only 2, maybe we should also let all models be 6 tries as well for fair comparison

You can take pass_rate_2 and that will be comparable to other models pass_rate_2.
I give more passes to see if it can using the test feedback fix the logic.
I have observed that some tasks does not provide enough details and first test pass gives very few feedback to pass all testcases from second attempt.

Added GLM-4.6 polyglot benchmark

35ed41a

Added GLM-4.6 polyglot benchmark #4590

Are you sure you want to change the base?

Added GLM-4.6 polyglot benchmark #4590

Uh oh!

Conversation

ehsan2003 commented Oct 15, 2025

Uh oh!

CLAassistant commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehsan2003 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MotherSoraka commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehsan2003 commented Oct 16, 2025

Uh oh!

bayorm commented Oct 19, 2025

Uh oh!

ehsan2003 commented Oct 19, 2025

Uh oh!

cperion commented Oct 22, 2025

Uh oh!

janus-reith commented Oct 29, 2025

Uh oh!

janus-reith commented Oct 29, 2025

Uh oh!

janus-reith commented Oct 29, 2025

Uh oh!

Kreijstal commented Nov 15, 2025

Uh oh!

Kreijstal commented Nov 15, 2025

Uh oh!

avtc commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avtc commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kreijstal commented Dec 25, 2025

Uh oh!

avtc commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

CLAassistant commented Oct 15, 2025 •

edited

Loading

ehsan2003 commented Oct 15, 2025 •

edited

Loading

MotherSoraka commented Oct 16, 2025 •

edited

Loading

avtc commented Dec 18, 2025 •

edited

Loading

avtc commented Dec 25, 2025 •

edited

Loading

avtc commented Dec 25, 2025 •

edited

Loading