Skip to content

Conversation

@ehsan2003
Copy link

I just ran this I believe that I haven't enabled reasoning but I saw some * Thinking * in the responses.

@CLAassistant
Copy link

CLAassistant commented Oct 15, 2025

CLA assistant check
All committers have signed the CLA.

@ehsan2003
Copy link
Author

ehsan2003 commented Oct 15, 2025

this is for reference in case anyone comes from google:
‍‍‍
─────────────────────────────────────────────────────── /benchmarks/2025-10-15-17-41-04--glm-4.6 ────────────────────────────────────────────────────────

- dirname: 2025-10-15-17-41-04--glm-4.6
  test_cases: 225
  model: openrouter/z-ai/glm-4.6
  edit_format: diff
  commit_hash: 11516d6-dirty
  pass_rate_1: 11.6
  pass_rate_2: 36.4
  pass_num_1: 26
  pass_num_2: 82
  percent_cases_well_formed: 93.8
  error_outputs: 26
  num_malformed_responses: 17
  num_with_malformed_responses: 14
  user_asks: 88
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 2493948
  completion_tokens: 347138
  test_timeouts: 5
  total_tests: 225
  command: aider --model openrouter/z-ai/glm-4.6
  date: 2025-10-15
  versions: 0.86.2.dev
  seconds_per_case: 49.8
  total_cost: 1.8545

costs: $0.0082/test-case, $1.85 total, $1.85 projected

@MotherSoraka
Copy link

MotherSoraka commented Oct 16, 2025

36% @pass2 ?
thats worse than Qwen3 32

@ehsan2003
Copy link
Author

36% @pass2 ? thats worse than Qwen3 32

I was surprised as well but that was what I've got

@bayorm
Copy link

bayorm commented Oct 19, 2025

Seems smth wrong, have you used "enable_thinking": True?

@ehsan2003
Copy link
Author

Seems smth wrong, have you used "enable_thinking": True?

no. I haven't that's the benchmark for non thinking variant,

in my experience glm-4.6 + deepseek 3.2 exp ( as the week model & editor model ) works surprisingly well

sometimes glm itself makes mistakes on editing files

@cperion
Copy link

cperion commented Oct 22, 2025

Benchmarks on openrouter must be taken with a grain of salt because we do not always know the provider and the model quantization behind each request. It could be mixed. That may explain the surprisingly low score

@janus-reith
Copy link

These are my results:

- dirname: 2025-10-29-08-39-00--glm-4.6
  test_cases: 225
  model: openrouter/z-ai/glm-4.6
  edit_format: diff
  commit_hash: 11516d6
  pass_rate_1: 16.0
  pass_rate_2: 44.4
  pass_num_1: 36
  pass_num_2: 100
  percent_cases_well_formed: 93.3
  error_outputs: 22
  num_malformed_responses: 16
  num_with_malformed_responses: 15
  user_asks: 33
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 2262299
  completion_tokens: 366044
  test_timeouts: 9
  total_tests: 225
  command: aider --model openrouter/z-ai/glm-4.6
  date: 2025-10-29
  versions: 0.86.2.dev
  seconds_per_case: 35.3
  total_cost: 1.5455

costs: $0.0069/test-case, $1.55 total, $1.55 projected

pass_rate_2: 44.4, that's better than the 36.4 you got, but still worse than I would expect.

I'm trying with the openrouter glm-4.6:exacto endpoint now to see if it makes a difference.

@janus-reith
Copy link

I'm trying with the openrouter glm-4.6:exacto endpoint now to see if it makes a difference.

- dirname: 2025-10-29-08-39-00--glm-4.6-exacto
  test_cases: 225
  model: openrouter/z-ai/glm-4.6:exacto
  edit_format: diff
  commit_hash: 11516d6
  pass_rate_1: 13.8
  pass_rate_2: 47.6
  pass_num_1: 31
  pass_num_2: 107
  percent_cases_well_formed: 91.6
  error_outputs: 23
  num_malformed_responses: 19
  num_with_malformed_responses: 19
  user_asks: 93
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 2618559
  completion_tokens: 397826
  test_timeouts: 9
  total_tests: 225
  command: aider --model openrouter/z-ai/glm-4.6:exacto
  date: 2025-10-29
  versions: 0.86.2.dev
  seconds_per_case: 43.8
  total_cost: 1.7436

costs: $0.0077/test-case, $1.74 total, $1.74 projected

Slightly better, still below expectations. Might need to adjust reasoning effort or other parameters, I'm not sure which defaults are used.

@janus-reith
Copy link

Tried with reasoning_effort: high got slightly worse results:

- dirname: 2025-10-29-08-39-00--glm-4.6-exacto-reasoning-high
  test_cases: 225
  model: openrouter/z-ai/glm-4.6:exacto
  edit_format: diff
  commit_hash: 11516d6
  reasoning_effort: high
  pass_rate_1: 12.0
  pass_rate_2: 41.3
  pass_num_1: 27
  pass_num_2: 93
  percent_cases_well_formed: 90.7
  error_outputs: 27
  num_malformed_responses: 23
  num_with_malformed_responses: 21
  user_asks: 85
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 3086467
  completion_tokens: 412182
  test_timeouts: 7
  total_tests: 225
  command: aider --model openrouter/z-ai/glm-4.6:exacto
  date: 2025-10-29
  versions: 0.86.2.dev
  seconds_per_case: 44.0
  total_cost: 1.9559

costs: $0.0087/test-case, $1.96 total, $1.96 projected

But I'm not sure though if the parameter is forwarded and respected properly. An indicator could be that prompt_tokens and cost is a bit higher, altough seconds_per_case is just slightly increased.

@Kreijstal
Copy link

time to cancel glm-4.6 subscription

@Kreijstal
Copy link

let us see now the kimi k2 thinking benchmarks

@avtc
Copy link

avtc commented Dec 18, 2025

Here are results for the whole format with 6 tries (default is 2, i.e. pass_rate_2) for GLM-4.6 on CodingPlan with params:

- name: openai/glm-4.6
  extra_params:
    max_tokens: 50000
    top_p: 0.95
    min_p: 0.0
    top_k: 40
    temperature: 1.0
  use_temperature: true
  examples_as_sys_msg: true
- dirname: 2025-12-10-13-53-16--GLM-4.6-T1.0-WHOLE
  test_cases: 225
  model: openai/glm-4.6
  edit_format: whole
  commit_hash: c74f5ef-dirty
  pass_rate_1: 23.1
  pass_rate_2: 61.8
  pass_rate_3: 74.2
  pass_rate_4: 80.0
  pass_rate_5: 85.8
  pass_rate_6: 88.0
  pass_num_1: 52
  pass_num_2: 139
  pass_num_3: 167
  pass_num_4: 180
  pass_num_5: 193
  pass_num_6: 198
  percent_cases_well_formed: 98.2
  error_outputs: 34
  num_malformed_responses: 6
  num_with_malformed_responses: 4
  user_asks: 227
  lazy_comments: 2
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 9905090
  completion_tokens: 3666817
  test_timeouts: 2
  total_tests: 225
  command: aider --model openai/glm-4.6
  date: 2025-12-10
  versions: 0.86.2.dev
  seconds_per_case: 381.9
  total_cost: 14.0101 

@avtc
Copy link

avtc commented Dec 25, 2025

Here are results for the whole format for temperature=0.7 with 6 tries (default is 2, i.e. pass_rate_2) for GLM-4.7 on CodingPlan with params:

- name: openai/glm-4.7
  extra_params:
    max_tokens: 50000
    top_p: 1.0
    min_p: 0.0
    top_k: 40
    temperature: 0.7
  use_temperature: true
  examples_as_sys_msg: true
- dirname: 2025-12-10-13-53-16--GLM-4.7-T0.7-WHOLE
  test_cases: 225
  model: openai/glm-4.7
  edit_format: whole
  commit_hash: c74f5ef-dirty
  pass_rate_1: 28.4
  pass_rate_2: 68.9
  pass_rate_3: 81.3
  pass_rate_4: 86.2
  pass_rate_5: 88.9
  pass_rate_6: 89.3
  pass_num_1: 64
  pass_num_2: 155
  pass_num_3: 183
  pass_num_4: 194
  pass_num_5: 200
  pass_num_6: 201
  percent_cases_well_formed: 99.6
  error_outputs: 14
  num_malformed_responses: 1
  num_with_malformed_responses: 1
  user_asks: 201
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 6364230
  completion_tokens: 4844956
  test_timeouts: 1
  total_tests: 225
  command: aider --model openai/glm-4.7
  date: 2025-12-10
  versions: 0.86.2.dev
  seconds_per_case: 778.9
  total_cost: 14.4774

note: the date is incorrect it is from copy-pasted folder name

@Kreijstal
Copy link

I mean it is not fair that glm has 6 tries while all other models get only 2, maybe we should also let all models be 6 tries as well for fair comparison

@avtc
Copy link

avtc commented Dec 25, 2025

I mean it is not fair that glm has 6 tries while all other models get only 2, maybe we should also let all models be 6 tries as well for fair comparison

You can take pass_rate_2 and that will be comparable to other models pass_rate_2.
I give more passes to see if it can using the test feedback fix the logic.
I have observed that some tasks does not provide enough details and first test pass gives very few feedback to pass all testcases from second attempt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants