-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Added GLM-4.6 polyglot benchmark #4590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
this is for reference in case anyone comes from google: - dirname: 2025-10-15-17-41-04--glm-4.6
test_cases: 225
model: openrouter/z-ai/glm-4.6
edit_format: diff
commit_hash: 11516d6-dirty
pass_rate_1: 11.6
pass_rate_2: 36.4
pass_num_1: 26
pass_num_2: 82
percent_cases_well_formed: 93.8
error_outputs: 26
num_malformed_responses: 17
num_with_malformed_responses: 14
user_asks: 88
lazy_comments: 0
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
prompt_tokens: 2493948
completion_tokens: 347138
test_timeouts: 5
total_tests: 225
command: aider --model openrouter/z-ai/glm-4.6
date: 2025-10-15
versions: 0.86.2.dev
seconds_per_case: 49.8
total_cost: 1.8545costs: $0.0082/test-case, $1.85 total, $1.85 projected |
|
36% @pass2 ? |
I was surprised as well but that was what I've got |
|
Seems smth wrong, have you used "enable_thinking": True? |
no. I haven't that's the benchmark for non thinking variant, in my experience glm-4.6 + deepseek 3.2 exp ( as the week model & editor model ) works surprisingly well sometimes glm itself makes mistakes on editing files |
|
Benchmarks on openrouter must be taken with a grain of salt because we do not always know the provider and the model quantization behind each request. It could be mixed. That may explain the surprisingly low score |
|
These are my results:
I'm trying with the openrouter glm-4.6:exacto endpoint now to see if it makes a difference. |
Slightly better, still below expectations. Might need to adjust reasoning effort or other parameters, I'm not sure which defaults are used. |
|
Tried with But I'm not sure though if the parameter is forwarded and respected properly. An indicator could be that |
|
time to cancel glm-4.6 subscription |
|
let us see now the kimi k2 thinking benchmarks |
|
Here are results for the - name: openai/glm-4.6
extra_params:
max_tokens: 50000
top_p: 0.95
min_p: 0.0
top_k: 40
temperature: 1.0
use_temperature: true
examples_as_sys_msg: true- dirname: 2025-12-10-13-53-16--GLM-4.6-T1.0-WHOLE
test_cases: 225
model: openai/glm-4.6
edit_format: whole
commit_hash: c74f5ef-dirty
pass_rate_1: 23.1
pass_rate_2: 61.8
pass_rate_3: 74.2
pass_rate_4: 80.0
pass_rate_5: 85.8
pass_rate_6: 88.0
pass_num_1: 52
pass_num_2: 139
pass_num_3: 167
pass_num_4: 180
pass_num_5: 193
pass_num_6: 198
percent_cases_well_formed: 98.2
error_outputs: 34
num_malformed_responses: 6
num_with_malformed_responses: 4
user_asks: 227
lazy_comments: 2
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
prompt_tokens: 9905090
completion_tokens: 3666817
test_timeouts: 2
total_tests: 225
command: aider --model openai/glm-4.6
date: 2025-12-10
versions: 0.86.2.dev
seconds_per_case: 381.9
total_cost: 14.0101 |
|
Here are results for the - name: openai/glm-4.7
extra_params:
max_tokens: 50000
top_p: 1.0
min_p: 0.0
top_k: 40
temperature: 0.7
use_temperature: true
examples_as_sys_msg: true- dirname: 2025-12-10-13-53-16--GLM-4.7-T0.7-WHOLE
test_cases: 225
model: openai/glm-4.7
edit_format: whole
commit_hash: c74f5ef-dirty
pass_rate_1: 28.4
pass_rate_2: 68.9
pass_rate_3: 81.3
pass_rate_4: 86.2
pass_rate_5: 88.9
pass_rate_6: 89.3
pass_num_1: 64
pass_num_2: 155
pass_num_3: 183
pass_num_4: 194
pass_num_5: 200
pass_num_6: 201
percent_cases_well_formed: 99.6
error_outputs: 14
num_malformed_responses: 1
num_with_malformed_responses: 1
user_asks: 201
lazy_comments: 0
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
prompt_tokens: 6364230
completion_tokens: 4844956
test_timeouts: 1
total_tests: 225
command: aider --model openai/glm-4.7
date: 2025-12-10
versions: 0.86.2.dev
seconds_per_case: 778.9
total_cost: 14.4774note: the date is incorrect it is from copy-pasted folder name |
|
I mean it is not fair that glm has 6 tries while all other models get only 2, maybe we should also let all models be 6 tries as well for fair comparison |
You can take |
I just ran this I believe that I haven't enabled reasoning but I saw some * Thinking * in the responses.