[Benchmark] Fix problems found in XSTest, M3oralbench, MSSBench, MMSafetyBench, Flames and SIUO_GEN. by Gugugugugutian · Pull Request #1503 · open-compass/VLMEvalKit

Gugugugugutian · 2026-04-01T03:37:34Z

This pull request adds support for several new datasets to the evaluation framework and aligns their evaluation logic and default judge models with the SafeWork-R1 standards. The most significant changes are the addition of new dataset classes, updates to dataset registration, and modifications to the default judge model selection logic.

Better dataset support and evaluation logic:

Added new dataset classes: MMSafetyBenchDataset, MSSBenchDataset, SIUODataset, SIUOGenDataset, SIUOMCQDataset, XSTestDataset, FlamesDataset, and M3oralBenchDataset, each with custom prompt construction and evaluation logic to match their respective benchmarks and SafeWork-R1 conventions. [1] [2] [3]
Better detailed evaluation and scoring for the new datasets, including category-wise and overall metrics, and robust judge logic for both rule-based and model-based scoring. [1] [2] [3]

Judge model selection improvements:

Updated the default judge model selection logic in run.py to assign the appropriate SafeWork-R1 judge models for the new datasets (gpt-4o-2024-11-20, gpt-4o, and gpt-4o-mini as appropriate).

Safework-R1 test.

gutian and others added 12 commits March 19, 2026 17:30

m3oralbench, mmsafetybench, siuo, xstest, flames added.

b6fe9fe

mssbench added.

3e06503

Merge branch 'open-compass:main' into main

e93f905

Merge remote-tracking branch 'upstream/main'

026b9ba

Update dataset link.

5d4fcf4

Merge branch 'main' into main

6bb0087

fix lint

9d89ffa

fix problems found during previous test and align precision with

fe83028

Safework-R1 test.

Merge branch 'main' of https://github.com/gugugugugutian/VLMEvalKit

72d33e0

Merge remote-tracking branch 'upstream/main'

30a6284

fix lint

3698c38

fix lint 2

eb69513

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark] Fix problems found in XSTest, M3oralbench, MSSBench, MMSafetyBench, Flames and SIUO_GEN. #1503

[Benchmark] Fix problems found in XSTest, M3oralbench, MSSBench, MMSafetyBench, Flames and SIUO_GEN. #1503
Gugugugugutian wants to merge 12 commits intoopen-compass:mainfrom
Gugugugugutian:main

Gugugugugutian commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Gugugugugutian commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants