Skip to content

[Benchmark] Fix problems found in XSTest, M3oralbench, MSSBench, MMSafetyBench, Flames and SIUO_GEN. #1503

Open
Gugugugugutian wants to merge 12 commits intoopen-compass:mainfrom
Gugugugugutian:main
Open

[Benchmark] Fix problems found in XSTest, M3oralbench, MSSBench, MMSafetyBench, Flames and SIUO_GEN. #1503
Gugugugugutian wants to merge 12 commits intoopen-compass:mainfrom
Gugugugugutian:main

Conversation

@Gugugugugutian
Copy link
Copy Markdown
Contributor

This pull request adds support for several new datasets to the evaluation framework and aligns their evaluation logic and default judge models with the SafeWork-R1 standards. The most significant changes are the addition of new dataset classes, updates to dataset registration, and modifications to the default judge model selection logic.

Better dataset support and evaluation logic:

  • Added new dataset classes: MMSafetyBenchDataset, MSSBenchDataset, SIUODataset, SIUOGenDataset, SIUOMCQDataset, XSTestDataset, FlamesDataset, and M3oralBenchDataset, each with custom prompt construction and evaluation logic to match their respective benchmarks and SafeWork-R1 conventions. [1] [2] [3]

  • Better detailed evaluation and scoring for the new datasets, including category-wise and overall metrics, and robust judge logic for both rule-based and model-based scoring. [1] [2] [3]

Judge model selection improvements:

  • Updated the default judge model selection logic in run.py to assign the appropriate SafeWork-R1 judge models for the new datasets (gpt-4o-2024-11-20, gpt-4o, and gpt-4o-mini as appropriate).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants