Problem
Relying on a single LLM judge can introduce model-specific biases and variability in scoring which may affect the reliability and robustness of benchmark results.
Reference
This paper shows using a panel of llms for judging outperforms a single llm as a judge. https://arxiv.org/abs/2404.18796
Solution
Add support for multiple LLM judges such as Deepseek, Qwen, Gemma, Granite, and aggregate their score with a majority vote if the score is binary or an average if the score is continuous. Report agreement metric such as Cohen’s Kappa or Fleiss’ Kappa to give a measure of confidence.
Problem
Relying on a single LLM judge can introduce model-specific biases and variability in scoring which may affect the reliability and robustness of benchmark results.
Reference
This paper shows using a panel of llms for judging outperforms a single llm as a judge. https://arxiv.org/abs/2404.18796
Solution
Add support for multiple LLM judges such as Deepseek, Qwen, Gemma, Granite, and aggregate their score with a majority vote if the score is binary or an average if the score is continuous. Report agreement metric such as Cohen’s Kappa or Fleiss’ Kappa to give a measure of confidence.