Skip to content

Latest commit

 

History

History
132 lines (101 loc) · 18.8 KB

File metadata and controls

132 lines (101 loc) · 18.8 KB

Data Quality Metrics

This document provides comprehensive information about all quality metrics used in Dingo.

Note: All metrics are backed by academic sources to ensure objectivity and scientific rigor.

RAG Evaluation Metrics

Type Metric Description Paper Source Evaluation Results Examples
LLMRAGAnswerRelevancy LLMRAGAnswerRelevancy 评估答案是否直接回答问题,检测无关和冗余信息 RAGAS: Automated Evaluation of Retrieval Augmented Generation N/A 📝 View Example
LLMRAGContextPrecision LLMRAGContextPrecision 评估检索上下文的精确度,包括相关性和排序质量 RAGAS: Automated Evaluation of Retrieval Augmented Generation N/A 📝 View Example
LLMRAGContextRecall LLMRAGContextRecall 评估检索上下文的完整性,判断上下文是否能支持答案中的所有陈述 RAGAS: Automated Evaluation of Retrieval Augmented Generation N/A 📝 View Example
LLMRAGContextRelevancy LLMRAGContextRelevancy 评估检索上下文与问题的相关性,检测噪声信息 RAGAS: Automated Evaluation of Retrieval Augmented Generation N/A 📝 View Example
LLMRAGFaithfulness LLMRAGFaithfulness 评估生成答案是否忠实于给定上下文,检测幻觉和编造信息 RAGAS: Automated Evaluation of Retrieval Augmented Generation N/A 📝 View Example

Pretrain Text Quality Assessment Metrics

Type Metric Description Paper Source Evaluation Results Examples
LLMCodeCompare LLMCodeCompare Compares the effectiveness of two tools in extracting code blocks from HTML to Markdown format by evaluating recognit... Internal Implementation N/A N/A
LLMDatamanAssessment LLMDatamanAssessment Evaluates pre-training data quality using the DataMan methodology (14 standards, 15 domains). Assigns a score (0/1), ... DataMan: Data Manager for Pre-training Large Language Models (Peng et al., 2025) N/A N/A
LLMMathCompare LLMMathCompare Compares the effectiveness of two tools in extracting mathematical formulas from HTML to Markdown format by evaluatin... Internal Implementation N/A N/A
LLMSecurityPolitics LLMSecurityPolitics Evaluates whether the text contains politics-related content Internal Implementation N/A N/A
LLMTableCompare LLMTableCompare Compares the effectiveness of two tools in extracting tables from HTML to Markdown format by evaluating recognition r... Internal Implementation N/A N/A
LLMTextQualityV4 LLMTextQualityV4 Enhanced text quality evaluation covering completeness (formulas, tables, code), effectiveness (garbled text, spacing... WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages (Yu et al., 2025) 📊 See Results N/A
LLMTextQualityV5 LLMTextQualityV5 Impact-driven text quality evaluation for LLM pretraining, focusing on structural completeness, readability, diversit... WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages (Yu et al., 2025) 📊 See Results 📝 View Example

SFT Data Assessment Metrics

Type Metric Description Paper Source Evaluation Results Examples
LLMFactCheckPublic LLMFactCheckPublic Two-stage factuality evaluation pipeline from GPT-5 GPT-5 System Card (OpenAI) N/A N/A
LLMHallucination LLMHallucination Evaluates whether the response contains factual contradictions or hallucinations against provided context information TruthfulQA: Measuring How Models Mimic Human Falsehoods (Lin et al., 2021) N/A N/A
LLMInstructionClarity LLMInstructionClarity Evaluates instruction clarity across four dimensions: self-descriptiveness, consistency, specificity, and completeness Internal Implementation [📊 See Results](Returns clarity score (0-10) and detailed analysis) 📝 View Example
LLMTaskDifficulty LLMTaskDifficulty Evaluates task difficulty across cognitive complexity, step complexity, domain knowledge, and constraint density Internal Implementation [📊 See Results](Returns difficulty level (1-10) with detailed breakdown) 📝 View Example
LLMText3HHarmless LLMText3HHarmless Checks if responses avoid harmful content, discriminatory language, and dangerous assistance Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Bai et al., 2022) 📊 See Results N/A
LLMText3HHelpful LLMText3HHelpful Assesses if responses address questions directly and follow instructions appropriately Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Bai et al., 2022) 📊 See Results N/A
LLMText3HHonest LLMText3HHonest Evaluates if responses provide accurate information without fabrication or deception Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Bai et al., 2022) 📊 See Results N/A
QUALITY_BAD_HALLUCINATION RuleHallucinationHHEM Uses Vectara's HHEM-2.1-Open model for local hallucination detection by evaluating consistency between response and c... HHEM-2.1-Open (Forrest Bao, Miaoran Li, Rogger Luo, Ofer Mendelevitch) N/A N/A

Classification Metrics

Type Metric Description Paper Source Evaluation Results Examples
LLMClassifyTopic LLMClassifyTopic Classifies text into categories like language processing, writing, code, mathematics, role-play, or knowledge Q&A. Ba... BERTopic & INSTAG (Grootendorst, 2022; Wei et al., 2023) 📊 See Results N/A

Multimodality Assessment Metrics

Type Metric Description Paper Source Evaluation Results Examples
LLMClassifyQR LLMClassifyQR Identifies images as CAPTCHA, QR code, or normal images Internal Implementation N/A N/A
VLMOCRUnderstanding VLMOCRUnderstanding 评估多模态模型对图片中文字内容的识别和理解能力,使用DeepSeek-OCR作为Ground Truth DeepSeek-OCR: Contexts Optical Compression [📊 See Results](通过对比VLM输出与OCR ground truth,识别文字遗漏、错误、幻觉等问题) N/A

Rule-Based TEXT Quality Metrics

Type Metric Description Paper Source Evaluation Results Examples
QUALITY_BAD_COMPLETENESS RuleLineEndWithEllipsis, RuleLineEndWithTerminal, RuleSentenceNumber, RuleWordNumber Checks whether the ratio of lines ending with ellipsis is below threshold; Checks whether the ratio of lines ending w... RedPajama: an Open Dataset for Training Large Language Models (Together Computer, 2023) 📊 See Results N/A
QUALITY_BAD_EFFECTIVENESS RuleDoi, RuleIsbn, RuleAbnormalChar, RuleAbnormalHtml, RuleAlphaWords, RuleAudioDataFormat, RuleCharNumber, RuleColonEnd, RuleContentNull, RuleContentShort, RuleContentShortMultiLan, RuleEnterAndSpace, RuleEnterMore, RuleEnterRatioMore, RuleHtmlEntity, RuleHtmlTag, RuleInvisibleChar, RuleImageDataFormat, RuleLatexSpecialChar, RuleLineJavascriptCount, RuleLoremIpsum, RuleMeanWordLength, RuleNlpDataFormat, RuleSftDataFormat, RuleSpaceMore, RuleSpecialCharacter, RuleStopWord, RuleSymbolWordRatio, RuleVedioDataFormat, RuleOnlyUrl Check whether the string is in the correct format of the doi; Check whether the string is in the correct format of th... Internal Implementation N/A N/A
QUALITY_BAD_FLUENCY RuleAbnormalNumber, RuleCharSplit, RuleNoPunc, RuleWordSplit, RuleWordStuck Checks PDF content for abnormal book page or index numbers that disrupt text flow; Checks PDF content for abnormal ch... RedPajama: an Open Dataset for Training Large Language Models (Together Computer, 2023) 📊 See Results N/A
QUALITY_BAD_RELEVANCE RuleHeadWordAr, RuleHeadWordCs, RuleHeadWordHu, RuleHeadWordKo, RuleHeadWordRu, RuleHeadWordSr, RuleHeadWordTh, RuleHeadWordVi, RulePatternSearch, RuleWatermark Checks whether Arabic content contains irrelevant tail source information; Checks whether Czech content contains irre... RedPajama: an Open Dataset for Training Large Language Models (Together Computer, 2023) 📊 See Results N/A
QUALITY_BAD_SECURITY RuleIDCard, RuleUnsafeWords, RulePIIDetection Checks whether content contains ID card information; Checks whether content contains unsafe words; Detects Personal I... RedPajama: an Open Dataset for Training Large Language Models (Together Computer, 2023) 📊 See Results N/A
QUALITY_BAD_SIMILARITY RuleDocRepeat, RuleDocFormulaRepeat Evaluates text for consecutive repeated content and multiple occurrences of special characters; Evaluates text for co... RedPajama: an Open Dataset for Training Large Language Models (Together Computer, 2023) 📊 See Results N/A
QUALITY_BAD_UNDERSTANDABILITY RuleCapitalWords, RuleCurlyBracket, RuleLineStartWithBulletpoint, RuleUniqueWords Checks whether the ratio of capital words is above threshold, indicating poor readability; Checks whether the ratio o... RedPajama: an Open Dataset for Training Large Language Models (Together Computer, 2023) 📊 See Results N/A

Rule-Based IMG Quality Metrics

Type Metric Description Paper Source Evaluation Results Examples
QUALITY_BAD_IMG_ARTIMUSE RuleImageArtimuse Evaluates image quality in the field of aesthetics using artimuse Internal Implementation N/A N/A
QUALITY_BAD_IMG_EFFECTIVENESS RuleImageValid, RuleImageSizeValid, RuleImageQuality Checks whether image is not all white or black, ensuring visual content validity; Checks whether image ratio of width... Internal Implementation N/A N/A
QUALITY_BAD_IMG_LABEL_OVERLAP RuleImageLabelOverlap Detects overlapping bounding boxes in image annotations, marks full/partial overlap and generates visualization images Internal Implementation N/A N/A
QUALITY_BAD_IMG_LABEL_VISUALIZATION RuleImageLabelVisualization Generates visualization images with bounding boxes and category labels, helping manual check of annotation accuracy Internal Implementation N/A N/A
QUALITY_BAD_IMG_RELEVANCE RuleImageTextSimilarity Evaluates semantic similarity between image and text content using CLIP model Learning Transferable Visual Representations with Natural Language Supervision (Radford et al., 2021) N/A N/A
QUALITY_BAD_IMG_SIMILARITY RuleImageRepeat Detects duplicate images using PHash and CNN methods to ensure data diversity ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky et al., 2012) N/A N/A

Audio Quality Metrics

Type Metric Description Paper Source Evaluation Results Examples
QUALITY_BAD_EFFECTIVENESS RuleAudioDuration Check whether the audio duration meets the standard Internal Implementation N/A N/A
QUALITY_BAD_EFFECTIVENESS RuleAudioSnrQuality Check whether the audio signal-to-noise ratio meets the standard Internal Implementation N/A N/A

Meta Rater Evaluation Metrics

Type Metric Description Paper Source Evaluation Results Examples
LLMMetaRaterCleanliness LLMMetaRaterCleanliness Evaluates text formatting, content appropriateness, and completeness, assessing whether text appears human-edited and... Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models (Zhuang et al., 2025) N/A N/A
LLMMetaRaterProfessionalism LLMMetaRaterProfessionalism Evaluates the degree of expertise and prerequisite knowledge required to comprehend text on a 5-point scale Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models (Zhuang et al., 2025) N/A N/A
LLMMetaRaterReadability LLMMetaRaterReadability Evaluates the clarity and coherence of text using appropriate vocabulary and sentence structures on a 5-point scale Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models (Zhuang et al., 2025) N/A N/A
LLMMetaRaterReasoning LLMMetaRaterReasoning Evaluates the reasoning complexity and logical depth of text content, from simple logical judgments to complex multid... Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models (Zhuang et al., 2025) N/A N/A

OCR Eval Metric

Type Metric Description Paper Source Evaluation Results Examples
LLMMinerURecognizeQuality LLMMinerURecognizeQuality Evaluate the quality of mineru recognize Internal Implementation [📊 See Results](error_category and error_label) N/A
VLMDocumentParsingOCRTrain VLMDocumentParsingOCRTrain Evaluate the quality of mineru recognize Internal Implementation [📊 See Results](error_category and error_label) N/A

Resume Quality Assessment Metrics

Type Metric Description Paper Source Evaluation Results Examples
LLMKeywordMatcher LLMKeywordMatcher Semantic keyword matching between resume and job description Internal Implementation N/A N/A
LLMResumeOptimizer LLMResumeOptimizer ATS-focused resume optimization with keyword injection and STAR polishing Internal Implementation N/A N/A
LLMResumeQuality LLMResumeQuality Comprehensive resume quality evaluation covering privacy, contact, format, structure, professionalism, date, and comp... Internal Implementation N/A N/A

Rule-Based RESUME Quality Metrics

Type Metric Description Paper Source Evaluation Results Examples
RESUME_QUALITY_BAD_COMPLETENESS RuleResumeEducationMissing, RuleResumeExperienceMissing Checks if resume contains education background information; Checks if resume contains work experience information Internal Implementation N/A N/A
RESUME_QUALITY_BAD_CONTACT RuleResumeEmailMissing, RuleResumePhoneMissing, RuleResumePhoneFormat Checks if resume contains a valid email address; Checks if resume contains a valid phone number; Validates phone numb... Internal Implementation N/A N/A
RESUME_QUALITY_BAD_DATE RuleResumeDateFormat Detects inconsistent date format usage in resume Internal Implementation N/A N/A
RESUME_QUALITY_BAD_FORMAT RuleResumeExcessiveWhitespace, RuleResumeMarkdown Detects excessive consecutive spaces in resume; Detects common Markdown syntax errors in resume Internal Implementation N/A N/A
RESUME_QUALITY_BAD_PRIVACY RuleResumeIDCard, RuleResumeDetailedAddress Detects 18-digit Chinese ID card numbers in resume content; Detects detailed address patterns that may leak privacy Internal Implementation N/A N/A
RESUME_QUALITY_BAD_PROFESSIONALISM RuleResumeEmoji, RuleResumeInformal Detects emoji usage in resume which reduces professionalism; Detects informal or colloquial expressions in resume Internal Implementation N/A N/A
RESUME_QUALITY_BAD_STRUCTURE RuleResumeNameMissing, RuleResumeSectionMissing Checks if resume contains a name in the first 200 characters; Checks if resume contains required sections like educat... Internal Implementation N/A N/A

SFT Data Assessment Metrics - Agent-Enhanced

Type Metric Description Paper Source Evaluation Results Examples
AgentHallucination AgentHallucination Agent-based hallucination detection with automatic web search for missing context Internal Implementation N/A N/A

Text Generation

Type Metric Description Paper Source Evaluation Results Examples
LLMLongVideoQa LLMLongVideoQa Generate video-related question-answer pairs based on the summarized information of the input long video. VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos (Jiashuo Yu et al., 2025) N/A N/A