Skip to content

Latest commit

 

History

History
413 lines (321 loc) · 13.6 KB

File metadata and controls

413 lines (321 loc) · 13.6 KB

RAG Evaluation Metrics - Complete Guide

🎯 Overview

Dingo's RAG evaluation metrics system is based on best practices from the RAGAS paper, DeepEval, and TruLens, providing comprehensive RAG system evaluation capabilities.

✅ Supported Metrics (5/5)

Metric Evaluation Dimension Required Fields Source
Faithfulness Answer Faithfulness user_input, response, retrieved_contexts RAGAS
Answer Relevancy Answer Relevance user_input, response RAGAS
Context Relevancy Context Relevance user_input, retrieved_contexts RAGAS + DeepEval + TruLens
Context Recall Context Recall user_input, retrieved_contexts, reference RAGAS
Context Precision Context Precision user_input, retrieved_contexts, reference RAGAS

🚀 Quick Start

1. Run Examples

# Dataset mode - batch evaluation (recommended)
python examples/rag/dataset_rag_eval_baseline.py

# SDK mode - single evaluation
python examples/rag/sdk_rag_eval.py

# Simulate RAG system and evaluate
python examples/rag/e2e_RAG_eval_with_mockRAG_fiqa.py

2. SDK Mode - Single Evaluation

import os
from dingo.config.input_args import EvaluatorLLMArgs, EmbeddingConfigArgs
from dingo.io.input import Data
from dingo.model.llm.rag.llm_rag_faithfulness import LLMRAGFaithfulness

# Configure LLM
LLMRAGFaithfulness.dynamic_config = EvaluatorLLMArgs(
    key=os.getenv("OPENAI_API_KEY"),
    api_url=os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1"),
    model=os.getenv("OPENAI_MODEL", "deepseek-chat"),
)

# Prepare data
data = Data(
    data_id="example_1",
    prompt="What is machine learning?",
    content="Machine learning is a branch of AI that enables computers to learn from data.",
    context=[
        "Machine learning is a subfield of AI.",
        "ML systems learn from data without explicit programming."
    ]
)

# Evaluate
result = LLMRAGFaithfulness.eval(data)

# View results
print(f"Score: {result.score}/10")
print(f"Passed: {not result.status}")  # status=False means passed
print(f"Reason: {result.reason[0]}")

3. Dataset Mode - Batch Evaluation

from dingo.config import InputArgs
from dingo.exec import Executor

# Configuration
llm_config = {
    "model": "gpt-4o-mini",
    "key": "YOUR_API_KEY",
    "api_url": "https://api.openai.com/v1",
}

llm_config_embedding = {
    "model": "gpt-4o-mini",
    "key": "YOUR_API_KEY",
    "api_url": "https://api.openai.com/v1",
    "embedding_config": {  # ⭐ Required for Answer Relevancy
        "model": "text-embedding-3-large",
        "api_url": "https://api.openai.com/v1",
        "key": "YOUR_API_KEY"
    },
    "parameters": {
        "strictness": 3,
        "threshold": 5
    }
}

input_data = {
    "task_name": "rag_evaluation",
    "input_path": "test/data/fiqa.jsonl",
    "output_path": "outputs/",
    "dataset": {"source": "local", "format": "jsonl"},
    "executor": {
        "max_workers": 10,
        "result_save": {"good": True, "bad": True, "all_labels": True}
    },
    "evaluator": [
        {
            "fields": {
                "prompt": "user_input",
                "content": "response",
                "context": "retrieved_contexts",
                "reference": "reference"
            },
            "evals": [
                {"name": "LLMRAGFaithfulness", "config": llm_config},
                {"name": "LLMRAGAnswerRelevancy", "config": llm_config_embedding},
                {"name": "LLMRAGContextRelevancy", "config": llm_config},
                {"name": "LLMRAGContextRecall", "config": llm_config},
                {"name": "LLMRAGContextPrecision", "config": llm_config}
            ]
        }
    ]
}

input_args = InputArgs(**input_data)
executor = Executor.exec_map["local"](input_args)
summary = executor.execute()

📋 Data Format

Required Fields

Metric user_input response retrieved_contexts reference Notes
Faithfulness - Measures if answer is based on context
Answer Relevancy - - Measures if answer addresses the question
Context Relevancy - - Measures if retrieved contexts are relevant
Context Recall - Measures if all needed info is retrieved
Context Precision - Measures ranking quality of retrieved contexts

Data Example (JSONL)

{"user_input": "What is deep learning?", "response": "Deep learning uses neural networks...", "retrieved_contexts": ["Deep learning is a subset of ML...", "Deep learning is used for image recognition..."]}
{"user_input": "Python features?", "response": "Python is concise and has rich libraries.", "retrieved_contexts": ["Python has clean syntax.", "Python has NumPy and other libraries."], "reference": "Python has clean syntax and a rich ecosystem."}

⚙️ Configuration

Configurable Parameters

Parameter Applicable Metrics Default Description
threshold All metrics 5.0 Pass threshold (0-10)
strictness Answer Relevancy 3 Number of questions to generate (1-5)
embedding_config Answer Relevancy - Required: includes model, api_url, key

Embedding Configuration (Answer Relevancy)

LLMRAGAnswerRelevancy requires embedding_config:

Option 1: Cloud LLM + Cloud Embedding

"config": {
    "model": "deepseek-chat",
    "key": "YOUR_API_KEY",
    "api_url": "https://api.deepseek.com",
    "embedding_config": {  # ⭐ Required
        "model": "text-embedding-3-large",
        "api_url": "https://api.deepseek.com",
        "key": "YOUR_API_KEY"
    },
    "parameters": {"strictness": 3, "threshold": 5}
}

Option 2: Cloud LLM + Local Embedding (Recommended: Cost-effective)

"config": {
    "model": "deepseek-chat",
    "key": "YOUR_API_KEY",
    "api_url": "https://api.deepseek.com",
    "embedding_config": {  # ⭐ Independent embedding service
        "model": "BAAI/bge-m3",
        "api_url": "http://localhost:8000/v1",  # Local vLLM/Xinference
        "key": "dummy-key"
    },
    "parameters": {"strictness": 3, "threshold": 5}
}

Deploy Local Embedding (vLLM):

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model BAAI/bge-m3 \
  --port 8000 \
  --host 0.0.0.0

What happens if not configured?

Runtime exception:

ValueError: Embedding model not initialized. Please configure 'embedding_config' in your LLM config with:
  - model: embedding model name (e.g., 'BAAI/bge-m3')
  - api_url: embedding service URL
  - key: API key (optional for local services)

📊 Metric Details

1️⃣ Faithfulness (Answer Faithfulness)

Evaluation Goal: Measure if the answer is entirely based on retrieved context, avoiding hallucinations

Calculation:

  1. Break down answer into independent statements (claims)
  2. Judge if each statement is supported by context
  3. Faithfulness score = (Supported statements / Total statements) × 10

Formula:

Faithfulness = (Context-supported claims / Total claims) × 10

Recommended Threshold: 7 (out of 10)


2️⃣ Answer Relevancy (Answer Relevance)

Evaluation Goal: Measure if the answer directly addresses the user question

Calculation:

  1. Generate N reverse questions from the answer (questions inferred by LLM from the answer)
  2. Calculate cosine similarity between embeddings of generated questions and original question
  3. Answer Relevancy = Average of all similarities

Formula:

Answer Relevancy = (1/N) × Σ cosine_sim(E_gi, E_o)

Where:
- N: Number of generated questions, default 3 (adjustable via strictness parameter)
- E_gi: Embedding of the i-th generated question
- E_o: Embedding of the original question

⚠️ Important: This metric requires embedding_config:

  • model: Embedding model name (e.g., text-embedding-3-large, BAAI/bge-m3)
  • api_url: Embedding service address
  • key: API key (optional for local services)

Recommended Threshold: 5 (out of 10)


3️⃣ Context Relevancy (Context Relevance)

Evaluation Goal: Measure if retrieved contexts are relevant to the question

Calculation: Uses a Dual-Judge System from NVIDIA research:

Judge 1 Scoring:

  • 0 = Context completely irrelevant
  • 1 = Context partially relevant
  • 2 = Context fully relevant

Judge 2 Scoring:

  • Uses different prompt wording for another perspective
  • Same 0-2 scoring standard
  • Purpose: Reduce single-prompt bias

Final Score:

Context Relevancy = (Relevant contexts / Total contexts) × 10

Where:
- Relevant context: Average score from both judges ≥ threshold (default 1.0)
- Irrelevant context: Average score < threshold

Recommended Threshold: 5 (out of 10)


4️⃣ Context Recall (Context Recall)

Evaluation Goal: Measure if all needed information is retrieved (requires reference answer)

Calculation:

  1. Extract independent statements from reference answer
  2. Judge if each statement can be attributed from retrieved contexts
  3. Recall = (Context-supported reference statements / Total reference statements) × 10

Formula:

Context Recall = (Context-supported reference claims / Total reference claims) × 10

Note: Requires reference answer (reference), typically used in evaluation phase

Recommended Threshold: 5 (out of 10)


5️⃣ Context Precision (Context Precision)

Evaluation Goal: Measure ranking quality of retrieval results, whether relevant docs are at the top (requires reference answer)

Calculation:

  1. For each position k, judge if the context is relevant (supports reference answer)
  2. Calculate Precision@k for each position
  3. Use relevance indicator (v_k) for weighted sum

Formula:

Context Precision = Σ(Precision@k × v_k) / Total relevant items in top K

Where:
- K: Total retrieved documents, e.g., 5 documents
- k: Current position (1st, 2nd, 3rd, ..., K-th)
- v_k: Relevance indicator, 0 (irrelevant) or 1 (relevant)
- Precision@k: Precision in first k documents, 0.0 to 1.0
- Precision@k = Relevant count in first k / k

Note: Requires reference answer (reference) to judge which contexts are relevant

Recommended Threshold: 5 (out of 10)

🌟 Best Practices

1. Metric Combinations

Complete Evaluation (5 metrics):

"evals": [
    {"name": "LLMRAGFaithfulness"},       # Detect hallucinations
    {"name": "LLMRAGAnswerRelevancy"},    # Check answer relevance
    {"name": "LLMRAGContextRelevancy"},   # Check context noise
    {"name": "LLMRAGContextRecall"},      # Evaluate retrieval completeness
    {"name": "LLMRAGContextPrecision"}    # Evaluate retrieval ranking
]

Production Environment (no reference needed):

"evals": [
    {"name": "LLMRAGFaithfulness"},       # ⭐ Most important: prevent hallucinations
    {"name": "LLMRAGAnswerRelevancy"},    # Ensure direct answers
    {"name": "LLMRAGContextRelevancy"}    # Check retrieval noise
]

Evaluation Phase (requires reference):

"evals": [
    {"name": "LLMRAGContextRecall"},      # Evaluate retrieval completeness
    {"name": "LLMRAGContextPrecision"}    # Evaluate retrieval ranking
]

2. Threshold Adjustment

Adjust thresholds (default 5) based on scenario:

  • Strict scenarios (finance, medical): threshold 7-8
  • General scenarios (Q&A systems): threshold 5-6
  • Loose scenarios (exploratory search): threshold 3-4

3. Iterative Optimization

  1. Initial Evaluation: Evaluate current system with all 5 metrics
  2. Identify Issues:
    • Low Faithfulness → Generation model produces hallucinations
      • Optimize: Adjust generation prompts, use stronger models, enhance fact-checking
    • Low Answer Relevancy → Answer off-topic or contains irrelevant info
      • Optimize: Improve generation prompts, limit answer length, enhance question understanding
    • Low Context Relevancy → Retrieval introduces noise
      • Optimize: Improve retrieval algorithm, adjust similarity threshold, improve embedding model
    • Low Context Recall → Retrieval misses important info
      • Optimize: Increase Top-K, improve query rewriting, expand knowledge base
    • Low Context Precision → Relevant docs ranked lower
      • Optimize: Improve ranking algorithm, adjust reranker, improve relevance calculation
  3. Targeted Optimization: Adjust components based on issues
  4. Re-evaluate: Verify optimization effects
  5. Continuous Monitoring: Monitor key metrics in production

4. Important Notes

  • LLM Dependency: All metrics depend on LLM API, requiring correct API key and endpoint
  • Embedding Dependency:
    • Answer Relevancy requires embedding_config: model, api_url, key
    • Can use cloud services (OpenAI, DeepSeek) or local deployment (vLLM, Xinference)
    • Not configuring will throw exception: ValueError: Embedding model not initialized...
  • Cost Considerations: Evaluation generates API costs, recommendations:
    • Development: Sample evaluation (50-100 samples)
    • Production: Use key metrics only (Faithfulness, Answer Relevancy, Context Relevancy)
    • Evaluation: Full evaluation of all metrics
  • Reference Requirements:
    • Context Recall and Context Precision require reference
    • Other three metrics don't need reference
    • Reference mainly used in evaluation phase, production usually doesn't need it

📖 For More Details

See the Chinese version for comprehensive examples and detailed explanations.