LightSpeed Evaluation Framework

A comprehensive framework for evaluating GenAI applications.

This is a WIP. We’re actively adding features, fixing issues, and expanding examples. Please give it a try, share feedback, and report bugs.

🎯 Key Features

Multi-Framework Support: Seamlessly use metrics from Ragas, DeepEval, and custom implementations
Turn & Conversation-Level Evaluation: Support for both individual queries and multi-turn conversations
Evaluation types: Response, Context, Tool Call, Overall Conversation evaluation & Script-based evaluation
LLM Provider Flexibility: OpenAI, Watsonx, Gemini, vLLM and others
Panel of Judges: Use multiple LLMs as judges to reduce bias and improve evaluation accuracy with configurable aggregation strategies
API Integration: Direct integration with external API for real-time data generation (if enabled)
Setup/Cleanup Scripts: Support for running setup and cleanup scripts before/after each conversation evaluation (applicable when API is enabled)
Token Usage Tracking: Track input/output tokens for both API calls and Judge LLM evaluations (per-judge tracking for panel mode)
Streaming Performance Metrics: Capture time-to-first-token (TTFT), streaming duration, and tokens/second when using streaming endpoint
Statistical Analysis: Statistics for every metric with score distribution analysis
Rich Output: CSV, JSON, TXT reports + visualization graphs (pass rates, distributions, heatmaps)
Flexible Configuration: Configurable environment & metric metadata, Global defaults with per-conversation/per-turn metric overrides
Early Validation: Catch configuration errors before expensive LLM calls
Concurrent Evaluation: Multi-threaded evaluation with configurable thread count
Caching: LLM, embedding, and API response caching for faster re-runs
Skip on Failure: Optionally skip remaining evaluations in a conversation when a turn evaluation fails (configurable globally or per conversation). When there is an error in API call/Setup script execution metrics are marked as ERROR always.
Usage Modes: CLI for batch evaluation and programmatic API for real-time integration with Python applications

🚀 Quick Start

Installation

Note: either use pip or uv pip.

From Git

Replace TAG below with a release tag like v0.5.0, or use main for latest (not recommended).

# Set your desired tag
TAG=v0.5.0

# Install package (no dependencies)
pip install --no-deps git+https://github.com/lightspeed-core/lightspeed-evaluation.git@${TAG}

# Install dependencies (choose one variant)
pip install -r https://raw.githubusercontent.com/lightspeed-core/lightspeed-evaluation/${TAG}/requirements.txt                      # Runtime only
pip install -r https://raw.githubusercontent.com/lightspeed-core/lightspeed-evaluation/${TAG}/requirements-nlp-metrics.txt       # + nlp-metrics
pip install -r https://raw.githubusercontent.com/lightspeed-core/lightspeed-evaluation/${TAG}/requirements-local-embeddings.txt  # + local-embeddings (torch excluded, see below)
pip install -r https://raw.githubusercontent.com/lightspeed-core/lightspeed-evaluation/${TAG}/requirements-all-extras.txt        # + all extras (torch excluded, see below)

From main branch:

pip install --no-deps git+https://github.com/lightspeed-core/lightspeed-evaluation.git
pip install -r https://raw.githubusercontent.com/lightspeed-core/lightspeed-evaluation/main/requirements.txt

CPU torch + local embeddings:

TAG=v0.5.0

# 1. Install package
pip install --no-deps git+https://github.com/lightspeed-core/lightspeed-evaluation.git@${TAG}

# 2. Install CPU torch
pip install torch==2.10.0 --index-url https://download.pytorch.org/whl/cpu

# 3. Install other dependencies
pip install -r https://raw.githubusercontent.com/lightspeed-core/lightspeed-evaluation/${TAG}/requirements-local-embeddings.txt

GPU torch + local embeddings:

TAG=v0.5.0

# 1. Install package
pip install --no-deps git+https://github.com/lightspeed-core/lightspeed-evaluation.git@${TAG}

# 2. Install GPU torch (CUDA version from PyPI)
pip install torch==2.10.0

# 3. Install other dependencies
pip install -r https://raw.githubusercontent.com/lightspeed-core/lightspeed-evaluation/${TAG}/requirements-local-embeddings.txt

Local Development (clone, uv lock)

Prerequisites: Install uv (fast Python package installer):

pip install uv

Clone and install:

git clone https://github.com/lightspeed-core/lightspeed-evaluation.git
cd lightspeed-evaluation

# Install (choose one)
uv sync                              # Core only
uv sync --extra nlp-metrics          # + nlp-metrics
uv sync --extra local-embeddings     # + local-embeddings (CPU, ~2GB)
uv sync --all-extras                 # + all extras
uv sync --all-extras --group dev     # + dev tools (for contributors)

# GPU local embeddings (~6GB)
cp uv-gpu.lock uv.lock && uv sync --extra local-embeddings --frozen

Maintainers: Regenerate locks and requirements

After changing pyproject.toml:

make sync-lock-and-requirements  # Regenerate uv.lock, uv-gpu.lock, requirements-*.txt

Basic Usage

# Set required environment variable(s) for Judge-LLM
export OPENAI_API_KEY="your-key"

# Optional: For script-based evaluations requiring Kubernetes access
export KUBECONFIG="/path/to/your/kubeconfig"

# Run evaluation
lightspeed-eval --system-config <CONFIG.yaml> --eval-data <EVAL_DATA.yaml> --output-dir <OUTPUT_DIR>

# Run subset of evaluations (filter by tag or conversation ID)
lightspeed-eval --system-config <CONFIG.yaml> --eval-data <EVAL_DATA.yaml> --tags basic advanced
lightspeed-eval --system-config <CONFIG.yaml> --eval-data <EVAL_DATA.yaml> --conv-ids conv_1 conv_2
# Filter by either (OR logic)
lightspeed-eval --system-config <CONFIG.yaml> --eval-data <EVAL_DATA.yaml> --tags basic --conv-ids special

# Clear and rebuild caches
lightspeed-eval --system-config <CONFIG.yaml> --eval-data <EVAL_DATA.yaml> --cache-warmup

Programmatic Usage (Library Mode)

Use the framework as a Python library for real-time integration with Python applications:

from lightspeed_evaluation import evaluate, SystemConfig, LLMConfig, EvaluationData, TurnData

# Configure
config = SystemConfig(llm=LLMConfig(provider="openai", model="gpt-4o-mini"))

# Create evaluation data
data = EvaluationData(
    conversation_group_id="my_eval",
    turns=[TurnData(turn_id="t1", query="What is OCP?", response="OpenShift...")]
)

# Run evaluation
results = evaluate(config, [data])

See Evaluation Guide - Programmatic API for detailed examples.

Usage Scenarios

Please make any necessary modifications to system.yaml and evaluation_data.yaml. The evaluation_data.yaml file includes sample data for guidance.

1. API-Enabled Real-time data collection

# Set required environment variable(s) for both Judge-LLM and API authentication (for MCP)
export OPENAI_API_KEY="your-evaluation-llm-key"

export API_KEY="your-api-endpoint-key"

# Ensure API is running at configured endpoint
# Default: http://localhost:8080/v1/

# Run with API-enabled configuration
lightspeed-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml

2. Static Data Evaluation (API Disabled)

# Set required environment variable(s) for Judge-LLM
export OPENAI_API_KEY="your-key"

# Use system configuration with api.enabled: false
# You have to pre-generate response, contexts & tool_calls data in the input evaluation data file
lightspeed-eval --system-config config/system_api_disabled.yaml --eval-data config/evaluation_data.yaml

📊 Supported Metrics

Turn-Level (Single Query)

Ragas -- docs on Ragas website
- Response Evaluation
  - faithfulness
  - response_relevancy
- Context Evaluation
Custom
- Response Evaluation
  - answer_correctness
  - intent_eval - Evaluates whether the response demonstrates the expected intent or purpose
  - keywords_eval - Keywords evaluation with alternatives (ALL keywords must match, case insensitive)
- Tool Evaluation
  - tool_eval - Validates tool calls, arguments, and optional results with regex pattern matching
Script-based
- Action Evaluation
  - script:action_eval - Executes verification scripts to validate actions (e.g., infrastructure changes)
NLP (No LLM required)
- nlp:bleu - BLEU score for n-gram precision
- nlp:rouge - ROUGE score for recall-oriented overlap
- Installation: pip install 'lightspeed-evaluation[nlp-metrics]' or uv sync --extra nlp-metrics

Conversation-Level (Multi-turn)

DeepEval -- docs on DeepEval website

Custom Metrics with GEval (from DeepEval)

Define custom evaluation metrics in system.yaml under metrics_metadata. Criteria is required; evaluation_steps and rubrics are optional. Score is 0–1.

metrics_metadata:
  turn_level:
    "geval:custom_metric_name":
      criteria: |
        What to evaluate (required).
      evaluation_params: [query, response, expected_response]
      threshold: 0.7
      description: "Metric description"

See Configuration → Metrics for GEval options (evaluation_steps, rubrics) and config/system.yaml for full examples.

⚙️ Configuration

System Config (`config/system.yaml`)

The default system config file is config/system.yaml. See docs/configuration.md for the detailed description.

Input File Data Structure (`config/evaluation_data.yaml`)

- conversation_group_id: "test_conversation"
  description: "Sample evaluation"
  tag: "basic"  # Optional: Tag for grouping eval conversations (default: "eval")
  
  # Optional: Environment setup/cleanup scripts, when API is enabled
  setup_script: "scripts/setup_env.sh"      # Run before conversation
  cleanup_script: "scripts/cleanup_env.sh"  # Run after conversation
  
  # Conversation-level metrics   
  conversation_metrics:
    - "deepeval:conversation_completeness"
  
  conversation_metrics_metadata:
    "deepeval:conversation_completeness":
      threshold: 0.8
  
  turns:
    - turn_id: id1
      query: What is OpenShift Virtualization?
      response: null                    # Populated by API if enabled, otherwise provide
      contexts:
        - OpenShift Virtualization is an extension of the OpenShift ...
      attachments: []                   # Attachments (Optional)
      expected_keywords: [["virtualization"], ["openshift"]]  # For keywords_eval evaluation
      expected_response: OpenShift Virtualization is an extension of the OpenShift Container Platform that allows running virtual machines alongside containers
      expected_intent: "explain a concept"  # Expected intent for intent evaluation
      
      # Per-turn metrics (overrides system defaults)
      turn_metrics:
        - "ragas:faithfulness"
        - "custom:keywords_eval"
        - "custom:answer_correctness"
        - "custom:intent_eval"
      
      # Per-turn metric configuration
      turn_metrics_metadata:
        "ragas:faithfulness": 
          threshold: 0.9  # Override system default
      # turn_metrics: null (omitted) → Use system defaults (metrics with default=true)
      
    - turn_id: id2
      query: Skip this turn evaluation
      turn_metrics: []                  # Skip evaluation for this turn

    - turn_id: id3
      query: Create a namespace called test-ns
      verify_script: "scripts/verify_namespace.sh"  # Script-based verification
      turn_metrics:
        - "script:action_eval"          # Script-based evaluation (if API is enabled)

Input file Data Structure Details

Conversation Data Fields

Field	Type	Required	Description
`conversation_group_id`	string	✅	Unique identifier for conversation
`description`	string	❌	Optional description
`tag`	string	❌	Tag for grouping eval conversations (default: "eval")
`setup_script`	string	❌	Path to setup script (Optional, used when API is enabled)
`cleanup_script`	string	❌	Path to cleanup script (Optional, used when API is enabled)
`conversation_metrics`	list[string]	❌	Conversation-level metrics (Optional, if override is required)
`conversation_metrics_metadata`	dict	❌	Conversation-level metric config (Optional, if override is required)
`turns`	list[TurnData]	✅	List of conversation turns

Turn Data Fields

Field	Type	Required	Description	API Populated
`turn_id`	string	✅	Unique identifier for the turn	❌
`query`	string	✅	The question/prompt to evaluate	❌
`response`	string	📋	Actual response from system	✅ (if API enabled)
`contexts`	list[string]	📋	Context information for evaluation	✅ (if API enabled)
`attachments`	list[string]	❌	Attachments	❌
`expected_keywords`	list[list[string]]	📋	Expected keywords for keyword evaluation (list of alternatives)	❌
`expected_response`	string or list[string]	📋	Expected response for comparison	❌
`expected_intent`	string	📋	Expected intent for intent evaluation	❌
`expected_tool_calls`	list[list[list[dict]]]	📋	Expected tool call sequences (multiple alternative sets)	❌
`tool_calls`	list[list[dict]]	❌	Actual tool calls from API	✅ (if API enabled)
`verify_script`	string	📋	Path to verification script	❌
`turn_metrics`	list[string]	❌	Turn-specific metrics to evaluate	❌
`turn_metrics_metadata`	dict	❌	Turn-specific metric configuration	❌

📋 Required based on metrics: Some fields are required only when using specific metrics

Examples

expected_keywords: Required for custom:keywords_eval (case insensitive matching)

expected_response: Required for custom:answer_correctness

expected_intent: Required for custom:intent_eval

expected_tool_calls: Required for custom:tool_eval (multiple alternative sets format)

verify_script: Required for script:action_eval (used when API is enabled)

response: Required for most metrics (auto-populated if API enabled)

Multiple expected responses: For metrics that include expected_response in their required_fields (defined in METRIC_REQUIREMENTS), you can provide expected_response as a list of strings. The evaluator will test each expected response until one passes. If all fail, it returns the maximum score from all attempts and logs all scores with their reasons into reason. Note: This feature only works for metrics explicitly listed in METRIC_REQUIREMENTS. For other metrics (e.g. GEval), only the first item in the list will be used. See example config for multiple expected responses (evaluation_data_multiple_expected_responses.yaml).

Metrics override behavior

Override Value	Behavior
`null` (or omitted)	Use system global metrics (metrics with `default: true`)
`[]` (empty list)	Skip evaluation for this turn
`["metric1", ...]`	Use specified metrics only, ignore global metrics

Tool Evaluation

The custom:tool_eval metric supports flexible matching with multiple alternative patterns:

Format: [[[tool_calls, ...]], [[tool_calls]], ...] (list of list of list)
Matching: Tries each alternative until one matches
Use Cases: Optional tools, multiple approaches, default arguments, skip scenarios, result validation
Empty Sets: [] represents "no tools" and must come after primary alternatives
Result Validation: Optionally validate tool outputs with regex patterns via the result field
Options:
- ordered (default: true) — sequence order must match when true, ignored when false
- full_match (default: true) — exact 1:1 match when true, partial match when false

Tool Call Structure

# Multiple alternative sets format: [[[tool_calls, ...]], [[tool_calls]], ...]
expected_tool_calls:
  - # Alternative 1: Primary approach
    - # Sequence 1
      - tool_name: oc_get
        arguments:
          kind: pod
          name: openshift-light*    # Regex patterns supported
        result: ".*Running.*"       # Optional: validate tool output (regex)
    - # Sequence 2 (if multiple parallel tool calls needed)
      - tool_name: oc_describe
        arguments:
          kind: pod
  - # Alternative 2: Different approach
    - # Sequence 1
      - tool_name: kubectl_get
        arguments:
          resource: pods
  - # Alternative 3: Skip scenario (optional)
    []  # When model has information from previous conversation

Tool Result Validation (Optional)

The result field allows validating the output returned by a tool call. This is useful for verifying that tools not only received the correct inputs but also produced the expected outputs.

Optional: If result is not specified, result validation is skipped (only name and arguments are checked)
Regex Support: Uses regex pattern matching for flexible validation
Use Cases: Verify command outputs, check success/failure states, validate returned data

# Example: Validate that a pod is in Running state
expected_tool_calls:
  - - tool_name: oc_get
      arguments:
        kind: pod
        name: nginx-.*
        namespace: default
      result: ".*Running.*"  # Verify pod status contains "Running"

# Example: Validate resource creation succeeded
expected_tool_calls:
  - - tool_name: oc_create
      arguments:
        kind: namespace
        name: test-ns
      result: ".*created"    # Verify creation was successful

Script-Based Evaluations

The framework supports script-based evaluations. Note: Scripts only execute when API is enabled - they're designed to test with actual environment changes.

Setup scripts: Run before conversation evaluation (e.g., create failed deployment for troubleshoot query)
Cleanup scripts: Run after conversation evaluation (e.g., cleanup failed deployment)
Verify scripts: Run per turn for script:action_eval metric (e.g., validate if a pod has been created or not)

# Example: evaluation_data.yaml
- conversation_group_id: infrastructure_test
  setup_script: ./scripts/setup_cluster.sh
  cleanup_script: ./scripts/cleanup_cluster.sh
  turns:
    - turn_id: turn_id
      query: Create a new cluster
      verify_script: ./scripts/verify_cluster.sh
      turn_metrics:
        - script:action_eval

Script Path Resolution

Script paths in evaluation data can be specified in multiple ways:

Relative Paths: Resolved relative to the evaluation data YAML file location, not the current working directory
Absolute Paths: Used as-is
Home Directory Paths: Expands to user's home directory

🔑 Authentication & Environment

Required Environment Variables

For LLM as a Judge Evaluation (Always Required)

# Hosted vLLM (provider: hosted_vllm)
export HOSTED_VLLM_API_KEY="your-key"
export HOSTED_VLLM_API_BASE="https://your-vllm-endpoint/v1"

# OpenAI (provider: openai)
export OPENAI_API_KEY="your-openai-key"

# IBM Watsonx (provider: watsonx)
export WATSONX_API_KEY="your-key"
export WATSONX_API_BASE="https://us-south.ml.cloud.ibm.com"
export WATSONX_PROJECT_ID="your-project-id"

# Gemini (provider: gemini)
export GEMINI_API_KEY="your-key"

# Azure OpenAI (provider: azure)
export AZURE_API_KEY="your-azure-key"
export AZURE_API_BASE="https://your-resource.openai.azure.com/"
# AZURE_API_VERSION is optional

Note for Azure: The model field should be Azure deployment name, not the model name (when these are different).

For Lightspeed Core API Integration (When `api.enabled: true`)

# API authentication for external system (MCP)
export API_KEY="your-api-endpoint-key"

📈 Output & Visualization

Generated Reports

CSV: Detailed results with status, scores, reasons
JSON: Summary statistics with score distributions
TXT: Human-readable summary
PNG: 4 visualization types (pass rates, score distributions, heatmaps, status breakdown)

Key Metrics in Output

Status: PASS/FAIL/ERROR/SKIPPED
Actual Reasons: Reason for evaluation status/result
Score Statistics: Mean, median, standard deviation, min/max for every metric

Streaming Performance Metrics

When using the streaming endpoint (api.endpoint_type: streaming), the framework captures additional performance metrics:

Metric	Description
`time_to_first_token`	Time in seconds from request start to first content token received
`streaming_duration`	Total time in seconds to receive all tokens
`tokens_per_second`	Output throughput (tokens generated per second, excluding TTFT)

These metrics are included in:

CSV output: Per-result columns for each metric
JSON output: Per-result fields and aggregate statistics in streaming_performance
TXT output: Aggregate statistics (mean, median, min/max) in the summary

🧪 Development

Development Tools

# Install dev dependencies and git hooks
make install-deps-test

# Format code
make black-format

# Run all pre-commit checks at once (same as CI)
make pre-commit      # Runs: bandit, check-types, pyright, docstyle, ruff, pylint, black-check
# or Run each quality checks individually:
make bandit          # Security scan
make check-types     # Type check
make pyright         # Type check
make docstyle        # Docstring style
make ruff            # Lint check
make pylint          # Lint check
make black-check     # Check formatting

# Run tests
make test            # Or: uv run pytest tests --cov=src

🔧 Troubleshooting

Issue	Solution
Parsing error with context-related metrics (e.g., `faithfulness`)	Increase `max_tokens` to a higher value (e.g., 2048 or higher - depends on number of the context & size)
Expected changes not reflected in results	Clear caches with `--cache-warmup` flag, or set `cache_enabled: false` in config, or manually delete `.caches/` folders

For comprehensive troubleshooting, see Evaluation Guide - Troubleshooting

📊 Web Dashboard (Proof of Concept)

An interactive web-based dashboard for visualizing and comparing evaluation results is available in the dashboard/ directory. This is a PoC implementation built with React and Vite.

See dashboard/README.md for setup and usage instructions.

Generate answers (optional - for creating test data)

For generating answers (optional) refer README-generate-answers

📄 License & Contributing

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Contributions welcome - see development setup above for code quality tools.

Name		Name	Last commit message	Last commit date
Latest commit History 478 Commits
.claude/skills/pr-review		.claude/skills/pr-review
.github		.github
config		config
dashboard		dashboard
docs		docs
eval_data		eval_data
githooks		githooks
lsc_agent_eval		lsc_agent_eval
script		script
src		src
tests		tests
.coderabbit.yaml		.coderabbit.yaml
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
README-generate-answers.md		README-generate-answers.md
README.md		README.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
pytest.ini		pytest.ini
requirements-all-extras.txt		requirements-all-extras.txt
requirements-local-embeddings.txt		requirements-local-embeddings.txt
requirements-nlp-metrics.txt		requirements-nlp-metrics.txt
requirements.txt		requirements.txt
uv-gpu.lock		uv-gpu.lock
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

LightSpeed Evaluation Framework

🎯 Key Features

🚀 Quick Start

Installation

From Git

Local Development (clone, uv lock)

Maintainers: Regenerate locks and requirements

Basic Usage

Programmatic Usage (Library Mode)

Usage Scenarios

1. API-Enabled Real-time data collection

2. Static Data Evaluation (API Disabled)

📊 Supported Metrics

Turn-Level (Single Query)

Conversation-Level (Multi-turn)

Custom Metrics with GEval (from DeepEval)

⚙️ Configuration

System Config (config/system.yaml)

Input File Data Structure (config/evaluation_data.yaml)

Input file Data Structure Details

Conversation Data Fields

Turn Data Fields

Metrics override behavior

Tool Evaluation

Tool Call Structure

Tool Result Validation (Optional)

Script-Based Evaluations

🔑 Authentication & Environment

Required Environment Variables

For LLM as a Judge Evaluation (Always Required)

For Lightspeed Core API Integration (When api.enabled: true)

📈 Output & Visualization

Generated Reports

Key Metrics in Output

Streaming Performance Metrics

🧪 Development

Development Tools

🔧 Troubleshooting

📊 Web Dashboard (Proof of Concept)

Generate answers (optional - for creating test data)

📄 License & Contributing

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

System Config (`config/system.yaml`)

Input File Data Structure (`config/evaluation_data.yaml`)

For Lightspeed Core API Integration (When `api.enabled: true`)

Packages