A comprehensive framework for evaluating GenAI applications.
This is a WIP. Weβre actively adding features, fixing issues, and expanding examples. Please give it a try, share feedback, and report bugs.
- Multi-Framework Support: Seamlessly use metrics from Ragas, DeepEval, and custom implementations
- Turn & Conversation-Level Evaluation: Support for both individual queries and multi-turn conversations
- Evaluation types: Response, Context, Tool Call, Overall Conversation evaluation & Script-based evaluation
- LLM Provider Flexibility: OpenAI, Watsonx, Gemini, vLLM and others
- Panel of Judges: Use multiple LLMs as judges to reduce bias and improve evaluation accuracy with configurable aggregation strategies
- API Integration: Direct integration with external API for real-time data generation (if enabled)
- Setup/Cleanup Scripts: Support for running setup and cleanup scripts before/after each conversation evaluation (applicable when API is enabled)
- Token Usage Tracking: Track input/output tokens for both API calls and Judge LLM evaluations (per-judge tracking for panel mode)
- Streaming Performance Metrics: Capture time-to-first-token (TTFT), streaming duration, and tokens/second when using streaming endpoint
- Statistical Analysis: Statistics for every metric with score distribution analysis
- Rich Output: CSV, JSON, TXT reports + visualization graphs (pass rates, distributions, heatmaps)
- Flexible Configuration: Configurable environment & metric metadata, Global defaults with per-conversation/per-turn metric overrides
- Early Validation: Catch configuration errors before expensive LLM calls
- Concurrent Evaluation: Multi-threaded evaluation with configurable thread count
- Caching: LLM, embedding, and API response caching for faster re-runs
- Skip on Failure: Optionally skip remaining evaluations in a conversation when a turn evaluation fails (configurable globally or per conversation). When there is an error in API call/Setup script execution metrics are marked as ERROR always.
- Usage Modes: CLI for batch evaluation and programmatic API for real-time integration with Python applications
Note: either use pip or uv pip.
Replace TAG below with a release tag like v0.5.0, or use main for latest (not recommended).
# Set your desired tag
TAG=v0.5.0
# Install package (no dependencies)
pip install --no-deps git+https://github.com/lightspeed-core/lightspeed-evaluation.git@${TAG}
# Install dependencies (choose one variant)
pip install -r https://raw.githubusercontent.com/lightspeed-core/lightspeed-evaluation/${TAG}/requirements.txt # Runtime only
pip install -r https://raw.githubusercontent.com/lightspeed-core/lightspeed-evaluation/${TAG}/requirements-nlp-metrics.txt # + nlp-metrics
pip install -r https://raw.githubusercontent.com/lightspeed-core/lightspeed-evaluation/${TAG}/requirements-local-embeddings.txt # + local-embeddings (torch excluded, see below)
pip install -r https://raw.githubusercontent.com/lightspeed-core/lightspeed-evaluation/${TAG}/requirements-all-extras.txt # + all extras (torch excluded, see below)From main branch:
pip install --no-deps git+https://github.com/lightspeed-core/lightspeed-evaluation.git
pip install -r https://raw.githubusercontent.com/lightspeed-core/lightspeed-evaluation/main/requirements.txtCPU torch + local embeddings:
TAG=v0.5.0
# 1. Install package
pip install --no-deps git+https://github.com/lightspeed-core/lightspeed-evaluation.git@${TAG}
# 2. Install CPU torch
pip install torch==2.10.0 --index-url https://download.pytorch.org/whl/cpu
# 3. Install other dependencies
pip install -r https://raw.githubusercontent.com/lightspeed-core/lightspeed-evaluation/${TAG}/requirements-local-embeddings.txtGPU torch + local embeddings:
TAG=v0.5.0
# 1. Install package
pip install --no-deps git+https://github.com/lightspeed-core/lightspeed-evaluation.git@${TAG}
# 2. Install GPU torch (CUDA version from PyPI)
pip install torch==2.10.0
# 3. Install other dependencies
pip install -r https://raw.githubusercontent.com/lightspeed-core/lightspeed-evaluation/${TAG}/requirements-local-embeddings.txtPrerequisites: Install uv (fast Python package installer):
pip install uvClone and install:
git clone https://github.com/lightspeed-core/lightspeed-evaluation.git
cd lightspeed-evaluation
# Install (choose one)
uv sync # Core only
uv sync --extra nlp-metrics # + nlp-metrics
uv sync --extra local-embeddings # + local-embeddings (CPU, ~2GB)
uv sync --all-extras # + all extras
uv sync --all-extras --group dev # + dev tools (for contributors)
# GPU local embeddings (~6GB)
cp uv-gpu.lock uv.lock && uv sync --extra local-embeddings --frozenAfter changing pyproject.toml:
make sync-lock-and-requirements # Regenerate uv.lock, uv-gpu.lock, requirements-*.txt# Set required environment variable(s) for Judge-LLM
export OPENAI_API_KEY="your-key"
# Optional: For script-based evaluations requiring Kubernetes access
export KUBECONFIG="/path/to/your/kubeconfig"
# Run evaluation
lightspeed-eval --system-config <CONFIG.yaml> --eval-data <EVAL_DATA.yaml> --output-dir <OUTPUT_DIR>
# Run subset of evaluations (filter by tag or conversation ID)
lightspeed-eval --system-config <CONFIG.yaml> --eval-data <EVAL_DATA.yaml> --tags basic advanced
lightspeed-eval --system-config <CONFIG.yaml> --eval-data <EVAL_DATA.yaml> --conv-ids conv_1 conv_2
# Filter by either (OR logic)
lightspeed-eval --system-config <CONFIG.yaml> --eval-data <EVAL_DATA.yaml> --tags basic --conv-ids special
# Clear and rebuild caches
lightspeed-eval --system-config <CONFIG.yaml> --eval-data <EVAL_DATA.yaml> --cache-warmupUse the framework as a Python library for real-time integration with Python applications:
from lightspeed_evaluation import evaluate, SystemConfig, LLMConfig, EvaluationData, TurnData
# Configure
config = SystemConfig(llm=LLMConfig(provider="openai", model="gpt-4o-mini"))
# Create evaluation data
data = EvaluationData(
conversation_group_id="my_eval",
turns=[TurnData(turn_id="t1", query="What is OCP?", response="OpenShift...")]
)
# Run evaluation
results = evaluate(config, [data])See Evaluation Guide - Programmatic API for detailed examples.
Please make any necessary modifications to system.yaml and evaluation_data.yaml. The evaluation_data.yaml file includes sample data for guidance.
# Set required environment variable(s) for both Judge-LLM and API authentication (for MCP)
export OPENAI_API_KEY="your-evaluation-llm-key"
export API_KEY="your-api-endpoint-key"
# Ensure API is running at configured endpoint
# Default: http://localhost:8080/v1/
# Run with API-enabled configuration
lightspeed-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml# Set required environment variable(s) for Judge-LLM
export OPENAI_API_KEY="your-key"
# Use system configuration with api.enabled: false
# You have to pre-generate response, contexts & tool_calls data in the input evaluation data file
lightspeed-eval --system-config config/system_api_disabled.yaml --eval-data config/evaluation_data.yaml- Ragas -- docs on Ragas website
- Response Evaluation
- Context Evaluation
- Custom
- Response Evaluation
answer_correctnessintent_eval- Evaluates whether the response demonstrates the expected intent or purposekeywords_eval- Keywords evaluation with alternatives (ALL keywords must match, case insensitive)
- Tool Evaluation
tool_eval- Validates tool calls, arguments, and optional results with regex pattern matching
- Response Evaluation
- Script-based
- Action Evaluation
script:action_eval- Executes verification scripts to validate actions (e.g., infrastructure changes)
- Action Evaluation
- NLP (No LLM required)
- DeepEval -- docs on DeepEval website
Define custom evaluation metrics in system.yaml under metrics_metadata. Criteria is required; evaluation_steps and rubrics are optional. Score is 0β1.
metrics_metadata:
turn_level:
"geval:custom_metric_name":
criteria: |
What to evaluate (required).
evaluation_params: [query, response, expected_response]
threshold: 0.7
description: "Metric description"See Configuration β Metrics for GEval options (evaluation_steps, rubrics) and config/system.yaml for full examples.
The default system config file is config/system.yaml.
See docs/configuration.md for the detailed description.
- conversation_group_id: "test_conversation"
description: "Sample evaluation"
tag: "basic" # Optional: Tag for grouping eval conversations (default: "eval")
# Optional: Environment setup/cleanup scripts, when API is enabled
setup_script: "scripts/setup_env.sh" # Run before conversation
cleanup_script: "scripts/cleanup_env.sh" # Run after conversation
# Conversation-level metrics
conversation_metrics:
- "deepeval:conversation_completeness"
conversation_metrics_metadata:
"deepeval:conversation_completeness":
threshold: 0.8
turns:
- turn_id: id1
query: What is OpenShift Virtualization?
response: null # Populated by API if enabled, otherwise provide
contexts:
- OpenShift Virtualization is an extension of the OpenShift ...
attachments: [] # Attachments (Optional)
expected_keywords: [["virtualization"], ["openshift"]] # For keywords_eval evaluation
expected_response: OpenShift Virtualization is an extension of the OpenShift Container Platform that allows running virtual machines alongside containers
expected_intent: "explain a concept" # Expected intent for intent evaluation
# Per-turn metrics (overrides system defaults)
turn_metrics:
- "ragas:faithfulness"
- "custom:keywords_eval"
- "custom:answer_correctness"
- "custom:intent_eval"
# Per-turn metric configuration
turn_metrics_metadata:
"ragas:faithfulness":
threshold: 0.9 # Override system default
# turn_metrics: null (omitted) β Use system defaults (metrics with default=true)
- turn_id: id2
query: Skip this turn evaluation
turn_metrics: [] # Skip evaluation for this turn
- turn_id: id3
query: Create a namespace called test-ns
verify_script: "scripts/verify_namespace.sh" # Script-based verification
turn_metrics:
- "script:action_eval" # Script-based evaluation (if API is enabled)| Field | Type | Required | Description |
|---|---|---|---|
conversation_group_id |
string | β | Unique identifier for conversation |
description |
string | β | Optional description |
tag |
string | β | Tag for grouping eval conversations (default: "eval") |
setup_script |
string | β | Path to setup script (Optional, used when API is enabled) |
cleanup_script |
string | β | Path to cleanup script (Optional, used when API is enabled) |
conversation_metrics |
list[string] | β | Conversation-level metrics (Optional, if override is required) |
conversation_metrics_metadata |
dict | β | Conversation-level metric config (Optional, if override is required) |
turns |
list[TurnData] | β | List of conversation turns |
| Field | Type | Required | Description | API Populated |
|---|---|---|---|---|
turn_id |
string | β | Unique identifier for the turn | β |
query |
string | β | The question/prompt to evaluate | β |
response |
string | π | Actual response from system | β (if API enabled) |
contexts |
list[string] | π | Context information for evaluation | β (if API enabled) |
attachments |
list[string] | β | Attachments | β |
expected_keywords |
list[list[string]] | π | Expected keywords for keyword evaluation (list of alternatives) | β |
expected_response |
string or list[string] | π | Expected response for comparison | β |
expected_intent |
string | π | Expected intent for intent evaluation | β |
expected_tool_calls |
list[list[list[dict]]] | π | Expected tool call sequences (multiple alternative sets) | β |
tool_calls |
list[list[dict]] | β | Actual tool calls from API | β (if API enabled) |
verify_script |
string | π | Path to verification script | β |
turn_metrics |
list[string] | β | Turn-specific metrics to evaluate | β |
turn_metrics_metadata |
dict | β | Turn-specific metric configuration | β |
π Required based on metrics: Some fields are required only when using specific metrics
Examples
expected_keywords: Required forcustom:keywords_eval(case insensitive matching)expected_response: Required forcustom:answer_correctnessexpected_intent: Required forcustom:intent_evalexpected_tool_calls: Required forcustom:tool_eval(multiple alternative sets format)verify_script: Required forscript:action_eval(used when API is enabled)response: Required for most metrics (auto-populated if API enabled)
Multiple expected responses: For metrics that include expected_response in their required_fields (defined in METRIC_REQUIREMENTS), you can provide expected_response as a list of strings. The evaluator will test each expected response until one passes. If all fail, it returns the maximum score from all attempts and logs all scores with their reasons into reason. Note: This feature only works for metrics explicitly listed in METRIC_REQUIREMENTS. For other metrics (e.g. GEval), only the first item in the list will be used. See example config for multiple expected responses (evaluation_data_multiple_expected_responses.yaml).
| Override Value | Behavior |
|---|---|
null (or omitted) |
Use system global metrics (metrics with default: true) |
[] (empty list) |
Skip evaluation for this turn |
["metric1", ...] |
Use specified metrics only, ignore global metrics |
The custom:tool_eval metric supports flexible matching with multiple alternative patterns:
- Format:
[[[tool_calls, ...]], [[tool_calls]], ...](list of list of list) - Matching: Tries each alternative until one matches
- Use Cases: Optional tools, multiple approaches, default arguments, skip scenarios, result validation
- Empty Sets:
[]represents "no tools" and must come after primary alternatives - Result Validation: Optionally validate tool outputs with regex patterns via the
resultfield - Options:
ordered(default: true) β sequence order must match when true, ignored when falsefull_match(default: true) β exact 1:1 match when true, partial match when false
# Multiple alternative sets format: [[[tool_calls, ...]], [[tool_calls]], ...]
expected_tool_calls:
- # Alternative 1: Primary approach
- # Sequence 1
- tool_name: oc_get
arguments:
kind: pod
name: openshift-light* # Regex patterns supported
result: ".*Running.*" # Optional: validate tool output (regex)
- # Sequence 2 (if multiple parallel tool calls needed)
- tool_name: oc_describe
arguments:
kind: pod
- # Alternative 2: Different approach
- # Sequence 1
- tool_name: kubectl_get
arguments:
resource: pods
- # Alternative 3: Skip scenario (optional)
[] # When model has information from previous conversationThe result field allows validating the output returned by a tool call. This is useful for verifying that tools not only received the correct inputs but also produced the expected outputs.
- Optional: If
resultis not specified, result validation is skipped (only name and arguments are checked) - Regex Support: Uses regex pattern matching for flexible validation
- Use Cases: Verify command outputs, check success/failure states, validate returned data
# Example: Validate that a pod is in Running state
expected_tool_calls:
- - tool_name: oc_get
arguments:
kind: pod
name: nginx-.*
namespace: default
result: ".*Running.*" # Verify pod status contains "Running"
# Example: Validate resource creation succeeded
expected_tool_calls:
- - tool_name: oc_create
arguments:
kind: namespace
name: test-ns
result: ".*created" # Verify creation was successfulThe framework supports script-based evaluations. Note: Scripts only execute when API is enabled - they're designed to test with actual environment changes.
- Setup scripts: Run before conversation evaluation (e.g., create failed deployment for troubleshoot query)
- Cleanup scripts: Run after conversation evaluation (e.g., cleanup failed deployment)
- Verify scripts: Run per turn for
script:action_evalmetric (e.g., validate if a pod has been created or not)
# Example: evaluation_data.yaml
- conversation_group_id: infrastructure_test
setup_script: ./scripts/setup_cluster.sh
cleanup_script: ./scripts/cleanup_cluster.sh
turns:
- turn_id: turn_id
query: Create a new cluster
verify_script: ./scripts/verify_cluster.sh
turn_metrics:
- script:action_evalScript Path Resolution
Script paths in evaluation data can be specified in multiple ways:
- Relative Paths: Resolved relative to the evaluation data YAML file location, not the current working directory
- Absolute Paths: Used as-is
- Home Directory Paths: Expands to user's home directory
# Hosted vLLM (provider: hosted_vllm)
export HOSTED_VLLM_API_KEY="your-key"
export HOSTED_VLLM_API_BASE="https://your-vllm-endpoint/v1"
# OpenAI (provider: openai)
export OPENAI_API_KEY="your-openai-key"
# IBM Watsonx (provider: watsonx)
export WATSONX_API_KEY="your-key"
export WATSONX_API_BASE="https://us-south.ml.cloud.ibm.com"
export WATSONX_PROJECT_ID="your-project-id"
# Gemini (provider: gemini)
export GEMINI_API_KEY="your-key"
# Azure OpenAI (provider: azure)
export AZURE_API_KEY="your-azure-key"
export AZURE_API_BASE="https://your-resource.openai.azure.com/"
# AZURE_API_VERSION is optionalNote for Azure: The
modelfield should be Azure deployment name, not the model name (when these are different).
# API authentication for external system (MCP)
export API_KEY="your-api-endpoint-key"- CSV: Detailed results with status, scores, reasons
- JSON: Summary statistics with score distributions
- TXT: Human-readable summary
- PNG: 4 visualization types (pass rates, score distributions, heatmaps, status breakdown)
- Status: PASS/FAIL/ERROR/SKIPPED
- Actual Reasons: Reason for evaluation status/result
- Score Statistics: Mean, median, standard deviation, min/max for every metric
When using the streaming endpoint (api.endpoint_type: streaming), the framework captures additional performance metrics:
| Metric | Description |
|---|---|
time_to_first_token |
Time in seconds from request start to first content token received |
streaming_duration |
Total time in seconds to receive all tokens |
tokens_per_second |
Output throughput (tokens generated per second, excluding TTFT) |
These metrics are included in:
- CSV output: Per-result columns for each metric
- JSON output: Per-result fields and aggregate statistics in
streaming_performance - TXT output: Aggregate statistics (mean, median, min/max) in the summary
# Install dev dependencies and git hooks
make install-deps-test
# Format code
make black-format
# Run all pre-commit checks at once (same as CI)
make pre-commit # Runs: bandit, check-types, pyright, docstyle, ruff, pylint, black-check
# or Run each quality checks individually:
make bandit # Security scan
make check-types # Type check
make pyright # Type check
make docstyle # Docstring style
make ruff # Lint check
make pylint # Lint check
make black-check # Check formatting
# Run tests
make test # Or: uv run pytest tests --cov=src| Issue | Solution |
|---|---|
Parsing error with context-related metrics (e.g., faithfulness) |
Increase max_tokens to a higher value (e.g., 2048 or higher - depends on number of the context & size) |
| Expected changes not reflected in results | Clear caches with --cache-warmup flag, or set cache_enabled: false in config, or manually delete .caches/ folders |
For comprehensive troubleshooting, see Evaluation Guide - Troubleshooting
An interactive web-based dashboard for visualizing and comparing evaluation results is available in the dashboard/ directory. This is a PoC implementation built with React and Vite.
See dashboard/README.md for setup and usage instructions.
For generating answers (optional) refer README-generate-answers
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
Contributions welcome - see development setup above for code quality tools.