Agent Health is an evaluation and observability framework for AI agents. It helps you measure agent performance through "Golden Path" trajectory comparison—where an LLM judge evaluates agent actions against expected outcomes.
Who uses Agent Health:
- AI teams building autonomous agents (RCA, customer support, data analysis)
- QA engineers testing agent behavior across scenarios
- Platform teams monitoring agent performance in production
Key capabilities:
- Real-time agent execution streaming and visualization
- LLM-based evaluation with pass/fail scoring
- Batch experiments comparing agents and models
- OpenTelemetry trace integration for performance analysis
- Pluggable connectors for different agent types (REST, SSE, CLI)
# Start Agent Health with demo data (no configuration needed)
npx @opensearch-project/agent-healthOpens http://localhost:4001 with pre-loaded sample data for exploration.
Next steps:
- Getting Started Guide - Step-by-step walkthrough
- Connect Your Agent - Configure your own agent
- Evals: Real-time agent evaluation with trajectory streaming
- Experiments: Batch evaluation runs with configurable parameters
- Compare: Side-by-side trace comparison with aligned and merged views
- Agent Traces: Table-based trace view with latency histogram, filtering, and detailed flyout with input/output display
- Live Traces: Real-time trace monitoring with auto-refresh and filtering
- Trace Views: Timeline and Flow visualizations for debugging
- Reports: Evaluation reports with LLM judge reasoning
- Connectors: Pluggable protocol adapters for different agent types
| Connector | Protocol | Description |
|---|---|---|
agui-streaming |
AG-UI SSE | ML-Commons agents (default) |
rest |
HTTP POST | Non-streaming REST APIs |
subprocess |
CLI | Command-line tools |
claude-code |
Claude CLI | Claude Code agent comparison |
mock |
In-memory | Demo and testing |
For creating custom connectors, see docs/CONNECTORS.md.
Agent Health uses a client-server architecture where all clients (UI, CLI) access OpenSearch through a unified HTTP API. The server handles agent communication via pluggable connectors and proxies LLM judge calls to AWS Bedrock.
For detailed architecture documentation, see docs/ARCHITECTURE.md.
# Start server (default action)
npx @opensearch-project/agent-health
# Initialize a new project (creates agent-health.config.ts and .env.example)
npx @opensearch-project/agent-health init
# Check configuration and connectivity
npx @opensearch-project/agent-health doctor
# List resources (agents, connectors, models, test-cases, benchmarks)
npx @opensearch-project/agent-health list agents
npx @opensearch-project/agent-health list connectors
# Run a single test case against an agent
npx @opensearch-project/agent-health run -t demo-otel-001 -a demo
# Run a benchmark (batch of test cases)
npx @opensearch-project/agent-health benchmark -f ./test-cases.json -a my-agent
npx @opensearch-project/agent-health benchmark -n "My Benchmark" -a my-agent --export results.json
# Export benchmark test cases as JSON
npx @opensearch-project/agent-health export -b "My Benchmark" -o test-cases.json
# Generate reports (HTML, PDF, JSON)
npx @opensearch-project/agent-health report -b "My Benchmark"
npx @opensearch-project/agent-health report -b "My Benchmark" -f pdf -o report.pdf
# One-time migration for existing benchmark runs
npx @opensearch-project/agent-health migrate --dry-runFor full CLI documentation, see docs/CLI.md.
Agent Health works out-of-the-box with demo data. Configure when you're ready to connect your own agent.
This is the primary way to configure custom agents, models, and hooks. Create it in your working directory (the directory you run npx or agent-health from):
# Generate a config file with examples
npx @opensearch-project/agent-health initOr create it manually:
// agent-health.config.ts
export default {
agents: [
{
key: "my-agent",
name: "My Agent",
endpoint: "http://localhost:8000/agent",
connectorType: "rest", // or "agui-streaming", "subprocess"
models: ["claude-sonnet-4"],
useTraces: true, // Enable OpenTelemetry trace collection
}
],
};The config file is auto-detected from the current working directory. Supported file names (in priority order): agent-health.config.ts, agent-health.config.js, agent-health.config.mjs. See agent-health.config.example.ts for all available options including authentication hooks.
Tip: Run
npx @opensearch-project/agent-health doctorto verify your configuration is loaded correctly.
For LLM Judge evaluation (uses AWS Bedrock):
# Create .env file
cp .env.example .env
# Add AWS credentials
AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_keyFull configuration guide: CONFIGURATION.md
| Command | Description |
|---|---|
npm install |
Install dependencies |
npm run dev |
Start frontend dev server (port 4000) |
npm run dev:server |
Start backend server (port 4001) |
npm run build |
TypeScript compile + Vite production build |
npm test |
Run all tests |
npm run test:unit |
Run unit tests only |
npm run test:integration |
Run integration tests only |
npm run test:e2e |
Run E2E tests with Playwright |
npm run test:e2e:ui |
Run E2E tests with Playwright UI |
npm run test:all |
Run all tests (unit + integration + e2e) |
npm test -- --coverage |
Run tests with coverage report |
npm run build:all |
Build UI + server + CLI |
npm run build:cli |
Build CLI only |
npm run server # Build UI + start single server on port 4001After publishing, run directly with npx:
npx @opensearch-project/agent-health # Start server on port 4001
npx @opensearch-project/agent-health --port 8080
npx @opensearch-project/agent-health --env-file .env| Mode | Command | Port(s) |
|---|---|---|
| Dev (frontend) | npm run dev |
4000 |
| Dev (backend) | npm run dev:server |
4001 |
| Production | npm run server |
4001 |
| NPX | npx @opensearch-project/agent-health |
4001 (default) |
In development, the Vite dev server (4000) proxies /api requests to the backend (4001).
AgentEval uses a comprehensive test suite with three layers:
| Type | Location | Command | Description |
|---|---|---|---|
| Unit | tests/unit/ |
npm run test:unit |
Fast, isolated function tests |
| Integration | tests/integration/ |
npm run test:integration |
Tests with real backend server |
| E2E | tests/e2e/ |
npm run test:e2e |
Browser-based UI tests with Playwright |
# All tests
npm test # Unit + integration
npm run test:all # Unit + integration + E2E
# By type
npm run test:unit # Unit tests only
npm run test:integration # Integration tests (starts server)
npm run test:e2e # E2E tests (starts servers)
npm run test:e2e:ui # E2E with Playwright UI for debugging
# With coverage
npm run test:unit -- --coverage
# Specific file
npm test -- path/to/file.test.ts
npx playwright test tests/e2e/dashboard.spec.tsE2E tests use Playwright to test the UI in a real browser.
# First time: install browsers
npx playwright install
# Run all E2E tests
npm run test:e2e
# Interactive UI mode (recommended for debugging)
npm run test:e2e:ui
# View test report
npm run test:e2e:reportWriting E2E Tests:
- Place tests in
tests/e2e/*.spec.ts - Use
data-testidattributes for reliable selectors - Handle empty states gracefully (check if data exists before asserting)
- See existing tests for patterns
All PRs must pass these CI checks:
| Job | What it checks |
|---|---|
build-and-test |
Build + unit tests + 90% coverage |
lint-and-typecheck |
TypeScript compilation |
license-check |
SPDX headers on all source files |
integration-tests |
Backend integration tests with coverage |
e2e-tests |
Playwright browser tests with pass/fail tracking |
security-scan |
npm audit for vulnerabilities |
test-summary |
Consolidated test results summary |
| Test Type | Metric | Threshold |
|---|---|---|
| Unit | Lines | ≥ 90% |
| Unit | Branches | ≥ 80% |
| Unit | Functions | ≥ 80% |
| Unit | Statements | ≥ 90% |
| Integration | Lines | Informational (no threshold) |
| E2E | Pass Rate | 100% |
Each CI run produces these artifacts (downloadable from Actions tab):
| Artifact | Contents |
|---|---|
coverage-report |
Unit test coverage (HTML, LCOV) |
integration-coverage-report |
Integration test coverage |
playwright-report |
E2E test report with screenshots/traces |
test-badges |
Badge data JSON for coverage visualization |
The E2E test suite includes tests for the complete evaluation flow using mock modes:
- Demo Agent (
mock://demo) - Simulated AG-UI streaming responses - Demo Model (
provider: "demo") - Simulated LLM judge evaluation
This allows testing the full Create Test Case → Create Benchmark → Run Evaluation → View Results flow without requiring AWS credentials or a live agent in CI.
Agent Health supports multiple agent types:
| Agent | Endpoint Variable | Setup |
|---|---|---|
| Langgraph | LANGGRAPH_ENDPOINT |
Simple localhost agent |
| HolmesGPT | HOLMESGPT_ENDPOINT |
AG-UI compatible RCA agent |
| ML-Commons | MLCOMMONS_ENDPOINT |
See ML-Commons Setup |
Enable verbose debug logging to diagnose issues:
# Via environment variable
DEBUG=true npx @opensearch-project/agent-health
# Or toggle at runtime via API
curl -X POST http://localhost:4001/api/debug -H 'Content-Type: application/json' -d '{"enabled":true}'Debug logging can also be toggled from the Settings page using the "Verbose Logging" switch, which syncs to both the browser console and server terminal.
| Issue | Solution |
|---|---|
| Cannot connect to backend | Run npm run dev:server, check curl http://localhost:4001/health |
| AWS credentials expired | Refresh credentials in .env |
| Storage/Traces not working | Check OpenSearch endpoint and credentials in .env |
| Need verbose logs | Set DEBUG=true in .env or toggle in Settings page |
We welcome contributions! See CONTRIBUTING.md for guidelines.
- Fork and clone the repository
- Install dependencies:
npm install - Create a feature branch:
git checkout -b feature/your-feature - Make changes and add tests
- Run tests:
npm test - Commit with DCO signoff:
git commit -s -m "feat: your message" - Push and create a Pull Request
All commits require DCO signoff and all PRs must pass CI checks (tests, coverage, linting).
- Getting Started - Step-by-step walkthrough from install to first evaluation
- Configuration - Connect your agent and configure the environment
- CLI Reference - Command-line interface documentation
- Development Guide - Architecture, coding conventions, and contributing
- Connectors Guide - Create custom connectors for your agent type
- ML-Commons Setup - OpenSearch ML-Commons integration
- Architecture - System design and patterns
