llm-evals

Star

Here are 109 public repositories matching this topic...

darkrishabh / agent-skills-eval

Star

A test runner for agentskills.io-style AI agent skills

cli yaml typescript ai-agents jsonl llm-evaluation llm-evals agent-evals agent-skills openai-compatible agentskills

Updated Jun 24, 2026
TypeScript

samarailly51-pixel / claimpilot-harness

Star

Crash-test insurance claim AI agents before production.

python testing insurance ai-agents prompt-injection llm-evals agent-evaluation

Updated Jun 20, 2026
Python

fastxyz / skill-optimizer

Star

Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs

cli benchmark sdk ai mcp evaluation optimizer eval evaluation-framework ai-agent llm llm-eval evals openrouter llm-evaluation-framework tool-calling llm-evals ai-skill

Updated May 28, 2026
TypeScript

Turn feature specs into merged PRs with a self-supervising swarm of coding agents — parallel execution, isolated sandboxes, DAG dependencies. Open-source, self-hostable, model-agnostic (Claude / Gemini / Codex).

Updated Jun 18, 2026
Python

ALucek / evaluizer

Star

Visualize LLM outputs against datasets, manually annotate results, and run automated evaluations to algorithmically optimize prompts.

llm-optimizer llm-evals prompt-annotation prompt-optimizer

Updated Nov 22, 2025
TypeScript

The-Swarm-Corporation / StatisticalModelEvaluator

Star

An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"

ai ml multiagent agents llms evals llm-evals agent-evals multi-agent-eval

Updated Oct 6, 2025
Python

pyladiesams / eval-llm-based-apps-jan2025

Star

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

workshop llm llms llmops llm-eval llm-test llm-evaluation-framework llm-evaluation-metrics llm-monitoring llm-testing llm-evals

Updated May 6, 2025
Jupyter Notebook

tiramitree / fde-ai-systems-portfolio

Star

Three runnable enterprise AI systems showing secure RAG, governed agents, AI release reliability, evals, traces, audit logs, and approval gates.

python openai ai-safety human-in-the-loop ai-agents rag enterprise-ai agentic-workflows tool-calling llm-evals forward-deployed-engineering responses-api

Updated Jun 22, 2026
Python

tpertner / squeeze

Star

Squeeze your model with pressure prompts to see if its behavior leaks.

reliability evaluation calibration alignment quality-assurance metamorphic-testing ai-safety trustworthiness hallucinations prompt-engineering llm-eval llm-evals

Updated Mar 1, 2026
Python

SaiTeja-Erukude / agentdog

Star

AgentDog helps developers inspect, test, score, and monitor AI agent runs locally.

testing ai-safety ai-agents tool-use llm-evals agent-evaluation

Updated May 23, 2026
Python

tpertner / confess

Star

Detecting Relational Boundary Erosion in AI systems. A framework for testing whether models maintain honest, calibrated, and appropriate boundaries.

python yaml calibration alignment metamorphic-testing model-evaluation ai-safety red-teaming prompt-injection hallucination-detection llm-evals evaluation-harness

Updated Feb 22, 2026
Python

keez97 / claude-architecture-skills

Star

7 Claude Code skills for software architecture review (Python, web, cloud, microservices). Includes A/B benchmarks against unskilled baseline, assertion-graded eval suite, and interactive dashboards.

python tdd architecture cloud-infrastructure code-review software-architecture ai-agents fastapi architecture-review prompt-engineering anthropic llm-evals claude-code claude-code-skills llm-benchmarks

Updated May 8, 2026
HTML

kevinschaul / llm-evals

Star

Because we should all have our own set of LLM evals.

llm llm-evals

Updated Apr 28, 2026
Python

sx4im / skillcheck

Star

🧪 A/B test your agent skills — skillcheck measures whether a SKILL.md actually improves an LLM's task performance, with blind grading, bootstrap confidence intervals, and a 0–100 score. CLI for Claude Code, Codex & Cursor skills.

cli typescript ab-testing ai-agents anthropic agentic-workflow llm-evals claude-code benchamrk llm-evaluation-benchmark claude-skill claude-code-skill codex-skills agent-skill ai-agent-skills skill-library openclaw-skills skill-testing skill-eval

Updated Jun 23, 2026
TypeScript

aelaguiz / codex-autoresearcher

Star

Codex-native autoresearch harness with structured worker/judge turns for optimizing anything you can measure.

python research optimization codex ai-agents llm-evals experiment-runner autoresearch

Updated Mar 21, 2026
Python

spences10 / ralph-town

Star

Disposable Daytona sandboxes for LLM evals and isolated command execution

cli typescript mcp sandbox daytona llm evals llm-evals sandbox-orchestration

Updated Jun 25, 2026
TypeScript

itseffi / ai-product-evals

Star

End-to-end LLM eval framework for AI products. Provider-agnostic and agent-native, with skills, traces, judge validation, and a human review interface.