Skip to content
#

llm-evals

Here are 109 public repositories matching this topic...

OmoiOS

Turn feature specs into merged PRs with a self-supervising swarm of coding agents — parallel execution, isolated sandboxes, DAG dependencies. Open-source, self-hostable, model-agnostic (Claude / Gemini / Codex).

  • Updated Jun 18, 2026
  • Python

🧪 A/B test your agent skills — skillcheck measures whether a SKILL.md actually improves an LLM's task performance, with blind grading, bootstrap confidence intervals, and a 0–100 score. CLI for Claude Code, Codex & Cursor skills.

  • Updated Jun 23, 2026
  • TypeScript

Improve this page

Add a description, image, and links to the llm-evals topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evals topic, visit your repo's landing page and select "manage topics."

Learn more