llm-benchmark

Here are 193 public repositories matching this topic...

DeepTempo / socbench

An Open Harness and Benchmark for AI in Cybersecurity Operations.

cybersecurity threat-hunting detection-engineering soc-analyst llm-evaluation llm-agents agentic-ai reasoning-models llm-benchmark

Updated Jul 14, 2026
Python

enescingoz / mac-llm-bench

Star

Community benchmark database for running LLMs on Apple Silicon Macs

benchmark inference apple-silicon llm llama-cpp local-llm llm-benchmark tokens-per-second

Updated Apr 22, 2026
Shell

TsinghuaC3I / AdsQA

Star

[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621

benchmark video-understanding video-question-answering advertisement-dataset video-llms llm-benchmark

Updated Oct 30, 2025
Python

weiber2002 / ICRTL-Benchmark

Star

ICRTL Benchmark: Industrial-level RTL design challenges for evaluating PPA optimization, code generation, and LLM applications in EDA.

benchmark evolution natural-language ppa eda rtl verilog system-verilog yosys ic-design vlsi-design rtl-design hardware-accelerator ai-agent verilogeval rtllm llm-benchmark spec-to-rtl

Updated Jul 3, 2026
SystemVerilog

blackwell-systems / gcf

Sponsor

Star

The AI-native wire format for structured data. 100% comprehension on every frontier model. 50-92% fewer tokens than JSON. 43B+ lossless round-trips across 17 formats. Spec v3.2 Stable.

Updated Jul 19, 2026

genlab-1c / prism

Star

Открытый бенчмарк LLM: какая нейросеть лучше пишет код 1С:Предприятие (BSL). Объективная оценка LLM по методике SMOP с реальным исполнением в 1С — Claude, GPT, Gemini, DeepSeek, YandexGPT, GigaChat.

benchmark ai evaluation code-generation code-quality onescript bsl 1c 1c-enterprise llm ai-code-generation agentic function-calling llm-evaluation smop llm-benchmark

Updated Jul 15, 2026
Python

hyeonsangjeon / gdpval-realworks

Sponsor

Star

Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).

Updated Jul 18, 2026
Python

ARTPARK-SAHAI-ORG / calibrate

Star

Core engine behind Calibrate, a framework for evaluating AI agents: speech-to-text, text-to-speech, LLM evaluation, end-to-end simulations

Updated Jul 18, 2026
JavaScript

ishida-lab / capbencher

Star

[ICML 2026] CapBencher toolkit: Give your LLM benchmark a built-in alarm for leakage and gaming

contamination-detection llm llm-datasets llm-benchmark leaderboard-hacking

Updated May 29, 2026
Python

MarkIvor / officeiq

Star

Исследовательский вопрос: можно ли измерить «офисный интеллект» LLM? Попытка — здесь. 100 сценариев, 10 критериев, русский корпоративный контекст.

benchmark ai-assistants evaluation-framework russian-nlp corporate-ai llm-evaluation llm-as-a-judge llm-benchmark

Updated May 11, 2026
HTML

MiaAI-Lab / Ternary-Bonsai-27B-tool-eval-bench-results

Star

tool-eval-bench results for Prism ML Ternary-Bonsai-27B (Q2_0) — 8 trials, score 85, deployability 80

ternary bonsai agentic gguf tool-calling llm-benchmark tool-eval-bench

Updated Jul 15, 2026
HTML

BennettSchwartz / ERR-EVAL

Sponsor

Star

Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.

python nlp benchmark machine-learning ai evaluation collaborate ai-safety llm llm-evaluation hallucination-detection ai-benchmark llm-benchmark

Updated Mar 22, 2026
Python

yyh-001 / llm-value-rankings

Star

Daily LLM value rankings - compare 300+ models by intelligence, speed and price. OpenRouter + Artificial Analysis. 大模型性价比排行榜

Updated Jul 19, 2026
CSS

GalenChen320 / Otter

Star

An agent evaluation framework with native multi-turn feedback iteration.

agent code-evaluation feedback-driven llm-benchmark

Updated Apr 29, 2026
Python

The open-source benchmark for LLM memory decay. Measure how Naive, RAG, Chunked RAG, Cascading, and SummaryMemory degrade over 100 conversation turns. Ebbinghaus forgetting curves, 5-provider LLM eval, multi-seed CI. No API key needed.

Updated Jul 4, 2026
Python

joeseesun / llm-case-benchmark

Star

同提示词多模型并排评测，支持写作与 HTML/SVG 预览 | Compare multiple LLMs on one prompt with writing and frontend previews.

nodejs sqlite html-preview model-comparison prompt-engineering deepseek openai-compatible llm-benchmark

Updated Jul 14, 2026
JavaScript

ctala / ai-benchmarks-alternativos

Sponsor

Star

Benchmark abierto en español de 170 modelos de IA (118 con 20+ runs, 69 rankeados, juez Phi-4 independiente). Calidad, costo, velocidad, long-context y fuga de credenciales como dimensiones separadas. Alternativas a Claude, GPT y Gemini para agentes n8n/Hermes. Calculadora interactiva con tus propios pesos.

ai-agents startup-tools n8n ai-models emprendedores openrouter ollama llm-evaluation llm-comparison ai-benchmark llm-benchmark spanish-ai openclaw hermes-agent claude-alternatives gpt-alternatives phi4-judge benchmark-en-espanol emprendedores-ia

Updated Jul 18, 2026
Python

Basaltlabs-app / Gauntlet

Star

Community-driven behavioral reliability benchmark for LLMs. 231 probes across 19 modules, deterministic scoring, perplexity correlation, layer sensitivity mapping, quant method capture, hardware-stratified community rankings. Every test contributes to the community dataset.

benchmark mcp community-driven model-evaluation ai-evaluation llm ollama sycophancy hallucination-detection llm-testing hardware-benchmark ai-trust trust-scoring behavioral-testing llm-benchmark deterministic-scoring

Updated May 4, 2026
Python

filipbasara0 / llm-jigsaw

Star

Testing how well LLMs can solve jigsaw puzzles

benchmark vision-language-model llm-reasoning llm-benchmark

Updated Jan 8, 2026
Python

idemerge / llm-api-bench

Star

Self-hosted LLM API benchmark, monitoring & playground. Compare latency, TTFT, throughput across OpenAI, Anthropic, Gemini & any OpenAI-compatible endpoint. Deploy with one command via Docker. | 自托管 LLM API 性能测试、监控与调试平台，一键 Docker 部署，支持多家服务商对比。

docker latency self-hosted gemini openai performance-testing api-monitoring anthropic llm-playground llm-api llm-benchmarking llm-monitoring api-benchmark llm-benchmark

Updated May 15, 2026
TypeScript

Improve this page

Add a description, image, and links to the llm-benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-benchmark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-benchmark

Here are 193 public repositories matching this topic...

DeepTempo / socbench

enescingoz / mac-llm-bench

TsinghuaC3I / AdsQA

weiber2002 / ICRTL-Benchmark

blackwell-systems / gcf

genlab-1c / prism

hyeonsangjeon / gdpval-realworks

ARTPARK-SAHAI-ORG / calibrate

ishida-lab / capbencher

MarkIvor / officeiq

MiaAI-Lab / Ternary-Bonsai-27B-tool-eval-bench-results

BennettSchwartz / ERR-EVAL

yyh-001 / llm-value-rankings

GalenChen320 / Otter

Neal006 / memorylens-bench

joeseesun / llm-case-benchmark

ctala / ai-benchmarks-alternativos

Basaltlabs-app / Gauntlet

filipbasara0 / llm-jigsaw

idemerge / llm-api-bench

Improve this page

Add this topic to your repo