Skip to content

EthicalML/awesome-agentic-engineering-resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

52 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Awesome License: CC0-1.0 PRs Welcome Last Commit Contributors Link Check X

Awesome Agentic Engineering Resources

A curated list of high-signal resources β€” articles, books, courses, cookbooks, papers, playbooks, benchmarks, talks, podcasts, and newsletters β€” for agentic engineering and AI engineering.

This is a resources list, not a tools list. Open-source tools for building agentic systems live in the sister list awesome-production-agentic-systems; production ML tooling lives in awesome-production-machine-learning. This list covers the learning, design, and operational resources that sit alongside those tools β€” including both:

Agentic engineering focuses on using AI agents to do software engineering (Copilot, Cursor, Claude Code, Aider, Cline, Windsurf, Codex; spec-driven development; context engineering; agent IDE rules and memory files; SWE benchmarks). AI / agentic systems engineering focuses on building agentic and LLM-powered systems (architecture, RAG, memory, tool use & MCP, orchestration, multi-agent coordination, evaluation, observability, guardrails, safety, fine-tuning, inference, product/UX, economics, teams).

You can keep up to date by watching this repo for the monthly releases summarising newly added resources 🀩

This list was proposed in EthicalML/awesome-production-machine-learning#709 as a sister list focused on resources rather than tools.

Legend

Resources are tagged with icons so you can scan and filter at a glance:

Icon Meaning
⭐ Editors' pick β€” start here
πŸ†“ Free to access
πŸ’° Paid
πŸ“˜ Book
πŸ§‘β€πŸŽ“ Course
πŸŽ₯ Video / talk
🎧 Audio / podcast
πŸ“„ Paper
πŸ› οΈ Hands-on cookbook / tutorial
πŸ“‹ Playbook / design-pattern catalog
πŸ§ͺ Benchmark / leaderboard
πŸ—οΈ Reference implementation / case study
πŸ“° Newsletter

Quick links to sections on this page

⭐ Trending / What's New 🧭 Core & Foundations πŸ—“οΈ Milestones Timeline
πŸ‘₯ Communities πŸ§‘β€πŸŽ“ Courses πŸ“˜ Books
✍️ Articles & Essays πŸ› οΈ Tutorials & Cookbooks πŸ“‹ Playbooks & Patterns
πŸ“„ Papers & Research πŸ§ͺ Benchmarks & Leaderboards πŸ—οΈ Reference Implementations
πŸŽ₯ Talks & Conferences 🎧 Podcasts πŸ“° Newsletters
πŸ›‘οΈ Governance, Safety & Responsible AI 🎨 Product, UX & Economics of AI πŸ§‘β€πŸ€β€πŸ§‘ Teams, Hiring & Org Design

Topic Coverage Matrix

Resources are organised as a matrix: the top-level sections above (rows) are resource types, and each section is sub-divided by topic. The 21 topics, T1–T21, are shared across sections. This lets you read vertically ("what papers exist on RAG?") or horizontally ("where do I find resources on Coding Agents?").

Topics:

# Topic
T1 Coding Agents & AI-Assisted Development (Copilot, Cursor, Claude Code, Aider, Cline, Windsurf, Codex)
T2 Spec-Driven Development & Context Engineering (AGENTS.md, spec-kit, rules files)
T3 Agent IDE Rules, Memory Files & Developer Workflows
T4 SWE Benchmarks & Coding Evaluation
T5 Autonomous Software Agents & Long-Horizon Engineering Tasks
T6 LLM Application Architecture & System Design
T7 Prompt Engineering
T8 Retrieval-Augmented Generation (RAG)
T9 Memory Systems & Long-Context
T10 Tool Use, Function Calling & MCP
T11 Orchestration, Planning & Design Patterns
T12 Multi-Agent Systems & Coordination
T13 Evaluation & Testing
T14 Observability, Tracing & Debugging
T15 Guardrails & Security (prompt injection, jailbreaks, red-teaming)
T16 Safety, Alignment & Responsible AI
T17 Fine-tuning, Post-training, RLHF & Reasoning Training
T18 Inference, Serving, Cost & Latency
T19 Voice, Multi-modal & Embodied Agents
T20 Product, UX & Human-AI Interaction Design
T21 Economics, Teams, Hiring & Org Design

Coverage (● = populated, β—‹ = opportunistic / partial, β€” = out of scope for that row):

Row \ Topic T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21
Core & Foundations ● ● β—‹ β—‹ β—‹ ● ● ● β—‹ ● ● β—‹ ● β—‹ β—‹ β—‹ β—‹ β—‹ β—‹ β—‹ β—‹
Communities ● β—‹ β—‹ β—‹ β—‹ ● ● ● β—‹ ● ● β—‹ ● ● β—‹ ● ● ● β—‹ ● ●
Courses ● β—‹ β—‹ ● β—‹ ● ● ● β—‹ ● ● ● ● ● ● ● ● ● β—‹ β—‹ β—‹
Books ● β—‹ β—‹ β€” β—‹ ● ● ● β—‹ ● ● β—‹ ● β—‹ ● ● ● ● β—‹ ● ●
Articles & Essays ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Tutorials & Cookbooks ● ● ● β—‹ ● ● ● ● ● ● ● ● ● ● ● β—‹ ● ● ● β—‹ β€”
Playbooks & Patterns ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● β—‹ ● β—‹ ● ●
Papers & Research ● β—‹ β€” ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● β—‹
Benchmarks ● β€” β€” ● ● β—‹ β—‹ ● β—‹ ● ● ● ● β—‹ ● ● β—‹ ● ● β—‹ β€”
Reference Impls ● ● ● ● ● ● β—‹ ● ● ● ● ● ● ● ● β—‹ ● ● ● ● ●
Talks & Conferences ● ● β—‹ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Podcasts ● β—‹ β—‹ β—‹ ● ● ● ● β—‹ ● ● ● ● ● ● ● ● ● β—‹ ● ●
Newsletters ● β—‹ β—‹ β—‹ β—‹ ● ● ● β—‹ ● ● β—‹ ● ● ● ● ● ● β—‹ ● ●

The Trending / What's New, Milestones Timeline, Governance & Responsible AI, Product / UX / Economics, and Teams, Hiring & Org Design sections collapse across topics and are presented as curated lists rather than matrix cells.

Contributing to the list

Please review our CONTRIBUTING.md before submitting a PR β€” it explains the one-line description style, how to pick the right row/topic cell, and the quality bar for inclusion. Thank you to the community for supporting the list's growth πŸš€

Want to receive recurring updates on this repo and other advancements

You can join the Machine Learning Engineer newsletter. Join over 70,000 ML professionals and enthusiasts who receive weekly curated articles & tutorials on production Machine Learning. Machine Learning Engineer newsletter
Also check out Awesome Production Agentic Systems and Awesome Production Machine Learning, the sister lists of open-source tools for agentic systems and production ML respectively. Awesome Production Agentic Systems
Star History Chart

Main Content

⭐ Trending / What's New

Rotating pinned items: the most-discussed agentic & AI-engineering resources of the current cycle. Refreshed regularly β€” see CONTRIBUTING.md for nomination criteria.

  • ⭐ πŸ†“ Building effective agents β€” Anthropic (2024). The most-cited reference for agent design patterns (augmented LLM, prompt chaining, routing, parallelisation, orchestrator-workers, evaluator-optimiser, autonomous agents). Start here before any other agent reading.
  • ⭐ πŸ†“ How we built our multi-agent research system β€” Anthropic (2025). Production retrospective on Claude's multi-agent research mode: orchestrator/subagent split, prompt engineering for agents, evaluation and failure modes.
  • ⭐ πŸ†“ A practical guide to building agents β€” OpenAI (2025). 30-page PDF covering when (and when not) to build agents, tool design, guardrails, and human-in-the-loop patterns.
  • ⭐ πŸ†“ The bitter lesson of AI agents / Agentic Coding: The Future of Software Development with Agents β€” Armin Ronacher (2025). Widely-shared essays on what it actually feels like to ship with agentic coding tools day-to-day.
  • πŸ†“ Claude Code: Best practices for agentic coding β€” Anthropic (2025). CLAUDE.md, slash-commands, headless mode, custom permissions β€” the canonical how-to-use-Claude-Code reference.
  • πŸ†“ How to build an agent β€” Thorsten Ball / Amp (2025). Viral step-by-step implementation of a tool-using coding agent in ~400 lines of Go, demystifying "what is an agent" in code.
  • πŸ†“ The new code β€” Sean Grove / OpenAI on Latent Space (2025). Specs-as-code: the spec is the new artefact, models are the compiler. Heavily cited in the AGENTS.md / spec-kit discussion.
  • πŸ†“ AGENTS.md β€” Community standard (2025) for per-repo agent instructions, now read by Claude Code, Codex, Aider, Cursor, Cline, Windsurf and others.

🧭 Core & Foundations

Canonical "what is agentic engineering / AI engineering" reading. Start here.

T1 Β· Coding Agents & AI-Assisted Development

T2 Β· Spec-Driven Development & Context Engineering

  • ⭐ πŸ†“ The new code β€” Sean Grove (OpenAI) on Latent Space. The canonical "specs are the new code" essay.
  • πŸ†“ AGENTS.md β€” Community standard for per-repo agent instructions.
  • πŸ†“ spec-kit β€” GitHub's toolkit and essay set on spec-driven development with coding agents.
  • πŸ†“ The rise of "context engineering" β€” LangChain. Why prompt engineering became context engineering.

T6 Β· LLM Application Architecture & System Design

T7 Β· Prompt Engineering

T8 Β· Retrieval-Augmented Generation (RAG)

T10 Β· Tool Use, Function Calling & MCP

T11 Β· Orchestration, Planning & Design Patterns

T13 Β· Evaluation & Testing

πŸ—“οΈ Milestones Timeline

Dated, field-defining events that shaped agentic & AI engineering.

Date Event Reference
2017-06 Transformer architecture introduced Attention Is All You Need
2020-05 GPT-3 shows in-context learning at scale Language Models are Few-Shot Learners
2020-05 RAG framework introduced RAG for Knowledge-Intensive NLP
2021-06 GitHub Copilot preview launches β€” first mainstream AI coding assistant GitHub blog
2022-01 Chain-of-Thought prompting Wei et al.
2022-03 InstructGPT / RLHF Ouyang et al.
2022-10 ReAct: reasoning + acting agent loop Yao et al.
2022-11 ChatGPT release β€” mainstream adoption inflection OpenAI
2023-03 GPT-4 release OpenAI
2023-03 HuggingGPT / Toolformer-era tool use Toolformer
2023-03 LangChain & LlamaIndex hit mainstream β€”
2023-05 Voyager: open-ended agents in Minecraft Voyager
2023-06 Simon Willison coins "prompt injection" as a durable threat category SW blog
2023-10 SWE-bench released β€” real-world coding eval SWE-bench
2023-12 Mixture-of-experts open models (Mixtral) Mistral
2024-03 Devin demo β€” autonomous software agent pitch Cognition
2024-05 GPT-4o: native multi-modal + realtime voice OpenAI
2024-06 Anthropic's "Building effective agents" publishes Anthropic
2024-07 SWE-bench Verified launched OpenAI
2024-09 o1 reveals reasoning-model era OpenAI
2024-11 Model Context Protocol (MCP) announced Anthropic
2025-02 Claude Code general availability Anthropic
2025-05 AGENTS.md published as cross-agent standard agents.md
2025-06 GitHub spec-kit / "new code" essays formalise spec-driven dev spec-kit

πŸ‘₯ Communities

Discords, Slacks, forums, and meetups where practitioners gather.

  • πŸ†“ MLOps Community β€” Slack + podcast + meetups; the biggest practitioner community at the ops/engineering intersection. Active agent and LLM-ops channels.
  • πŸ†“ LangChain Discord β€” Heavy day-to-day Q&A on agent orchestration, RAG, evaluation, MCP.
  • πŸ†“ LlamaIndex Discord β€” RAG-centric builder community with active reference-impl discussion.
  • πŸ†“ r/LocalLLaMA β€” The definitive open-weights / local-inference forum; fastest signal for new models, quantisation, and serving.
  • πŸ†“ r/MachineLearning β€” Academic and practitioner mix; where new papers and threads get dissected.
  • πŸ†“ Hacker News β€” Filter for "LLM", "agent", "Claude", "Cursor" β€” where engineering-side essays trend.
  • πŸ†“ EleutherAI Discord β€” Open research community; strong training/interpretability discussion.
  • πŸ†“ Hugging Face Discord & Forums β€” Transformers, TRL, PEFT, model-hub discussions.
  • πŸ†“ AI Engineer World's Fair / Latent Space Discord β€” Practitioner community anchoring the AI Engineer conference series.
  • πŸ†“ AI Dev Board β€” Community-curated hub for AI engineering resources and discussions.
  • πŸ†“ Cursor Community Forum β€” User-driven forum for Cursor rules, MCP, and workflows.
  • πŸ†“ Anthropic Discord β€” Official Claude / Claude Code / MCP community.

πŸ§‘β€πŸŽ“ Courses

Structured courses β€” free and paid, university and industry.

T1 Β· Coding Agents & AI-Assisted Development

T4 Β· SWE Benchmarks & Coding Evaluation

  • πŸ§‘β€πŸŽ“ πŸ†“ Evaluating and Debugging Generative AI β€” DeepLearning.AI + W&B. Covers coding-eval mechanics.
  • πŸ§‘β€πŸŽ“ πŸ†“ Mastering LLMs: Evals β€” Hamel Husain & Shreya Shankar (Maven). Companion evals-for-LLMs curriculum.
  • πŸ§‘β€πŸŽ“ πŸ†“ SWE-bench tutorial β€” Princeton NLP. Free, self-paced walk-through of running and scoring coding evals.

T6 Β· LLM Application Architecture & System Design

T7 Β· Prompt Engineering

T8 Β· Retrieval-Augmented Generation (RAG)

T10 Β· Tool Use, Function Calling & MCP

T11 Β· Orchestration, Planning & Design Patterns

T12 Β· Multi-Agent Systems

T13 Β· Evaluation & Testing

T14 Β· Observability, Tracing & Debugging

  • πŸ§‘β€πŸŽ“ πŸ†“ LLMOps β€” DeepLearning.AI + Google Cloud.
  • πŸ§‘β€πŸŽ“ πŸ†“ Evaluating LLMs with Arize β€” Arize course hub.
  • πŸ§‘β€πŸŽ“ πŸ†“ LangSmith Academy β€” LangChain. Free self-paced LangSmith courses covering tracing and evals.

T15 Β· Guardrails & Security

T16 Β· Safety, Alignment & Responsible AI

T17 Β· Fine-tuning, Post-training & RLHF

T18 Β· Inference, Serving, Cost & Latency

πŸ“˜ Books

Published and in-progress books covering agentic & AI engineering.

T1 Β· Coding Agents & AI-Assisted Development

  • ⭐ πŸ“˜ πŸ’° AI-Assisted Programming β€” Tom Taulli (O'Reilly, 2024). Practical coverage of Copilot/Cursor/Claude workflows.
  • πŸ“˜ πŸ’° Prompt Engineering for Generative AI β€” James Phoenix & Mike Taylor (O'Reilly, 2024). Includes heavy coverage of code-generation prompting patterns.

T6 Β· LLM Application Architecture & System Design

T7 Β· Prompt Engineering

  • πŸ“˜ πŸ†“ Prompt Engineering for LLMs β€” John Berryman & Albert Ziegler (O'Reilly, 2024). From Copilot's original tech-lead.
  • πŸ“˜ πŸ’° The Prompt Report β€” Schulhoff et al. (2024). A 76-page survey that effectively functions as a book-length prompting reference.

T8 Β· RAG

T10 Β· Tool Use & MCP

T11 Β· Orchestration & Design Patterns

T13 Β· Evaluation

T15 Β· Guardrails & Security

T16 Β· Safety, Alignment & Responsible AI

  • πŸ“˜ πŸ’° Human Compatible β€” Stuart Russell (2019). The foundational alignment argument.
  • πŸ“˜ πŸ’° The Alignment Problem β€” Brian Christian (2020). The canonical popular-press primer.

T17 Β· Fine-tuning & Post-training

T18 Β· Inference & Serving

T20 Β· Product & UX

T21 Β· Economics, Teams & Org

✍️ Articles & Essays

Long-form writing from canonical authors and engineering teams.

T1 Β· Coding Agents & AI-Assisted Development

T2 Β· Spec-Driven Development & Context Engineering

T3 Β· Agent IDE Rules, Memory Files & Workflows

T4 Β· SWE Benchmarks & Coding Evaluation

T5 Β· Autonomous Software Agents

T6 Β· LLM Application Architecture

T7 Β· Prompt Engineering

T8 Β· Retrieval-Augmented Generation (RAG)

T9 Β· Memory Systems & Long-Context

T10 Β· Tool Use, Function Calling & MCP

T11 Β· Orchestration & Design Patterns

T12 Β· Multi-Agent Systems & Coordination

T13 Β· Evaluation & Testing

T14 Β· Observability, Tracing & Debugging

T15 Β· Guardrails & Security

T16 Β· Safety, Alignment & Responsible AI

T17 Β· Fine-tuning, Post-training & RLHF

T18 Β· Inference, Serving, Cost & Latency

T19 Β· Voice, Multi-modal & Embodied Agents

T20 Β· Product, UX & Human-AI Interaction

T21 Β· Economics, Teams, Hiring & Org Design

πŸ› οΈ Tutorials & Cookbooks

Hands-on, code-first guides and official cookbooks from model providers and framework authors.

T1 Β· Coding Agents & AI-Assisted Development

T2 Β· Spec-Driven Development

  • πŸ› οΈ πŸ†“ GitHub spec-kit β€” The official spec-driven-development toolkit.
  • πŸ› οΈ πŸ†“ AGENTS.md examples β€” Example AGENTS.md files for common stacks.

T3 Β· Agent IDE Rules & Workflows

T5 Β· Autonomous Software Agents

T6 Β· LLM Application Architecture

T7 Β· Prompt Engineering

T8 Β· Retrieval-Augmented Generation (RAG)

T9 Β· Memory Systems

T10 Β· Tool Use & MCP

T11 Β· Orchestration & Patterns

T12 Β· Multi-Agent Systems

T13 Β· Evaluation & Testing

T14 Β· Observability

T15 Β· Guardrails & Security

T17 Β· Fine-tuning & Post-training

T18 Β· Inference & Serving

T19 Β· Voice & Multimodal

πŸ“‹ Playbooks & Design-Pattern Catalogs

Opinionated, prescriptive guides distilling design patterns and operational practices.

πŸ“„ Papers & Research

Foundational papers, surveys, and benchmark papers. Includes a dated milestone-papers table.

Milestone Papers

Date Keywords Institution Paper
2017-06 Transformer Google Attention Is All You Need
2018-10 BERT Google BERT: Pre-training of Deep Bidirectional Transformers
2020-05 GPT-3, ICL OpenAI Language Models are Few-Shot Learners
2020-05 RAG Meta RAG for Knowledge-Intensive NLP Tasks
2021-06 LoRA Microsoft LoRA: Low-Rank Adaptation of LLMs
2022-01 CoT Google Chain-of-Thought Prompting
2022-03 InstructGPT / RLHF OpenAI Training LMs to follow instructions with human feedback
2022-10 ReAct Princeton / Google ReAct: Synergizing Reasoning and Acting
2022-12 Constitutional AI Anthropic Constitutional AI
2023-02 Toolformer Meta Toolformer: LMs Can Teach Themselves to Use Tools
2023-03 Reflexion Northeastern Reflexion
2023-03 Self-Refine CMU Self-Refine: Iterative Refinement
2023-05 Tree of Thoughts Princeton Tree of Thoughts
2023-05 QLoRA UW QLoRA: Efficient Finetuning of Quantized LLMs
2023-05 Voyager NVIDIA / Caltech Voyager: Open-Ended Embodied Agent
2023-05 DPO Stanford DPO: Your LM Is Secretly a Reward Model
2023-06 LLM-as-Judge UC Berkeley Judging LLM-as-a-Judge
2023-07 Generative Agents Stanford / Google Generative Agents: Interactive Simulacra
2023-07 Lost in the Middle Stanford Lost in the Middle
2023-07 GCG CMU Universal and Transferable Adversarial Attacks
2023-09 Agent survey Fudan The Rise and Potential of LLM-based Agents
2023-10 SWE-bench Princeton SWE-bench: Can LMs Resolve Real-World Issues?
2023-10 AutoGen Microsoft AutoGen: Enabling Multi-Agent Conversations
2023-11 GAIA Meta / HF GAIA: Benchmark for General AI Assistants
2023-12 RAG Survey Tongji RAG for LLMs: A Survey
2024-02 SWE-agent Princeton SWE-agent: Agent-Computer Interfaces
2024-05 Many-shot jailbreaking Anthropic Many-shot Jailbreaking
2024-06 Prompt Report Maryland The Prompt Report
2024-06 Ο„-bench Sierra Ο„-bench: Tool-Agent-User benchmark
2024-09 o1 / reasoning OpenAI Learning to Reason with LLMs

T1 Β· Coding Agents & T4 Β· SWE Benchmarks

T5 Β· Autonomous SWE Agents

T6 Β· App Architecture

T7 Β· Prompt Engineering

T8 Β· RAG

T9 Β· Memory

T10 Β· Tool Use & MCP

T11 Β· Orchestration & Patterns

T12 Β· Multi-Agent

T13 Β· Evaluation

T14 Β· Observability

T15 Β· Guardrails & Security

T16 Β· Safety & Alignment

T17 Β· Fine-tuning & Post-training

T18 Β· Inference & Serving

T19 Β· Voice & Multimodal

T20 Β· Product & UX

πŸ§ͺ Benchmarks & Leaderboards

Public benchmarks and leaderboards for coding agents, tool use, RAG, evaluation, and more.

T1 / T4 Β· Coding Agents & SWE Benchmarks

  • ⭐ πŸ§ͺ πŸ†“ SWE-bench β€” Real-world GitHub-issue resolution benchmark; Verified subset is the de-facto industry standard.
  • πŸ§ͺ πŸ†“ Terminal-Bench β€” Stanford / Laude. Long-horizon terminal task benchmark.
  • πŸ§ͺ πŸ†“ LiveCodeBench β€” Rolling contamination-free coding benchmark.
  • πŸ§ͺ πŸ†“ BigCodeBench β€” Practical programming with diverse function calls.
  • πŸ§ͺ πŸ†“ HumanEval+ / EvalPlus β€” Strengthened HumanEval.
  • πŸ§ͺ πŸ†“ MLE-bench β€” OpenAI. Kaggle-style ML engineering benchmark.

T5 Β· Autonomous Agents

  • πŸ§ͺ πŸ†“ GAIA β€” General AI Assistants benchmark.
  • πŸ§ͺ πŸ†“ AgentBench β€” Tsinghua. Broad agent capability benchmark.
  • πŸ§ͺ πŸ†“ WebArena / VisualWebArena β€” Web-navigation agents.
  • πŸ§ͺ πŸ†“ OSWorld β€” Desktop OS-controlling agents.
  • πŸ§ͺ πŸ†“ MLE-bench β€” ML-engineering agents.

T8 Β· RAG

  • πŸ§ͺ πŸ†“ RAGAS β€” Framework and leaderboard for RAG eval.
  • πŸ§ͺ πŸ†“ MTEB β€” Massive Text Embedding Benchmark.
  • πŸ§ͺ πŸ†“ BEIR β€” Zero-shot IR benchmark.
  • πŸ§ͺ πŸ†“ ARES β€” Automated RAG evaluation.

T10 Β· Tool Use & Function Calling

T11 Β· Orchestration / T12 Β· Multi-Agent

  • πŸ§ͺ πŸ†“ AgentBench β€” General agent-capability.
  • πŸ§ͺ πŸ†“ AgentBoard β€” HKUST. Analytic, fine-grained agent eval.

T13 Β· Evaluation

  • πŸ§ͺ πŸ†“ HELM β€” Stanford CRFM. Holistic evaluation.
  • πŸ§ͺ πŸ†“ Chatbot Arena / LMSYS Arena β€” Human-preference leaderboard.
  • πŸ§ͺ πŸ†“ MMLU-Pro β€” Harder MMLU.
  • πŸ§ͺ πŸ†“ MT-Bench β€” LLM-as-judge multi-turn.

T15 Β· Guardrails & Security

T16 Β· Safety & Alignment

  • πŸ§ͺ πŸ†“ TruthfulQA β€” Truthfulness benchmark.
  • πŸ§ͺ πŸ†“ BBQ β€” Bias benchmark.
  • πŸ§ͺ πŸ†“ ToxiGen β€” Toxicity.

T18 Β· Inference

  • πŸ§ͺ πŸ†“ MLPerf Inference β€” MLCommons. Industry-standard serving benchmark.
  • πŸ§ͺ πŸ†“ LLMPerf β€” Anyscale. Throughput/latency tool.

T19 Β· Voice & Multimodal

  • πŸ§ͺ πŸ†“ MMMU β€” Multimodal multidiscipline benchmark.
  • πŸ§ͺ πŸ†“ VideoMME β€” Video understanding.
  • πŸ§ͺ πŸ†“ Dynabench speech β€” Live speech-model benchmarks.

πŸ—οΈ Reference Implementations & Case Studies

Public production write-ups and canonical reference repositories that teach by example.

T1 / T3 Β· Coding Agents & IDE Rules

  • ⭐ πŸ—οΈ πŸ†“ Claude Code β€” Anthropic's reference agentic CLI.
  • πŸ—οΈ πŸ†“ Aider β€” Reference terminal coding agent with detailed engineering blog.
  • πŸ—οΈ πŸ†“ Cline β€” Open-source autonomous coding agent.
  • πŸ—οΈ πŸ†“ OpenHands β€” All Hands AI. Open-source autonomous SWE agent.

T2 Β· Spec-Driven Dev

  • πŸ—οΈ πŸ†“ GitHub spec-kit β€” Reference spec-driven toolkit.

T5 Β· Autonomous SWE Agents

  • πŸ—οΈ πŸ†“ SWE-agent β€” Princeton NLP. Reference agent for SWE-bench.
  • πŸ—οΈ πŸ†“ AutoCodeRover β€” NUS.
  • πŸ—οΈ πŸ†“ Agentless β€” Minimal agentless baseline that beat prior agents on SWE-bench Lite.

T6 Β· App Architecture

  • πŸ—οΈ πŸ†“ Open Interpreter β€” Reference local code-execution agent.
  • πŸ—οΈ πŸ†“ Quivr β€” Reference full-stack RAG assistant.
  • πŸ—οΈ πŸ†“ LangChain templates β€” Reference app scaffolds.

T8 Β· RAG

  • ⭐ πŸ—οΈ πŸ†“ LlamaIndex β€” Reference RAG framework; docs double as case studies.
  • πŸ—οΈ πŸ†“ RAGFlow β€” Production-grade RAG reference.
  • πŸ—οΈ πŸ†“ Verba β€” Weaviate reference RAG app.
  • πŸ—οΈ πŸ†“ GraphRAG β€” Microsoft Research.

T9 Β· Memory

  • πŸ—οΈ πŸ†“ Letta (MemGPT) β€” Reference agentic-memory implementation.
  • πŸ—οΈ πŸ†“ Mem0 β€” Reference memory layer.
  • πŸ—οΈ πŸ†“ Zep β€” Long-term memory store.

T10 Β· Tool Use & MCP

T11 / T12 Β· Orchestration & Multi-Agent

  • πŸ—οΈ πŸ†“ LangGraph β€” Reference graph-based orchestration.
  • πŸ—οΈ πŸ†“ AutoGen β€” Microsoft.
  • πŸ—οΈ πŸ†“ CrewAI β€” Reference role-based multi-agent.
  • πŸ—οΈ πŸ†“ Pydantic AI β€” Type-safe agent framework.

T13 Β· Evaluation

T14 Β· Observability

  • πŸ—οΈ πŸ†“ Langfuse β€” Open-source LLM observability.
  • πŸ—οΈ πŸ†“ Arize Phoenix β€” Open-source tracing + evals.
  • πŸ—οΈ πŸ†“ OpenLLMetry β€” OTel-based LLM instrumentation.

T15 Β· Guardrails & Security

  • πŸ—οΈ πŸ†“ Guardrails AI β€” Reference guardrails framework.
  • πŸ—οΈ πŸ†“ NVIDIA NeMo Guardrails β€” Programmable guardrails.
  • πŸ—οΈ πŸ†“ Rebuff β€” Prompt-injection defence reference.

T17 Β· Fine-tuning

  • πŸ—οΈ πŸ†“ Unsloth β€” Fast LoRA/QLoRA reference.
  • πŸ—οΈ πŸ†“ Axolotl β€” Reference fine-tuning framework.
  • πŸ—οΈ πŸ†“ LLaMA-Factory β€” Unified fine-tuning toolkit.
  • πŸ—οΈ πŸ†“ Hugging Face alignment-handbook β€” Reference RLHF/DPO recipes.

T18 Β· Inference & Serving

  • ⭐ πŸ—οΈ πŸ†“ vLLM β€” Reference high-throughput LLM serving.
  • πŸ—οΈ πŸ†“ SGLang β€” Structured generation serving.
  • πŸ—οΈ πŸ†“ llama.cpp β€” Reference CPU/GPU local inference.
  • πŸ—οΈ πŸ†“ TensorRT-LLM β€” NVIDIA reference optimised serving.

T19 Β· Voice & Multimodal

  • πŸ—οΈ πŸ†“ LiveKit Agents β€” Voice-agent reference.
  • πŸ—οΈ πŸ†“ Pipecat β€” Daily's voice-agent framework.
  • πŸ—οΈ πŸ†“ Ultravox β€” Real-time speech LM.

T20 Β· Product & UX

  • πŸ—οΈ πŸ†“ Vercel AI SDK β€” Reference AI-UI patterns and streaming.
  • πŸ—οΈ πŸ†“ Open WebUI β€” Reference local chat UI.
  • πŸ—οΈ πŸ†“ assistant-ui β€” Reference React components for AI chat.

πŸŽ₯ Talks, Workshops & Conferences

Recorded talks, workshops, and conference series worth watching.

Conference series

  • ⭐ πŸŽ₯ πŸ†“ AI Engineer Summit / World's Fair β€” The definitive practitioner conference; full talks on YouTube.
  • πŸŽ₯ πŸ†“ NeurIPS / ICML / ICLR β€” Core ML research venues; most papers include recorded talks.
  • πŸŽ₯ πŸ†“ COLM β€” Conference on Language Modeling. New dedicated LM venue.
  • πŸŽ₯ πŸ†“ MLSys β€” Core ML-systems conference (inference, serving).
  • πŸŽ₯ πŸ†“ LlamaCon β€” Meta's open-source LLM conference.

Canonical talks

T1 Β· Coding Agents

T4 Β· SWE Benchmarks

T6 Β· App Architecture

T7 Β· Prompt Engineering

T8 Β· RAG

T10 Β· MCP

T11 / T12 Β· Orchestration & Multi-Agent

T13 Β· Evaluation

T14 Β· Observability

T15 / T16 Β· Security & Safety

T17 Β· Fine-tuning

T18 Β· Inference

T19 Β· Voice & Multimodal

T20 Β· Product & UX

T21 Β· Economics & Teams

🎧 Podcasts

Recurring podcasts with strong agentic & AI-engineering coverage.

  • ⭐ 🎧 πŸ†“ Latent Space β€” swyx & Alessio. The AI-engineering podcast of record; guests include most major AI-lab engineers.
  • ⭐ 🎧 πŸ†“ Practical AI β€” Daniel Whitenack & Chris Benson. Long-running, practitioner-first.
  • 🎧 πŸ†“ MLOps Community podcast β€” Demetrios Brinkmann. Ops-side operationalisation case studies.
  • 🎧 πŸ†“ Gradient Dissent β€” Weights & Biases. Applied-ML interviews.
  • 🎧 πŸ†“ The TWIML AI Podcast β€” Sam Charrington. Longest-running ML interview series.
  • 🎧 πŸ†“ No Priors β€” Sarah Guo & Elad Gil. Founders / researchers.
  • 🎧 πŸ†“ Cognitive Revolution β€” Nathan Labenz. Weekly AI engineering + strategy.
  • 🎧 πŸ†“ Dwarkesh Podcast β€” Dwarkesh Patel. Deep interviews with top researchers.
  • 🎧 πŸ†“ Machine Learning Street Talk β€” Tim Scarfe. Technical deep-dives.
  • 🎧 πŸ†“ Lex Fridman Podcast β€” Long-form interviews with AI-lab CEOs and researchers.
  • 🎧 πŸ†“ Unsupervised Learning β€” Redpoint. AI-founder / operator conversations.
  • 🎧 πŸ†“ Interconnects β€” Nathan Lambert. RLHF / post-training focus.
  • 🎧 πŸ†“ Pragmatic Engineer β€” Gergely Orosz. AI-engineering org/hiring coverage.

πŸ“° Newsletters

Weekly and monthly curated newsletters.

  • ⭐ πŸ“° πŸ†“ The Batch β€” Andrew Ng / DeepLearning.AI. Weekly AI-engineering digest.
  • ⭐ πŸ“° πŸ†“ Import AI β€” Jack Clark (Anthropic co-founder). Policy + research.
  • ⭐ πŸ“° πŸ†“ Latent Space β€” swyx. The AI-engineering newsletter of record.
  • πŸ“° πŸ†“ Simon Willison's Weblog β€” RSS/email. Daily real-time coverage of tools and agents.
  • πŸ“° πŸ†“ Ahead of AI β€” Sebastian Raschka. LLM research + fine-tuning deep-dives.
  • πŸ“° πŸ†“ The Pragmatic Engineer β€” Gergely Orosz. AI-engineering hiring/org coverage.
  • πŸ“° πŸ†“ Interconnects β€” Nathan Lambert. RLHF / post-training.
  • πŸ“° πŸ†“ Last Week in AI β€” Weekly recap.
  • πŸ“° πŸ†“ TLDR AI β€” Daily headlines.
  • πŸ“° πŸ†“ Ben's Bites β€” Daily digest; founder-friendly.
  • πŸ“° πŸ†“ Chip Huyen's Blog β€” Occasional long-form on AI engineering.
  • πŸ“° πŸ†“ Eugene Yan β€” Pattern / eval / RAG deep-dives.
  • πŸ“° πŸ†“ Hamel's Blog β€” Evals + applied LLMs.
  • πŸ“° πŸ†“ Machine Learning Engineer Newsletter β€” Alejandro Saucedo. Weekly production-ML curation.
  • πŸ“° πŸ†“ MLOps Community newsletter β€” MLOps Community.
  • πŸ“° πŸ†“ The Data Exchange β€” Ben Lorica.

πŸ›‘οΈ Governance, Safety & Responsible AI

Policy frameworks, safety research, red-teaming resources, and responsible-AI guidance.

Policy & frameworks

Lab safety & responsible scaling

Security & red-teaming

Responsible AI practice

Papers & research

🎨 Product, UX & Economics of AI

Going beyond engineering: designing for AI, human-AI interaction, and the economics of LLM applications.

Design & UX

Economics & business

Product strategy

πŸ§‘β€πŸ€β€πŸ§‘ Teams, Hiring & Org Design

How organisations structure AI-engineering work, hire for it, and operate sustainably.


How to suggest a resource

Please use one of the issue templates (resource suggestion, broken link, or trending nomination) or open a pull request following the guidance in CONTRIBUTING.md. The curation methodology and update cadence are documented in NOTES.md.

Update cadence

Weekly: PR triage and broken-link fixes. Monthly: trending rotation and new-resource batches. Quarterly: full thoroughness pass against the checklist in NOTES.md.

License

CC0 β€” To the extent possible under law, the contributors have waived all copyright and related or neighboring rights to this work.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors