Agentic RAG Legal Challenge 2026

Evidence-first legal question answering over DIFC regulations, case law, and contracts.

Built for the Agentic RAG Legal Challenge at Dubai AI Week & Machines Can See 2026.

A staged LangGraph pipeline that answers 900 legal questions over 300 DIFC documents with page-level retrieval, anchor-based filtering, and faithfulness guardrails. Developed over a 12-day competition sprint (March 10--22, 2026) with 1,598 commits, culminating in a multi-agent coordination phase where six AI agents worked in parallel on the final push.

Tip

For the full write-up on what worked, what didn't, and the infrastructure-over-retrieval finding, see the blog article.

Results

Metric	Value
Questions answered	900
Documents processed	300
Total pages indexed	10,109 (+209% via registry enrichment)
Null answers	3
No-page-grounding answers	3
Corrections applied	103+ (56 DOI dates, 11 comparisons, 32 booleans, 16 numbers)
Eval versions completed	V9.1 through V17
Total commits	1,598

Key Findings

Infrastructure beats retrieval. The single biggest insight from the competition: investing in evaluation infrastructure, submission tooling, and automated correction pipelines produced more score improvement than any retrieval technique change.

87.5% ablation rejection rate. Seven of eight retrieval enhancements tested during the competition were rejected after ablation testing. BM25-only retrieval, RAG Fusion, HyDE, FlashRank, Isaacus EQA, step-back prompting, and citation-first retrieval all failed to improve over the baseline. Only the anchor-based page filtering survived.

TTFT was the biggest remaining lever. Late in the competition, time-to-first-token analysis revealed a +23.5% potential swing -- larger than any retrieval optimization still on the table.

DB short-circuit for metadata questions. 167 of 900 questions were answerable from corpus metadata alone (document titles, dates, registry entries), resolved in <50ms without an LLM call.

Architecture

Every query flows through a staged LangGraph pipeline with page-first retrieval and anchor-based filtering:

graph LR
    Q([Query]) --> Classify
    Classify --> DB["DB Answerer<br/>(short-circuit)"]
    Classify --> PageRetrieval["Page-Level Retrieval"]
    PageRetrieval --> AnchorFilter["Anchor Filter"]
    AnchorFilter --> Rerank
    Rerank --> Generate
    Generate --> Verify
    Verify --> Emit
    Emit --> Finalize
    DB --> Finalize
    Finalize --> A([Answer + Telemetry])

graph TD
    subgraph Retrieval
        Hybrid["Hybrid Search<br/>Dense + BM25"]
        Kanon["Kanon 2 Embedder<br/>1792 dims"]
        Qdrant["Qdrant<br/>10,109 points"]
        Hybrid --> Kanon
        Kanon --> Qdrant
    end

    subgraph Filtering
        Anchor["Anchor Filter<br/>Legal refs, entities,<br/>cross-references"]
        Rerank["Zerank 2<br/>Cohere fallback"]
        Anchor --> Rerank
    end

    subgraph Generation
        Router["Model Router"]
        Mini["gpt-4.1-mini<br/>Strict types"]
        Full["gpt-4.1<br/>Free text"]
        Router --> Mini
        Router --> Full
    end

    subgraph Verification
        Premise["Premise Guard"]
        Conflict["Conflict Detection"]
        Citation["Citation Verification"]
        PostProc["Post-processing"]
    end

    Qdrant --> Anchor
    Rerank --> Router
    Full --> Premise
    Mini --> Premise
    Premise --> Conflict --> Citation --> PostProc

Design Choices

DB answerer short-circuit: 167 metadata-answerable questions resolved via corpus registry lookup in <50ms
Page-first retrieval: Hybrid search (dense + BM25) at page level, then anchor-based filtering for precision
Legal-domain embeddings: Kanon 2 Embedder for DIFC-heavy terminology and statute references
Anchor filtering: Extracts legal references, entity names, and cross-references to localize relevant pages
Reranking: Zerank 2 with Cohere Rerank fallback for final page ordering
Model routing: gpt-4.1-mini for strict answer types, gpt-4.1 for complex free-text reasoning
Faithfulness guardrails: Premise guard, conflict detection, citation verification, and final post-processing bound to answer_final
Telemetry bound to the final answer: used_page_ids, cited_page_ids, per-stage timings, and model metadata in every response

Private phase corpus: 300 documents, 900 questions, 10,307 Qdrant points at 1,792 dims (Kanon-2).

Quick Start

# 1. Clone and configure
git clone https://github.com/chernistry/shafi && cd shafi
cp .env.example .env
# Put machine-specific secrets in .env.local

# 2. Start the local stack (API + Qdrant)
docker compose up --build -d

# 3. Ingest the DIFC corpus
docker compose --profile tools run --rm ingest

# 4. Ask a question
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "Which laws are administered by the Registrar?"}'

The default docker compose setup brings up:

qdrant locally, with health checks
api with warmup enabled by default
ingest and eval as tool-profile services inside the same Docker network

No extra QDRANT_URL juggling is required for the default local workflow.

Environment Contract

Host-local uv run ... and other shell commands use QDRANT_URL=http://localhost:6333
Docker Compose service containers always use QDRANT_URL=http://qdrant:6333 internally
Local config precedence: process env > .env.local > .env > code defaults
Keep shared defaults in .env; keep workstation-specific overrides and secrets in .env.local

Stack

Layer	Choice
API	FastAPI + SSE
Orchestration	LangGraph
Embeddings	Kanon 2 Embedder (1,792 dims)
Vector DB	Qdrant hybrid search
Reranker	Zerank 2, Cohere fallback
Complex LLM	`gpt-4.1`
Strict/simple LLM	`gpt-4.1-mini`
Validation	Pydantic v2 + Pyright strict
Tooling	`uv`, `ruff`, `pytest`, Docker Compose

Evaluation

Run the harness locally against the Dockerized API:

docker compose --profile tools run --rm eval \
  python -m shafi.eval.harness \
  --golden dataset/public_dataset.json \
  --endpoint http://api:8000/query \
  --concurrency 4 \
  --emit-cases \
  --judge \
  --judge-scope free_text \
  --judge-docs-dir dataset/dataset_documents \
  --judge-out data/judge_run.jsonl \
  --out data/eval_run.json

Tracked checks:

Full public-set correctness
free_text judge pass rate, accuracy, grounding, clarity
Citation coverage
Answer-type format compliance
Document-retrieval diagnostics
TTFT distribution
Per-stage latency (classify, embed, qdrant, rerank, llm, verify)
Platform submission projection and archive compliance

Platform Submission

The repository has a separate platform-native submission path for warm-up and final runs.

Submission workflow

# 1. Start local infrastructure
docker compose up --build -d qdrant

# 2. Preflight: build and audit the curated code archive
docker compose --profile tools run --rm eval \
  python -m shafi.submission.platform --archive-only

# 3. Build a warm-up submission package
docker compose --profile tools run --rm eval \
  python -m shafi.submission.platform

# 4. Inspect generated artifacts
#   - platform_runs/<phase>/submission.json
#   - platform_runs/<phase>/preflight_summary.json
#   - platform_runs/<phase>/code_archive.zip
#   - platform_runs/<phase>/code_archive_audit.json

# 5. Submit the inspected artifact and poll status
docker compose --profile tools run --rm eval \
  python -m shafi.submission.platform \
    --submit-existing \
    --submission-path platform_runs/warmup/submission.json \
    --code-archive-path platform_runs/warmup/code_archive.zip \
    --poll

The flow downloads phase-specific documents, ingests them into a phase-specific Qdrant collection, then downloads questions and runs the RAG pipeline. Documents are ingested before questions to comply with the "no question-aware indexing" rule.

Development

uv sync --extra dev       # install dev dependencies
make lint                 # ruff check src tests scripts
make typecheck            # pyright strict mode
make test                 # pytest tests/unit
make all                  # lint + typecheck + test
make format               # auto-fix with ruff

Script index: docs/scripts_index.md

See CONTRIBUTING.md for the full development guide.

Building Legal Answering Systems -- Full write-up covering the competition experience, architecture decisions, the infrastructure-over-retrieval finding, and the multi-agent sprint
Bernstein -- AI coding agent orchestrator born from this competition's multi-agent coordination patterns
Competition Page -- Agentic RAG Legal Challenge at Dubai AI Week & Machines Can See 2026

License

AGPL-3.0 1.0.0 -- Free for non-commercial use. Commercial licensing: alex@alexchernysh.com. See COMMERCIAL-LICENSING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
dashboard		dashboard
data		data
docs		docs
models/page_scorer		models/page_scorer
profiles		profiles
scripts		scripts
src/shafi		src/shafi
tests		tests
typings		typings
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic RAG Legal Challenge 2026

Evidence-first legal question answering over DIFC regulations, case law, and contracts.

Results

Key Findings

Architecture

Design Choices

Quick Start

Environment Contract

Stack

Evaluation

Platform Submission

Development

Related

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agentic RAG Legal Challenge 2026

Evidence-first legal question answering over DIFC regulations, case law, and contracts.

Results

Key Findings

Architecture

Design Choices

Quick Start

Environment Contract

Stack

Evaluation

Platform Submission

Development

Related

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages