Description
Add run identity, timestamps, and comparison capability to the eval pipeline and viewer. Currently eval results are flat JSON arrays with no run metadata — users can't tell when an eval was run, what changed between runs, or track coverage improvement over time.
Acceptance Criteria
Out of Scope
- Multi-domain support (separate issue)
- Retrieval quality vs KB quality distinction (separate issue)
- Automated scheduled eval runs
Description
Add run identity, timestamps, and comparison capability to the eval pipeline and viewer. Currently eval results are flat JSON arrays with no run metadata — users can't tell when an eval was run, what changed between runs, or track coverage improvement over time.
Acceptance Criteria
Out of Scope