Skip to content

feat: eval pipeline run metadata and comparison #29

Description

@rajnavakoti

Description

Add run identity, timestamps, and comparison capability to the eval pipeline and viewer. Currently eval results are flat JSON arrays with no run metadata — users can't tell when an eval was run, what changed between runs, or track coverage improvement over time.

Acceptance Criteria

  • Eval pipeline writes run metadata (run_id, timestamp, domain, source_mix, entity_count_at_time) to eval output
  • Viewer loads multiple eval result files and presents them as named runs
  • Run comparison mode shows CLEAN/INCOMPLETE/MISSING deltas between two runs
  • Coverage trend visualization (sparkline or small chart) when 3+ runs exist
  • Run selector dropdown in Evals page header

Out of Scope

  • Multi-domain support (separate issue)
  • Retrieval quality vs KB quality distinction (separate issue)
  • Automated scheduled eval runs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions