-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Research Spike: Prime Radiant & Validation Approaches for Autonomous XP Agents
Executive Summary
Goal: Identify and evaluate validation approaches for autonomous Extreme Programming agents that ensure domain depth and alignment to requirements/acceptance criteria — not just code that compiles.
Context: Forge currently uses 8 agents in parallel with 7 quality gates. This research explores advanced validation mechanisms to ensure agents produce code with genuine domain understanding and behavioral correctness.
1. Prime Radiant Concept
Origin & Metaphor
Prime Radiant comes from Isaac Asimov's Foundation series — a device used by psychohistorians to:
- Predict future societal trends based on mathematical models
- Validate predictions against actual outcomes
- Continuously update models as new data arrives
- Display complex multi-dimensional data for human review
Application to Autonomous Development
In software development context, a "Prime Radiant" validation system would:
- Predict Expected Behaviors from requirements/specs
- Validate Implementations against predicted behaviors
- Learn from Deviations to improve future predictions
- Visualize Domain Models for human review
- Detect Drift between specification and implementation over time
Key Principles
- ✅ Predictive validation — Know what should exist before checking if it does
- ✅ Multi-dimensional verification — Code, behavior, domain, contracts
- ✅ Continuous learning — Update understanding based on outcomes
- ✅ Human-readable — Domain experts can review and validate
- ✅ Drift detection — Catch spec/implementation divergence early
2. Current State: Forge's Validation Approach
Existing Mechanisms
- Gherkin Behavioral Specs — Human-readable acceptance criteria
- 7 Quality Gates — Functional, behavioral, coverage, security, a11y, resilience, contract
- Confidence-Tiered Fixes — Platinum/Gold/Silver/Bronze patterns from experience
- Defect Prediction — Historical failure data + file changes
- LLM-as-Judge (Implicit) — Agents evaluate each other's work
Gaps & Limitations
❌ Domain model validation — No explicit check that code reflects domain concepts
❌ Requirement traceability — No systematic mapping: requirement → implementation → test
❌ Intent preservation — Can't verify "why" behind implementation choices
❌ Cross-cutting concerns — Limited validation of architectural principles
❌ Semantic drift — No ongoing validation that implementation stays aligned with domain
3. Alternative Validation Approaches
3.1 Domain-Driven Design Validation
Concept: Validate that code implements domain concepts correctly, not just passes tests.
Ubiquitous Language Checker
domain_model:
entities:
- Trip (aggregate root)
- Booking (value object)
- Seat (value object)
invariants:
- Trip.availableSeats >= 0
- SUM(Booking.seats WHERE status='accepted') <= Trip.capacity
bounded_contexts:
- identity
- payments
- logistics
validation:
- code_uses_domain_terms: true # "Trip" not "Journey", "Booking" not "Reservation"
- invariants_enforced: true # Check runtime + tests enforce invariants
- bounded_context_isolation: true # No cross-context couplingImplementation:
- Extract domain model from Gherkin + ADRs
- Parse code to find classes/types
- Validate: naming alignment, invariant enforcement, context boundaries
Strengths:
- ✅ Ensures code reflects domain thinking
- ✅ Catches semantic drift (wrong abstractions)
- ✅ Validates business rules, not just behavior
Weaknesses:
- ❌ Requires explicit domain model
- ❌ Hard to automate (language-dependent parsing)
Rating: ⭐⭐⭐⭐⭐ (5/5) — Essential for domain-rich applications
3.2 Specification-by-Example Validation
Concept: Generate executable examples from requirements, then verify code satisfies them.
Example-Driven Verification
# From Gherkin spec
Given I have a trip with 4 available seats
When a passenger requests 2 seats
Then available seats should be 2
# Auto-generate property tests
Property: forAll trips, forAll valid requests:
approveBooking(trip, request) =>
trip.availableSeats == original - request.seatsProcess:
- Parse Gherkin scenarios
- Generate property-based tests from scenarios
- Run 1000+ random examples per property
- Validate: all scenarios hold for ALL inputs, not just happy path
Strengths:
- ✅ Comprehensive (thousands of test cases from one spec)
- ✅ Finds edge cases
- ✅ Validates intent, not just examples
Weaknesses:
- ❌ Slow execution
- ❌ Requires property formulation skill
Rating: ⭐⭐⭐⭐⭐ (5/5) — Already proven effective (Forge #5)
3.3 Contract-First Validation
Concept: Define contracts upfront, validate both frontend/backend implement them correctly.
Contract Registry
contract: CreateTripRequest
fields:
- origin: {type: Location, required: true}
- destination: {type: Location, required: true}
- departureTime: {type: DateTime, required: true}
- availableSeats: {type: PositiveInt, required: true}
frontend_model: mobile/lib/models/trip.dart
backend_handler: backend/src/api/trips.rs
validation:
- frontend_can_serialize: true
- backend_can_deserialize: true
- field_names_match: true
- types_compatible: trueImplementation:
- Define OpenAPI/AsyncAPI contracts
- Generate types for frontend + backend
- Validate: both sides implement contract correctly
- Test: real API calls match contract
Strengths:
- ✅ Prevents frontend/backend mismatches
- ✅ Single source of truth
- ✅ Catches integration issues early
Weaknesses:
- ❌ Upfront design overhead
- ❌ Contract changes require coordination
Rating: ⭐⭐⭐⭐⭐ (5/5) — Critical for microservices/API-driven apps
3.4 Architectural Decision Records (ADR) Enforcement
Concept: Encode architectural constraints as executable rules, block violations automatically.
ADR Validator
# ADR: No direct database access from frontend
adr_001:
title: Separate frontend/backend data access
decision: Frontend uses only API endpoints, never direct DB
validation_command: |
find mobile/ -type f -name "*.dart" -exec grep -l "DatabaseConnection\\|executeQuery" {} \;
# Should return 0 results
enforcement: blocking
severity: critical
# ADR: All API calls must have error handling
adr_002:
title: Robust error handling
decision: All async API calls must have try-catch
validation_command: |
grep -r "await api\." mobile/lib/services/ | grep -v "try"
# Should return 0 matches
enforcement: blocking
severity: highImplementation:
- Extract ADRs from documentation
- Define validation commands for each constraint
- Run validators on every commit
- Block merge if violations found
Strengths:
- ✅ Enforces architectural principles automatically
- ✅ Prevents technical debt accumulation
- ✅ Documents decisions in executable form
Weaknesses:
- ❌ Requires upfront ADR creation
- ❌ Validation commands can be brittle
Rating: ⭐⭐⭐⭐⭐ (5/5) — Essential for large codebases
Note: Forge already uses this! (See README — "Agent-optimized ADRs")
3.5 Intent Preservation Validation
Concept: Record WHY a decision was made, validate future changes preserve original intent.
Intent Tracker
# When fixing Issue #432: RadioGroup bug
intent:
context: User needs to decline ride requests
requirement: UI must show radio options for decline reasons
constraint: Must use Flutter built-in widgets only
original_approach: RadioGroup<T> (doesn't exist)
corrected_approach: RadioListTile<String>
lesson: Always verify widget exists in Flutter SDK before using
validation:
- future_changes_to_this_file:
- preserve: "User can select one decline reason"
- preserve: "Uses standard Flutter Radio pattern"
- detect_regression: "Don't reintroduce RadioGroup"Implementation:
- When fixing bugs, record: context, requirement, constraint, lesson
- On future changes to that file, check intent preservation
- Ask LLM: "Does this change preserve original intent?"
- Warn if intent violated
Strengths:
- ✅ Prevents regressions
- ✅ Documents design rationale
- ✅ Helps future developers understand context
Weaknesses:
- ❌ Manual intent capture
- ❌ Hard to validate programmatically
Rating: ⭐⭐⭐⭐ (4/5) — Valuable but labor-intensive
3.6 Multi-Model Ensemble Validation (Prime Radiant Implementation)
Concept: Multiple AI models independently evaluate implementation from different perspectives, aggregate verdicts.
Validation Ensemble
validation_ensemble:
perspectives:
- perspective: domain_expert
model: opus
prompt: "Does this code correctly implement the Trip/Booking domain model?"
- perspective: security_auditor
model: gpt-4
prompt: "Are there any security vulnerabilities in this code?"
- perspective: performance_engineer
model: gemini-pro
prompt: "Are there performance issues or inefficiencies?"
- perspective: ux_designer
model: sonnet
prompt: "Is the user experience intuitive? Are loading states handled?"
- perspective: test_engineer
model: sonnet
prompt: "Is this code adequately tested? Are edge cases covered?"
aggregation:
method: consensus # Require 80% agreement
threshold: 0.8
on_disagreement: escalate_to_humanImplementation:
- Spawn N models in parallel
- Each evaluates code from its perspective
- Collect verdicts: PASS/FAIL + reasoning
- Aggregate: if consensus → merge, if disagreement → human review
Strengths:
- ✅ Multi-perspective validation (like Prime Radiant's multi-dimensional view)
- ✅ Catches issues one model might miss
- ✅ Reduces false positives/negatives
Weaknesses:
- ❌ Expensive (N× model costs)
- ❌ Slower (parallel but still multiple calls)
- ❌ Disagreement resolution overhead
Rating: ⭐⭐⭐⭐⭐ (5/5) — Closest to "Prime Radiant" concept
Note: This is essentially Forge #13 (Ensemble Multi-Agent) scaled to validation phase.
3.7 Semantic Diff Validation
Concept: When code changes, verify semantics haven't changed unintentionally.
Semantic Change Detector
# Before change
function approveBooking(trip, booking):
trip.availableSeats -= booking.seats
booking.status = "accepted"
# After change
function approveBooking(trip, booking):
if trip.availableSeats >= booking.seats: # NEW GUARD!
trip.availableSeats -= booking.seats
booking.status = "accepted"
else:
throw InsufficientSeatsError()
validation:
semantic_diff:
- added: "Guard clause prevents negative seats"
- preserved: "Seat reduction logic unchanged"
- impact: "Safer (prevents invariant violation)"
- risk: LOW
- verdict: APPROVE (improves correctness)Implementation:
- LLM analyzes code before/after
- Describes semantic changes
- Evaluates: is this intentional? does it align with requirements?
- Flag if semantic change but no spec change
Strengths:
- ✅ Catches unintended behavior changes
- ✅ Documents evolution
- ✅ Validates alignment with intent
Weaknesses:
- ❌ Hard to detect all semantic changes
- ❌ False positives (safe changes flagged)
Rating: ⭐⭐⭐⭐ (4/5) — Valuable for critical code
3.8 Requirement Traceability Matrix
Concept: Explicit mapping from requirements → code → tests, validate completeness.
Traceability Matrix
requirement: REQ-001
description: "User can decline ride requests with reason"
acceptance_criteria:
- AC1: "UI shows list of decline reasons"
- AC2: "User can select one reason"
- AC3: "Selection is sent to backend"
implementation:
- file: booking_request_screen.dart
lines: 247-260
implements: [AC1, AC2]
- file: api_service.dart
lines: 89-102
implements: [AC3]
tests:
- file: booking_request_screen_test.dart
scenario: "Declining request with reason"
covers: [AC1, AC2, AC3]
validation:
- all_acs_implemented: true ✅
- all_acs_tested: true ✅
- no_orphaned_code: true ✅ (all code maps to requirement)Implementation:
- Parse requirements from Gherkin/user stories
- Tag code with requirement IDs (comments or annotations)
- Generate traceability matrix
- Validate: every requirement has implementation + tests
Strengths:
- ✅ Complete coverage visibility
- ✅ Detects orphaned code (no requirement)
- ✅ Audit trail for compliance
Weaknesses:
- ❌ Manual tagging overhead
- ❌ Stale annotations
Rating: ⭐⭐⭐⭐ (4/5) — Essential for regulated domains
3.9 Behavior-Preserving Refactoring Validation
Concept: When refactoring, verify behavior hasn't changed.
Refactoring Validator
refactoring:
before_snapshot:
- run all tests
- capture: test results, coverage, performance metrics
- save: behavioral signature
after_refactoring:
- run all tests
- capture: test results, coverage, performance metrics
- compare: behavioral signature
validation:
- same_tests_pass: true
- same_tests_fail: true (if any)
- coverage_unchanged_or_improved: true
- performance_unchanged_or_improved: true
- api_contracts_unchanged: trueImplementation:
- Before refactoring: snapshot test results + behavior
- Refactor
- After refactoring: re-run tests
- Diff: if behavior changed → flag (unless intentional)
Strengths:
- ✅ Safe refactoring (behavior locked)
- ✅ Detects accidental changes
- ✅ Builds confidence
Weaknesses:
- ❌ Requires good existing tests
- ❌ Can't validate if tests are wrong
Rating: ⭐⭐⭐⭐ (4/5) — Standard practice for refactoring
3.10 Runtime Invariant Checking (Production Validation)
Concept: Monitor production to validate code behaves as designed.
Invariant Monitor
invariants:
- name: seats_non_negative
expression: trip.availableSeats >= 0
scope: production
action: alert + rollback
- name: capacity_not_exceeded
expression: |
SUM(booking.seats WHERE trip_id = {id} AND status = 'accepted')
<= trip.capacity
scope: production
action: alert + block_new_bookings
monitoring:
- on_violation:
- log_event
- send_alert: ops_team
- auto_remediate: true (if safe)
- create_issue: githubImplementation:
- Define invariants from domain model
- Instrument code to check invariants at runtime
- Monitor violations in production
- Alert + auto-fix if possible
Strengths:
- ✅ Real-world validation
- ✅ Catches bugs tests miss
- ✅ Continuous verification
Weaknesses:
- ❌ Performance overhead
- ❌ Only catches after deployment
Rating: ⭐⭐⭐⭐⭐ (5/5) — Essential for critical systems
Note: Forge already supports this! (Approach #11 - Runtime Verification)
4. Recommended Prime Radiant Implementation for Forge
Vision
A "Prime Radiant" for Forge would be a multi-dimensional validation dashboard that:
- Predicts what should exist from requirements
- Validates implementations against predictions
- Visualizes domain models, contracts, and dependencies
- Learns from deviations to improve future validations
- Alerts on drift between spec and implementation
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ PRIME RADIANT │
│ Multi-Dimensional Validation System │
├─────────────────────────────────────────────────────────────────┤
│ │
│ INPUT LAYER │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Gherkin │ │ ADRs │ │ Domain │ │
│ │ Specs │ │ │ │ Model │ │
│ └──────┬─────┘ └──────┬─────┘ └──────┬─────┘ │
│ │ │ │ │
│ └────────────────┴────────────────┘ │
│ │ │
│ ▼ │
│ PREDICTION ENGINE │
│ ┌───────────────────────────────────────────────┐ │
│ │ "From specs, what SHOULD exist?" │ │
│ │ - Expected classes/functions │ │
│ │ - Expected invariants │ │
│ │ - Expected tests │ │
│ │ - Expected API contracts │ │
│ └───────────────┬───────────────────────────────┘ │
│ │ │
│ ▼ │
│ VALIDATION ENSEMBLE (Multi-Model) │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Domain │ │ Contract │ │ Intent │ │
│ │ Validator │ │ Validator │ │ Validator │ │
│ │ (Opus) │ │ (Sonnet) │ │ (GPT-4) │ │
│ └──────┬─────┘ └──────┬─────┘ └──────┬─────┘ │
│ │ │ │ │
│ └────────────────┴────────────────┘ │
│ │ │
│ ▼ │
│ AGGREGATION & DECISION │
│ ┌───────────────────────────────────────────────┐ │
│ │ Consensus: 3/3 models agree → PASS │ │
│ │ Disagreement: 2/3 → WARN + human review │ │
│ │ Failure: 0/3 or 1/3 → BLOCK │ │
│ └───────────────┬───────────────────────────────┘ │
│ │ │
│ ▼ │
│ LEARNING & FEEDBACK │
│ ┌───────────────────────────────────────────────┐ │
│ │ - Update confidence tiers │ │
│ │ - Record patterns (correct implementations) │ │
│ │ - Improve predictions for next iteration │ │
│ └───────────────────────────────────────────────┘ │
│ │
│ OUTPUT │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Verdict │ │ Traceability│ │ Drift │ │
│ │ Report │ │ Matrix │ │ Alerts │ │
│ └────────────┘ └────────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Implementation Phases
Phase 1: Prediction Engine (Week 1-2)
Goal: From Gherkin + ADRs, predict what code should look like
-
Input parsing:
- Parse all Gherkin scenarios
- Parse all ADRs
- Extract domain model concepts
-
Prediction generation:
# From Gherkin: "Given I have a trip with 4 available seats" predictions: - entity: Trip fields: - availableSeats: integer (positive) methods: - approveBooking(booking): void invariants: - availableSeats >= 0 # From ADR: "No direct DB access from frontend" - constraint: no_direct_db_access scope: mobile/* validation: grep -r "DatabaseConnection" mobile/ == 0 results
-
Deliverable:
.forge/predictions.yamlgenerated from specs
Phase 2: Multi-Model Validation (Week 3-4)
Goal: 3+ models validate implementation independently
-
Validator agents (spawn in parallel):
- Domain Validator (Opus): "Does code correctly implement domain model?"
- Contract Validator (Sonnet): "Do frontend/backend contracts align?"
- Intent Validator (GPT-4): "Does implementation preserve intent from specs?"
-
Verdict aggregation:
const verdicts = await Promise.all([ domainValidator.validate(code, predictions), contractValidator.validate(code, predictions), intentValidator.validate(code, predictions) ]); if (verdicts.every(v => v === 'PASS')) return 'APPROVED'; if (verdicts.filter(v => v === 'PASS').length >= 2) return 'WARN'; return 'BLOCKED';
-
Deliverable:
forge --prime-radiantcommand
Phase 3: Traceability Matrix (Week 5)
Goal: Explicit requirement → code → test mapping
-
Mapping generation:
- Scan code for
// REQ-001annotations - Build matrix: which files implement which requirements
- Validate: every requirement has implementation + tests
- Scan code for
-
Orphan detection:
- Find code with no requirement mapping
- Find requirements with no implementation
- Alert on gaps
-
Deliverable:
.forge/traceability.htmlvisual matrix
Phase 4: Drift Detection (Week 6)
Goal: Continuous monitoring for spec/implementation divergence
-
Baseline capture:
- On first run, capture: code structure, API contracts, domain model
- Save:
.forge/baseline.json
-
Drift monitoring:
- On each run, compare current state vs baseline
- Detect: new APIs not in spec, removed features still in spec, changed invariants
-
Alerts:
drift_detected: - type: spec_drift message: "Gherkin says 'User can decline', but DeclineButton removed from code" severity: high action: block_merge
-
Deliverable:
forge --drift-checkcommand
Phase 5: Learning & Feedback (Week 7-8)
Goal: Improve predictions based on actual outcomes
-
Pattern mining:
- Analyze: which predicted structures actually appeared in code
- Record: successful implementations (for future reference)
-
Confidence updating:
- If prediction was correct → increase confidence in that pattern
- If prediction was wrong → update model
-
Feedback loop:
# After successful implementation learning: - pattern: Trip entity with availableSeats field confidence: platinum (5/5 times correct) - pattern: RadioGroup widget in Flutter confidence: bronze (was wrong, doesn't exist) lesson: Always verify widget exists in Flutter SDK
-
Deliverable:
.forge/patterns.yamlcontinuously updated
5. Comparison Table
| Approach | Domain Depth | Req Alignment | Complexity | Cost | Rating |
|---|---|---|---|---|---|
| Domain Model Validation | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Medium | Low | ⭐⭐⭐⭐⭐ |
| Specification-by-Example | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Medium | Medium | ⭐⭐⭐⭐⭐ |
| Contract-First | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Low | Low | ⭐⭐⭐⭐⭐ |
| ADR Enforcement | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Low | Low | ⭐⭐⭐⭐⭐ |
| Intent Preservation | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | High | Medium | ⭐⭐⭐⭐ |
| Multi-Model Ensemble | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | High | High | ⭐⭐⭐⭐⭐ |
| Semantic Diff | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Medium | Medium | ⭐⭐⭐⭐ |
| Traceability Matrix | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Medium | Low | ⭐⭐⭐⭐ |
| Behavior-Preserving | ⭐⭐⭐ | ⭐⭐⭐ | Low | Low | ⭐⭐⭐⭐ |
| Runtime Invariants | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Medium | High | ⭐⭐⭐⭐⭐ |
Legend:
- Domain Depth: Does it validate code reflects domain understanding?
- Req Alignment: Does it ensure implementation matches requirements?
- Complexity: Implementation difficulty
- Cost: Computational/time cost
- Rating: Overall effectiveness
6. Recommendations for Forge
Immediate (Next Sprint)
-
✅ Already Have: ADR enforcement, Runtime invariants (Approach Approach #4: Mutation Testing - Test Your Tests #11)
-
🆕 Add: Domain Model Validation
- Extract domain entities from Gherkin
- Validate code uses correct terminology
- Check invariants are enforced
-
🆕 Add: Contract-First Validation
- Define OpenAPI contracts for all APIs
- Validate frontend/backend alignment
- Auto-generate types from contracts
Short-term (This Month)
-
🆕 Implement: Multi-Model Ensemble Validation (Prime Radiant v1)
- 3 models evaluate from different perspectives
- Consensus-based approval
- Human escalation on disagreement
-
🆕 Implement: Traceability Matrix
- Requirement → code → test mapping
- Orphan detection
- Visual HTML report
Long-term (This Quarter)
-
🆕 Full Prime Radiant: All 5 phases
- Prediction engine
- Multi-model validation
- Traceability matrix
- Drift detection
- Learning & feedback
-
🆕 Prime Radiant Dashboard:
- Real-time validation status
- Drift alerts
- Confidence scores
- Pattern evolution over time
7. Success Metrics
Current State (Forge Today)
- ✅ Behavioral verification (Gherkin)
- ✅ 7 quality gates
- ✅ Defect prediction
- ✅ Confidence-tiered fixes
With Prime Radiant (Target)
- ✅ Domain alignment verified (not just behavior)
- ✅ Requirement traceability (100% coverage)
- ✅ Intent preservation (no accidental regressions)
- ✅ Drift detection (spec/impl alignment monitored)
- ✅ Multi-perspective validation (consensus-based approval)
Expected Outcomes:
- 🎯 First-pass quality improvement: 90% → 98%
- 🎯 Domain depth score: NEW (0% → 95%)
- 🎯 Requirement alignment: NEW (0% → 100%)
- 🎯 Production bugs from shallow implementations: Near zero
- 🎯 Developer confidence: Higher (validated against domain model)
8. Conclusion
Prime Radiant as a metaphor represents a multi-dimensional validation system that:
- Predicts what should exist from requirements
- Validates implementations from multiple perspectives
- Continuously learns and improves
- Visualizes complex relationships for human review
For Forge, implementing a Prime Radiant system would mean:
- Domain Model Validation — Code reflects domain thinking
- Multi-Model Ensemble — Consensus-based quality gates
- Traceability Matrix — Complete req → code → test mapping
- Drift Detection — Continuous spec/impl alignment monitoring
- Learning Loop — Patterns improve over time
This goes beyond "tests pass" to ensure domain depth and requirement alignment — the true goal of autonomous XP agents.
Next Step: Choose 2-3 approaches from this research to prototype in Forge, starting with Domain Model Validation + Multi-Model Ensemble (the core "Prime Radiant" concept).
References:
- Asimov, I. (1951). Foundation. (Prime Radiant concept)
- Evans, E. (2003). Domain-Driven Design. (Domain model validation)
- Beck, K. (1999). Extreme Programming Explained. (XP practices)
- Forge issues Multi-Agent Quality Assurance: Preventing Shallow AI Implementations #4-22 (Autonomous QA research)