Skip to content

feat: Schema Versioning and Migration for Rolling Upgrades#9

Merged
wongfei2009 merged 5 commits intomasterfrom
feature/schema-versioning
Jan 20, 2026
Merged

feat: Schema Versioning and Migration for Rolling Upgrades#9
wongfei2009 merged 5 commits intomasterfrom
feature/schema-versioning

Conversation

@wongfei2009
Copy link
Copy Markdown
Owner

Summary

This PR implements Schema Versioning and Migration support for HarmonyLite, enabling safe rolling upgrades in production clusters where nodes may temporarily have different database schemas.

Problem Solved

During rolling upgrades, nodes temporarily have schema mismatches. Without this feature, replication could:

  • Apply changes to columns that don't exist
  • Corrupt data due to schema incompatibility
  • Cause unpredictable failures

Solution

Phase 1-3: Schema Hashing and Validation

  • Schema Manager: Uses Atlas to compute deterministic schema hashes
  • Schema Cache: Caches schema hashes per table for performance
  • Change Log Enhancement: Schema hash included in every CDC message
  • Schema Validation: Validates incoming messages against local schema

Phase 4: Cluster-wide Visibility

  • Schema Registry: NATS KV-backed registry showing all nodes' schema versions
  • Enables operators to monitor schema rollout progress across the cluster

Pause/Resume Behavior

When a schema mismatch is detected:

  1. Replication pauses (stops acknowledging messages)
  2. A warning is logged with upgrade instructions
  3. After local schema is upgraded and node restarted, replication resumes automatically

Testing

  • Unit tests for schema manager, registry, validation logic
  • E2E test for full rolling upgrade scenario with 3 nodes
  • Shell script for manual testing

Rolling Upgrade Workflow

  1. Stop node, upgrade schema with ALTER TABLE
  2. Restart node - it now has new schema hash
  3. Repeat for remaining nodes
  4. Replication automatically resumes as each node is upgraded

Implements schema versioning and migration handling system according to
design document to address scenarios where database instances have
different schema versions during rolling upgrades.

Phase 1: Foundation (Schema Tracking with Atlas)
- Add ariga.io/atlas dependency for SQLite introspection
- Create SchemaManager using Atlas for deterministic schema hashing
- Create SchemaCache for O(1) schema hash lookups
- Add __harmonylite__schema_version table to track schema state
- Initialize schema cache on startup and store hash in database
- Add UpdateSchemaState() for recomputing schema hash

Phase 2: Event Enhancement
- Add SchemaHash field to ChangeLogEvent with CBOR omitempty tag
- Populate SchemaHash during event creation (backward compatible)

Phase 3: Validation and Pause-on-Mismatch
- Add schema mismatch tracking fields to Replicator
- Implement handleSchemaMismatch() with 5-minute periodic recompute
- Create ListenWithDB() for schema-aware replication
- Add O(1) hash comparison in replication hot path
- NAK with 30s delay when schema mismatches (pauses replication)
- Implement checkStreamGap() to detect message truncation
- Exit process when stream gap detected (triggers snapshot restore)
- Auto-resume when schema matches after recompute (no restart needed)
- Add harmonylite_schema_mismatch_paused gauge metric

Key Features:
- Deterministic SHA-256 schema hashing using Atlas introspection
- Self-healing: auto-detects schema changes and resumes replication
- Stream gap detection prevents nodes from getting stuck
- Backward compatible with events lacking SchemaHash
- Observable via Prometheus metrics

Phase 4 (cluster visibility via NATS KV) deferred for future work.

Ref: docs/docs/design/schema-versioning.md
- Add SchemaRegistry with NATS KeyValue integration for cluster-wide schema state
- Implement PublishSchemaState() to broadcast node schema hash to registry
- Implement GetClusterSchemaState() to retrieve state from all nodes
- Implement CheckClusterSchemaConsistency() to validate schema across cluster
- Add CLI flags: -schema-status and -schema-status-cluster
- Add printLocalSchemaStatus() to display local schema information
- Add printClusterSchemaStatus() to display cluster-wide schema status with hash groups
- Integrate schema publishing into harmonylite.go startup after CDC installation
- Add comprehensive unit tests for schema registry functionality
- All tests pass: unit tests (db/logstream) and E2E tests (10/10 specs)

Phase 4 provides operators visibility into schema state across the cluster,
making it easy to diagnose schema mismatches during rolling upgrades.
- Add run-schema-migration-test.sh for automated testing of schema versioning
- Test verifies: hash computation, registry publishing, mismatch detection,
  rolling upgrade workflow, and schema convergence
- Update README.md with comprehensive schema versioning documentation
- Include troubleshooting guide for schema-related issues
…ades

- Add Schema Mismatch Pause and Resume test context in e2e_test.go
- Add schema migration helpers: alterTableAddColumn, hasColumn, insertBookWithRating, waitForCDCReady
- Test validates rolling upgrade workflow: pause on mismatch, resume after upgrade
- Covers key discovery: change_log table must be dropped and recreated after schema changes
- Add schema versioning to README.md features list
- Update introduction.md with schema versioning as key feature
- Update architecture.md with Schema Versioning section and flow diagram
- Update production-deployment.md with rolling upgrade workflow
- Add schema mismatch to replication.md failure modes and troubleshooting
- Mark design/schema-versioning.md status as Implemented
@wongfei2009 wongfei2009 merged commit da92828 into master Jan 20, 2026
2 checks passed
@wongfei2009 wongfei2009 deleted the feature/schema-versioning branch January 20, 2026 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant