A PostgreSQL cluster health monitoring tool that scans, analyzes, and reports on the health status of PostgreSQL clusters with streaming replication.
This is a project that's tightly coupled to the our current setup, so it's probably mostly useless to outsides. That said, it's open source so do with it what thou wilst.
Large parts of this readme (but not the code) was AI-summarized. Tread carefully.
- Cluster Health Detection: Identifies healthy, degraded, critical, and unknown cluster states
- Split-Brain Detection: Detects and resolves split-brain scenarios using timeline and replica evidence
- Replication Monitoring: Tracks replication lag, replica status, and synchronization state
- Failover Detection: Identifies clusters that have experienced failover
- Backup Progress Tracking (optional): Estimates pg_basebackup progress via Prometheus filesystem metrics
- Concurrent Scanning: Parallel health checks across multiple clusters and nodes
- Multiple Output Formats: Terminal output (with colors) and CSV export
- Structured Logging: Full tracing support with spans and structured fields
- Rust 1.70+ (for building from source)
- PostgreSQL credentials and SSL certificates
- Network access to PostgreSQL nodes
- A Database Portal that the tool can connect to to get an initial list of nodes to scan
The tool expects a REST API endpoint that returns PostgreSQL node information. The API must:
Endpoint: GET https://database.example.com/api/v1/nodes (default)
You can configure a custom URL at compile time:
DATABASE_PORTAL_URL=https://your-api.com/api/v1/nodes cargo build --releaseCache Behavior:
- Responses are cached in
/tmp/nodes_response.json - Cache is valid for 24 hours
- Stale cache triggers automatic re-fetch
Response Format:
{
"items": [
{
"id": 1,
"cluster_id": 33,
"node_name": "dev-pg-app001-db001.sto1.example.com",
"pg_version": "15.14",
"ip_address": "127.1.12.151"
},
{
"id": 2,
"cluster_id": 33,
"node_name": "dev-pg-app001-db002.sto2.example.com",
"pg_version": "15.14",
"ip_address": "127.2.12.151"
}
],
"count": 2
}Required Fields:
items: Array of node objectscount: Total number of nodes returned- Each node must have:
id,cluster_id,node_name,pg_version,ip_address
Node Naming Convention:
Nodes must follow the naming pattern: {env}-pg-{app}-{db}.{zone}.{domain}
- Example:
dev-pg-app001-db001.sto1.example.com - The tool uses this pattern to extract environment, cluster name, and database number (GOOD FIRST PR IF YOU EXTRACT THIS INTO CONFIG)
# Basic build
cargo build --release
# Build with Prometheus integration for backup progress tracking
PROMETHEUS_URL=https://prometheus.example.com DATABASE_PORTAL_URL=https://database.example.com cargo build --release --features prometheusThe binary will be available at target/release/db-scan
export PGUSER="your-username" export PGPASSWORD="your-password" export PGSSLKEY="/path/to/ssl.key" export PGSSLCERT="/path/to/ssl.crt" export PGSSLROOTCERT="/path/to/ca.crt"
export RUST_LOG="info" # or debug, trace, warn, error
PROMETHEUS_URL=https://prometheus.example.com
cargo build --release --features prometheusConfigure Prometheus URL (compile-time):
PROMETHEUS_URL=https://prometheus.example.com cargo build --release --features prometheusRequirements:
- Prometheus server with
node_exportermetrics - Metrics:
node_filesystem_size_bytesandnode_filesystem_avail_bytes - Labels:
host(hostname) andmountpoint(e.g.,/var/lib/pgsql)
How it works:
- Fetches filesystem metrics for all nodes in the cluster at startup using a single batch query (e.g.,
host=~"dev-pg-app001.*") - For replicas using pg_basebackup, compares the replica's filesystem used bytes against the primary's filesystem used bytes
- Estimates progress:
(replica_used_bytes / primary_used_bytes) * 100 - Progress stored as percentage * 100 (e.g., 4156 = 41.56%), keyed by replica IP address
Note: This is a filesystem-to-filesystem comparison (not database size), chosen for performance reasons as querying database size is too slow on large clusters. This is a rough estimate assuming both filesystems primarily contain PostgreSQL data. Will be inaccurate if filesystems have significantly different amounts of other data. As of this writing, the system queries the pgsql mount point, so it shouldn't drift too far. Except for WAL, logs etc...
export PGUSER="your-username"
export PGPASSWORD="your-password"
export PGSSLKEY="/path/to/ssl.key"
export PGSSLCERT="/path/to/ssl.crt"
export PGSSLROOTCERT="/path/to/ca.crt"
export RUST_LOG="info" # or debug, trace, warn, errordb-scan [OPTIONS]
Options:
--pguser <PGUSER> PostgreSQL username
--pgpassword <PGPASSWORD> PostgreSQL password
--pgsslkey <PGSSLKEY> Path to SSL key file
--pgsslcert <PGSSLCERT> Path to SSL certificate file
--pgsslrootcert <PGSSLROOTCERT> Path to SSL root certificate file
-c, --cluster <CLUSTER> Filter by cluster name
-l, --log-level <LOG_LEVEL> Log level [default: info]
--show-healthy Show healthy clusters in output
--show-failover Show clusters that have experienced failover
--csv <PATH> Write results to CSV file
--no-color Disable terminal colors
-h, --help Print help
-V, --version Print version# Scan all clusters
db-scan
# Scan specific cluster
db-scan --cluster dev-pg-app001
# Show healthy clusters
db-scan --show-healthy
# Export to CSV
db-scan --csv results.csv
# Increase logging verbosity
db-scan --log-level debugTerminal output shows cluster health in a table format:
STATUS CLUSTER PRIMARY REPLICAS LAG REASON
CRITICAL prod-pg-app007 db001 - - NoPrimary
DEGRADED dev-pg-app001 db001 db002,db003 50MB HighReplicationLag
HEALTHY prod-pg-app123 db002 db001,db003 0B Failover
- One primary and two replicas online
- Replication lag < 5 seconds (80MB at 16MB/s WAL generation)
- All replicas streaming
- OneReplicaDown: One replica unavailable
- HighReplicationLag: Lag exceeds 5 second threshold
- RebuildingReplica: Replica is rebuilding (no WAL receiver)
- ChainedReplica: Cascading replication detected
- NoPrimary: No primary node found
- SplitBrain: Multiple primaries detected
- WritesBlocked: Primary with sync_commit=on but no replicas
- WritesUnprotected: Primary with sync_commit=off and no replicas
- NoNodesReachable: Cannot connect to any nodes
- UnexpectedTopology: Cluster has unexpected node count
- Scanner (
v2/scan/): Connects to PostgreSQL nodes and executes health checks - Cluster Builder (
v2/cluster.rs): Groups nodes into clusters - Analyzer (
v2/analyze/): Evaluates cluster health and detects issues - Writer (
v2/writer.rs): Formats and outputs results
Nodes API → Scanner → Analyzed Nodes → Cluster Builder → Clusters → Analyzer → Health Status → Writer → Output
When multiple primaries are detected, the tool uses multiple strategies to determine the true primary:
- Timeline Analysis: Higher timeline ID indicates more recent promotion
- Replica Evidence: Which primary are the replicas following?
- Combined Evidence: Both timeline and replica data agree
- Override Case: Replicas override timeline (isolated failed promotion)
cargo testsrc/
├── main.rs # Entry point
├── database_portal.rs # Node API client
├── logging.rs # Tracing setup
└── v2/
├── mod.rs
├── node.rs # Node data structure
├── db.rs # Database connection
├── cluster.rs # Cluster builder
├── writer.rs # Output formatting
├── scan/
│ ├── mod.rs # Scan orchestration
│ ├── health_check_primary.rs
│ └── health_check_replica.rs
└── analyze/
└── mod.rs # Health analysis logic
The tool uses structured logging with tracing. All logs include relevant context fields:
tracing::info!(
node_name = %node.node_name,
attempt = attempt,
"successfully connected after retry"
);Spans are used for major operations to provide hierarchical context.
lol, no