# Start one or more services interactively
underwrite run mechanism audit risk
# Start as HTTP daemon (default: mechanism,audit)
underwrite serve
underwrite serve --host 0.0.0.0 --port 8080
underwrite serve --services "mechanism,audit,fraud" --rate-limit 200
# With auth
UNDERWRITE_API_TOKEN=prod-token underwrite serve --require-auth
# Init default config
underwrite init
underwrite init config.production.jsonThe run command starts services in the foreground with a synchronous event loop. The serve command wraps them in a FastAPI/uvicorn HTTP server.
On startup, the Runtime loads configuration in this order:
underwrite.jsonin working directory (if it exists)config.<UNDERWRITE_ENV>.json(ifUNDERWRITE_ENVis set)- Environment variable overrides (
UNDERWRITE_*)
Create a default config with underwrite init — it enables mechanism and audit by default.
Send SIGTERM or SIGINT (Ctrl+C):
kill <pid>
docker stop underwriteThe Runtime performs an orderly shutdown:
- Stops the metrics export loop
- Stops all registered services (unsubscribes from bus)
- Stops the event bus (flushes remaining events, waits for pending futures up to 5s)
- Shuts down store backends (closes connection pools, shuts down thread pools)
The serve command supports --shutdown-timeout 30 (default 30s) for the HTTP server's graceful drain period.
SIGKILL (kill -9) is safe—saga idempotency keys in the store prevent duplicate event processing on restart.
The ServiceSupervisor monitors service handler failures and auto-restarts crashed services.
Enable/disable with:
UNDERWRITE_RECOVERY_AUTO_RESTART=true # enable (default)
UNDERWRITE_RECOVERY_AUTO_RESTART=false # disableConfigured via:
{
"recovery": {
"auto_restart": true,
"max_restarts": 5,
"backoff_seconds": 2.0
}
}- After a handler exception, the supervisor records a failure
- Exponential backoff before restart:
backoff_seconds * 2^(failure-1), capped at 60s - After
max_restartsconsecutive failures, the service is marked permanently unhealthy - On successful handler execution, the failure count resets
- The supervisor health check reports restarting services and total failures
The Runtime's restart_failing_services() method re-registers, re-wires, and re-starts failing services:
restarted = rt.restart_failing_services()
print(f"Restarted: {restarted}") # ['fee', 'risk']The platform uses two layers of circuit breakers:
Tracks failures per subscriber ID (not per service). After 5 consecutive failures, the circuit opens for 60 seconds. While open, events are sent directly to the DLQ without invoking the handler. A successful request on the half-open state resets the circuit.
| Store | Failure Threshold | Recovery Timeout |
|---|---|---|
| PostgresStore | 3 | 15 seconds |
| FileStore | 3 | 30 seconds |
When tripped, all store operations raise CircuitBreakerOpenError. Check state via health:
underwrite health
# [OK] store — circuit=closedOr programmatically:
from underwrite.__circuit__ import CircuitBreaker, CircuitState
cb = CircuitBreaker(failure_threshold=3, recovery_timeout=15.0)
# cb.state → CircuitState.CLOSED / OPEN / HALF_OPENRecovery is automatic after the cooldown period. No manual reset required.
The DLQ captures events that failed processing (handler exceptions, rate limiting, open circuits).
underwrite dlqOutput:
Dead-letter queue: 3 entries
[1717785600.0] subscriber-id: fee.assess — ProtocolError: must be finite
[1717785601.5] subscriber-id: risk.scored — RateLimitError: rate limit exceeded
[1717785602.0] subscriber-id: fraud.alert — CircuitBreakerOpenError: circuit is open
underwrite dlq --replay # replay all
underwrite dlq --replay --max 10 # replay at most 10Replayed events are re-published to the bus. Services with idempotency guards skip duplicate events. The DLQ is bounded at 10,000 records (oldest evicted first).
When FileStore or PostgresStore is used, the DLQ persists across restarts:
- FileStore:
data/bus/dlq.json - PostgresStore:
dead_letterstable
rt = Runtime()
dlq = rt.bus.dlq
print(dlq.count)
for record in dlq.records:
print(record.event.event_type, record.error)
dlq.clear()The migration engine applies pending schema changes on startup (auto-migrate enabled by default).
# Run pending migrations manually
underwrite migrate
# Check applied versions (Postgres)
psql $DATABASE_URL -c "SELECT * FROM migrations ORDER BY version;"Current migrations (defined in __migrate__.py):
| Version | Description |
|---|---|
| 1 | Initial store schema — key-value table, migrations table |
| 2 | Dead-letter queue table |
| 3 | Metrics snapshot table |
-- Rollback version 3
DROP TABLE IF EXISTS metrics_snapshots;
DELETE FROM migrations WHERE version = 3;
-- Rollback version 2
DROP TABLE IF EXISTS dead_letters;
DELETE FROM migrations WHERE version = 2;After rollback, underwrite migrate re-applies the migration.
The platform tracks asset quality per RBI Master Circular on Income Recognition and Asset Classification (IRAC):
| Classification | Trigger | Event | Action |
|---|---|---|---|
| SMA-0 | 30 days past due | sma.classified |
Alert relationship manager |
| SMA-1 | 60 days past due | sma.classified |
Initiate collection |
| SMA-2 | 90 days past due | sma.classified |
Prepare NPA report |
| NPA (Substandard) | 91-180 days | npa.bucket.changed |
Provision at 15%, suspend income recognition |
| NPA (Doubtful) | 181-360 days | npa.bucket.changed |
Provision at 25% (secured) |
| NPA (Loss) | >360 days | npa.bucket.changed |
Provision at 100% |
| DLG Trigger | 120+ days | npa.dlg.triggered |
Invoke default loss guarantee |
Monitor NPA ratios:
# Check current NPA classification counts
underwrite health | grep npa
# Expected output:
# service:npa — events_handled=42 sma0=5 sma1=2 sma2=1 npa_substandard=1Monitor that all loans fall within RBI-mandated rate caps:
underwrite metrics | grep -E "pricing|caps|penal"Key metrics:
pricing.rate_caps— count of rate cap applicationspricing.penal_interest— penal interest assessments (should be ≤24% p.a.)pricing.foreclosure— foreclosure charge computations (0% for personal/home loans)
All consent lifecycle events are recorded in the audit ledger:
# Query consent events (requires store inspection)
underwrite dlq | grep consentMonitor consent expiry and withdrawal rates to ensure DPDPA compliance.
The platform tracks DSR response times. If a DSR exceeds 30 days:
dsr.fulfilledis not emitted — checkunderwrite dlqfor pending requests- Escalate via
grievance.loggedevent - Manually verify DPO notification (configured via
dpdpa.dsr.dpo_email)
When breach.detected fires:
- Identify scope via audit log:
underwrite healthand check store - Notify Data Protection Board within 72 hours (configurable)
- Record breach closure via
breach.closedevent - Document in breach register
| Report | Frequency | Data Source | Notes |
|---|---|---|---|
| NPA classification | Monthly | npa.bucket.changed events |
RBI return on asset quality |
| Capital adequacy | Quarterly | Store aggregation | Leverage ratio monitoring |
| Interest rate disclosure | Monthly | Pricing service | Rate cap compliance report |
| KFS issuance log | Daily | KFS service | Cooling-off period tracking |
| Consent register | Monthly | Consent service | DPDPA compliance audit |
| Grievance register | Monthly | DSR service | DPDPA Section 13 compliance |
| Credit bureau data submission | Weekly | Credit bureau service | CIBIL/Experian/Equifax data refresh |
underwrite healthHTTP health endpoints (requires underwrite serve):
| Endpoint | Path |
|---|---|
| Liveness probe | GET /healthz |
| Readiness probe | GET /readyz |
| Full status | GET /v1/health |
| Legacy | GET /health |
underwrite metricsCounters: events.emitted, events.handled, events.failed, store.corruption, store.io_error, authz.failures
Timers: handle.duration (per-service, per-event-type with count/avg/min/max)
HTTP: GET /v1/metrics returns Prometheus text format (requires underwrite[serve]).
Configure via environment:
export UNDERWRITE_LOG_LEVEL=DEBUG
export UNDERWRITE_LOG_FORMAT=jsonJSON format includes timestamp, level, logger, message, module, line, correlation_id, trace_id. Sensitive fields (SSN, PAN, tokens, passwords) are automatically redacted.
OpenTelemetry distributed tracing:
{
"tracing": {
"enabled": true,
"exporter": "otlp"
}
}Requires underwrite[otlp] extra. Console exporter is also available for development.
Data is stored as individual JSON files in data/:
# Backup
tar czf underwrite-data-$(date +%Y%m%d).tar.gz data/
# Restore
tar xzf underwrite-data-20260608.tar.gzKeys map to file paths: saga:<id> → data/saga/<id>.json.
pg_dump $DATABASE_URL -t store -t migrations -t dead_letters -t metrics_snapshots > underwrite-backup.sqlSagas that were interrupted by a crash can be replayed:
from underwrite.__runtime__ import Runtime
rt = Runtime()
success = rt.replay_saga("saga-id-here")replay_saga() finds the next unexecuted step after the last completed one and executes all remaining steps. Idempotency keys ensure no step is executed twice.
Saga status values: started → completed (success), or compensating → rolled_back (failure).
After fixing the root cause (e.g., misconfiguration, missing env var), replay failed events:
underwrite dlq --replayManually restart a failing service via the Runtime:
rt.restart_failing_services()underwrite healthIf degraded, check individual checks: bus, store, service:<name>, supervisor.
underwrite dlqLook for patterns: all errors from one service, rate limiting, circuit open.
UNDERWRITE_LOG_LEVEL=DEBUG underwrite run <service>With JSON logging:
underwrite serve --port 8080 | jq 'select(.level == "ERROR")'underwrite health | grep circuitIf circuits are open, wait for automatic recovery (15–60s depending on component).
| Issue | Action |
|---|---|
| Circuit breaker open | Wait for cooldown, or check store connectivity |
| DLQ accumulating | Fix handler error, then underwrite dlq --replay |
| Service crash-looping | Check logs, increase max_restarts or disable auto_restart |
Saga stuck in started |
rt.replay_saga(id) to retry |
| Migration failed | SELECT * FROM migrations, rollback failed version, fix and re-migrate |
| Store connection lost | Check DB endpoint, credentials, network policy |
| Signature verification failures | Check authz policy file and service identities |
| Command | Description |
|---|---|
underwrite init [path] |
Create default config file |
underwrite run <service>... |
Start services in foreground |
underwrite serve |
Start HTTP daemon with health/metrics endpoints |
underwrite list |
List all available services |
underwrite health |
Show health status |
underwrite metrics |
Show metrics snapshot |
underwrite dlq |
Show dead-letter queue |
underwrite dlq --replay |
Replay dead-letter events |
underwrite migrate |
Run pending migrations |
underwrite identity <service> |
Generate Ed25519 identity for a service |
Install extras with pip install underwrite[<extra>]:
| Extra | Provides |
|---|---|
serve |
FastAPI + uvicorn HTTP server |
postgres |
PostgreSQL store backend |
otlp |
OpenTelemetry distributed tracing |
risk |
NumPy + scikit-learn for risk scoring |
vault |
HashiCorp Vault secrets backend |
aws |
AWS Secrets Manager / S3 / SQS backends |
gcs |
Google Cloud Storage backend |
dev |
Pytest, ruff, mypy, bandit, testcontainers |
mutation |
Mutation testing (mutmut) |
security |
Bandit + pip-audit |
all |
All extras combined |
When serve extra is installed, GET /v1/metrics exposes runtime and service metrics in Prometheus text format at text/plain; version=0.0.4. FastAPI can also be instrumented with OpenTelemetry via opentelemetry-instrumentation-fastapi.