A practical SRE tool for tracking SLO error budgets, burn rates, and incident impact.
Built with Python, featuring both CLI and web dashboard interfaces.
- Error Budget Calculator: Calculate remaining budget based on SLI data
- Multi-Window Burn Rate Detection: Fast burn (1h), medium (6h), slow (24h), trend (3d)
- Incident Impact Analysis: Correlate incidents with budget consumption
- CLI Interface: Rich terminal output with colored tables
- Web Dashboard: Interactive charts with Plotly
# Install dependencies
cd error-budget-tracker
pip install -r requirements.txt
# View SLO status
python main.py status
# View detailed info for a specific SLO
python main.py detail api-availability
# View active alerts
python main.py alerts
# View incident impact
python main.py incidents
# List available SLOs
python main.py list
# Launch web dashboard
python main.py dashboarderror-budget-tracker/
├── main.py # Entry point
├── requirements.txt # Dependencies
├── data/
│ └── mock_data.py # Mock data generator (replace with Datadog API)
├── src/
│ ├── calculator.py # Error budget calculation logic
│ ├── burn_rate.py # Multi-window burn rate detection
│ ├── cli.py # Command-line interface
│ └── web.py # Flask web dashboard
└── templates/
└── dashboard.html # Web dashboard template
Error Budget = 100% - SLO Target
For example:
- 99.9% SLO → 0.1% error budget → 43.2 minutes/month
- 99.5% SLO → 0.5% error budget → 3.6 hours/month
- 99.0% SLO → 1.0% error budget → 7.2 hours/month
Burn rate measures how fast you're consuming your error budget:
- 1.0x: Consuming at expected rate (will use 100% by end of window)
- 2.0x: Consuming twice as fast (will exhaust in half the time)
- 10.0x: Critical - will exhaust budget in 1/10th of the time
Based on Google SRE principles, this tool uses multiple windows to catch different scenarios:
| Window | Budget Threshold | Severity | Use Case |
|---|---|---|---|
| 1 hour | 2% | Critical | Severe outages |
| 6 hours | 5% | Warning | Significant issues |
| 24 hours | 10% | Warning | Gradual degradation |
| 3 days | 20% | Info | Long-term trends |
To use real data, modify data/mock_data.py:
from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v1.api.service_level_objectives_api import ServiceLevelObjectivesApi
def get_real_slo_data():
configuration = Configuration()
configuration.api_key["apiKeyAuth"] = "YOUR_API_KEY"
configuration.api_key["appKeyAuth"] = "YOUR_APP_KEY"
with ApiClient(configuration) as api_client:
api = ServiceLevelObjectivesApi(api_client)
response = api.list_slos()
return response.dataThe mock data simulates:
- 6 SLOs (API, Ticket Processing, Webhook, Search, Chat)
- Realistic traffic patterns (higher during business hours)
- Incident-like bad periods with increased error rates
- Various incident severities and root causes
- Deploy history
Shows all SLOs with budget remaining, burn rate, and status.
Deep dive into a specific SLO with all metrics.
Shows all active burn rate alerts across windows.
Analyzes incidents and their error budget impact.
Launches web UI at http://localhost:5000
Edit data/mock_data.py and add to the SLOS dict.
Edit src/burn_rate.py and modify BURN_RATE_WINDOWS.
- Install:
pip install datadog-api-client - Replace mock data functions with Datadog API calls
- Map Datadog SLO response to the expected format
This tool is designed to be portable. The prompts used to build it can be reused to:
- Connect to your real Datadog instance
- Pull SLO definitions from your environment
- Integrate with your incident management system
- Add Spinnaker/GitHub Actions deploy correlation