Error Budget Tracker

A practical SRE tool for tracking SLO error budgets, burn rates, and incident impact.

Built with Python, featuring both CLI and web dashboard interfaces.

Features

Error Budget Calculator: Calculate remaining budget based on SLI data
Multi-Window Burn Rate Detection: Fast burn (1h), medium (6h), slow (24h), trend (3d)
Incident Impact Analysis: Correlate incidents with budget consumption
CLI Interface: Rich terminal output with colored tables
Web Dashboard: Interactive charts with Plotly

Quick Start

# Install dependencies
cd error-budget-tracker
pip install -r requirements.txt

# View SLO status
python main.py status

# View detailed info for a specific SLO
python main.py detail api-availability

# View active alerts
python main.py alerts

# View incident impact
python main.py incidents

# List available SLOs
python main.py list

# Launch web dashboard
python main.py dashboard

Project Structure

error-budget-tracker/
├── main.py              # Entry point
├── requirements.txt     # Dependencies
├── data/
│   └── mock_data.py     # Mock data generator (replace with Datadog API)
├── src/
│   ├── calculator.py    # Error budget calculation logic
│   ├── burn_rate.py     # Multi-window burn rate detection
│   ├── cli.py           # Command-line interface
│   └── web.py           # Flask web dashboard
└── templates/
    └── dashboard.html   # Web dashboard template

Understanding Error Budgets

What is an Error Budget?

Error Budget = 100% - SLO Target

For example:

99.9% SLO → 0.1% error budget → 43.2 minutes/month
99.5% SLO → 0.5% error budget → 3.6 hours/month
99.0% SLO → 1.0% error budget → 7.2 hours/month

Burn Rate

Burn rate measures how fast you're consuming your error budget:

1.0x: Consuming at expected rate (will use 100% by end of window)
2.0x: Consuming twice as fast (will exhaust in half the time)
10.0x: Critical - will exhaust budget in 1/10th of the time

Multi-Window Alerting

Based on Google SRE principles, this tool uses multiple windows to catch different scenarios:

Window	Budget Threshold	Severity	Use Case
1 hour	2%	Critical	Severe outages
6 hours	5%	Warning	Significant issues
24 hours	10%	Warning	Gradual degradation
3 days	20%	Info	Long-term trends

Connecting to Datadog

To use real data, modify data/mock_data.py:

from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v1.api.service_level_objectives_api import ServiceLevelObjectivesApi

def get_real_slo_data():
    configuration = Configuration()
    configuration.api_key["apiKeyAuth"] = "YOUR_API_KEY"
    configuration.api_key["appKeyAuth"] = "YOUR_APP_KEY"

    with ApiClient(configuration) as api_client:
        api = ServiceLevelObjectivesApi(api_client)
        response = api.list_slos()
        return response.data

Mock Data

The mock data simulates:

6 SLOs (API, Ticket Processing, Webhook, Search, Chat)
Realistic traffic patterns (higher during business hours)
Incident-like bad periods with increased error rates
Various incident severities and root causes
Deploy history

CLI Commands

`status`

Shows all SLOs with budget remaining, burn rate, and status.

`detail <slo_id>`

Deep dive into a specific SLO with all metrics.

`alerts`

Shows all active burn rate alerts across windows.

`incidents`

Analyzes incidents and their error budget impact.

`dashboard`

Launches web UI at http://localhost:5000

Extending the Tool

Add New SLOs

Edit data/mock_data.py and add to the SLOS dict.

Custom Burn Rate Windows

Edit src/burn_rate.py and modify BURN_RATE_WINDOWS.

Add Datadog Integration

Install: pip install datadog-api-client
Replace mock data functions with Datadog API calls
Map Datadog SLO response to the expected format

Use at Work

This tool is designed to be portable. The prompts used to build it can be reused to:

Connect to your real Datadog instance
Pull SLO definitions from your environment
Integrate with your incident management system
Add Spinnaker/GitHub Actions deploy correlation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Error Budget Tracker

Features

Quick Start

Project Structure

Understanding Error Budgets

What is an Error Budget?

Burn Rate

Multi-Window Alerting

Connecting to Datadog

Mock Data

CLI Commands

`status`

`detail <slo_id>`

`alerts`

`incidents`

`dashboard`

Extending the Tool

Add New SLOs

Custom Burn Rate Windows

Add Datadog Integration

Use at Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
src		src
templates		templates
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Error Budget Tracker

Features

Quick Start

Project Structure

Understanding Error Budgets

What is an Error Budget?

Burn Rate

Multi-Window Alerting

Connecting to Datadog

Mock Data

CLI Commands

status

detail <slo_id>

alerts

incidents

dashboard

Extending the Tool

Add New SLOs

Custom Burn Rate Windows

Add Datadog Integration

Use at Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`status`

`detail <slo_id>`

`alerts`

`incidents`

`dashboard`

Packages