Skip to content

anandibhat/error-budget-tracker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Error Budget Tracker

A practical SRE tool for tracking SLO error budgets, burn rates, and incident impact.

Built with Python, featuring both CLI and web dashboard interfaces.

Features

  • Error Budget Calculator: Calculate remaining budget based on SLI data
  • Multi-Window Burn Rate Detection: Fast burn (1h), medium (6h), slow (24h), trend (3d)
  • Incident Impact Analysis: Correlate incidents with budget consumption
  • CLI Interface: Rich terminal output with colored tables
  • Web Dashboard: Interactive charts with Plotly

Quick Start

# Install dependencies
cd error-budget-tracker
pip install -r requirements.txt

# View SLO status
python main.py status

# View detailed info for a specific SLO
python main.py detail api-availability

# View active alerts
python main.py alerts

# View incident impact
python main.py incidents

# List available SLOs
python main.py list

# Launch web dashboard
python main.py dashboard

Project Structure

error-budget-tracker/
├── main.py              # Entry point
├── requirements.txt     # Dependencies
├── data/
│   └── mock_data.py     # Mock data generator (replace with Datadog API)
├── src/
│   ├── calculator.py    # Error budget calculation logic
│   ├── burn_rate.py     # Multi-window burn rate detection
│   ├── cli.py           # Command-line interface
│   └── web.py           # Flask web dashboard
└── templates/
    └── dashboard.html   # Web dashboard template

Understanding Error Budgets

What is an Error Budget?

Error Budget = 100% - SLO Target

For example:

  • 99.9% SLO → 0.1% error budget → 43.2 minutes/month
  • 99.5% SLO → 0.5% error budget → 3.6 hours/month
  • 99.0% SLO → 1.0% error budget → 7.2 hours/month

Burn Rate

Burn rate measures how fast you're consuming your error budget:

  • 1.0x: Consuming at expected rate (will use 100% by end of window)
  • 2.0x: Consuming twice as fast (will exhaust in half the time)
  • 10.0x: Critical - will exhaust budget in 1/10th of the time

Multi-Window Alerting

Based on Google SRE principles, this tool uses multiple windows to catch different scenarios:

Window Budget Threshold Severity Use Case
1 hour 2% Critical Severe outages
6 hours 5% Warning Significant issues
24 hours 10% Warning Gradual degradation
3 days 20% Info Long-term trends

Connecting to Datadog

To use real data, modify data/mock_data.py:

from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v1.api.service_level_objectives_api import ServiceLevelObjectivesApi

def get_real_slo_data():
    configuration = Configuration()
    configuration.api_key["apiKeyAuth"] = "YOUR_API_KEY"
    configuration.api_key["appKeyAuth"] = "YOUR_APP_KEY"

    with ApiClient(configuration) as api_client:
        api = ServiceLevelObjectivesApi(api_client)
        response = api.list_slos()
        return response.data

Mock Data

The mock data simulates:

  • 6 SLOs (API, Ticket Processing, Webhook, Search, Chat)
  • Realistic traffic patterns (higher during business hours)
  • Incident-like bad periods with increased error rates
  • Various incident severities and root causes
  • Deploy history

CLI Commands

status

Shows all SLOs with budget remaining, burn rate, and status.

detail <slo_id>

Deep dive into a specific SLO with all metrics.

alerts

Shows all active burn rate alerts across windows.

incidents

Analyzes incidents and their error budget impact.

dashboard

Launches web UI at http://localhost:5000

Extending the Tool

Add New SLOs

Edit data/mock_data.py and add to the SLOS dict.

Custom Burn Rate Windows

Edit src/burn_rate.py and modify BURN_RATE_WINDOWS.

Add Datadog Integration

  1. Install: pip install datadog-api-client
  2. Replace mock data functions with Datadog API calls
  3. Map Datadog SLO response to the expected format

Use at Work

This tool is designed to be portable. The prompts used to build it can be reused to:

  1. Connect to your real Datadog instance
  2. Pull SLO definitions from your environment
  3. Integrate with your incident management system
  4. Add Spinnaker/GitHub Actions deploy correlation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors