AI-Powered Kubernetes Debugging Agent
KubeSage is an AI-driven Kubernetes debugging assistant that automatically analyzes failing pods and identifies root causes using Amazon Bedrock LLMs.
Instead of manually digging through logs, events, and YAML manifests, KubeSage performs automated root cause analysis and provides actionable remediation suggestions directly from your terminal.
Detects common Kubernetes failures such as:
- 🔁 CrashLoopBackOff
- 💥 OOMKilled
- 📦 ImagePullBackOff
- ⚙️ Misconfigured resources
- 📉 Resource starvation
Powered by Amazon Bedrock using:
- Claude 3 Haiku – fast classification
KubeSage provides:
- 🔍 Root cause
- ⚠ Risk level
- 💡 Suggested fix
- 📈 Confidence score
Every analysis is automatically persisted to Amazon DynamoDB:
- Stores pod name, namespace, risk level, root cause, and timestamp
- Query past incidents by pod name or risk severity
- Enables post-mortem analysis and trend detection
- Full incident record retained for audit and compliance
KubeSage automatically notifies your team on critical failures:
- Real-time email alerts triggered for HIGH risk incidents
- Incident summary with root cause and suggested fix delivered to inbox
- No manual monitoring required — KubeSage alerts you before you notice
- Extensible to SMS, Slack, and PagerDuty via SNS subscriptions
python cli.py --pod oom-deployment-86b87cc56-45flmExample output:
Below is the high-level architecture of KubeSage.
KubeSage integrates with several AWS services:
| Service | Purpose |
|---|---|
| Amazon Bedrock | LLM inference for debugging reasoning (Claude 3 Haiku) |
| Amazon Elastic Kubernetes Service | Managed Kubernetes cluster |
| Amazon CloudWatch | Pod logs and container metrics collection |
| Amazon DynamoDB | Persistent incident storage and history querying |
| Amazon SNS | Real-time email/SMS alerts for HIGH risk incidents |
| AWS Lambda (optional) | Event-driven autonomous debugging |
| Amazon S3 (optional) | Long-term log archival |
- Python 3.10
- Typer (CLI framework)
- Rich (terminal UI)
- Pydantic (structured outputs)
- Kubernetes Python Client
- Amazon Bedrock
- Claude models from Anthropic
- strands – Bedrock invocation
- boto3 – DynamoDB and SNS integration
KubeSage helps engineers:
- ⏱ Reduce Mean Time To Resolution (MTTR) from 30+ minutes to under 1 minute
- 🔍 Automatically identify failure causes without manual log digging
- 🧠 Democratize SRE expertise — junior engineers debug like seniors
- 🔔 Get alerted on critical failures before manual discovery
- 📚 Build an incident knowledge base for post-mortems and trend analysis
- ⚡ Respond faster to production incidents
git clone https://github.com/naman22a/kubesage.git
cd kubesage
conda env create -f environment.yml
conda activate kubesagegit clone https://github.com/naman22a/kubesage.git
cd kubesage
pip install -r requirements.txtFor complete setup instructions including:
- AWS IAM configuration
- AWS CLI installation
- Docker installation
- kubectl installation
- eksctl installation
- Creating an EKS cluster
- Running test workloads
- Enable Amazon Bedrock in your AWS region.
- Ensure model access is granted (Claude model).
- Configure credentials:
aws configure- Set environment variables (if needed):
export AWS_REGION=us-east-1- Create a DynamoDB table named
kubesage-analysiswithpod_nameas the partition key. - Ensure your IAM role has
dynamodb:PutItemanddynamodb:Querypermissions. - Set environment variable:
export DYNAMODB_TABLE=kubesage-analysis- Create an SNS topic named
kubesage-alertsin your AWS region. - Subscribe your email or phone number to the topic.
- Confirm the subscription from your inbox.
- Set environment variable:
export SNS_TOPIC_ARN=arn:aws:sns:us-east-1:YOUR_ACCOUNT_ID:kubesage-alertsAlerts are automatically triggered when
risk_assessmentis HIGH.
kubesage/
├── k8s/ # K8s manifest files for testing
├── src/ # Main source code
│ ├── agent.py
│ ├── aws_utils.py
│ ├── constants.py
│ ├── custom_types.py
│ ├── fns.py
│ └── k8s.py
├── cli.py # CLI entry point
├── environment.yml # Conda environment
├── requirements.txt # Python dependencies
KubeSage is GPL V3




