feat: Add observability with Prometheus metrics and OpenTelemetry tracing by elmorem · Pull Request #34 · elmorem/ContextIQ

elmorem · 2025-12-11T21:46:53Z

Summary

Add comprehensive observability infrastructure with Prometheus metrics and OpenTelemetry distributed tracing
Create shared observability package with metrics, tracing, and middleware utilities
Add /metrics endpoint to all services for Prometheus scraping
Automatically collect HTTP request metrics via middleware
Instrument FastAPI applications with OpenTelemetry for distributed tracing

Implementation Details

Shared Observability Package (shared/observability/)

metrics.py: Prometheus metrics definitions for HTTP, database, message queues, cache, and business events
tracing.py: OpenTelemetry setup with resource configuration and FastAPI instrumentation
middleware.py: MetricsMiddleware for automatic HTTP metrics collection

Prometheus Metrics (metrics.py:16-116)

HTTP Metrics:

http_requests_total - Counter by service, method, endpoint, status_code
http_request_duration_seconds - Histogram with 11 buckets (5ms - 10s)
http_requests_in_progress - Gauge by service, method, endpoint

Database Metrics:

db_operations_total - Counter by service, operation, table, status
db_operation_duration_seconds - Histogram with 9 buckets (1ms - 1s)
db_connections_active - Gauge by service

Message Queue Metrics:

mq_messages_published_total / mq_messages_consumed_total
mq_message_processing_duration_seconds - Histogram (100ms - 2min)

Cache Metrics:

cache_operations_total - Counter by operation and status
cache_hit_rate - Gauge (0-1)

Business Metrics:

memories_created_total, memories_retrieved_total
sessions_created_total, events_added_total

OpenTelemetry Tracing (tracing.py:17-75)

Service-level tracing with resource attributes (service.name, service.version, service.namespace)
Console span exporter for development
OTLP span exporter for production (configurable endpoint)
FastAPI automatic instrumentation for HTTP request/response traces
Batch span processing for performance

Service Integration

All services now include:

MetricsMiddleware for automatic HTTP metrics
OpenTelemetry FastAPI instrumentation
/metrics endpoint returning Prometheus-formatted metrics
Tracing setup in lifespan with service-specific configuration

Sessions Service (services/sessions/app/main.py):

Service name: "sessions-service"
Metrics exposed on /metrics
Tracing with console export in debug mode

Memory Service (services/memory/app/main.py):

Service name: "memory-service"
Metrics exposed on /metrics
Tracing with console export in debug mode

API Gateway (services/gateway/app/main.py):

Service name: "api-gateway"
Metrics exposed on /metrics
Tracing without console export (production mode)
Updated root endpoint to include /metrics link

Metrics Endpoint Usage

All services now expose Prometheus metrics:

# Sessions Service
curl http://localhost:8001/metrics

# Memory Service  
curl http://localhost:8002/metrics

# API Gateway
curl http://localhost:8000/metrics

Example metrics output:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{service="sessions-service",method="GET",endpoint="/health",status_code="200"} 15.0

# HELP http_request_duration_seconds HTTP request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{service="sessions-service",method="GET",endpoint="/api/v1/sessions",le="0.005"} 120.0

Distributed Tracing

OpenTelemetry traces are automatically generated for:

All HTTP requests (via FastAPI instrumentation)
Request/response timing
Service-to-service correlation (via correlation IDs)

In development, traces are exported to console. In production, configure OTLP endpoint:

setup_tracing(
    service_name="my-service",
    service_version="1.0.0",
    otlp_endpoint="http://localhost:4317",  # Jaeger/Tempo/etc
)

Future Enhancements

This PR provides the foundation for observability. Future PRs can add:

Worker process metrics (memory generation, consolidation)
Database operation metrics integration
Message queue metrics integration
Grafana dashboards for visualization
Alert rules for Prometheus
Jaeger/Tempo deployment for trace visualization

Test Plan

Type checks pass (mypy)
Code formatting applied (black)
Manual testing of metrics endpoints on all services
Verify metrics increment with HTTP requests
Verify traces appear in console (debug mode)
Test metrics scraping with Prometheus
Verify histogram buckets capture request latencies

🤖 Generated with Claude Code

…cing Implement comprehensive observability infrastructure for all services with Prometheus metrics collection, OpenTelemetry distributed tracing, and standardized monitoring endpoints. Changes: - Create shared/observability package with metrics, tracing, and middleware - Add Prometheus metrics for HTTP requests, database operations, message queues, cache, and business events - Add OpenTelemetry distributed tracing with console and OTLP exporters - Create MetricsMiddleware for automatic HTTP metrics collection - Add /metrics endpoint to all services (Sessions, Memory, API Gateway) - Instrument FastAPI applications with OpenTelemetry - Configure tracing in service lifespan with service name and version Metrics collected: - http_requests_total (counter by service, method, endpoint, status) - http_request_duration_seconds (histogram by service, method, endpoint) - http_requests_in_progress (gauge by service, method, endpoint) - db_operations_total, db_operation_duration_seconds, db_connections_active - mq_messages_published/consumed_total, mq_message_processing_duration_seconds - cache_operations_total, cache_hit_rate - memories_created/retrieved_total, sessions_created_total, events_added_total Tracing features: - Service-level tracing with resource attributes (service.name, service.version) - FastAPI automatic instrumentation for request/response traces - Console export for development, OTLP export for production 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

elmorem merged commit e4f2506 into main Dec 11, 2025
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: Add observability with Prometheus metrics and OpenTelemetry tracing#34

feat: Add observability with Prometheus metrics and OpenTelemetry tracing#34
elmorem merged 1 commit intomainfrom
feat/observability-and-monitoring

elmorem commented Dec 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

elmorem commented Dec 11, 2025

Summary

Implementation Details

Metrics Endpoint Usage

Distributed Tracing

Future Enhancements

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant