Skip to content

Comments

feat: Add observability with Prometheus metrics and OpenTelemetry tracing#34

Merged
elmorem merged 1 commit intomainfrom
feat/observability-and-monitoring
Dec 11, 2025
Merged

feat: Add observability with Prometheus metrics and OpenTelemetry tracing#34
elmorem merged 1 commit intomainfrom
feat/observability-and-monitoring

Conversation

@elmorem
Copy link
Owner

@elmorem elmorem commented Dec 11, 2025

Summary

  • Add comprehensive observability infrastructure with Prometheus metrics and OpenTelemetry distributed tracing
  • Create shared observability package with metrics, tracing, and middleware utilities
  • Add /metrics endpoint to all services for Prometheus scraping
  • Automatically collect HTTP request metrics via middleware
  • Instrument FastAPI applications with OpenTelemetry for distributed tracing

Implementation Details

Shared Observability Package (shared/observability/)

  • metrics.py: Prometheus metrics definitions for HTTP, database, message queues, cache, and business events
  • tracing.py: OpenTelemetry setup with resource configuration and FastAPI instrumentation
  • middleware.py: MetricsMiddleware for automatic HTTP metrics collection

Prometheus Metrics (metrics.py:16-116)

HTTP Metrics:

  • http_requests_total - Counter by service, method, endpoint, status_code
  • http_request_duration_seconds - Histogram with 11 buckets (5ms - 10s)
  • http_requests_in_progress - Gauge by service, method, endpoint

Database Metrics:

  • db_operations_total - Counter by service, operation, table, status
  • db_operation_duration_seconds - Histogram with 9 buckets (1ms - 1s)
  • db_connections_active - Gauge by service

Message Queue Metrics:

  • mq_messages_published_total / mq_messages_consumed_total
  • mq_message_processing_duration_seconds - Histogram (100ms - 2min)

Cache Metrics:

  • cache_operations_total - Counter by operation and status
  • cache_hit_rate - Gauge (0-1)

Business Metrics:

  • memories_created_total, memories_retrieved_total
  • sessions_created_total, events_added_total

OpenTelemetry Tracing (tracing.py:17-75)

  • Service-level tracing with resource attributes (service.name, service.version, service.namespace)
  • Console span exporter for development
  • OTLP span exporter for production (configurable endpoint)
  • FastAPI automatic instrumentation for HTTP request/response traces
  • Batch span processing for performance

Service Integration

All services now include:

  • MetricsMiddleware for automatic HTTP metrics
  • OpenTelemetry FastAPI instrumentation
  • /metrics endpoint returning Prometheus-formatted metrics
  • Tracing setup in lifespan with service-specific configuration

Sessions Service (services/sessions/app/main.py):

  • Service name: "sessions-service"
  • Metrics exposed on /metrics
  • Tracing with console export in debug mode

Memory Service (services/memory/app/main.py):

  • Service name: "memory-service"
  • Metrics exposed on /metrics
  • Tracing with console export in debug mode

API Gateway (services/gateway/app/main.py):

  • Service name: "api-gateway"
  • Metrics exposed on /metrics
  • Tracing without console export (production mode)
  • Updated root endpoint to include /metrics link

Metrics Endpoint Usage

All services now expose Prometheus metrics:

# Sessions Service
curl http://localhost:8001/metrics

# Memory Service  
curl http://localhost:8002/metrics

# API Gateway
curl http://localhost:8000/metrics

Example metrics output:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{service="sessions-service",method="GET",endpoint="/health",status_code="200"} 15.0

# HELP http_request_duration_seconds HTTP request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{service="sessions-service",method="GET",endpoint="/api/v1/sessions",le="0.005"} 120.0

Distributed Tracing

OpenTelemetry traces are automatically generated for:

  • All HTTP requests (via FastAPI instrumentation)
  • Request/response timing
  • Service-to-service correlation (via correlation IDs)

In development, traces are exported to console. In production, configure OTLP endpoint:

setup_tracing(
    service_name="my-service",
    service_version="1.0.0",
    otlp_endpoint="http://localhost:4317",  # Jaeger/Tempo/etc
)

Future Enhancements

This PR provides the foundation for observability. Future PRs can add:

  • Worker process metrics (memory generation, consolidation)
  • Database operation metrics integration
  • Message queue metrics integration
  • Grafana dashboards for visualization
  • Alert rules for Prometheus
  • Jaeger/Tempo deployment for trace visualization

Test Plan

  • Type checks pass (mypy)
  • Code formatting applied (black)
  • Manual testing of metrics endpoints on all services
  • Verify metrics increment with HTTP requests
  • Verify traces appear in console (debug mode)
  • Test metrics scraping with Prometheus
  • Verify histogram buckets capture request latencies

🤖 Generated with Claude Code

…cing

Implement comprehensive observability infrastructure for all services with Prometheus metrics collection, OpenTelemetry distributed tracing, and standardized monitoring endpoints.

Changes:
- Create shared/observability package with metrics, tracing, and middleware
- Add Prometheus metrics for HTTP requests, database operations, message queues, cache, and business events
- Add OpenTelemetry distributed tracing with console and OTLP exporters
- Create MetricsMiddleware for automatic HTTP metrics collection
- Add /metrics endpoint to all services (Sessions, Memory, API Gateway)
- Instrument FastAPI applications with OpenTelemetry
- Configure tracing in service lifespan with service name and version

Metrics collected:
- http_requests_total (counter by service, method, endpoint, status)
- http_request_duration_seconds (histogram by service, method, endpoint)
- http_requests_in_progress (gauge by service, method, endpoint)
- db_operations_total, db_operation_duration_seconds, db_connections_active
- mq_messages_published/consumed_total, mq_message_processing_duration_seconds
- cache_operations_total, cache_hit_rate
- memories_created/retrieved_total, sessions_created_total, events_added_total

Tracing features:
- Service-level tracing with resource attributes (service.name, service.version)
- FastAPI automatic instrumentation for request/response traces
- Console export for development, OTLP export for production

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@elmorem elmorem merged commit e4f2506 into main Dec 11, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant