Production-grade HTTP gateway for LLM providers with reliability controls
Fastify-based TypeScript service providing a unified interface for OpenAI and Anthropic with:
- ⚡ Exponential backoff retries
- 🔌 Circuit breakers (Opossum)
- 🚦 Rate limiting per IP
- 🔑 Idempotency key support
- 📊 Pino structured logging
- ✅ Zod schema validation
- 🛡️ Configurable timeouts
- 📖 OpenAPI 3.0 spec
- 📈 Prometheus metrics - request latency, error rates, cache hit ratio
- 🔄 Request coalescing - dedupe concurrent identical requests
# Install
npm install
# Configure
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
# Run
npm run dev
# Test
npm testcurl -X POST http://localhost:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-Idempotency-Key: my-request-123" \
-d '{
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello!"}]
}'| Env Var | Default | Description |
|---|---|---|
PORT |
3000 | Server port |
OPENAI_API_KEY |
- | OpenAI API key |
ANTHROPIC_API_KEY |
- | Anthropic API key |
DEFAULT_PROVIDER |
openai | Default LLM provider |
RATE_LIMIT_MAX |
100 | Requests per window |
RATE_LIMIT_WINDOW_MS |
60000 | Window duration |
RETRY_MAX_ATTEMPTS |
3 | Max retry attempts |
IDEMPOTENCY_TTL_MS |
3600000 | Cache TTL (1hr) |
Access Prometheus metrics at /metrics:
curl http://localhost:3000/metrics
Available metrics:
llm_gateway_http_request_duration_seconds- HTTP request latency histogramllm_gateway_http_requests_total- Total HTTP requests counterllm_gateway_provider_latency_seconds- LLM provider latencyllm_gateway_tokens_total- Token usage by provider/modelllm_gateway_cache_hits_total/cache_misses_total- Cache hit ratiollm_gateway_circuit_breaker_state- Circuit breaker state gaugellm_gateway_rate_limit_exceeded_total- Rate limit events
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ HTTP │────▶│ Service │────▶│ Provider │
│ (Fastify) │ │ Layer │ │ Layer │
└─────────────┘ └──────────────┘ └─────────────┘
│ │ │
▼ ▼ ▼
Zod Validation Idempotency OpenAI/Anthropic
Request ID Rate Limit SDK Interface
Logging Circuit Breaker
Retries
See openapi.yaml for the full OpenAPI 3.0 specification.
| Method | Path | Description |
|---|---|---|
| GET | /health |
Health check |
| POST | /v1/chat/completions |
Create chat completion |
All responses include:
X-Request-Id: Unique request identifier for tracingX-RateLimit-Limit: Requests per rate limit windowX-RateLimit-Remaining: Remaining requests in current windowX-RateLimit-Reset: ISO timestamp when rate limit resetsX-Cached:trueif response was served from idempotency cache
# Run all tests
npm test
# Run with coverage
npm run test:coverage
# Type check
npm run typecheck
# Run specific test suites
npm test -- chaos
npm test -- benchmark- Unit tests: Provider mocks, retry logic, circuit breaker, rate limiter, idempotency cache, schema validation
- Integration tests: End-to-end API tests
- Chaos tests: Malformed input, rate limiting, idempotency behavior
- Benchmarks: Latency percentiles (p50/p95/p99) under load
- Exponential backoff with configurable max attempts
- Smart retry detection (429, 5xx errors)
- Configurable initial delay and max delay
- Automatic circuit opening after error threshold
- Half-open state for recovery testing
- Fallback responses when circuit is open
- Per-IP sliding window rate limiting
- Configurable window size and request limit
- Standard rate limit headers in responses
- Cache responses by idempotency key
- 1-hour TTL (configurable)
- Prevents duplicate processing
MIT