Skip to content

themoah/klag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

110 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Klag license-badge

Kafka Consumer Lag Exporter β€” Know when your consumers fall behind, before it becomes a problem.

Inspired by kafka-lag-exporter (archived 2024). Built with Vert.x and Micrometer.

πŸ“– Documentation lives at klag.dev

Full guides β€” configuration, Kafka ACLs, Helm/Strimzi deployment, integrations, metrics reference, and development β€” are at klag.dev. AI agents: a machine-readable corpus is at klag.dev/llms.txt. The docs source lives in website/ (Astro + Starlight).

Scales to large clusters: Monitors thousands of consumer groups in ~50MB heap. Request batching with configurable delays prevents overwhelming brokers β€” fetch offsets for 500+ groups without spiking cluster CPU.

Why Klag?

Consumer lag is the gap between what Kafka has produced and what your consumers have processed. Left unmonitored, growing lag leads to stale downstream data, memory pressure as consumers struggle to catch up, and silent failures when groups die without alerts. Klag continuously monitors all consumer groups and exposes metrics to your observability stack.

Key Features

Feature Why It Matters
Lag velocity Know if lag is growing or shrinking β€” catch problems before they escalate
Time-based lag estimation See lag in seconds/minutes, not just message counts
Hot partition detection Find partitions with uneven load causing bottlenecks
Consumer group state tracking Alert on Rebalancing, Dead, or Empty states
Data loss prevention Alert before lag exceeds retention and data is lost
Request batching Safely monitor large clusters without overwhelming brokers
AI-native Opt-in read-only MCP endpoint for SRE/dev agents

Sinks: Prometheus, Datadog, OTLP (Grafana Cloud, New Relic, etc.). See integrations.

How Klag compares

Klag, Burrow, and KMinion all monitor Kafka consumer lag. They differ in what they measure and where they send it.

Feature Klag Burrow KMinion
Lag in messages βœ… βœ… βœ…
Consumer group state βœ… βœ… βœ…
Lag velocity (growing/shrinking) βœ… ⚠️ status only ❌
Time-based lag + time-to-catch-up βœ… ❌ ❌
Hot partition detection βœ… ❌ ❌
Data loss / retention alerting βœ… ❌ ❌
Lag status rules + notifiers ❌ βœ… ❌
Prometheus / Datadog / OTLP native βœ… ⚠️ exporter / ❌ / ❌ βœ… / ❌ / ❌
AI agent endpoint (MCP) βœ… ❌ ❌
Read-only ACLs (DESCRIBE only) βœ… βœ… ⚠️ writes for e2e

Quick Start

docker run -e KAFKA_BOOTSTRAP_SERVERS=kafka:9092 \
           -e METRICS_REPORTER=prometheus \
           -p 8888:8888 \
           themoah/klag:latest

Metrics available at http://localhost:8888/metrics.

A GraalVM native image (themoah/klag:native) starts in ~70-100 ms using ~44 MB RSS, versus ~500 ms / ~119 MB for the JVM image β€” same config, endpoints, and metrics. See Native Image.

Helm

helm repo add klag https://themoah.github.io/klag
helm repo update
helm install klag klag/klag --set kafka.bootstrapServers="kafka-broker:9092"

Published to Artifact Hub. Full chart config, SASL, and Strimzi: Kubernetes deployment and charts/klag/README.md.

Metrics

Metric Description
klag.consumer.lag Current lag per partition (also .sum, .max, .min)
klag.consumer.lag.velocity Rate of change β€” positive means falling behind
klag.consumer.lag.ms Lag in ms from Kafka log timestamps
klag.consumer.lag.time_to_close_seconds Estimated seconds until lag reaches zero
klag.consumer.lag.retention_percent Lag as % of available messages (data loss alerting)
klag.consumer.group.state Group health: Stable, Rebalancing, Dead, Empty
klag.hot_partition[.lag] Partitions with statistically abnormal throughput

Full metrics reference and the pre-built Grafana dashboard are documented at klag.dev/metrics.

Grafana Dashboard

Blogpost: Introducing Klag

Configuration

Configure via src/main/resources/application.properties or environment variables. The most common:

Variable Default Description
KAFKA_BOOTSTRAP_SERVERS localhost:9092 Kafka broker addresses
METRICS_REPORTER none prometheus, datadog, or otlp
METRICS_INTERVAL_MS 60000 How often to collect metrics
METRICS_GROUP_FILTER * Comma-separated glob patterns. A group is included if it matches any segment (e.g. ingest*,categorize*).
METRICS_GROUP_EXCLUDE (empty) Comma-separated glob patterns to exclude even if included by the filter (e.g. debug-*,canary-*,*-shadow).

See CLAUDE.md for the complete configuration reference.


Broker Compatibility

Klag works with Apache Kafka 2.x and 3.x brokers.

Running against Kafka 2.x

When klag first talks to a 2.x broker, you'll see a single WARN log line β€” emitted once per process, not per topic and not per scrape:

MAX_TIMESTAMP listOffsets unsupported by broker (likely pre-Kafka 3.0);
falling back to LATEST for logEndTimestamp. Logged once per process;
further occurrences at DEBUG. Cause: ...

This is expected, and safe to ignore. All klag metrics remain accurate. Subsequent per-topic fallback events are emitted at DEBUG; flip LOG_LEVEL_KLAG=DEBUG if you want to see them.

Why this happens, and why the fallback exists at all

Klag asks the broker for partition end-offsets using OffsetSpec.MAX_TIMESTAMP, an API added in Kafka 3.0 (KIP-734). 2.x brokers don't support it and respond with UnsupportedVersionException. Klag catches that and falls back to OffsetSpec.LATEST, which every Kafka version supports. The fallback returns the same partition end-offset that drives every lag metric, so accuracy is preserved.

Why the fallback is necessary. Without it, klag refuses to start on a 2.x cluster. The first metrics scrape runs synchronously during klag's startup, and a failure there propagates back up to the process launcher, which exits with code 1 β€” CrashLoopBackOff under Kubernetes. One consumer group on a 2.x cluster was enough to brick startup.

Note for future contributors. On 2.x brokers, both logEndTimestamp and maxTimestampOffset fall back to the LATEST offset's timestamp/offset, so time-based lag interpolation degrades gracefully (the anchor becomes the broker-side append time of the last record rather than the highest-timestamp record).


Kafka ACL Permissions

Klag requires read-only access to monitor consumer lag. It uses only the Kafka Admin Client API with DESCRIBE permissionsβ€”no write or alter access needed.

Required Permissions

Resource Name Permission Operations
CLUSTER kafka-cluster DESCRIBE Health check, list consumer groups
TOPIC * or prefixed DESCRIBE Get partition info and offsets
GROUP * or prefixed DESCRIBE Get group state and committed offsets

Self-Managed Kafka

Monitor all groups and topics
# Cluster permissions (required)
kafka-acls --bootstrap-server <broker> \
  --add --allow-principal User:<klag-user> \
  --operation Describe --cluster

# All topics
kafka-acls --bootstrap-server <broker> \
  --add --allow-principal User:<klag-user> \
  --operation Describe --topic '*'

# All consumer groups
kafka-acls --bootstrap-server <broker> \
  --add --allow-principal User:<klag-user> \
  --operation Describe --group '*'

Klag needs read-only Kafka access (Admin Client DESCRIBE on cluster, topics, groups). Full configuration reference and ACL setup (self-managed + Confluent Cloud) are at klag.dev/configuration and klag.dev/kafka/acl-permissions.

Development

Requires Java 21.

./gradlew clean test      # Run tests
./gradlew clean assemble  # Build fat JAR
./gradlew clean run       # Run with hot-reload

End-to-end tests (k3d + real Kafka, Strimzi matrix) live in scripts/. See Build from Source and Contributing.


vert.x

Some parts of the code were written with Claude Claude

About

new kafka lag exporter: lightweight and extendable.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors