Kafka Consumer Lag Exporter β Know when your consumers fall behind, before it becomes a problem.
Inspired by kafka-lag-exporter (archived 2024). Built with Vert.x and Micrometer.
π Documentation lives at klag.dev
Full guides β configuration, Kafka ACLs, Helm/Strimzi deployment, integrations, metrics reference, and development β are at klag.dev. AI agents: a machine-readable corpus is at klag.dev/llms.txt. The docs source lives in
website/(Astro + Starlight).
Scales to large clusters: Monitors thousands of consumer groups in ~50MB heap. Request batching with configurable delays prevents overwhelming brokers β fetch offsets for 500+ groups without spiking cluster CPU.
Consumer lag is the gap between what Kafka has produced and what your consumers have processed. Left unmonitored, growing lag leads to stale downstream data, memory pressure as consumers struggle to catch up, and silent failures when groups die without alerts. Klag continuously monitors all consumer groups and exposes metrics to your observability stack.
| Feature | Why It Matters |
|---|---|
| Lag velocity | Know if lag is growing or shrinking β catch problems before they escalate |
| Time-based lag estimation | See lag in seconds/minutes, not just message counts |
| Hot partition detection | Find partitions with uneven load causing bottlenecks |
| Consumer group state tracking | Alert on Rebalancing, Dead, or Empty states |
| Data loss prevention | Alert before lag exceeds retention and data is lost |
| Request batching | Safely monitor large clusters without overwhelming brokers |
| AI-native | Opt-in read-only MCP endpoint for SRE/dev agents |
Sinks: Prometheus, Datadog, OTLP (Grafana Cloud, New Relic, etc.). See integrations.
Klag, Burrow, and KMinion all monitor Kafka consumer lag. They differ in what they measure and where they send it.
| Feature | Klag | Burrow | KMinion |
|---|---|---|---|
| Lag in messages | β | β | β |
| Consumer group state | β | β | β |
| Lag velocity (growing/shrinking) | β | β | |
| Time-based lag + time-to-catch-up | β | β | β |
| Hot partition detection | β | β | β |
| Data loss / retention alerting | β | β | β |
| Lag status rules + notifiers | β | β | β |
| Prometheus / Datadog / OTLP native | β | β / β / β | |
| AI agent endpoint (MCP) | β | β | β |
| Read-only ACLs (DESCRIBE only) | β | β |
docker run -e KAFKA_BOOTSTRAP_SERVERS=kafka:9092 \
-e METRICS_REPORTER=prometheus \
-p 8888:8888 \
themoah/klag:latestMetrics available at http://localhost:8888/metrics.
A GraalVM native image (themoah/klag:native) starts in ~70-100 ms using ~44 MB RSS,
versus ~500 ms / ~119 MB for the JVM image β same config, endpoints, and metrics. See
Native Image.
helm repo add klag https://themoah.github.io/klag
helm repo update
helm install klag klag/klag --set kafka.bootstrapServers="kafka-broker:9092"Published to Artifact Hub. Full chart config, SASL, and Strimzi: Kubernetes deployment and charts/klag/README.md.
| Metric | Description |
|---|---|
klag.consumer.lag |
Current lag per partition (also .sum, .max, .min) |
klag.consumer.lag.velocity |
Rate of change β positive means falling behind |
klag.consumer.lag.ms |
Lag in ms from Kafka log timestamps |
klag.consumer.lag.time_to_close_seconds |
Estimated seconds until lag reaches zero |
klag.consumer.lag.retention_percent |
Lag as % of available messages (data loss alerting) |
klag.consumer.group.state |
Group health: Stable, Rebalancing, Dead, Empty |
klag.hot_partition[.lag] |
Partitions with statistically abnormal throughput |
Full metrics reference and the pre-built Grafana dashboard are documented at klag.dev/metrics.
Configure via src/main/resources/application.properties or environment variables. The
most common:
| Variable | Default | Description |
|---|---|---|
KAFKA_BOOTSTRAP_SERVERS |
localhost:9092 |
Kafka broker addresses |
METRICS_REPORTER |
none |
prometheus, datadog, or otlp |
METRICS_INTERVAL_MS |
60000 |
How often to collect metrics |
METRICS_GROUP_FILTER |
* |
Comma-separated glob patterns. A group is included if it matches any segment (e.g. ingest*,categorize*). |
METRICS_GROUP_EXCLUDE |
(empty) | Comma-separated glob patterns to exclude even if included by the filter (e.g. debug-*,canary-*,*-shadow). |
See CLAUDE.md for the complete configuration reference.
Klag works with Apache Kafka 2.x and 3.x brokers.
When klag first talks to a 2.x broker, you'll see a single WARN log line β emitted once per process, not per topic and not per scrape:
MAX_TIMESTAMP listOffsets unsupported by broker (likely pre-Kafka 3.0);
falling back to LATEST for logEndTimestamp. Logged once per process;
further occurrences at DEBUG. Cause: ...
This is expected, and safe to ignore. All klag metrics remain accurate. Subsequent per-topic fallback events are emitted at DEBUG; flip LOG_LEVEL_KLAG=DEBUG if you want to see them.
Why this happens, and why the fallback exists at all
Klag asks the broker for partition end-offsets using OffsetSpec.MAX_TIMESTAMP, an API added in Kafka 3.0 (KIP-734). 2.x brokers don't support it and respond with UnsupportedVersionException. Klag catches that and falls back to OffsetSpec.LATEST, which every Kafka version supports. The fallback returns the same partition end-offset that drives every lag metric, so accuracy is preserved.
Why the fallback is necessary. Without it, klag refuses to start on a 2.x cluster. The first metrics scrape runs synchronously during klag's startup, and a failure there propagates back up to the process launcher, which exits with code 1 β CrashLoopBackOff under Kubernetes. One consumer group on a 2.x cluster was enough to brick startup.
Note for future contributors. On 2.x brokers, both logEndTimestamp and maxTimestampOffset fall back to the LATEST offset's timestamp/offset, so time-based lag interpolation degrades gracefully (the anchor becomes the broker-side append time of the last record rather than the highest-timestamp record).
Klag requires read-only access to monitor consumer lag. It uses only the Kafka Admin Client API with DESCRIBE permissionsβno write or alter access needed.
| Resource | Name | Permission | Operations |
|---|---|---|---|
| CLUSTER | kafka-cluster | DESCRIBE | Health check, list consumer groups |
| TOPIC | * or prefixed |
DESCRIBE | Get partition info and offsets |
| GROUP | * or prefixed |
DESCRIBE | Get group state and committed offsets |
Monitor all groups and topics
# Cluster permissions (required)
kafka-acls --bootstrap-server <broker> \
--add --allow-principal User:<klag-user> \
--operation Describe --cluster
# All topics
kafka-acls --bootstrap-server <broker> \
--add --allow-principal User:<klag-user> \
--operation Describe --topic '*'
# All consumer groups
kafka-acls --bootstrap-server <broker> \
--add --allow-principal User:<klag-user> \
--operation Describe --group '*'Klag needs read-only Kafka access (Admin Client DESCRIBE on cluster, topics, groups). Full configuration reference and ACL setup (self-managed + Confluent Cloud) are at klag.dev/configuration and klag.dev/kafka/acl-permissions.
Requires Java 21.
./gradlew clean test # Run tests
./gradlew clean assemble # Build fat JAR
./gradlew clean run # Run with hot-reloadEnd-to-end tests (k3d + real Kafka, Strimzi matrix) live in scripts/. See
Build from Source and
Contributing.
