Purge offline workers and clean up Prometheus metric labels by ShubhAtWork · Pull Request #1492 · mher/flower

ShubhAtWork · 2026-03-03T07:25:18Z

Summary

When a Celery worker goes offline permanently (e.g. a scaled-down container, a crashed node, a spot instance termination), its state persists indefinitely in three places:

EventsState.counter — per-worker event counter dict (~1 KB per worker)
Inspector.workers — cached inspect results
Prometheus label series — every labels(worker_name, ...) call creates a time series that persists for the lifetime of the process

In long-running deployments with worker churn (autoscaling, rolling deploys, spot instances), these accumulate without bound, causing monotonically growing memory usage, degraded Prometheus scrape performance, and a slow /workers page.

Problem with the existing `--purge_offline_workers` option

The option already exists in flower/options.py, and WorkersView.get() uses it to filter workers from the UI. But it does NOT actually remove them from memory — the counter, inspector, and Prometheus data continue to grow.

What this PR does

Add a periodic timer (runs every purge_offline_workers seconds, minimum 10s) that scans for workers offline beyond the configured threshold
Actually purge dead workers from state.counter, inspector.workers, and all Prometheus metric label series
Handle orphaned entries — workers that appear in counter/inspector but not in state.workers (e.g. from incomplete event sequences)
Add PrometheusMetrics.remove_worker_metrics() — iterates all 6 metric families and removes label series matching the worker name
Add Inspector.purge_worker() — removes cached inspect data for a worker

Test plan

pytest tests/unit/test_app.py tests/unit/test_events.py tests/unit/test_inspector.py — 14 tests pass
Deploy with --purge_offline_workers=600, verify memory stabilizes after workers scale down
Verify Prometheus /metrics endpoint no longer shows labels for purged workers

🤖 Generated with Claude Code

When a Celery worker goes offline permanently (e.g. a scaled-down container, a crashed node), its state persists in three places: 1. `EventsState.counter` — the per-worker event counter dict 2. `Inspector.workers` — the cached inspect results 3. Prometheus label series — every `labels(worker_name, ...)` call creates a time series that persists for the lifetime of the process In long-running deployments with worker churn (autoscaling, rolling deploys, spot instances), these accumulate without bound: - `state.counter` grows by ~1 KB per worker — 10,000 ephemeral workers over a month means ~10 MB of dead entries - Prometheus label cardinality grows linearly with every unique worker name ever seen, degrading scrape performance and increasing storage costs in monitoring systems - The `/workers` page slows down as it iterates all known workers The existing `--purge_offline_workers` option only filters workers from the UI view (in `WorkersView.get`) — it does NOT actually remove them from memory. This change implements actual purging: - Add a periodic timer that runs every `purge_offline_workers` seconds (minimum 10s) to scan for workers that are offline beyond the configured threshold - Purge dead workers from `state.counter`, `inspector.workers`, and all Prometheus metric label series - Handle orphaned entries — workers that appear in counter/inspector but not in `state.workers` (e.g. from incomplete event sequences) - Add `PrometheusMetrics.remove_worker_metrics()` to clean up label series across all 6 metric families - Add `Inspector.purge_worker()` to remove cached inspect data Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ShubhAtWork mentioned this pull request Mar 3, 2026

Fix memory leaks, connection leaks, and optimize for 10k+ queues #1488

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Purge offline workers and clean up Prometheus metric labels#1492

Purge offline workers and clean up Prometheus metric labels#1492
ShubhAtWork wants to merge 1 commit intomher:masterfrom
twofourlabs:fix/purge-offline-workers

ShubhAtWork commented Mar 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ShubhAtWork commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem with the existing --purge_offline_workers option

What this PR does

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ShubhAtWork commented Mar 3, 2026 •

edited

Loading

Problem with the existing `--purge_offline_workers` option