Purge offline workers and clean up Prometheus metric labels#1492
Open
ShubhAtWork wants to merge 1 commit intomher:masterfrom
Open
Purge offline workers and clean up Prometheus metric labels#1492ShubhAtWork wants to merge 1 commit intomher:masterfrom
ShubhAtWork wants to merge 1 commit intomher:masterfrom
Conversation
When a Celery worker goes offline permanently (e.g. a scaled-down container, a crashed node), its state persists in three places: 1. `EventsState.counter` — the per-worker event counter dict 2. `Inspector.workers` — the cached inspect results 3. Prometheus label series — every `labels(worker_name, ...)` call creates a time series that persists for the lifetime of the process In long-running deployments with worker churn (autoscaling, rolling deploys, spot instances), these accumulate without bound: - `state.counter` grows by ~1 KB per worker — 10,000 ephemeral workers over a month means ~10 MB of dead entries - Prometheus label cardinality grows linearly with every unique worker name ever seen, degrading scrape performance and increasing storage costs in monitoring systems - The `/workers` page slows down as it iterates all known workers The existing `--purge_offline_workers` option only filters workers from the UI view (in `WorkersView.get`) — it does NOT actually remove them from memory. This change implements actual purging: - Add a periodic timer that runs every `purge_offline_workers` seconds (minimum 10s) to scan for workers that are offline beyond the configured threshold - Purge dead workers from `state.counter`, `inspector.workers`, and all Prometheus metric label series - Handle orphaned entries — workers that appear in counter/inspector but not in `state.workers` (e.g. from incomplete event sequences) - Add `PrometheusMetrics.remove_worker_metrics()` to clean up label series across all 6 metric families - Add `Inspector.purge_worker()` to remove cached inspect data Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a Celery worker goes offline permanently (e.g. a scaled-down container, a crashed node, a spot instance termination), its state persists indefinitely in three places:
EventsState.counter— per-worker event counter dict (~1 KB per worker)Inspector.workers— cached inspect resultslabels(worker_name, ...)call creates a time series that persists for the lifetime of the processIn long-running deployments with worker churn (autoscaling, rolling deploys, spot instances), these accumulate without bound, causing monotonically growing memory usage, degraded Prometheus scrape performance, and a slow
/workerspage.Problem with the existing
--purge_offline_workersoptionThe option already exists in
flower/options.py, andWorkersView.get()uses it to filter workers from the UI. But it does NOT actually remove them from memory — the counter, inspector, and Prometheus data continue to grow.What this PR does
purge_offline_workersseconds, minimum 10s) that scans for workers offline beyond the configured thresholdstate.counter,inspector.workers, and all Prometheus metric label seriesstate.workers(e.g. from incomplete event sequences)PrometheusMetrics.remove_worker_metrics()— iterates all 6 metric families and removes label series matching the worker nameInspector.purge_worker()— removes cached inspect data for a workerTest plan
pytest tests/unit/test_app.py tests/unit/test_events.py tests/unit/test_inspector.py— 14 tests pass--purge_offline_workers=600, verify memory stabilizes after workers scale down/metricsendpoint no longer shows labels for purged workers🤖 Generated with Claude Code