Skip to content

Purge offline workers and clean up Prometheus metric labels#1492

Open
ShubhAtWork wants to merge 1 commit intomher:masterfrom
twofourlabs:fix/purge-offline-workers
Open

Purge offline workers and clean up Prometheus metric labels#1492
ShubhAtWork wants to merge 1 commit intomher:masterfrom
twofourlabs:fix/purge-offline-workers

Conversation

@ShubhAtWork
Copy link

@ShubhAtWork ShubhAtWork commented Mar 3, 2026

Summary

When a Celery worker goes offline permanently (e.g. a scaled-down container, a crashed node, a spot instance termination), its state persists indefinitely in three places:

  1. EventsState.counter — per-worker event counter dict (~1 KB per worker)
  2. Inspector.workers — cached inspect results
  3. Prometheus label series — every labels(worker_name, ...) call creates a time series that persists for the lifetime of the process

In long-running deployments with worker churn (autoscaling, rolling deploys, spot instances), these accumulate without bound, causing monotonically growing memory usage, degraded Prometheus scrape performance, and a slow /workers page.

Problem with the existing --purge_offline_workers option

The option already exists in flower/options.py, and WorkersView.get() uses it to filter workers from the UI. But it does NOT actually remove them from memory — the counter, inspector, and Prometheus data continue to grow.

What this PR does

  • Add a periodic timer (runs every purge_offline_workers seconds, minimum 10s) that scans for workers offline beyond the configured threshold
  • Actually purge dead workers from state.counter, inspector.workers, and all Prometheus metric label series
  • Handle orphaned entries — workers that appear in counter/inspector but not in state.workers (e.g. from incomplete event sequences)
  • Add PrometheusMetrics.remove_worker_metrics() — iterates all 6 metric families and removes label series matching the worker name
  • Add Inspector.purge_worker() — removes cached inspect data for a worker

Test plan

  • pytest tests/unit/test_app.py tests/unit/test_events.py tests/unit/test_inspector.py — 14 tests pass
  • Deploy with --purge_offline_workers=600, verify memory stabilizes after workers scale down
  • Verify Prometheus /metrics endpoint no longer shows labels for purged workers

🤖 Generated with Claude Code

When a Celery worker goes offline permanently (e.g. a scaled-down
container, a crashed node), its state persists in three places:

1. `EventsState.counter` — the per-worker event counter dict
2. `Inspector.workers` — the cached inspect results
3. Prometheus label series — every `labels(worker_name, ...)` call
   creates a time series that persists for the lifetime of the process

In long-running deployments with worker churn (autoscaling, rolling
deploys, spot instances), these accumulate without bound:

- `state.counter` grows by ~1 KB per worker — 10,000 ephemeral
  workers over a month means ~10 MB of dead entries
- Prometheus label cardinality grows linearly with every unique worker
  name ever seen, degrading scrape performance and increasing storage
  costs in monitoring systems
- The `/workers` page slows down as it iterates all known workers

The existing `--purge_offline_workers` option only filters workers
from the UI view (in `WorkersView.get`) — it does NOT actually remove
them from memory. This change implements actual purging:

- Add a periodic timer that runs every `purge_offline_workers` seconds
  (minimum 10s) to scan for workers that are offline beyond the
  configured threshold
- Purge dead workers from `state.counter`, `inspector.workers`, and
  all Prometheus metric label series
- Handle orphaned entries — workers that appear in counter/inspector
  but not in `state.workers` (e.g. from incomplete event sequences)
- Add `PrometheusMetrics.remove_worker_metrics()` to clean up label
  series across all 6 metric families
- Add `Inspector.purge_worker()` to remove cached inspect data

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant