fix(celery): bump worker concurrency default to 16#1228
Conversation
The default celery worker concurrency (os.cpu_count()) underutilises the worker pool for process_nats_pipeline_result and create_detection_images, which are DB/Redis-bound rather than CPU-bound. On a prefork pool sized to CPU count, the pool is idle most of the time while the antenna queue backlogs during high-throughput NATS async_api jobs. Override via CELERY_WORKER_CONCURRENCY env var per deployment; 16 is the new default.
✅ Deploy Preview for antenna-ssec canceled.
|
✅ Deploy Preview for antenna-preview canceled.
|
📝 WalkthroughWalkthroughA new Celery worker concurrency setting was added to the base configuration, enabling control over the prefork pool size via an environment variable with a default value of 16. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR adjusts the default Celery worker prefork pool size by introducing an explicit CELERY_WORKER_CONCURRENCY setting in the Django base settings, while keeping it overridable per deployment via an environment variable.
Changes:
- Add
CELERY_WORKER_CONCURRENCY = env.int("CELERY_WORKER_CONCURRENCY", default=16)toconfig/settings/base.py. - Document rationale and override behavior inline next to existing worker prefetch settings.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
config/settings/base.py (1)
401-401: Consider documentingCELERY_WORKER_CONCURRENCYin env templates/runbooks.Optional, but adding it to
.env.example/deployment docs will make per-environment tuning easier (especially smaller staging/demo stacks).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/settings/base.py` at line 401, Add documentation for the CELERY_WORKER_CONCURRENCY environment variable (used where CELERY_WORKER_CONCURRENCY = env.int("CELERY_WORKER_CONCURRENCY", default=16)) to the project's environment templates and deployment/runbook, e.g., update .env.example and relevant runbooks to include the variable name, its purpose (controls Celery worker concurrency), allowed values, and the default of 16, plus a note recommending smaller values for staging/demo and guidance for tuning per-environment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@config/settings/base.py`:
- Line 401: Add documentation for the CELERY_WORKER_CONCURRENCY environment
variable (used where CELERY_WORKER_CONCURRENCY =
env.int("CELERY_WORKER_CONCURRENCY", default=16)) to the project's environment
templates and deployment/runbook, e.g., update .env.example and relevant
runbooks to include the variable name, its purpose (controls Celery worker
concurrency), allowed values, and the default of 16, plus a note recommending
smaller values for staging/demo and guidance for tuning per-environment.
Summary
CELERY_WORKER_CONCURRENCY = env.int("CELERY_WORKER_CONCURRENCY", default=16)toconfig/settings/base.py, next to the existingCELERY_WORKER_PREFETCH_MULTIPLIER/CELERY_WORKER_ENABLE_PREFETCH_COUNT_REDUCTIONblock.CELERY_WORKER_CONCURRENCYenv var.Why
The default celery worker concurrency when the setting is unset is
os.cpu_count(). On the current production celery worker host (8 cores) this means an 8-process prefork pool. The dominant tasks on theantennaqueue —process_nats_pipeline_resultandcreate_detection_images— are DB/Redis-bound rather than CPU-bound: each task spends most of its time waiting on postgres/pgbouncer and Redis round-trips, not crunching numbers.Direct observation during a high-throughput
async_apijob:async_apijobs killed by transient Redis errors duringupdate_state—RedisErrorand "state actually missing" are conflated into a single fatal path #1219/fix(cache): enable SO_KEEPALIVE on django-redis cache connections #1221 trigger)consumer_utilisationon the antenna queue: ~0.0016, i.e. the single AMQP consumer's prefetch window is fully occupied essentially all the time. This is the "worker pool too small" signature, not a broker-side issue.Raising the prefork pool size directly addresses the bottleneck. 16 is a conservative first step (2× cpu_count, roughly matching the observed room on DB/pgbouncer side). A hotfix override of 16 was applied in production via the env var ahead of this PR and confirmed to drain the backlog on the active jobs.
Why 16 specifically
It is the smallest power-of-2 step that roughly matches the empirical gap between ingress and drain on the production incident that motivated this PR, without risking pgbouncer saturation. Deployments with different DB/pgbouncer capacity can override via env var. A larger default can be considered once we have measured postgres connection-pool headroom (see "what we still need to verify" below).
What this does not change
1— that was already set and fairness behaviour is unchanged.antennaqueue into a dedicated "ingest fast path" vs "housekeeping / status-check" queue is a larger follow-up, filed separately.prefork. Switching togeventfor this queue may give much higher effective concurrency on an IO-bound workload, but every task on this queue would need to be audited for gevent-safety (blocking C extensions, thread-locals in PyTorch paths, etc.) first. Out of scope here.What we still need to verify
default_pool_size, but worth confirming under load.--max-tasks-per-child=100/--max-memory-per-child=2 GiBalready bound each process).Related
async_apijobs killed by transient Redis errors duringupdate_state—RedisErrorand "state actually missing" are conflated into a single fatal path #1219 — code-path brittleness that lets a single transient Redis error mark an active job FAILURE and delete state (independent of this PR).async_apijobs killed by transient Redis errors duringupdate_state—RedisErrorand "state actually missing" are conflated into a single fatal path #1219 path triggers (independent of this PR).Summary by CodeRabbit
Release Notes