fix(celery): bump worker concurrency default to 16 by mihow · Pull Request #1228 · RolnickLab/antenna

mihow · 2026-04-14T17:36:28Z

Summary

Add explicit CELERY_WORKER_CONCURRENCY = env.int("CELERY_WORKER_CONCURRENCY", default=16) to config/settings/base.py, next to the existing CELERY_WORKER_PREFETCH_MULTIPLIER / CELERY_WORKER_ENABLE_PREFETCH_COUNT_REDUCTION block.
Overridable per deployment via the CELERY_WORKER_CONCURRENCY env var.

Why

The default celery worker concurrency when the setting is unset is os.cpu_count(). On the current production celery worker host (8 cores) this means an 8-process prefork pool. The dominant tasks on the antenna queue — process_nats_pipeline_result and create_detection_images — are DB/Redis-bound rather than CPU-bound: each task spends most of its time waiting on postgres/pgbouncer and Redis round-trips, not crunching numbers.

Direct observation during a high-throughput async_api job:

Queue ingress rate ~640/min (two active jobs with 280–900 images each; redelivery amplifies ingress further when any of the bugs around async_api jobs killed by transient Redis errors during update_state — RedisError and "state actually missing" are conflated into a single fatal path #1219/fix(cache): enable SO_KEEPALIVE on django-redis cache connections #1221 trigger)
Single-consumer drain rate ~283/min at 8-way concurrency
Net accumulation ~360/min; the queue grew into the tens of thousands of messages over ~30 min
Container-level utilisation at 8-way concurrency: ~1.6% CPU, ~1.3 GiB of 29 GiB resident. Plenty of headroom.
RabbitMQ consumer_utilisation on the antenna queue: ~0.0016, i.e. the single AMQP consumer's prefetch window is fully occupied essentially all the time. This is the "worker pool too small" signature, not a broker-side issue.

Raising the prefork pool size directly addresses the bottleneck. 16 is a conservative first step (2× cpu_count, roughly matching the observed room on DB/pgbouncer side). A hotfix override of 16 was applied in production via the env var ahead of this PR and confirmed to drain the backlog on the active jobs.

Why 16 specifically

It is the smallest power-of-2 step that roughly matches the empirical gap between ingress and drain on the production incident that motivated this PR, without risking pgbouncer saturation. Deployments with different DB/pgbouncer capacity can override via env var. A larger default can be considered once we have measured postgres connection-pool headroom (see "what we still need to verify" below).

What this does not change

Prefetch multiplier stays at 1 — that was already set and fairness behaviour is unchanged.
Routing / queue topology is unchanged. Splitting the antenna queue into a dedicated "ingest fast path" vs "housekeeping / status-check" queue is a larger follow-up, filed separately.
Pool class stays prefork. Switching to gevent for this queue may give much higher effective concurrency on an IO-bound workload, but every task on this queue would need to be audited for gevent-safety (blocking C extensions, thread-locals in PyTorch paths, etc.) first. Out of scope here.

What we still need to verify

Postgres / pgbouncer connection pool usage after deploy — 16 prefork workers × persistent connections should be well within pgbouncer's default_pool_size, but worth confirming under load.
Whether the 16-default is also correct for the smaller staging/demo deployments or whether those want a lower override.
Whether this change exposes any new memory-pressure pattern at peak load (current --max-tasks-per-child=100 / --max-memory-per-child=2 GiB already bound each process).

async_api jobs killed by transient Redis errors during update_state — RedisError and "state actually missing" are conflated into a single fatal path #1219 — code-path brittleness that lets a single transient Redis error mark an active job FAILURE and delete state (independent of this PR).
fix(cache): enable SO_KEEPALIVE on django-redis cache connections #1221 — django-redis cache connection keepalive fix that reduces how often the async_api jobs killed by transient Redis errors during update_state — RedisError and "state actually missing" are conflated into a single fatal path #1219 path triggers (independent of this PR).

Summary by CodeRabbit

Release Notes

New Features
- Added configurable worker concurrency setting to control parallel background task processing (default: 16 workers).

The default celery worker concurrency (os.cpu_count()) underutilises the worker pool for process_nats_pipeline_result and create_detection_images, which are DB/Redis-bound rather than CPU-bound. On a prefork pool sized to CPU count, the pool is idle most of the time while the antenna queue backlogs during high-throughput NATS async_api jobs. Override via CELERY_WORKER_CONCURRENCY env var per deployment; 16 is the new default.

netlify · 2026-04-14T17:36:35Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`44dd942`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/69de7b20eb95120008e0cd81

netlify · 2026-04-14T17:36:35Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`44dd942`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/69de7b209892d40008b68bbf

coderabbitai · 2026-04-14T17:36:44Z

📝 Walkthrough

Walkthrough

A new Celery worker concurrency setting was added to the base configuration, enabling control over the prefork pool size via an environment variable with a default value of 16.

Changes

Cohort / File(s)	Summary
Celery Configuration `config/settings/base.py`	Added `CELERY_WORKER_CONCURRENCY` environment variable setting with default value of 16 to control Celery prefork pool size.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Suggested reviewers

carlosgjs

Poem

🐰 A new setting hops into place,
Concurrency tuned with grace,
Sixteen workers, or more if you choose,
Prefork pools that never lose! 🌟

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a new Celery worker concurrency setting with a default value of 16.
Description check	✅ Passed	The PR description covers all required sections: summary, list of changes, related issues, detailed description with motivation, and deployment notes. However, 'How to Test the Changes' and 'Screenshots' sections are missing, and the Checklist is incomplete.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/celery-worker-concurrency

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR adjusts the default Celery worker prefork pool size by introducing an explicit CELERY_WORKER_CONCURRENCY setting in the Django base settings, while keeping it overridable per deployment via an environment variable.

Changes:

Add CELERY_WORKER_CONCURRENCY = env.int("CELERY_WORKER_CONCURRENCY", default=16) to config/settings/base.py.
Document rationale and override behavior inline next to existing worker prefetch settings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

🧹 Nitpick comments (1)

config/settings/base.py (1)
401-401: Consider documenting CELERY_WORKER_CONCURRENCY in env templates/runbooks.

Optional, but adding it to .env.example/deployment docs will make per-environment tuning easier (especially smaller staging/demo stacks).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/settings/base.py` at line 401, Add documentation for the
CELERY_WORKER_CONCURRENCY environment variable (used where
CELERY_WORKER_CONCURRENCY = env.int("CELERY_WORKER_CONCURRENCY", default=16)) to
the project's environment templates and deployment/runbook, e.g., update
.env.example and relevant runbooks to include the variable name, its purpose
(controls Celery worker concurrency), allowed values, and the default of 16,
plus a note recommending smaller values for staging/demo and guidance for tuning
per-environment.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@config/settings/base.py`:
- Line 401: Add documentation for the CELERY_WORKER_CONCURRENCY environment
variable (used where CELERY_WORKER_CONCURRENCY =
env.int("CELERY_WORKER_CONCURRENCY", default=16)) to the project's environment
templates and deployment/runbook, e.g., update .env.example and relevant
runbooks to include the variable name, its purpose (controls Celery worker
concurrency), allowed values, and the default of 16, plus a note recommending
smaller values for staging/demo and guidance for tuning per-environment.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e68d66df-4cb2-4c7f-af0f-480c5272b1a4

📥 Commits

Reviewing files that changed from the base of the PR and between 1c6be7a and 44dd942.

📒 Files selected for processing (1)

config/settings/base.py

Copilot AI review requested due to automatic review settings April 14, 2026 17:36

mihow mentioned this pull request Apr 14, 2026

Split the celery antenna queue into workload classes (ingest / housekeeping) #1229

Open

Copilot AI reviewed Apr 14, 2026

View reviewed changes

mihow mentioned this pull request Apr 14, 2026

Investigate switching ingest celery pool from prefork to gevent #1230

Open

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(celery): bump worker concurrency default to 16#1228

fix(celery): bump worker concurrency default to 16#1228
mihow wants to merge 1 commit intomainfrom
fix/celery-worker-concurrency

mihow commented Apr 14, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

netlify bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

netlify bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

Copilot AI left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mihow commented Apr 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Why 16 specifically

What this does not change

What we still need to verify

Related

Summary by CodeRabbit

Release Notes

Uh oh!

netlify bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

netlify bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

coderabbitai bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihow commented Apr 14, 2026 •

edited by coderabbitai bot

Loading

netlify bot commented Apr 14, 2026 •

edited

Loading

netlify bot commented Apr 14, 2026 •

edited

Loading

coderabbitai bot commented Apr 14, 2026 •

edited

Loading