fix(cache): enable SO_KEEPALIVE on django-redis cache connections by mihow · Pull Request #1221 · RolnickLab/antenna

mihow · 2026-04-11T01:17:17Z

Summary

Adds TCP keepalive (SOCKET_KEEPALIVE + SOCKET_KEEPALIVE_OPTIONS) and a connect timeout to CACHES["default"]["OPTIONS"] in config/settings/base.py, extracts a shared TCP_KEEPALIVE_OPTIONS constant so the same tuning is used for both the django-redis cache and the Celery/RabbitMQ broker, and exposes the knobs as env vars so production deployments can tune without a code change.

Why

redis-py does not set SO_KEEPALIVE unless the client explicitly opts in via socket_keepalive=True. Without it, the kernel never sends TCP keepalive probes on the socket — regardless of host-level sysctl tuning. Pooled Redis connections can sit idle long enough for a stateful cloud firewall to silently drop them; the next task to borrow one from the pool fails with ECONNRESET.

Direct observation on a production worker: ss -ton state established dport = :6379 showed only some Redis connections with a timer:(keepalive,...) entry. The bare ones are the django-redis cache connections this PR fixes.

Context: how this gap got missed

The TCP-keepalive pattern for Redis/AMQP connections is not new in this repo. Three earlier PRs set it up for Celery-side paths but did not touch the django-redis cache block:

Add Celery worker memory and task limits for leak prevention #1051 (Nov 2025) — Celery worker memory/task limits; broker pivot from Redis to RabbitMQ. Touched CELERY_BROKER_TRANSPORT_OPTIONS only.
Celery options to improve stability #1055 (Nov 2025) — Celery result-backend settings. Enabled CELERY_REDIS_SOCKET_KEEPALIVE for the Redis result-backend connection pool (distinct from the django-redis cache pool). CACHES untouched.
Celery settings to prevent long running tasks from being silently canceled #1073 (Dec 2025, direct precursor) — Added socket_keepalive + socket_settings (TCP_KEEPIDLE=60, TCP_KEEPINTVL=10, TCP_KEEPCNT=9) to CELERY_BROKER_TRANSPORT_OPTIONS to fix "zombie Celery tasks" (Fix dangling processing jobs by adding TCP keepalives #1072). Coordinated with host-level sysctl tuning. CACHES untouched; the django-redis path was never discussed.

None of the three PR discussions mention the CACHES block or any reason not to apply the same tuning there. The omission was by oversight, not a conscious decision — the django-redis cache path uses its own OPTIONS dict that doesn't inherit from Celery config.

What this PR adds

In config/settings/base.py:

A shared TCP_KEEPALIVE_OPTIONS constant (60/10/9 defaults, overridable via TCP_KEEPALIVE_IDLE, TCP_KEEPALIVE_INTVL, TCP_KEEPALIVE_CNT env vars) used by both:
- CACHES["default"]["OPTIONS"]["SOCKET_KEEPALIVE_OPTIONS"] (new)
- CELERY_BROKER_TRANSPORT_OPTIONS["socket_settings"] (replacing the inline duplicate)
SOCKET_KEEPALIVE: True on the cache, overridable via REDIS_CACHE_SOCKET_KEEPALIVE.
SOCKET_CONNECT_TIMEOUT: 5 on the cache, overridable via REDIS_CACHE_SOCKET_CONNECT_TIMEOUT. This caps connect-time so that if the Redis host is briefly unreachable we fail fast rather than blocking the pooled-connection acquisition indefinitely. 5s is well above normal connect latency (noted explicitly because Copilot flagged this as out-of-scope relative to the PR title — it's a small scope addition but intentional and documented in the code).

Defaults vs. production tuning

The defaults are safe for every environment:

local dev, docker-compose, staging, demo — on localhost the keepalive probes are effectively no-ops; the settings cost nothing.
production — the defaults match what's already in CELERY_BROKER_TRANSPORT_OPTIONS and are correct for deployments behind a stateful firewall that drops idle connections after ~1h.

If a deployment has different network behavior (e.g., a more aggressive NAT that drops sooner, or a high-latency link where 5s connect is too tight), operators can override via env:

Env var	Default	What it does
`TCP_KEEPALIVE_IDLE`	60	Seconds of idle before first keepalive probe
`TCP_KEEPALIVE_INTVL`	10	Seconds between probes
`TCP_KEEPALIVE_CNT`	9	Probes before giving up
`REDIS_CACHE_SOCKET_KEEPALIVE`	True	Enable/disable keepalive on the cache socket
`REDIS_CACHE_SOCKET_CONNECT_TIMEOUT`	5	Seconds to wait for connect before failing

Recommended production baseline is the defaults (first dead-connection detection at ~60s + 9×10s ≈ 150s). Deployments behind a firewall that drops at <60s idle should lower TCP_KEEPALIVE_IDLE to roughly half the firewall timeout.

Incident this addresses

A production async_api job was processing normally (hundreds of images processed, results flowing into the database) when one process_nats_pipeline_result task hit:

Redis error updating job <id> state:
    Error while reading from redis:6379 : (104, 'Connection reset by peer')

153 ms later the job was marked FAILURE with "Redis state missing for job" and its NATS stream + Redis state were deleted. Other tasks on the same celery worker reached Redis successfully milliseconds before and after — a single pooled connection had been silently dropped.

The code-path brittleness that escalates a transient Redis blip into a fatal FAILURE is tracked separately in #1219 (reopened). This PR fixes the config gap that lets the transient happen in the first place. Both should land.

Verification after deploy

Inside a celery worker container:

ss -ton state established dport = :6379

Every established Redis connection should show timer:(keepalive,...). Before this change, roughly half were bare.

What still needs verifying

Observe ss -ton on production workers after deploy to confirm all connections now have keepalive timers.
Track the rate of "Redis state missing for job" failures over the next ~week; expect it to drop once fix(cache): enable SO_KEEPALIVE on django-redis cache connections #1221 and async_api jobs killed by transient Redis errors during update_state — RedisError and "state actually missing" are conflated into a single fatal path #1219 are both in.
Consider adding a startup log line that prints the effective redis-py connection options so config drift is visible in ops.

Scope

Hypothesis-driven fix based on code reading + direct ss observation. The constant extraction + env tunables also closes a CodeRabbit nit (duplicate dicts between cache and broker keepalive) and the Copilot concern about documenting the connect-timeout choice.

Closes #1218. Related: #1073 (broker precursor), #1219 (code-path fix that pairs with this config fix), #1220.

Without SOCKET_KEEPALIVE, redis-py does not call setsockopt(SO_KEEPALIVE, 1) on the Redis cache socket, so the kernel never sends TCP keepalive probes regardless of host-level sysctl tuning. Pooled connections can sit idle long enough for stateful cloud firewalls to silently drop them, at which point the next task to borrow one from the pool fails with ECONNRESET. This mirrors what CELERY_BROKER_TRANSPORT_OPTIONS.socket_settings does for the celery broker path, which has been protected for some time. Closes #1218.

netlify · 2026-04-11T01:17:24Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`5f71440`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/69de9042e37fea0008f22e79

netlify · 2026-04-11T01:17:24Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`5f71440`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/69de90423106db0008259efa

coderabbitai · 2026-04-11T01:17:30Z

📝 Walkthrough

Walkthrough

Added TCP keepalive socket configuration to Redis cache settings in config/settings/base.py. Configuration includes enabling SOCKET_KEEPALIVE, setting keepalive options (TCP_KEEPIDLE, TCP_KEEPINTVL, TCP_KEEPCNT), and a socket connection timeout. This mirrors the existing Celery broker configuration pattern to prevent idle connection resets.

Changes

Cohort / File(s)	Summary
Redis Cache Socket Configuration `config/settings/base.py`	Added `SOCKET_KEEPALIVE`, `SOCKET_KEEPALIVE_OPTIONS` (TCP_KEEPIDLE=60, TCP_KEEPINTVL=10, TCP_KEEPCNT=9), and `SOCKET_CONNECT_TIMEOUT=5` to CACHES default OPTIONS. Aligns with existing Celery broker socket_settings pattern to prevent cloud firewalls from silently dropping idle pooled connections.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 A firewall dropped our Redis thread,
After one hour of rest in bed,
But now with keepalive TCP,
Our pooled connections dance so free!
No more resets, no more pain,
Hops and bounds through the data lane! 🎉

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: enabling SO_KEEPALIVE socket option on django-redis cache connections to prevent idle connection drops.
Linked Issues check	✅ Passed	The PR implementation adds SOCKET_KEEPALIVE, SOCKET_KEEPALIVE_OPTIONS, and SOCKET_CONNECT_TIMEOUT to CACHES configuration, directly addressing the requirements in issue `#1218` to prevent cloud firewall idle connection drops.
Out of Scope Changes check	✅ Passed	All changes are scoped to adding socket keepalive options to the django-redis cache configuration in config/settings/base.py as required by issue `#1218`. No extraneous modifications present.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering all required template sections with detailed context, rationale, implementation details, and verification instructions.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/redis-cache-socket-keepalive

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR updates Django’s django-redis cache connection options to enable TCP keepalive, matching the existing Celery broker keepalive configuration, to reduce idle-connection drops by stateful firewalls (per #1218).

Changes:

Enable SOCKET_KEEPALIVE for the default Django Redis cache connection pool.
Configure SOCKET_KEEPALIVE_OPTIONS (TCP_KEEPIDLE/INTVL/KEEPCNT) to align with existing Celery transport settings.
Add a Redis cache socket connect timeout option.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

config/settings/base.py

coderabbitai

🧹 Nitpick comments (1)

config/settings/base.py (1)

267-275: Consider extracting shared keepalive tunables to avoid config drift.

The same keepalive values are now duplicated between cache and Celery broker settings. A shared constant would keep them synchronized over time.

♻️ Suggested refactor

+# Shared TCP keepalive tuning (seconds / probe count)
+REDIS_TCP_KEEPALIVE_OPTIONS = {
+    socket.TCP_KEEPIDLE: 60,
+    socket.TCP_KEEPINTVL: 10,
+    socket.TCP_KEEPCNT: 9,
+}
+
 CACHES = {
     "default": {
         "BACKEND": "django_redis.cache.RedisCache",
         "LOCATION": env("REDIS_URL", default=None),
         "OPTIONS": {
@@
             "IGNORE_EXCEPTIONS": True,
             "SOCKET_KEEPALIVE": True,
-            "SOCKET_KEEPALIVE_OPTIONS": {
-                socket.TCP_KEEPIDLE: 60,
-                socket.TCP_KEEPINTVL: 10,
-                socket.TCP_KEEPCNT: 9,
-            },
+            "SOCKET_KEEPALIVE_OPTIONS": REDIS_TCP_KEEPALIVE_OPTIONS,
             "SOCKET_CONNECT_TIMEOUT": 5,
         },
     }
 }
@@
 CELERY_BROKER_TRANSPORT_OPTIONS = {
     "socket_keepalive": True,
-    "socket_settings": {
-        socket.TCP_KEEPIDLE: 60,
-        socket.TCP_KEEPINTVL: 10,
-        socket.TCP_KEEPCNT: 9,
-    },
+    "socket_settings": REDIS_TCP_KEEPALIVE_OPTIONS,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@config/settings/base.py` around lines 267 - 275, Extract the duplicated
keepalive tunables into a single shared constant (e.g., KEEPALIVE_SETTINGS or
SOCKET_KEEPALIVE_OPTIONS_DEFAULT) and replace the inline dicts currently used
for "SOCKET_KEEPALIVE_OPTIONS" in both the cache and Celery broker config with a
reference to that constant; also consider extracting related flags like
"SOCKET_KEEPALIVE" and "SOCKET_CONNECT_TIMEOUT" into shared constants if they
are duplicated elsewhere so that codepaths that set socket.TCP_KEEPIDLE,
socket.TCP_KEEPINTVL, socket.TCP_KEEPCNT (referenced in the current
SOCKET_KEEPALIVE_OPTIONS) and any connect timeout values all point to the single
source of truth.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@config/settings/base.py`:
- Around line 267-275: Extract the duplicated keepalive tunables into a single
shared constant (e.g., KEEPALIVE_SETTINGS or SOCKET_KEEPALIVE_OPTIONS_DEFAULT)
and replace the inline dicts currently used for "SOCKET_KEEPALIVE_OPTIONS" in
both the cache and Celery broker config with a reference to that constant; also
consider extracting related flags like "SOCKET_KEEPALIVE" and
"SOCKET_CONNECT_TIMEOUT" into shared constants if they are duplicated elsewhere
so that codepaths that set socket.TCP_KEEPIDLE, socket.TCP_KEEPINTVL,
socket.TCP_KEEPCNT (referenced in the current SOCKET_KEEPALIVE_OPTIONS) and any
connect timeout values all point to the single source of truth.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5c328c27-5e51-486b-8ce1-1c59dc4e02b4

📥 Commits

Reviewing files that changed from the base of the PR and between 8f8b177 and 29130d4.

📒 Files selected for processing (1)

config/settings/base.py

Extracts TCP_KEEPALIVE_OPTIONS to a module-level constant with env-var overrides (TCP_KEEPALIVE_IDLE/INTVL/CNT) and reuses it in both the django-redis CACHES block and CELERY_BROKER_TRANSPORT_OPTIONS. Also exposes REDIS_CACHE_SOCKET_KEEPALIVE and REDIS_CACHE_SOCKET_CONNECT_TIMEOUT as env vars for production tuning. Defaults match the values introduced for the broker in #1073 (60/10/9), which are safe for local/staging/demo and correct for deployments behind stateful firewalls.

Copilot AI review requested due to automatic review settings April 11, 2026 01:17

Copilot started reviewing on behalf of mihow April 11, 2026 01:18 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

config/settings/base.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Apr 11, 2026

View reviewed changes

mihow mentioned this pull request Apr 14, 2026

fix(celery): bump worker concurrency default to 16 #1228

Open

mihow merged commit 71f153a into main Apr 14, 2026
7 checks passed

mihow deleted the fix/redis-cache-socket-keepalive branch April 14, 2026 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cache): enable SO_KEEPALIVE on django-redis cache connections#1221

fix(cache): enable SO_KEEPALIVE on django-redis cache connections#1221
mihow merged 2 commits intomainfrom
fix/redis-cache-socket-keepalive

mihow commented Apr 11, 2026 •

edited

Loading

Uh oh!

netlify bot commented Apr 11, 2026 •

edited

Loading

Uh oh!

netlify bot commented Apr 11, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mihow commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Context: how this gap got missed

What this PR adds

Defaults vs. production tuning

Incident this addresses

Verification after deploy

What still needs verifying

Scope

Uh oh!

netlify bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

netlify bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

coderabbitai bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihow commented Apr 11, 2026 •

edited

Loading

netlify bot commented Apr 11, 2026 •

edited

Loading

netlify bot commented Apr 11, 2026 •

edited

Loading

coderabbitai bot commented Apr 11, 2026 •

edited

Loading