Skip to content

fix(cache): enable SO_KEEPALIVE on django-redis cache connections#1221

Merged
mihow merged 2 commits intomainfrom
fix/redis-cache-socket-keepalive
Apr 14, 2026
Merged

fix(cache): enable SO_KEEPALIVE on django-redis cache connections#1221
mihow merged 2 commits intomainfrom
fix/redis-cache-socket-keepalive

Conversation

@mihow
Copy link
Copy Markdown
Collaborator

@mihow mihow commented Apr 11, 2026

Summary

Adds TCP keepalive (SOCKET_KEEPALIVE + SOCKET_KEEPALIVE_OPTIONS) and a connect timeout to CACHES["default"]["OPTIONS"] in config/settings/base.py, extracts a shared TCP_KEEPALIVE_OPTIONS constant so the same tuning is used for both the django-redis cache and the Celery/RabbitMQ broker, and exposes the knobs as env vars so production deployments can tune without a code change.

Why

redis-py does not set SO_KEEPALIVE unless the client explicitly opts in via socket_keepalive=True. Without it, the kernel never sends TCP keepalive probes on the socket — regardless of host-level sysctl tuning. Pooled Redis connections can sit idle long enough for a stateful cloud firewall to silently drop them; the next task to borrow one from the pool fails with ECONNRESET.

Direct observation on a production worker: ss -ton state established dport = :6379 showed only some Redis connections with a timer:(keepalive,...) entry. The bare ones are the django-redis cache connections this PR fixes.

Context: how this gap got missed

The TCP-keepalive pattern for Redis/AMQP connections is not new in this repo. Three earlier PRs set it up for Celery-side paths but did not touch the django-redis cache block:

None of the three PR discussions mention the CACHES block or any reason not to apply the same tuning there. The omission was by oversight, not a conscious decision — the django-redis cache path uses its own OPTIONS dict that doesn't inherit from Celery config.

What this PR adds

In config/settings/base.py:

  1. A shared TCP_KEEPALIVE_OPTIONS constant (60/10/9 defaults, overridable via TCP_KEEPALIVE_IDLE, TCP_KEEPALIVE_INTVL, TCP_KEEPALIVE_CNT env vars) used by both:

    • CACHES["default"]["OPTIONS"]["SOCKET_KEEPALIVE_OPTIONS"] (new)
    • CELERY_BROKER_TRANSPORT_OPTIONS["socket_settings"] (replacing the inline duplicate)
  2. SOCKET_KEEPALIVE: True on the cache, overridable via REDIS_CACHE_SOCKET_KEEPALIVE.

  3. SOCKET_CONNECT_TIMEOUT: 5 on the cache, overridable via REDIS_CACHE_SOCKET_CONNECT_TIMEOUT. This caps connect-time so that if the Redis host is briefly unreachable we fail fast rather than blocking the pooled-connection acquisition indefinitely. 5s is well above normal connect latency (noted explicitly because Copilot flagged this as out-of-scope relative to the PR title — it's a small scope addition but intentional and documented in the code).

Defaults vs. production tuning

The defaults are safe for every environment:

  • local dev, docker-compose, staging, demo — on localhost the keepalive probes are effectively no-ops; the settings cost nothing.
  • production — the defaults match what's already in CELERY_BROKER_TRANSPORT_OPTIONS and are correct for deployments behind a stateful firewall that drops idle connections after ~1h.

If a deployment has different network behavior (e.g., a more aggressive NAT that drops sooner, or a high-latency link where 5s connect is too tight), operators can override via env:

Env var Default What it does
TCP_KEEPALIVE_IDLE 60 Seconds of idle before first keepalive probe
TCP_KEEPALIVE_INTVL 10 Seconds between probes
TCP_KEEPALIVE_CNT 9 Probes before giving up
REDIS_CACHE_SOCKET_KEEPALIVE True Enable/disable keepalive on the cache socket
REDIS_CACHE_SOCKET_CONNECT_TIMEOUT 5 Seconds to wait for connect before failing

Recommended production baseline is the defaults (first dead-connection detection at ~60s + 9×10s ≈ 150s). Deployments behind a firewall that drops at <60s idle should lower TCP_KEEPALIVE_IDLE to roughly half the firewall timeout.

Incident this addresses

A production async_api job was processing normally (hundreds of images processed, results flowing into the database) when one process_nats_pipeline_result task hit:

Redis error updating job <id> state:
    Error while reading from redis:6379 : (104, 'Connection reset by peer')

153 ms later the job was marked FAILURE with "Redis state missing for job" and its NATS stream + Redis state were deleted. Other tasks on the same celery worker reached Redis successfully milliseconds before and after — a single pooled connection had been silently dropped.

The code-path brittleness that escalates a transient Redis blip into a fatal FAILURE is tracked separately in #1219 (reopened). This PR fixes the config gap that lets the transient happen in the first place. Both should land.

Verification after deploy

Inside a celery worker container:

ss -ton state established dport = :6379

Every established Redis connection should show timer:(keepalive,...). Before this change, roughly half were bare.

What still needs verifying

Scope

Hypothesis-driven fix based on code reading + direct ss observation. The constant extraction + env tunables also closes a CodeRabbit nit (duplicate dicts between cache and broker keepalive) and the Copilot concern about documenting the connect-timeout choice.

Closes #1218. Related: #1073 (broker precursor), #1219 (code-path fix that pairs with this config fix), #1220.

Without SOCKET_KEEPALIVE, redis-py does not call setsockopt(SO_KEEPALIVE, 1)
on the Redis cache socket, so the kernel never sends TCP keepalive probes
regardless of host-level sysctl tuning. Pooled connections can sit idle long
enough for stateful cloud firewalls to silently drop them, at which point
the next task to borrow one from the pool fails with ECONNRESET.

This mirrors what CELERY_BROKER_TRANSPORT_OPTIONS.socket_settings does for
the celery broker path, which has been protected for some time.

Closes #1218.
Copilot AI review requested due to automatic review settings April 11, 2026 01:17
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 11, 2026

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit 5f71440
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/69de9042e37fea0008f22e79

@netlify
Copy link
Copy Markdown

netlify bot commented Apr 11, 2026

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit 5f71440
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/69de90423106db0008259efa

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 11, 2026

📝 Walkthrough

Walkthrough

Added TCP keepalive socket configuration to Redis cache settings in config/settings/base.py. Configuration includes enabling SOCKET_KEEPALIVE, setting keepalive options (TCP_KEEPIDLE, TCP_KEEPINTVL, TCP_KEEPCNT), and a socket connection timeout. This mirrors the existing Celery broker configuration pattern to prevent idle connection resets.

Changes

Cohort / File(s) Summary
Redis Cache Socket Configuration
config/settings/base.py
Added SOCKET_KEEPALIVE, SOCKET_KEEPALIVE_OPTIONS (TCP_KEEPIDLE=60, TCP_KEEPINTVL=10, TCP_KEEPCNT=9), and SOCKET_CONNECT_TIMEOUT=5 to CACHES default OPTIONS. Aligns with existing Celery broker socket_settings pattern to prevent cloud firewalls from silently dropping idle pooled connections.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 A firewall dropped our Redis thread,
After one hour of rest in bed,
But now with keepalive TCP,
Our pooled connections dance so free!
No more resets, no more pain,
Hops and bounds through the data lane! 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: enabling SO_KEEPALIVE socket option on django-redis cache connections to prevent idle connection drops.
Linked Issues check ✅ Passed The PR implementation adds SOCKET_KEEPALIVE, SOCKET_KEEPALIVE_OPTIONS, and SOCKET_CONNECT_TIMEOUT to CACHES configuration, directly addressing the requirements in issue #1218 to prevent cloud firewall idle connection drops.
Out of Scope Changes check ✅ Passed All changes are scoped to adding socket keepalive options to the django-redis cache configuration in config/settings/base.py as required by issue #1218. No extraneous modifications present.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering all required template sections with detailed context, rationale, implementation details, and verification instructions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/redis-cache-socket-keepalive

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Django’s django-redis cache connection options to enable TCP keepalive, matching the existing Celery broker keepalive configuration, to reduce idle-connection drops by stateful firewalls (per #1218).

Changes:

  • Enable SOCKET_KEEPALIVE for the default Django Redis cache connection pool.
  • Configure SOCKET_KEEPALIVE_OPTIONS (TCP_KEEPIDLE/INTVL/KEEPCNT) to align with existing Celery transport settings.
  • Add a Redis cache socket connect timeout option.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
config/settings/base.py (1)

267-275: Consider extracting shared keepalive tunables to avoid config drift.

The same keepalive values are now duplicated between cache and Celery broker settings. A shared constant would keep them synchronized over time.

♻️ Suggested refactor
+# Shared TCP keepalive tuning (seconds / probe count)
+REDIS_TCP_KEEPALIVE_OPTIONS = {
+    socket.TCP_KEEPIDLE: 60,
+    socket.TCP_KEEPINTVL: 10,
+    socket.TCP_KEEPCNT: 9,
+}
+
 CACHES = {
     "default": {
         "BACKEND": "django_redis.cache.RedisCache",
         "LOCATION": env("REDIS_URL", default=None),
         "OPTIONS": {
@@
             "IGNORE_EXCEPTIONS": True,
             "SOCKET_KEEPALIVE": True,
-            "SOCKET_KEEPALIVE_OPTIONS": {
-                socket.TCP_KEEPIDLE: 60,
-                socket.TCP_KEEPINTVL: 10,
-                socket.TCP_KEEPCNT: 9,
-            },
+            "SOCKET_KEEPALIVE_OPTIONS": REDIS_TCP_KEEPALIVE_OPTIONS,
             "SOCKET_CONNECT_TIMEOUT": 5,
         },
     }
 }
@@
 CELERY_BROKER_TRANSPORT_OPTIONS = {
     "socket_keepalive": True,
-    "socket_settings": {
-        socket.TCP_KEEPIDLE: 60,
-        socket.TCP_KEEPINTVL: 10,
-        socket.TCP_KEEPCNT: 9,
-    },
+    "socket_settings": REDIS_TCP_KEEPALIVE_OPTIONS,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/settings/base.py` around lines 267 - 275, Extract the duplicated
keepalive tunables into a single shared constant (e.g., KEEPALIVE_SETTINGS or
SOCKET_KEEPALIVE_OPTIONS_DEFAULT) and replace the inline dicts currently used
for "SOCKET_KEEPALIVE_OPTIONS" in both the cache and Celery broker config with a
reference to that constant; also consider extracting related flags like
"SOCKET_KEEPALIVE" and "SOCKET_CONNECT_TIMEOUT" into shared constants if they
are duplicated elsewhere so that codepaths that set socket.TCP_KEEPIDLE,
socket.TCP_KEEPINTVL, socket.TCP_KEEPCNT (referenced in the current
SOCKET_KEEPALIVE_OPTIONS) and any connect timeout values all point to the single
source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@config/settings/base.py`:
- Around line 267-275: Extract the duplicated keepalive tunables into a single
shared constant (e.g., KEEPALIVE_SETTINGS or SOCKET_KEEPALIVE_OPTIONS_DEFAULT)
and replace the inline dicts currently used for "SOCKET_KEEPALIVE_OPTIONS" in
both the cache and Celery broker config with a reference to that constant; also
consider extracting related flags like "SOCKET_KEEPALIVE" and
"SOCKET_CONNECT_TIMEOUT" into shared constants if they are duplicated elsewhere
so that codepaths that set socket.TCP_KEEPIDLE, socket.TCP_KEEPINTVL,
socket.TCP_KEEPCNT (referenced in the current SOCKET_KEEPALIVE_OPTIONS) and any
connect timeout values all point to the single source of truth.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5c328c27-5e51-486b-8ce1-1c59dc4e02b4

📥 Commits

Reviewing files that changed from the base of the PR and between 8f8b177 and 29130d4.

📒 Files selected for processing (1)
  • config/settings/base.py

Extracts TCP_KEEPALIVE_OPTIONS to a module-level constant with env-var
overrides (TCP_KEEPALIVE_IDLE/INTVL/CNT) and reuses it in both the
django-redis CACHES block and CELERY_BROKER_TRANSPORT_OPTIONS. Also
exposes REDIS_CACHE_SOCKET_KEEPALIVE and REDIS_CACHE_SOCKET_CONNECT_TIMEOUT
as env vars for production tuning.

Defaults match the values introduced for the broker in #1073 (60/10/9),
which are safe for local/staging/demo and correct for deployments behind
stateful firewalls.
@mihow mihow merged commit 71f153a into main Apr 14, 2026
7 checks passed
@mihow mihow deleted the fix/redis-cache-socket-keepalive branch April 14, 2026 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants