fix(cache): enable SO_KEEPALIVE on django-redis cache connections#1221
fix(cache): enable SO_KEEPALIVE on django-redis cache connections#1221
Conversation
Without SOCKET_KEEPALIVE, redis-py does not call setsockopt(SO_KEEPALIVE, 1) on the Redis cache socket, so the kernel never sends TCP keepalive probes regardless of host-level sysctl tuning. Pooled connections can sit idle long enough for stateful cloud firewalls to silently drop them, at which point the next task to borrow one from the pool fails with ECONNRESET. This mirrors what CELERY_BROKER_TRANSPORT_OPTIONS.socket_settings does for the celery broker path, which has been protected for some time. Closes #1218.
✅ Deploy Preview for antenna-ssec canceled.
|
✅ Deploy Preview for antenna-preview canceled.
|
📝 WalkthroughWalkthroughAdded TCP keepalive socket configuration to Redis cache settings in Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR updates Django’s django-redis cache connection options to enable TCP keepalive, matching the existing Celery broker keepalive configuration, to reduce idle-connection drops by stateful firewalls (per #1218).
Changes:
- Enable
SOCKET_KEEPALIVEfor the default Django Redis cache connection pool. - Configure
SOCKET_KEEPALIVE_OPTIONS(TCP_KEEPIDLE/INTVL/KEEPCNT) to align with existing Celery transport settings. - Add a Redis cache socket connect timeout option.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
config/settings/base.py (1)
267-275: Consider extracting shared keepalive tunables to avoid config drift.The same keepalive values are now duplicated between cache and Celery broker settings. A shared constant would keep them synchronized over time.
♻️ Suggested refactor
+# Shared TCP keepalive tuning (seconds / probe count) +REDIS_TCP_KEEPALIVE_OPTIONS = { + socket.TCP_KEEPIDLE: 60, + socket.TCP_KEEPINTVL: 10, + socket.TCP_KEEPCNT: 9, +} + CACHES = { "default": { "BACKEND": "django_redis.cache.RedisCache", "LOCATION": env("REDIS_URL", default=None), "OPTIONS": { @@ "IGNORE_EXCEPTIONS": True, "SOCKET_KEEPALIVE": True, - "SOCKET_KEEPALIVE_OPTIONS": { - socket.TCP_KEEPIDLE: 60, - socket.TCP_KEEPINTVL: 10, - socket.TCP_KEEPCNT: 9, - }, + "SOCKET_KEEPALIVE_OPTIONS": REDIS_TCP_KEEPALIVE_OPTIONS, "SOCKET_CONNECT_TIMEOUT": 5, }, } } @@ CELERY_BROKER_TRANSPORT_OPTIONS = { "socket_keepalive": True, - "socket_settings": { - socket.TCP_KEEPIDLE: 60, - socket.TCP_KEEPINTVL: 10, - socket.TCP_KEEPCNT: 9, - }, + "socket_settings": REDIS_TCP_KEEPALIVE_OPTIONS,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/settings/base.py` around lines 267 - 275, Extract the duplicated keepalive tunables into a single shared constant (e.g., KEEPALIVE_SETTINGS or SOCKET_KEEPALIVE_OPTIONS_DEFAULT) and replace the inline dicts currently used for "SOCKET_KEEPALIVE_OPTIONS" in both the cache and Celery broker config with a reference to that constant; also consider extracting related flags like "SOCKET_KEEPALIVE" and "SOCKET_CONNECT_TIMEOUT" into shared constants if they are duplicated elsewhere so that codepaths that set socket.TCP_KEEPIDLE, socket.TCP_KEEPINTVL, socket.TCP_KEEPCNT (referenced in the current SOCKET_KEEPALIVE_OPTIONS) and any connect timeout values all point to the single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@config/settings/base.py`:
- Around line 267-275: Extract the duplicated keepalive tunables into a single
shared constant (e.g., KEEPALIVE_SETTINGS or SOCKET_KEEPALIVE_OPTIONS_DEFAULT)
and replace the inline dicts currently used for "SOCKET_KEEPALIVE_OPTIONS" in
both the cache and Celery broker config with a reference to that constant; also
consider extracting related flags like "SOCKET_KEEPALIVE" and
"SOCKET_CONNECT_TIMEOUT" into shared constants if they are duplicated elsewhere
so that codepaths that set socket.TCP_KEEPIDLE, socket.TCP_KEEPINTVL,
socket.TCP_KEEPCNT (referenced in the current SOCKET_KEEPALIVE_OPTIONS) and any
connect timeout values all point to the single source of truth.
Extracts TCP_KEEPALIVE_OPTIONS to a module-level constant with env-var overrides (TCP_KEEPALIVE_IDLE/INTVL/CNT) and reuses it in both the django-redis CACHES block and CELERY_BROKER_TRANSPORT_OPTIONS. Also exposes REDIS_CACHE_SOCKET_KEEPALIVE and REDIS_CACHE_SOCKET_CONNECT_TIMEOUT as env vars for production tuning. Defaults match the values introduced for the broker in #1073 (60/10/9), which are safe for local/staging/demo and correct for deployments behind stateful firewalls.
Summary
Adds TCP keepalive (
SOCKET_KEEPALIVE+SOCKET_KEEPALIVE_OPTIONS) and a connect timeout toCACHES["default"]["OPTIONS"]inconfig/settings/base.py, extracts a sharedTCP_KEEPALIVE_OPTIONSconstant so the same tuning is used for both the django-redis cache and the Celery/RabbitMQ broker, and exposes the knobs as env vars so production deployments can tune without a code change.Why
redis-pydoes not setSO_KEEPALIVEunless the client explicitly opts in viasocket_keepalive=True. Without it, the kernel never sends TCP keepalive probes on the socket — regardless of host-level sysctl tuning. Pooled Redis connections can sit idle long enough for a stateful cloud firewall to silently drop them; the next task to borrow one from the pool fails with ECONNRESET.Direct observation on a production worker:
ss -ton state established dport = :6379showed only some Redis connections with atimer:(keepalive,...)entry. The bare ones are the django-redis cache connections this PR fixes.Context: how this gap got missed
The TCP-keepalive pattern for Redis/AMQP connections is not new in this repo. Three earlier PRs set it up for Celery-side paths but did not touch the django-redis cache block:
CELERY_BROKER_TRANSPORT_OPTIONSonly.CELERY_REDIS_SOCKET_KEEPALIVEfor the Redis result-backend connection pool (distinct from the django-redis cache pool).CACHESuntouched.socket_keepalive+socket_settings(TCP_KEEPIDLE=60, TCP_KEEPINTVL=10, TCP_KEEPCNT=9) toCELERY_BROKER_TRANSPORT_OPTIONSto fix "zombie Celery tasks" (Fix dangling processing jobs by adding TCP keepalives #1072). Coordinated with host-level sysctl tuning.CACHESuntouched; the django-redis path was never discussed.None of the three PR discussions mention the
CACHESblock or any reason not to apply the same tuning there. The omission was by oversight, not a conscious decision — the django-redis cache path uses its ownOPTIONSdict that doesn't inherit from Celery config.What this PR adds
In
config/settings/base.py:A shared
TCP_KEEPALIVE_OPTIONSconstant (60/10/9 defaults, overridable viaTCP_KEEPALIVE_IDLE,TCP_KEEPALIVE_INTVL,TCP_KEEPALIVE_CNTenv vars) used by both:CACHES["default"]["OPTIONS"]["SOCKET_KEEPALIVE_OPTIONS"](new)CELERY_BROKER_TRANSPORT_OPTIONS["socket_settings"](replacing the inline duplicate)SOCKET_KEEPALIVE: Trueon the cache, overridable viaREDIS_CACHE_SOCKET_KEEPALIVE.SOCKET_CONNECT_TIMEOUT: 5on the cache, overridable viaREDIS_CACHE_SOCKET_CONNECT_TIMEOUT. This caps connect-time so that if the Redis host is briefly unreachable we fail fast rather than blocking the pooled-connection acquisition indefinitely. 5s is well above normal connect latency (noted explicitly because Copilot flagged this as out-of-scope relative to the PR title — it's a small scope addition but intentional and documented in the code).Defaults vs. production tuning
The defaults are safe for every environment:
CELERY_BROKER_TRANSPORT_OPTIONSand are correct for deployments behind a stateful firewall that drops idle connections after ~1h.If a deployment has different network behavior (e.g., a more aggressive NAT that drops sooner, or a high-latency link where 5s connect is too tight), operators can override via env:
TCP_KEEPALIVE_IDLETCP_KEEPALIVE_INTVLTCP_KEEPALIVE_CNTREDIS_CACHE_SOCKET_KEEPALIVEREDIS_CACHE_SOCKET_CONNECT_TIMEOUTRecommended production baseline is the defaults (first dead-connection detection at ~60s + 9×10s ≈ 150s). Deployments behind a firewall that drops at <60s idle should lower
TCP_KEEPALIVE_IDLEto roughly half the firewall timeout.Incident this addresses
A production
async_apijob was processing normally (hundreds of images processed, results flowing into the database) when oneprocess_nats_pipeline_resulttask hit:153 ms later the job was marked FAILURE with "Redis state missing for job" and its NATS stream + Redis state were deleted. Other tasks on the same celery worker reached Redis successfully milliseconds before and after — a single pooled connection had been silently dropped.
The code-path brittleness that escalates a transient Redis blip into a fatal FAILURE is tracked separately in #1219 (reopened). This PR fixes the config gap that lets the transient happen in the first place. Both should land.
Verification after deploy
Inside a celery worker container:
Every established Redis connection should show
timer:(keepalive,...). Before this change, roughly half were bare.What still needs verifying
ss -tonon production workers after deploy to confirm all connections now have keepalive timers.async_apijobs killed by transient Redis errors duringupdate_state—RedisErrorand "state actually missing" are conflated into a single fatal path #1219 are both in.redis-pyconnection options so config drift is visible in ops.Scope
Hypothesis-driven fix based on code reading + direct
ssobservation. The constant extraction + env tunables also closes a CodeRabbit nit (duplicate dicts between cache and broker keepalive) and the Copilot concern about documenting the connect-timeout choice.Closes #1218. Related: #1073 (broker precursor), #1219 (code-path fix that pairs with this config fix), #1220.