Skip to content

[13.x] Fix queue worker entering infinite loop on persistent pop exceptions#59579

Closed
Avnsh1111 wants to merge 1 commit intolaravel:13.xfrom
Avnsh1111:fix/queue-worker-infinite-pop-loop
Closed

[13.x] Fix queue worker entering infinite loop on persistent pop exceptions#59579
Avnsh1111 wants to merge 1 commit intolaravel:13.xfrom
Avnsh1111:fix/queue-worker-infinite-pop-loop

Conversation

@Avnsh1111
Copy link
Copy Markdown

The Problem

When Worker::getNextJob() catches a Throwable that is not a database connection error (e.g. SQS SDK timeout, Redis auth failure, Beanstalkd connection drop), the worker enters an infinite silent loop:

catch → report → stopWorkerIfLostConnection (no match) → sleep(1) → retry → catch → forever

stopWorkerIfLostConnection() uses the DetectsLostConnections trait which only matches database error strings (MySQL, PostgreSQL, SQLite). Queue infrastructure failures from SQS, Redis, or HTTP-based drivers never trigger a stop condition. The --timeout flag also cannot help because pcntl_alarm is only registered after getNextJob() returns.

The result: workers appear healthy to process supervisors (Docker, Supervisor) but silently process zero messages — potentially for days (as reported in #59517).

The Fix

Adds a lightweight consecutive pop failure counter ($popFailures) directly in getNextJob():

  • Increment on each caught exception
  • Reset to 0 on any successful pop (including empty queue returns)
  • Set $this->shouldQuit = true after 100 consecutive failures

This uses the existing shouldQuitstopIfNecessary() → graceful exit path with no new CLI options, no new enum cases, and no new public API surface. The worker exits cleanly with status 0, allowing Supervisor/Docker to restart it.

100 consecutive failures (with 1s sleep between each) means the worker tolerates ~1.5 minutes of transient errors before giving up — enough to ride out brief network blips while still catching persistent infrastructure failures.

Benefit to end users

Queue workers will no longer silently become zombies when the queue backend (SQS, Redis, etc.) has a persistent connection issue. They will exit gracefully after sustained failures, allowing process managers to restart them with a fresh connection.

Why this doesn't break existing features

  • The counter only increments on exceptions inside getNextJob() — normal job processing is completely unaffected
  • The counter resets on every successful pop (including when the queue is empty), so transient/intermittent errors never accumulate
  • Workers that hit a database connection error still take the existing lostConnection exit path (which fires first)
  • The 100-failure threshold is deliberately high to avoid false positives
  • No changes to WorkerOptions, CLI arguments, or public API

Tests

Two new tests in QueueWorkerTest:

  1. testWorkerQuitsAfterConsecutivePopFailures — verifies the worker exits after sustained pop exceptions using BrokenQueueConnection
  2. testWorkerPopFailureCounterResetsOnSuccess — verifies intermittent failures followed by success don't trigger a quit, using a new IntermittentBrokenQueueConnection fake

All 27 existing queue worker tests continue to pass.

Fixes #59517

…ptions

When getNextJob() catches a non-database Throwable (e.g. SQS timeout,
Redis auth failure), stopWorkerIfLostConnection() never matches because
it only checks database error strings. The worker silently loops forever:
catch → sleep(1) → retry → catch → forever. Workers appear healthy to
process supervisors but process zero messages.

Adds a consecutive pop failure counter that triggers shouldQuit after
100 failures, allowing the worker to exit gracefully and be restarted
by Supervisor/Docker. The counter resets on any successful pop, so
transient errors do not accumulate.

Fixes laravel#59517
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

queue:work daemon enters infinite silent loop when getNextJob() throws non-database exceptions

2 participants