Clarify worker cancellation requirements for stuck jobs#1264
Conversation
brandur
left a comment
There was a problem hiding this comment.
@peter941221 I think some of this is a little too much detail for some of the areas like README and general Godoc intro — these sections are arguably already a bit too long and we should be really prescriptive about what is allowed to go in there.
I think the piece in worker.go is a nice improvement — do we want to revert the other stuff and we can keep that?
In terms of general documentation — I think what I'll do is put together a specific "stuck jobs" advanced topics page. We have a little already at https://riverqueue.com/docs/graceful-shutdown#stuck-programs, but this could stand to be expanded a bit.
|
@brandur sure, will do that. |
|
Reverted the broader README, Godoc, and example additions and kept the worker.go clarification only. We can move the deeper stuck-jobs guidance to a dedicated advanced topics page later. |
This follows up the stuck-job discussion in #1258.
The current docs already mention that workers should respect context cancellation, but they don't quite connect that requirement to the failure mode users are likely to see in practice: jobs that remain in
runningbecause worker code is blocked without also observingctx.Done().Worker.Work's contract to call out blocking operations like channels, timers, and network work explicitly, and recommends aselectthat also watchesctx.Done().num_jobs_stuckin the producer job counts log line when investigating jobs that appear stuck inrunning.The goal here isn't to change rescue semantics, only to make the existing behavior easier to understand and diagnose.
Verification:
gofmt.go test ./... -run TestDoesNotExist -count=1to confirm the repo still loads and compiles after the docs/example changes.Caveat: