Skip to content

fix(prefect-shell): kill whole process tree on ShellOperation cleanup#21586

Open
zzstoatzz wants to merge 2 commits intomainfrom
fix/shell-operation-orphan-subprocess-20979
Open

fix(prefect-shell): kill whole process tree on ShellOperation cleanup#21586
zzstoatzz wants to merge 2 commits intomainfrom
fix/shell-operation-orphan-subprocess-20979

Conversation

@zzstoatzz
Copy link
Copy Markdown
Collaborator

@zzstoatzz zzstoatzz commented Apr 17, 2026

closes #20979

summary

on posix, ShellOperation spawned its subprocess in the same process group as the flow run. when the worker cancelled the flow run, ProcessManager.kill() signalled only the flow process's pid, and the shell's subprocess (and any child it spawned, like sleep 9999) outlived the flow, reparented to init.

scope

  • posix: targeted fix. isolate the shell with start_new_session=True and signal the whole process group on cleanup so descendants die with it.
  • windows: existing paths (run(), arun(), atrigger()) are intentionally unchanged — the bug as reported is a posix signal/process-group concept and i couldn't verify an equivalent on windows. the only windows-visible change is that sync trigger() + close() now calls process.kill() instead of leaking the subprocess. standard subprocess api, no new windows-specific logic.

the fix

  • _process_isolation_kwargs()start_new_session=True on posix, {} on windows
  • _signal_process_tree(process, sig) → sends sig to the process group on posix, no-op on windows
  • _close_sync_process_tree(process) → sync cleanup. posix: SIGTERM to group → subprocess.wait(timeout=5) → SIGKILL to group if needed. windows: process.kill() + process.wait() (same as pre-PR run() finally)
  • applied to sync run() finally, async arun() finally, async atrigger() via _exit_stack, and sync trigger() via _sync_exit_stack

no internal polling loop — subprocess.wait(timeout=) on the sync side and open_process's process.aclose() on the async side handle the wait. mirrors the pattern in core ProcessManager.kill.

repro (linux)

from prefect import flow
from prefect_shell import ShellOperation

@flow
def sleepy():
    ShellOperation(commands=["sleep 9999"]).run()

deploy to a process work pool, run, cancel via ui.

end-to-end repro on main shell code (bug present)
$ TOKEN="unfixed-$RANDOM"
$ RUN_ID=$(prefect deployment run 'sleepy-20979/sleepy-20979' --param token=$TOKEN)
  flow run id: ffcf0306-7f90-4086-a562-f882f734c5aa

$ ps -Ao pid,ppid,pgid,command | awk '/sleep 9999$/'
  98942 98941 98774 sleep 9999

$ ps -o pgid= -p 98941   # bash pgid
  98774
$ pgrep -f 'python -m prefect.engine' | xargs -I {} ps -o pgid= -p {}
  98774                  # flow pgid — same group as bash/sleep

$ prefect flow-run cancel $RUN_ID
  Flow run was successfully scheduled for cancellation.

# wait 35s for the grace period
$ prefect flow-run inspect $RUN_ID | grep state_name
      state_name='Cancelled'

$ ps -p 98942 >/dev/null && echo "BUG REPRODUCED"
  BUG REPRODUCED: sleep 98942 still alive after cancel
end-to-end repro on this branch (fix present)
$ TOKEN="v2-$RANDOM"
$ RUN_ID=$(prefect deployment run 'sleepy-20979/sleepy-20979' --param token=$TOKEN)
  flow run id: 3223e2d6-09c8-4954-bddc-75b7b5c07e6a

$ ps -Ao pid,ppid,command | awk '/sleep 9999$/'
  21327 21326 sleep 9999

$ prefect flow-run cancel $RUN_ID
  Flow run was successfully scheduled for cancellation.

$ prefect flow-run inspect $RUN_ID | grep state_name
      state_name='Cancelled'

$ ps -p 21327 >/dev/null && echo "sleep alive" || echo "sleep cleaned up"
  sleep cleaned up
$ ps -p 21326 >/dev/null && echo "bash alive"  || echo "bash cleaned up"
  bash cleaned up

verification

posix (linux):

  • reproduced the orphan bug on main's shell code with a deployed flow + real process worker + cancel via cli — sleep orphaned after the flow transitions to Cancelled
  • fix eliminates the orphan in the same scenario
  • direct subprocess repro (simulates the worker's SIGTERM path): 3/3 pre-fix reproduces, 3/3 post-fix clean
  • 4 new tests in TestShellOperationProcessGroup cover sync/async spawn (own session leader) and sync/async context-manager exit (descendants reclaimed)
  • full prefect-shell suite: 61 passed, 3 skipped, no regressions

windows:

  • ran the prefect-shell suite on windows-latest via a temporary CI workflow on this branch; all 31 tests that exercise modified code paths pass. one unrelated test (test_shell_run_command_error_windows) fails on a stale log-count assertion that has nothing to do with ShellOperation — same failure reproduces with this PR's code reverted. note: no main-branch CI currently runs this test, so it has silently drifted.
  • the bug itself was not reproduced or verified on windows. the reported symptom is a posix signal/process-group concept without a direct windows analog. this PR does not claim to fix an orphan-subprocess bug on windows; it preserves existing windows behavior on the affected paths and incidentally fixes a sync trigger() + close() leak using standard subprocess api.

@github-actions github-actions bot added the bug Something isn't working label Apr 17, 2026
zzstoatzz added a commit that referenced this pull request Apr 17, 2026
delete before merging.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
closes #20979

`ShellOperation` spawned its subprocess in the same process group as the
flow run. when the worker cancelled the flow run, `ProcessManager.kill()`
signalled only the flow process's pid, and the shell's subprocess (and
any child it spawned, like `sleep 9999`) outlived the flow, reparented
to init.

isolate the shell in a new posix session via `start_new_session=True`
and kill the whole process group on cleanup (sync `run()` finally,
async `arun()` finally, and both context-manager paths). sync
`trigger()` now registers its subprocess with `_sync_exit_stack` so
`close()` reclaims the tree — previously it leaked.

cleanup helpers mirror the pattern in core `ProcessManager.kill`:
`_signal_process_tree` for non-blocking signal delivery and
`_close_sync_process_tree` for sync paths that need
SIGTERM → `subprocess.wait(timeout=)` → SIGKILL escalation. no
internal polling loop.

windows behavior for existing paths (`run()`, `arun()`, `atrigger()`)
is unchanged — the helpers short-circuit the POSIX group logic and
fall back to `subprocess.Popen.kill` where needed. the only windows
change is that sync `trigger()` + `close()` now calls `process.kill()`
instead of leaking.

verified against a real process worker on linux (flow cancelled cleanly,
no orphaned bash/sleep) and against windows ci (all 31 exercised
tests pass; one pre-existing unrelated flake remains, also reproducible
on main).

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
@zzstoatzz zzstoatzz force-pushed the fix/shell-operation-orphan-subprocess-20979 branch from 2af380f to 9266ea4 Compare April 17, 2026 05:01
@zzstoatzz zzstoatzz marked this pull request as ready for review April 17, 2026 15:45
Copy link
Copy Markdown
Member

@desertaxle desertaxle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of nits, but overall LGTM!

`_process_isolation_kwargs` is POSIX-only, so there is no matching tree
to signal on Windows. Windows cleanup goes through the direct
`subprocess.Popen.kill` / `terminate` on the caller side and is
unchanged from pre-PR behavior.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "pre-PR" references in docstrings are a bit wonky.

Comment on lines +91 to +92
if process.returncode is not None:
return
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex says there's an issue if the parent exits and has a detached child like this:

with ShellOperation(commands=["sleep 120 & echo $! > /tmp/child.pid"]) as op:
    proc = op.trigger()
    ...
    proc._process.poll()  # shell exits, returncode becomes 0
# child is still alive here

That's a bit of an edge case, but is there any harm in sending a SIGTERM to the process tree even if the parent has already exited?

… exit

- rewrote `_signal_process_tree` and `_close_sync_process_tree` docstrings
  to stop referencing "pre-PR behavior" (per alex's nit)
- `_close_sync_process_tree` now sends SIGTERM to the process group on
  POSIX even when `process.returncode` is already set. detached descendants
  (e.g. `sleep 120 &`) stay in the same session as the shell and can
  outlive it, so returning early left them orphaned. the wait/SIGKILL
  escalation is still gated on the shell still running.
- windows path unchanged in effect (kill+wait only when still running).
- added `test_sync_close_terminates_detached_child_after_shell_exits`
  which reproduces the codex-flagged edge case (shell exits before close).

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ShellOperation continues running after cancelling a flow run when using process type work pool

2 participants