fix(prefect-shell): kill whole process tree on ShellOperation cleanup#21586
Open
fix(prefect-shell): kill whole process tree on ShellOperation cleanup#21586
Conversation
zzstoatzz
added a commit
that referenced
this pull request
Apr 17, 2026
delete before merging. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
closes #20979 `ShellOperation` spawned its subprocess in the same process group as the flow run. when the worker cancelled the flow run, `ProcessManager.kill()` signalled only the flow process's pid, and the shell's subprocess (and any child it spawned, like `sleep 9999`) outlived the flow, reparented to init. isolate the shell in a new posix session via `start_new_session=True` and kill the whole process group on cleanup (sync `run()` finally, async `arun()` finally, and both context-manager paths). sync `trigger()` now registers its subprocess with `_sync_exit_stack` so `close()` reclaims the tree — previously it leaked. cleanup helpers mirror the pattern in core `ProcessManager.kill`: `_signal_process_tree` for non-blocking signal delivery and `_close_sync_process_tree` for sync paths that need SIGTERM → `subprocess.wait(timeout=)` → SIGKILL escalation. no internal polling loop. windows behavior for existing paths (`run()`, `arun()`, `atrigger()`) is unchanged — the helpers short-circuit the POSIX group logic and fall back to `subprocess.Popen.kill` where needed. the only windows change is that sync `trigger()` + `close()` now calls `process.kill()` instead of leaking. verified against a real process worker on linux (flow cancelled cleanly, no orphaned bash/sleep) and against windows ci (all 31 exercised tests pass; one pre-existing unrelated flake remains, also reproducible on main). Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
2af380f to
9266ea4
Compare
desertaxle
approved these changes
Apr 17, 2026
Member
desertaxle
left a comment
There was a problem hiding this comment.
Couple of nits, but overall LGTM!
| `_process_isolation_kwargs` is POSIX-only, so there is no matching tree | ||
| to signal on Windows. Windows cleanup goes through the direct | ||
| `subprocess.Popen.kill` / `terminate` on the caller side and is | ||
| unchanged from pre-PR behavior. |
Member
There was a problem hiding this comment.
The "pre-PR" references in docstrings are a bit wonky.
Comment on lines
+91
to
+92
| if process.returncode is not None: | ||
| return |
Member
There was a problem hiding this comment.
Codex says there's an issue if the parent exits and has a detached child like this:
with ShellOperation(commands=["sleep 120 & echo $! > /tmp/child.pid"]) as op:
proc = op.trigger()
...
proc._process.poll() # shell exits, returncode becomes 0
# child is still alive hereThat's a bit of an edge case, but is there any harm in sending a SIGTERM to the process tree even if the parent has already exited?
… exit - rewrote `_signal_process_tree` and `_close_sync_process_tree` docstrings to stop referencing "pre-PR behavior" (per alex's nit) - `_close_sync_process_tree` now sends SIGTERM to the process group on POSIX even when `process.returncode` is already set. detached descendants (e.g. `sleep 120 &`) stay in the same session as the shell and can outlive it, so returning early left them orphaned. the wait/SIGKILL escalation is still gated on the shell still running. - windows path unchanged in effect (kill+wait only when still running). - added `test_sync_close_terminates_detached_child_after_shell_exits` which reproduces the codex-flagged edge case (shell exits before close). Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
closes #20979
summary
on posix,
ShellOperationspawned its subprocess in the same process group as the flow run. when the worker cancelled the flow run,ProcessManager.kill()signalled only the flow process's pid, and the shell's subprocess (and any child it spawned, likesleep 9999) outlived the flow, reparented to init.scope
start_new_session=Trueand signal the whole process group on cleanup so descendants die with it.run(),arun(),atrigger()) are intentionally unchanged — the bug as reported is a posix signal/process-group concept and i couldn't verify an equivalent on windows. the only windows-visible change is that synctrigger()+close()now callsprocess.kill()instead of leaking the subprocess. standardsubprocessapi, no new windows-specific logic.the fix
_process_isolation_kwargs()→start_new_session=Trueon posix,{}on windows_signal_process_tree(process, sig)→ sendssigto the process group on posix, no-op on windows_close_sync_process_tree(process)→ sync cleanup. posix: SIGTERM to group →subprocess.wait(timeout=5)→ SIGKILL to group if needed. windows:process.kill()+process.wait()(same as pre-PRrun()finally)run()finally, asyncarun()finally, asyncatrigger()via_exit_stack, and synctrigger()via_sync_exit_stackno internal polling loop —
subprocess.wait(timeout=)on the sync side andopen_process'sprocess.aclose()on the async side handle the wait. mirrors the pattern in coreProcessManager.kill.repro (linux)
deploy to a process work pool, run, cancel via ui.
end-to-end repro on
mainshell code (bug present)end-to-end repro on this branch (fix present)
verification
posix (linux):
main's shell code with a deployed flow + real process worker + cancel via cli —sleeporphaned after the flow transitions toCancelledTestShellOperationProcessGroupcover sync/async spawn (own session leader) and sync/async context-manager exit (descendants reclaimed)windows:
windows-latestvia a temporary CI workflow on this branch; all 31 tests that exercise modified code paths pass. one unrelated test (test_shell_run_command_error_windows) fails on a stale log-count assertion that has nothing to do withShellOperation— same failure reproduces with this PR's code reverted. note: no main-branch CI currently runs this test, so it has silently drifted.trigger()+close()leak using standardsubprocessapi.