Fix race condition in GitRepository.pull_code() with file-based locking#21388
Fix race condition in GitRepository.pull_code() with file-based locking#21388devin-ai-integration[bot] wants to merge 13 commits intomainfrom
GitRepository.pull_code() with file-based locking#21388Conversation
Add file-based locking around GitRepository.pull_code() to prevent race conditions when multiple concurrent flow runs use the same git repository. Uses asyncio.Lock for in-process async coordination between concurrent tasks and FileLock for cross-process coordination. The lock file is created adjacent to the destination directory (e.g., dest.lock). Closes #11187 Co-authored-by: alex.s <alex.s@prefect.io> Co-Authored-By: alex.s <ajstreed1@gmail.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
filelock is a transitive dependency not available in the prefect-client package. Fall back to asyncio.Lock only when filelock is not installed. Co-Authored-By: alex.s <ajstreed1@gmail.com>
- Create prefect/locking/_filelock.py with cross-platform file lock using
OS-level locking (fcntl.flock on Unix, msvcrt.locking on Windows)
- Use async-aware aacquire() method that polls with asyncio.sleep() to
avoid blocking the event loop during cross-process lock contention
- Fix lock path derivation: use parent/(name + '.lock') instead of
with_suffix('.lock') which incorrectly replaces existing suffixes
- Remove filelock transitive dependency usage entirely
- Update tests to work with new internal FileLock
Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
- Wrap acquire/aacquire polling loops in try/except BaseException to close the fd on CancelledError or any other unexpected exception - Wrap _unlock_fd in release() with try/finally to ensure os.close() runs even if unlock raises Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com> Co-Authored-By: alex.s <ajstreed1@gmail.com>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 13e3555258
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Handle ImportError for fcntl/msvcrt so _filelock.py loads on any OS - FileLock.acquire/aacquire silently no-op when locking is unavailable - pull_code() catches lock acquisition failures at runtime and falls back to asyncio.Lock only, logging a debug message Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com> Co-Authored-By: alex.s <ajstreed1@gmail.com>
Rewrites _filelock.py to use Path.touch(exist_ok=False) / Path.unlink() instead of fcntl/msvcrt, following the same pattern as FileSystemLockManager. No OS-specific imports needed — works on any platform that supports basic filesystem operations. Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com> Co-Authored-By: alex.s <ajstreed1@gmail.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5a86567757
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Write the owning process's PID to the lock file on acquisition. When a lock file already exists, read the PID and check if the process is still alive via os.kill(pid, 0). If the process is dead, remove the stale lock and retry immediately — no need to wait for timeout. Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com> Co-Authored-By: alex.s <ajstreed1@gmail.com>
P1: Replace os.kill(pid, 0) with ctypes.windll.kernel32.OpenProcess on Windows — os.kill terminates the target process on Windows instead of probing liveness. P2: Treat empty or malformed lock files as stale and remove them immediately. Previously, a process crashing between os.open(O_EXCL) and os.write() left an empty lock file that could never be recovered, causing repeated 300s timeout failures. Add 24 unit tests for the _filelock module covering helpers, sync/async acquire/release, timeout, contention, context manager, and stale lock recovery (dead PID, empty file, malformed content). Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com> Co-Authored-By: alex.s <ajstreed1@gmail.com>
P1: Replace two-step os.open(O_EXCL) + os.write() with atomic temp-file-then-hard-link pattern. The PID is written to a temp file first and then hard-linked to the lock path, so the lock file is never visible in an empty state. This eliminates the race where a concurrent process could read an empty lock file and falsely treat it as stale. P2: Remove the fixed 300s timeout from pull_code(). FileLock now waits indefinitely (timeout=-1) so that legitimate long-running clones on large repos or slow networks are never failed by a timeout. Stale lock recovery via PID check handles crashed processes. Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com> Co-Authored-By: alex.s <ajstreed1@gmail.com>
- Remove in-process asyncio.Lock/threading.Lock; rely solely on FileLock which polls with asyncio.sleep and is safe across event loops - Replace unsafe check+rename fallback with os.open(O_CREAT|O_EXCL) for atomic lock creation on filesystems without hard-link support - Create parent directory before mkstemp so fresh base paths work - Add test for parent directory creation Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com> Co-Authored-By: alex.s <ajstreed1@gmail.com>
…ormed locks as stale The O_CREAT|O_EXCL fallback leaves the lock file briefly empty before the PID is written. _remove_stale_lock() was deleting these empty files, allowing a second writer to acquire the lock while the first still held it. Now empty and malformed lock files are left alone -- the poll loop retries until the PID is written or the writer crashes and is detected via PID liveness check. Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com> Co-Authored-By: alex.s <ajstreed1@gmail.com>
Empty or malformed lock files (from a writer that crashed between os.open(O_CREAT|O_EXCL) and os.write()) are now recovered by checking the file's modification time. If the mtime is older than 5 seconds the writer must have crashed, so the file is removed as stale. If the mtime is recent the writer may still be running, so the file is left alone and the poll loop retries. This fixes the permanent unrecoverable lock issue where pull_code() would hang indefinitely on the O_CREAT|O_EXCL fallback path after a crash, without reintroducing the double-entry race (recent empty files are still left alone). Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com> Co-Authored-By: alex.s <ajstreed1@gmail.com>
Fixes a race condition in
GitRepository.pull_code()where multiple concurrent flow runs sharing the same clone destination directory can race on thegit_dir.exists()check-then-act logic, causingFileNotFoundErroror corrupt clones. Theshutil.rmtree()calls on failure paths can also delete a directory out from under a concurrent run.Closes #11187
Changes
Adds file-based locking around the entire
pull_code()method using an internalFileLock(prefect/locking/_filelock.py). A.lockfile adjacent to the destination directory (e.g.,repo.lock) coordinates both concurrent async tasks within a process and across separate processes. No in-processasyncio.Lockorthreading.Lockis used —FileLock.aacquire()polls withasyncio.sleep, which is non-blocking and safe across different event loops (e.g. callers viarun_coro_as_sync).Atomic lock creation:
_write_lock()writes the PID to a temporary file first, then usesos.link()(hard link) to atomically place it at the lock path. This guarantees the lock file is never visible in an empty or partial state — eliminating the race where a concurrent_remove_stale_lockcould read an empty file and falsely delete a live lock. On filesystems that don't support hard links, it falls back toos.open(O_CREAT | O_EXCL)which is atomic on all platforms.Parent directory creation:
_write_lock()callspath.parent.mkdir(parents=True, exist_ok=True)before creating temp/lock files, so fresh base paths that haven't beengit cloned yet don't fail withFileNotFoundError.Stale lock recovery: The owning process's PID is written to the lock file on acquisition. When contention is detected (
FileExistsError), the lock file's PID is read and checked for liveness. If the owning process is dead, the stale lock is removed and acquisition retries immediately. For empty or malformed lock files (e.g. from a writer that crashed betweenos.open(O_CREAT|O_EXCL)andos.write()in the fallback path), the file's modification time is checked: if the mtime is older than_STALE_EMPTY_THRESHOLD(5 seconds), the writer must have crashed and the file is removed as stale; if the mtime is recent, the writer may still be in the brief create-then-write window, so the file is left alone and the poll loop retries.Indefinite wait by default:
FileLockdefaults totimeout=-1(wait forever). Since stale lock recovery handles crashed processes (via PID liveness checks and mtime-based detection for empty/malformed files), a fixed timeout would only cause legitimate long-running pulls (large repos, slow networks, submodule-heavy checkouts) to fail unnecessarily.Windows-safe PID checks: On Unix,
os.kill(pid, 0)is used as a harmless signal-0 probe. On Windows,os.killterminates the target process, soctypes.windll.kernel32.OpenProcessis used instead.The existing
pull_codebody is extracted into_pull_code_locked()with no logic changes.Tests
tests/locking/test_filelock.py(30 tests) — covers_is_pid_alive,_remove_stale_lock(including mtime-based recovery of old empty/malformed files and preservation of recent ones),_write_lock(including temp file cleanup and parent directory creation), syncFileLock(acquire, release, context manager, timeout, thread contention, stale lock recovery for dead PID / old empty / old malformed / recent empty blocking), asyncFileLock(aacquire, timeout, async contention, stale lock recovery for dead PID / old empty / old malformed)tests/runner/test_storage.py(TestGitRepositoryConcurrency) — tests concurrentpull_code()calls don't race, lock file placement, and failed clone cleanup under lockItems for reviewer attention
_STALE_EMPTY_THRESHOLD). This is generous for theos.open→os.writewindow (typically microseconds), but if a system were so heavily loaded that a process stalled for 5+ seconds between those two syscalls, a concurrent acquirer could incorrectly remove the live lock file and cause double-entry. In practice this is extremely unlikely, and it eliminates the previous permanent-block failure mode where a crash in theO_CREAT|O_EXCLfallback path would hangpull_code()indefinitely._remove_if_old: There is a small window between reading mtime and callingunlinkwhere another process could claim the lock. In practice this requires the file to be >5s old AND another process to claim it in the same instant, making it negligible.except OSErroronos.link: The fallback fromos.linktoO_CREAT | O_EXCLtriggers on anyOSError, not just "hard links not supported." Ifos.linkfails for an unexpected reason (e.g. permissions), theO_CREAT | O_EXCLpath runs instead. This is conservative but could mask unexpected filesystem errors.asyncio.Lockapproach deadlocked when the same repo was pulled from different loops viarun_coro_as_sync.OpenProcesspath is not tested in CI (CI runs on Linux). Thectypes.windll.kernel32.OpenProcesscode path relies on correct constant values and API behavior.Checklist
<link to issue>"mint.json.Link to Devin session: https://app.devin.ai/sessions/1e3a05b708534a449af4acc4a2d76cc1
Requested by: @desertaxle