Skip to content

Fix race condition in GitRepository.pull_code() with file-based locking#21388

Open
devin-ai-integration[bot] wants to merge 13 commits intomainfrom
devin/1775076072-fix-pull-code-race-condition
Open

Fix race condition in GitRepository.pull_code() with file-based locking#21388
devin-ai-integration[bot] wants to merge 13 commits intomainfrom
devin/1775076072-fix-pull-code-race-condition

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot commented Apr 1, 2026

Fixes a race condition in GitRepository.pull_code() where multiple concurrent flow runs sharing the same clone destination directory can race on the git_dir.exists() check-then-act logic, causing FileNotFoundError or corrupt clones. The shutil.rmtree() calls on failure paths can also delete a directory out from under a concurrent run.

Closes #11187

Changes

Adds file-based locking around the entire pull_code() method using an internal FileLock (prefect/locking/_filelock.py). A .lock file adjacent to the destination directory (e.g., repo.lock) coordinates both concurrent async tasks within a process and across separate processes. No in-process asyncio.Lock or threading.Lock is used — FileLock.aacquire() polls with asyncio.sleep, which is non-blocking and safe across different event loops (e.g. callers via run_coro_as_sync).

Atomic lock creation: _write_lock() writes the PID to a temporary file first, then uses os.link() (hard link) to atomically place it at the lock path. This guarantees the lock file is never visible in an empty or partial state — eliminating the race where a concurrent _remove_stale_lock could read an empty file and falsely delete a live lock. On filesystems that don't support hard links, it falls back to os.open(O_CREAT | O_EXCL) which is atomic on all platforms.

Parent directory creation: _write_lock() calls path.parent.mkdir(parents=True, exist_ok=True) before creating temp/lock files, so fresh base paths that haven't been git cloned yet don't fail with FileNotFoundError.

Stale lock recovery: The owning process's PID is written to the lock file on acquisition. When contention is detected (FileExistsError), the lock file's PID is read and checked for liveness. If the owning process is dead, the stale lock is removed and acquisition retries immediately. For empty or malformed lock files (e.g. from a writer that crashed between os.open(O_CREAT|O_EXCL) and os.write() in the fallback path), the file's modification time is checked: if the mtime is older than _STALE_EMPTY_THRESHOLD (5 seconds), the writer must have crashed and the file is removed as stale; if the mtime is recent, the writer may still be in the brief create-then-write window, so the file is left alone and the poll loop retries.

Indefinite wait by default: FileLock defaults to timeout=-1 (wait forever). Since stale lock recovery handles crashed processes (via PID liveness checks and mtime-based detection for empty/malformed files), a fixed timeout would only cause legitimate long-running pulls (large repos, slow networks, submodule-heavy checkouts) to fail unnecessarily.

Windows-safe PID checks: On Unix, os.kill(pid, 0) is used as a harmless signal-0 probe. On Windows, os.kill terminates the target process, so ctypes.windll.kernel32.OpenProcess is used instead.

The existing pull_code body is extracted into _pull_code_locked() with no logic changes.

Tests

  • tests/locking/test_filelock.py (30 tests) — covers _is_pid_alive, _remove_stale_lock (including mtime-based recovery of old empty/malformed files and preservation of recent ones), _write_lock (including temp file cleanup and parent directory creation), sync FileLock (acquire, release, context manager, timeout, thread contention, stale lock recovery for dead PID / old empty / old malformed / recent empty blocking), async FileLock (aacquire, timeout, async contention, stale lock recovery for dead PID / old empty / old malformed)
  • tests/runner/test_storage.py (TestGitRepositoryConcurrency) — tests concurrent pull_code() calls don't race, lock file placement, and failed clone cleanup under lock

Items for reviewer attention

  • Mtime-based crash detection threshold: Empty/malformed lock files are treated as stale after 5 seconds (_STALE_EMPTY_THRESHOLD). This is generous for the os.openos.write window (typically microseconds), but if a system were so heavily loaded that a process stalled for 5+ seconds between those two syscalls, a concurrent acquirer could incorrectly remove the live lock file and cause double-entry. In practice this is extremely unlikely, and it eliminates the previous permanent-block failure mode where a crash in the O_CREAT|O_EXCL fallback path would hang pull_code() indefinitely.
  • TOCTOU in _remove_if_old: There is a small window between reading mtime and calling unlink where another process could claim the lock. In practice this requires the file to be >5s old AND another process to claim it in the same instant, making it negligible.
  • Broad except OSError on os.link: The fallback from os.link to O_CREAT | O_EXCL triggers on any OSError, not just "hard links not supported." If os.link fails for an unexpected reason (e.g. permissions), the O_CREAT | O_EXCL path runs instead. This is conservative but could mask unexpected filesystem errors.
  • No in-process fast path: All contention (including same-process async tasks) goes through filesystem polling. This trades a small amount of efficiency for cross-event-loop safety — the previous asyncio.Lock approach deadlocked when the same repo was pulled from different loops via run_coro_as_sync.
  • PID reuse edge case: If the OS reuses a crashed process's PID before another process checks staleness, the stale lock won't be cleaned up via PID check. With the indefinite timeout, this would block forever. In practice PID reuse is rare enough to be acceptable.
  • Windows OpenProcess path is not tested in CI (CI runs on Linux). The ctypes.windll.kernel32.OpenProcess code path relies on correct constant values and API behavior.
  • Cross-process locking is only tested via mocks — the concurrency tests run within a single process using asyncio. Actual multi-process file contention is not exercised by the test suite.

Checklist

  • This pull request references any related issue by including "closes <link to issue>"
  • If this pull request adds new functionality, it includes unit tests that cover the changes
  • If this pull request removes docs files, it includes redirect settings in mint.json.
  • If this pull request adds functions or classes, it includes helpful docstrings.

Link to Devin session: https://app.devin.ai/sessions/1e3a05b708534a449af4acc4a2d76cc1
Requested by: @desertaxle

Add file-based locking around GitRepository.pull_code() to prevent race
conditions when multiple concurrent flow runs use the same git repository.

Uses asyncio.Lock for in-process async coordination between concurrent
tasks and FileLock for cross-process coordination. The lock file is
created adjacent to the destination directory (e.g., dest.lock).

Closes #11187

Co-authored-by: alex.s <alex.s@prefect.io>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions github-actions bot added arch:windows Related to the Windows OS bug Something isn't working labels Apr 1, 2026
filelock is a transitive dependency not available in the prefect-client
package. Fall back to asyncio.Lock only when filelock is not installed.

Co-Authored-By: alex.s <ajstreed1@gmail.com>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 1, 2026

Merging this PR will not alter performance

✅ 2 untouched benchmarks


Comparing devin/1775076072-fix-pull-code-race-condition (4e4f323) with main (17308b1)1

Open in CodSpeed

Footnotes

  1. No successful run was found on main (a738e75) during the generation of this report, so 17308b1 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

devin-ai-integration[bot]

This comment was marked as resolved.

- Create prefect/locking/_filelock.py with cross-platform file lock using
  OS-level locking (fcntl.flock on Unix, msvcrt.locking on Windows)
- Use async-aware aacquire() method that polls with asyncio.sleep() to
  avoid blocking the event loop during cross-process lock contention
- Fix lock path derivation: use parent/(name + '.lock') instead of
  with_suffix('.lock') which incorrectly replaces existing suffixes
- Remove filelock transitive dependency usage entirely
- Update tests to work with new internal FileLock

Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
Copy link
Copy Markdown
Contributor Author

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 7 additional findings in Devin Review.

Open in Devin Review

Comment thread src/prefect/locking/_filelock.py Outdated
Comment thread src/prefect/locking/_filelock.py
devin-ai-integration bot and others added 2 commits April 2, 2026 15:16
- Wrap acquire/aacquire polling loops in try/except BaseException to
  close the fd on CancelledError or any other unexpected exception
- Wrap _unlock_fd in release() with try/finally to ensure os.close()
  runs even if unlock raises

Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
@desertaxle desertaxle marked this pull request as ready for review April 2, 2026 19:00
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13e3555258

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/prefect/runner/storage.py Outdated
devin-ai-integration bot and others added 2 commits April 2, 2026 19:40
- Handle ImportError for fcntl/msvcrt so _filelock.py loads on any OS
- FileLock.acquire/aacquire silently no-op when locking is unavailable
- pull_code() catches lock acquisition failures at runtime and falls
  back to asyncio.Lock only, logging a debug message

Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
Rewrites _filelock.py to use Path.touch(exist_ok=False) / Path.unlink()
instead of fcntl/msvcrt, following the same pattern as
FileSystemLockManager. No OS-specific imports needed — works on any
platform that supports basic filesystem operations.

Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
@desertaxle
Copy link
Copy Markdown
Member

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5a86567757

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/prefect/locking/_filelock.py Outdated
Comment thread src/prefect/runner/storage.py Outdated
Write the owning process's PID to the lock file on acquisition. When a
lock file already exists, read the PID and check if the process is still
alive via os.kill(pid, 0). If the process is dead, remove the stale lock
and retry immediately — no need to wait for timeout.

Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
Comment thread src/prefect/locking/_filelock.py
devin-ai-integration bot and others added 5 commits April 7, 2026 16:47
P1: Replace os.kill(pid, 0) with ctypes.windll.kernel32.OpenProcess on
Windows — os.kill terminates the target process on Windows instead of
probing liveness.

P2: Treat empty or malformed lock files as stale and remove them
immediately. Previously, a process crashing between os.open(O_EXCL) and
os.write() left an empty lock file that could never be recovered, causing
repeated 300s timeout failures.

Add 24 unit tests for the _filelock module covering helpers, sync/async
acquire/release, timeout, contention, context manager, and stale lock
recovery (dead PID, empty file, malformed content).

Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
P1: Replace two-step os.open(O_EXCL) + os.write() with atomic
temp-file-then-hard-link pattern. The PID is written to a temp file
first and then hard-linked to the lock path, so the lock file is never
visible in an empty state. This eliminates the race where a concurrent
process could read an empty lock file and falsely treat it as stale.

P2: Remove the fixed 300s timeout from pull_code(). FileLock now waits
indefinitely (timeout=-1) so that legitimate long-running clones on
large repos or slow networks are never failed by a timeout. Stale lock
recovery via PID check handles crashed processes.

Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
- Remove in-process asyncio.Lock/threading.Lock; rely solely on FileLock
  which polls with asyncio.sleep and is safe across event loops
- Replace unsafe check+rename fallback with os.open(O_CREAT|O_EXCL) for
  atomic lock creation on filesystems without hard-link support
- Create parent directory before mkstemp so fresh base paths work
- Add test for parent directory creation

Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
…ormed locks as stale

The O_CREAT|O_EXCL fallback leaves the lock file briefly empty before
the PID is written. _remove_stale_lock() was deleting these empty files,
allowing a second writer to acquire the lock while the first still held
it. Now empty and malformed lock files are left alone -- the poll loop
retries until the PID is written or the writer crashes and is detected
via PID liveness check.

Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
Empty or malformed lock files (from a writer that crashed between
os.open(O_CREAT|O_EXCL) and os.write()) are now recovered by checking
the file's modification time.  If the mtime is older than 5 seconds
the writer must have crashed, so the file is removed as stale.  If the
mtime is recent the writer may still be running, so the file is left
alone and the poll loop retries.

This fixes the permanent unrecoverable lock issue where pull_code()
would hang indefinitely on the O_CREAT|O_EXCL fallback path after a
crash, without reintroducing the double-entry race (recent empty files
are still left alone).

Co-authored-by: Alexander Streed <desertaxle@users.noreply.github.com>
Co-Authored-By: alex.s <ajstreed1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arch:windows Related to the Windows OS bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

can’t run multiple flows stored remotely concurrently on windows

1 participant