Make asset downloading concurrent-safe#1989
Make asset downloading concurrent-safe#1989shi-eric wants to merge 3 commits intonewton-physics:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughClones assets into per-process/thread temporary directories, atomically renames them into the shared cache, cleans up stale temp artifacts, adds concurrency-safe helpers, expands unit tests for rename/temp/cleanup/concurrency, and removes the test-suite parallel pre-download orchestration. Changes
Sequence DiagramsequenceDiagram
participant Client as Client
participant Download as download_git_folder()
participant TempFS as TempDir
participant Git as Git
participant Cache as CacheFS
participant Cleanup as Cleanup
Client->>Download: request download_git_folder(url, cache_folder, force_refresh)
Download->>TempFS: _temp_cache_path() -> create per-process/thread temp_dir
TempFS-->>Download: temp_dir created
Download->>Git: clone repository into temp_dir
Git-->>Download: clone complete
Download->>TempFS: write temp stamp (.newton_last_check) in temp_dir
alt existing cache present & new version detected
Download->>Cache: rename existing cache -> stale path (atomic)
Cache-->>Download: stale cache staged
end
Download->>Cache: _safe_rename(temp_dir -> cache_folder) (atomic)
Cache-->>Download: new cache in place
Download->>Cleanup: _cleanup_stale_temp_dirs(cache_folder)
Cleanup-->>Download: stale/temp artifacts removed
Download-->>Client: return final cached path
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
newton/tests/test_download_assets.py (1)
121-132: The committed regression still misses the multiprocessing case.This only exercises thread contention. The behavior being fixed here depends on PID-scoped temp directories and cross-process rename handoff, so the original
#1987multiprocessing failure mode is still unguarded in CI. Please add a small multi-process smoke test as well.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/tests/test_download_assets.py` around lines 121 - 132, The test only covers threads; add a multiprocessing smoke test to exercise cross-process temp dir and rename handoff: create a new test (e.g., test_concurrent_download_multiprocess) that spawns multiple processes which each call the same helper that invokes download_git_folder(self.remote_dir, self.asset_rel, cache_dir=self.cache_dir, branch="main"), then in each process assert the returned path exists and the file content equals "v1\n"; use multiprocessing.Process or concurrent.futures.ProcessPoolExecutor to run ~2–4 worker processes and join/collect results so CI exercises PID-scoped temp dirs and cross-process rename behavior for download_git_folder.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@newton/_src/utils/download_assets.py`:
- Around line 82-111: The cleanup currently deletes any matching temp dir older
than max_age in _cleanup_stale_temp_dirs, which can remove a still-active
download; instead parse the owner PID from the temp-dir suffix (the existing
_TEMP_DIR_RE contains the p{pid} token) and check that process liveness before
deleting: extract pid from the dir name, then test whether that pid is alive
(e.g. try os.kill(pid, 0) on POSIX, or use psutil.pid_exists(pid) if available /
fall back appropriately) and only call _safe_rmtree(entry) if the owner process
is not alive; keep the existing age check as an additional guard.
---
Nitpick comments:
In `@newton/tests/test_download_assets.py`:
- Around line 121-132: The test only covers threads; add a multiprocessing smoke
test to exercise cross-process temp dir and rename handoff: create a new test
(e.g., test_concurrent_download_multiprocess) that spawns multiple processes
which each call the same helper that invokes
download_git_folder(self.remote_dir, self.asset_rel, cache_dir=self.cache_dir,
branch="main"), then in each process assert the returned path exists and the
file content equals "v1\n"; use multiprocessing.Process or
concurrent.futures.ProcessPoolExecutor to run ~2–4 worker processes and
join/collect results so CI exercises PID-scoped temp dirs and cross-process
rename behavior for download_git_folder.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 3b097046-05e8-42a5-b8e9-48dd416f5bb8
📒 Files selected for processing (3)
newton/_src/utils/download_assets.pynewton/tests/test_download_assets.pynewton/tests/thirdparty/unittest_parallel.py
💤 Files with no reviewable changes (1)
- newton/tests/thirdparty/unittest_parallel.py
50f1f82 to
f0e2bd9
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
newton/tests/test_download_assets.py (1)
121-132: Thread-level concurrency test is good; consider documenting process-level coverage.This test validates thread-safety using
ThreadPoolExecutor, which is valuable. However, the PR's primary goal is multiprocess-safe downloads (as used by the parallel test runner). The temp path includes both PID and thread ID, so thread-level testing alone doesn't fully exercise the PID-based uniqueness.The PR description mentions a standalone stress test with 8 worker processes was run successfully. Consider either:
- Adding a brief docstring note that process-level concurrency was validated externally, or
- Adding a
ProcessPoolExecutorvariant (though this may be overkill for unit tests)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/tests/test_download_assets.py` around lines 121 - 132, The thread-level test test_concurrent_download verifies thread-safety but doesn't exercise PID-based uniqueness for process-level concurrency; update the test by either (A) adding a one-line docstring/note to test_concurrent_download stating that process-level concurrency was validated externally (and reference download_git_folder and cache_dir/temp path PID+TID behavior), or (B) add an additional test function that uses concurrent.futures.ProcessPoolExecutor to spawn multiple processes calling download_git_folder(self.remote_dir, self.asset_rel, cache_dir=self.cache_dir, branch="main") and assert the downloaded file exists and contains "v1\n" to validate multiprocess safety; include the new test name (e.g., test_concurrent_download_processes) so reviewers can find it easily.newton/_src/utils/download_assets.py (1)
333-344: Consider exception safety for stale_dir rename.The stale-cache rename at line 337 catches
FileNotFoundError, but ifos.renameraises a differentOSError(e.g., permission error), it will propagate up and skip the subsequent_safe_rename(temp_dir, cache_folder)call. The temp_dir cleanup infinallywill handle the temp directory, but the stale_dir may be left in an inconsistent state.Consider also catching and logging (or ignoring) other
OSErrorvariants to ensure the workflow proceeds:🛡️ Proposed robustness improvement
if cache_folder.exists(): try: os.rename(cache_folder, stale_dir) - except FileNotFoundError: + except OSError: # Another thread already moved/removed it pass🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/_src/utils/download_assets.py` around lines 333 - 344, The stale-cache rename block should catch broader OSError variants so other OS errors (e.g., PermissionError) don't abort the sequence; update the try/except around os.rename(cache_folder, stale_dir) in this section to catch OSError (in addition to FileNotFoundError), log or warn the exception with its details, then allow execution to continue to _safe_rename(temp_dir, cache_folder) and the subsequent stale_dir cleanup via _safe_rmtree; keep the original FileNotFoundError handling semantics (i.e., treat as already moved) but ensure any caught OSError is surfaced in the logs for debugging while not preventing cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@newton/_src/utils/download_assets.py`:
- Around line 333-344: The stale-cache rename block should catch broader OSError
variants so other OS errors (e.g., PermissionError) don't abort the sequence;
update the try/except around os.rename(cache_folder, stale_dir) in this section
to catch OSError (in addition to FileNotFoundError), log or warn the exception
with its details, then allow execution to continue to _safe_rename(temp_dir,
cache_folder) and the subsequent stale_dir cleanup via _safe_rmtree; keep the
original FileNotFoundError handling semantics (i.e., treat as already moved) but
ensure any caught OSError is surfaced in the logs for debugging while not
preventing cleanup.
In `@newton/tests/test_download_assets.py`:
- Around line 121-132: The thread-level test test_concurrent_download verifies
thread-safety but doesn't exercise PID-based uniqueness for process-level
concurrency; update the test by either (A) adding a one-line docstring/note to
test_concurrent_download stating that process-level concurrency was validated
externally (and reference download_git_folder and cache_dir/temp path PID+TID
behavior), or (B) add an additional test function that uses
concurrent.futures.ProcessPoolExecutor to spawn multiple processes calling
download_git_folder(self.remote_dir, self.asset_rel, cache_dir=self.cache_dir,
branch="main") and assert the downloaded file exists and contains "v1\n" to
validate multiprocess safety; include the new test name (e.g.,
test_concurrent_download_processes) so reviewers can find it easily.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: f0bd22c6-76e1-430b-b01e-fb4ca96412fa
📒 Files selected for processing (3)
newton/_src/utils/download_assets.pynewton/tests/test_download_assets.pynewton/tests/thirdparty/unittest_parallel.py
💤 Files with no reviewable changes (1)
- newton/tests/thirdparty/unittest_parallel.py
f0e2bd9 to
68231ac
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@newton/_src/utils/download_assets.py`:
- Around line 329-337: The current logic renames an existing cache_folder to
stale_dir before calling _safe_rename(temp_dir, cache_folder), which can move a
winner's freshly published cache; fix this by recording the identity/state of
cache_folder (e.g., os.stat() metadata such as st_ino/st_mtime or a hash of its
contents) into a variable before cloning, and before creating stale_dir only
proceed with os.rename(cache_folder, stale_dir) if the current cache_folder
still matches that recorded identity; if it does not match (meaning another
caller published a new cache), skip the stale swap and simply discard the
temp_dir (or let _safe_rename handle no-op), and add a deterministic regression
test that simulates the "winner already published" race to assert that the
published cache is not moved/removed (reference cache_folder, temp_dir,
stale_dir, and _safe_rename in your changes).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 9d4bdb29-2c6d-495a-beef-44bea4b83469
📒 Files selected for processing (3)
newton/_src/utils/download_assets.pynewton/tests/test_download_assets.pynewton/tests/thirdparty/unittest_parallel.py
💤 Files with no reviewable changes (1)
- newton/tests/thirdparty/unittest_parallel.py
adenzler-nvidia
left a comment
There was a problem hiding this comment.
Review Summary
The core temp-dir + safe-rename pattern is well-designed and the removal of the fragile pre-download list is a clear improvement. A few concurrency issues worth addressing before merge.
Critical (2): TOCTOU race on the post-rename existence check; finally block can mask the original exception.
Important (3): Bare os.rename for stale-dir move misses race errors; orphaned stale_dir not cleaned on failure; missing test coverage for _safe_rename retry/exhaustion paths.
Nits (2): Lone # 3. numbering; force_refresh semantics subtly changed under concurrency.
newton/_src/utils/download_assets.py
Outdated
| return cache_folder / folder_path | ||
|
|
||
| # Should not happen, but handle gracefully | ||
| raise RuntimeError(f"Failed to place cache folder at {cache_folder}") |
There was a problem hiding this comment.
Critical — TOCTOU race makes this path reachable under concurrency
This is not actually unreachable. Consider:
- Thread A completes
_safe_rename(temp_dir, cache_folder)— succeeds. - Thread B calls
os.rename(cache_folder, stale_dir_B)— moves A's result away. - Thread A checks
if cache_folder.exists():— returnsFalse. - Thread A raises this
RuntimeErroreven though the download succeeded.
This is a TOCTOU race in the exact concurrent scenario this PR targets.
Suggestion: return cache_folder / folder_path unconditionally after _safe_rename (the path will be valid by the time the caller actually uses it, since whichever thread won the race placed valid content there). Or re-check with a brief retry.
There was a problem hiding this comment.
Fixed — removed the post-rename existence check entirely. After _safe_rename either succeeds or returns silently (another thread won), cache_folder / folder_path is valid regardless. The check + RuntimeError was dead weight that introduced a needless race.
newton/_src/utils/download_assets.py
Outdated
| finally: | ||
| # Always clean up temp dir | ||
| if temp_dir.exists(): | ||
| _safe_rmtree(temp_dir) |
There was a problem hiding this comment.
Critical — finally cleanup can mask the original exception
If the try block raises (e.g. GitCommandError) and then _safe_rmtree(temp_dir) also raises (e.g. PermissionError on Windows where git processes hold locks), Python replaces the original exception with the cleanup exception. The root cause is completely lost.
finally:
try:
if temp_dir.exists():
_safe_rmtree(temp_dir)
except OSError as cleanup_err:
print(f"Warning: failed to clean up temp dir {temp_dir}: {cleanup_err}")There was a problem hiding this comment.
Fixed — the finally block now wraps each cleanup call in try/except OSError: pass so a cleanup failure can't mask the original exception.
newton/_src/utils/download_assets.py
Outdated
| stale_dir = Path(f"{cache_folder}_stale_p{os.getpid()}_t{threading.get_ident()}") | ||
| if cache_folder.exists(): | ||
| try: | ||
| os.rename(cache_folder, stale_dir) |
There was a problem hiding this comment.
Important — bare os.rename misses race conditions that _safe_rename already handles
Only FileNotFoundError is caught here. If stale_dir already exists (e.g. same thread runs this path twice without cleanup completing), os.rename raises FileExistsError (Windows) or OSError(ENOTEMPTY) (Linux). This propagates as a confusing "Failed to download git folder: [Errno 39] Directory not empty" even though the download itself succeeded.
Consider using _safe_rename(cache_folder, stale_dir) here — it already handles these exact races.
There was a problem hiding this comment.
Added a _safe_rmtree(stale_dir) pre-cleanup before the os.rename call to handle leftover stale dirs from a previous same-thread invocation.
Note: using _safe_rename here (as suggested) would be wrong — if stale_dir already exists, _safe_rename returns silently without moving cache_folder, which would skip the stale-dir eviction entirely. The bare os.rename is intentional since we want the move to actually happen; the pre-cleanup addresses the collision case.
newton/_src/utils/download_assets.py
Outdated
| _safe_rename(temp_dir, cache_folder) | ||
| # Clean up stale dir | ||
| if stale_dir.exists(): | ||
| _safe_rmtree(stale_dir) |
There was a problem hiding this comment.
Important — orphaned stale_dir if _safe_rename raises on the next line's predecessor
If _safe_rename(temp_dir, cache_folder) on line 337 raises after exhausting retries, the exception propagates to the except handlers which don't reference stale_dir. The finally block only cleans temp_dir. The stale dir lingers until _cleanup_stale_temp_dirs runs (up to 1 hour).
Consider adding stale_dir cleanup to the finally block (define stale_dir before the try, initialize to None, and clean up if it exists).
There was a problem hiding this comment.
Fixed — stale_dir is now initialized to None before the try block and cleaned up in the finally block alongside temp_dir.
newton/_src/utils/download_assets.py
Outdated
| _safe_rmtree(cache_folder) | ||
|
|
||
| # 3. Download if not cached (or if cache was just cleared) | ||
| # 3. Download into a process/thread-unique temp directory, then rename |
There was a problem hiding this comment.
Nit: the old code had # 1., # 2., # 3. comments providing a visual outline. The PR removes 1 and 2 but keeps # 3. here. A lone # 3. with no predecessors is confusing — drop the number.
There was a problem hiding this comment.
Fixed — dropped the number.
|
|
||
| def test_rename_destination_exists(self): | ||
| """Rename is a no-op when destination already exists.""" | ||
| src = os.path.join(self.base, "src_dir") |
There was a problem hiding this comment.
Important — missing test coverage for _safe_rename retry and exhaustion paths
The retry loop (transient OSError succeeds on retry) and exhaustion path (raises after N attempts) are the mechanical heart of the concurrency safety but have no tests. Also missing: ENOTEMPTY path. These are cheap to add via unittest.mock.patch("os.rename", ...):
- Mock
os.renameto raise a genericOSErroron first call, succeed on second → verify rename succeeds. - Mock
os.renameto raiseOSErroron all attempts → verify it raises after exhausting retries. - Mock
os.renameto raiseOSError(errno.ENOTEMPTY, ...)→ verify it returns silently (same asFileExistsError).
There was a problem hiding this comment.
Added three tests:
test_rename_retries_on_transient_error— mockos.renameto raise on first call, succeed on secondtest_rename_raises_after_exhausting_retries— mock to always raise, verify exception after all attemptstest_rename_enotempty_returns_silently— mock ENOTEMPTY, verify silent return
68231ac to
ac3bf35
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
newton/tests/test_download_assets.py (1)
123-134: Consider asserting no orphaned temp directories remain.The test validates cache integrity (file content is correct) but doesn't verify that the concurrency mechanism properly cleans up all temporary artifacts. Leftover
_p*_t*directories would indicate a cleanup bug.♻️ Suggested enhancement
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor: futures = [executor.submit(download) for _ in range(4)] for f in concurrent.futures.as_completed(futures): f.result() + + # Verify no orphaned temp/stale directories remain + import re + temp_pattern = re.compile(r"_(?:stale_)?p\d+_t\d+$") + for entry in Path(self.cache_dir).iterdir(): + self.assertIsNone( + temp_pattern.search(entry.name), + f"Orphaned temp/stale directory found: {entry.name}", + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/tests/test_download_assets.py` around lines 123 - 134, The test_concurrent_download currently checks cache integrity but not cleanup; after all futures complete, add an assertion that no orphaned temporary directories remain in the cache directory by scanning cache_dir (the same cache_dir passed to download_git_folder) for any entries matching the temp pattern (e.g., names like "_p*_t*" or whatever temp-prefix your implementation uses) and assert the glob/list is empty; implement this by using pathlib.Path(self.cache_dir).glob(...) or similar and failing the test if any matches are found to ensure download_git_folder cleans up temp artifacts under concurrent runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@newton/tests/test_download_assets.py`:
- Around line 123-134: The test_concurrent_download currently checks cache
integrity but not cleanup; after all futures complete, add an assertion that no
orphaned temporary directories remain in the cache directory by scanning
cache_dir (the same cache_dir passed to download_git_folder) for any entries
matching the temp pattern (e.g., names like "_p*_t*" or whatever temp-prefix
your implementation uses) and assert the glob/list is empty; implement this by
using pathlib.Path(self.cache_dir).glob(...) or similar and failing the test if
any matches are found to ensure download_git_folder cleans up temp artifacts
under concurrent runs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 33e35848-d611-4620-bc48-c488741d7bb1
📒 Files selected for processing (3)
newton/_src/utils/download_assets.pynewton/tests/test_download_assets.pynewton/tests/thirdparty/unittest_parallel.py
💤 Files with no reviewable changes (1)
- newton/tests/thirdparty/unittest_parallel.py
adenzler-nvidia
left a comment
There was a problem hiding this comment.
Nice work — the temp-dir + safe-rename pattern is sound and the test coverage on the helpers is thorough. Two issues in the error handling that I think should be addressed before merge:
-
Successful download reported as failure — After
_safe_renamesucceeds, if_safe_rmtree(stale_dir)raises, the exception falls through toexcept Exceptionand wraps it inRuntimeError("Failed to download git folder: ..."). The cache is valid at that point, but the caller sees a failure. -
Silent cleanup failures can cause unbounded disk growth — The
finallyblock uses bareexcept OSError: pass, and_cleanup_stale_temp_dirshas the same pattern on its outerparent.iterdir()call. If both cleanup paths fail (e.g., permissions misconfiguration on CI), orphaned temp/stale directories accumulate with zero diagnostic output. Replacingpasswithwarnings.warn(...)preserves the "don't mask the original exception" intent while leaving a debugging trail.
newton/_src/utils/download_assets.py
Outdated
| _safe_rename(temp_dir, cache_folder) | ||
| # Clean up stale dir | ||
| if stale_dir.exists(): | ||
| _safe_rmtree(stale_dir) |
There was a problem hiding this comment.
If _safe_rmtree(stale_dir) raises here (permission error, NFS I/O error, etc.), the exception propagates to the except Exception handler below, which wraps it as RuntimeError("Failed to download git folder: ..."). But _safe_rename already succeeded — the cache is valid and the download worked.
Suggestion: wrap this in its own try/except so a cleanup failure can't cause a successful download to be reported as failed:
try:
if stale_dir.exists():
_safe_rmtree(stale_dir)
except OSError as e:
import warnings
warnings.warn(f"Failed to remove stale directory {stale_dir}: {e}", stacklevel=2)There was a problem hiding this comment.
Fixed — wrapped the stale-dir cleanup in try/except so a cleanup failure after a successful rename doesn't propagate to the except handler and get misreported as a download failure. Orphaned stale dirs get cleaned up by _cleanup_stale_temp_dirs on the next call.
Also switched the stale-dir move itself to use _safe_rename(cache_folder, stale_dir) which retries on transient PermissionError (Windows file locks) instead of failing immediately.
| try: | ||
| if d is not None and d.exists(): | ||
| _safe_rmtree(d) | ||
| except OSError: |
There was a problem hiding this comment.
Bare except OSError: pass here means that if cleanup fails (e.g., wrong permissions on the cache dir), orphaned temp/stale directories accumulate silently. Combined with the same pattern in _cleanup_stale_temp_dirs (the outer except OSError: pass on parent.iterdir()), both cleanup mechanisms can fail simultaneously with zero diagnostics.
Consider replacing pass with warnings.warn(...) — this still avoids masking the original exception but leaves a trail:
except OSError as cleanup_err:
import warnings
warnings.warn(
f"Failed to clean up temporary directory {d}: {cleanup_err}",
stacklevel=2,
)There was a problem hiding this comment.
Skipping this one. The silent catch here is intentional — this is the finally block whose sole purpose is to avoid masking the original exception (or losing a successful return). Orphaned dirs are already handled by _cleanup_stale_temp_dirs on the next call, so there's no diagnostic gap. Adding warnings would create noise for a transient condition that's already covered.
Each caller now clones into a PID/TID-unique temp directory, then atomically renames it to the final cache location. If another process already placed the cache, the rename is a no-op and the temp directory is discarded. Orphaned temp directories from crashed processes are cleaned up automatically after 1 hour. This eliminates the need for the manually maintained pre-download list in unittest_parallel.py, which was a fragile workaround for the lack of concurrency safety. Closes newton-physics#1987
- Remove TOCTOU race: return unconditionally after _safe_rename instead of re-checking cache_folder.exists() - Wrap finally cleanup in try/except to avoid masking the original exception when _safe_rmtree fails - Clean up stale_dir in finally block to prevent orphaned directories when _safe_rename exhausts retries - Pre-clean stale_dir before os.rename to handle leftover from previous same-thread calls - Drop lone "# 3." comment numbering - Add tests for _safe_rename retry, exhaustion, and ENOTEMPTY paths
- Use _safe_rename for stale-dir move so transient PermissionError (Windows file locks) is retried instead of propagating as a download failure - Wrap stale-dir cleanup in try/except so a cleanup failure after a successful rename doesn't get misreported as a download failure
ac3bf35 to
c23c51f
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@newton/_src/utils/download_assets.py`:
- Around line 334-336: The current stale-swap has a TOCTOU race: between
checking cache_folder.exists() and calling _safe_rename(cache_folder, stale_dir)
another process may have moved/created the folder; wrap the stale rename and the
final temp->cache swap in try/except to handle missing/exists races atomically:
when calling _safe_rename(cache_folder, stale_dir) catch and ignore
FileNotFoundError (or OSError indicating the source is gone) because a
concurrent publisher likely succeeded, and when calling _safe_rename(temp_dir,
cache_folder) treat EEXIST/AlreadyExists as success (or re-check cache_folder
afterwards and skip replace if it now exists), keeping the use of cache_folder,
stale_dir, temp_dir and _safe_rename as the referenced symbols.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: df97548d-cc7a-444e-b37e-0e23625adeff
📒 Files selected for processing (1)
newton/_src/utils/download_assets.py
| if cache_folder.exists(): | ||
| _safe_rename(cache_folder, stale_dir) | ||
| _safe_rename(temp_dir, cache_folder) |
There was a problem hiding this comment.
Handle TOCTOU on stale-swap to avoid false failures under concurrent refresh.
Between Line 334 (cache_folder.exists()) and Line 335 (_safe_rename(cache_folder, stale_dir)), another caller can move/remove cache_folder. In that case, this caller can raise at Line 352-353 even though a concurrent caller successfully published a valid cache.
🔧 Proposed fix
if cache_folder.exists():
- _safe_rename(cache_folder, stale_dir)
+ try:
+ _safe_rename(cache_folder, stale_dir)
+ except FileNotFoundError:
+ # Lost race: another concurrent caller already moved/replaced it.
+ pass
_safe_rename(temp_dir, cache_folder)Also applies to: 352-353
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@newton/_src/utils/download_assets.py` around lines 334 - 336, The current
stale-swap has a TOCTOU race: between checking cache_folder.exists() and calling
_safe_rename(cache_folder, stale_dir) another process may have moved/created the
folder; wrap the stale rename and the final temp->cache swap in try/except to
handle missing/exists races atomically: when calling _safe_rename(cache_folder,
stale_dir) catch and ignore FileNotFoundError (or OSError indicating the source
is gone) because a concurrent publisher likely succeeded, and when calling
_safe_rename(temp_dir, cache_folder) treat EEXIST/AlreadyExists as success (or
re-check cache_folder afterwards and skip replace if it now exists), keeping the
use of cache_folder, stale_dir, temp_dir and _safe_rename as the referenced
symbols.
Description
Make
download_git_folder()safe under concurrent multi-process/thread access using a temp-directory + safe-rename pattern, inspired by Warp's kernel cache approach.Each caller now clones into a PID/TID-unique temp directory, then atomically renames it to the final cache location. If another process already placed the cache, the rename is a no-op and the temp directory is discarded. Orphaned temp directories from crashed processes are cleaned up automatically after 1 hour.
This eliminates the need for the manually maintained pre-download list in
unittest_parallel.py, which was a fragile workaround for the lack of concurrency safety (forgetting to update it caused flaky test failures).Closes #1987
Changes
newton/_src/utils/download_assets.py:_safe_rename()— tolerates races where another process creates the destination first_temp_cache_path()— returns a PID/TID-unique temp path_cleanup_stale_temp_dirs()— removes orphaned temp dirs older than 1 hourdownload_git_folder()to clone into temp dir, then safe-rename into final cache locationnewton/tests/test_download_assets.py— add tests for all new helpers + concurrent download testnewton/tests/thirdparty/unittest_parallel.py— remove_parallel_download(), asset lists, menagerie folder lists, and related importsChecklist
CHANGELOG.mdhas been updated (if user-facing change)Test plan
Unit tests:
All 10 tests pass:
TestSafeRename— rename success and destination-exists casesTestTempCachePath— PID/TID uniqueness across threadsTestCleanupStaleTempDirs— removes old orphans, preserves recent, ignores unrelatedTestDownloadAssets.test_download_and_refresh— basic download + stale refresh + force refreshTestDownloadAssets.test_concurrent_download— 4 threads racing to download the same assetAdditionally, a standalone stress test (not committed) was used to validate with real assets and spawned processes. It clears the cache, spawns 8 worker processes that all race to download
anybotics_anymal_d, verifies no leftover temp directories, and loads the asset into a Newton model to verify integrity. Passed across multiple rounds:Stress test script (not committed)
Stress test output (8 workers, 2 rounds)
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores