System Info
- transformers: 5.5.3
- huggingface_hub: 1.12.0
- Python: 3.13
- OS: Linux
Who can help?
No response
Information
Tasks
Reproduction
Source files:
Save two models with different source, then load each one:
$ python main.py save --path=pretrained_a/subdir --magic="Magic A"
$ python main.py save --path=pretrained_b/subdir --magic="Magic B"
$ python main.py load --path=pretrained_a/subdir
Load "pretrained_a/subdir"
Model says: Magic A
Source path: HF_MODULES_CACHE/transformers_modules/subdir/custom_model.py
Source says: Magic A
$ python main.py load --path=pretrained_b/subdir
Load "pretrained_b/subdir"
Model says: Magic B
Source path: HF_MODULES_CACHE/transformers_modules/subdir/custom_model.py
Source says: Magic B
Both models end up cached at the same path in HF_MODULES_CACHE, even though their source differs.
Expected behavior
Two models with different source on disk should get separate cache entries, or not be cached at all.
Actual
The cache subdirectory is named after the basename of the local path (subdir), so the two models share a cache location and overwrite each other. The sequential case above happens to produce correct output only because each load rewrites the cached file before importing it.
Consequences
Breaks on parallel environments such as on Slurm clusters were multiple jobs try to use the same cache dirs.
-
Parallel loads race on the shared file. Two processes loading these models at the same time will write to the same path with no coordination, and the imported module can end up with arbitrary contents. "Don't load in parallel" is not a workable answer: HF_MODULES_CACHE is a shared directory used by other transformers code, and there are legitimate cases where multiple processes need to load different trust_remote_code models concurrently.
-
The cache grows without need. The source already exists on local disk - it could be loaded directly.
Suggested fix
Key the local-path cache subdirectory by a content hash of the source file(s), computed at the point the bytes are being read.
- Different source produces different cache dirs, so parallel loads of distinct models do not collide.
- Identical source is populated once, regardless of how many local paths reference it.
System Info
Who can help?
No response
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Source files:
Save two models with different source, then load each one:
Both models end up cached at the same path in
HF_MODULES_CACHE, even though their source differs.Expected behavior
Two models with different source on disk should get separate cache entries, or not be cached at all.
Actual
The cache subdirectory is named after the basename of the local path (
subdir), so the two models share a cache location and overwrite each other. The sequential case above happens to produce correct output only because each load rewrites the cached file before importing it.Consequences
Breaks on parallel environments such as on Slurm clusters were multiple jobs try to use the same cache dirs.
Parallel loads race on the shared file. Two processes loading these models at the same time will write to the same path with no coordination, and the imported module can end up with arbitrary contents. "Don't load in parallel" is not a workable answer:
HF_MODULES_CACHEis a shared directory used by other transformers code, and there are legitimate cases where multiple processes need to load differenttrust_remote_codemodels concurrently.The cache grows without need. The source already exists on local disk - it could be loaded directly.
Suggested fix
Key the local-path cache subdirectory by a content hash of the source file(s), computed at the point the bytes are being read.