Skip to content

trust_remote_code cache path collides for local models sharing a leaf directory name #45632

@nurpax

Description

@nurpax

System Info

  • transformers: 5.5.3
  • huggingface_hub: 1.12.0
  • Python: 3.13
  • OS: Linux

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Source files:

Save two models with different source, then load each one:

$ python main.py save --path=pretrained_a/subdir --magic="Magic A"
$ python main.py save --path=pretrained_b/subdir --magic="Magic B"

$ python main.py load --path=pretrained_a/subdir
Load "pretrained_a/subdir"
Model says:  Magic A
Source path: HF_MODULES_CACHE/transformers_modules/subdir/custom_model.py
Source says: Magic A

$ python main.py load --path=pretrained_b/subdir
Load "pretrained_b/subdir"
Model says:  Magic B
Source path: HF_MODULES_CACHE/transformers_modules/subdir/custom_model.py
Source says: Magic B

Both models end up cached at the same path in HF_MODULES_CACHE, even though their source differs.

Expected behavior

Two models with different source on disk should get separate cache entries, or not be cached at all.

Actual

The cache subdirectory is named after the basename of the local path (subdir), so the two models share a cache location and overwrite each other. The sequential case above happens to produce correct output only because each load rewrites the cached file before importing it.

Consequences

Breaks on parallel environments such as on Slurm clusters were multiple jobs try to use the same cache dirs.

  1. Parallel loads race on the shared file. Two processes loading these models at the same time will write to the same path with no coordination, and the imported module can end up with arbitrary contents. "Don't load in parallel" is not a workable answer: HF_MODULES_CACHE is a shared directory used by other transformers code, and there are legitimate cases where multiple processes need to load different trust_remote_code models concurrently.

  2. The cache grows without need. The source already exists on local disk - it could be loaded directly.

Suggested fix

Key the local-path cache subdirectory by a content hash of the source file(s), computed at the point the bytes are being read.

  • Different source produces different cache dirs, so parallel loads of distinct models do not collide.
  • Identical source is populated once, regardless of how many local paths reference it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions