feat(jobs): add volume mounting support for buckets and repos#3936
feat(jobs): add volume mounting support for buckets and repos#3936
Conversation
Add `volumes` parameter to `run_job`, `create_scheduled_job`, `run_uv_job`, and `create_scheduled_uv_job` to mount HuggingFace Buckets and Repos (models, datasets, spaces) as volumes in job containers. - Add `JobVolume` dataclass and `JobVolumeType` enum - Add `volumes` field to `JobInfo` and `JobSpec` responses - Add `-v/--volume` CLI option with Docker-like syntax (e.g. `-v models/gpt2:/data` or `-v buckets/org/bucket:/mnt:ro`) - Serialize volumes to camelCase for the Hub API
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
- Remove dead isinstance check in _create_job_spec serialization - Add volumes field to JobInfo docstring - Preserve original input in _parse_volumes error messages - Restructure tests: parametrize, merge into existing classes, top-level imports
| # Parse type from source_part (first segment before /) | ||
| slash_idx = source_part.find("/") | ||
| if slash_idx == -1: | ||
| # No slash: bare source like "gpt2:/data" -> model type | ||
| vol_type_str = JobVolumeType.MODEL.value | ||
| source = source_part | ||
| else: | ||
| vol_type_str = source_part[:slash_idx] | ||
| source = source_part[slash_idx + 1 :] | ||
| # If the first segment isn't a known type, treat the whole thing as a model source | ||
| # e.g. "org/my-model:/data" -> type=model, source="org/my-model" | ||
| if vol_type_str not in _VOLUME_TYPES: | ||
| vol_type_str = JobVolumeType.MODEL.value | ||
| source = source_part | ||
|
|
||
| result.append( | ||
| JobVolume( | ||
| type=vol_type_str, | ||
| source=source, | ||
| mount_path=mount_path, | ||
| read_only=read_only, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
with this change you can support revisions (including special refs) and paths in repo/bucket:
| # Parse type from source_part (first segment before /) | |
| slash_idx = source_part.find("/") | |
| if slash_idx == -1: | |
| # No slash: bare source like "gpt2:/data" -> model type | |
| vol_type_str = JobVolumeType.MODEL.value | |
| source = source_part | |
| else: | |
| vol_type_str = source_part[:slash_idx] | |
| source = source_part[slash_idx + 1 :] | |
| # If the first segment isn't a known type, treat the whole thing as a model source | |
| # e.g. "org/my-model:/data" -> type=model, source="org/my-model" | |
| if vol_type_str not in _VOLUME_TYPES: | |
| vol_type_str = JobVolumeType.MODEL.value | |
| source = source_part | |
| result.append( | |
| JobVolume( | |
| type=vol_type_str, | |
| source=source, | |
| mount_path=mount_path, | |
| read_only=read_only, | |
| ) | |
| ) | |
| resolved_path = hffs.resolve_path(source_part) | |
| if isinstance(resolved_path, HfFileSystemResolvedRepositoryPath): | |
| result.append( | |
| JobVolume( | |
| type=resolved_path.repo_type, | |
| source=resolved_path.repo_id, | |
| mount_path=mount_path, | |
| revision=resolved_path.revision, | |
| read_only=read_only, | |
| path=resolved_path.path_in_repo, | |
| ) | |
| ) | |
| else: | |
| result.append( | |
| JobVolume( | |
| type=JobVolumeType.BUCKET.value, | |
| source=resolved_path.bucket_id, | |
| mount_path=mount_path, | |
| read_only=read_only, | |
| path=resolved_path.path, | |
| ) | |
| ) |
for example here are supported paths:
# buckets
"hf://buckets/username/bucket"
"hf://buckets/username/bucket/path"
# repos
"hf://gpt2"
"hf://user/model"
"hf://datasets/user/dataset"
"hf://user/model/path/in/repo"
"hf://user/model@revision"
"hf://user/model@refs/pr/1"(it works with and without the hf:// prefix)
your will need these imports
from huggingface_hub import hffs
from huggingface_hub.hf_file_system import HfFileSystemResolvedBucketPath, HfFileSystemResolvedRepositoryPathit will also raise an error if the repo / bucket doesn't exist
|
love it ! quick question for the CLI: should we require the hf:// prefix for the source path ? to make sure it doesn't look like a local path (and in case we want to support local path at some point) |
Think this makes sense IMO. For Jobs I have quite a lot of use cases in mind where you do something like hf jobs uv run whisper-transcribe.py some-local-dir/audiofiles.mp3 |
Summary
Add support for mounting HuggingFace Buckets and Repos (models, datasets, spaces) as volumes in Job containers.
Python API
CLI
hf jobs run -v datasets/username/my-dataset:/data -v buckets/username/my-bucket:/output python:3.12 python script.pyChanges
_jobs_api.py: newJobVolumedataclass andJobVolumeTypeenum,volumesfield added toJobInfo/JobSpec/_create_job_spechf_api.py:volumesparameter added torun_job,run_uv_job,create_scheduled_job,create_scheduled_uv_jobcli/jobs.py:--volume/-vCLI option with Docker-like syntax (TYPE/SOURCE:/MOUNT_PATH[:ro])__init__.py: exportJobVolume,JobVolumeType