The goal of this project is to run download Hugging Face datasets and train them locally using GPU hardware in Docker.
This project is not responsible for collecting the data from the hardware, but assumes the data has already been collected and uploaded to Hugging Face.
The host machine must have the NVIDIA drivers installed and the NVIDIA Container Toolkit configured so Docker can access the GPU. Without this, Docker only knows about the default runc runtime and will fail when your compose file specifies runtime: nvidia.
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkitsudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart dockerdocker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smiYou should see your GPU listed. If you already have the toolkit installed but still get a runtime error, the configure step (step 2) may have been skipped, or Docker wasn't restarted after installation.
This project is intended to produce a container that can run on different type of machines (x86_64, aarch) and will need configurations to select the right type before building. User will need to select architecture, number and type of GPUs, and dataset size.
User also must define the policy type, user account, dataset path, policy name, training params and other required input items.
Use the local .env file to store all configuration items like HugginFace and GitHub accounts, API keys, database URLs, policy URLs, and training parameters.
There are specific challenges with training on the DGX Spark due to the aarch64 architecture and the incompatibility of specific packages:
- PyTorch: The standard PyTorch pip indexes (
cu124, etc.) do not publish aarch64 CUDA wheels. The DGX Spark's GB10 GPU (Blackwell, sm_121 compute capability) requires PyTorch 2.9.0+ from thecu130index (https://download.pytorch.org/whl/cu130). The Dockerfile handles this automatically based on the target architecture. - torchcodec: Not available on aarch64. The system falls back to PyAV (
av) for video decoding. - flash-attn: No prebuilt aarch64 wheels. Skipped on ARM builds.
- decord: x86_64 only, not installed on ARM.
Note: lerobot pins torch<2.8.0, but DGX Spark requires torch>=2.9.0 for GPU support. The Dockerfile installs lerobot first, then overrides torch with the CUDA-enabled build. This version mismatch is a known trade-off for aarch64 GPU support.
On standard x86_64 NVIDIA GPU servers, PyTorch CUDA wheels install normally from the cu124 index (PyTorch 2.6.x).
- FFmpeg: Runtime libraries are installed for torchcodec support.
- torchcodec: Pre-built wheels have ABI compatibility issues with cu124 PyTorch builds. The system defaults to PyAV (
av) for video decoding. - flash-attn: Installed on x86_64 with a fallback if compilation fails.
Both architectures use PyAV (av) as the default video backend for reliable cross-platform video decoding. This can be overridden via the VIDEO_BACKEND environment variable in .env:
# Use pyav (default, recommended)
VIDEO_BACKEND=pyav
# Use torchcodec (x86_64 only, may have compatibility issues)
VIDEO_BACKEND=torchcodecThe Dockerfile sets the following environment variables by default to reduce CUDA OOM errors and CPU memory pressure during training:
| Variable | Default | Purpose |
|---|---|---|
PYTORCH_CUDA_ALLOC_CONF |
expandable_segments:True |
Reduces CUDA allocator fragmentation; allows the allocator to grow segments dynamically instead of failing when free memory is fragmented |
OMP_NUM_THREADS |
1 |
Prevents CPU thread over-subscription from DataLoader workers each spawning a full OpenMP thread pool |
TOKENIZERS_PARALLELISM |
false |
Suppresses HuggingFace tokenizer parallelism warnings and potential deadlocks inside forked DataLoader workers |
If you encounter torch.OutOfMemoryError: CUDA out of memory, try these steps in order:
- Pin training to one GPU — on a multi-GPU host, set
NVIDIA_VISIBLE_DEVICES=0(or1) in.envto avoid cross-GPU memory contention. - Tune the allocator — extend
PYTORCH_CUDA_ALLOC_CONFwithmax_split_size_mbto cap block sizes, e.g.:PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
- Reduce batch size — lower
--policy.batch_sizein your training command. - Check fragmentation — run
nvidia-smito compare memory reserved vs. memory used; a large gap indicates fragmentation thatexpandable_segmentsshould mitigate.
Here is an example command to initiate the lerobot-train function that should run inside the container on a machine with NVIDIA GPU:
lerobot-train \
--dataset.repo_id=useraccount/awesome_demo \
--policy.type=act \
--output_dir=outputs/train/awesome_demo_spark \
--job_name=awesome_demo__spark \
--policy.device=cuda \
--wandb.enable=true \
--policy.repo_id=useraccount/awesome_demo_spark \
--steps=20000