Skip to content

knightman/docker-lerobot-train

Repository files navigation

Docker Lerobot Train README

The goal of this project is to run download Hugging Face datasets and train them locally using GPU hardware in Docker.

This project is not responsible for collecting the data from the hardware, but assumes the data has already been collected and uploaded to Hugging Face.

Prerequisites: NVIDIA Container Toolkit

The host machine must have the NVIDIA drivers installed and the NVIDIA Container Toolkit configured so Docker can access the GPU. Without this, Docker only knows about the default runc runtime and will fail when your compose file specifies runtime: nvidia.

1. Install the NVIDIA Container Toolkit

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

2. Configure Docker to use the NVIDIA runtime

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

3. Verify it works

docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

You should see your GPU listed. If you already have the toolkit installed but still get a runtime error, the configure step (step 2) may have been skipped, or Docker wasn't restarted after installation.

Configuration Options

This project is intended to produce a container that can run on different type of machines (x86_64, aarch) and will need configurations to select the right type before building. User will need to select architecture, number and type of GPUs, and dataset size.

User also must define the policy type, user account, dataset path, policy name, training params and other required input items.

Use the local .env file to store all configuration items like HugginFace and GitHub accounts, API keys, database URLs, policy URLs, and training parameters.

DGX Spark (aarch64)

There are specific challenges with training on the DGX Spark due to the aarch64 architecture and the incompatibility of specific packages:

  • PyTorch: The standard PyTorch pip indexes (cu124, etc.) do not publish aarch64 CUDA wheels. The DGX Spark's GB10 GPU (Blackwell, sm_121 compute capability) requires PyTorch 2.9.0+ from the cu130 index (https://download.pytorch.org/whl/cu130). The Dockerfile handles this automatically based on the target architecture.
  • torchcodec: Not available on aarch64. The system falls back to PyAV (av) for video decoding.
  • flash-attn: No prebuilt aarch64 wheels. Skipped on ARM builds.
  • decord: x86_64 only, not installed on ARM.

Note: lerobot pins torch<2.8.0, but DGX Spark requires torch>=2.9.0 for GPU support. The Dockerfile installs lerobot first, then overrides torch with the CUDA-enabled build. This version mismatch is a known trade-off for aarch64 GPU support.

x86_64 Servers

On standard x86_64 NVIDIA GPU servers, PyTorch CUDA wheels install normally from the cu124 index (PyTorch 2.6.x).

  • FFmpeg: Runtime libraries are installed for torchcodec support.
  • torchcodec: Pre-built wheels have ABI compatibility issues with cu124 PyTorch builds. The system defaults to PyAV (av) for video decoding.
  • flash-attn: Installed on x86_64 with a fallback if compilation fails.

Video Backend

Both architectures use PyAV (av) as the default video backend for reliable cross-platform video decoding. This can be overridden via the VIDEO_BACKEND environment variable in .env:

# Use pyav (default, recommended)
VIDEO_BACKEND=pyav

# Use torchcodec (x86_64 only, may have compatibility issues)
VIDEO_BACKEND=torchcodec

GPU Memory Management

The Dockerfile sets the following environment variables by default to reduce CUDA OOM errors and CPU memory pressure during training:

Variable Default Purpose
PYTORCH_CUDA_ALLOC_CONF expandable_segments:True Reduces CUDA allocator fragmentation; allows the allocator to grow segments dynamically instead of failing when free memory is fragmented
OMP_NUM_THREADS 1 Prevents CPU thread over-subscription from DataLoader workers each spawning a full OpenMP thread pool
TOKENIZERS_PARALLELISM false Suppresses HuggingFace tokenizer parallelism warnings and potential deadlocks inside forked DataLoader workers

If you encounter torch.OutOfMemoryError: CUDA out of memory, try these steps in order:

  1. Pin training to one GPU — on a multi-GPU host, set NVIDIA_VISIBLE_DEVICES=0 (or 1) in .env to avoid cross-GPU memory contention.
  2. Tune the allocator — extend PYTORCH_CUDA_ALLOC_CONF with max_split_size_mb to cap block sizes, e.g.:
    PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
  3. Reduce batch size — lower --policy.batch_size in your training command.
  4. Check fragmentation — run nvidia-smi to compare memory reserved vs. memory used; a large gap indicates fragmentation that expandable_segments should mitigate.

Example Command

Here is an example command to initiate the lerobot-train function that should run inside the container on a machine with NVIDIA GPU:

lerobot-train \
  --dataset.repo_id=useraccount/awesome_demo \
  --policy.type=act \
  --output_dir=outputs/train/awesome_demo_spark \
  --job_name=awesome_demo__spark \
  --policy.device=cuda \
  --wandb.enable=true \
  --policy.repo_id=useraccount/awesome_demo_spark \
  --steps=20000

References

lerobot github imitation learning page on HuggingFace

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors