Context
Training jobs are submitted to HPC clusters via SLURM scripts in scripts/ and research/. There is no way to test these job scripts locally before scheduling real jobs, which makes iteration slow and wastes cluster resources on configuration errors.
Proposed Changes
Add a Docker Compose setup that simulates a minimal SLURM environment locally:
- review the DRAC / Compute Canada SLURM documentation to match their config as closely as possible: https://docs.alliancecan.ca/wiki/Running_jobs see the bash scripts in the
research/ folder of this repo for real examples of how the pipeline in this repo is scheduled in DRAC's SLURM. For example: research/order_level_classifier/job_train_classifier.sh
- A container with a SLURM controller and single compute node (e.g., using
giovtorres/slurm-docker-cluster or similar)
- The project mounted as a volume so job scripts can be submitted with
sbatch
- GPU passthrough optional (for CPU-only smoke tests, training can run for 1-2 epochs)
- A README explaining how to start the environment, submit jobs, and check output
- Create a GitHub workflow to test SLURM jobs
This would allow developers to validate SLURM scripts, environment setup, and pipeline orchestration before submitting to the real cluster.
Related
Context
Training jobs are submitted to HPC clusters via SLURM scripts in
scripts/andresearch/. There is no way to test these job scripts locally before scheduling real jobs, which makes iteration slow and wastes cluster resources on configuration errors.Proposed Changes
Add a Docker Compose setup that simulates a minimal SLURM environment locally:
research/folder of this repo for real examples of how the pipeline in this repo is scheduled in DRAC's SLURM. For example: research/order_level_classifier/job_train_classifier.shgiovtorres/slurm-docker-clusteror similar)sbatchThis would allow developers to validate SLURM scripts, environment setup, and pipeline orchestration before submitting to the real cluster.
Related
research/order_level_classifier/job_*.sh— existing SLURM job scriptsscripts/train_species_classifier.sh— local equivalent (PR feat: add species classifier training pipeline #69)