Name	Name	Last commit message	Last commit date
parent directory ..
src	src
.gitignore	.gitignore
Dockerfile	Dockerfile
README.md	README.md
docker-compose.yml	docker-compose.yml
pom.xml	pom.xml
run.sh	run.sh

Name

Last commit message

Last commit date

Gpu Orchestration

A deep learning training job requires GPU resources that are expensive and scarce. The orchestration pipeline needs to check GPU availability, allocate the right GPU type for the model architecture, launch the training job, monitor utilization, and release the GPU immediately when training completes.

Pipeline

[gpu_check_availability]
     |
     v
[gpu_allocate]
     |
     v
[gpu_submit_job]
     |
     v
[gpu_collect_results]
     |
     v
[gpu_release]

Workflow inputs: jobId, gpuType, modelPath

Workers

GpuAllocateWorker (task: gpu_allocate)

Uses randomization
Writes gpuId, allocated, memoryGb

GpuCheckAvailabilityWorker (task: gpu_check_availability)

Writes available, gpuType, cluster

GpuCollectResultsWorker (task: gpu_collect_results)

Writes collected, artifacts

GpuReleaseWorker (task: gpu_release)

Writes released

GpuSubmitJobWorker (task: gpu_submit_job)

Writes outputPath, epochs, lossVal

20 tests | Workflow: gpu_orchestration_demo | Timeout: 60s

See RUNNING.md for setup and usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Gpu Orchestration

Pipeline

Workers

FilesExpand file tree

gpu-orchestration

Directory actions

More options

Directory actions

More options

Latest commit

History

gpu-orchestration

Folders and files

parent directory

README.md

Gpu Orchestration

Pipeline

Workers