Skip to content

llm-d-incubation/hermes

Repository files navigation

Hermes

Note

This project is still incubating and is a proof of concept.

Kubernetes cluster analyzer for RDMA-capable GPU infrastructure. Scans clusters to detect RDMA networking capabilities, GPU topology, and intelligently selects optimal node pairs for high-speed interconnect testing.

Supports CoreWeave, GKE, OpenShift, and generic Kubernetes environments.

Installation

From Release Binary

Download the latest release for your platform:

# macOS (Apple Silicon)
curl -LO https://github.com/llm-d-incubation/hermes/releases/latest/download/hermes-darwin-arm64.tar.gz
tar xzf hermes-darwin-arm64.tar.gz
sudo mv hermes /usr/local/bin/

# macOS (Intel)
curl -LO https://github.com/llm-d-incubation/hermes/releases/latest/download/hermes-darwin-amd64.tar.gz
tar xzf hermes-darwin-amd64.tar.gz
sudo mv hermes /usr/local/bin/

# Linux (x86_64)
curl -LO https://github.com/llm-d-incubation/hermes/releases/latest/download/hermes-linux-amd64.tar.gz
tar xzf hermes-linux-amd64.tar.gz
sudo mv hermes hca-probe /usr/local/bin/

From Source

cargo install --path .

Quick Start

# scan cluster
hermes scan

# filter RDMA-capable nodes
hermes scan --ib-only

# preview RDMA test manifests
hermes self-test --dry-run

# run RDMA self-test
hermes self-test --namespace default

Platform Examples

# CoreWeave
KUBECONFIG=~/path/to/cwconfig hermes scan

# GKE
gcloud container clusters get-credentials CLUSTER_NAME && hermes scan

# OpenShift (with proxy)
HTTPS_PROXY=http://proxy-ip:port hermes scan

Self-Test Framework

Prerequisite: Most self-tests require JobSet to be installed on the cluster:

kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.10.1/manifests.yaml

Automatically deploys RDMA workloads on intelligently-selected node pairs:

# preview what will be deployed
hermes self-test --dry-run

# run UCX-based data transfer test
hermes self-test --namespace default

# OpenShift RoCE (auto-detects SR-IOV network or use --sriov-network)
hermes self-test --namespace test-ns

# keep resources after test
hermes self-test --no-cleanup

How it works: Scans cluster → selects optimal node pair (same fabric/zone) → renders test manifests → deploys jobs → monitors completion → cleanup

Available workloads: nixl-transfer-test (default), deepgemm-minimal-test

Output Formats

hermes scan --format json    # JSON output
hermes scan --format table   # table view (default)
hermes scan --save-to report.json

License

MIT

About

Hermes is a cluster configuration scanning and self-test generation tool for llm-d inference workloads

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages