Setting up Kind Cluster with NVIDIA GPU Support: Makefile Guide

Overview

This guide explains how to use the Makefile to set up a Kind (Kubernetes in Docker) cluster with NVIDIA GPU support, including monitoring capabilities.

For detailed configuration information, please refer to:

Prerequisites

The Makefile automatically installs the following requirements:

Go
kubectl (latest stable version)
Kind (v0.20.0)
Helm

Main Targets

Complete Setup

make all

This runs the complete setup process in the following order:

Installs prerequisites
Creates Kind cluster
Sets up NVIDIA support
Installs GPU operator
Tests GPU access
Sets up monitoring
Configures port forwarding

Individual Steps

1. Install Prerequisites

make prerequisites

Installs all required tools and dependencies.

2. Create Cluster

make cluster

Creates a Kind cluster using the configuration from kind-config.yaml. For detailed configuration information, see the NVIDIA and Kind Configuration Guide.

You can use different kind configuration files by setting the KIND_CONFIG environment variable:

# Use default config (kind-config.yaml)
make cluster

# Use 8 GPU configuration
KIND_CONFIG=kind-config-8GPU.yaml make cluster

# Use mount configuration
KIND_CONFIG=kind-config-mnt.yaml make cluster

Available configuration files:

kind-config.yaml: Default configuration with basic GPU support
kind-config-8GPU.yaml: Configuration for systems with 8 GPUs
kind-config-mnt.yaml: Configuration with additional mount points for models, data, templates, and requests

3. Setup NVIDIA Support

make setup-nvidia

Runs the setup-nvidia-kind.sh script to configure NVIDIA container support. See the NVIDIA and Kind Configuration Guide for detailed explanation of the setup process.

4. Install GPU Operator

make install-gpu-operator

Installs the NVIDIA GPU operator with the following configurations:

Driver disabled (uses host driver)
Toolkit enabled
Device plugin enabled
MIG manager disabled
Host mounts enabled
Specific toolkit and device plugin versions

5. Test GPU Access

make test-gpu

Runs a test pod with nvidia-smi to verify GPU access.

6. Setup Monitoring

make setup-monitoring

Sets up monitoring stack:

Installs kube-prometheus-stack
Configures DCGM monitoring
Sets up custom service monitors

For detailed information about DCGM monitoring setup, refer to the DCGM Monitoring Setup Guide.

7. Port Forwarding

make port-forward

Sets up port forwarding for monitoring services:

Prometheus: 9090
Grafana: 3000
Alertmanager: 9093

Maintenance Commands

Clean Up

make clean

Deletes the Kind cluster.

Debug

make debug

Shows debug information including:

Pod status in gpu-operator namespace
Pod descriptions
GPU operator logs
NVIDIA container information

Reinstall NVIDIA Runtime

make reinstall-nvidia-runtime

Completely reinstalls the NVIDIA runtime:

Uninstalls GPU operator
Deletes gpu-operator namespace
Recreates cluster
Reinstalls NVIDIA support
Reinstalls GPU operator

Common Issues and Troubleshooting

If port forwarding fails:
- Check if ports are already in use
- Verify the services are running in the monitoring namespace
If GPU operator installation fails:
- Use make debug to check the operator logs
- Verify NVIDIA driver compatibility
- Check if all required mounts are properly configured
- See NVIDIA and Kind Configuration Guide for proper setup requirements
If monitoring setup fails:
- Ensure CustomResourceDefinitions are properly established
- Check if the prometheus-operator is running
- Verify RBAC permissions are correctly configured
- Refer to DCGM Monitoring Setup Guide for detailed monitoring configuration

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
backups		backups
config		config
dashboards		dashboards
docs		docs
logs		logs
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
dcgm-servicemonitor.yaml		dcgm-servicemonitor.yaml
debug-gpu-operator.sh		debug-gpu-operator.sh
kind-config-8GPU.yaml		kind-config-8GPU.yaml
kind-config-gen.sh		kind-config-gen.sh
kind-config-mnt.yaml		kind-config-mnt.yaml
kind-config.yaml		kind-config.yaml
prometheus-dcgm.yaml		prometheus-dcgm.yaml
setup-nvidia-kind.sh		setup-nvidia-kind.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setting up Kind Cluster with NVIDIA GPU Support: Makefile Guide

Overview

Prerequisites

Main Targets

Complete Setup

Individual Steps

1. Install Prerequisites

2. Create Cluster

3. Setup NVIDIA Support

4. Install GPU Operator

5. Test GPU Access

6. Setup Monitoring

7. Port Forwarding

Maintenance Commands

Clean Up

Debug

Reinstall NVIDIA Runtime

Common Issues and Troubleshooting

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

SeineAI/nvidia-kind-deploy

Folders and files

Latest commit

History

Repository files navigation

Setting up Kind Cluster with NVIDIA GPU Support: Makefile Guide

Overview

Prerequisites

Main Targets

Complete Setup

Individual Steps

1. Install Prerequisites

2. Create Cluster

3. Setup NVIDIA Support

4. Install GPU Operator

5. Test GPU Access

6. Setup Monitoring

7. Port Forwarding

Maintenance Commands

Clean Up

Debug

Reinstall NVIDIA Runtime

Common Issues and Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages