This guide explains how to use the Makefile to set up a Kind (Kubernetes in Docker) cluster with NVIDIA GPU support, including monitoring capabilities.
For detailed configuration information, please refer to:
The Makefile automatically installs the following requirements:
- Go
- kubectl (latest stable version)
- Kind (v0.20.0)
- Helm
make allThis runs the complete setup process in the following order:
- Installs prerequisites
- Creates Kind cluster
- Sets up NVIDIA support
- Installs GPU operator
- Tests GPU access
- Sets up monitoring
- Configures port forwarding
make prerequisitesInstalls all required tools and dependencies.
make clusterCreates a Kind cluster using the configuration from kind-config.yaml. For detailed configuration information, see the NVIDIA and Kind Configuration Guide.
You can use different kind configuration files by setting the KIND_CONFIG environment variable:
# Use default config (kind-config.yaml)
make cluster
# Use 8 GPU configuration
KIND_CONFIG=kind-config-8GPU.yaml make cluster
# Use mount configuration
KIND_CONFIG=kind-config-mnt.yaml make clusterAvailable configuration files:
kind-config.yaml: Default configuration with basic GPU supportkind-config-8GPU.yaml: Configuration for systems with 8 GPUskind-config-mnt.yaml: Configuration with additional mount points for models, data, templates, and requests
make setup-nvidiaRuns the setup-nvidia-kind.sh script to configure NVIDIA container support. See the NVIDIA and Kind Configuration Guide for detailed explanation of the setup process.
make install-gpu-operatorInstalls the NVIDIA GPU operator with the following configurations:
- Driver disabled (uses host driver)
- Toolkit enabled
- Device plugin enabled
- MIG manager disabled
- Host mounts enabled
- Specific toolkit and device plugin versions
make test-gpuRuns a test pod with nvidia-smi to verify GPU access.
make setup-monitoringSets up monitoring stack:
- Installs kube-prometheus-stack
- Configures DCGM monitoring
- Sets up custom service monitors
For detailed information about DCGM monitoring setup, refer to the DCGM Monitoring Setup Guide.
make port-forwardSets up port forwarding for monitoring services:
- Prometheus:
9090 - Grafana:
3000 - Alertmanager:
9093
make cleanDeletes the Kind cluster.
make debugShows debug information including:
- Pod status in gpu-operator namespace
- Pod descriptions
- GPU operator logs
- NVIDIA container information
make reinstall-nvidia-runtimeCompletely reinstalls the NVIDIA runtime:
- Uninstalls GPU operator
- Deletes gpu-operator namespace
- Recreates cluster
- Reinstalls NVIDIA support
- Reinstalls GPU operator
-
If port forwarding fails:
- Check if ports are already in use
- Verify the services are running in the monitoring namespace
-
If GPU operator installation fails:
- Use
make debugto check the operator logs - Verify NVIDIA driver compatibility
- Check if all required mounts are properly configured
- See NVIDIA and Kind Configuration Guide for proper setup requirements
- Use
-
If monitoring setup fails:
- Ensure CustomResourceDefinitions are properly established
- Check if the prometheus-operator is running
- Verify RBAC permissions are correctly configured
- Refer to DCGM Monitoring Setup Guide for detailed monitoring configuration