diff --git a/justfile b/justfile index c7b77b5..0c81c34 100644 --- a/justfile +++ b/justfile @@ -25,6 +25,7 @@ mod longhorn mod metabase mod mlflow mod minio +mod nvidia-device-plugin mod fairwinds-polaris mod oauth2-proxy mod postgres diff --git a/nvidia-device-plugin/README.md b/nvidia-device-plugin/README.md new file mode 100644 index 0000000..c3ddfa7 --- /dev/null +++ b/nvidia-device-plugin/README.md @@ -0,0 +1,435 @@ +# NVIDIA Device Plugin for Kubernetes + +Enables GPU support in Kubernetes clusters by exposing NVIDIA GPUs as schedulable resources. + +## Overview + +This module deploys the NVIDIA device plugin using the official Helm chart with: + +- **NVIDIA Device Plugin** - Exposes GPUs as `nvidia.com/gpu` resources +- **Node Feature Discovery (NFD)** - Automatically detects GPU hardware on nodes +- **GPU Feature Discovery (GFD)** - Discovers and labels GPU capabilities +- **k3s Integration** - Automatic nvidia runtime detection for k3s clusters + +## Prerequisites + +### Host System Requirements + +The following components must be installed on each GPU node **before** deploying the device plugin: + +#### 1. NVIDIA GPU Driver + +Install the appropriate NVIDIA GPU driver for your system. + +**Arch Linux:** + +```bash +# Install NVIDIA driver +sudo pacman -S nvidia nvidia-utils +``` + +**Ubuntu/Debian:** + +```bash +# Install NVIDIA driver +sudo apt-get update +sudo apt-get install -y nvidia-driver- +``` + +Verify driver installation: + +```bash +nvidia-smi +``` + +#### 2. NVIDIA Container Toolkit + +The NVIDIA Container Toolkit allows containers to access GPU devices. + +**Arch Linux:** + +```bash +# Install NVIDIA Container Toolkit +sudo pacman -S nvidia-container-toolkit + +# Configure containerd runtime +sudo nvidia-ctk runtime configure --runtime=containerd + +# Restart containerd +sudo systemctl restart containerd +``` + +**Ubuntu/Debian:** + +```bash +# Add NVIDIA repository +distribution=$(. /etc/os-release;echo $ID$VERSION_ID) +curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - +curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ + sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list + +# Install NVIDIA Container Toolkit +sudo apt-get update +sudo apt-get install -y nvidia-container-toolkit + +# Configure containerd runtime +sudo nvidia-ctk runtime configure --runtime=containerd + +# Restart containerd +sudo systemctl restart containerd +``` + +#### 3. k3s Runtime Configuration + +**Important**: k3s automatically detects the nvidia runtime if NVIDIA Container Toolkit is installed. **No manual configuration is required**. + +After installing NVIDIA Container Toolkit and restarting k3s, verify automatic detection: + +```bash +# Restart k3s to detect nvidia runtime +sudo systemctl restart k3s + +# Verify nvidia runtime is detected +sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml +``` + +Expected output: + +```toml +[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'] +[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options] + BinaryName = "/usr/bin/nvidia-container-runtime" +[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'] +[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options] + BinaryName = "/usr/bin/nvidia-container-runtime.cdi" +``` + +**Note**: Do **NOT** create `/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl` manually, as this can break k3s networking. k3s handles runtime detection automatically. + +## Installation + +### Deploy NVIDIA Device Plugin + +```bash +just nvidia-device-plugin::install +``` + +This installs: + +- NVIDIA Device Plugin DaemonSet +- Node Feature Discovery (NFD) for GPU hardware detection +- GPU Feature Discovery (GFD) for GPU capability labeling + +### Verify Installation + +```bash +just nvidia-device-plugin::verify +``` + +Expected output: + +```plain +=== GPU Resources per Node === +node1: 1 GPUs + +=== Device Plugin Pods === +NAME READY STATUS RESTARTS AGE +nvidia-device-plugin-xxxxx 1/1 Running 0 1m +nvidia-device-plugin-gpu-feature-discovery-xxxxx 1/1 Running 0 1m +``` + +### Test GPU Access + +```bash +just nvidia-device-plugin::test +``` + +This creates a test pod that runs `nvidia-smi` and displays GPU information. + +Expected output: + +```plain +=== GPU Test Output === ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +... +``` + +## Configuration + +Environment variables (set in `.env.local` or override): + +```bash +NVIDIA_DEVICE_PLUGIN_NAMESPACE=nvidia-device-plugin # Kubernetes namespace +NVIDIA_DEVICE_PLUGIN_VERSION=0.18.0 # Helm chart version +``` + +## Usage + +### Using GPUs in Pods + +To use GPUs in your pods, specify two things: + +1. **runtimeClassName**: `nvidia` - Uses NVIDIA Container Runtime +2. **resources.limits**: `nvidia.com/gpu: 1` - Requests GPU allocation + +Example pod configuration: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: gpu-pod +spec: + runtimeClassName: nvidia + containers: + - name: cuda-container + image: nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04 + command: ["nvidia-smi"] + resources: + limits: + nvidia.com/gpu: 1 +``` + +### Using GPUs in JupyterHub + +Configure JupyterHub to allow GPU access for notebook servers: + +```yaml +# jupyterhub values.yaml +singleuser: + runtimeClassName: nvidia + extraResource: + limits: + nvidia.com/gpu: "1" +``` + +After deploying JupyterHub with this configuration, users can access GPUs in their notebooks: + +```python +import torch + +# Check GPU availability +print(torch.cuda.is_available()) # True +print(torch.cuda.device_count()) # 1 +print(torch.cuda.get_device_name(0)) # NVIDIA GeForce RTX 4070 Ti +``` + +### Multiple GPUs + +To request multiple GPUs: + +```yaml +resources: + limits: + nvidia.com/gpu: 2 +``` + +### GPU Sharing (Time-Slicing) + +The device plugin supports GPU time-slicing for sharing GPUs across multiple pods. See the [official documentation](https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing) for configuration details. + +## Architecture + +```plain +GPU Node (Arch Linux) + ├─ NVIDIA Driver (nvidia, nvidia-utils) + ├─ NVIDIA Container Toolkit (nvidia-container-runtime) + └─ k3s with containerd + ├─ Auto-detected nvidia runtime + └─ NVIDIA Device Plugin (DaemonSet) + ├─ Discovers GPUs on node + ├─ Exposes nvidia.com/gpu resource + └─ Manages GPU allocation to pods + ↓ + User Pods (with runtimeClassName: nvidia) + ├─ GPU device access (/dev/nvidia*) + ├─ CUDA libraries mounted + └─ nvidia-smi available +``` + +**Key Components**: + +- **NVIDIA Driver**: Kernel module for GPU hardware access +- **NVIDIA Container Toolkit**: Container runtime hooks for GPU access +- **k3s containerd**: Automatically detects nvidia runtime +- **Device Plugin**: Kubernetes plugin that advertises GPU resources +- **NFD**: Detects GPU hardware and labels nodes +- **GFD**: Discovers GPU capabilities and features + +## Management + +### Check GPU Resources + +```bash +# View GPU resources per node +just nvidia-device-plugin::gpu-info +``` + +### Upgrade Device Plugin + +```bash +# Update to latest version +just nvidia-device-plugin::install +``` + +### Uninstall + +```bash +just nvidia-device-plugin::uninstall +``` + +This removes: + +- NVIDIA Device Plugin DaemonSet +- Node Feature Discovery components +- GPU Feature Discovery components +- Helm release and namespace + +**Note**: Host-level components (NVIDIA driver, Container Toolkit) are NOT removed and must be uninstalled manually if needed. + +## Troubleshooting + +### Check Device Plugin Pods + +```bash +kubectl get pods -n nvidia-device-plugin +``` + +Expected pods: + +- `nvidia-device-plugin-*` - Device plugin daemon (one per GPU node) +- `nvidia-device-plugin-gpu-feature-discovery-*` - GPU feature discovery (one per GPU node) +- `nvidia-device-plugin-node-feature-discovery-master-*` - NFD master +- `nvidia-device-plugin-node-feature-discovery-gc-*` - NFD garbage collector + +### GPU Not Detected + +**Symptom**: `just nvidia-device-plugin::verify` shows `0 GPUs` + +**Possible Causes**: + +1. **NVIDIA driver not installed** + + ```bash + # Check if driver is loaded + nvidia-smi + ``` + + If this fails, install NVIDIA driver on the host. + +2. **NVIDIA Container Toolkit not installed** + + ```bash + # Check if nvidia-container-runtime exists + which nvidia-container-runtime + ``` + + If not found, install NVIDIA Container Toolkit. + +3. **k3s did not detect nvidia runtime** + + ```bash + # Check containerd config + sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml + ``` + + If empty, restart k3s: + + ```bash + sudo systemctl restart k3s + ``` + +### Device Plugin Pod CrashLoopBackOff + +**Symptom**: Device plugin pod shows `CrashLoopBackOff` status + +**Check logs**: + +```bash +kubectl logs -n nvidia-device-plugin +``` + +**Common errors**: + +1. **"invalid device discovery strategy"** + + - Cause: NVIDIA Container Toolkit not configured properly + - Solution: Run `sudo nvidia-ctk runtime configure --runtime=containerd` and restart containerd + +2. **"failed to create containerd task"** + + - Cause: containerd cannot find nvidia runtime + - Solution: Verify `/usr/bin/nvidia-container-runtime` exists and restart k3s + +### Pod Cannot Access GPU + +**Symptom**: Pod starts but `nvidia-smi` fails with "executable file not found" + +**Cause**: Pod does not have `runtimeClassName: nvidia` specified + +**Solution**: Add `runtimeClassName: nvidia` to pod spec: + +```yaml +spec: + runtimeClassName: nvidia # Required! + containers: + - name: gpu-container + resources: + limits: + nvidia.com/gpu: 1 +``` + +### k3s Node NotReady After Configuration + +**Symptom**: Node shows `NotReady` status with "cni plugin not initialized" error + +**Cause**: Invalid `/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl` file + +**Solution**: Remove the file and restart k3s: + +```bash +sudo rm /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl +sudo systemctl restart k3s +``` + +k3s will automatically detect nvidia runtime without manual configuration. + +### Check NVIDIA Runtime in Pods + +```bash +# Run a test pod to verify GPU access +kubectl apply -f nvidia-device-plugin/gpu-test-pod.yaml + +# Check logs +kubectl logs gpu-test + +# Clean up +kubectl delete pod gpu-test +``` + +## Configuration Files + +Key configuration files: + +- `values.yaml` - Helm chart values with NFD and GFD enabled +- `gpu-test-pod.yaml` - Test pod for verifying GPU access +- `justfile` - Task recipes for installation and management + +## Security Considerations + +- **Privileged Access**: Device plugin pods run with privileged access to manage GPU devices +- **Host Path Mounts**: Pods mount `/dev` and other host paths for GPU access +- **Runtime Security**: NVIDIA runtime is isolated from default runc runtime +- **Resource Limits**: GPUs are allocated exclusively to pods (no overcommit by default) +- **Driver Compatibility**: Ensure NVIDIA driver version is compatible with CUDA version in containers + +## References + +- [NVIDIA Device Plugin GitHub](https://github.com/NVIDIA/k8s-device-plugin) +- [NVIDIA Container Toolkit Documentation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) +- [k3s Advanced Configuration](https://docs.k3s.io/advanced) +- [Kubernetes Device Plugins](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/) +- [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) diff --git a/nvidia-device-plugin/gpu-test-pod.yaml b/nvidia-device-plugin/gpu-test-pod.yaml new file mode 100644 index 0000000..ea39769 --- /dev/null +++ b/nvidia-device-plugin/gpu-test-pod.yaml @@ -0,0 +1,14 @@ +apiVersion: v1 +kind: Pod +metadata: + name: gpu-test +spec: + restartPolicy: OnFailure + runtimeClassName: nvidia + containers: + - name: cuda-container + image: nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04 + command: ["nvidia-smi"] + resources: + limits: + nvidia.com/gpu: 1 diff --git a/nvidia-device-plugin/justfile b/nvidia-device-plugin/justfile new file mode 100644 index 0000000..9a77ca1 --- /dev/null +++ b/nvidia-device-plugin/justfile @@ -0,0 +1,75 @@ +set fallback := true + +export NVIDIA_DEVICE_PLUGIN_NAMESPACE := env("NVIDIA_DEVICE_PLUGIN_NAMESPACE", "nvidia-device-plugin") +export NVIDIA_DEVICE_PLUGIN_VERSION := env("NVIDIA_DEVICE_PLUGIN_VERSION", "0.18.0") + +[private] +default: + @just --list --unsorted --list-submodules + +# Install NVIDIA device plugin for Kubernetes +install: + #!/bin/bash + set -euo pipefail + + if ! helm repo list | grep -q "^nvdp"; then + helm repo add nvdp https://nvidia.github.io/k8s-device-plugin + fi + helm repo update nvdp + + helm upgrade --install nvidia-device-plugin nvdp/nvidia-device-plugin \ + --namespace ${NVIDIA_DEVICE_PLUGIN_NAMESPACE} \ + --create-namespace \ + --version ${NVIDIA_DEVICE_PLUGIN_VERSION} \ + --values values.yaml \ + --wait + + echo "✓ NVIDIA device plugin installed successfully" + echo "" + echo "Verify GPU availability with:" + echo " just nvidia-device-plugin::verify" + +# Verify GPU resources are available +verify: + #!/bin/bash + set -euo pipefail + + echo "=== GPU Resources per Node ===" + kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): \(.status.capacity["nvidia.com/gpu"] // "0") GPUs"' + + echo "" + echo "=== Device Plugin Pods ===" + kubectl get pods -n ${NVIDIA_DEVICE_PLUGIN_NAMESPACE} -l app.kubernetes.io/name=nvidia-device-plugin + + echo "" + echo "Test GPU access with:" + echo " just nvidia-device-plugin::test" + +# Show detailed GPU information +gpu-info: + kubectl get nodes -o json | jq -r '.items[] | select(.status.capacity["nvidia.com/gpu"] != null) | {name: .metadata.name, gpus: .status.capacity["nvidia.com/gpu"], allocatable: .status.allocatable["nvidia.com/gpu"]}' + +# Test GPU access by running nvidia-smi in a pod +test: + #!/bin/bash + set -euo pipefail + + kubectl delete pod gpu-test --ignore-not-found=true + kubectl apply -f gpu-test-pod.yaml + + echo "Waiting for pod to complete..." + kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/gpu-test --timeout=60s || true + + echo "" + echo "=== GPU Test Output ===" + kubectl logs gpu-test + + kubectl delete pod gpu-test + +# Uninstall NVIDIA device plugin +uninstall: + #!/bin/bash + set -euo pipefail + + helm uninstall nvidia-device-plugin -n ${NVIDIA_DEVICE_PLUGIN_NAMESPACE} || true + echo "✓ NVIDIA device plugin uninstalled" diff --git a/nvidia-device-plugin/values.yaml b/nvidia-device-plugin/values.yaml new file mode 100644 index 0000000..3514ee7 --- /dev/null +++ b/nvidia-device-plugin/values.yaml @@ -0,0 +1,10 @@ +# Enable GPU Feature Discovery +gfd: + enabled: true + +# Enable Node Feature Discovery (dependency) +nfd: + enabled: true + +# Configure runtime for k3s +runtimeClassName: "nvidia"