# NVIDIA Device Plugin for Kubernetes Enables GPU support in Kubernetes clusters by exposing NVIDIA GPUs as schedulable resources. ## Overview This module deploys the NVIDIA device plugin using the official Helm chart with: - **NVIDIA Device Plugin** - Exposes GPUs as `nvidia.com/gpu` resources - **Node Feature Discovery (NFD)** - Automatically detects GPU hardware on nodes - **GPU Feature Discovery (GFD)** - Discovers and labels GPU capabilities - **k3s Integration** - Automatic nvidia runtime detection for k3s clusters ## Prerequisites ### Host System Requirements The following components must be installed on each GPU node **before** deploying the device plugin: #### 1. NVIDIA GPU Driver Install the appropriate NVIDIA GPU driver for your system. **Arch Linux:** ```bash # Install NVIDIA driver sudo pacman -S nvidia nvidia-utils ``` **Ubuntu/Debian:** ```bash # Install NVIDIA driver sudo apt-get update sudo apt-get install -y nvidia-driver- ``` Verify driver installation: ```bash nvidia-smi ``` #### 2. NVIDIA Container Toolkit The NVIDIA Container Toolkit allows containers to access GPU devices. **Arch Linux:** ```bash # Install NVIDIA Container Toolkit sudo pacman -S nvidia-container-toolkit # Configure containerd runtime sudo nvidia-ctk runtime configure --runtime=containerd # Restart containerd sudo systemctl restart containerd ``` **Ubuntu/Debian:** ```bash # Add NVIDIA repository distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list # Install NVIDIA Container Toolkit sudo apt-get update sudo apt-get install -y nvidia-container-toolkit # Configure containerd runtime sudo nvidia-ctk runtime configure --runtime=containerd # Restart containerd sudo systemctl restart containerd ``` #### 3. k3s Runtime Configuration **Important**: k3s automatically detects the nvidia runtime if NVIDIA Container Toolkit is installed. **No manual configuration is required**. After installing NVIDIA Container Toolkit and restarting k3s, verify automatic detection: ```bash # Restart k3s to detect nvidia runtime sudo systemctl restart k3s # Verify nvidia runtime is detected sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml ``` Expected output: ```toml [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'] [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options] BinaryName = "/usr/bin/nvidia-container-runtime" [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'] [plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options] BinaryName = "/usr/bin/nvidia-container-runtime.cdi" ``` **Note**: Do **NOT** create `/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl` manually, as this can break k3s networking. k3s handles runtime detection automatically. ## Installation ### Deploy NVIDIA Device Plugin ```bash just nvidia-device-plugin::install ``` This installs: - NVIDIA Device Plugin DaemonSet - Node Feature Discovery (NFD) for GPU hardware detection - GPU Feature Discovery (GFD) for GPU capability labeling ### Verify Installation ```bash just nvidia-device-plugin::verify ``` Expected output: ```plain === GPU Resources per Node === node1: 1 GPUs === Device Plugin Pods === NAME READY STATUS RESTARTS AGE nvidia-device-plugin-xxxxx 1/1 Running 0 1m nvidia-device-plugin-gpu-feature-discovery-xxxxx 1/1 Running 0 1m ``` ### Test GPU Access ```bash just nvidia-device-plugin::test ``` This creates a test pod that runs `nvidia-smi` and displays GPU information. Expected output: ```plain === GPU Test Output === +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | ... ``` ## Configuration Environment variables (set in `.env.local` or override): ```bash NVIDIA_DEVICE_PLUGIN_NAMESPACE=nvidia-device-plugin # Kubernetes namespace NVIDIA_DEVICE_PLUGIN_VERSION=0.18.0 # Helm chart version ``` ## Usage ### Using GPUs in Pods To use GPUs in your pods, specify two things: 1. **runtimeClassName**: `nvidia` - Uses NVIDIA Container Runtime 2. **resources.limits**: `nvidia.com/gpu: 1` - Requests GPU allocation Example pod configuration: ```yaml apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: runtimeClassName: nvidia containers: - name: cuda-container image: nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04 command: ["nvidia-smi"] resources: limits: nvidia.com/gpu: 1 ``` ### Using GPUs in JupyterHub Configure JupyterHub to allow GPU access for notebook servers: ```yaml # jupyterhub values.yaml singleuser: runtimeClassName: nvidia extraResource: limits: nvidia.com/gpu: "1" ``` After deploying JupyterHub with this configuration, users can access GPUs in their notebooks: ```python import torch # Check GPU availability print(torch.cuda.is_available()) # True print(torch.cuda.device_count()) # 1 print(torch.cuda.get_device_name(0)) # NVIDIA GeForce RTX 4070 Ti ``` ### Multiple GPUs To request multiple GPUs: ```yaml resources: limits: nvidia.com/gpu: 2 ``` ### GPU Sharing (Time-Slicing) The device plugin supports GPU time-slicing for sharing GPUs across multiple pods. See the [official documentation](https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing) for configuration details. ## Architecture ```plain GPU Node (Arch Linux) ├─ NVIDIA Driver (nvidia, nvidia-utils) ├─ NVIDIA Container Toolkit (nvidia-container-runtime) └─ k3s with containerd ├─ Auto-detected nvidia runtime └─ NVIDIA Device Plugin (DaemonSet) ├─ Discovers GPUs on node ├─ Exposes nvidia.com/gpu resource └─ Manages GPU allocation to pods ↓ User Pods (with runtimeClassName: nvidia) ├─ GPU device access (/dev/nvidia*) ├─ CUDA libraries mounted └─ nvidia-smi available ``` **Key Components**: - **NVIDIA Driver**: Kernel module for GPU hardware access - **NVIDIA Container Toolkit**: Container runtime hooks for GPU access - **k3s containerd**: Automatically detects nvidia runtime - **Device Plugin**: Kubernetes plugin that advertises GPU resources - **NFD**: Detects GPU hardware and labels nodes - **GFD**: Discovers GPU capabilities and features ## Management ### Check GPU Resources ```bash # View GPU resources per node just nvidia-device-plugin::gpu-info ``` ### Upgrade Device Plugin ```bash # Update to latest version just nvidia-device-plugin::install ``` ### Uninstall ```bash just nvidia-device-plugin::uninstall ``` This removes: - NVIDIA Device Plugin DaemonSet - Node Feature Discovery components - GPU Feature Discovery components - Helm release and namespace **Note**: Host-level components (NVIDIA driver, Container Toolkit) are NOT removed and must be uninstalled manually if needed. ## Troubleshooting ### Check Device Plugin Pods ```bash kubectl get pods -n nvidia-device-plugin ``` Expected pods: - `nvidia-device-plugin-*` - Device plugin daemon (one per GPU node) - `nvidia-device-plugin-gpu-feature-discovery-*` - GPU feature discovery (one per GPU node) - `nvidia-device-plugin-node-feature-discovery-master-*` - NFD master - `nvidia-device-plugin-node-feature-discovery-gc-*` - NFD garbage collector ### GPU Not Detected **Symptom**: `just nvidia-device-plugin::verify` shows `0 GPUs` **Possible Causes**: 1. **NVIDIA driver not installed** ```bash # Check if driver is loaded nvidia-smi ``` If this fails, install NVIDIA driver on the host. 2. **NVIDIA Container Toolkit not installed** ```bash # Check if nvidia-container-runtime exists which nvidia-container-runtime ``` If not found, install NVIDIA Container Toolkit. 3. **k3s did not detect nvidia runtime** ```bash # Check containerd config sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml ``` If empty, restart k3s: ```bash sudo systemctl restart k3s ``` ### Device Plugin Pod CrashLoopBackOff **Symptom**: Device plugin pod shows `CrashLoopBackOff` status **Check logs**: ```bash kubectl logs -n nvidia-device-plugin ``` **Common errors**: 1. **"invalid device discovery strategy"** - Cause: NVIDIA Container Toolkit not configured properly - Solution: Run `sudo nvidia-ctk runtime configure --runtime=containerd` and restart containerd 2. **"failed to create containerd task"** - Cause: containerd cannot find nvidia runtime - Solution: Verify `/usr/bin/nvidia-container-runtime` exists and restart k3s ### Pod Cannot Access GPU **Symptom**: Pod starts but `nvidia-smi` fails with "executable file not found" **Cause**: Pod does not have `runtimeClassName: nvidia` specified **Solution**: Add `runtimeClassName: nvidia` to pod spec: ```yaml spec: runtimeClassName: nvidia # Required! containers: - name: gpu-container resources: limits: nvidia.com/gpu: 1 ``` ### k3s Node NotReady After Configuration **Symptom**: Node shows `NotReady` status with "cni plugin not initialized" error **Cause**: Invalid `/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl` file **Solution**: Remove the file and restart k3s: ```bash sudo rm /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl sudo systemctl restart k3s ``` k3s will automatically detect nvidia runtime without manual configuration. ### Check NVIDIA Runtime in Pods ```bash # Run a test pod to verify GPU access kubectl apply -f nvidia-device-plugin/gpu-test-pod.yaml # Check logs kubectl logs gpu-test # Clean up kubectl delete pod gpu-test ``` ## Configuration Files Key configuration files: - `values.yaml` - Helm chart values with NFD and GFD enabled - `gpu-test-pod.yaml` - Test pod for verifying GPU access - `justfile` - Task recipes for installation and management ## Security Considerations - **Privileged Access**: Device plugin pods run with privileged access to manage GPU devices - **Host Path Mounts**: Pods mount `/dev` and other host paths for GPU access - **Runtime Security**: NVIDIA runtime is isolated from default runc runtime - **Resource Limits**: GPUs are allocated exclusively to pods (no overcommit by default) - **Driver Compatibility**: Ensure NVIDIA driver version is compatible with CUDA version in containers ## References - [NVIDIA Device Plugin GitHub](https://github.com/NVIDIA/k8s-device-plugin) - [NVIDIA Container Toolkit Documentation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) - [k3s Advanced Configuration](https://docs.k3s.io/advanced) - [Kubernetes Device Plugins](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/) - [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery)