feat(nvidia-device-plugin): install nvidia-device-plugin

2025-11-21 00:30:02 +09:00
parent b958a13c76
commit 71b41c6dbf
5 changed files with 535 additions and 0 deletions
--- a/1
+++ b/1
@@ -25,6 +25,7 @@ mod longhorn
 mod metabase
 mod mlflow
 mod minio
+mod nvidia-device-plugin
 mod fairwinds-polaris
 mod oauth2-proxy
 mod postgres
--- a/nvidia-device-plugin/README.md
+++ b/nvidia-device-plugin/README.md
@@ -0,0 +1,435 @@
+# NVIDIA Device Plugin for Kubernetes
+
+Enables GPU support in Kubernetes clusters by exposing NVIDIA GPUs as schedulable resources.
+
+## Overview
+
+This module deploys the NVIDIA device plugin using the official Helm chart with:
+
+- **NVIDIA Device Plugin** - Exposes GPUs as `nvidia.com/gpu` resources
+- **Node Feature Discovery (NFD)** - Automatically detects GPU hardware on nodes
+- **GPU Feature Discovery (GFD)** - Discovers and labels GPU capabilities
+- **k3s Integration** - Automatic nvidia runtime detection for k3s clusters
+
+## Prerequisites
+
+### Host System Requirements
+
+The following components must be installed on each GPU node **before** deploying the device plugin:
+
+#### 1. NVIDIA GPU Driver
+
+Install the appropriate NVIDIA GPU driver for your system.
+
+**Arch Linux:**
+
+```bash
+# Install NVIDIA driver
+sudo pacman -S nvidia nvidia-utils
+```
+
+**Ubuntu/Debian:**
+
+```bash
+# Install NVIDIA driver
+sudo apt-get update
+sudo apt-get install -y nvidia-driver-<version>
+```
+
+Verify driver installation:
+
+```bash
+nvidia-smi
+```
+
+#### 2. NVIDIA Container Toolkit
+
+The NVIDIA Container Toolkit allows containers to access GPU devices.
+
+**Arch Linux:**
+
+```bash
+# Install NVIDIA Container Toolkit
+sudo pacman -S nvidia-container-toolkit
+
+# Configure containerd runtime
+sudo nvidia-ctk runtime configure --runtime=containerd
+
+# Restart containerd
+sudo systemctl restart containerd
+```
+
+**Ubuntu/Debian:**
+
+```bash
+# Add NVIDIA repository
+distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
+curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
+curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
+  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
+
+# Install NVIDIA Container Toolkit
+sudo apt-get update
+sudo apt-get install -y nvidia-container-toolkit
+
+# Configure containerd runtime
+sudo nvidia-ctk runtime configure --runtime=containerd
+
+# Restart containerd
+sudo systemctl restart containerd
+```
+
+#### 3. k3s Runtime Configuration
+
+**Important**: k3s automatically detects the nvidia runtime if NVIDIA Container Toolkit is installed. **No manual configuration is required**.
+
+After installing NVIDIA Container Toolkit and restarting k3s, verify automatic detection:
+
+```bash
+# Restart k3s to detect nvidia runtime
+sudo systemctl restart k3s
+
+# Verify nvidia runtime is detected
+sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
+```
+
+Expected output:
+
+```toml
+[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia']
+[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options]
+  BinaryName = "/usr/bin/nvidia-container-runtime"
+[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi']
+[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options]
+  BinaryName = "/usr/bin/nvidia-container-runtime.cdi"
+```
+
+**Note**: Do **NOT** create `/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl` manually, as this can break k3s networking. k3s handles runtime detection automatically.
+
+## Installation
+
+### Deploy NVIDIA Device Plugin
+
+```bash
+just nvidia-device-plugin::install
+```
+
+This installs:
+
+- NVIDIA Device Plugin DaemonSet
+- Node Feature Discovery (NFD) for GPU hardware detection
+- GPU Feature Discovery (GFD) for GPU capability labeling
+
+### Verify Installation
+
+```bash
+just nvidia-device-plugin::verify
+```
+
+Expected output:
+
+```plain
+=== GPU Resources per Node ===
+node1: 1 GPUs
+
+=== Device Plugin Pods ===
+NAME                                               READY   STATUS    RESTARTS   AGE
+nvidia-device-plugin-xxxxx                         1/1     Running   0          1m
+nvidia-device-plugin-gpu-feature-discovery-xxxxx   1/1     Running   0          1m
+```
+
+### Test GPU Access
+
+```bash
+just nvidia-device-plugin::test
+```
+
+This creates a test pod that runs `nvidia-smi` and displays GPU information.
+
+Expected output:
+
+```plain
+=== GPU Test Output ===
+-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+...
+```
+
+## Configuration
+
+Environment variables (set in `.env.local` or override):
+
+```bash
+NVIDIA_DEVICE_PLUGIN_NAMESPACE=nvidia-device-plugin  # Kubernetes namespace
+NVIDIA_DEVICE_PLUGIN_VERSION=0.18.0                  # Helm chart version
+```
+
+## Usage
+
+### Using GPUs in Pods
+
+To use GPUs in your pods, specify two things:
+
+1. **runtimeClassName**: `nvidia` - Uses NVIDIA Container Runtime
+2. **resources.limits**: `nvidia.com/gpu: 1` - Requests GPU allocation
+
+Example pod configuration:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: gpu-pod
+spec:
+  runtimeClassName: nvidia
+  containers:
+  - name: cuda-container
+    image: nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04
+    command: ["nvidia-smi"]
+    resources:
+      limits:
+        nvidia.com/gpu: 1
+```
+
+### Using GPUs in JupyterHub
+
+Configure JupyterHub to allow GPU access for notebook servers:
+
+```yaml
+# jupyterhub values.yaml
+singleuser:
+  runtimeClassName: nvidia
+  extraResource:
+    limits:
+      nvidia.com/gpu: "1"
+```
+
+After deploying JupyterHub with this configuration, users can access GPUs in their notebooks:
+
+```python
+import torch
+
+# Check GPU availability
+print(torch.cuda.is_available())  # True
+print(torch.cuda.device_count())   # 1
+print(torch.cuda.get_device_name(0))  # NVIDIA GeForce RTX 4070 Ti
+```
+
+### Multiple GPUs
+
+To request multiple GPUs:
+
+```yaml
+resources:
+  limits:
+    nvidia.com/gpu: 2
+```
+
+### GPU Sharing (Time-Slicing)
+
+The device plugin supports GPU time-slicing for sharing GPUs across multiple pods. See the [official documentation](https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing) for configuration details.
+
+## Architecture
+
+```plain
+GPU Node (Arch Linux)
+  ├─ NVIDIA Driver (nvidia, nvidia-utils)
+  ├─ NVIDIA Container Toolkit (nvidia-container-runtime)
+  └─ k3s with containerd
+      ├─ Auto-detected nvidia runtime
+      └─ NVIDIA Device Plugin (DaemonSet)
+          ├─ Discovers GPUs on node
+          ├─ Exposes nvidia.com/gpu resource
+          └─ Manages GPU allocation to pods
+              ↓
+          User Pods (with runtimeClassName: nvidia)
+              ├─ GPU device access (/dev/nvidia*)
+              ├─ CUDA libraries mounted
+              └─ nvidia-smi available
+```
+
+**Key Components**:
+
+- **NVIDIA Driver**: Kernel module for GPU hardware access
+- **NVIDIA Container Toolkit**: Container runtime hooks for GPU access
+- **k3s containerd**: Automatically detects nvidia runtime
+- **Device Plugin**: Kubernetes plugin that advertises GPU resources
+- **NFD**: Detects GPU hardware and labels nodes
+- **GFD**: Discovers GPU capabilities and features
+
+## Management
+
+### Check GPU Resources
+
+```bash
+# View GPU resources per node
+just nvidia-device-plugin::gpu-info
+```
+
+### Upgrade Device Plugin
+
+```bash
+# Update to latest version
+just nvidia-device-plugin::install
+```
+
+### Uninstall
+
+```bash
+just nvidia-device-plugin::uninstall
+```
+
+This removes:
+
+- NVIDIA Device Plugin DaemonSet
+- Node Feature Discovery components
+- GPU Feature Discovery components
+- Helm release and namespace
+
+**Note**: Host-level components (NVIDIA driver, Container Toolkit) are NOT removed and must be uninstalled manually if needed.
+
+## Troubleshooting
+
+### Check Device Plugin Pods
+
+```bash
+kubectl get pods -n nvidia-device-plugin
+```
+
+Expected pods:
+
+- `nvidia-device-plugin-*` - Device plugin daemon (one per GPU node)
+- `nvidia-device-plugin-gpu-feature-discovery-*` - GPU feature discovery (one per GPU node)
+- `nvidia-device-plugin-node-feature-discovery-master-*` - NFD master
+- `nvidia-device-plugin-node-feature-discovery-gc-*` - NFD garbage collector
+
+### GPU Not Detected
+
+**Symptom**: `just nvidia-device-plugin::verify` shows `0 GPUs`
+
+**Possible Causes**:
+
+1. **NVIDIA driver not installed**
+
+   ```bash
+   # Check if driver is loaded
+   nvidia-smi
+   ```
+
+   If this fails, install NVIDIA driver on the host.
+
+2. **NVIDIA Container Toolkit not installed**
+
+   ```bash
+   # Check if nvidia-container-runtime exists
+   which nvidia-container-runtime
+   ```
+
+   If not found, install NVIDIA Container Toolkit.
+
+3. **k3s did not detect nvidia runtime**
+
+   ```bash
+   # Check containerd config
+   sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
+   ```
+
+   If empty, restart k3s:
+
+   ```bash
+   sudo systemctl restart k3s
+   ```
+
+### Device Plugin Pod CrashLoopBackOff
+
+**Symptom**: Device plugin pod shows `CrashLoopBackOff` status
+
+**Check logs**:
+
+```bash
+kubectl logs -n nvidia-device-plugin <pod-name>
+```
+
+**Common errors**:
+
+1. **"invalid device discovery strategy"**
+
+   - Cause: NVIDIA Container Toolkit not configured properly
+   - Solution: Run `sudo nvidia-ctk runtime configure --runtime=containerd` and restart containerd
+
+2. **"failed to create containerd task"**
+
+   - Cause: containerd cannot find nvidia runtime
+   - Solution: Verify `/usr/bin/nvidia-container-runtime` exists and restart k3s
+
+### Pod Cannot Access GPU
+
+**Symptom**: Pod starts but `nvidia-smi` fails with "executable file not found"
+
+**Cause**: Pod does not have `runtimeClassName: nvidia` specified
+
+**Solution**: Add `runtimeClassName: nvidia` to pod spec:
+
+```yaml
+spec:
+  runtimeClassName: nvidia  # Required!
+  containers:
+  - name: gpu-container
+    resources:
+      limits:
+        nvidia.com/gpu: 1
+```
+
+### k3s Node NotReady After Configuration
+
+**Symptom**: Node shows `NotReady` status with "cni plugin not initialized" error
+
+**Cause**: Invalid `/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl` file
+
+**Solution**: Remove the file and restart k3s:
+
+```bash
+sudo rm /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
+sudo systemctl restart k3s
+```
+
+k3s will automatically detect nvidia runtime without manual configuration.
+
+### Check NVIDIA Runtime in Pods
+
+```bash
+# Run a test pod to verify GPU access
+kubectl apply -f nvidia-device-plugin/gpu-test-pod.yaml
+
+# Check logs
+kubectl logs gpu-test
+
+# Clean up
+kubectl delete pod gpu-test
+```
+
+## Configuration Files
+
+Key configuration files:
+
+- `values.yaml` - Helm chart values with NFD and GFD enabled
+- `gpu-test-pod.yaml` - Test pod for verifying GPU access
+- `justfile` - Task recipes for installation and management
+
+## Security Considerations
+
+- **Privileged Access**: Device plugin pods run with privileged access to manage GPU devices
+- **Host Path Mounts**: Pods mount `/dev` and other host paths for GPU access
+- **Runtime Security**: NVIDIA runtime is isolated from default runc runtime
+- **Resource Limits**: GPUs are allocated exclusively to pods (no overcommit by default)
+- **Driver Compatibility**: Ensure NVIDIA driver version is compatible with CUDA version in containers
+
+## References
+
+- [NVIDIA Device Plugin GitHub](https://github.com/NVIDIA/k8s-device-plugin)
+- [NVIDIA Container Toolkit Documentation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
+- [k3s Advanced Configuration](https://docs.k3s.io/advanced)
+- [Kubernetes Device Plugins](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/)
+- [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery)
--- a/nvidia-device-plugin/gpu-test-pod.yaml
+++ b/nvidia-device-plugin/gpu-test-pod.yaml
@@ -0,0 +1,14 @@
+apiVersion: v1
+kind: Pod
+metadata:
+  name: gpu-test
+spec:
+  restartPolicy: OnFailure
+  runtimeClassName: nvidia
+  containers:
+  - name: cuda-container
+    image: nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04
+    command: ["nvidia-smi"]
+    resources:
+      limits:
+        nvidia.com/gpu: 1
--- a/nvidia-device-plugin/justfile
+++ b/nvidia-device-plugin/justfile
@@ -0,0 +1,75 @@
+set fallback := true
+
+export NVIDIA_DEVICE_PLUGIN_NAMESPACE := env("NVIDIA_DEVICE_PLUGIN_NAMESPACE", "nvidia-device-plugin")
+export NVIDIA_DEVICE_PLUGIN_VERSION := env("NVIDIA_DEVICE_PLUGIN_VERSION", "0.18.0")
+
+[private]
+default:
+    @just --list --unsorted --list-submodules
+
+# Install NVIDIA device plugin for Kubernetes
+install:
+    #!/bin/bash
+    set -euo pipefail
+
+    if ! helm repo list | grep -q "^nvdp"; then
+        helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
+    fi
+    helm repo update nvdp
+
+    helm upgrade --install nvidia-device-plugin nvdp/nvidia-device-plugin \
+        --namespace ${NVIDIA_DEVICE_PLUGIN_NAMESPACE} \
+        --create-namespace \
+        --version ${NVIDIA_DEVICE_PLUGIN_VERSION} \
+        --values values.yaml \
+        --wait
+
+    echo "✓ NVIDIA device plugin installed successfully"
+    echo ""
+    echo "Verify GPU availability with:"
+    echo "  just nvidia-device-plugin::verify"
+
+# Verify GPU resources are available
+verify:
+    #!/bin/bash
+    set -euo pipefail
+
+    echo "=== GPU Resources per Node ==="
+    kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): \(.status.capacity["nvidia.com/gpu"] // "0") GPUs"'
+
+    echo ""
+    echo "=== Device Plugin Pods ==="
+    kubectl get pods -n ${NVIDIA_DEVICE_PLUGIN_NAMESPACE} -l app.kubernetes.io/name=nvidia-device-plugin
+
+    echo ""
+    echo "Test GPU access with:"
+    echo "  just nvidia-device-plugin::test"
+
+# Show detailed GPU information
+gpu-info:
+    kubectl get nodes -o json | jq -r '.items[] | select(.status.capacity["nvidia.com/gpu"] != null) | {name: .metadata.name, gpus: .status.capacity["nvidia.com/gpu"], allocatable: .status.allocatable["nvidia.com/gpu"]}'
+
+# Test GPU access by running nvidia-smi in a pod
+test:
+    #!/bin/bash
+    set -euo pipefail
+
+    kubectl delete pod gpu-test --ignore-not-found=true
+    kubectl apply -f gpu-test-pod.yaml
+
+    echo "Waiting for pod to complete..."
+    kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/gpu-test --timeout=60s || true
+
+    echo ""
+    echo "=== GPU Test Output ==="
+    kubectl logs gpu-test
+
+    kubectl delete pod gpu-test
+
+# Uninstall NVIDIA device plugin
+uninstall:
+    #!/bin/bash
+    set -euo pipefail
+
+    helm uninstall nvidia-device-plugin -n ${NVIDIA_DEVICE_PLUGIN_NAMESPACE} || true
+    echo "✓ NVIDIA device plugin uninstalled"
--- a/nvidia-device-plugin/values.yaml
+++ b/nvidia-device-plugin/values.yaml
@@ -0,0 +1,10 @@
+# Enable GPU Feature Discovery
+gfd:
+  enabled: true
+
+# Enable Node Feature Discovery (dependency)
+nfd:
+  enabled: true
+
+# Configure runtime for k3s
+runtimeClassName: "nvidia"