11 KiB
NVIDIA Device Plugin for Kubernetes
Enables GPU support in Kubernetes clusters by exposing NVIDIA GPUs as schedulable resources.
Overview
This module deploys the NVIDIA device plugin using the official Helm chart with:
- NVIDIA Device Plugin - Exposes GPUs as
nvidia.com/gpuresources - Node Feature Discovery (NFD) - Automatically detects GPU hardware on nodes
- GPU Feature Discovery (GFD) - Discovers and labels GPU capabilities
- k3s Integration - Automatic nvidia runtime detection for k3s clusters
Prerequisites
Host System Requirements
The following components must be installed on each GPU node before deploying the device plugin:
1. NVIDIA GPU Driver
Install the appropriate NVIDIA GPU driver for your system.
Arch Linux:
# Install NVIDIA driver
sudo pacman -S nvidia nvidia-utils
Ubuntu/Debian:
# Install NVIDIA driver
sudo apt-get update
sudo apt-get install -y nvidia-driver-<version>
Verify driver installation:
nvidia-smi
2. NVIDIA Container Toolkit
The NVIDIA Container Toolkit allows containers to access GPU devices.
Arch Linux:
# Install NVIDIA Container Toolkit
sudo pacman -S nvidia-container-toolkit
# Configure containerd runtime
sudo nvidia-ctk runtime configure --runtime=containerd
# Restart containerd
sudo systemctl restart containerd
Ubuntu/Debian:
# Add NVIDIA repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Install NVIDIA Container Toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# Configure containerd runtime
sudo nvidia-ctk runtime configure --runtime=containerd
# Restart containerd
sudo systemctl restart containerd
3. k3s Runtime Configuration
Important: k3s automatically detects the nvidia runtime if NVIDIA Container Toolkit is installed. No manual configuration is required.
After installing NVIDIA Container Toolkit and restarting k3s, verify automatic detection:
# Restart k3s to detect nvidia runtime
sudo systemctl restart k3s
# Verify nvidia runtime is detected
sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
Expected output:
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia']
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi']
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options]
BinaryName = "/usr/bin/nvidia-container-runtime.cdi"
Note: Do NOT create /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl manually, as this can break k3s networking. k3s handles runtime detection automatically.
Installation
Deploy NVIDIA Device Plugin
just nvidia-device-plugin::install
This installs:
- NVIDIA Device Plugin DaemonSet
- Node Feature Discovery (NFD) for GPU hardware detection
- GPU Feature Discovery (GFD) for GPU capability labeling
Verify Installation
just nvidia-device-plugin::verify
Expected output:
=== GPU Resources per Node ===
node1: 1 GPUs
=== Device Plugin Pods ===
NAME READY STATUS RESTARTS AGE
nvidia-device-plugin-xxxxx 1/1 Running 0 1m
nvidia-device-plugin-gpu-feature-discovery-xxxxx 1/1 Running 0 1m
Test GPU Access
just nvidia-device-plugin::test
This creates a test pod that runs nvidia-smi and displays GPU information.
Expected output:
=== GPU Test Output ===
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
...
Configuration
Environment variables (set in .env.local or override):
NVIDIA_DEVICE_PLUGIN_NAMESPACE=nvidia-device-plugin # Kubernetes namespace
NVIDIA_DEVICE_PLUGIN_VERSION=0.18.0 # Helm chart version
Usage
Using GPUs in Pods
To use GPUs in your pods, specify two things:
- runtimeClassName:
nvidia- Uses NVIDIA Container Runtime - resources.limits:
nvidia.com/gpu: 1- Requests GPU allocation
Example pod configuration:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
runtimeClassName: nvidia
containers:
- name: cuda-container
image: nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
Multiple GPUs
To request multiple GPUs:
resources:
limits:
nvidia.com/gpu: 2
GPU Sharing (Time-Slicing)
The device plugin supports GPU time-slicing for sharing GPUs across multiple pods. See the official documentation for configuration details.
Architecture
GPU Node (Arch Linux)
├─ NVIDIA Driver (nvidia, nvidia-utils)
├─ NVIDIA Container Toolkit (nvidia-container-runtime)
└─ k3s with containerd
├─ Auto-detected nvidia runtime
└─ NVIDIA Device Plugin (DaemonSet)
├─ Discovers GPUs on node
├─ Exposes nvidia.com/gpu resource
└─ Manages GPU allocation to pods
↓
User Pods (with runtimeClassName: nvidia)
├─ GPU device access (/dev/nvidia*)
├─ CUDA libraries mounted
└─ nvidia-smi available
Key Components:
- NVIDIA Driver: Kernel module for GPU hardware access
- NVIDIA Container Toolkit: Container runtime hooks for GPU access
- k3s containerd: Automatically detects nvidia runtime
- Device Plugin: Kubernetes plugin that advertises GPU resources
- NFD: Detects GPU hardware and labels nodes
- GFD: Discovers GPU capabilities and features
Management
Check GPU Resources
# View GPU resources per node
just nvidia-device-plugin::gpu-info
Upgrade Device Plugin
# Update to latest version
just nvidia-device-plugin::install
Uninstall
just nvidia-device-plugin::uninstall
This removes:
- NVIDIA Device Plugin DaemonSet
- Node Feature Discovery components
- GPU Feature Discovery components
- Helm release and namespace
Note: Host-level components (NVIDIA driver, Container Toolkit) are NOT removed and must be uninstalled manually if needed.
Troubleshooting
Check Device Plugin Pods
kubectl get pods -n nvidia-device-plugin
Expected pods:
nvidia-device-plugin-*- Device plugin daemon (one per GPU node)nvidia-device-plugin-gpu-feature-discovery-*- GPU feature discovery (one per GPU node)nvidia-device-plugin-node-feature-discovery-master-*- NFD masternvidia-device-plugin-node-feature-discovery-gc-*- NFD garbage collector
GPU Not Detected
Symptom: just nvidia-device-plugin::verify shows 0 GPUs
Possible Causes:
-
NVIDIA driver not installed
# Check if driver is loaded nvidia-smiIf this fails, install NVIDIA driver on the host.
-
NVIDIA Container Toolkit not installed
# Check if nvidia-container-runtime exists which nvidia-container-runtimeIf not found, install NVIDIA Container Toolkit.
-
k3s did not detect nvidia runtime
# Check containerd config sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.tomlIf empty, restart k3s:
sudo systemctl restart k3s
Device Plugin Pod CrashLoopBackOff
Symptom: Device plugin pod shows CrashLoopBackOff status
Check logs:
kubectl logs -n nvidia-device-plugin <pod-name>
Common errors:
-
"invalid device discovery strategy"
- Cause: NVIDIA Container Toolkit not configured properly
- Solution: Run
sudo nvidia-ctk runtime configure --runtime=containerdand restart containerd
-
"failed to create containerd task"
- Cause: containerd cannot find nvidia runtime
- Solution: Verify
/usr/bin/nvidia-container-runtimeexists and restart k3s
Pod Cannot Access GPU
Symptom: Pod starts but nvidia-smi fails with "executable file not found"
Cause: Pod does not have runtimeClassName: nvidia specified
Solution: Add runtimeClassName: nvidia to pod spec:
spec:
runtimeClassName: nvidia # Required!
containers:
- name: gpu-container
resources:
limits:
nvidia.com/gpu: 1
k3s Node NotReady After Configuration
Symptom: Node shows NotReady status with "cni plugin not initialized" error
Cause: Invalid /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl file
Solution: Remove the file and restart k3s:
sudo rm /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
sudo systemctl restart k3s
k3s will automatically detect nvidia runtime without manual configuration.
Check NVIDIA Runtime in Pods
# Run a test pod to verify GPU access
kubectl apply -f nvidia-device-plugin/gpu-test-pod.yaml
# Check logs
kubectl logs gpu-test
# Clean up
kubectl delete pod gpu-test
Configuration Files
Key configuration files:
values.yaml- Helm chart values with NFD and GFD enabledgpu-test-pod.yaml- Test pod for verifying GPU accessjustfile- Task recipes for installation and management
Security Considerations
- Privileged Access: Device plugin pods run with privileged access to manage GPU devices
- Host Path Mounts: Pods mount
/devand other host paths for GPU access - Runtime Security: NVIDIA runtime is isolated from default runc runtime
- Resource Limits: GPUs are allocated exclusively to pods (no overcommit by default)
- Driver Compatibility: Ensure NVIDIA driver version is compatible with CUDA version in containers