feat(nvidia-device-plugin): install nvidia-device-plugin
This commit is contained in:
435
nvidia-device-plugin/README.md
Normal file
435
nvidia-device-plugin/README.md
Normal file
@@ -0,0 +1,435 @@
|
||||
# NVIDIA Device Plugin for Kubernetes
|
||||
|
||||
Enables GPU support in Kubernetes clusters by exposing NVIDIA GPUs as schedulable resources.
|
||||
|
||||
## Overview
|
||||
|
||||
This module deploys the NVIDIA device plugin using the official Helm chart with:
|
||||
|
||||
- **NVIDIA Device Plugin** - Exposes GPUs as `nvidia.com/gpu` resources
|
||||
- **Node Feature Discovery (NFD)** - Automatically detects GPU hardware on nodes
|
||||
- **GPU Feature Discovery (GFD)** - Discovers and labels GPU capabilities
|
||||
- **k3s Integration** - Automatic nvidia runtime detection for k3s clusters
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Host System Requirements
|
||||
|
||||
The following components must be installed on each GPU node **before** deploying the device plugin:
|
||||
|
||||
#### 1. NVIDIA GPU Driver
|
||||
|
||||
Install the appropriate NVIDIA GPU driver for your system.
|
||||
|
||||
**Arch Linux:**
|
||||
|
||||
```bash
|
||||
# Install NVIDIA driver
|
||||
sudo pacman -S nvidia nvidia-utils
|
||||
```
|
||||
|
||||
**Ubuntu/Debian:**
|
||||
|
||||
```bash
|
||||
# Install NVIDIA driver
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y nvidia-driver-<version>
|
||||
```
|
||||
|
||||
Verify driver installation:
|
||||
|
||||
```bash
|
||||
nvidia-smi
|
||||
```
|
||||
|
||||
#### 2. NVIDIA Container Toolkit
|
||||
|
||||
The NVIDIA Container Toolkit allows containers to access GPU devices.
|
||||
|
||||
**Arch Linux:**
|
||||
|
||||
```bash
|
||||
# Install NVIDIA Container Toolkit
|
||||
sudo pacman -S nvidia-container-toolkit
|
||||
|
||||
# Configure containerd runtime
|
||||
sudo nvidia-ctk runtime configure --runtime=containerd
|
||||
|
||||
# Restart containerd
|
||||
sudo systemctl restart containerd
|
||||
```
|
||||
|
||||
**Ubuntu/Debian:**
|
||||
|
||||
```bash
|
||||
# Add NVIDIA repository
|
||||
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
|
||||
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
|
||||
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
|
||||
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
|
||||
|
||||
# Install NVIDIA Container Toolkit
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y nvidia-container-toolkit
|
||||
|
||||
# Configure containerd runtime
|
||||
sudo nvidia-ctk runtime configure --runtime=containerd
|
||||
|
||||
# Restart containerd
|
||||
sudo systemctl restart containerd
|
||||
```
|
||||
|
||||
#### 3. k3s Runtime Configuration
|
||||
|
||||
**Important**: k3s automatically detects the nvidia runtime if NVIDIA Container Toolkit is installed. **No manual configuration is required**.
|
||||
|
||||
After installing NVIDIA Container Toolkit and restarting k3s, verify automatic detection:
|
||||
|
||||
```bash
|
||||
# Restart k3s to detect nvidia runtime
|
||||
sudo systemctl restart k3s
|
||||
|
||||
# Verify nvidia runtime is detected
|
||||
sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
|
||||
```
|
||||
|
||||
Expected output:
|
||||
|
||||
```toml
|
||||
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia']
|
||||
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options]
|
||||
BinaryName = "/usr/bin/nvidia-container-runtime"
|
||||
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi']
|
||||
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options]
|
||||
BinaryName = "/usr/bin/nvidia-container-runtime.cdi"
|
||||
```
|
||||
|
||||
**Note**: Do **NOT** create `/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl` manually, as this can break k3s networking. k3s handles runtime detection automatically.
|
||||
|
||||
## Installation
|
||||
|
||||
### Deploy NVIDIA Device Plugin
|
||||
|
||||
```bash
|
||||
just nvidia-device-plugin::install
|
||||
```
|
||||
|
||||
This installs:
|
||||
|
||||
- NVIDIA Device Plugin DaemonSet
|
||||
- Node Feature Discovery (NFD) for GPU hardware detection
|
||||
- GPU Feature Discovery (GFD) for GPU capability labeling
|
||||
|
||||
### Verify Installation
|
||||
|
||||
```bash
|
||||
just nvidia-device-plugin::verify
|
||||
```
|
||||
|
||||
Expected output:
|
||||
|
||||
```plain
|
||||
=== GPU Resources per Node ===
|
||||
node1: 1 GPUs
|
||||
|
||||
=== Device Plugin Pods ===
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
nvidia-device-plugin-xxxxx 1/1 Running 0 1m
|
||||
nvidia-device-plugin-gpu-feature-discovery-xxxxx 1/1 Running 0 1m
|
||||
```
|
||||
|
||||
### Test GPU Access
|
||||
|
||||
```bash
|
||||
just nvidia-device-plugin::test
|
||||
```
|
||||
|
||||
This creates a test pod that runs `nvidia-smi` and displays GPU information.
|
||||
|
||||
Expected output:
|
||||
|
||||
```plain
|
||||
=== GPU Test Output ===
|
||||
+-----------------------------------------------------------------------------------------+
|
||||
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 |
|
||||
+-----------------------------------------+------------------------+----------------------+
|
||||
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
|
||||
...
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Environment variables (set in `.env.local` or override):
|
||||
|
||||
```bash
|
||||
NVIDIA_DEVICE_PLUGIN_NAMESPACE=nvidia-device-plugin # Kubernetes namespace
|
||||
NVIDIA_DEVICE_PLUGIN_VERSION=0.18.0 # Helm chart version
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Using GPUs in Pods
|
||||
|
||||
To use GPUs in your pods, specify two things:
|
||||
|
||||
1. **runtimeClassName**: `nvidia` - Uses NVIDIA Container Runtime
|
||||
2. **resources.limits**: `nvidia.com/gpu: 1` - Requests GPU allocation
|
||||
|
||||
Example pod configuration:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: gpu-pod
|
||||
spec:
|
||||
runtimeClassName: nvidia
|
||||
containers:
|
||||
- name: cuda-container
|
||||
image: nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04
|
||||
command: ["nvidia-smi"]
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
```
|
||||
|
||||
### Using GPUs in JupyterHub
|
||||
|
||||
Configure JupyterHub to allow GPU access for notebook servers:
|
||||
|
||||
```yaml
|
||||
# jupyterhub values.yaml
|
||||
singleuser:
|
||||
runtimeClassName: nvidia
|
||||
extraResource:
|
||||
limits:
|
||||
nvidia.com/gpu: "1"
|
||||
```
|
||||
|
||||
After deploying JupyterHub with this configuration, users can access GPUs in their notebooks:
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
# Check GPU availability
|
||||
print(torch.cuda.is_available()) # True
|
||||
print(torch.cuda.device_count()) # 1
|
||||
print(torch.cuda.get_device_name(0)) # NVIDIA GeForce RTX 4070 Ti
|
||||
```
|
||||
|
||||
### Multiple GPUs
|
||||
|
||||
To request multiple GPUs:
|
||||
|
||||
```yaml
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 2
|
||||
```
|
||||
|
||||
### GPU Sharing (Time-Slicing)
|
||||
|
||||
The device plugin supports GPU time-slicing for sharing GPUs across multiple pods. See the [official documentation](https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing) for configuration details.
|
||||
|
||||
## Architecture
|
||||
|
||||
```plain
|
||||
GPU Node (Arch Linux)
|
||||
├─ NVIDIA Driver (nvidia, nvidia-utils)
|
||||
├─ NVIDIA Container Toolkit (nvidia-container-runtime)
|
||||
└─ k3s with containerd
|
||||
├─ Auto-detected nvidia runtime
|
||||
└─ NVIDIA Device Plugin (DaemonSet)
|
||||
├─ Discovers GPUs on node
|
||||
├─ Exposes nvidia.com/gpu resource
|
||||
└─ Manages GPU allocation to pods
|
||||
↓
|
||||
User Pods (with runtimeClassName: nvidia)
|
||||
├─ GPU device access (/dev/nvidia*)
|
||||
├─ CUDA libraries mounted
|
||||
└─ nvidia-smi available
|
||||
```
|
||||
|
||||
**Key Components**:
|
||||
|
||||
- **NVIDIA Driver**: Kernel module for GPU hardware access
|
||||
- **NVIDIA Container Toolkit**: Container runtime hooks for GPU access
|
||||
- **k3s containerd**: Automatically detects nvidia runtime
|
||||
- **Device Plugin**: Kubernetes plugin that advertises GPU resources
|
||||
- **NFD**: Detects GPU hardware and labels nodes
|
||||
- **GFD**: Discovers GPU capabilities and features
|
||||
|
||||
## Management
|
||||
|
||||
### Check GPU Resources
|
||||
|
||||
```bash
|
||||
# View GPU resources per node
|
||||
just nvidia-device-plugin::gpu-info
|
||||
```
|
||||
|
||||
### Upgrade Device Plugin
|
||||
|
||||
```bash
|
||||
# Update to latest version
|
||||
just nvidia-device-plugin::install
|
||||
```
|
||||
|
||||
### Uninstall
|
||||
|
||||
```bash
|
||||
just nvidia-device-plugin::uninstall
|
||||
```
|
||||
|
||||
This removes:
|
||||
|
||||
- NVIDIA Device Plugin DaemonSet
|
||||
- Node Feature Discovery components
|
||||
- GPU Feature Discovery components
|
||||
- Helm release and namespace
|
||||
|
||||
**Note**: Host-level components (NVIDIA driver, Container Toolkit) are NOT removed and must be uninstalled manually if needed.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Check Device Plugin Pods
|
||||
|
||||
```bash
|
||||
kubectl get pods -n nvidia-device-plugin
|
||||
```
|
||||
|
||||
Expected pods:
|
||||
|
||||
- `nvidia-device-plugin-*` - Device plugin daemon (one per GPU node)
|
||||
- `nvidia-device-plugin-gpu-feature-discovery-*` - GPU feature discovery (one per GPU node)
|
||||
- `nvidia-device-plugin-node-feature-discovery-master-*` - NFD master
|
||||
- `nvidia-device-plugin-node-feature-discovery-gc-*` - NFD garbage collector
|
||||
|
||||
### GPU Not Detected
|
||||
|
||||
**Symptom**: `just nvidia-device-plugin::verify` shows `0 GPUs`
|
||||
|
||||
**Possible Causes**:
|
||||
|
||||
1. **NVIDIA driver not installed**
|
||||
|
||||
```bash
|
||||
# Check if driver is loaded
|
||||
nvidia-smi
|
||||
```
|
||||
|
||||
If this fails, install NVIDIA driver on the host.
|
||||
|
||||
2. **NVIDIA Container Toolkit not installed**
|
||||
|
||||
```bash
|
||||
# Check if nvidia-container-runtime exists
|
||||
which nvidia-container-runtime
|
||||
```
|
||||
|
||||
If not found, install NVIDIA Container Toolkit.
|
||||
|
||||
3. **k3s did not detect nvidia runtime**
|
||||
|
||||
```bash
|
||||
# Check containerd config
|
||||
sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
|
||||
```
|
||||
|
||||
If empty, restart k3s:
|
||||
|
||||
```bash
|
||||
sudo systemctl restart k3s
|
||||
```
|
||||
|
||||
### Device Plugin Pod CrashLoopBackOff
|
||||
|
||||
**Symptom**: Device plugin pod shows `CrashLoopBackOff` status
|
||||
|
||||
**Check logs**:
|
||||
|
||||
```bash
|
||||
kubectl logs -n nvidia-device-plugin <pod-name>
|
||||
```
|
||||
|
||||
**Common errors**:
|
||||
|
||||
1. **"invalid device discovery strategy"**
|
||||
|
||||
- Cause: NVIDIA Container Toolkit not configured properly
|
||||
- Solution: Run `sudo nvidia-ctk runtime configure --runtime=containerd` and restart containerd
|
||||
|
||||
2. **"failed to create containerd task"**
|
||||
|
||||
- Cause: containerd cannot find nvidia runtime
|
||||
- Solution: Verify `/usr/bin/nvidia-container-runtime` exists and restart k3s
|
||||
|
||||
### Pod Cannot Access GPU
|
||||
|
||||
**Symptom**: Pod starts but `nvidia-smi` fails with "executable file not found"
|
||||
|
||||
**Cause**: Pod does not have `runtimeClassName: nvidia` specified
|
||||
|
||||
**Solution**: Add `runtimeClassName: nvidia` to pod spec:
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
runtimeClassName: nvidia # Required!
|
||||
containers:
|
||||
- name: gpu-container
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
```
|
||||
|
||||
### k3s Node NotReady After Configuration
|
||||
|
||||
**Symptom**: Node shows `NotReady` status with "cni plugin not initialized" error
|
||||
|
||||
**Cause**: Invalid `/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl` file
|
||||
|
||||
**Solution**: Remove the file and restart k3s:
|
||||
|
||||
```bash
|
||||
sudo rm /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
|
||||
sudo systemctl restart k3s
|
||||
```
|
||||
|
||||
k3s will automatically detect nvidia runtime without manual configuration.
|
||||
|
||||
### Check NVIDIA Runtime in Pods
|
||||
|
||||
```bash
|
||||
# Run a test pod to verify GPU access
|
||||
kubectl apply -f nvidia-device-plugin/gpu-test-pod.yaml
|
||||
|
||||
# Check logs
|
||||
kubectl logs gpu-test
|
||||
|
||||
# Clean up
|
||||
kubectl delete pod gpu-test
|
||||
```
|
||||
|
||||
## Configuration Files
|
||||
|
||||
Key configuration files:
|
||||
|
||||
- `values.yaml` - Helm chart values with NFD and GFD enabled
|
||||
- `gpu-test-pod.yaml` - Test pod for verifying GPU access
|
||||
- `justfile` - Task recipes for installation and management
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- **Privileged Access**: Device plugin pods run with privileged access to manage GPU devices
|
||||
- **Host Path Mounts**: Pods mount `/dev` and other host paths for GPU access
|
||||
- **Runtime Security**: NVIDIA runtime is isolated from default runc runtime
|
||||
- **Resource Limits**: GPUs are allocated exclusively to pods (no overcommit by default)
|
||||
- **Driver Compatibility**: Ensure NVIDIA driver version is compatible with CUDA version in containers
|
||||
|
||||
## References
|
||||
|
||||
- [NVIDIA Device Plugin GitHub](https://github.com/NVIDIA/k8s-device-plugin)
|
||||
- [NVIDIA Container Toolkit Documentation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
|
||||
- [k3s Advanced Configuration](https://docs.k3s.io/advanced)
|
||||
- [Kubernetes Device Plugins](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/)
|
||||
- [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery)
|
||||
Reference in New Issue
Block a user