Ollama
Local LLM inference server for running open-source models:
- Local Inference: Run LLMs locally without external API dependencies
- GPU Acceleration: NVIDIA GPU support with automatic runtime configuration
- Model Library: Access to thousands of open-source models (Llama, Qwen, DeepSeek, Mistral, etc.)
- OpenAI-Compatible API: Drop-in replacement for OpenAI API endpoints
- Persistent Storage: Models stored persistently across restarts
Prerequisites
For GPU support, ensure NVIDIA Device Plugin is installed:
just nvidia-device-plugin::install
See nvidia-device-plugin/README.md for host system requirements.
Installation
just ollama::install
During installation, you will be prompted for:
- GPU support: Enable/disable NVIDIA GPU acceleration
- Models to pull: Comma-separated list of models to download (e.g.,
qwen3:8b,deepseek-r1:8b)
Environment Variables
| Variable | Default | Description |
|---|---|---|
OLLAMA_NAMESPACE |
ollama |
Kubernetes namespace |
OLLAMA_CHART_VERSION |
1.35.0 |
Helm chart version |
OLLAMA_GPU_ENABLED |
(prompt) | Enable GPU support (true/false) |
OLLAMA_GPU_TYPE |
nvidia |
GPU type (nvidia or amd) |
OLLAMA_GPU_COUNT |
1 |
Number of GPUs to allocate |
OLLAMA_MODELS |
(prompt) | Comma-separated list of models |
OLLAMA_STORAGE_SIZE |
30Gi |
Persistent volume size for models |
Example with Environment Variables
OLLAMA_GPU_ENABLED=true OLLAMA_MODELS="qwen3:8b,llama3.2:3b" just ollama::install
Model Management
List Models
just ollama::list
Pull a Model
just ollama::pull qwen3:8b
Run Interactive Chat
just ollama::run qwen3:8b
Check Status
just ollama::status
View Logs
just ollama::logs
Browse available models at ollama.com/library.
API Usage
Ollama exposes an OpenAI-compatible API at http://ollama.ollama.svc.cluster.local:11434.
OpenAI-Compatible Endpoint
curl http://ollama.ollama.svc.cluster.local:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:8b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Native Ollama API
# Generate completion
curl http://ollama.ollama.svc.cluster.local:11434/api/generate \
-d '{"model": "qwen3:8b", "prompt": "Hello!"}'
# Chat completion
curl http://ollama.ollama.svc.cluster.local:11434/api/chat \
-d '{"model": "qwen3:8b", "messages": [{"role": "user", "content": "Hello!"}]}'
# List models
curl http://ollama.ollama.svc.cluster.local:11434/api/tags
Python Client
from openai import OpenAI
client = OpenAI(
base_url="http://ollama.ollama.svc.cluster.local:11434/v1",
api_key="ollama" # Required but ignored
)
response = client.chat.completions.create(
model="qwen3:8b",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
GPU Verification
Check if GPU is being used:
kubectl exec -n ollama deploy/ollama -- ollama ps
Expected output with GPU:
NAME ID SIZE PROCESSOR CONTEXT UNTIL
qwen3:8b 500a1f067a9f 6.0 GB 100% GPU 4096 4 minutes from now
If PROCESSOR shows 100% CPU, see Troubleshooting section.
Integration with LibreChat
Ollama integrates with LibreChat for a web-based chat interface:
just librechat::install
LibreChat automatically connects to Ollama using the internal Kubernetes service URL.
GPU Time-Slicing
To share a single GPU across multiple pods, enable time-slicing in NVIDIA Device Plugin:
GPU_TIME_SLICING_REPLICAS=4 just nvidia-device-plugin::install
This allows up to 4 pods to share the same GPU (e.g., Ollama + JupyterHub notebooks).
Upgrade
just ollama::upgrade
Uninstall
just ollama::uninstall
This removes the Helm release and namespace. Pulled models are deleted with the PVC.
Troubleshooting
Model Running on CPU Instead of GPU
Symptom: ollama ps shows 100% CPU instead of 100% GPU
Cause: Missing runtimeClassName: nvidia in pod spec
Solution: Ensure OLLAMA_GPU_ENABLED=true and upgrade:
OLLAMA_GPU_ENABLED=true just ollama::upgrade
The Helm values include runtimeClassName: nvidia when GPU is enabled.
GPU Not Detected in Pod
Check GPU devices in pod:
kubectl exec -n ollama deploy/ollama -- ls -la /dev/nvidia*
If no devices are found:
-
Verify NVIDIA Device Plugin is running:
just nvidia-device-plugin::verify -
Check RuntimeClass exists:
kubectl get runtimeclass nvidia -
Restart Ollama to pick up GPU:
kubectl rollout restart deployment/ollama -n ollama
Model Download Slow or Failing
Check pod logs:
just ollama::logs
Increase storage if needed by setting OLLAMA_STORAGE_SIZE:
OLLAMA_STORAGE_SIZE=50Gi just ollama::upgrade
Out of Memory Errors
Symptom: Model fails to load with OOM error
Solutions:
- Use a smaller quantized model (e.g.,
qwen3:8binstead ofqwen3:14b) - Reduce context size in your API requests
- Upgrade to a GPU with more VRAM