# Ollama Local LLM inference server for running open-source models: - **Local Inference**: Run LLMs locally without external API dependencies - **GPU Acceleration**: NVIDIA GPU support with automatic runtime configuration - **Model Library**: Access to thousands of open-source models (Llama, Qwen, DeepSeek, Mistral, etc.) - **OpenAI-Compatible API**: Drop-in replacement for OpenAI API endpoints - **Persistent Storage**: Models stored persistently across restarts ## Prerequisites For GPU support, ensure NVIDIA Device Plugin is installed: ```bash just nvidia-device-plugin::install ``` See [nvidia-device-plugin/README.md](../nvidia-device-plugin/README.md) for host system requirements. ## Installation ```bash just ollama::install ``` During installation, you will be prompted for: - **GPU support**: Enable/disable NVIDIA GPU acceleration - **Models to pull**: Comma-separated list of models to download (e.g., `qwen3:8b,deepseek-r1:8b`) ### Environment Variables | Variable | Default | Description | | -------- | ------- | ----------- | | `OLLAMA_NAMESPACE` | `ollama` | Kubernetes namespace | | `OLLAMA_CHART_VERSION` | `1.35.0` | Helm chart version | | `OLLAMA_GPU_ENABLED` | (prompt) | Enable GPU support (`true`/`false`) | | `OLLAMA_GPU_TYPE` | `nvidia` | GPU type (`nvidia` or `amd`) | | `OLLAMA_GPU_COUNT` | `1` | Number of GPUs to allocate | | `OLLAMA_MODELS` | (prompt) | Comma-separated list of models | | `OLLAMA_STORAGE_SIZE` | `30Gi` | Persistent volume size for models | ### Example with Environment Variables ```bash OLLAMA_GPU_ENABLED=true OLLAMA_MODELS="qwen3:8b,llama3.2:3b" just ollama::install ``` ## Model Management ### List Models ```bash just ollama::list ``` ### Pull a Model ```bash just ollama::pull qwen3:8b ``` ### Run Interactive Chat ```bash just ollama::run qwen3:8b ``` ### Check Status ```bash just ollama::status ``` ### View Logs ```bash just ollama::logs ``` Browse available models at [ollama.com/library](https://ollama.com/library). ## API Usage Ollama exposes an OpenAI-compatible API at `http://ollama.ollama.svc.cluster.local:11434`. ### OpenAI-Compatible Endpoint ```bash curl http://ollama.ollama.svc.cluster.local:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3:8b", "messages": [{"role": "user", "content": "Hello!"}] }' ``` ### Native Ollama API ```bash # Generate completion curl http://ollama.ollama.svc.cluster.local:11434/api/generate \ -d '{"model": "qwen3:8b", "prompt": "Hello!"}' # Chat completion curl http://ollama.ollama.svc.cluster.local:11434/api/chat \ -d '{"model": "qwen3:8b", "messages": [{"role": "user", "content": "Hello!"}]}' # List models curl http://ollama.ollama.svc.cluster.local:11434/api/tags ``` ### Python Client ```python from openai import OpenAI client = OpenAI( base_url="http://ollama.ollama.svc.cluster.local:11434/v1", api_key="ollama" # Required but ignored ) response = client.chat.completions.create( model="qwen3:8b", messages=[{"role": "user", "content": "Hello!"}] ) print(response.choices[0].message.content) ``` ## GPU Verification Check if GPU is being used: ```bash kubectl exec -n ollama deploy/ollama -- ollama ps ``` Expected output with GPU: ```text NAME ID SIZE PROCESSOR CONTEXT UNTIL qwen3:8b 500a1f067a9f 6.0 GB 100% GPU 4096 4 minutes from now ``` If `PROCESSOR` shows `100% CPU`, see Troubleshooting section. ## Integration with LibreChat Ollama integrates with [LibreChat](../librechat/README.md) for a web-based chat interface: ```bash just librechat::install ``` LibreChat automatically connects to Ollama using the internal Kubernetes service URL. ## GPU Time-Slicing To share a single GPU across multiple pods, enable time-slicing in NVIDIA Device Plugin: ```bash GPU_TIME_SLICING_REPLICAS=4 just nvidia-device-plugin::install ``` This allows up to 4 pods to share the same GPU (e.g., Ollama + JupyterHub notebooks). ## Upgrade ```bash just ollama::upgrade ``` ## Uninstall ```bash just ollama::uninstall ``` This removes the Helm release and namespace. Pulled models are deleted with the PVC. ## Troubleshooting ### Model Running on CPU Instead of GPU **Symptom**: `ollama ps` shows `100% CPU` instead of `100% GPU` **Cause**: Missing `runtimeClassName: nvidia` in pod spec **Solution**: Ensure `OLLAMA_GPU_ENABLED=true` and upgrade: ```bash OLLAMA_GPU_ENABLED=true just ollama::upgrade ``` The Helm values include `runtimeClassName: nvidia` when GPU is enabled. ### GPU Not Detected in Pod **Check GPU devices in pod**: ```bash kubectl exec -n ollama deploy/ollama -- ls -la /dev/nvidia* ``` If no devices are found: 1. Verify NVIDIA Device Plugin is running: ```bash just nvidia-device-plugin::verify ``` 2. Check RuntimeClass exists: ```bash kubectl get runtimeclass nvidia ``` 3. Restart Ollama to pick up GPU: ```bash kubectl rollout restart deployment/ollama -n ollama ``` ### Model Download Slow or Failing **Check pod logs**: ```bash just ollama::logs ``` **Increase storage if needed** by setting `OLLAMA_STORAGE_SIZE`: ```bash OLLAMA_STORAGE_SIZE=50Gi just ollama::upgrade ``` ### Out of Memory Errors **Symptom**: Model fails to load with OOM error **Solutions**: 1. Use a smaller quantized model (e.g., `qwen3:8b` instead of `qwen3:14b`) 2. Reduce context size in your API requests 3. Upgrade to a GPU with more VRAM ## References - [Ollama Website](https://ollama.com/) - [Ollama Model Library](https://ollama.com/library) - [Ollama GitHub](https://github.com/ollama/ollama) - [Ollama Helm Chart](https://github.com/otwld/ollama-helm) - [OpenAI API Compatibility](https://ollama.com/blog/openai-compatibility)