diff --git a/ollama/README.md b/ollama/README.md new file mode 100644 index 0000000..ee5c7a5 --- /dev/null +++ b/ollama/README.md @@ -0,0 +1,256 @@ +# Ollama + +Local LLM inference server for running open-source models: + +- **Local Inference**: Run LLMs locally without external API dependencies +- **GPU Acceleration**: NVIDIA GPU support with automatic runtime configuration +- **Model Library**: Access to thousands of open-source models (Llama, Qwen, DeepSeek, Mistral, etc.) +- **OpenAI-Compatible API**: Drop-in replacement for OpenAI API endpoints +- **Persistent Storage**: Models stored persistently across restarts + +## Prerequisites + +For GPU support, ensure NVIDIA Device Plugin is installed: + +```bash +just nvidia-device-plugin::install +``` + +See [nvidia-device-plugin/README.md](../nvidia-device-plugin/README.md) for host system requirements. + +## Installation + +```bash +just ollama::install +``` + +During installation, you will be prompted for: + +- **GPU support**: Enable/disable NVIDIA GPU acceleration +- **Models to pull**: Comma-separated list of models to download (e.g., `qwen3:8b,deepseek-r1:8b`) + +### Environment Variables + +| Variable | Default | Description | +| -------- | ------- | ----------- | +| `OLLAMA_NAMESPACE` | `ollama` | Kubernetes namespace | +| `OLLAMA_CHART_VERSION` | `1.35.0` | Helm chart version | +| `OLLAMA_GPU_ENABLED` | (prompt) | Enable GPU support (`true`/`false`) | +| `OLLAMA_GPU_TYPE` | `nvidia` | GPU type (`nvidia` or `amd`) | +| `OLLAMA_GPU_COUNT` | `1` | Number of GPUs to allocate | +| `OLLAMA_MODELS` | (prompt) | Comma-separated list of models | +| `OLLAMA_STORAGE_SIZE` | `30Gi` | Persistent volume size for models | + +### Example with Environment Variables + +```bash +OLLAMA_GPU_ENABLED=true OLLAMA_MODELS="qwen3:8b,llama3.2:3b" just ollama::install +``` + +## Model Management + +### List Models + +```bash +just ollama::list +``` + +### Pull a Model + +```bash +just ollama::pull qwen3:8b +``` + +### Run Interactive Chat + +```bash +just ollama::run qwen3:8b +``` + +### Check Status + +```bash +just ollama::status +``` + +### View Logs + +```bash +just ollama::logs +``` + +Browse available models at [ollama.com/library](https://ollama.com/library). + +## API Usage + +Ollama exposes an OpenAI-compatible API at `http://ollama.ollama.svc.cluster.local:11434`. + +### OpenAI-Compatible Endpoint + +```bash +curl http://ollama.ollama.svc.cluster.local:11434/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "qwen3:8b", + "messages": [{"role": "user", "content": "Hello!"}] + }' +``` + +### Native Ollama API + +```bash +# Generate completion +curl http://ollama.ollama.svc.cluster.local:11434/api/generate \ + -d '{"model": "qwen3:8b", "prompt": "Hello!"}' + +# Chat completion +curl http://ollama.ollama.svc.cluster.local:11434/api/chat \ + -d '{"model": "qwen3:8b", "messages": [{"role": "user", "content": "Hello!"}]}' + +# List models +curl http://ollama.ollama.svc.cluster.local:11434/api/tags +``` + +### Python Client + +```python +from openai import OpenAI + +client = OpenAI( + base_url="http://ollama.ollama.svc.cluster.local:11434/v1", + api_key="ollama" # Required but ignored +) + +response = client.chat.completions.create( + model="qwen3:8b", + messages=[{"role": "user", "content": "Hello!"}] +) +print(response.choices[0].message.content) +``` + +## GPU Verification + +Check if GPU is being used: + +```bash +kubectl exec -n ollama deploy/ollama -- ollama ps +``` + +Expected output with GPU: + +```text +NAME ID SIZE PROCESSOR CONTEXT UNTIL +qwen3:8b 500a1f067a9f 6.0 GB 100% GPU 4096 4 minutes from now +``` + +If `PROCESSOR` shows `100% CPU`, see Troubleshooting section. + +## Integration with LibreChat + +Ollama integrates with [LibreChat](../librechat/README.md) for a web-based chat interface: + +```bash +just librechat::install +``` + +LibreChat automatically connects to Ollama using the internal Kubernetes service URL. + +## GPU Time-Slicing + +To share a single GPU across multiple pods, enable time-slicing in NVIDIA Device Plugin: + +```bash +GPU_TIME_SLICING_REPLICAS=4 just nvidia-device-plugin::install +``` + +This allows up to 4 pods to share the same GPU (e.g., Ollama + JupyterHub notebooks). + +## Upgrade + +```bash +just ollama::upgrade +``` + +## Uninstall + +```bash +just ollama::uninstall +``` + +This removes the Helm release and namespace. Pulled models are deleted with the PVC. + +## Troubleshooting + +### Model Running on CPU Instead of GPU + +**Symptom**: `ollama ps` shows `100% CPU` instead of `100% GPU` + +**Cause**: Missing `runtimeClassName: nvidia` in pod spec + +**Solution**: Ensure `OLLAMA_GPU_ENABLED=true` and upgrade: + +```bash +OLLAMA_GPU_ENABLED=true just ollama::upgrade +``` + +The Helm values include `runtimeClassName: nvidia` when GPU is enabled. + +### GPU Not Detected in Pod + +**Check GPU devices in pod**: + +```bash +kubectl exec -n ollama deploy/ollama -- ls -la /dev/nvidia* +``` + +If no devices are found: + +1. Verify NVIDIA Device Plugin is running: + + ```bash + just nvidia-device-plugin::verify + ``` + +2. Check RuntimeClass exists: + + ```bash + kubectl get runtimeclass nvidia + ``` + +3. Restart Ollama to pick up GPU: + + ```bash + kubectl rollout restart deployment/ollama -n ollama + ``` + +### Model Download Slow or Failing + +**Check pod logs**: + +```bash +just ollama::logs +``` + +**Increase storage if needed** by setting `OLLAMA_STORAGE_SIZE`: + +```bash +OLLAMA_STORAGE_SIZE=50Gi just ollama::upgrade +``` + +### Out of Memory Errors + +**Symptom**: Model fails to load with OOM error + +**Solutions**: + +1. Use a smaller quantized model (e.g., `qwen3:8b` instead of `qwen3:14b`) +2. Reduce context size in your API requests +3. Upgrade to a GPU with more VRAM + +## References + +- [Ollama Website](https://ollama.com/) +- [Ollama Model Library](https://ollama.com/library) +- [Ollama GitHub](https://github.com/ollama/ollama) +- [Ollama Helm Chart](https://github.com/otwld/ollama-helm) +- [OpenAI API Compatibility](https://ollama.com/blog/openai-compatibility)