docs(ollama): add ollama/README.md

2025-12-03 16:09:03 +09:00
parent 06c1593e9b
commit 3534c31eda
1 changed files with 256 additions and 0 deletions
--- a/ollama/README.md
+++ b/ollama/README.md
@@ -0,0 +1,256 @@
+# Ollama
+
+Local LLM inference server for running open-source models:
+
+- **Local Inference**: Run LLMs locally without external API dependencies
+- **GPU Acceleration**: NVIDIA GPU support with automatic runtime configuration
+- **Model Library**: Access to thousands of open-source models (Llama, Qwen, DeepSeek, Mistral, etc.)
+- **OpenAI-Compatible API**: Drop-in replacement for OpenAI API endpoints
+- **Persistent Storage**: Models stored persistently across restarts
+
+## Prerequisites
+
+For GPU support, ensure NVIDIA Device Plugin is installed:
+
+```bash
+just nvidia-device-plugin::install
+```
+
+See [nvidia-device-plugin/README.md](../nvidia-device-plugin/README.md) for host system requirements.
+
+## Installation
+
+```bash
+just ollama::install
+```
+
+During installation, you will be prompted for:
+
+- **GPU support**: Enable/disable NVIDIA GPU acceleration
+- **Models to pull**: Comma-separated list of models to download (e.g., `qwen3:8b,deepseek-r1:8b`)
+
+### Environment Variables
+
+| Variable | Default | Description |
+| -------- | ------- | ----------- |
+| `OLLAMA_NAMESPACE` | `ollama` | Kubernetes namespace |
+| `OLLAMA_CHART_VERSION` | `1.35.0` | Helm chart version |
+| `OLLAMA_GPU_ENABLED` | (prompt) | Enable GPU support (`true`/`false`) |
+| `OLLAMA_GPU_TYPE` | `nvidia` | GPU type (`nvidia` or `amd`) |
+| `OLLAMA_GPU_COUNT` | `1` | Number of GPUs to allocate |
+| `OLLAMA_MODELS` | (prompt) | Comma-separated list of models |
+| `OLLAMA_STORAGE_SIZE` | `30Gi` | Persistent volume size for models |
+
+### Example with Environment Variables
+
+```bash
+OLLAMA_GPU_ENABLED=true OLLAMA_MODELS="qwen3:8b,llama3.2:3b" just ollama::install
+```
+
+## Model Management
+
+### List Models
+
+```bash
+just ollama::list
+```
+
+### Pull a Model
+
+```bash
+just ollama::pull qwen3:8b
+```
+
+### Run Interactive Chat
+
+```bash
+just ollama::run qwen3:8b
+```
+
+### Check Status
+
+```bash
+just ollama::status
+```
+
+### View Logs
+
+```bash
+just ollama::logs
+```
+
+Browse available models at [ollama.com/library](https://ollama.com/library).
+
+## API Usage
+
+Ollama exposes an OpenAI-compatible API at `http://ollama.ollama.svc.cluster.local:11434`.
+
+### OpenAI-Compatible Endpoint
+
+```bash
+curl http://ollama.ollama.svc.cluster.local:11434/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "qwen3:8b",
+    "messages": [{"role": "user", "content": "Hello!"}]
+  }'
+```
+
+### Native Ollama API
+
+```bash
+# Generate completion
+curl http://ollama.ollama.svc.cluster.local:11434/api/generate \
+  -d '{"model": "qwen3:8b", "prompt": "Hello!"}'
+
+# Chat completion
+curl http://ollama.ollama.svc.cluster.local:11434/api/chat \
+  -d '{"model": "qwen3:8b", "messages": [{"role": "user", "content": "Hello!"}]}'
+
+# List models
+curl http://ollama.ollama.svc.cluster.local:11434/api/tags
+```
+
+### Python Client
+
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    base_url="http://ollama.ollama.svc.cluster.local:11434/v1",
+    api_key="ollama"  # Required but ignored
+)
+
+response = client.chat.completions.create(
+    model="qwen3:8b",
+    messages=[{"role": "user", "content": "Hello!"}]
+)
+print(response.choices[0].message.content)
+```
+
+## GPU Verification
+
+Check if GPU is being used:
+
+```bash
+kubectl exec -n ollama deploy/ollama -- ollama ps
+```
+
+Expected output with GPU:
+
+```text
+NAME        ID              SIZE      PROCESSOR    CONTEXT    UNTIL
+qwen3:8b    500a1f067a9f    6.0 GB    100% GPU     4096       4 minutes from now
+```
+
+If `PROCESSOR` shows `100% CPU`, see Troubleshooting section.
+
+## Integration with LibreChat
+
+Ollama integrates with [LibreChat](../librechat/README.md) for a web-based chat interface:
+
+```bash
+just librechat::install
+```
+
+LibreChat automatically connects to Ollama using the internal Kubernetes service URL.
+
+## GPU Time-Slicing
+
+To share a single GPU across multiple pods, enable time-slicing in NVIDIA Device Plugin:
+
+```bash
+GPU_TIME_SLICING_REPLICAS=4 just nvidia-device-plugin::install
+```
+
+This allows up to 4 pods to share the same GPU (e.g., Ollama + JupyterHub notebooks).
+
+## Upgrade
+
+```bash
+just ollama::upgrade
+```
+
+## Uninstall
+
+```bash
+just ollama::uninstall
+```
+
+This removes the Helm release and namespace. Pulled models are deleted with the PVC.
+
+## Troubleshooting
+
+### Model Running on CPU Instead of GPU
+
+**Symptom**: `ollama ps` shows `100% CPU` instead of `100% GPU`
+
+**Cause**: Missing `runtimeClassName: nvidia` in pod spec
+
+**Solution**: Ensure `OLLAMA_GPU_ENABLED=true` and upgrade:
+
+```bash
+OLLAMA_GPU_ENABLED=true just ollama::upgrade
+```
+
+The Helm values include `runtimeClassName: nvidia` when GPU is enabled.
+
+### GPU Not Detected in Pod
+
+**Check GPU devices in pod**:
+
+```bash
+kubectl exec -n ollama deploy/ollama -- ls -la /dev/nvidia*
+```
+
+If no devices are found:
+
+1. Verify NVIDIA Device Plugin is running:
+
+   ```bash
+   just nvidia-device-plugin::verify
+   ```
+
+2. Check RuntimeClass exists:
+
+   ```bash
+   kubectl get runtimeclass nvidia
+   ```
+
+3. Restart Ollama to pick up GPU:
+
+   ```bash
+   kubectl rollout restart deployment/ollama -n ollama
+   ```
+
+### Model Download Slow or Failing
+
+**Check pod logs**:
+
+```bash
+just ollama::logs
+```
+
+**Increase storage if needed** by setting `OLLAMA_STORAGE_SIZE`:
+
+```bash
+OLLAMA_STORAGE_SIZE=50Gi just ollama::upgrade
+```
+
+### Out of Memory Errors
+
+**Symptom**: Model fails to load with OOM error
+
+**Solutions**:
+
+1. Use a smaller quantized model (e.g., `qwen3:8b` instead of `qwen3:14b`)
+2. Reduce context size in your API requests
+3. Upgrade to a GPU with more VRAM
+
+## References
+
+- [Ollama Website](https://ollama.com/)
+- [Ollama Model Library](https://ollama.com/library)
+- [Ollama GitHub](https://github.com/ollama/ollama)
+- [Ollama Helm Chart](https://github.com/otwld/ollama-helm)
+- [OpenAI API Compatibility](https://ollama.com/blog/openai-compatibility)