docs(ollama): add ollama/README.md

2025-12-03 16:09:03 +09:00
parent 06c1593e9b
commit 3534c31eda
1 changed files with 256 additions and 0 deletions
--- a/ollama/README.md
+++ b/ollama/README.md
@@ -0,0 +1,256 @@
 # Ollama
 Local LLM inference server for running open-source models:
 - **Local Inference**: Run LLMs locally without external API dependencies
 - **GPU Acceleration**: NVIDIA GPU support with automatic runtime configuration
 - **Model Library**: Access to thousands of open-source models (Llama, Qwen, DeepSeek, Mistral, etc.)
 - **OpenAI-Compatible API**: Drop-in replacement for OpenAI API endpoints
 - **Persistent Storage**: Models stored persistently across restarts
 ## Prerequisites
 For GPU support, ensure NVIDIA Device Plugin is installed:
 ```bash
 just nvidia-device-plugin::install
 ```
 See [nvidia-device-plugin/README.md](../nvidia-device-plugin/README.md) for host system requirements.
 ## Installation
 ```bash
 just ollama::install
 ```
 During installation, you will be prompted for:
 - **GPU support**: Enable/disable NVIDIA GPU acceleration
 - **Models to pull**: Comma-separated list of models to download (e.g., `qwen3:8b,deepseek-r1:8b`)
 ### Environment Variables
 | Variable | Default | Description |
 | -------- | ------- | ----------- |
 | `OLLAMA_NAMESPACE` | `ollama` | Kubernetes namespace |
 | `OLLAMA_CHART_VERSION` | `1.35.0` | Helm chart version |
 | `OLLAMA_GPU_ENABLED` | (prompt) | Enable GPU support (`true`/`false`) |
 | `OLLAMA_GPU_TYPE` | `nvidia` | GPU type (`nvidia` or `amd`) |
 | `OLLAMA_GPU_COUNT` | `1` | Number of GPUs to allocate |
 | `OLLAMA_MODELS` | (prompt) | Comma-separated list of models |
 | `OLLAMA_STORAGE_SIZE` | `30Gi` | Persistent volume size for models |
 ### Example with Environment Variables
 ```bash
 OLLAMA_GPU_ENABLED=true OLLAMA_MODELS="qwen3:8b,llama3.2:3b" just ollama::install
 ```
 ## Model Management
 ### List Models
 ```bash
 just ollama::list
 ```
 ### Pull a Model
 ```bash
 just ollama::pull qwen3:8b
 ```
 ### Run Interactive Chat
 ```bash
 just ollama::run qwen3:8b
 ```
 ### Check Status
 ```bash
 just ollama::status
 ```
 ### View Logs
 ```bash
 just ollama::logs
 ```
 Browse available models at [ollama.com/library](https://ollama.com/library).
 ## API Usage
 Ollama exposes an OpenAI-compatible API at `http://ollama.ollama.svc.cluster.local:11434`.
 ### OpenAI-Compatible Endpoint
 ```bash
 curl http://ollama.ollama.svc.cluster.local:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
 ```
 ### Native Ollama API
 ```bash
 # Generate completion
 curl http://ollama.ollama.svc.cluster.local:11434/api/generate \
  -d '{"model": "qwen3:8b", "prompt": "Hello!"}'
 # Chat completion
 curl http://ollama.ollama.svc.cluster.local:11434/api/chat \
  -d '{"model": "qwen3:8b", "messages": [{"role": "user", "content": "Hello!"}]}'
 # List models
 curl http://ollama.ollama.svc.cluster.local:11434/api/tags
 ```
 ### Python Client
 ```python
 from openai import OpenAI
 client = OpenAI(
    base_url="http://ollama.ollama.svc.cluster.local:11434/v1",
    api_key="ollama"  # Required but ignored
 )
 response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Hello!"}]
 )
 print(response.choices[0].message.content)
 ```
 ## GPU Verification
 Check if GPU is being used:
 ```bash
 kubectl exec -n ollama deploy/ollama -- ollama ps
 ```
 Expected output with GPU:
 ```text
 NAME        ID              SIZE      PROCESSOR    CONTEXT    UNTIL
 qwen3:8b    500a1f067a9f    6.0 GB    100% GPU     4096       4 minutes from now
 ```
 If `PROCESSOR` shows `100% CPU`, see Troubleshooting section.
 ## Integration with LibreChat
 Ollama integrates with [LibreChat](../librechat/README.md) for a web-based chat interface:
 ```bash
 just librechat::install
 ```
 LibreChat automatically connects to Ollama using the internal Kubernetes service URL.
 ## GPU Time-Slicing
 To share a single GPU across multiple pods, enable time-slicing in NVIDIA Device Plugin:
 ```bash
 GPU_TIME_SLICING_REPLICAS=4 just nvidia-device-plugin::install
 ```
 This allows up to 4 pods to share the same GPU (e.g., Ollama + JupyterHub notebooks).
 ## Upgrade
 ```bash
 just ollama::upgrade
 ```
 ## Uninstall
 ```bash
 just ollama::uninstall
 ```
 This removes the Helm release and namespace. Pulled models are deleted with the PVC.
 ## Troubleshooting
 ### Model Running on CPU Instead of GPU
 **Symptom**: `ollama ps` shows `100% CPU` instead of `100% GPU`
 **Cause**: Missing `runtimeClassName: nvidia` in pod spec
 **Solution**: Ensure `OLLAMA_GPU_ENABLED=true` and upgrade:
 ```bash
 OLLAMA_GPU_ENABLED=true just ollama::upgrade
 ```
 The Helm values include `runtimeClassName: nvidia` when GPU is enabled.
 ### GPU Not Detected in Pod
 **Check GPU devices in pod**:
 ```bash
 kubectl exec -n ollama deploy/ollama -- ls -la /dev/nvidia*
 ```
 If no devices are found:
 1. Verify NVIDIA Device Plugin is running:
   ```bash
   just nvidia-device-plugin::verify
   ```
 2. Check RuntimeClass exists:
   ```bash
   kubectl get runtimeclass nvidia
   ```
 3. Restart Ollama to pick up GPU:
   ```bash
   kubectl rollout restart deployment/ollama -n ollama
   ```
 ### Model Download Slow or Failing
 **Check pod logs**:
 ```bash
 just ollama::logs
 ```
 **Increase storage if needed** by setting `OLLAMA_STORAGE_SIZE`:
 ```bash
 OLLAMA_STORAGE_SIZE=50Gi just ollama::upgrade
 ```
 ### Out of Memory Errors
 **Symptom**: Model fails to load with OOM error
 **Solutions**:
 1. Use a smaller quantized model (e.g., `qwen3:8b` instead of `qwen3:14b`)
 2. Reduce context size in your API requests
 3. Upgrade to a GPU with more VRAM
 ## References
 - [Ollama Website](https://ollama.com/)
 - [Ollama Model Library](https://ollama.com/library)
 - [Ollama GitHub](https://github.com/ollama/ollama)
 - [Ollama Helm Chart](https://github.com/otwld/ollama-helm)
 - [OpenAI API Compatibility](https://ollama.com/blog/openai-compatibility)