485 lines
12 KiB
Markdown
485 lines
12 KiB
Markdown
# MLflow
|
|
|
|
Open source platform for managing the end-to-end machine learning lifecycle with Keycloak OIDC authentication.
|
|
|
|
## Overview
|
|
|
|
This module deploys MLflow using the Community Charts Helm chart with:
|
|
|
|
- **Keycloak OIDC authentication** for user login
|
|
- **Custom Docker image** with mlflow-oidc-auth plugin
|
|
- **PostgreSQL backend** for tracking server and auth databases
|
|
- **MinIO/S3 artifact storage** with proxied access
|
|
- **FastAPI/ASGI server** with Uvicorn for production
|
|
- **HTTPS reverse proxy support** via Traefik
|
|
- **Group-based access control** via Keycloak groups
|
|
- **Prometheus metrics** for monitoring
|
|
|
|
## Prerequisites
|
|
|
|
- Kubernetes cluster (k3s)
|
|
- Keycloak installed and configured
|
|
- PostgreSQL cluster (CloudNativePG)
|
|
- MinIO object storage
|
|
- External Secrets Operator (optional, for Vault integration)
|
|
- Docker registry (local or remote)
|
|
|
|
## Installation
|
|
|
|
### Basic Installation
|
|
|
|
1. **Build and Push Custom MLflow Image**:
|
|
|
|
Set `DOCKER_HOST` to your remote Docker host (where k3s is running):
|
|
|
|
```bash
|
|
export DOCKER_HOST=ssh://yourhost.com
|
|
just mlflow::build-and-push-image
|
|
```
|
|
|
|
This builds a custom MLflow image with OIDC auth plugin and pushes it to your k3s registry.
|
|
|
|
2. **Install MLflow**:
|
|
|
|
```bash
|
|
just mlflow::install
|
|
```
|
|
|
|
You will be prompted for:
|
|
|
|
- **MLflow host (FQDN)**: e.g., `mlflow.example.com`
|
|
|
|
### What Gets Installed
|
|
|
|
- MLflow tracking server (FastAPI with OIDC)
|
|
- PostgreSQL databases:
|
|
- `mlflow` - Experiment tracking, models, and runs
|
|
- `mlflow_auth` - User authentication and permissions
|
|
- PostgreSQL user `mlflow` with access to both databases
|
|
- MinIO bucket `mlflow` for artifact storage
|
|
- Custom MLflow Docker image with OIDC auth plugin
|
|
- Keycloak OAuth client (confidential client)
|
|
- Keycloak groups:
|
|
- `mlflow-admins` - Full administrative access
|
|
- `mlflow-users` - Basic user access
|
|
|
|
## Configuration
|
|
|
|
### Docker Build Environment
|
|
|
|
For building and pushing the custom MLflow image:
|
|
|
|
```bash
|
|
DOCKER_HOST=ssh://yourhost.com # Remote Docker host (where k3s is running)
|
|
IMAGE_REGISTRY=localhost:30500 # k3s local registry
|
|
```
|
|
|
|
### Deployment Configuration
|
|
|
|
Environment variables (set in `.env.local` or override):
|
|
|
|
```bash
|
|
MLFLOW_NAMESPACE=mlflow # Kubernetes namespace
|
|
MLFLOW_CHART_VERSION=1.8.0 # Helm chart version
|
|
MLFLOW_HOST=mlflow.example.com # External hostname
|
|
MLFLOW_IMAGE_TAG=3.6.0-oidc # Custom image tag
|
|
MLFLOW_IMAGE_PULL_POLICY=IfNotPresent # Image pull policy
|
|
KEYCLOAK_HOST=auth.example.com # Keycloak hostname
|
|
KEYCLOAK_REALM=buunstack # Keycloak realm name
|
|
```
|
|
|
|
### Architecture Notes
|
|
|
|
**MLflow 3.6.0 with OIDC**:
|
|
|
|
- Uses `mlflow-oidc-auth[full]==5.6.1` plugin
|
|
- FastAPI/ASGI server with Uvicorn (not Gunicorn)
|
|
- Server type: `oidc-auth-fastapi` for ASGI compatibility
|
|
- Session management: `cachelib` with filesystem backend
|
|
- Custom Docker image built from `burakince/mlflow:3.6.0`
|
|
|
|
**Authentication Flow**:
|
|
|
|
- OIDC Discovery: `/.well-known/openid-configuration`
|
|
- Redirect URI: `/callback` (not `/oidc/callback`)
|
|
- Required scopes: `openid profile email groups`
|
|
- Group attribute: `groups` from UserInfo
|
|
|
|
**Database Structure**:
|
|
|
|
- `mlflow` database: Experiment tracking, models, parameters, metrics
|
|
- `mlflow_auth` database: User accounts, groups, permissions
|
|
|
|
## Usage
|
|
|
|
### Access MLflow
|
|
|
|
1. Navigate to `https://your-mlflow-host/`
|
|
2. Click "Keycloak" button to authenticate
|
|
3. After successful login:
|
|
- First redirect: Permissions Management UI (`/oidc/ui/`)
|
|
- Click "MLflow" button: Main MLflow UI
|
|
|
|
### Grant Admin Access
|
|
|
|
Add users to the `mlflow-admins` group:
|
|
|
|
```bash
|
|
just keycloak::add-user-to-group <username> mlflow-admins
|
|
```
|
|
|
|
Admin users have full privileges including:
|
|
|
|
- Experiment and model management
|
|
- User and permission management
|
|
- Access to all experiments and models
|
|
|
|
### Log Experiments
|
|
|
|
#### Using Python Client
|
|
|
|
```python
|
|
import mlflow
|
|
|
|
# Set tracking URI
|
|
mlflow.set_tracking_uri("https://mlflow.example.com")
|
|
|
|
# Start experiment
|
|
mlflow.set_experiment("my-experiment")
|
|
|
|
# Log parameters, metrics, and artifacts
|
|
with mlflow.start_run():
|
|
mlflow.log_param("learning_rate", 0.01)
|
|
mlflow.log_metric("accuracy", 0.95)
|
|
mlflow.log_artifact("model.pkl")
|
|
```
|
|
|
|
#### Authentication for API Access
|
|
|
|
For programmatic access, create an access token:
|
|
|
|
1. Log in to MLflow UI
|
|
2. Navigate to Permissions UI → Create access token
|
|
3. Use token in your code:
|
|
|
|
```python
|
|
import os
|
|
os.environ["MLFLOW_TRACKING_TOKEN"] = "your-token"
|
|
```
|
|
|
|
### Model Registry
|
|
|
|
Register and manage models:
|
|
|
|
```python
|
|
# Register model
|
|
mlflow.register_model(
|
|
model_uri="runs:/<run-id>/model",
|
|
name="my-model"
|
|
)
|
|
|
|
# Transition model stage
|
|
from mlflow.tracking import MlflowClient
|
|
client = MlflowClient()
|
|
client.transition_model_version_stage(
|
|
name="my-model",
|
|
version=1,
|
|
stage="Production"
|
|
)
|
|
```
|
|
|
|
## Features
|
|
|
|
- **Experiment Tracking**: Log parameters, metrics, and artifacts
|
|
- **Model Registry**: Version and manage ML models
|
|
- **Model Serving**: Deploy models as REST APIs
|
|
- **Project Reproducibility**: Package code, data, and environment
|
|
- **Remote Execution**: Run experiments on remote platforms
|
|
- **UI Dashboard**: Visual experiment comparison and analysis
|
|
- **LLM Tracking**: Track LLM applications with traces
|
|
- **Prompt Registry**: Manage and version prompts
|
|
|
|
## Architecture
|
|
|
|
```plain
|
|
External Users
|
|
↓
|
|
Cloudflare Tunnel (HTTPS)
|
|
↓
|
|
Traefik Ingress (HTTPS)
|
|
↓
|
|
MLflow Server (HTTP inside cluster)
|
|
├─ FastAPI/ASGI (Uvicorn)
|
|
├─ mlflow-oidc-auth plugin
|
|
│ ├─ OAuth → Keycloak (authentication)
|
|
│ └─ Session → FileSystemCache
|
|
├─ PostgreSQL (metadata)
|
|
│ ├─ mlflow (tracking)
|
|
│ └─ mlflow_auth (users/groups)
|
|
└─ MinIO (artifacts via proxied access)
|
|
```
|
|
|
|
**Key Components**:
|
|
|
|
- **Server Type**: `oidc-auth-fastapi` for FastAPI/ASGI compatibility
|
|
- **Allowed Hosts**: Validates `Host` header for security
|
|
- **Session Backend**: Cachelib with filesystem storage
|
|
- **Artifact Storage**: Proxied through MLflow server (no direct S3 access needed)
|
|
|
|
## Authentication
|
|
|
|
### User Login (OIDC)
|
|
|
|
- Users authenticate via Keycloak
|
|
- Standard OIDC flow with Authorization Code grant
|
|
- Group membership retrieved from `groups` claim in UserInfo
|
|
- Users automatically created on first login
|
|
|
|
### Access Control
|
|
|
|
**Group-based Permissions**:
|
|
|
|
```python
|
|
OIDC_ADMIN_GROUP_NAME = "mlflow-admins"
|
|
OIDC_GROUP_NAME = "mlflow-admins,mlflow-users"
|
|
```
|
|
|
|
**Default Permissions**:
|
|
|
|
- New resources: `MANAGE` permission for creator
|
|
- Admins: Full access to all resources
|
|
- Users: Access based on explicit permissions
|
|
|
|
### Permission Management
|
|
|
|
Access the Permissions UI at `/oidc/ui/`:
|
|
|
|
- View and manage user permissions
|
|
- Assign permissions to experiments, models, and prompts
|
|
- Create and manage groups
|
|
- View audit logs
|
|
|
|
## Management
|
|
|
|
### Rebuild Custom Image
|
|
|
|
If you need to update the custom MLflow image:
|
|
|
|
```bash
|
|
export DOCKER_HOST=ssh://yourhost.com
|
|
just mlflow::build-and-push-image
|
|
```
|
|
|
|
After rebuilding, restart MLflow to use the new image:
|
|
|
|
```bash
|
|
kubectl rollout restart deployment/mlflow -n mlflow
|
|
```
|
|
|
|
### Upgrade MLflow
|
|
|
|
```bash
|
|
just mlflow::upgrade
|
|
```
|
|
|
|
Updates the Helm deployment with current configuration.
|
|
|
|
### Uninstall
|
|
|
|
```bash
|
|
# Keep PostgreSQL databases
|
|
just mlflow::uninstall false
|
|
|
|
# Delete PostgreSQL databases and user
|
|
just mlflow::uninstall true
|
|
```
|
|
|
|
### Clean Up All Resources
|
|
|
|
```bash
|
|
just mlflow::cleanup
|
|
```
|
|
|
|
Deletes databases, users, secrets, and Keycloak client (with confirmation).
|
|
|
|
## Troubleshooting
|
|
|
|
### Check Pod Status
|
|
|
|
```bash
|
|
kubectl get pods -n mlflow
|
|
```
|
|
|
|
Expected pods:
|
|
|
|
- `mlflow-*` - Main application (1 replica)
|
|
- `mlflow-db-migration-*` - Database migration (Completed)
|
|
- `mlflow-dbchecker-*` - Database connection check (Completed)
|
|
|
|
### OAuth Login Fails
|
|
|
|
#### Redirect Loop (Returns to Login Page)
|
|
|
|
**Symptoms**: User authenticates with Keycloak but returns to login page
|
|
|
|
**Common Causes**:
|
|
|
|
1. **Redirect URI Mismatch**:
|
|
- Check Keycloak client redirect URI matches `/callback`
|
|
- Verify `OIDC_REDIRECT_URI` is `https://{host}/callback`
|
|
|
|
2. **Missing Groups Scope**:
|
|
- Ensure `groups` scope is added to Keycloak client
|
|
- Check groups mapper is configured in Keycloak
|
|
|
|
3. **Group Membership**:
|
|
- User must be in `mlflow-admins` or `mlflow-users` group
|
|
- Add user to group: `just keycloak::add-user-to-group <user> mlflow-admins`
|
|
|
|
#### Session Errors
|
|
|
|
**Error**: `Session module for filesystem could not be imported`
|
|
|
|
**Solution**: Ensure session configuration is correct:
|
|
|
|
```yaml
|
|
SESSION_TYPE: "cachelib"
|
|
SESSION_CACHE_DIR: "/tmp/session"
|
|
```
|
|
|
|
#### Group Detection Errors
|
|
|
|
**Error**: `Group detection error: No module named 'oidc'`
|
|
|
|
**Solution**: Remove `OIDC_GROUP_DETECTION_PLUGIN` setting (should be unset or removed)
|
|
|
|
### Server Type Errors
|
|
|
|
**Error**: `TypeError: Flask.__call__() missing 1 required positional argument: 'start_response'`
|
|
|
|
**Cause**: Using Flask server type with Uvicorn (ASGI)
|
|
|
|
**Solution**: Ensure `appName: "oidc-auth-fastapi"` in values
|
|
|
|
### Database Connection Issues
|
|
|
|
Check database credentials:
|
|
|
|
```bash
|
|
kubectl get secret mlflow-db-secret -n mlflow -o yaml
|
|
```
|
|
|
|
Test database connectivity:
|
|
|
|
```bash
|
|
kubectl exec -n mlflow deployment/mlflow -- \
|
|
psql -h postgres-cluster-rw.postgres -U mlflow -d mlflow -c "SELECT 1"
|
|
```
|
|
|
|
### Artifact Storage Issues
|
|
|
|
Check MinIO credentials:
|
|
|
|
```bash
|
|
kubectl get secret mlflow-s3-secret -n mlflow -o yaml
|
|
```
|
|
|
|
Test MinIO connectivity:
|
|
|
|
```bash
|
|
kubectl exec -n mlflow deployment/mlflow -- \
|
|
python -c "import boto3; import os; \
|
|
client = boto3.client('s3', \
|
|
endpoint_url=os.getenv('MLFLOW_S3_ENDPOINT_URL'), \
|
|
aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'), \
|
|
aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY')); \
|
|
print(client.list_buckets())"
|
|
```
|
|
|
|
### Check Logs
|
|
|
|
```bash
|
|
# Application logs
|
|
kubectl logs -n mlflow deployment/mlflow --tail=100
|
|
|
|
# Database migration logs
|
|
kubectl logs -n mlflow job/mlflow-db-migration
|
|
|
|
# Real-time logs
|
|
kubectl logs -n mlflow deployment/mlflow -f
|
|
```
|
|
|
|
### Common Log Messages
|
|
|
|
**Normal**:
|
|
|
|
- `Successfully created FastAPI app with OIDC integration`
|
|
- `OIDC routes, authentication, and UI should now be available`
|
|
- `Session module for cachelib imported`
|
|
- `Redirect URI for OIDC login: https://{host}/callback`
|
|
|
|
**Issues**:
|
|
|
|
- `Group detection error` - Check OIDC configuration
|
|
- `Authorization error: User is not allowed to login` - User not in required group
|
|
- `Session error` - Session configuration issue
|
|
|
|
### Image Build Issues
|
|
|
|
If custom image build fails:
|
|
|
|
```bash
|
|
# Set Docker host
|
|
export DOCKER_HOST=ssh://yourhost.com
|
|
|
|
# Rebuild image manually
|
|
cd /path/to/buun-stack/mlflow
|
|
just mlflow::build-and-push-image
|
|
|
|
# Check image exists on remote host
|
|
docker images localhost:30500/mlflow:3.6.0-oidc
|
|
|
|
# Test image on remote host
|
|
docker run --rm localhost:30500/mlflow:3.6.0-oidc mlflow --version
|
|
```
|
|
|
|
**Note**: All Docker commands run on the remote host specified by `DOCKER_HOST`.
|
|
|
|
## Custom Image
|
|
|
|
### Dockerfile
|
|
|
|
Located at `mlflow/image/Dockerfile`:
|
|
|
|
```dockerfile
|
|
FROM burakince/mlflow:3.6.0
|
|
|
|
# Install mlflow-oidc-auth plugin with filesystem session support
|
|
RUN pip install --no-cache-dir \
|
|
mlflow-oidc-auth[full]==5.6.1 \
|
|
cachelib[filesystem]
|
|
```
|
|
|
|
### Building Custom Image
|
|
|
|
**Important**: Set `DOCKER_HOST` to build on the remote k3s host:
|
|
|
|
```bash
|
|
export DOCKER_HOST=ssh://yourhost.com
|
|
|
|
just mlflow::build-image # Build only
|
|
just mlflow::push-image # Push only (requires prior build)
|
|
just mlflow::build-and-push-image # Build and push
|
|
```
|
|
|
|
The image is built on the remote Docker host and pushed to the k3s local registry (`localhost:30500`).
|
|
|
|
## References
|
|
|
|
- [MLflow Documentation](https://mlflow.org/docs/latest/index.html)
|
|
- [MLflow GitHub](https://github.com/mlflow/mlflow)
|
|
- [mlflow-oidc-auth Plugin](https://github.com/mlflow-oidc/mlflow-oidc-auth)
|
|
- [mlflow-oidc-auth Documentation](https://mlflow-oidc.github.io/mlflow-oidc-auth/)
|
|
- [Community Charts MLflow](https://github.com/community-charts/helm-charts/tree/main/charts/mlflow)
|
|
- [Keycloak OIDC](https://www.keycloak.org/docs/latest/securing_apps/#_oidc)
|