feat(mlflow): enable authn
This commit is contained in:
484
mlflow/README.md
Normal file
484
mlflow/README.md
Normal file
@@ -0,0 +1,484 @@
|
||||
# MLflow
|
||||
|
||||
Open source platform for managing the end-to-end machine learning lifecycle with Keycloak OIDC authentication.
|
||||
|
||||
## Overview
|
||||
|
||||
This module deploys MLflow using the Community Charts Helm chart with:
|
||||
|
||||
- **Keycloak OIDC authentication** for user login
|
||||
- **Custom Docker image** with mlflow-oidc-auth plugin
|
||||
- **PostgreSQL backend** for tracking server and auth databases
|
||||
- **MinIO/S3 artifact storage** with proxied access
|
||||
- **FastAPI/ASGI server** with Uvicorn for production
|
||||
- **HTTPS reverse proxy support** via Traefik
|
||||
- **Group-based access control** via Keycloak groups
|
||||
- **Prometheus metrics** for monitoring
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Kubernetes cluster (k3s)
|
||||
- Keycloak installed and configured
|
||||
- PostgreSQL cluster (CloudNativePG)
|
||||
- MinIO object storage
|
||||
- External Secrets Operator (optional, for Vault integration)
|
||||
- Docker registry (local or remote)
|
||||
|
||||
## Installation
|
||||
|
||||
### Basic Installation
|
||||
|
||||
1. **Build and Push Custom MLflow Image**:
|
||||
|
||||
Set `DOCKER_HOST` to your remote Docker host (where k3s is running):
|
||||
|
||||
```bash
|
||||
export DOCKER_HOST=ssh://yourhost.com
|
||||
just mlflow::build-and-push-image
|
||||
```
|
||||
|
||||
This builds a custom MLflow image with OIDC auth plugin and pushes it to your k3s registry.
|
||||
|
||||
2. **Install MLflow**:
|
||||
|
||||
```bash
|
||||
just mlflow::install
|
||||
```
|
||||
|
||||
You will be prompted for:
|
||||
|
||||
- **MLflow host (FQDN)**: e.g., `mlflow.example.com`
|
||||
|
||||
### What Gets Installed
|
||||
|
||||
- MLflow tracking server (FastAPI with OIDC)
|
||||
- PostgreSQL databases:
|
||||
- `mlflow` - Experiment tracking, models, and runs
|
||||
- `mlflow_auth` - User authentication and permissions
|
||||
- PostgreSQL user `mlflow` with access to both databases
|
||||
- MinIO bucket `mlflow` for artifact storage
|
||||
- Custom MLflow Docker image with OIDC auth plugin
|
||||
- Keycloak OAuth client (confidential client)
|
||||
- Keycloak groups:
|
||||
- `mlflow-admins` - Full administrative access
|
||||
- `mlflow-users` - Basic user access
|
||||
|
||||
## Configuration
|
||||
|
||||
### Docker Build Environment
|
||||
|
||||
For building and pushing the custom MLflow image:
|
||||
|
||||
```bash
|
||||
DOCKER_HOST=ssh://yourhost.com # Remote Docker host (where k3s is running)
|
||||
IMAGE_REGISTRY=localhost:30500 # k3s local registry
|
||||
```
|
||||
|
||||
### Deployment Configuration
|
||||
|
||||
Environment variables (set in `.env.local` or override):
|
||||
|
||||
```bash
|
||||
MLFLOW_NAMESPACE=mlflow # Kubernetes namespace
|
||||
MLFLOW_CHART_VERSION=1.8.0 # Helm chart version
|
||||
MLFLOW_HOST=mlflow.example.com # External hostname
|
||||
MLFLOW_IMAGE_TAG=3.6.0-oidc # Custom image tag
|
||||
MLFLOW_IMAGE_PULL_POLICY=IfNotPresent # Image pull policy
|
||||
KEYCLOAK_HOST=auth.example.com # Keycloak hostname
|
||||
KEYCLOAK_REALM=buunstack # Keycloak realm name
|
||||
```
|
||||
|
||||
### Architecture Notes
|
||||
|
||||
**MLflow 3.6.0 with OIDC**:
|
||||
|
||||
- Uses `mlflow-oidc-auth[full]==5.6.1` plugin
|
||||
- FastAPI/ASGI server with Uvicorn (not Gunicorn)
|
||||
- Server type: `oidc-auth-fastapi` for ASGI compatibility
|
||||
- Session management: `cachelib` with filesystem backend
|
||||
- Custom Docker image built from `burakince/mlflow:3.6.0`
|
||||
|
||||
**Authentication Flow**:
|
||||
|
||||
- OIDC Discovery: `/.well-known/openid-configuration`
|
||||
- Redirect URI: `/callback` (not `/oidc/callback`)
|
||||
- Required scopes: `openid profile email groups`
|
||||
- Group attribute: `groups` from UserInfo
|
||||
|
||||
**Database Structure**:
|
||||
|
||||
- `mlflow` database: Experiment tracking, models, parameters, metrics
|
||||
- `mlflow_auth` database: User accounts, groups, permissions
|
||||
|
||||
## Usage
|
||||
|
||||
### Access MLflow
|
||||
|
||||
1. Navigate to `https://your-mlflow-host/`
|
||||
2. Click "Keycloak" button to authenticate
|
||||
3. After successful login:
|
||||
- First redirect: Permissions Management UI (`/oidc/ui/`)
|
||||
- Click "MLflow" button: Main MLflow UI
|
||||
|
||||
### Grant Admin Access
|
||||
|
||||
Add users to the `mlflow-admins` group:
|
||||
|
||||
```bash
|
||||
just keycloak::add-user-to-group <username> mlflow-admins
|
||||
```
|
||||
|
||||
Admin users have full privileges including:
|
||||
|
||||
- Experiment and model management
|
||||
- User and permission management
|
||||
- Access to all experiments and models
|
||||
|
||||
### Log Experiments
|
||||
|
||||
#### Using Python Client
|
||||
|
||||
```python
|
||||
import mlflow
|
||||
|
||||
# Set tracking URI
|
||||
mlflow.set_tracking_uri("https://mlflow.example.com")
|
||||
|
||||
# Start experiment
|
||||
mlflow.set_experiment("my-experiment")
|
||||
|
||||
# Log parameters, metrics, and artifacts
|
||||
with mlflow.start_run():
|
||||
mlflow.log_param("learning_rate", 0.01)
|
||||
mlflow.log_metric("accuracy", 0.95)
|
||||
mlflow.log_artifact("model.pkl")
|
||||
```
|
||||
|
||||
#### Authentication for API Access
|
||||
|
||||
For programmatic access, create an access token:
|
||||
|
||||
1. Log in to MLflow UI
|
||||
2. Navigate to Permissions UI → Create access token
|
||||
3. Use token in your code:
|
||||
|
||||
```python
|
||||
import os
|
||||
os.environ["MLFLOW_TRACKING_TOKEN"] = "your-token"
|
||||
```
|
||||
|
||||
### Model Registry
|
||||
|
||||
Register and manage models:
|
||||
|
||||
```python
|
||||
# Register model
|
||||
mlflow.register_model(
|
||||
model_uri="runs:/<run-id>/model",
|
||||
name="my-model"
|
||||
)
|
||||
|
||||
# Transition model stage
|
||||
from mlflow.tracking import MlflowClient
|
||||
client = MlflowClient()
|
||||
client.transition_model_version_stage(
|
||||
name="my-model",
|
||||
version=1,
|
||||
stage="Production"
|
||||
)
|
||||
```
|
||||
|
||||
## Features
|
||||
|
||||
- **Experiment Tracking**: Log parameters, metrics, and artifacts
|
||||
- **Model Registry**: Version and manage ML models
|
||||
- **Model Serving**: Deploy models as REST APIs
|
||||
- **Project Reproducibility**: Package code, data, and environment
|
||||
- **Remote Execution**: Run experiments on remote platforms
|
||||
- **UI Dashboard**: Visual experiment comparison and analysis
|
||||
- **LLM Tracking**: Track LLM applications with traces
|
||||
- **Prompt Registry**: Manage and version prompts
|
||||
|
||||
## Architecture
|
||||
|
||||
```plain
|
||||
External Users
|
||||
↓
|
||||
Cloudflare Tunnel (HTTPS)
|
||||
↓
|
||||
Traefik Ingress (HTTPS)
|
||||
↓
|
||||
MLflow Server (HTTP inside cluster)
|
||||
├─ FastAPI/ASGI (Uvicorn)
|
||||
├─ mlflow-oidc-auth plugin
|
||||
│ ├─ OAuth → Keycloak (authentication)
|
||||
│ └─ Session → FileSystemCache
|
||||
├─ PostgreSQL (metadata)
|
||||
│ ├─ mlflow (tracking)
|
||||
│ └─ mlflow_auth (users/groups)
|
||||
└─ MinIO (artifacts via proxied access)
|
||||
```
|
||||
|
||||
**Key Components**:
|
||||
|
||||
- **Server Type**: `oidc-auth-fastapi` for FastAPI/ASGI compatibility
|
||||
- **Allowed Hosts**: Validates `Host` header for security
|
||||
- **Session Backend**: Cachelib with filesystem storage
|
||||
- **Artifact Storage**: Proxied through MLflow server (no direct S3 access needed)
|
||||
|
||||
## Authentication
|
||||
|
||||
### User Login (OIDC)
|
||||
|
||||
- Users authenticate via Keycloak
|
||||
- Standard OIDC flow with Authorization Code grant
|
||||
- Group membership retrieved from `groups` claim in UserInfo
|
||||
- Users automatically created on first login
|
||||
|
||||
### Access Control
|
||||
|
||||
**Group-based Permissions**:
|
||||
|
||||
```python
|
||||
OIDC_ADMIN_GROUP_NAME = "mlflow-admins"
|
||||
OIDC_GROUP_NAME = "mlflow-admins,mlflow-users"
|
||||
```
|
||||
|
||||
**Default Permissions**:
|
||||
|
||||
- New resources: `MANAGE` permission for creator
|
||||
- Admins: Full access to all resources
|
||||
- Users: Access based on explicit permissions
|
||||
|
||||
### Permission Management
|
||||
|
||||
Access the Permissions UI at `/oidc/ui/`:
|
||||
|
||||
- View and manage user permissions
|
||||
- Assign permissions to experiments, models, and prompts
|
||||
- Create and manage groups
|
||||
- View audit logs
|
||||
|
||||
## Management
|
||||
|
||||
### Rebuild Custom Image
|
||||
|
||||
If you need to update the custom MLflow image:
|
||||
|
||||
```bash
|
||||
export DOCKER_HOST=ssh://yourhost.com
|
||||
just mlflow::build-and-push-image
|
||||
```
|
||||
|
||||
After rebuilding, restart MLflow to use the new image:
|
||||
|
||||
```bash
|
||||
kubectl rollout restart deployment/mlflow -n mlflow
|
||||
```
|
||||
|
||||
### Upgrade MLflow
|
||||
|
||||
```bash
|
||||
just mlflow::upgrade
|
||||
```
|
||||
|
||||
Updates the Helm deployment with current configuration.
|
||||
|
||||
### Uninstall
|
||||
|
||||
```bash
|
||||
# Keep PostgreSQL databases
|
||||
just mlflow::uninstall false
|
||||
|
||||
# Delete PostgreSQL databases and user
|
||||
just mlflow::uninstall true
|
||||
```
|
||||
|
||||
### Clean Up All Resources
|
||||
|
||||
```bash
|
||||
just mlflow::cleanup
|
||||
```
|
||||
|
||||
Deletes databases, users, secrets, and Keycloak client (with confirmation).
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Check Pod Status
|
||||
|
||||
```bash
|
||||
kubectl get pods -n mlflow
|
||||
```
|
||||
|
||||
Expected pods:
|
||||
|
||||
- `mlflow-*` - Main application (1 replica)
|
||||
- `mlflow-db-migration-*` - Database migration (Completed)
|
||||
- `mlflow-dbchecker-*` - Database connection check (Completed)
|
||||
|
||||
### OAuth Login Fails
|
||||
|
||||
#### Redirect Loop (Returns to Login Page)
|
||||
|
||||
**Symptoms**: User authenticates with Keycloak but returns to login page
|
||||
|
||||
**Common Causes**:
|
||||
|
||||
1. **Redirect URI Mismatch**:
|
||||
- Check Keycloak client redirect URI matches `/callback`
|
||||
- Verify `OIDC_REDIRECT_URI` is `https://{host}/callback`
|
||||
|
||||
2. **Missing Groups Scope**:
|
||||
- Ensure `groups` scope is added to Keycloak client
|
||||
- Check groups mapper is configured in Keycloak
|
||||
|
||||
3. **Group Membership**:
|
||||
- User must be in `mlflow-admins` or `mlflow-users` group
|
||||
- Add user to group: `just keycloak::add-user-to-group <user> mlflow-admins`
|
||||
|
||||
#### Session Errors
|
||||
|
||||
**Error**: `Session module for filesystem could not be imported`
|
||||
|
||||
**Solution**: Ensure session configuration is correct:
|
||||
|
||||
```yaml
|
||||
SESSION_TYPE: "cachelib"
|
||||
SESSION_CACHE_DIR: "/tmp/session"
|
||||
```
|
||||
|
||||
#### Group Detection Errors
|
||||
|
||||
**Error**: `Group detection error: No module named 'oidc'`
|
||||
|
||||
**Solution**: Remove `OIDC_GROUP_DETECTION_PLUGIN` setting (should be unset or removed)
|
||||
|
||||
### Server Type Errors
|
||||
|
||||
**Error**: `TypeError: Flask.__call__() missing 1 required positional argument: 'start_response'`
|
||||
|
||||
**Cause**: Using Flask server type with Uvicorn (ASGI)
|
||||
|
||||
**Solution**: Ensure `appName: "oidc-auth-fastapi"` in values
|
||||
|
||||
### Database Connection Issues
|
||||
|
||||
Check database credentials:
|
||||
|
||||
```bash
|
||||
kubectl get secret mlflow-db-secret -n mlflow -o yaml
|
||||
```
|
||||
|
||||
Test database connectivity:
|
||||
|
||||
```bash
|
||||
kubectl exec -n mlflow deployment/mlflow -- \
|
||||
psql -h postgres-cluster-rw.postgres -U mlflow -d mlflow -c "SELECT 1"
|
||||
```
|
||||
|
||||
### Artifact Storage Issues
|
||||
|
||||
Check MinIO credentials:
|
||||
|
||||
```bash
|
||||
kubectl get secret mlflow-s3-secret -n mlflow -o yaml
|
||||
```
|
||||
|
||||
Test MinIO connectivity:
|
||||
|
||||
```bash
|
||||
kubectl exec -n mlflow deployment/mlflow -- \
|
||||
python -c "import boto3; import os; \
|
||||
client = boto3.client('s3', \
|
||||
endpoint_url=os.getenv('MLFLOW_S3_ENDPOINT_URL'), \
|
||||
aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'), \
|
||||
aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY')); \
|
||||
print(client.list_buckets())"
|
||||
```
|
||||
|
||||
### Check Logs
|
||||
|
||||
```bash
|
||||
# Application logs
|
||||
kubectl logs -n mlflow deployment/mlflow --tail=100
|
||||
|
||||
# Database migration logs
|
||||
kubectl logs -n mlflow job/mlflow-db-migration
|
||||
|
||||
# Real-time logs
|
||||
kubectl logs -n mlflow deployment/mlflow -f
|
||||
```
|
||||
|
||||
### Common Log Messages
|
||||
|
||||
**Normal**:
|
||||
|
||||
- `Successfully created FastAPI app with OIDC integration`
|
||||
- `OIDC routes, authentication, and UI should now be available`
|
||||
- `Session module for cachelib imported`
|
||||
- `Redirect URI for OIDC login: https://{host}/callback`
|
||||
|
||||
**Issues**:
|
||||
|
||||
- `Group detection error` - Check OIDC configuration
|
||||
- `Authorization error: User is not allowed to login` - User not in required group
|
||||
- `Session error` - Session configuration issue
|
||||
|
||||
### Image Build Issues
|
||||
|
||||
If custom image build fails:
|
||||
|
||||
```bash
|
||||
# Set Docker host
|
||||
export DOCKER_HOST=ssh://yourhost.com
|
||||
|
||||
# Rebuild image manually
|
||||
cd /path/to/buun-stack/mlflow
|
||||
just mlflow::build-and-push-image
|
||||
|
||||
# Check image exists on remote host
|
||||
docker images localhost:30500/mlflow:3.6.0-oidc
|
||||
|
||||
# Test image on remote host
|
||||
docker run --rm localhost:30500/mlflow:3.6.0-oidc mlflow --version
|
||||
```
|
||||
|
||||
**Note**: All Docker commands run on the remote host specified by `DOCKER_HOST`.
|
||||
|
||||
## Custom Image
|
||||
|
||||
### Dockerfile
|
||||
|
||||
Located at `mlflow/image/Dockerfile`:
|
||||
|
||||
```dockerfile
|
||||
FROM burakince/mlflow:3.6.0
|
||||
|
||||
# Install mlflow-oidc-auth plugin with filesystem session support
|
||||
RUN pip install --no-cache-dir \
|
||||
mlflow-oidc-auth[full]==5.6.1 \
|
||||
cachelib[filesystem]
|
||||
```
|
||||
|
||||
### Building Custom Image
|
||||
|
||||
**Important**: Set `DOCKER_HOST` to build on the remote k3s host:
|
||||
|
||||
```bash
|
||||
export DOCKER_HOST=ssh://yourhost.com
|
||||
|
||||
just mlflow::build-image # Build only
|
||||
just mlflow::push-image # Push only (requires prior build)
|
||||
just mlflow::build-and-push-image # Build and push
|
||||
```
|
||||
|
||||
The image is built on the remote Docker host and pushed to the k3s local registry (`localhost:30500`).
|
||||
|
||||
## References
|
||||
|
||||
- [MLflow Documentation](https://mlflow.org/docs/latest/index.html)
|
||||
- [MLflow GitHub](https://github.com/mlflow/mlflow)
|
||||
- [mlflow-oidc-auth Plugin](https://github.com/mlflow-oidc/mlflow-oidc-auth)
|
||||
- [mlflow-oidc-auth Documentation](https://mlflow-oidc.github.io/mlflow-oidc-auth/)
|
||||
- [Community Charts MLflow](https://github.com/community-charts/helm-charts/tree/main/charts/mlflow)
|
||||
- [Keycloak OIDC](https://www.keycloak.org/docs/latest/securing_apps/#_oidc)
|
||||
Reference in New Issue
Block a user