feat(mlflow): enable authn

This commit is contained in:
Masaki Yatsu
2025-11-09 15:48:02 +09:00
parent 0142034535
commit 995abfe4d2
6 changed files with 727 additions and 28 deletions

484
mlflow/README.md Normal file
View File

@@ -0,0 +1,484 @@
# MLflow
Open source platform for managing the end-to-end machine learning lifecycle with Keycloak OIDC authentication.
## Overview
This module deploys MLflow using the Community Charts Helm chart with:
- **Keycloak OIDC authentication** for user login
- **Custom Docker image** with mlflow-oidc-auth plugin
- **PostgreSQL backend** for tracking server and auth databases
- **MinIO/S3 artifact storage** with proxied access
- **FastAPI/ASGI server** with Uvicorn for production
- **HTTPS reverse proxy support** via Traefik
- **Group-based access control** via Keycloak groups
- **Prometheus metrics** for monitoring
## Prerequisites
- Kubernetes cluster (k3s)
- Keycloak installed and configured
- PostgreSQL cluster (CloudNativePG)
- MinIO object storage
- External Secrets Operator (optional, for Vault integration)
- Docker registry (local or remote)
## Installation
### Basic Installation
1. **Build and Push Custom MLflow Image**:
Set `DOCKER_HOST` to your remote Docker host (where k3s is running):
```bash
export DOCKER_HOST=ssh://yourhost.com
just mlflow::build-and-push-image
```
This builds a custom MLflow image with OIDC auth plugin and pushes it to your k3s registry.
2. **Install MLflow**:
```bash
just mlflow::install
```
You will be prompted for:
- **MLflow host (FQDN)**: e.g., `mlflow.example.com`
### What Gets Installed
- MLflow tracking server (FastAPI with OIDC)
- PostgreSQL databases:
- `mlflow` - Experiment tracking, models, and runs
- `mlflow_auth` - User authentication and permissions
- PostgreSQL user `mlflow` with access to both databases
- MinIO bucket `mlflow` for artifact storage
- Custom MLflow Docker image with OIDC auth plugin
- Keycloak OAuth client (confidential client)
- Keycloak groups:
- `mlflow-admins` - Full administrative access
- `mlflow-users` - Basic user access
## Configuration
### Docker Build Environment
For building and pushing the custom MLflow image:
```bash
DOCKER_HOST=ssh://yourhost.com # Remote Docker host (where k3s is running)
IMAGE_REGISTRY=localhost:30500 # k3s local registry
```
### Deployment Configuration
Environment variables (set in `.env.local` or override):
```bash
MLFLOW_NAMESPACE=mlflow # Kubernetes namespace
MLFLOW_CHART_VERSION=1.8.0 # Helm chart version
MLFLOW_HOST=mlflow.example.com # External hostname
MLFLOW_IMAGE_TAG=3.6.0-oidc # Custom image tag
MLFLOW_IMAGE_PULL_POLICY=IfNotPresent # Image pull policy
KEYCLOAK_HOST=auth.example.com # Keycloak hostname
KEYCLOAK_REALM=buunstack # Keycloak realm name
```
### Architecture Notes
**MLflow 3.6.0 with OIDC**:
- Uses `mlflow-oidc-auth[full]==5.6.1` plugin
- FastAPI/ASGI server with Uvicorn (not Gunicorn)
- Server type: `oidc-auth-fastapi` for ASGI compatibility
- Session management: `cachelib` with filesystem backend
- Custom Docker image built from `burakince/mlflow:3.6.0`
**Authentication Flow**:
- OIDC Discovery: `/.well-known/openid-configuration`
- Redirect URI: `/callback` (not `/oidc/callback`)
- Required scopes: `openid profile email groups`
- Group attribute: `groups` from UserInfo
**Database Structure**:
- `mlflow` database: Experiment tracking, models, parameters, metrics
- `mlflow_auth` database: User accounts, groups, permissions
## Usage
### Access MLflow
1. Navigate to `https://your-mlflow-host/`
2. Click "Keycloak" button to authenticate
3. After successful login:
- First redirect: Permissions Management UI (`/oidc/ui/`)
- Click "MLflow" button: Main MLflow UI
### Grant Admin Access
Add users to the `mlflow-admins` group:
```bash
just keycloak::add-user-to-group <username> mlflow-admins
```
Admin users have full privileges including:
- Experiment and model management
- User and permission management
- Access to all experiments and models
### Log Experiments
#### Using Python Client
```python
import mlflow
# Set tracking URI
mlflow.set_tracking_uri("https://mlflow.example.com")
# Start experiment
mlflow.set_experiment("my-experiment")
# Log parameters, metrics, and artifacts
with mlflow.start_run():
mlflow.log_param("learning_rate", 0.01)
mlflow.log_metric("accuracy", 0.95)
mlflow.log_artifact("model.pkl")
```
#### Authentication for API Access
For programmatic access, create an access token:
1. Log in to MLflow UI
2. Navigate to Permissions UI → Create access token
3. Use token in your code:
```python
import os
os.environ["MLFLOW_TRACKING_TOKEN"] = "your-token"
```
### Model Registry
Register and manage models:
```python
# Register model
mlflow.register_model(
model_uri="runs:/<run-id>/model",
name="my-model"
)
# Transition model stage
from mlflow.tracking import MlflowClient
client = MlflowClient()
client.transition_model_version_stage(
name="my-model",
version=1,
stage="Production"
)
```
## Features
- **Experiment Tracking**: Log parameters, metrics, and artifacts
- **Model Registry**: Version and manage ML models
- **Model Serving**: Deploy models as REST APIs
- **Project Reproducibility**: Package code, data, and environment
- **Remote Execution**: Run experiments on remote platforms
- **UI Dashboard**: Visual experiment comparison and analysis
- **LLM Tracking**: Track LLM applications with traces
- **Prompt Registry**: Manage and version prompts
## Architecture
```plain
External Users
Cloudflare Tunnel (HTTPS)
Traefik Ingress (HTTPS)
MLflow Server (HTTP inside cluster)
├─ FastAPI/ASGI (Uvicorn)
├─ mlflow-oidc-auth plugin
│ ├─ OAuth → Keycloak (authentication)
│ └─ Session → FileSystemCache
├─ PostgreSQL (metadata)
│ ├─ mlflow (tracking)
│ └─ mlflow_auth (users/groups)
└─ MinIO (artifacts via proxied access)
```
**Key Components**:
- **Server Type**: `oidc-auth-fastapi` for FastAPI/ASGI compatibility
- **Allowed Hosts**: Validates `Host` header for security
- **Session Backend**: Cachelib with filesystem storage
- **Artifact Storage**: Proxied through MLflow server (no direct S3 access needed)
## Authentication
### User Login (OIDC)
- Users authenticate via Keycloak
- Standard OIDC flow with Authorization Code grant
- Group membership retrieved from `groups` claim in UserInfo
- Users automatically created on first login
### Access Control
**Group-based Permissions**:
```python
OIDC_ADMIN_GROUP_NAME = "mlflow-admins"
OIDC_GROUP_NAME = "mlflow-admins,mlflow-users"
```
**Default Permissions**:
- New resources: `MANAGE` permission for creator
- Admins: Full access to all resources
- Users: Access based on explicit permissions
### Permission Management
Access the Permissions UI at `/oidc/ui/`:
- View and manage user permissions
- Assign permissions to experiments, models, and prompts
- Create and manage groups
- View audit logs
## Management
### Rebuild Custom Image
If you need to update the custom MLflow image:
```bash
export DOCKER_HOST=ssh://yourhost.com
just mlflow::build-and-push-image
```
After rebuilding, restart MLflow to use the new image:
```bash
kubectl rollout restart deployment/mlflow -n mlflow
```
### Upgrade MLflow
```bash
just mlflow::upgrade
```
Updates the Helm deployment with current configuration.
### Uninstall
```bash
# Keep PostgreSQL databases
just mlflow::uninstall false
# Delete PostgreSQL databases and user
just mlflow::uninstall true
```
### Clean Up All Resources
```bash
just mlflow::cleanup
```
Deletes databases, users, secrets, and Keycloak client (with confirmation).
## Troubleshooting
### Check Pod Status
```bash
kubectl get pods -n mlflow
```
Expected pods:
- `mlflow-*` - Main application (1 replica)
- `mlflow-db-migration-*` - Database migration (Completed)
- `mlflow-dbchecker-*` - Database connection check (Completed)
### OAuth Login Fails
#### Redirect Loop (Returns to Login Page)
**Symptoms**: User authenticates with Keycloak but returns to login page
**Common Causes**:
1. **Redirect URI Mismatch**:
- Check Keycloak client redirect URI matches `/callback`
- Verify `OIDC_REDIRECT_URI` is `https://{host}/callback`
2. **Missing Groups Scope**:
- Ensure `groups` scope is added to Keycloak client
- Check groups mapper is configured in Keycloak
3. **Group Membership**:
- User must be in `mlflow-admins` or `mlflow-users` group
- Add user to group: `just keycloak::add-user-to-group <user> mlflow-admins`
#### Session Errors
**Error**: `Session module for filesystem could not be imported`
**Solution**: Ensure session configuration is correct:
```yaml
SESSION_TYPE: "cachelib"
SESSION_CACHE_DIR: "/tmp/session"
```
#### Group Detection Errors
**Error**: `Group detection error: No module named 'oidc'`
**Solution**: Remove `OIDC_GROUP_DETECTION_PLUGIN` setting (should be unset or removed)
### Server Type Errors
**Error**: `TypeError: Flask.__call__() missing 1 required positional argument: 'start_response'`
**Cause**: Using Flask server type with Uvicorn (ASGI)
**Solution**: Ensure `appName: "oidc-auth-fastapi"` in values
### Database Connection Issues
Check database credentials:
```bash
kubectl get secret mlflow-db-secret -n mlflow -o yaml
```
Test database connectivity:
```bash
kubectl exec -n mlflow deployment/mlflow -- \
psql -h postgres-cluster-rw.postgres -U mlflow -d mlflow -c "SELECT 1"
```
### Artifact Storage Issues
Check MinIO credentials:
```bash
kubectl get secret mlflow-s3-secret -n mlflow -o yaml
```
Test MinIO connectivity:
```bash
kubectl exec -n mlflow deployment/mlflow -- \
python -c "import boto3; import os; \
client = boto3.client('s3', \
endpoint_url=os.getenv('MLFLOW_S3_ENDPOINT_URL'), \
aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'), \
aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY')); \
print(client.list_buckets())"
```
### Check Logs
```bash
# Application logs
kubectl logs -n mlflow deployment/mlflow --tail=100
# Database migration logs
kubectl logs -n mlflow job/mlflow-db-migration
# Real-time logs
kubectl logs -n mlflow deployment/mlflow -f
```
### Common Log Messages
**Normal**:
- `Successfully created FastAPI app with OIDC integration`
- `OIDC routes, authentication, and UI should now be available`
- `Session module for cachelib imported`
- `Redirect URI for OIDC login: https://{host}/callback`
**Issues**:
- `Group detection error` - Check OIDC configuration
- `Authorization error: User is not allowed to login` - User not in required group
- `Session error` - Session configuration issue
### Image Build Issues
If custom image build fails:
```bash
# Set Docker host
export DOCKER_HOST=ssh://yourhost.com
# Rebuild image manually
cd /path/to/buun-stack/mlflow
just mlflow::build-and-push-image
# Check image exists on remote host
docker images localhost:30500/mlflow:3.6.0-oidc
# Test image on remote host
docker run --rm localhost:30500/mlflow:3.6.0-oidc mlflow --version
```
**Note**: All Docker commands run on the remote host specified by `DOCKER_HOST`.
## Custom Image
### Dockerfile
Located at `mlflow/image/Dockerfile`:
```dockerfile
FROM burakince/mlflow:3.6.0
# Install mlflow-oidc-auth plugin with filesystem session support
RUN pip install --no-cache-dir \
mlflow-oidc-auth[full]==5.6.1 \
cachelib[filesystem]
```
### Building Custom Image
**Important**: Set `DOCKER_HOST` to build on the remote k3s host:
```bash
export DOCKER_HOST=ssh://yourhost.com
just mlflow::build-image # Build only
just mlflow::push-image # Push only (requires prior build)
just mlflow::build-and-push-image # Build and push
```
The image is built on the remote Docker host and pushed to the k3s local registry (`localhost:30500`).
## References
- [MLflow Documentation](https://mlflow.org/docs/latest/index.html)
- [MLflow GitHub](https://github.com/mlflow/mlflow)
- [mlflow-oidc-auth Plugin](https://github.com/mlflow-oidc/mlflow-oidc-auth)
- [mlflow-oidc-auth Documentation](https://mlflow-oidc.github.io/mlflow-oidc-auth/)
- [Community Charts MLflow](https://github.com/community-charts/helm-charts/tree/main/charts/mlflow)
- [Keycloak OIDC](https://www.keycloak.org/docs/latest/securing_apps/#_oidc)