15 KiB
Dagster Documentation
Overview
This document covers Dagster installation, deployment, and debugging in the buun-stack environment.
Installation
Prerequisites
- Kubernetes cluster with buun-stack components
- PostgreSQL database cluster
- MinIO object storage (optional, for MinIO-based storage)
- External Secrets Operator (optional, for Vault integration)
- Keycloak (for authentication)
Installation Steps
-
Setup Environment Secrets (if needed):
- See Environment Variables Setup section below for configuration options
- Create ExternalSecret or Secret before installation if you want environment variables available immediately
-
Install Dagster:
# Interactive installation with configuration prompts just dagster::install -
Access Dagster Web UI:
- Navigate to your Dagster instance (e.g.,
https://dagster.yourdomain.com) - Login with your Keycloak credentials
- Navigate to your Dagster instance (e.g.,
Uninstalling
# Remove Dagster (keeps database by default)
just dagster::uninstall false
# Remove Dagster and delete database
just dagster::uninstall true
Project Deployment
Overview
Dagster's official Helm chart only supports adding projects during installation. To work around this limitation, just dagster::deploy-project was implemented to dynamically add projects to a running Dagster instance.
This solution:
- Copies project files to a shared PVC (ReadWriteMany with Longhorn, or ReadWriteOnce fallback)
- Updates the workspace ConfigMap to register the new project
- Restarts Dagster components to load the updated workspace
Deploy Projects to Shared PVC
1. Prepare Project Directory
Requirements:
-
Project must have a
definitions.pyfile (typically insrc/<project_name>/definitions.py) -
Project directory name must use underscores, not hyphens (e.g.,
my_project, notmy-project) -
Project should follow standard Python package structure:
my_project/ ├── pyproject.toml # Optional: for dependencies ├── requirements.txt # Optional: for dependencies └── src/ └── my_project/ ├── __init__.py ├── definitions.py # Required: Dagster definitions entry point └── defs/ └── assets/ └── my_assets.py
2. Deploy Project
# Deploy a local project directory
just dagster::deploy-project /path/to/your/project
# Interactive deployment (will prompt for project path)
just dagster::deploy-project
What happens during deployment:
-
File Copy: Project files are copied to
/opt/dagster/user-code/<project_name>/in the shared PVC- Excludes:
.venv,__pycache__,.git, and other build artifacts - Uses tar with
--no-xattrsto avoid macOS extended attributes issues
- Excludes:
-
Workspace Update: The
dagster-workspace-yamlConfigMap is updated to include:load_from: - python_module: module_name: <project_name>.definitions working_directory: /opt/dagster/user-code/<project_name>/src # or project root if no src/ -
Automatic Reload: Dagster automatically detects the workspace.yaml changes and reloads within ~1 minute
- No manual restart required
- The new project will appear in the Deployment tab once loaded
3. Verify Deployment
- Access Dagster Web UI
- Navigate to "Deployment" tab to see registered code locations
- Wait ~1 minute for automatic workspace reload
- Click "Reload all" in the UI to refresh the view if needed
- Check the Asset Catalog for your assets
Remove Projects
# Remove a deployed project
just dagster::remove-project project_name
# Interactive removal (will prompt for project name)
just dagster::remove-project
What happens during removal:
- Files are deleted from the shared PVC
- The project module is removed from
workspace.yaml - Dagster automatically detects the change and reloads within ~1 minute
Manual Workspace Reload
If automatic reload doesn't work or you need immediate reload:
just dagster::reload-workspace
This command restarts both dagster-webserver and dagster-daemon to force an immediate workspace reload.
Storage Configuration
Local PVC Storage (Default)
Uses Kubernetes PersistentVolumeClaims for storage:
- dagster-storage-pvc: Main Dagster storage (ReadWriteOnce)
- dagster-user-code-pvc: Shared user code storage (ReadWriteMany with Longhorn)
MinIO Storage (Optional)
When MinIO is available, Dagster can use S3-compatible storage:
- dagster-data: Data files bucket
- dagster-logs: Compute logs bucket
The storage type is selected during installation via interactive prompt.
Environment Variables Setup
Environment variables are provided to Dagster through Kubernetes Secrets. You have several options:
Option 1: Customize the Example Template
-
Create the example environment secrets template:
just dagster::create-env-secrets-example -
Important: This creates a template with sample values. You must customize it:
- If using External Secrets: Edit
dagster-env-external-secret.gomplate.yamlto reference your actual Vault paths - If using Direct Secrets: Update the created
dagster-env-secretwith your actual credentials
- If using External Secrets: Edit
Option 2: Create ExternalSecret Manually
Create an ExternalSecret that references your Vault credentials:
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: dagster-env-external-secret
namespace: dagster
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-secret-store
kind: ClusterSecretStore
target:
name: dagster-env-secret
data:
- secretKey: AWS_ACCESS_KEY_ID
remoteRef:
key: minio/credentials
property: access_key
- secretKey: AWS_SECRET_ACCESS_KEY
remoteRef:
key: minio/credentials
property: secret_key
- secretKey: POSTGRES_URL
remoteRef:
key: postgres/admin
property: connection_string
# Add more variables as needed
Option 3: Create Kubernetes Secret Directly
kubectl create secret generic dagster-env-secret -n dagster \
--from-literal=AWS_ACCESS_KEY_ID="your-access-key" \
--from-literal=AWS_SECRET_ACCESS_KEY="your-secret-key" \
--from-literal=AWS_ENDPOINT_URL="http://minio.minio.svc.cluster.local:9000" \
--from-literal=POSTGRES_URL="postgresql://user:pass@postgres-cluster-rw.postgres:5432"
After creating the environment secrets, redeploy Dagster to pick up the new configuration.
Example Projects
CSV to PostgreSQL Project
The examples/csv_to_postgres project demonstrates a complete ETL pipeline that loads data from MinIO object storage into PostgreSQL using dlt (data load tool).
Dataset Information
MovieLens 20M Dataset
This project processes the MovieLens 20M dataset from GroupLens Research. The dataset contains:
- 27,278 movies with metadata
- 20 million ratings from 138,493 users
- 465,564 tags applied by users
- Additional genome data for content-based filtering
MinIO Storage Structure
The dataset files are stored in MinIO under the movie-lens bucket:
mc alias set buun https://minio.your-domain.com access-key secret-key
mc ls buun/movie-lens
[2025-09-14 12:13:09 JST] 309MiB STANDARD genome-scores.csv
[2025-09-14 12:12:37 JST] 18KiB STANDARD genome-tags.csv
[2025-09-14 12:12:38 JST] 557KiB STANDARD links.csv
[2025-09-14 12:12:38 JST] 1.3MiB STANDARD movies.csv
[2025-09-14 12:13:15 JST] 509MiB STANDARD ratings.csv
[2025-09-14 12:12:42 JST] 16MiB STANDARD tags.csv
The project processes:
- movies.csv (1.3MiB) - Movie metadata
- tags.csv (16MiB) - User-generated tags
- ratings.csv (509MiB) - User ratings
Project Features
Assets Processed
- movies_pipeline: MovieLens movies data with primary key
movieId - ratings_pipeline: User ratings with composite primary key
[userId, movieId] - tags_pipeline: User tags with composite primary key
[userId, movieId, timestamp] - movielens_summary: Generates metadata summary of all processed assets
Smart Processing
- Table Existence Check: Uses DuckDB PostgreSQL scanner to check if tables already exist
- Skip Logic: If a table already contains data, the asset will skip processing to avoid reprocessing large files
- Write Disposition: Uses
replacemode for initial loads
Dependencies
dlt[duckdb,filesystem,postgres,s3]>=1.12.1dagsterand related libraries- duckdb (for table existence checking)
Environment Variables Required
The project expects the following environment variables to be set:
POSTGRES_URL: PostgreSQL connection string (format:postgresql://user:password@host:port/database)AWS_ACCESS_KEY_ID: MinIO/S3 access keyAWS_SECRET_ACCESS_KEY: MinIO/S3 secret keyAWS_ENDPOINT_URL: MinIO endpoint URL- Additional dlt-specific environment variables for advanced configuration
Debugging and Troubleshooting
Debug Commands
Check Dagster component logs using kubectl:
Pod Status and Logs
# Check Dagster pods status
kubectl get pods -n dagster
# View webserver logs
kubectl logs -n dagster deployment/dagster-dagster-webserver -c dagster-webserver --tail=100
# View daemon logs
kubectl logs -n dagster deployment/dagster-daemon -c dagster-daemon --tail=100
# View user code deployment logs (if using code servers)
kubectl logs -n dagster deployment/dagster-user-code -c dagster --tail=100
Configuration and Secrets
# Check workspace configuration
kubectl get configmap dagster-workspace-yaml -n dagster -o yaml
# Check database secret
kubectl describe secret dagster-database-secret -n dagster
# Check environment secrets (if configured)
kubectl describe secret dagster-env-secret -n dagster
# Check OAuth secrets
kubectl describe secret dagster-oauth-secret -n dagster
Common Issues
Assets Not Appearing
Symptoms: Project deployed but assets not visible in Dagster UI
Debugging Steps:
-
Check webserver logs for import errors:
kubectl logs -n dagster deployment/dagster-dagster-webserver -c dagster-webserver --tail=100 | grep -i error -
Verify workspace configuration:
kubectl get configmap dagster-workspace-yaml -n dagster -o jsonpath='{.data.workspace\.yaml}' -
Check project files in PVC:
WEBSERVER_POD=$(kubectl get pods -n dagster -l component=dagster-webserver -o jsonpath='{.items[0].metadata.name}') kubectl exec $WEBSERVER_POD -n dagster -- ls -la /opt/dagster/user-code/
Common Causes:
- Python syntax errors in project files
- Missing
definitions.pyfile - Incorrect module structure
- Project name contains hyphens
Asset Execution Failures
Symptoms: Assets appear but fail during materialization
Debugging Steps:
-
Check daemon logs for execution errors:
kubectl logs -n dagster deployment/dagster-daemon -c dagster-daemon --tail=100 -
Check environment variables in webserver:
WEBSERVER_POD=$(kubectl get pods -n dagster -l component=dagster-webserver -o jsonpath='{.items[0].metadata.name}') kubectl exec $WEBSERVER_POD -n dagster -- env | grep -E "(AWS|POSTGRES|DLT)" -
Test connectivity from pods:
# Test MinIO connectivity kubectl exec $WEBSERVER_POD -n dagster -- ping minio.minio.svc.cluster.local # Test PostgreSQL connectivity kubectl exec $WEBSERVER_POD -n dagster -- nc -zv postgres-cluster-rw.postgres 5432
Environment Variables Issues
Symptoms: Assets fail with authentication or connection errors
Debugging Steps:
-
Verify secret exists and contains data:
kubectl describe secret dagster-env-secret -n dagster -
Check if ExternalSecret is syncing (if using External Secrets):
kubectl get externalsecret dagster-env-external-secret -n dagster kubectl describe externalsecret dagster-env-external-secret -n dagster -
Verify environment variables are loaded in pods:
WEBSERVER_POD=$(kubectl get pods -n dagster -l component=dagster-webserver -o jsonpath='{.items[0].metadata.name}') kubectl exec $WEBSERVER_POD -n dagster -- printenv | grep -E "(AWS|POSTGRES|DLT)"
Authentication Issues
Symptoms: Cannot access Dagster UI or authentication failures
Debugging Steps:
-
Check OAuth2 proxy status:
kubectl get pods -n dagster -l app=oauth2-proxy kubectl logs -n dagster deployment/oauth2-proxy-dagster --tail=100 -
Verify OAuth client configuration in Keycloak:
- Ensure client
dagsterexists in the realm - Check redirect URIs are correctly configured
- Verify client secret matches
- Ensure client
-
Check OAuth secret:
kubectl describe secret dagster-oauth-secret -n dagster
Database Connection Issues
Symptoms: Database-related errors or connection failures
Debugging Steps:
-
Test database connectivity:
WEBSERVER_POD=$(kubectl get pods -n dagster -l component=dagster-webserver -o jsonpath='{.items[0].metadata.name}') kubectl exec $WEBSERVER_POD -n dagster -- python3 -c " import os import psycopg2 conn = psycopg2.connect( host='postgres-cluster-rw.postgres', port=5432, database='dagster', user=os.getenv('POSTGRES_USER', 'dagster'), password=os.getenv('POSTGRES_PASSWORD', '') ) print('Database connection successful') conn.close() " -
Check database secret:
kubectl describe secret dagster-database-secret -n dagster -
Verify database exists:
just postgres::psql -c "\l" | grep dagster
Connection Testing
MinIO Connectivity
# Test MinIO access from Dagster pod
WEBSERVER_POD=$(kubectl get pods -n dagster -l component=dagster-webserver -o jsonpath='{.items[0].metadata.name}')
kubectl exec $WEBSERVER_POD -n dagster -- python3 -c "
import boto3
import os
client = boto3.client('s3',
endpoint_url=os.getenv('AWS_ENDPOINT_URL'),
aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY')
)
print('Buckets:', [b['Name'] for b in client.list_buckets()['Buckets']])
"
Log Analysis Tips
-
Filter logs by timestamp:
kubectl logs -n dagster deployment/dagster-dagster-webserver --since=10m -
Search for specific errors:
kubectl logs -n dagster deployment/dagster-daemon | grep -i "error\|exception\|failed" -
Monitor logs in real-time:
kubectl logs -n dagster deployment/dagster-dagster-webserver -f -
Check resource usage:
kubectl top pods -n dagster