Files

Dagster Documentation

Overview

This document covers Dagster installation, deployment, and debugging in the buun-stack environment.

Installation

Prerequisites

  • Kubernetes cluster with buun-stack components
  • PostgreSQL database cluster
  • MinIO object storage (optional, for MinIO-based storage)
  • External Secrets Operator (optional, for Vault integration)
  • Keycloak (for authentication)

Installation Steps

  1. Setup Environment Secrets (if needed):

    • See Environment Variables Setup section below for configuration options
    • Create ExternalSecret or Secret before installation if you want environment variables available immediately
  2. Install Dagster:

    # Interactive installation with configuration prompts
    just dagster::install
    
  3. Access Dagster Web UI:

    • Navigate to your Dagster instance (e.g., https://dagster.yourdomain.com)
    • Login with your Keycloak credentials

Uninstalling

# Remove Dagster (keeps database by default)
just dagster::uninstall false

# Remove Dagster and delete database
just dagster::uninstall true

Project Deployment

Overview

Dagster's official Helm chart only supports adding projects during installation. To work around this limitation, just dagster::deploy-project was implemented to dynamically add projects to a running Dagster instance.

This solution:

  • Copies project files to a shared PVC (ReadWriteMany with Longhorn, or ReadWriteOnce fallback)
  • Updates the workspace ConfigMap to register the new project
  • Restarts Dagster components to load the updated workspace

Deploy Projects to Shared PVC

1. Prepare Project Directory

Requirements:

  • Project must have a definitions.py file (typically in src/<project_name>/definitions.py)

  • Project directory name must use underscores, not hyphens (e.g., my_project, not my-project)

  • Project should follow standard Python package structure:

    my_project/
    ├── pyproject.toml          # Optional: for dependencies
    ├── requirements.txt        # Optional: for dependencies
    └── src/
        └── my_project/
            ├── __init__.py
            ├── definitions.py  # Required: Dagster definitions entry point
            └── defs/
                └── assets/
                    └── my_assets.py
    

2. Deploy Project

# Deploy a local project directory
just dagster::deploy-project /path/to/your/project

# Interactive deployment (will prompt for project path)
just dagster::deploy-project

What happens during deployment:

  1. File Copy: Project files are copied to /opt/dagster/user-code/<project_name>/ in the shared PVC

    • Excludes: .venv, __pycache__, .git, and other build artifacts
    • Uses tar with --no-xattrs to avoid macOS extended attributes issues
  2. Workspace Update: The dagster-workspace-yaml ConfigMap is updated to include:

    load_from:
      - python_module:
          module_name: <project_name>.definitions
          working_directory: /opt/dagster/user-code/<project_name>/src  # or project root if no src/
    
  3. Automatic Reload: Dagster automatically detects the workspace.yaml changes and reloads within ~1 minute

    • No manual restart required
    • The new project will appear in the Deployment tab once loaded

3. Verify Deployment

  • Access Dagster Web UI
  • Navigate to "Deployment" tab to see registered code locations
  • Wait ~1 minute for automatic workspace reload
  • Click "Reload all" in the UI to refresh the view if needed
  • Check the Asset Catalog for your assets

Remove Projects

# Remove a deployed project
just dagster::remove-project project_name

# Interactive removal (will prompt for project name)
just dagster::remove-project

What happens during removal:

  1. Files are deleted from the shared PVC
  2. The project module is removed from workspace.yaml
  3. Dagster automatically detects the change and reloads within ~1 minute

Manual Workspace Reload

If automatic reload doesn't work or you need immediate reload:

just dagster::reload-workspace

This command restarts both dagster-webserver and dagster-daemon to force an immediate workspace reload.

Storage Configuration

Local PVC Storage (Default)

Uses Kubernetes PersistentVolumeClaims for storage:

  • dagster-storage-pvc: Main Dagster storage (ReadWriteOnce)
  • dagster-user-code-pvc: Shared user code storage (ReadWriteMany with Longhorn)

MinIO Storage (Optional)

When MinIO is available, Dagster can use S3-compatible storage:

  • dagster-data: Data files bucket
  • dagster-logs: Compute logs bucket

The storage type is selected during installation via interactive prompt.

Environment Variables Setup

Environment variables are provided to Dagster through Kubernetes Secrets. You have several options:

Option 1: Customize the Example Template

  1. Create the example environment secrets template:

    just dagster::create-env-secrets-example
    
  2. Important: This creates a template with sample values. You must customize it:

    • If using External Secrets: Edit dagster-env-external-secret.gomplate.yaml to reference your actual Vault paths
    • If using Direct Secrets: Update the created dagster-env-secret with your actual credentials

Option 2: Create ExternalSecret Manually

Create an ExternalSecret that references your Vault credentials:

apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: dagster-env-external-secret
  namespace: dagster
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-secret-store
    kind: ClusterSecretStore
  target:
    name: dagster-env-secret
  data:
  - secretKey: AWS_ACCESS_KEY_ID
    remoteRef:
      key: minio/credentials
      property: access_key
  - secretKey: AWS_SECRET_ACCESS_KEY
    remoteRef:
      key: minio/credentials
      property: secret_key
  - secretKey: POSTGRES_URL
    remoteRef:
      key: postgres/admin
      property: connection_string
  # Add more variables as needed

Option 3: Create Kubernetes Secret Directly

kubectl create secret generic dagster-env-secret -n dagster \
  --from-literal=AWS_ACCESS_KEY_ID="your-access-key" \
  --from-literal=AWS_SECRET_ACCESS_KEY="your-secret-key" \
  --from-literal=AWS_ENDPOINT_URL="http://minio.minio.svc.cluster.local:9000" \
  --from-literal=POSTGRES_URL="postgresql://user:pass@postgres-cluster-rw.postgres:5432"

After creating the environment secrets, redeploy Dagster to pick up the new configuration.

Example Projects

CSV to PostgreSQL Project

The examples/csv_to_postgres project demonstrates a complete ETL pipeline that loads data from MinIO object storage into PostgreSQL using dlt (data load tool).

Dataset Information

MovieLens 20M Dataset

This project processes the MovieLens 20M dataset from GroupLens Research. The dataset contains:

  • 27,278 movies with metadata
  • 20 million ratings from 138,493 users
  • 465,564 tags applied by users
  • Additional genome data for content-based filtering
MinIO Storage Structure

The dataset files are stored in MinIO under the movie-lens bucket:

mc alias set buun https://minio.your-domain.com access-key secret-key
mc ls buun/movie-lens

[2025-09-14 12:13:09 JST] 309MiB STANDARD genome-scores.csv
[2025-09-14 12:12:37 JST]  18KiB STANDARD genome-tags.csv
[2025-09-14 12:12:38 JST] 557KiB STANDARD links.csv
[2025-09-14 12:12:38 JST] 1.3MiB STANDARD movies.csv
[2025-09-14 12:13:15 JST] 509MiB STANDARD ratings.csv
[2025-09-14 12:12:42 JST]  16MiB STANDARD tags.csv

The project processes:

  • movies.csv (1.3MiB) - Movie metadata
  • tags.csv (16MiB) - User-generated tags
  • ratings.csv (509MiB) - User ratings

Project Features

Assets Processed
  • movies_pipeline: MovieLens movies data with primary key movieId
  • ratings_pipeline: User ratings with composite primary key [userId, movieId]
  • tags_pipeline: User tags with composite primary key [userId, movieId, timestamp]
  • movielens_summary: Generates metadata summary of all processed assets
Smart Processing
  • Table Existence Check: Uses DuckDB PostgreSQL scanner to check if tables already exist
  • Skip Logic: If a table already contains data, the asset will skip processing to avoid reprocessing large files
  • Write Disposition: Uses replace mode for initial loads
Dependencies
  • dlt[duckdb,filesystem,postgres,s3]>=1.12.1
  • dagster and related libraries
  • duckdb (for table existence checking)

Environment Variables Required

The project expects the following environment variables to be set:

  • POSTGRES_URL: PostgreSQL connection string (format: postgresql://user:password@host:port/database)
  • AWS_ACCESS_KEY_ID: MinIO/S3 access key
  • AWS_SECRET_ACCESS_KEY: MinIO/S3 secret key
  • AWS_ENDPOINT_URL: MinIO endpoint URL
  • Additional dlt-specific environment variables for advanced configuration

Debugging and Troubleshooting

Debug Commands

Check Dagster component logs using kubectl:

Pod Status and Logs

# Check Dagster pods status
kubectl get pods -n dagster

# View webserver logs
kubectl logs -n dagster deployment/dagster-dagster-webserver -c dagster-webserver --tail=100

# View daemon logs
kubectl logs -n dagster deployment/dagster-daemon -c dagster-daemon --tail=100

# View user code deployment logs (if using code servers)
kubectl logs -n dagster deployment/dagster-user-code -c dagster --tail=100

Configuration and Secrets

# Check workspace configuration
kubectl get configmap dagster-workspace-yaml -n dagster -o yaml

# Check database secret
kubectl describe secret dagster-database-secret -n dagster

# Check environment secrets (if configured)
kubectl describe secret dagster-env-secret -n dagster

# Check OAuth secrets
kubectl describe secret dagster-oauth-secret -n dagster

Common Issues

Assets Not Appearing

Symptoms: Project deployed but assets not visible in Dagster UI

Debugging Steps:

  1. Check webserver logs for import errors:

    kubectl logs -n dagster deployment/dagster-dagster-webserver -c dagster-webserver --tail=100 | grep -i error
    
  2. Verify workspace configuration:

    kubectl get configmap dagster-workspace-yaml -n dagster -o jsonpath='{.data.workspace\.yaml}'
    
  3. Check project files in PVC:

    WEBSERVER_POD=$(kubectl get pods -n dagster -l component=dagster-webserver -o jsonpath='{.items[0].metadata.name}')
    kubectl exec $WEBSERVER_POD -n dagster -- ls -la /opt/dagster/user-code/
    

Common Causes:

  • Python syntax errors in project files
  • Missing definitions.py file
  • Incorrect module structure
  • Project name contains hyphens

Asset Execution Failures

Symptoms: Assets appear but fail during materialization

Debugging Steps:

  1. Check daemon logs for execution errors:

    kubectl logs -n dagster deployment/dagster-daemon -c dagster-daemon --tail=100
    
  2. Check environment variables in webserver:

    WEBSERVER_POD=$(kubectl get pods -n dagster -l component=dagster-webserver -o jsonpath='{.items[0].metadata.name}')
    kubectl exec $WEBSERVER_POD -n dagster -- env | grep -E "(AWS|POSTGRES|DLT)"
    
  3. Test connectivity from pods:

    # Test MinIO connectivity
    kubectl exec $WEBSERVER_POD -n dagster -- ping minio.minio.svc.cluster.local
    
    # Test PostgreSQL connectivity
    kubectl exec $WEBSERVER_POD -n dagster -- nc -zv postgres-cluster-rw.postgres 5432
    

Environment Variables Issues

Symptoms: Assets fail with authentication or connection errors

Debugging Steps:

  1. Verify secret exists and contains data:

    kubectl describe secret dagster-env-secret -n dagster
    
  2. Check if ExternalSecret is syncing (if using External Secrets):

    kubectl get externalsecret dagster-env-external-secret -n dagster
    kubectl describe externalsecret dagster-env-external-secret -n dagster
    
  3. Verify environment variables are loaded in pods:

    WEBSERVER_POD=$(kubectl get pods -n dagster -l component=dagster-webserver -o jsonpath='{.items[0].metadata.name}')
    kubectl exec $WEBSERVER_POD -n dagster -- printenv | grep -E "(AWS|POSTGRES|DLT)"
    

Authentication Issues

Symptoms: Cannot access Dagster UI or authentication failures

Debugging Steps:

  1. Check OAuth2 proxy status:

    kubectl get pods -n dagster -l app=oauth2-proxy
    kubectl logs -n dagster deployment/oauth2-proxy-dagster --tail=100
    
  2. Verify OAuth client configuration in Keycloak:

    • Ensure client dagster exists in the realm
    • Check redirect URIs are correctly configured
    • Verify client secret matches
  3. Check OAuth secret:

    kubectl describe secret dagster-oauth-secret -n dagster
    

Database Connection Issues

Symptoms: Database-related errors or connection failures

Debugging Steps:

  1. Test database connectivity:

    WEBSERVER_POD=$(kubectl get pods -n dagster -l component=dagster-webserver -o jsonpath='{.items[0].metadata.name}')
    kubectl exec $WEBSERVER_POD -n dagster -- python3 -c "
    import os
    import psycopg2
    conn = psycopg2.connect(
        host='postgres-cluster-rw.postgres',
        port=5432,
        database='dagster',
        user=os.getenv('POSTGRES_USER', 'dagster'),
        password=os.getenv('POSTGRES_PASSWORD', '')
    )
    print('Database connection successful')
    conn.close()
    "
    
  2. Check database secret:

    kubectl describe secret dagster-database-secret -n dagster
    
  3. Verify database exists:

    just postgres::psql -c "\l" | grep dagster
    

Connection Testing

MinIO Connectivity

# Test MinIO access from Dagster pod
WEBSERVER_POD=$(kubectl get pods -n dagster -l component=dagster-webserver -o jsonpath='{.items[0].metadata.name}')
kubectl exec $WEBSERVER_POD -n dagster -- python3 -c "
import boto3
import os
client = boto3.client('s3',
    endpoint_url=os.getenv('AWS_ENDPOINT_URL'),
    aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
    aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY')
)
print('Buckets:', [b['Name'] for b in client.list_buckets()['Buckets']])
"

Log Analysis Tips

  1. Filter logs by timestamp:

    kubectl logs -n dagster deployment/dagster-dagster-webserver --since=10m
    
  2. Search for specific errors:

    kubectl logs -n dagster deployment/dagster-daemon | grep -i "error\|exception\|failed"
    
  3. Monitor logs in real-time:

    kubectl logs -n dagster deployment/dagster-dagster-webserver -f
    
  4. Check resource usage:

    kubectl top pods -n dagster