From 1b807d7f32754a626bd32d44b8c16fc368641963 Mon Sep 17 00:00:00 2001 From: Masaki Yatsu Date: Tue, 16 Sep 2025 14:21:50 +0900 Subject: [PATCH] docs(airflow): write more about Airflow debug --- airflow/README.md | 458 +++++++++++++++++++++++++++++++++++++ airflow/examples/README.md | 190 --------------- 2 files changed, 458 insertions(+), 190 deletions(-) create mode 100644 airflow/README.md delete mode 100644 airflow/examples/README.md diff --git a/airflow/README.md b/airflow/README.md new file mode 100644 index 0000000..5c29e6a --- /dev/null +++ b/airflow/README.md @@ -0,0 +1,458 @@ +# Airflow Documentation + +## Overview + +This document covers Airflow installation, deployment, and debugging in the buun-stack environment. + +## Installation + +### Prerequisites + +- Kubernetes cluster with buun-stack components +- PostgreSQL database cluster +- MinIO object storage +- External Secrets Operator (optional, for Vault integration) +- JupyterHub (optional, for DAG deployment via web interface) + +### Installation Steps + +1. **Setup Environment Secrets** (if needed): + - See Environment Variables Setup section below for configuration options + - Create ExternalSecret or Secret before installation if you want environment variables available immediately + +2. **Install Airflow**: + + ```bash + # Interactive installation with configuration prompts + just airflow::install + ``` + +3. **Access Airflow Web UI**: + - Navigate to your Airflow instance (e.g., `https://airflow.buun.dev`) + - Login with your Keycloak credentials + +4. **Assign User Roles** (if needed): + + ```bash + # Add user role for DAG execution permissions + just airflow::assign-role airflow_user + + # Available roles: + # - airflow_admin: Full administrative access + # - airflow_op: Operator access (can trigger DAGs) + # - airflow_user: User access (read/write access to DAGs) + # - airflow_viewer: Viewer access (read-only) + ``` + +### Uninstalling + +```bash +# Remove Airflow (keeps database by default) +just airflow::uninstall + +# Remove Airflow and delete database +just airflow::uninstall true +``` + +## DAG Deployment + +### 1. Access JupyterHub + +- Navigate to your JupyterHub instance (e.g., `https://jupyter.buun.dev`) +- Login with your credentials + +### 2. Navigate to Airflow DAGs Directory + +In JupyterHub, the Airflow DAGs directory is mounted at: + +``` +/home/jovyan/airflow-dags/ +``` + +### 3. Upload the DAG File + +1. Open JupyterHub file browser +2. Navigate to `/home/jovyan/airflow-dags/` +3. Upload or copy `csv_to_postgres_dag.py` to this directory + +### 4. Verify Deployment + +1. Access Airflow Web UI (e.g., `https://airflow.buun.dev`) +2. Check that the DAG `csv_to_postgres` appears in the DAGs list +3. If the DAG doesn't appear immediately, wait 1-2 minutes for Airflow to detect the new file + +## DAG Features + +### Tables Processed + +- **movies**: MovieLens movies data with primary key `movieId` +- **ratings**: User ratings with composite primary key `[userId, movieId]` +- **tags**: User tags with composite primary key `[userId, movieId, timestamp]` +- **summary**: Generates metadata summary of all processed tables + +### Smart Processing + +- **Table Existence Check**: Uses DuckDB PostgreSQL scanner to check if tables already exist +- **Skip Logic**: If a table already contains data, the task will skip processing to avoid reprocessing large files +- **Write Disposition**: Uses `replace` mode for initial loads + +### Environment Variables Required + +The DAG expects the following environment variables to be set: + +- `POSTGRES_URL`: PostgreSQL connection string (format: `postgresql://user:password@host:port/database`) +- `AWS_ACCESS_KEY_ID`: MinIO/S3 access key +- `AWS_SECRET_ACCESS_KEY`: MinIO/S3 secret key +- `AWS_ENDPOINT_URL`: MinIO endpoint URL +- Additional dlt-specific environment variables for advanced configuration + +### Environment Variables Setup + +Environment variables are provided to Airflow through Kubernetes Secrets. You have several options: + +#### Option 1: Customize the Example Template + +1. Create the example environment secrets template: + + ```bash + just airflow::create-env-secrets-example + ``` + +2. **Important**: This creates a template with sample values. You must customize it: + - If using **External Secrets**: Edit `airflow-env-external-secret.gomplate.yaml` to reference your actual Vault paths + - If using **Direct Secrets**: Update the created `airflow-env-secret` with your actual credentials + +#### Option 2: Create ExternalSecret Manually + +Create an ExternalSecret that references your Vault credentials: + +```yaml +apiVersion: external-secrets.io/v1 +kind: ExternalSecret +metadata: + name: airflow-env-external-secret + namespace: datastack +spec: + refreshInterval: 1h + secretStoreRef: + name: vault-secret-store + kind: ClusterSecretStore + target: + name: airflow-env-secret + data: + - secretKey: AWS_ACCESS_KEY_ID + remoteRef: + key: minio/credentials + property: access_key + - secretKey: AWS_SECRET_ACCESS_KEY + remoteRef: + key: minio/credentials + property: secret_key + # Add more variables as needed +``` + +#### Option 3: Create Kubernetes Secret Directly + +```bash +kubectl create secret generic airflow-env-secret -n datastack \ + --from-literal=AWS_ACCESS_KEY_ID="your-access-key" \ + --from-literal=AWS_SECRET_ACCESS_KEY="your-secret-key" \ + --from-literal=AWS_ENDPOINT_URL="http://minio.minio.svc.cluster.local:9000" \ + --from-literal=POSTGRES_URL="postgresql://user:pass@postgres-cluster-rw.postgres:5432" +``` + +After creating the environment secrets, redeploy Airflow to pick up the new configuration. + +### Manual Execution + +The DAG is configured for manual execution only (`schedule_interval=None`). To run: + +1. Go to Airflow Web UI +2. Find the `csv_to_postgres` DAG +3. Click "Trigger DAG" to start execution + +## Example DAGs + +### CSV to PostgreSQL DAG + +The `csv_to_postgres_dag.py` demonstrates a complete ETL pipeline that loads data from MinIO object storage into PostgreSQL using dlt (data load tool). + +#### Dataset Information + +##### MovieLens 20M Dataset + +This DAG processes the [MovieLens 20M dataset](https://grouplens.org/datasets/movielens/20m/) from GroupLens Research. The dataset contains: + +- **27,278 movies** with metadata +- **20 million ratings** from 138,493 users +- **465,564 tags** applied by users +- Additional genome data for content-based filtering + +##### MinIO Storage Structure + +The dataset files are stored in MinIO under the `movie-lens` bucket: + +```bash +mc alias set buun https://minio.your-domain.com access-key secret-key +mc ls buun/movie-lens + +[2025-09-14 12:13:09 JST] 309MiB STANDARD genome-scores.csv +[2025-09-14 12:12:37 JST] 18KiB STANDARD genome-tags.csv +[2025-09-14 12:12:38 JST] 557KiB STANDARD links.csv +[2025-09-14 12:12:38 JST] 1.3MiB STANDARD movies.csv +[2025-09-14 12:13:15 JST] 509MiB STANDARD ratings.csv +[2025-09-14 12:12:42 JST] 16MiB STANDARD tags.csv +``` + +The DAG currently processes: + +- **movies.csv** (1.3MiB) - Movie metadata +- **tags.csv** (16MiB) - User-generated tags +- **ratings.csv** (509MiB) - User ratings (available but currently disabled in DAG) + +#### DAG Features + +##### Tables Processed + +- **movies**: MovieLens movies data with primary key `movieId` +- **ratings**: User ratings with composite primary key `[userId, movieId]` +- **tags**: User tags with composite primary key `[userId, movieId, timestamp]` +- **summary**: Generates metadata summary of all processed tables + +##### Smart Processing + +- **Table Existence Check**: Uses DuckDB PostgreSQL scanner to check if tables already exist +- **Skip Logic**: If a table already contains data, the task will skip processing to avoid reprocessing large files +- **Write Disposition**: Uses `replace` mode for initial loads + +##### Dependencies + +- `dlt[duckdb,filesystem,postgres,s3]>=1.12.1` +- duckdb (for table existence checking) +- Standard Airflow libraries + +## Debugging and Troubleshooting + +### Debug Commands + +The Airflow justfile provides several debugging recipes: + +#### DAG Import and Processing Logs + +```bash +# Check DAG import errors from processor logs +just airflow::logs-dag-errors + +# Check DAG import errors for a specific file +just airflow::logs-dag-errors csv_to_postgres_dag.py + +# Test DAG file import manually +just airflow::logs-test-import csv_to_postgres_dag.py + +# Monitor DAG processing in real-time +just airflow::logs-dag-processor +``` + +#### Worker and Task Logs + +```bash +# View worker logs (where tasks execute) +just airflow::logs-worker + +# View scheduler logs +just airflow::logs-scheduler + +# View API server logs (Airflow 3.0) +just airflow::logs-api-server + +# View all Airflow component logs +just airflow::logs-all +``` + +#### Specific Component Debugging + +```bash +# Check specific pod logs +kubectl logs -n datastack -c + +# Examples: +kubectl logs -n datastack airflow-worker-0 -c worker --tail=100 +kubectl logs -n datastack airflow-scheduler-xxx -c scheduler --tail=100 +kubectl logs -n datastack airflow-dag-processor-xxx -c dag-processor --tail=100 +``` + +### Common Issues + +#### DAG Not Appearing + +**Symptoms**: DAG file uploaded but not visible in Airflow UI + +**Debugging Steps**: + +1. Check DAG processor logs: + + ```bash + just airflow::logs-dag-errors + ``` + +2. Test DAG import manually: + + ```bash + just airflow::logs-test-import your-dag-file.py + ``` + +3. Verify file location and permissions: + + ```bash + kubectl exec -n datastack airflow-dag-processor-xxx -c dag-processor -- ls -la /opt/airflow/dags/ + ``` + +**Common Causes**: + +- Python syntax errors in DAG file +- Missing Python package imports +- Duplicate DAG IDs +- File permissions issues + +#### Task Execution Failures + +**Symptoms**: DAG appears but tasks fail during execution + +**Debugging Steps**: + +1. Check worker logs for the specific task: + + ```bash + just airflow::logs-worker | grep -A 10 -B 10 "task_id" + ``` + +2. Check environment variables in worker: + + ```bash + kubectl exec -n datastack airflow-worker-0 -c worker -- env | grep -E "(AWS|POSTGRES)" + ``` + +3. Test connectivity from worker: + + ```bash + # Test MinIO connectivity + kubectl exec -n datastack airflow-worker-0 -c worker -- ping minio.minio.svc.cluster.local + + # Test PostgreSQL connectivity + kubectl exec -n datastack airflow-worker-0 -c worker -- nc -zv postgres-cluster-rw.postgres 5432 + ``` + +#### Environment Variables Issues + +**Symptoms**: Tasks fail with authentication or connection errors + +**Debugging Steps**: + +1. Verify secret exists and contains data: + + ```bash + kubectl describe secret airflow-env-secret -n datastack + ``` + +2. Check if ExternalSecret is syncing (if using External Secrets): + + ```bash + kubectl get externalsecret airflow-env-external-secret -n datastack + kubectl describe externalsecret airflow-env-external-secret -n datastack + ``` + +3. Verify environment variables are loaded in pods: + + ```bash + kubectl exec -n datastack airflow-worker-0 -c worker -- printenv | grep -E "(AWS|POSTGRES|DLT)" + ``` + +#### Authentication and Permissions + +**Symptoms**: 403 Forbidden errors when triggering DAGs + +**Debugging Steps**: + +1. Check user roles in Airflow: + + ```bash + kubectl exec -n datastack airflow-scheduler-xxx -c scheduler -- airflow users list + ``` + +2. Assign proper role if needed: + + ```bash + just airflow::assign-role airflow_user + ``` + +3. Check Keycloak client roles: + - Ensure user has appropriate Keycloak client role + - Re-login to Airflow to sync roles + +#### Package Installation Issues + +**Symptoms**: Import errors for packages like `dlt`, `duckdb` + +**Debugging Steps**: + +1. Check if packages are installed correctly: + + ```bash + kubectl exec -n datastack airflow-worker-0 -c worker -- pip list | grep -E "(dlt|duckdb)" + ``` + +2. Verify init container logs: + + ```bash + kubectl logs -n datastack airflow-worker-0 -c install-packages + ``` + +3. Check PYTHONPATH configuration: + + ```bash + kubectl exec -n datastack airflow-worker-0 -c worker -- echo $PYTHONPATH + ``` + +### Connection Testing + +#### MinIO Connectivity + +```bash +# Test MinIO access from worker +kubectl exec -n datastack airflow-worker-0 -c worker -- python3 -c " +import boto3 +import os +client = boto3.client('s3', + endpoint_url=os.getenv('AWS_ENDPOINT_URL'), + aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'), + aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY') +) +print('Buckets:', [b['Name'] for b in client.list_buckets()['Buckets']]) +" +``` + +### Log Analysis Tips + +1. **Filter logs by timestamp**: + + ```bash + kubectl logs -n datastack airflow-worker-0 -c worker --since=10m + ``` + +2. **Search for specific errors**: + + ```bash + just airflow::logs-worker | grep -i "error\|exception\|failed" + ``` + +3. **Monitor logs in real-time**: + + ```bash + kubectl logs -n datastack airflow-worker-0 -c worker -f + ``` + +4. **Check resource usage**: + + ```bash + kubectl top pods -n datastack | grep airflow + ``` diff --git a/airflow/examples/README.md b/airflow/examples/README.md deleted file mode 100644 index 8f80a79..0000000 --- a/airflow/examples/README.md +++ /dev/null @@ -1,190 +0,0 @@ -# CSV to PostgreSQL Airflow DAG Deployment - -## Overview - -This document describes how to deploy the `csv_to_postgres_dag.py` to Airflow using JupyterHub interface. The DAG processes MovieLens dataset files stored in MinIO and loads them into PostgreSQL. - -## Dataset Information - -### MovieLens 20M Dataset - -This DAG processes the [MovieLens 20M dataset](https://grouplens.org/datasets/movielens/20m/) from GroupLens Research. The dataset contains: - -- **27,278 movies** with metadata -- **20 million ratings** from 138,493 users -- **465,564 tags** applied by users -- Additional genome data for content-based filtering - -### MinIO Storage Structure - -The dataset files are stored in MinIO under the `movie-lens` bucket: - -```bash -mc alias set buun https://minio.your-domain.com access-key secret-key -mc ls buun/movie-lens - -[2025-09-14 12:13:09 JST] 309MiB STANDARD genome-scores.csv -[2025-09-14 12:12:37 JST] 18KiB STANDARD genome-tags.csv -[2025-09-14 12:12:38 JST] 557KiB STANDARD links.csv -[2025-09-14 12:12:38 JST] 1.3MiB STANDARD movies.csv -[2025-09-14 12:13:15 JST] 509MiB STANDARD ratings.csv -[2025-09-14 12:12:42 JST] 16MiB STANDARD tags.csv -``` - -The DAG currently processes: - -- **movies.csv** (1.3MiB) - Movie metadata -- **tags.csv** (16MiB) - User-generated tags -- **ratings.csv** (509MiB) - User ratings (available but currently disabled in DAG) - -## Deployment Steps - -### 1. Access JupyterHub - -- Navigate to your JupyterHub instance (e.g., `https://jupyter.buun.dev`) -- Login with your credentials - -### 2. Navigate to Airflow DAGs Directory - -In JupyterHub, the Airflow DAGs directory is mounted at: - -``` -/home/jovyan/airflow-dags/ -``` - -### 3. Upload the DAG File - -1. Open JupyterHub file browser -2. Navigate to `/home/jovyan/airflow-dags/` -3. Upload or copy `csv_to_postgres_dag.py` to this directory - -### 4. Verify Deployment - -1. Access Airflow Web UI (e.g., `https://airflow.buun.dev`) -2. Check that the DAG `csv_to_postgres` appears in the DAGs list -3. If the DAG doesn't appear immediately, wait 1-2 minutes for Airflow to detect the new file - -## DAG Features - -### Tables Processed - -- **movies**: MovieLens movies data with primary key `movieId` -- **ratings**: User ratings with composite primary key `[userId, movieId]` -- **tags**: User tags with composite primary key `[userId, movieId, timestamp]` -- **summary**: Generates metadata summary of all processed tables - -### Smart Processing - -- **Table Existence Check**: Uses DuckDB PostgreSQL scanner to check if tables already exist -- **Skip Logic**: If a table already contains data, the task will skip processing to avoid reprocessing large files -- **Write Disposition**: Uses `replace` mode for initial loads - -### Environment Variables Required - -The DAG expects the following environment variables to be set: - -- `POSTGRES_URL`: PostgreSQL connection string (format: `postgresql://user:password@host:port/database`) -- `AWS_ACCESS_KEY_ID`: MinIO/S3 access key -- `AWS_SECRET_ACCESS_KEY`: MinIO/S3 secret key -- `AWS_ENDPOINT_URL`: MinIO endpoint URL -- Additional dlt-specific environment variables for advanced configuration - -### Environment Variables Setup - -Environment variables are provided to Airflow through Kubernetes Secrets. You have several options: - -#### Option 1: Customize the Example Template - -1. Create the example environment secrets template: - - ```bash - just airflow::create-env-secrets-example - ``` - -2. **Important**: This creates a template with sample values. You must customize it: - - If using **External Secrets**: Edit `airflow-env-external-secret.gomplate.yaml` to reference your actual Vault paths - - If using **Direct Secrets**: Update the created `airflow-env-secret` with your actual credentials - -#### Option 2: Create ExternalSecret Manually - -Create an ExternalSecret that references your Vault credentials: - -```yaml -apiVersion: external-secrets.io/v1 -kind: ExternalSecret -metadata: - name: airflow-env-external-secret - namespace: datastack -spec: - refreshInterval: 1h - secretStoreRef: - name: vault-secret-store - kind: ClusterSecretStore - target: - name: airflow-env-secret - data: - - secretKey: AWS_ACCESS_KEY_ID - remoteRef: - key: minio/credentials - property: access_key - - secretKey: AWS_SECRET_ACCESS_KEY - remoteRef: - key: minio/credentials - property: secret_key - # Add more variables as needed -``` - -#### Option 3: Create Kubernetes Secret Directly - -```bash -kubectl create secret generic airflow-env-secret -n datastack \ - --from-literal=AWS_ACCESS_KEY_ID="your-access-key" \ - --from-literal=AWS_SECRET_ACCESS_KEY="your-secret-key" \ - --from-literal=AWS_ENDPOINT_URL="http://minio.minio.svc.cluster.local:9000" \ - --from-literal=POSTGRES_URL="postgresql://user:pass@postgres-cluster-rw.postgres:5432" -``` - -After creating the environment secrets, redeploy Airflow to pick up the new configuration. - -### Manual Execution - -The DAG is configured for manual execution only (`schedule_interval=None`). To run: - -1. Go to Airflow Web UI -2. Find the `csv_to_postgres` DAG -3. Click "Trigger DAG" to start execution - -## Dependencies - -- dlt[duckdb,filesystem,postgres,s3]>=1.12.1 -- duckdb (for table existence checking) -- Standard Airflow libraries - -## Troubleshooting - -### DAG Not Appearing - -- Check file permissions in `/home/jovyan/airflow-dags/` -- Verify the Python syntax is correct -- Check Airflow logs for import errors - -### Environment Variables - -- Ensure the `airflow-env-secret` Kubernetes Secret exists in the datastack namespace -- Verify secret contains all required environment variables: - - ```bash - kubectl describe secret airflow-env-secret -n datastack - ``` - -- If using External Secrets, check that the ExternalSecret is syncing properly: - - ```bash - kubectl get externalsecret airflow-env-external-secret -n datastack - ``` - -### Connection Issues - -- Verify MinIO and PostgreSQL connectivity from Airflow workers -- Check that the `movielens_af` database exists in PostgreSQL -- Ensure MinIO bucket `movie-lens` is accessible with proper credentials