From 1b807d7f32754a626bd32d44b8c16fc368641963 Mon Sep 17 00:00:00 2001
From: Masaki Yatsu <yatsu@yatsu.info>
Date: Tue, 16 Sep 2025 14:21:50 +0900
Subject: [PATCH] docs(airflow): write more about Airflow debug

---
 airflow/README.md          | 458 +++++++++++++++++++++++++++++++++++++
 airflow/examples/README.md | 190 ---------------
 2 files changed, 458 insertions(+), 190 deletions(-)
 create mode 100644 airflow/README.md
 delete mode 100644 airflow/examples/README.md
diff --git a/airflow/README.md b/airflow/README.md
new file mode 100644
index 0000000..5c29e6a
--- /dev/null
+++ b/airflow/README.md
@@ -0,0 +1,458 @@
+# Airflow Documentation
+
+## Overview
+
+This document covers Airflow installation, deployment, and debugging in the buun-stack environment.
+
+## Installation
+
+### Prerequisites
+
+- Kubernetes cluster with buun-stack components
+- PostgreSQL database cluster
+- MinIO object storage
+- External Secrets Operator (optional, for Vault integration)
+- JupyterHub (optional, for DAG deployment via web interface)
+
+### Installation Steps
+
+1. **Setup Environment Secrets** (if needed):
+   - See Environment Variables Setup section below for configuration options
+   - Create ExternalSecret or Secret before installation if you want environment variables available immediately
+
+2. **Install Airflow**:
+
+   ```bash
+   # Interactive installation with configuration prompts
+   just airflow::install
+   ```
+
+3. **Access Airflow Web UI**:
+   - Navigate to your Airflow instance (e.g., `https://airflow.buun.dev`)
+   - Login with your Keycloak credentials
+
+4. **Assign User Roles** (if needed):
+
+   ```bash
+   # Add user role for DAG execution permissions
+   just airflow::assign-role <username> airflow_user
+
+   # Available roles:
+   # - airflow_admin: Full administrative access
+   # - airflow_op: Operator access (can trigger DAGs)
+   # - airflow_user: User access (read/write access to DAGs)
+   # - airflow_viewer: Viewer access (read-only)
+   ```
+
+### Uninstalling
+
+```bash
+# Remove Airflow (keeps database by default)
+just airflow::uninstall
+
+# Remove Airflow and delete database
+just airflow::uninstall true
+```
+
+## DAG Deployment
+
+### 1. Access JupyterHub
+
+- Navigate to your JupyterHub instance (e.g., `https://jupyter.buun.dev`)
+- Login with your credentials
+
+### 2. Navigate to Airflow DAGs Directory
+
+In JupyterHub, the Airflow DAGs directory is mounted at:
+
+```
+/home/jovyan/airflow-dags/
+```
+
+### 3. Upload the DAG File
+
+1. Open JupyterHub file browser
+2. Navigate to `/home/jovyan/airflow-dags/`
+3. Upload or copy `csv_to_postgres_dag.py` to this directory
+
+### 4. Verify Deployment
+
+1. Access Airflow Web UI (e.g., `https://airflow.buun.dev`)
+2. Check that the DAG `csv_to_postgres` appears in the DAGs list
+3. If the DAG doesn't appear immediately, wait 1-2 minutes for Airflow to detect the new file
+
+## DAG Features
+
+### Tables Processed
+
+- **movies**: MovieLens movies data with primary key `movieId`
+- **ratings**: User ratings with composite primary key `[userId, movieId]`
+- **tags**: User tags with composite primary key `[userId, movieId, timestamp]`
+- **summary**: Generates metadata summary of all processed tables
+
+### Smart Processing
+
+- **Table Existence Check**: Uses DuckDB PostgreSQL scanner to check if tables already exist
+- **Skip Logic**: If a table already contains data, the task will skip processing to avoid reprocessing large files
+- **Write Disposition**: Uses `replace` mode for initial loads
+
+### Environment Variables Required
+
+The DAG expects the following environment variables to be set:
+
+- `POSTGRES_URL`: PostgreSQL connection string (format: `postgresql://user:password@host:port/database`)
+- `AWS_ACCESS_KEY_ID`: MinIO/S3 access key
+- `AWS_SECRET_ACCESS_KEY`: MinIO/S3 secret key
+- `AWS_ENDPOINT_URL`: MinIO endpoint URL
+- Additional dlt-specific environment variables for advanced configuration
+
+### Environment Variables Setup
+
+Environment variables are provided to Airflow through Kubernetes Secrets. You have several options:
+
+#### Option 1: Customize the Example Template
+
+1. Create the example environment secrets template:
+
+   ```bash
+   just airflow::create-env-secrets-example
+   ```
+
+2. **Important**: This creates a template with sample values. You must customize it:
+   - If using **External Secrets**: Edit `airflow-env-external-secret.gomplate.yaml` to reference your actual Vault paths
+   - If using **Direct Secrets**: Update the created `airflow-env-secret` with your actual credentials
+
+#### Option 2: Create ExternalSecret Manually
+
+Create an ExternalSecret that references your Vault credentials:
+
+```yaml
+apiVersion: external-secrets.io/v1
+kind: ExternalSecret
+metadata:
+  name: airflow-env-external-secret
+  namespace: datastack
+spec:
+  refreshInterval: 1h
+  secretStoreRef:
+    name: vault-secret-store
+    kind: ClusterSecretStore
+  target:
+    name: airflow-env-secret
+  data:
+  - secretKey: AWS_ACCESS_KEY_ID
+    remoteRef:
+      key: minio/credentials
+      property: access_key
+  - secretKey: AWS_SECRET_ACCESS_KEY
+    remoteRef:
+      key: minio/credentials
+      property: secret_key
+  # Add more variables as needed
+```
+
+#### Option 3: Create Kubernetes Secret Directly
+
+```bash
+kubectl create secret generic airflow-env-secret -n datastack \
+  --from-literal=AWS_ACCESS_KEY_ID="your-access-key" \
+  --from-literal=AWS_SECRET_ACCESS_KEY="your-secret-key" \
+  --from-literal=AWS_ENDPOINT_URL="http://minio.minio.svc.cluster.local:9000" \
+  --from-literal=POSTGRES_URL="postgresql://user:pass@postgres-cluster-rw.postgres:5432"
+```
+
+After creating the environment secrets, redeploy Airflow to pick up the new configuration.
+
+### Manual Execution
+
+The DAG is configured for manual execution only (`schedule_interval=None`). To run:
+
+1. Go to Airflow Web UI
+2. Find the `csv_to_postgres` DAG
+3. Click "Trigger DAG" to start execution
+
+## Example DAGs
+
+### CSV to PostgreSQL DAG
+
+The `csv_to_postgres_dag.py` demonstrates a complete ETL pipeline that loads data from MinIO object storage into PostgreSQL using dlt (data load tool).
+
+#### Dataset Information
+
+##### MovieLens 20M Dataset
+
+This DAG processes the [MovieLens 20M dataset](https://grouplens.org/datasets/movielens/20m/) from GroupLens Research. The dataset contains:
+
+- **27,278 movies** with metadata
+- **20 million ratings** from 138,493 users
+- **465,564 tags** applied by users
+- Additional genome data for content-based filtering
+
+##### MinIO Storage Structure
+
+The dataset files are stored in MinIO under the `movie-lens` bucket:
+
+```bash
+mc alias set buun https://minio.your-domain.com access-key secret-key
+mc ls buun/movie-lens
+
+[2025-09-14 12:13:09 JST] 309MiB STANDARD genome-scores.csv
+[2025-09-14 12:12:37 JST]  18KiB STANDARD genome-tags.csv
+[2025-09-14 12:12:38 JST] 557KiB STANDARD links.csv
+[2025-09-14 12:12:38 JST] 1.3MiB STANDARD movies.csv
+[2025-09-14 12:13:15 JST] 509MiB STANDARD ratings.csv
+[2025-09-14 12:12:42 JST]  16MiB STANDARD tags.csv
+```
+
+The DAG currently processes:
+
+- **movies.csv** (1.3MiB) - Movie metadata
+- **tags.csv** (16MiB) - User-generated tags
+- **ratings.csv** (509MiB) - User ratings (available but currently disabled in DAG)
+
+#### DAG Features
+
+##### Tables Processed
+
+- **movies**: MovieLens movies data with primary key `movieId`
+- **ratings**: User ratings with composite primary key `[userId, movieId]`
+- **tags**: User tags with composite primary key `[userId, movieId, timestamp]`
+- **summary**: Generates metadata summary of all processed tables
+
+##### Smart Processing
+
+- **Table Existence Check**: Uses DuckDB PostgreSQL scanner to check if tables already exist
+- **Skip Logic**: If a table already contains data, the task will skip processing to avoid reprocessing large files
+- **Write Disposition**: Uses `replace` mode for initial loads
+
+##### Dependencies
+
+- `dlt[duckdb,filesystem,postgres,s3]>=1.12.1`
+- duckdb (for table existence checking)
+- Standard Airflow libraries
+
+## Debugging and Troubleshooting
+
+### Debug Commands
+
+The Airflow justfile provides several debugging recipes:
+
+#### DAG Import and Processing Logs
+
+```bash
+# Check DAG import errors from processor logs
+just airflow::logs-dag-errors
+
+# Check DAG import errors for a specific file
+just airflow::logs-dag-errors csv_to_postgres_dag.py
+
+# Test DAG file import manually
+just airflow::logs-test-import csv_to_postgres_dag.py
+
+# Monitor DAG processing in real-time
+just airflow::logs-dag-processor
+```
+
+#### Worker and Task Logs
+
+```bash
+# View worker logs (where tasks execute)
+just airflow::logs-worker
+
+# View scheduler logs
+just airflow::logs-scheduler
+
+# View API server logs (Airflow 3.0)
+just airflow::logs-api-server
+
+# View all Airflow component logs
+just airflow::logs-all
+```
+
+#### Specific Component Debugging
+
+```bash
+# Check specific pod logs
+kubectl logs -n datastack <pod-name> -c <container-name>
+
+# Examples:
+kubectl logs -n datastack airflow-worker-0 -c worker --tail=100
+kubectl logs -n datastack airflow-scheduler-xxx -c scheduler --tail=100
+kubectl logs -n datastack airflow-dag-processor-xxx -c dag-processor --tail=100
+```
+
+### Common Issues
+
+#### DAG Not Appearing
+
+**Symptoms**: DAG file uploaded but not visible in Airflow UI
+
+**Debugging Steps**:
+
+1. Check DAG processor logs:
+
+   ```bash
+   just airflow::logs-dag-errors
+   ```
+
+2. Test DAG import manually:
+
+   ```bash
+   just airflow::logs-test-import your-dag-file.py
+   ```
+
+3. Verify file location and permissions:
+
+   ```bash
+   kubectl exec -n datastack airflow-dag-processor-xxx -c dag-processor -- ls -la /opt/airflow/dags/
+   ```
+
+**Common Causes**:
+
+- Python syntax errors in DAG file
+- Missing Python package imports
+- Duplicate DAG IDs
+- File permissions issues
+
+#### Task Execution Failures
+
+**Symptoms**: DAG appears but tasks fail during execution
+
+**Debugging Steps**:
+
+1. Check worker logs for the specific task:
+
+   ```bash
+   just airflow::logs-worker | grep -A 10 -B 10 "task_id"
+   ```
+
+2. Check environment variables in worker:
+
+   ```bash
+   kubectl exec -n datastack airflow-worker-0 -c worker -- env | grep -E "(AWS|POSTGRES)"
+   ```
+
+3. Test connectivity from worker:
+
+   ```bash
+   # Test MinIO connectivity
+   kubectl exec -n datastack airflow-worker-0 -c worker -- ping minio.minio.svc.cluster.local
+
+   # Test PostgreSQL connectivity
+   kubectl exec -n datastack airflow-worker-0 -c worker -- nc -zv postgres-cluster-rw.postgres 5432
+   ```
+
+#### Environment Variables Issues
+
+**Symptoms**: Tasks fail with authentication or connection errors
+
+**Debugging Steps**:
+
+1. Verify secret exists and contains data:
+
+   ```bash
+   kubectl describe secret airflow-env-secret -n datastack
+   ```
+
+2. Check if ExternalSecret is syncing (if using External Secrets):
+
+   ```bash
+   kubectl get externalsecret airflow-env-external-secret -n datastack
+   kubectl describe externalsecret airflow-env-external-secret -n datastack
+   ```
+
+3. Verify environment variables are loaded in pods:
+
+   ```bash
+   kubectl exec -n datastack airflow-worker-0 -c worker -- printenv | grep -E "(AWS|POSTGRES|DLT)"
+   ```
+
+#### Authentication and Permissions
+
+**Symptoms**: 403 Forbidden errors when triggering DAGs
+
+**Debugging Steps**:
+
+1. Check user roles in Airflow:
+
+   ```bash
+   kubectl exec -n datastack airflow-scheduler-xxx -c scheduler -- airflow users list
+   ```
+
+2. Assign proper role if needed:
+
+   ```bash
+   just airflow::assign-role <username> airflow_user
+   ```
+
+3. Check Keycloak client roles:
+   - Ensure user has appropriate Keycloak client role
+   - Re-login to Airflow to sync roles
+
+#### Package Installation Issues
+
+**Symptoms**: Import errors for packages like `dlt`, `duckdb`
+
+**Debugging Steps**:
+
+1. Check if packages are installed correctly:
+
+   ```bash
+   kubectl exec -n datastack airflow-worker-0 -c worker -- pip list | grep -E "(dlt|duckdb)"
+   ```
+
+2. Verify init container logs:
+
+   ```bash
+   kubectl logs -n datastack airflow-worker-0 -c install-packages
+   ```
+
+3. Check PYTHONPATH configuration:
+
+   ```bash
+   kubectl exec -n datastack airflow-worker-0 -c worker -- echo $PYTHONPATH
+   ```
+
+### Connection Testing
+
+#### MinIO Connectivity
+
+```bash
+# Test MinIO access from worker
+kubectl exec -n datastack airflow-worker-0 -c worker -- python3 -c "
+import boto3
+import os
+client = boto3.client('s3',
+    endpoint_url=os.getenv('AWS_ENDPOINT_URL'),
+    aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
+    aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY')
+)
+print('Buckets:', [b['Name'] for b in client.list_buckets()['Buckets']])
+"
+```
+
+### Log Analysis Tips
+
+1. **Filter logs by timestamp**:
+
+   ```bash
+   kubectl logs -n datastack airflow-worker-0 -c worker --since=10m
+   ```
+
+2. **Search for specific errors**:
+
+   ```bash
+   just airflow::logs-worker | grep -i "error\|exception\|failed"
+   ```
+
+3. **Monitor logs in real-time**:
+
+   ```bash
+   kubectl logs -n datastack airflow-worker-0 -c worker -f
+   ```
+
+4. **Check resource usage**:
+
+   ```bash
+   kubectl top pods -n datastack | grep airflow
+   ```
diff --git a/airflow/examples/README.md b/airflow/examples/README.md
deleted file mode 100644
index 8f80a79..0000000
--- a/airflow/examples/README.md
+++ /dev/null
@@ -1,190 +0,0 @@
-# CSV to PostgreSQL Airflow DAG Deployment
-
-## Overview
-
-This document describes how to deploy the `csv_to_postgres_dag.py` to Airflow using JupyterHub interface. The DAG processes MovieLens dataset files stored in MinIO and loads them into PostgreSQL.
-
-## Dataset Information
-
-### MovieLens 20M Dataset
-
-This DAG processes the [MovieLens 20M dataset](https://grouplens.org/datasets/movielens/20m/) from GroupLens Research. The dataset contains:
-
-- **27,278 movies** with metadata
-- **20 million ratings** from 138,493 users
-- **465,564 tags** applied by users
-- Additional genome data for content-based filtering
-
-### MinIO Storage Structure
-
-The dataset files are stored in MinIO under the `movie-lens` bucket:
-
-```bash
-mc alias set buun https://minio.your-domain.com access-key secret-key
-mc ls buun/movie-lens
-
-[2025-09-14 12:13:09 JST] 309MiB STANDARD genome-scores.csv
-[2025-09-14 12:12:37 JST]  18KiB STANDARD genome-tags.csv
-[2025-09-14 12:12:38 JST] 557KiB STANDARD links.csv
-[2025-09-14 12:12:38 JST] 1.3MiB STANDARD movies.csv
-[2025-09-14 12:13:15 JST] 509MiB STANDARD ratings.csv
-[2025-09-14 12:12:42 JST]  16MiB STANDARD tags.csv
-```
-
-The DAG currently processes:
-
-- **movies.csv** (1.3MiB) - Movie metadata
-- **tags.csv** (16MiB) - User-generated tags
-- **ratings.csv** (509MiB) - User ratings (available but currently disabled in DAG)
-
-## Deployment Steps
-
-### 1. Access JupyterHub
-
-- Navigate to your JupyterHub instance (e.g., `https://jupyter.buun.dev`)
-- Login with your credentials
-
-### 2. Navigate to Airflow DAGs Directory
-
-In JupyterHub, the Airflow DAGs directory is mounted at:
-
-```
-/home/jovyan/airflow-dags/
-```
-
-### 3. Upload the DAG File
-
-1. Open JupyterHub file browser
-2. Navigate to `/home/jovyan/airflow-dags/`
-3. Upload or copy `csv_to_postgres_dag.py` to this directory
-
-### 4. Verify Deployment
-
-1. Access Airflow Web UI (e.g., `https://airflow.buun.dev`)
-2. Check that the DAG `csv_to_postgres` appears in the DAGs list
-3. If the DAG doesn't appear immediately, wait 1-2 minutes for Airflow to detect the new file
-
-## DAG Features
-
-### Tables Processed
-
-- **movies**: MovieLens movies data with primary key `movieId`
-- **ratings**: User ratings with composite primary key `[userId, movieId]`
-- **tags**: User tags with composite primary key `[userId, movieId, timestamp]`
-- **summary**: Generates metadata summary of all processed tables
-
-### Smart Processing
-
-- **Table Existence Check**: Uses DuckDB PostgreSQL scanner to check if tables already exist
-- **Skip Logic**: If a table already contains data, the task will skip processing to avoid reprocessing large files
-- **Write Disposition**: Uses `replace` mode for initial loads
-
-### Environment Variables Required
-
-The DAG expects the following environment variables to be set:
-
-- `POSTGRES_URL`: PostgreSQL connection string (format: `postgresql://user:password@host:port/database`)
-- `AWS_ACCESS_KEY_ID`: MinIO/S3 access key
-- `AWS_SECRET_ACCESS_KEY`: MinIO/S3 secret key
-- `AWS_ENDPOINT_URL`: MinIO endpoint URL
-- Additional dlt-specific environment variables for advanced configuration
-
-### Environment Variables Setup
-
-Environment variables are provided to Airflow through Kubernetes Secrets. You have several options:
-
-#### Option 1: Customize the Example Template
-
-1. Create the example environment secrets template:
-
-   ```bash
-   just airflow::create-env-secrets-example
-   ```
-
-2. **Important**: This creates a template with sample values. You must customize it:
-   - If using **External Secrets**: Edit `airflow-env-external-secret.gomplate.yaml` to reference your actual Vault paths
-   - If using **Direct Secrets**: Update the created `airflow-env-secret` with your actual credentials
-
-#### Option 2: Create ExternalSecret Manually
-
-Create an ExternalSecret that references your Vault credentials:
-
-```yaml
-apiVersion: external-secrets.io/v1
-kind: ExternalSecret
-metadata:
-  name: airflow-env-external-secret
-  namespace: datastack
-spec:
-  refreshInterval: 1h
-  secretStoreRef:
-    name: vault-secret-store
-    kind: ClusterSecretStore
-  target:
-    name: airflow-env-secret
-  data:
-  - secretKey: AWS_ACCESS_KEY_ID
-    remoteRef:
-      key: minio/credentials
-      property: access_key
-  - secretKey: AWS_SECRET_ACCESS_KEY
-    remoteRef:
-      key: minio/credentials
-      property: secret_key
-  # Add more variables as needed
-```
-
-#### Option 3: Create Kubernetes Secret Directly
-
-```bash
-kubectl create secret generic airflow-env-secret -n datastack \
-  --from-literal=AWS_ACCESS_KEY_ID="your-access-key" \
-  --from-literal=AWS_SECRET_ACCESS_KEY="your-secret-key" \
-  --from-literal=AWS_ENDPOINT_URL="http://minio.minio.svc.cluster.local:9000" \
-  --from-literal=POSTGRES_URL="postgresql://user:pass@postgres-cluster-rw.postgres:5432"
-```
-
-After creating the environment secrets, redeploy Airflow to pick up the new configuration.
-
-### Manual Execution
-
-The DAG is configured for manual execution only (`schedule_interval=None`). To run:
-
-1. Go to Airflow Web UI
-2. Find the `csv_to_postgres` DAG
-3. Click "Trigger DAG" to start execution
-
-## Dependencies
-
-- dlt[duckdb,filesystem,postgres,s3]>=1.12.1
-- duckdb (for table existence checking)
-- Standard Airflow libraries
-
-## Troubleshooting
-
-### DAG Not Appearing
-
-- Check file permissions in `/home/jovyan/airflow-dags/`
-- Verify the Python syntax is correct
-- Check Airflow logs for import errors
-
-### Environment Variables
-
-- Ensure the `airflow-env-secret` Kubernetes Secret exists in the datastack namespace
-- Verify secret contains all required environment variables:
-
-  ```bash
-  kubectl describe secret airflow-env-secret -n datastack
-  ```
-
-- If using External Secrets, check that the ExternalSecret is syncing properly:
-
-  ```bash
-  kubectl get externalsecret airflow-env-external-secret -n datastack
-  ```
-
-### Connection Issues
-
-- Verify MinIO and PostgreSQL connectivity from Airflow workers
-- Check that the `movielens_af` database exists in PostgreSQL
-- Ensure MinIO bucket `movie-lens` is accessible with proper credentials