diff --git a/CLAUDE.md b/CLAUDE.md index 5e9ff26..588b0a3 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -159,6 +159,7 @@ install: ``` ServiceMonitor template (`servicemonitor.gomplate.yaml`): + ```yaml {{- if eq .Env.MONITORING_ENABLED "true" }} apiVersion: monitoring.coreos.com/v1 @@ -366,3 +367,36 @@ receiving - Only write code comments when necessary, as the code should be self-explanatory (Avoid trivial comment for each code block) - Write output messages and code comments in English + +### Markdown Style + +When writing Markdown documentation: + +1. **NEVER use ordered lists as section headers**: + - Ordered lists indent content and are not suitable for headings + - Use proper heading levels (####) instead of numbered lists for section titles + + ```markdown + + 1. **Setup Instructions:** + + Details here... + + 2. **Next Step:** + + More details... + + + #### Setup Instructions + + Details here... + + #### Next Step + + More details... + ``` + +2. **Always validate with markdownlint-cli2**: + - Run `markdownlint-cli2 ` before committing any Markdown files + - Fix all linting errors to ensure consistent formatting + - Pay attention to code block language specifications (MD040) and list formatting (MD029) diff --git a/airflow/README.md b/airflow/README.md index bdaddf8..781245d 100644 --- a/airflow/README.md +++ b/airflow/README.md @@ -46,7 +46,7 @@ This document covers Airflow installation, deployment, and debugging in the buun **Note**: New users have only Viewer access by default and cannot execute DAGs without role assignment. 4. **Access Airflow Web UI**: - - Navigate to your Airflow instance (e.g., `https://airflow.buun.dev`) + - Navigate to your Airflow instance (e.g., `https://airflow.yourdomain.com`) - Login with your Keycloak credentials ### Uninstalling @@ -63,7 +63,7 @@ just airflow::uninstall true ### 1. Access JupyterHub -- Navigate to your JupyterHub instance (e.g., `https://jupyter.buun.dev`) +- Navigate to your JupyterHub instance (e.g., `https://jupyter.yourdomain.com`) - Login with your credentials ### 2. Navigate to Airflow DAGs Directory @@ -82,7 +82,7 @@ In JupyterHub, the Airflow DAGs directory is mounted at: ### 4. Verify Deployment -1. Access Airflow Web UI (e.g., `https://airflow.buun.dev`) +1. Access Airflow Web UI (e.g., `https://airflow.yourdomain.com`) 2. Check that the DAG `csv_to_postgres` appears in the DAGs list 3. If the DAG doesn't appear immediately, wait 1-2 minutes for Airflow to detect the new file diff --git a/dagster/README.md b/dagster/README.md index 2904d14..8956168 100644 --- a/dagster/README.md +++ b/dagster/README.md @@ -28,7 +28,7 @@ This document covers Dagster installation, deployment, and debugging in the buun ``` 3. **Access Dagster Web UI**: - - Navigate to your Dagster instance (e.g., `https://dagster.buun.dev`) + - Navigate to your Dagster instance (e.g., `https://dagster.yourdomain.com`) - Login with your Keycloak credentials ### Uninstalling diff --git a/docs/jupyterhub.md b/docs/jupyterhub.md index ab25f5c..a8e6135 100644 --- a/docs/jupyterhub.md +++ b/docs/jupyterhub.md @@ -1,577 +1,5 @@ -# JupyterHub +# JupyterHub Documentation -JupyterHub provides a multi-user Jupyter notebook environment with Keycloak OIDC authentication, Vault integration for secure secrets management, and custom kernel images for data science workflows. +This documentation has been moved to [jupyterhub/README.md](../jupyterhub/README.md). -## Installation - -Install JupyterHub with interactive configuration: - -```bash -just jupyterhub::install -``` - -This will prompt for: - -- JupyterHub host (FQDN) -- NFS PV usage (if Longhorn is installed) -- NFS server details (if NFS is enabled) -- Vault integration setup (requires root token for initial setup) - -### Prerequisites - -- Keycloak must be installed and configured -- For NFS storage: Longhorn must be installed -- For Vault integration: Vault and External Secrets Operator must be installed -- Helm repository must be accessible - -## Kernel Images - -### Important Note - -Building and using custom buun-stack images requires building the `buunstack` Python package first. The package wheel file will be included in the Docker image during build. - -JupyterHub supports multiple kernel image profiles: - -### Standard Profiles - -- **minimal**: Basic Python environment -- **base**: Python with common data science packages -- **datascience**: Full data science stack (default) -- **pyspark**: PySpark for big data processing -- **pytorch**: PyTorch for machine learning -- **tensorflow**: TensorFlow for machine learning - -### Buun-Stack Profiles - -- **buun-stack**: Comprehensive data science environment with Vault integration -- **buun-stack-cuda**: CUDA-enabled version with GPU support - -## Profile Configuration - -Enable/disable profiles using environment variables: - -```bash -# Enable buun-stack profile (CPU version) -JUPYTER_PROFILE_BUUN_STACK_ENABLED=true - -# Enable buun-stack CUDA profile (GPU version) -JUPYTER_PROFILE_BUUN_STACK_CUDA_ENABLED=true - -# Disable default datascience profile -JUPYTER_PROFILE_DATASCIENCE_ENABLED=false -``` - -Available profile variables: - -- `JUPYTER_PROFILE_MINIMAL_ENABLED` -- `JUPYTER_PROFILE_BASE_ENABLED` -- `JUPYTER_PROFILE_DATASCIENCE_ENABLED` -- `JUPYTER_PROFILE_PYSPARK_ENABLED` -- `JUPYTER_PROFILE_PYTORCH_ENABLED` -- `JUPYTER_PROFILE_TENSORFLOW_ENABLED` -- `JUPYTER_PROFILE_BUUN_STACK_ENABLED` -- `JUPYTER_PROFILE_BUUN_STACK_CUDA_ENABLED` - -Only `JUPYTER_PROFILE_DATASCIENCE_ENABLED` is true by default. - -## Buun-Stack Images - -Buun-stack images provide comprehensive data science environments with: - -- All standard data science packages (NumPy, Pandas, Scikit-learn, etc.) -- Deep learning frameworks (PyTorch, TensorFlow, Keras) -- Big data tools (PySpark, Apache Arrow) -- NLP and ML libraries (LangChain, Transformers, spaCy) -- Database connectors and tools -- **Vault integration** with `buunstack` Python package - -### Building Custom Images - -Build and push buun-stack images to your registry: - -```bash -# Build images (includes building the buunstack Python package) -just jupyterhub::build-kernel-images - -# Push to registry -just jupyterhub::push-kernel-images -``` - -The build process: - -1. Builds the `buunstack` Python package wheel -2. Copies the wheel into the Docker build context -3. Installs the wheel in the Docker image -4. Cleans up temporary files - -⚠️ **Note**: Buun-stack images are comprehensive and large (~13GB). Initial image pulls and deployments take significant time due to the extensive package set. - -### Image Configuration - -Configure image settings in `.env.local`: - -```bash -# Image registry -IMAGE_REGISTRY=localhost:30500 - -# Image tag (current default) -JUPYTER_PYTHON_KERNEL_TAG=python-3.12-28 -``` - -## Vault Integration - -### Overview - -Vault integration enables secure secrets management directly from Jupyter notebooks. The system uses: - -- **ExternalSecret** to fetch the admin token from Vault -- **Renewable tokens** with unlimited Max TTL to avoid 30-day system limitations -- **Token renewal script** that automatically renews tokens at TTL/2 intervals (minimum 30 seconds) -- **User-specific tokens** created during notebook spawn with isolated access - -### Architecture - -```plain -┌────────────────────────────────────────────────────────────────┐ -│ JupyterHub Hub Pod │ -│ │ -│ ┌──────────────┐ ┌────────────────┐ ┌────────────────────┐ │ -│ │ Hub │ │ Token Renewer │ │ ExternalSecret │ │ -│ │ Container │◄─┤ Sidecar │◄─┤ (mounted as │ │ -│ │ │ │ │ │ Secret) │ │ -│ └──────────────┘ └────────────────┘ └────────────────────┘ │ -│ │ │ ▲ │ -│ │ │ │ │ -│ ▼ ▼ │ │ -│ ┌──────────────────────────────────┐ │ │ -│ │ /vault/secrets/vault-token │ │ │ -│ │ (Admin token for user creation) │ │ │ -│ └──────────────────────────────────┘ │ │ -└────────────────────────────────────────────────────┼───────────┘ - │ - ┌───────────▼──────────┐ - │ Vault │ - │ secret/jupyterhub/ │ - │ vault-token │ - └──────────────────────┘ -``` - -### Prerequisites - -Vault integration requires: - -- Vault server installed and configured -- External Secrets Operator installed -- ClusterSecretStore configured for Vault -- Buun-stack kernel images (standard images don't include Vault integration) - -### Setup - -Vault integration is configured during JupyterHub installation: - -```bash -just jupyterhub::install -# Answer "yes" when prompted about Vault integration -# Provide Vault root token when prompted -``` - -The setup process: - -1. Creates `jupyterhub-admin` policy with necessary permissions including `sudo` for orphan token creation -2. Creates renewable admin token with 24h TTL and unlimited Max TTL -3. Stores token in Vault at `secret/jupyterhub/vault-token` -4. Creates ExternalSecret to fetch token from Vault -5. Deploys token renewal sidecar for automatic renewal - -### Usage in Notebooks - -With Vault integration enabled, use the `buunstack` package in notebooks: - -```python -from buunstack import SecretStore - -# Initialize (uses pre-acquired user-specific token) -secrets = SecretStore() - -# Store secrets -secrets.put('api-keys', - openai='sk-...', - github='ghp_...', - database_url='postgresql://...') - -# Retrieve secrets -api_keys = secrets.get('api-keys') -openai_key = secrets.get('api-keys', field='openai') - -# List all secrets -secret_names = secrets.list() - -# Delete secrets or specific fields -secrets.delete('old-api-key') # Delete entire secret -secrets.delete('api-keys', field='github') # Delete only github field -``` - -### Security Features - -- **User isolation**: Each user receives an orphan token with access only to their namespace -- **Automatic renewal**: Token renewal script renews admin token at TTL/2 intervals (minimum 30 seconds) -- **ExternalSecret integration**: Admin token fetched securely from Vault -- **Orphan tokens**: User tokens are orphan tokens, not limited by parent policy restrictions -- **Audit trail**: All secret access is logged in Vault - -### Token Management - -#### Admin Token - -The admin token is managed through: - -1. **Creation**: `just jupyterhub::create-jupyterhub-vault-token` creates renewable token -2. **Storage**: Stored in Vault at `secret/jupyterhub/vault-token` -3. **Retrieval**: ExternalSecret fetches and mounts as Kubernetes Secret -4. **Renewal**: `vault-token-renewer.sh` script renews at TTL/2 intervals - -#### User Tokens - -User tokens are created dynamically: - -1. **Pre-spawn hook** reads admin token from `/vault/secrets/vault-token` -2. **Creates user policy** `jupyter-user-{username}` with restricted access -3. **Creates orphan token** with user policy (requires `sudo` permission) -4. **Sets environment variable** `NOTEBOOK_VAULT_TOKEN` in notebook container - -## Token Renewal Implementation - -### Admin Token Renewal - -The admin token renewal is handled by a sidecar container (`vault-token-renewer`) running alongside the JupyterHub hub: - -**Implementation Details:** - -1. **Renewal Script**: `/vault/config/vault-token-renewer.sh` - - Runs in the `vault-token-renewer` sidecar container - - Uses Vault 1.17.5 image with HashiCorp Vault CLI - -2. **Environment-Based TTL Configuration**: - - ```bash - # Reads TTL from environment variable (set in .env.local) - TTL_RAW="${JUPYTERHUB_VAULT_TOKEN_TTL}" # e.g., "5m", "24h" - - # Converts to seconds and calculates renewal interval - RENEWAL_INTERVAL=$((TTL_SECONDS / 2)) # TTL/2 with minimum 30s - ``` - -3. **Token Source**: ExternalSecret → Kubernetes Secret → mounted file - - ```bash - # Token retrieved from ExternalSecret-managed mount - ADMIN_TOKEN=$(cat /vault/admin-token/token) - ``` - -4. **Renewal Loop**: - - ```bash - while true; do - vault token renew >/dev/null 2>&1 - sleep $RENEWAL_INTERVAL - done - ``` - -5. **Error Handling**: If renewal fails, re-retrieves token from ExternalSecret mount - -**Key Files:** - -- `vault-token-renewer.sh`: Main renewal script -- `jupyterhub-vault-token-external-secret.gomplate.yaml`: ExternalSecret configuration -- `vault-token-renewer-config` ConfigMap: Contains the renewal script - -### User Token Renewal - -User token renewal is handled within the notebook environment by the `buunstack` Python package: - -**Implementation Details:** - -1. **Token Source**: Environment variable set by pre-spawn hook - - ```python - # In pre_spawn_hook.gomplate.py - spawner.environment["NOTEBOOK_VAULT_TOKEN"] = user_vault_token - ``` - -2. **Automatic Renewal**: Built into `SecretStore` class operations - - ```python - # In buunstack/secrets.py - def _ensure_authenticated(self): - token_info = self.client.auth.token.lookup_self() - ttl = token_info.get("data", {}).get("ttl", 0) - renewable = token_info.get("data", {}).get("renewable", False) - - # Renew if TTL < 10 minutes and renewable - if renewable and ttl > 0 and ttl < 600: - self.client.auth.token.renew_self() - ``` - -3. **Renewal Trigger**: Every `SecretStore` operation (get, put, delete, list) - - Checks token validity before operation - - Automatically renews if TTL < 10 minutes - - Transparent to user code - -4. **Token Configuration** (set during creation): - - **TTL**: `NOTEBOOK_VAULT_TOKEN_TTL` (default: 24h = 1 day) - - **Max TTL**: `NOTEBOOK_VAULT_TOKEN_MAX_TTL` (default: 168h = 7 days) - - **Policy**: User-specific `jupyter-user-{username}` - - **Type**: Orphan token (independent of parent token lifecycle) - -5. **Expiry Handling**: When token reaches Max TTL: - - Cannot be renewed further - - User must restart notebook server (triggers new token creation) - - Prevented by `JUPYTERHUB_CULL_MAX_AGE` setting (6 days < 7 day Max TTL) - -**Key Files:** - -- `pre_spawn_hook.gomplate.py`: User token creation logic -- `buunstack/secrets.py`: Token renewal implementation -- `user_policy.hcl`: User token permissions template - -### Token Lifecycle Summary - -``` -┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ -│ Admin Token │ │ User Token │ │ Pod Lifecycle │ -│ │ │ │ │ │ -│ Created: Manual │ │ Created: Spawn │ │ Max Age: 7 days │ -│ TTL: 5m-24h │ │ TTL: 1 day │ │ Auto-restart │ -│ Max TTL: ∞ │ │ Max TTL: 7 days │ │ at Max TTL │ -│ Renewal: Auto │ │ Renewal: Auto │ │ │ -│ Interval: TTL/2 │ │ Trigger: Usage │ │ │ -└─────────────────┘ └──────────────────┘ └─────────────────┘ - │ │ │ - ▼ ▼ ▼ - vault-token-renewer buunstack.py cull.maxAge - sidecar SecretStore pod restart -``` - -## Storage Options - -### Default Storage - -Uses Kubernetes PersistentVolumes for user home directories. - -### NFS Storage - -For shared storage across nodes, configure NFS: - -```bash -JUPYTERHUB_NFS_PV_ENABLED=true -JUPYTER_NFS_IP=192.168.10.1 -JUPYTER_NFS_PATH=/volume1/drive1/jupyter -``` - -NFS storage requires: - -- Longhorn storage system installed -- NFS server accessible from cluster nodes -- Proper NFS export permissions configured - -## Configuration - -### Environment Variables - -Key configuration variables: - -```bash -# Basic settings -JUPYTERHUB_NAMESPACE=jupyter -JUPYTERHUB_CHART_VERSION=4.2.0 -JUPYTERHUB_OIDC_CLIENT_ID=jupyterhub - -# Keycloak integration -KEYCLOAK_REALM=buunstack - -# Storage -JUPYTERHUB_NFS_PV_ENABLED=false - -# Vault integration -JUPYTERHUB_VAULT_INTEGRATION_ENABLED=false -VAULT_ADDR=https://vault.example.com - -# Image settings -JUPYTER_PYTHON_KERNEL_TAG=python-3.12-28 -IMAGE_REGISTRY=localhost:30500 - -# Vault token TTL settings -JUPYTERHUB_VAULT_TOKEN_TTL=24h # Admin token: renewed at TTL/2 intervals -NOTEBOOK_VAULT_TOKEN_TTL=24h # User token: 1 day (renewed on usage) -NOTEBOOK_VAULT_TOKEN_MAX_TTL=168h # User token: 7 days max - -# Server pod lifecycle settings -JUPYTERHUB_CULL_MAX_AGE=604800 # Max pod age in seconds (7 days = 604800s) - # Should be <= NOTEBOOK_VAULT_TOKEN_MAX_TTL - -# Logging -JUPYTER_BUUNSTACK_LOG_LEVEL=warning # Options: debug, info, warning, error -``` - -### Advanced Configuration - -Customize JupyterHub behavior by editing `jupyterhub-values.gomplate.yaml` template before installation. - -## Management - -### Uninstall - -```bash -just jupyterhub::uninstall -``` - -This removes: - -- JupyterHub deployment -- User pods -- PVCs -- ExternalSecret - -### Update - -Upgrade to newer versions: - -```bash -# Update image tag in .env.local -export JUPYTER_PYTHON_KERNEL_TAG=python-3.12-29 - -# Rebuild and push images -just jupyterhub::build-kernel-images -just jupyterhub::push-kernel-images - -# Upgrade JupyterHub deployment -just jupyterhub::install -``` - -### Manual Token Refresh - -If needed, manually refresh the admin token: - -```bash -# Create new renewable token -just jupyterhub::create-jupyterhub-vault-token - -# Restart JupyterHub to pick up new token -kubectl rollout restart deployment/hub -n jupyter -``` - -## Troubleshooting - -### Image Pull Issues - -Buun-stack images are large and may timeout: - -```bash -# Check pod status -kubectl get pods -n jupyter - -# Check image pull progress -kubectl describe pod -n jupyter - -# Increase timeout if needed -helm upgrade jupyterhub jupyterhub/jupyterhub --timeout=30m -f jupyterhub-values.yaml -``` - -### Vault Integration Issues - -Check token and authentication: - -```bash -# Check ExternalSecret status -kubectl get externalsecret -n jupyter jupyterhub-vault-token - -# Check if Secret was created -kubectl get secret -n jupyter jupyterhub-vault-token - -# Check token renewal logs -kubectl logs -n jupyter -l app.kubernetes.io/component=hub -c vault-token-renewer - -# In a notebook, verify environment -%env NOTEBOOK_VAULT_TOKEN -``` - -Common issues: - -1. **"child policies must be subset of parent"**: Admin policy needs `sudo` permission for orphan tokens -2. **Token not found**: Check ExternalSecret and ClusterSecretStore configuration -3. **Permission denied**: Verify `jupyterhub-admin` policy has all required permissions - -### Authentication Issues - -Verify Keycloak client configuration: - -```bash -# Check client exists -just keycloak::get-client buunstack jupyterhub - -# Check redirect URIs -just keycloak::update-client buunstack jupyterhub \ - "https://your-jupyter-host/hub/oauth_callback" -``` - -## Technical Implementation Details - -### Helm Chart Version - -JupyterHub uses the official Zero to JupyterHub (Z2JH) Helm chart: - -- Chart: `jupyterhub/jupyterhub` -- Version: `4.2.0` (configurable via `JUPYTERHUB_CHART_VERSION`) -- Documentation: https://z2jh.jupyter.org/ - -### Token System Architecture - -The system uses a three-tier token approach: - -1. **Renewable Admin Token**: - - Created with `explicit-max-ttl=0` (unlimited Max TTL) - - Renewed automatically at TTL/2 intervals (minimum 30 seconds) - - Stored in Vault and fetched via ExternalSecret - -2. **Orphan User Tokens**: - - Created with `create_orphan()` API call - - Not limited by parent token policies - - Individual TTL and Max TTL settings - -3. **Token Renewal Script**: - - Runs as sidecar container - - Reads token from ExternalSecret mount - - Handles renewal and re-retrieval on failure - -### Key Files - -- `jupyterhub-admin-policy.hcl`: Vault policy with admin permissions -- `user_policy.hcl`: Template for user-specific policies -- `vault-token-renewer.sh`: Token renewal script -- `jupyterhub-vault-token-external-secret.gomplate.yaml`: ExternalSecret configuration - -## Performance Considerations - -- **Image Size**: Buun-stack images are ~13GB, plan storage accordingly -- **Pull Time**: Initial pulls take 5-15 minutes depending on network -- **Resource Usage**: Data science workloads require adequate CPU/memory -- **Token Renewal**: Minimal overhead (renewal at TTL/2 intervals) - -For production deployments, consider: - -- Pre-pulling images to all nodes -- Using faster storage backends -- Configuring resource limits per user -- Setting up monitoring and alerts - -## Known Limitations - -1. **Annual Token Recreation**: While tokens have unlimited Max TTL, best practice suggests recreating them annually - -2. **Token Expiry and Pod Lifecycle**: User tokens have a TTL of 1 day (`NOTEBOOK_VAULT_TOKEN_TTL=24h`) and maximum TTL of 7 days (`NOTEBOOK_VAULT_TOKEN_MAX_TTL=168h`). Daily usage extends the token for another day, allowing up to 7 days of continuous use. Server pods are automatically restarted after 7 days (`JUPYTERHUB_CULL_MAX_AGE=604800s`) to refresh tokens. - -3. **Cull Settings**: Server idle timeout is set to 2 hours by default. Adjust `cull.timeout` and `cull.every` in the Helm values for different requirements - -4. **NFS Storage**: When using NFS storage, ensure proper permissions are set on the NFS server. The default `JUPYTER_FSGID` is 100 - -5. **ExternalSecret Dependency**: Requires External Secrets Operator to be installed and configured +Please refer to the new location for complete JupyterHub setup, configuration, and usage documentation. diff --git a/docs/resource-management.md b/docs/resource-management.md new file mode 100644 index 0000000..9eebf0a --- /dev/null +++ b/docs/resource-management.md @@ -0,0 +1,538 @@ +# Resource Managementplain + +This document describes how to configure resource requests and limits for components in the buun-stack. + +## Table of Contents + +- [Overview](#overview) +- [QoS Classes](#qos-classes) +- [Using Goldilocks](#using-goldilocks) +- [Configuring Resources](#configuring-resources) +- [Best Practices](#best-practices) +- [Troubleshooting](#troubleshooting) + +## Overview + +Kubernetes uses resource requests and limits to: + +- **Schedule pods** on nodes with sufficient resources +- **Ensure quality of service** through QoS classes +- **Prevent resource exhaustion** by limiting resource consumption + +All critical components in buun-stack should have resource requests and limits configured. + +## QoS Classes + +Kubernetes assigns one of three QoS classes to each pod based on its resource configuration: + +### Guaranteed QoS (Highest Priority) + +**Requirements:** + +- Every container must have CPU and memory requests +- Every container must have CPU and memory limits +- Requests and limits must be **equal** for both CPU and memory + +**Characteristics:** + +- Highest priority during resource contention +- Last to be evicted when node runs out of resources +- Predictable performance + +**Example:** + +```yaml +resources: + requests: + cpu: 200mplain + memory: 1Gi + limits: + cpu: 200m # Same as requests + memory: 1Gi # Same as requests +``` + +**Use for:** Critical data stores (PostgreSQL, Vault) + +### Burstable QoS (Medium Priority) + +**Requirements:** + +- At least one container has requests or limits +- Does not meet Guaranteed QoS criteria +- Typically `requests < limits` + +**Characteristics:** + +- Medium priority during resource contention +- Can burst to limits when resources are available +- More resource-efficient than Guaranteed + +**Example:** + +```yaml +resources: + requests: + cpu: 50m + memory: 128Mi + limits: + cpu: 100m # Can burst up to this + memory: 256Mi # Can burst up to this +``` + +**Use for:** Operators, auxiliary services, variable workloads + +### BestEffort QoS (Lowest Priority) + +**Requirements:** + +- No resource requests or limits configured + +**Characteristics:** + +- Lowest priority during resource contention +- First to be evicted when node runs out of resources +- **Not recommended for production** + +## Using Goldilocks + +Goldilocks uses Vertical Pod Autoscaler (VPA) to recommend resource settings based on actual usage. + +### Setup + +For installation and detailed setup instructions, see: + +- [VPA Installation and Configuration](../vpa/README.md) +- [Goldilocks Installation and Configuration](../goldilocks/README.md) + +Quick start: + +```bash +# Install VPA +just vpa::install + +# Install Goldilocks +just goldilocks::install + +# Enable monitoring for a namespace +just goldilocks::enable-namespace +``` + +Access the dashboard at your configured Goldilocks host (e.g., `https://goldilocks.example.com`). + +### Using the Dashboard + +- Navigate to the namespace +- Expand "Containers" section for each workload +- Review both "Guaranteed QoS" and "Burstable QoS" recommendations + +### Limitations + +Goldilocks only monitors **standard Kubernetes workloads** (Deployment, StatefulSet, DaemonSet). It **does not** automatically create VPAs for: + +- Custom Resource Definitions (CRDs) +- Resources managed by operators (e.g., CloudNativePG Cluster) + +For CRDs, use alternative methods: + +- Check actual usage: `kubectl top pod -n ` +- Use Grafana dashboards: `Kubernetes / Compute Resources / Pod` +- Monitor over time and adjust based on observed patterns + +### Working with Recommendations + +#### For Standard Workloads (Supported by Goldilocks) + +Review Goldilocks recommendations in the dashboard, then configure resources based on your testing status: + +**With load testing:** + +- Use Goldilocks recommended values with minimal headroom (1.5-2x) +- Round to clean values (50m, 100m, 200m, 512Mi, 1Gi, etc.) + +**Without load testing:** + +- Add more headroom to handle unexpected load (3-5x) +- Round to clean values + +**Example:** + +Goldilocks recommendation: 50m CPU, 128Mi Memory + +- With load testing: 100m CPU, 256Mi Memory (2x, rounded) +- Without load testing: 200m CPU, 512Mi Memory (4x, rounded) + +#### For CRDs and Unsupported Workloads + +Use Grafana to check actual resource usage: + +1. **Navigate to Grafana dashboard**: `Kubernetes / Compute Resources / Pod` +2. **Select namespace and pod** +3. **Review usage over 24+ hours** to identify peak values + +Then apply the same approach: + +**With load testing:** + +- Use observed peak values with minimal headroom (1.5-2x) + +**Without load testing:** + +- Add significant headroom (3-5x) for safety + +**Example:** + +Grafana shows peak: 40m CPU, 207Mi Memory + +- With load testing: 100m CPU, 512Mi Memory (2.5x/2.5x, rounded) +- Without load testing: 200m CPU, 1Gi Memory (5x/5x, rounded, Guaranteed QoS) + +## Configuring Resources + +### Helm-Managed Components + +For components installed via Helm, configure resources in the values file. + +#### Example: PostgreSQL Operator (CNPG) + +**File:** `postgres/cnpg-values.yaml` + +```yaml +resources: + requests: + cpu: 50m + memory: 128Mi + limits: + cpu: 100m + memory: 256Mi +``` + +**Apply:** + +```bash +cd postgres +helm upgrade --install cnpg cnpg/cloudnative-pg --version ${CNPG_CHART_VERSION} \ + -n ${CNPG_NAMESPACE} -f cnpg-values.yaml +``` + +#### Example: Vault + +**File:** `vault/vault-values.gomplate.yaml` + +```yaml +server: + resources: + requests: + cpu: 50m + memory: 512Mi + limits: + cpu: 50m + memory: 512Mi + +injector: + resources: + requests: + cpu: 50m + memory: 128Mi + limits: + cpu: 50m + memory: 128Mi + +csi: + enabled: true + agent: + resources: + requests: + cpu: 50m + memory: 128Mi + limits: + cpu: 50m + memory: 128Mi + resources: + requests: + cpu: 50m + memory: 64Mi + limits: + cpu: 50m + memory: 128Mi +``` + +**Apply:** + +```bash +cd vault +gomplate -f vault-values.gomplate.yaml -o vault-values.yaml +helm upgrade vault hashicorp/vault --version ${VAULT_CHART_VERSION} \ + -n vault -f vault-values.yaml +``` + +**Note:** After updating StatefulSet resources, delete the pod to apply changes: + +```bash +kubectl delete pod vault-0 -n vault +# Unseal Vault after restart +kubectl exec -n vault vault-0 -- vault operator unseal +``` + +### CRD-Managed Components + +For components managed by Custom Resource Definitions, patch the CRD directly. + +#### Example: PostgreSQL Cluster (CloudNativePG) + +**Update values file** + +**File:** `postgres/postgres-cluster-values.gomplate.yaml` + +```yaml +cluster: + instances: 1 + + # Resource configuration (Guaranteed QoS) + resources: + requests: + cpu: 200m + memory: 1Gi + limits: + cpu: 200m + memory: 1Gi + + storage: + size: {{ .Env.POSTGRES_STORAGE_SIZE }} +``` + +**Apply via justfile:** + +```bash +just postgres::create-cluster +``` + +**Restart pod to apply changes:** + +```bash +kubectl delete pod postgres-cluster-1 -n postgres +kubectl wait --for=condition=Ready pod/postgres-cluster-1 -n postgres --timeout=180s +``` + +**Data Safety:** PostgreSQL data is stored in PersistentVolumeClaim (PVC) and will be preserved during pod restart. + +### Verification + +After applying resource configurations: + +**1. Check resource settings:** + +```bash +# For standard workloads +kubectl get deployment -n -o jsonpath='{.spec.template.spec.containers[0].resources}' | jq + +# For pods +kubectl get pod -n -o jsonpath='{.spec.containers[0].resources}' | jq +``` + +**2. Verify QoS Class:** + +```bash +kubectl get pod -n -o jsonpath='{.status.qosClass}' +``` + +**3. Check actual usage:** + +```bash +kubectl top pod -n +``` + +## Best Practices + +### Choosing QoS Class + +| Component Type | Recommended QoS | Rationale | +|---------------|-----------------|-----------| +| **Data stores** (PostgreSQL, Vault) | Guaranteed | Critical services, data integrity, predictable performance | +| **Operators** (CNPG, etc.) | Burstable | Lightweight controllers, occasional spikes | +| **Auxiliary services** (Injectors, CSI providers) | Burstable | Support services, variable load | + +### Setting Resource Values + +**1. Start with actual usage:** + +```bash +# Check current usage +kubectl top pod -n + +# Check historical usage in Grafana +# Dashboard: Kubernetes / Compute Resources / Pod +``` + +**2. Add appropriate headroom:** + +| Scenario | Recommended Multiplier | Example | +|----------|----------------------|---------| +| Stable, predictable load | 2-3x current usage | Current: 40m → Set: 100m | +| Variable load | 5-10x current usage | Current: 40m → Set: 200m | +| Growth expected | 5-10x current usage | Current: 200Mi → Set: 1Gi | + +**3. Use round numbers:** + +- CPU: 50m, 100m, 200m, 500m, 1000m (1 core) +- Memory: 64Mi, 128Mi, 256Mi, 512Mi, 1Gi, 2Gi + +**4. Monitor and adjust:** + +- Check usage patterns after 1-2 weeks +- Adjust based on observed peak usage +- Iterate as workload changes + +### Resource Configuration Examples + +Based on actual deployments in buun-stack: + +```yaml +# PostgreSQL Operator (Burstable) +resources: + requests: + cpu: 50m + memory: 128Mi + limits: + cpu: 100m + memory: 256Mi + +# PostgreSQL Cluster (Guaranteed) +resources: + requests: + cpu: 200m + memory: 1Gi + limits: + cpu: 200m + memory: 1Gi + +# Vault Server (Guaranteed) +resources: + requests: + cpu: 50m + memory: 512Mi + limits: + cpu: 50m + memory: 512Mi + +# Vault Agent Injector (Guaranteed) +resources: + requests: + cpu: 50m + memory: 128Mi + limits: + cpu: 50m + memory: 128Mi +``` + +## Troubleshooting + +### Pod Stuck in Pending State + +**Symptom:** + +```plain +NAME READY STATUS RESTARTS AGE +my-pod 0/1 Pending 0 5m +``` + +**Check events:** + +```bash +kubectl describe pod -n | tail -20 +``` + +**Common causes:** + +#### Insufficient resources + +```plain +FailedScheduling: 0/1 nodes are available: 1 Insufficient cpu/memory +``` + +**Solution:** Reduce resource requests or add more nodes + +#### Pod anti-affinity + +```plain +FailedScheduling: 0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules +``` + +**Solution:** Delete old pod to allow new pod to schedule + +```bash +kubectl delete pod -n +``` + +### OOMKilled (Out of Memory) + +**Symptom:** + +```plain +NAME READY STATUS RESTARTS AGE +my-pod 0/1 OOMKilled 1 5m +``` + +**Solution:** + +#### Check memory limit is sufficient + +```bash +kubectl top pod -n +``` + +#### Increase memory limits + +```yaml +resources: + limits: + memory: 2Gi # Increase from 1Gi +``` + +### Helm Stuck in pending-upgrade + +**Symptom:** + +```bash +helm status -n +# STATUS: pending-upgrade +``` + +**Solution:** + +```bash +# Remove pending release secret +kubectl get secrets -n -l owner=helm,name= --sort-by=.metadata.creationTimestamp +kubectl delete secret sh.helm.release.v1..v -n + +# Verify status is back to deployed +helm status -n + +# Re-run upgrade +helm upgrade -n -f values.yaml +``` + +### VPA Not Providing Recommendations + +**Symptom:** + +- VPA shows "NoPodsMatched" or "ConfigUnsupported" +- Goldilocks shows empty containers section + +**Cause:** +VPA cannot monitor Custom Resource Definitions (CRDs) directly + +**Solution:** +Use alternative monitoring methods: + +1. kubectl top pod +2. Grafana dashboards +3. Prometheus queries + +For CRDs, configure resources manually based on observed usage patterns. + +## References + +- [Kubernetes Resource Management](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) +- [Kubernetes QoS Classes](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/) +- [Goldilocks Documentation](https://goldilocks.docs.fairwinds.com/) +- [CloudNativePG Resource Management](https://cloudnative-pg.io/documentation/current/resource_management/) diff --git a/jupyterhub/README.md b/jupyterhub/README.md index a93dad4..973b1db 100644 --- a/jupyterhub/README.md +++ b/jupyterhub/README.md @@ -1,24 +1,146 @@ # JupyterHub -Multi-user platform for interactive computing: +JupyterHub provides a multi-user Jupyter notebook environment with Keycloak OIDC authentication, Vault integration for secure secrets management, and custom kernel images for data science workflows. -- Collaborative Jupyter notebook environment -- Integrated with Keycloak for OIDC authentication -- Persistent storage for user workspaces -- Support for multiple kernels and environments -- Vault integration for secure secrets management +## Table of Contents -See [JupyterHub Documentation](../docs/jupyterhub.md) for detailed setup and configuration. +- [Installation](#installation) +- [Prerequisites](#prerequisites) +- [Access](#access) +- [Kernel Images](#kernel-images) +- [Profile Configuration](#profile-configuration) +- [Buun-Stack Images](#buun-stack-images) +- [buunstack Package & SecretStore](#buunstack-package--secretstore) +- [Vault Integration](#vault-integration) +- [Token Renewal Implementation](#token-renewal-implementation) +- [Storage Options](#storage-options) +- [Configuration](#configuration) +- [Custom Container Images](#custom-container-images) +- [Management](#management) +- [Troubleshooting](#troubleshooting) +- [Technical Implementation Details](#technical-implementation-details) +- [Performance Considerations](#performance-considerations) +- [Known Limitations](#known-limitations) ## Installation +Install JupyterHub with interactive configuration: + ```bash just jupyterhub::install ``` +This will prompt for: + +- JupyterHub host (FQDN) +- NFS PV usage (if Longhorn is installed) +- NFS server details (if NFS is enabled) +- Vault integration setup (requires root token for initial setup) + +## Prerequisites + +- Keycloak must be installed and configured +- For NFS storage: Longhorn must be installed +- For Vault integration: Vault and External Secrets Operator must be installed +- Helm repository must be accessible + ## Access -Access JupyterHub at `https://jupyter.yourdomain.com` and authenticate via Keycloak. +Access JupyterHub at your configured host (e.g., `https://jupyter.example.com`) and authenticate via Keycloak. + +## Kernel Images + +### Important Note + +Building and using custom buun-stack images requires building the `buunstack` Python package first. The package wheel file will be included in the Docker image during build. + +JupyterHub supports multiple kernel image profiles: + +### Standard Profiles + +- **minimal**: Basic Python environment +- **base**: Python with common data science packages +- **datascience**: Full data science stack (default) +- **pyspark**: PySpark for big data processing +- **pytorch**: PyTorch for machine learning +- **tensorflow**: TensorFlow for machine learning + +### Buun-Stack Profiles + +- **buun-stack**: Comprehensive data science environment with Vault integration +- **buun-stack-cuda**: CUDA-enabled version with GPU support + +## Profile Configuration + +Enable/disable profiles using environment variables: + +```bash +# Enable buun-stack profile (CPU version) +JUPYTER_PROFILE_BUUN_STACK_ENABLED=true + +# Enable buun-stack CUDA profile (GPU version) +JUPYTER_PROFILE_BUUN_STACK_CUDA_ENABLED=true + +# Disable default datascience profile +JUPYTER_PROFILE_DATASCIENCE_ENABLED=false +``` + +Available profile variables: + +- `JUPYTER_PROFILE_MINIMAL_ENABLED` +- `JUPYTER_PROFILE_BASE_ENABLED` +- `JUPYTER_PROFILE_DATASCIENCE_ENABLED` +- `JUPYTER_PROFILE_PYSPARK_ENABLED` +- `JUPYTER_PROFILE_PYTORCH_ENABLED` +- `JUPYTER_PROFILE_TENSORFLOW_ENABLED` +- `JUPYTER_PROFILE_BUUN_STACK_ENABLED` +- `JUPYTER_PROFILE_BUUN_STACK_CUDA_ENABLED` + +Only `JUPYTER_PROFILE_DATASCIENCE_ENABLED` is true by default. + +## Buun-Stack Images + +Buun-stack images provide comprehensive data science environments with: + +- All standard data science packages (NumPy, Pandas, Scikit-learn, etc.) +- Deep learning frameworks (PyTorch, TensorFlow, Keras) +- Big data tools (PySpark, Apache Arrow) +- NLP and ML libraries (LangChain, Transformers, spaCy) +- Database connectors and tools +- **Vault integration** with `buunstack` Python package + +### Building Custom Images + +Build and push buun-stack images to your registry: + +```bash +# Build images (includes building the buunstack Python package) +just jupyterhub::build-kernel-images + +# Push to registry +just jupyterhub::push-kernel-images +``` + +The build process: + +1. Builds the `buunstack` Python package wheel +2. Copies the wheel into the Docker build context +3. Installs the wheel in the Docker image +4. Cleans up temporary files + +⚠️ **Note**: Buun-stack images are comprehensive and large (~13GB). Initial image pulls and deployments take significant time due to the extensive package set. + +### Image Configuration + +Configure image settings in `.env.local`: + +```bash +# Image registry +IMAGE_REGISTRY=localhost:30500 + +# Image tag (current default) +JUPYTER_PYTHON_KERNEL_TAG=python-3.12-28 +``` ## buunstack Package & SecretStore @@ -60,6 +182,305 @@ For detailed documentation, usage examples, and API reference, see: [📖 buunstack Package Documentation](../python-package/README.md) +## Vault Integration + +### Overview + +Vault integration enables secure secrets management directly from Jupyter notebooks. The system uses: + +- **ExternalSecret** to fetch the admin token from Vault +- **Renewable tokens** with unlimited Max TTL to avoid 30-day system limitations +- **Token renewal script** that automatically renews tokens at TTL/2 intervals (minimum 30 seconds) +- **User-specific tokens** created during notebook spawn with isolated access + +### Architecture + +```plain +┌────────────────────────────────────────────────────────────────┐ +│ JupyterHub Hub Pod │ +│ │ +│ ┌──────────────┐ ┌────────────────┐ ┌────────────────────┐ │ +│ │ Hub │ │ Token Renewer │ │ ExternalSecret │ │ +│ │ Container │◄─┤ Sidecar │◄─┤ (mounted as │ │ +│ │ │ │ │ │ Secret) │ │ +│ └──────────────┘ └────────────────┘ └────────────────────┘ │ +│ │ │ ▲ │ +│ │ │ │ │ +│ ▼ ▼ │ │ +│ ┌──────────────────────────────────┐ │ │ +│ │ /vault/secrets/vault-token │ │ │ +│ │ (Admin token for user creation) │ │ │ +│ └──────────────────────────────────┘ │ │ +└────────────────────────────────────────────────────┼───────────┘ + │ + ┌───────────▼──────────┐ + │ Vault │ + │ secret/jupyterhub/ │ + │ vault-token │ + └──────────────────────┘ +``` + +### Prerequisites + +Vault integration requires: + +- Vault server installed and configured +- External Secrets Operator installed +- ClusterSecretStore configured for Vault +- Buun-stack kernel images (standard images don't include Vault integration) + +### Setup + +Vault integration is configured during JupyterHub installation: + +```bash +just jupyterhub::install +# Answer "yes" when prompted about Vault integration +# Provide Vault root token when prompted +``` + +The setup process: + +1. Creates `jupyterhub-admin` policy with necessary permissions including `sudo` for orphan token creation +2. Creates renewable admin token with 24h TTL and unlimited Max TTL +3. Stores token in Vault at `secret/jupyterhub/vault-token` +4. Creates ExternalSecret to fetch token from Vault +5. Deploys token renewal sidecar for automatic renewal + +### Usage in Notebooks + +With Vault integration enabled, use the `buunstack` package in notebooks: + +```python +from buunstack import SecretStore + +# Initialize (uses pre-acquired user-specific token) +secrets = SecretStore() + +# Store secrets +secrets.put('api-keys', + openai='sk-...', + github='ghp_...', + database_url='postgresql://...') + +# Retrieve secrets +api_keys = secrets.get('api-keys') +openai_key = secrets.get('api-keys', field='openai') + +# List all secrets +secret_names = secrets.list() + +# Delete secrets or specific fields +secrets.delete('old-api-key') # Delete entire secret +secrets.delete('api-keys', field='github') # Delete only github field +``` + +### Security Features + +- **User isolation**: Each user receives an orphan token with access only to their namespace +- **Automatic renewal**: Token renewal script renews admin token at TTL/2 intervals (minimum 30 seconds) +- **ExternalSecret integration**: Admin token fetched securely from Vault +- **Orphan tokens**: User tokens are orphan tokens, not limited by parent policy restrictions +- **Audit trail**: All secret access is logged in Vault + +### Token Management + +#### Admin Token + +The admin token is managed through: + +1. **Creation**: `just jupyterhub::create-jupyterhub-vault-token` creates renewable token +2. **Storage**: Stored in Vault at `secret/jupyterhub/vault-token` +3. **Retrieval**: ExternalSecret fetches and mounts as Kubernetes Secret +4. **Renewal**: `vault-token-renewer.sh` script renews at TTL/2 intervals + +#### User Tokens + +User tokens are created dynamically: + +1. **Pre-spawn hook** reads admin token from `/vault/secrets/vault-token` +2. **Creates user policy** `jupyter-user-{username}` with restricted access +3. **Creates orphan token** with user policy (requires `sudo` permission) +4. **Sets environment variable** `NOTEBOOK_VAULT_TOKEN` in notebook container + +## Token Renewal Implementation + +### Admin Token Renewal + +The admin token renewal is handled by a sidecar container (`vault-token-renewer`) running alongside the JupyterHub hub: + +**Implementation Details:** + +1. **Renewal Script**: `/vault/config/vault-token-renewer.sh` + - Runs in the `vault-token-renewer` sidecar container + - Uses Vault 1.17.5 image with HashiCorp Vault CLI + +2. **Environment-Based TTL Configuration**: + + ```bash + # Reads TTL from environment variable (set in .env.local) + TTL_RAW="${JUPYTERHUB_VAULT_TOKEN_TTL}" # e.g., "5m", "24h" + + # Converts to seconds and calculates renewal interval + RENEWAL_INTERVAL=$((TTL_SECONDS / 2)) # TTL/2 with minimum 30s + ``` + +3. **Token Source**: ExternalSecret → Kubernetes Secret → mounted file + + ```bash + # Token retrieved from ExternalSecret-managed mount + ADMIN_TOKEN=$(cat /vault/admin-token/token) + ``` + +4. **Renewal Loop**: + + ```bash + while true; do + vault token renew >/dev/null 2>&1 + sleep $RENEWAL_INTERVAL + done + ``` + +5. **Error Handling**: If renewal fails, re-retrieves token from ExternalSecret mount + +**Key Files:** + +- `vault-token-renewer.sh`: Main renewal script +- `jupyterhub-vault-token-external-secret.gomplate.yaml`: ExternalSecret configuration +- `vault-token-renewer-config` ConfigMap: Contains the renewal script + +### User Token Renewal + +User token renewal is handled within the notebook environment by the `buunstack` Python package: + +**Implementation Details:** + +1. **Token Source**: Environment variable set by pre-spawn hook + + ```python + # In pre_spawn_hook.gomplate.py + spawner.environment["NOTEBOOK_VAULT_TOKEN"] = user_vault_token + ``` + +2. **Automatic Renewal**: Built into `SecretStore` class operations + + ```python + # In buunstack/secrets.py + def _ensure_authenticated(self): + token_info = self.client.auth.token.lookup_self() + ttl = token_info.get("data", {}).get("ttl", 0) + renewable = token_info.get("data", {}).get("renewable", False) + + # Renew if TTL < 10 minutes and renewable + if renewable and ttl > 0 and ttl < 600: + self.client.auth.token.renew_self() + ``` + +3. **Renewal Trigger**: Every `SecretStore` operation (get, put, delete, list) + - Checks token validity before operation + - Automatically renews if TTL < 10 minutes + - Transparent to user code + +4. **Token Configuration** (set during creation): + - **TTL**: `NOTEBOOK_VAULT_TOKEN_TTL` (default: 24h = 1 day) + - **Max TTL**: `NOTEBOOK_VAULT_TOKEN_MAX_TTL` (default: 168h = 7 days) + - **Policy**: User-specific `jupyter-user-{username}` + - **Type**: Orphan token (independent of parent token lifecycle) + +5. **Expiry Handling**: When token reaches Max TTL: + - Cannot be renewed further + - User must restart notebook server (triggers new token creation) + - Prevented by `JUPYTERHUB_CULL_MAX_AGE` setting (6 days < 7 day Max TTL) + +**Key Files:** + +- `pre_spawn_hook.gomplate.py`: User token creation logic +- `buunstack/secrets.py`: Token renewal implementation +- `user_policy.hcl`: User token permissions template + +### Token Lifecycle Summary + +```plain +┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ +│ Admin Token │ │ User Token │ │ Pod Lifecycle │ +│ │ │ │ │ │ +│ Created: Manual │ │ Created: Spawn │ │ Max Age: 7 days │ +│ TTL: 5m-24h │ │ TTL: 1 day │ │ Auto-restart │ +│ Max TTL: ∞ │ │ Max TTL: 7 days │ │ at Max TTL │ +│ Renewal: Auto │ │ Renewal: Auto │ │ │ +│ Interval: TTL/2 │ │ Trigger: Usage │ │ │ +└─────────────────┘ └──────────────────┘ └─────────────────┘ + │ │ │ + ▼ ▼ ▼ + vault-token-renewer buunstack.py cull.maxAge + sidecar SecretStore pod restart +``` + +## Storage Options + +### Default Storage + +Uses Kubernetes PersistentVolumes for user home directories. + +### NFS Storage + +For shared storage across nodes, configure NFS: + +```bash +JUPYTERHUB_NFS_PV_ENABLED=true +JUPYTER_NFS_IP=192.168.10.1 +JUPYTER_NFS_PATH=/volume1/drive1/jupyter +``` + +NFS storage requires: + +- Longhorn storage system installed +- NFS server accessible from cluster nodes +- Proper NFS export permissions configured + +## Configuration + +### Environment Variables + +Key configuration variables: + +```bash +# Basic settings +JUPYTERHUB_NAMESPACE=jupyter +JUPYTERHUB_CHART_VERSION=4.2.0 +JUPYTERHUB_OIDC_CLIENT_ID=jupyterhub + +# Keycloak integration +KEYCLOAK_REALM=buunstack + +# Storage +JUPYTERHUB_NFS_PV_ENABLED=false + +# Vault integration +JUPYTERHUB_VAULT_INTEGRATION_ENABLED=false +VAULT_ADDR=https://vault.example.com + +# Image settings +JUPYTER_PYTHON_KERNEL_TAG=python-3.12-28 +IMAGE_REGISTRY=localhost:30500 + +# Vault token TTL settings +JUPYTERHUB_VAULT_TOKEN_TTL=24h # Admin token: renewed at TTL/2 intervals +NOTEBOOK_VAULT_TOKEN_TTL=24h # User token: 1 day (renewed on usage) +NOTEBOOK_VAULT_TOKEN_MAX_TTL=168h # User token: 7 days max + +# Server pod lifecycle settings +JUPYTERHUB_CULL_MAX_AGE=604800 # Max pod age in seconds (7 days = 604800s) + # Should be <= NOTEBOOK_VAULT_TOKEN_MAX_TTL + +# Logging +JUPYTER_BUUNSTACK_LOG_LEVEL=warning # Options: debug, info, warning, error +``` + +### Advanced Configuration + +Customize JupyterHub behavior by editing `jupyterhub-values.gomplate.yaml` template before installation. + ## Custom Container Images JupyterHub uses custom container images with pre-installed data science tools and integrations: @@ -88,3 +509,156 @@ GPU-enabled notebook image based on `jupyter/pytorch-notebook:cuda12`: [📖 See Image Documentation](./images/datastack-cuda-notebook/README.md) Both images are based on the official [Jupyter Docker Stacks](https://github.com/jupyter/docker-stacks) and include all standard data science libraries (NumPy, pandas, scikit-learn, matplotlib, etc.). + +## Management + +### Uninstall + +```bash +just jupyterhub::uninstall +``` + +This removes: + +- JupyterHub deployment +- User pods +- PVCs +- ExternalSecret + +### Update + +Upgrade to newer versions: + +```bash +# Update image tag in .env.local +export JUPYTER_PYTHON_KERNEL_TAG=python-3.12-29 + +# Rebuild and push images +just jupyterhub::build-kernel-images +just jupyterhub::push-kernel-images + +# Upgrade JupyterHub deployment +just jupyterhub::install +``` + +### Manual Token Refresh + +If needed, manually refresh the admin token: + +```bash +# Create new renewable token +just jupyterhub::create-jupyterhub-vault-token + +# Restart JupyterHub to pick up new token +kubectl rollout restart deployment/hub -n jupyter +``` + +## Troubleshooting + +### Image Pull Issues + +Buun-stack images are large and may timeout: + +```bash +# Check pod status +kubectl get pods -n jupyter + +# Check image pull progress +kubectl describe pod -n jupyter + +# Increase timeout if needed +helm upgrade jupyterhub jupyterhub/jupyterhub --timeout=30m -f jupyterhub-values.yaml +``` + +### Vault Integration Issues + +Check token and authentication: + +```bash +# Check ExternalSecret status +kubectl get externalsecret -n jupyter jupyterhub-vault-token + +# Check if Secret was created +kubectl get secret -n jupyter jupyterhub-vault-token + +# Check token renewal logs +kubectl logs -n jupyter -l app.kubernetes.io/component=hub -c vault-token-renewer + +# In a notebook, verify environment +%env NOTEBOOK_VAULT_TOKEN +``` + +Common issues: + +1. **"child policies must be subset of parent"**: Admin policy needs `sudo` permission for orphan tokens +2. **Token not found**: Check ExternalSecret and ClusterSecretStore configuration +3. **Permission denied**: Verify `jupyterhub-admin` policy has all required permissions + +### Authentication Issues + +Verify Keycloak client configuration: + +```bash +# Check client exists +just keycloak::get-client buunstack jupyterhub + +# Check redirect URIs +just keycloak::update-client buunstack jupyterhub \ + "https://your-jupyter-host/hub/oauth_callback" +``` + +## Technical Implementation Details + +### Helm Chart Version + +JupyterHub uses the official Zero to JupyterHub (Z2JH) Helm chart: + +- Chart: `jupyterhub/jupyterhub` +- Version: `4.2.0` (configurable via `JUPYTERHUB_CHART_VERSION`) +- Documentation: https://z2jh.jupyter.org/ + +### Token System Architecture + +The system uses a three-tier token approach: + +1. **Renewable Admin Token**: + - Created with `explicit-max-ttl=0` (unlimited Max TTL) + - Renewed automatically at TTL/2 intervals (minimum 30 seconds) + - Stored in Vault and fetched via ExternalSecret +2. **Orphan User Tokens**: + - Created with `create_orphan()` API call + - Not limited by parent token policies + - Individual TTL and Max TTL settings +3. **Token Renewal Script**: + - Runs as sidecar container + - Reads token from ExternalSecret mount + - Handles renewal and re-retrieval on failure + +### Key Files + +- `jupyterhub-admin-policy.hcl`: Vault policy with admin permissions +- `user_policy.hcl`: Template for user-specific policies +- `vault-token-renewer.sh`: Token renewal script +- `jupyterhub-vault-token-external-secret.gomplate.yaml`: ExternalSecret configuration + +## Performance Considerations + +- **Image Size**: Buun-stack images are ~13GB, plan storage accordingly +- **Pull Time**: Initial pulls take 5-15 minutes depending on network +- **Resource Usage**: Data science workloads require adequate CPU/memory +- **Token Renewal**: Minimal overhead (renewal at TTL/2 intervals) + +For production deployments, consider: + +- Pre-pulling images to all nodes +- Using faster storage backends +- Configuring resource limits per user +- Setting up monitoring and alerts + +## Known Limitations + +1. **Annual Token Recreation**: While tokens have unlimited Max TTL, best practice suggests recreating them annually +2. **Token Expiry and Pod Lifecycle**: User tokens have a TTL of 1 day (`NOTEBOOK_VAULT_TOKEN_TTL=24h`) and maximum TTL of 7 days (`NOTEBOOK_VAULT_TOKEN_MAX_TTL=168h`). Daily usage extends the token for another day, allowing up to 7 days of continuous use. Server pods are automatically restarted after 7 days (`JUPYTERHUB_CULL_MAX_AGE=604800s`) to refresh tokens. +3. **Cull Settings**: Server idle timeout is set to 2 hours by default. Adjust `cull.timeout` and `cull.every` in the Helm values for different requirements +4. **NFS Storage**: When using NFS storage, ensure proper permissions are set on the NFS server. The default `JUPYTER_FSGID` is 100 +5. **ExternalSecret Dependency**: Requires External Secrets Operator to be installed and configured diff --git a/trino/MCP.md b/trino/MCP.md index b82f30e..b5ea7bc 100644 --- a/trino/MCP.md +++ b/trino/MCP.md @@ -26,7 +26,7 @@ Create `.env.claude` with Trino connection settings: ```bash # Trino Connection (Password Authentication) -TRINO_HOST=trino.buun.dev +TRINO_HOST=trino.yourdomain.com TRINO_PORT=443 TRINO_SCHEME=https TRINO_SSL=true @@ -75,7 +75,7 @@ Create `~/.env.claude` in your home directory with 1Password references: ```bash # Trino Connection (Password Authentication) -TRINO_HOST=trino.buun.dev +TRINO_HOST=trino.yourdomain.com TRINO_PORT=443 TRINO_SCHEME=https TRINO_SSL=true diff --git a/trino/justfile b/trino/justfile index b016382..5ef2c34 100644 --- a/trino/justfile +++ b/trino/justfile @@ -392,7 +392,7 @@ cli user="": TRINO_HOST="${TRINO_HOST}" while [ -z "${TRINO_HOST}" ]; do TRINO_HOST=$(gum input --prompt="Trino host (FQDN): " --width=100 \ - --placeholder="e.g., trino.buun.dev") + --placeholder="e.g., trino.yourdomain.com") done TRINO_USER="{{ user }}" if [ -z "${TRINO_USER}" ]; then