docs: reconstruct docs
This commit is contained in:
34
CLAUDE.md
34
CLAUDE.md
@@ -159,6 +159,7 @@ install:
|
|||||||
```
|
```
|
||||||
|
|
||||||
ServiceMonitor template (`servicemonitor.gomplate.yaml`):
|
ServiceMonitor template (`servicemonitor.gomplate.yaml`):
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
{{- if eq .Env.MONITORING_ENABLED "true" }}
|
{{- if eq .Env.MONITORING_ENABLED "true" }}
|
||||||
apiVersion: monitoring.coreos.com/v1
|
apiVersion: monitoring.coreos.com/v1
|
||||||
@@ -366,3 +367,36 @@ receiving
|
|||||||
- Only write code comments when necessary, as the code should be self-explanatory
|
- Only write code comments when necessary, as the code should be self-explanatory
|
||||||
(Avoid trivial comment for each code block)
|
(Avoid trivial comment for each code block)
|
||||||
- Write output messages and code comments in English
|
- Write output messages and code comments in English
|
||||||
|
|
||||||
|
### Markdown Style
|
||||||
|
|
||||||
|
When writing Markdown documentation:
|
||||||
|
|
||||||
|
1. **NEVER use ordered lists as section headers**:
|
||||||
|
- Ordered lists indent content and are not suitable for headings
|
||||||
|
- Use proper heading levels (####) instead of numbered lists for section titles
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
<!-- INCORRECT: Ordered list used as headers -->
|
||||||
|
1. **Setup Instructions:**
|
||||||
|
|
||||||
|
Details here...
|
||||||
|
|
||||||
|
2. **Next Step:**
|
||||||
|
|
||||||
|
More details...
|
||||||
|
|
||||||
|
<!-- CORRECT: Use headings instead -->
|
||||||
|
#### Setup Instructions
|
||||||
|
|
||||||
|
Details here...
|
||||||
|
|
||||||
|
#### Next Step
|
||||||
|
|
||||||
|
More details...
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Always validate with markdownlint-cli2**:
|
||||||
|
- Run `markdownlint-cli2 <file>` before committing any Markdown files
|
||||||
|
- Fix all linting errors to ensure consistent formatting
|
||||||
|
- Pay attention to code block language specifications (MD040) and list formatting (MD029)
|
||||||
|
|||||||
@@ -46,7 +46,7 @@ This document covers Airflow installation, deployment, and debugging in the buun
|
|||||||
**Note**: New users have only Viewer access by default and cannot execute DAGs without role assignment.
|
**Note**: New users have only Viewer access by default and cannot execute DAGs without role assignment.
|
||||||
|
|
||||||
4. **Access Airflow Web UI**:
|
4. **Access Airflow Web UI**:
|
||||||
- Navigate to your Airflow instance (e.g., `https://airflow.buun.dev`)
|
- Navigate to your Airflow instance (e.g., `https://airflow.yourdomain.com`)
|
||||||
- Login with your Keycloak credentials
|
- Login with your Keycloak credentials
|
||||||
|
|
||||||
### Uninstalling
|
### Uninstalling
|
||||||
@@ -63,7 +63,7 @@ just airflow::uninstall true
|
|||||||
|
|
||||||
### 1. Access JupyterHub
|
### 1. Access JupyterHub
|
||||||
|
|
||||||
- Navigate to your JupyterHub instance (e.g., `https://jupyter.buun.dev`)
|
- Navigate to your JupyterHub instance (e.g., `https://jupyter.yourdomain.com`)
|
||||||
- Login with your credentials
|
- Login with your credentials
|
||||||
|
|
||||||
### 2. Navigate to Airflow DAGs Directory
|
### 2. Navigate to Airflow DAGs Directory
|
||||||
@@ -82,7 +82,7 @@ In JupyterHub, the Airflow DAGs directory is mounted at:
|
|||||||
|
|
||||||
### 4. Verify Deployment
|
### 4. Verify Deployment
|
||||||
|
|
||||||
1. Access Airflow Web UI (e.g., `https://airflow.buun.dev`)
|
1. Access Airflow Web UI (e.g., `https://airflow.yourdomain.com`)
|
||||||
2. Check that the DAG `csv_to_postgres` appears in the DAGs list
|
2. Check that the DAG `csv_to_postgres` appears in the DAGs list
|
||||||
3. If the DAG doesn't appear immediately, wait 1-2 minutes for Airflow to detect the new file
|
3. If the DAG doesn't appear immediately, wait 1-2 minutes for Airflow to detect the new file
|
||||||
|
|
||||||
|
|||||||
@@ -28,7 +28,7 @@ This document covers Dagster installation, deployment, and debugging in the buun
|
|||||||
```
|
```
|
||||||
|
|
||||||
3. **Access Dagster Web UI**:
|
3. **Access Dagster Web UI**:
|
||||||
- Navigate to your Dagster instance (e.g., `https://dagster.buun.dev`)
|
- Navigate to your Dagster instance (e.g., `https://dagster.yourdomain.com`)
|
||||||
- Login with your Keycloak credentials
|
- Login with your Keycloak credentials
|
||||||
|
|
||||||
### Uninstalling
|
### Uninstalling
|
||||||
|
|||||||
@@ -1,577 +1,5 @@
|
|||||||
# JupyterHub
|
# JupyterHub Documentation
|
||||||
|
|
||||||
JupyterHub provides a multi-user Jupyter notebook environment with Keycloak OIDC authentication, Vault integration for secure secrets management, and custom kernel images for data science workflows.
|
This documentation has been moved to [jupyterhub/README.md](../jupyterhub/README.md).
|
||||||
|
|
||||||
## Installation
|
Please refer to the new location for complete JupyterHub setup, configuration, and usage documentation.
|
||||||
|
|
||||||
Install JupyterHub with interactive configuration:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
just jupyterhub::install
|
|
||||||
```
|
|
||||||
|
|
||||||
This will prompt for:
|
|
||||||
|
|
||||||
- JupyterHub host (FQDN)
|
|
||||||
- NFS PV usage (if Longhorn is installed)
|
|
||||||
- NFS server details (if NFS is enabled)
|
|
||||||
- Vault integration setup (requires root token for initial setup)
|
|
||||||
|
|
||||||
### Prerequisites
|
|
||||||
|
|
||||||
- Keycloak must be installed and configured
|
|
||||||
- For NFS storage: Longhorn must be installed
|
|
||||||
- For Vault integration: Vault and External Secrets Operator must be installed
|
|
||||||
- Helm repository must be accessible
|
|
||||||
|
|
||||||
## Kernel Images
|
|
||||||
|
|
||||||
### Important Note
|
|
||||||
|
|
||||||
Building and using custom buun-stack images requires building the `buunstack` Python package first. The package wheel file will be included in the Docker image during build.
|
|
||||||
|
|
||||||
JupyterHub supports multiple kernel image profiles:
|
|
||||||
|
|
||||||
### Standard Profiles
|
|
||||||
|
|
||||||
- **minimal**: Basic Python environment
|
|
||||||
- **base**: Python with common data science packages
|
|
||||||
- **datascience**: Full data science stack (default)
|
|
||||||
- **pyspark**: PySpark for big data processing
|
|
||||||
- **pytorch**: PyTorch for machine learning
|
|
||||||
- **tensorflow**: TensorFlow for machine learning
|
|
||||||
|
|
||||||
### Buun-Stack Profiles
|
|
||||||
|
|
||||||
- **buun-stack**: Comprehensive data science environment with Vault integration
|
|
||||||
- **buun-stack-cuda**: CUDA-enabled version with GPU support
|
|
||||||
|
|
||||||
## Profile Configuration
|
|
||||||
|
|
||||||
Enable/disable profiles using environment variables:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Enable buun-stack profile (CPU version)
|
|
||||||
JUPYTER_PROFILE_BUUN_STACK_ENABLED=true
|
|
||||||
|
|
||||||
# Enable buun-stack CUDA profile (GPU version)
|
|
||||||
JUPYTER_PROFILE_BUUN_STACK_CUDA_ENABLED=true
|
|
||||||
|
|
||||||
# Disable default datascience profile
|
|
||||||
JUPYTER_PROFILE_DATASCIENCE_ENABLED=false
|
|
||||||
```
|
|
||||||
|
|
||||||
Available profile variables:
|
|
||||||
|
|
||||||
- `JUPYTER_PROFILE_MINIMAL_ENABLED`
|
|
||||||
- `JUPYTER_PROFILE_BASE_ENABLED`
|
|
||||||
- `JUPYTER_PROFILE_DATASCIENCE_ENABLED`
|
|
||||||
- `JUPYTER_PROFILE_PYSPARK_ENABLED`
|
|
||||||
- `JUPYTER_PROFILE_PYTORCH_ENABLED`
|
|
||||||
- `JUPYTER_PROFILE_TENSORFLOW_ENABLED`
|
|
||||||
- `JUPYTER_PROFILE_BUUN_STACK_ENABLED`
|
|
||||||
- `JUPYTER_PROFILE_BUUN_STACK_CUDA_ENABLED`
|
|
||||||
|
|
||||||
Only `JUPYTER_PROFILE_DATASCIENCE_ENABLED` is true by default.
|
|
||||||
|
|
||||||
## Buun-Stack Images
|
|
||||||
|
|
||||||
Buun-stack images provide comprehensive data science environments with:
|
|
||||||
|
|
||||||
- All standard data science packages (NumPy, Pandas, Scikit-learn, etc.)
|
|
||||||
- Deep learning frameworks (PyTorch, TensorFlow, Keras)
|
|
||||||
- Big data tools (PySpark, Apache Arrow)
|
|
||||||
- NLP and ML libraries (LangChain, Transformers, spaCy)
|
|
||||||
- Database connectors and tools
|
|
||||||
- **Vault integration** with `buunstack` Python package
|
|
||||||
|
|
||||||
### Building Custom Images
|
|
||||||
|
|
||||||
Build and push buun-stack images to your registry:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Build images (includes building the buunstack Python package)
|
|
||||||
just jupyterhub::build-kernel-images
|
|
||||||
|
|
||||||
# Push to registry
|
|
||||||
just jupyterhub::push-kernel-images
|
|
||||||
```
|
|
||||||
|
|
||||||
The build process:
|
|
||||||
|
|
||||||
1. Builds the `buunstack` Python package wheel
|
|
||||||
2. Copies the wheel into the Docker build context
|
|
||||||
3. Installs the wheel in the Docker image
|
|
||||||
4. Cleans up temporary files
|
|
||||||
|
|
||||||
⚠️ **Note**: Buun-stack images are comprehensive and large (~13GB). Initial image pulls and deployments take significant time due to the extensive package set.
|
|
||||||
|
|
||||||
### Image Configuration
|
|
||||||
|
|
||||||
Configure image settings in `.env.local`:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Image registry
|
|
||||||
IMAGE_REGISTRY=localhost:30500
|
|
||||||
|
|
||||||
# Image tag (current default)
|
|
||||||
JUPYTER_PYTHON_KERNEL_TAG=python-3.12-28
|
|
||||||
```
|
|
||||||
|
|
||||||
## Vault Integration
|
|
||||||
|
|
||||||
### Overview
|
|
||||||
|
|
||||||
Vault integration enables secure secrets management directly from Jupyter notebooks. The system uses:
|
|
||||||
|
|
||||||
- **ExternalSecret** to fetch the admin token from Vault
|
|
||||||
- **Renewable tokens** with unlimited Max TTL to avoid 30-day system limitations
|
|
||||||
- **Token renewal script** that automatically renews tokens at TTL/2 intervals (minimum 30 seconds)
|
|
||||||
- **User-specific tokens** created during notebook spawn with isolated access
|
|
||||||
|
|
||||||
### Architecture
|
|
||||||
|
|
||||||
```plain
|
|
||||||
┌────────────────────────────────────────────────────────────────┐
|
|
||||||
│ JupyterHub Hub Pod │
|
|
||||||
│ │
|
|
||||||
│ ┌──────────────┐ ┌────────────────┐ ┌────────────────────┐ │
|
|
||||||
│ │ Hub │ │ Token Renewer │ │ ExternalSecret │ │
|
|
||||||
│ │ Container │◄─┤ Sidecar │◄─┤ (mounted as │ │
|
|
||||||
│ │ │ │ │ │ Secret) │ │
|
|
||||||
│ └──────────────┘ └────────────────┘ └────────────────────┘ │
|
|
||||||
│ │ │ ▲ │
|
|
||||||
│ │ │ │ │
|
|
||||||
│ ▼ ▼ │ │
|
|
||||||
│ ┌──────────────────────────────────┐ │ │
|
|
||||||
│ │ /vault/secrets/vault-token │ │ │
|
|
||||||
│ │ (Admin token for user creation) │ │ │
|
|
||||||
│ └──────────────────────────────────┘ │ │
|
|
||||||
└────────────────────────────────────────────────────┼───────────┘
|
|
||||||
│
|
|
||||||
┌───────────▼──────────┐
|
|
||||||
│ Vault │
|
|
||||||
│ secret/jupyterhub/ │
|
|
||||||
│ vault-token │
|
|
||||||
└──────────────────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
### Prerequisites
|
|
||||||
|
|
||||||
Vault integration requires:
|
|
||||||
|
|
||||||
- Vault server installed and configured
|
|
||||||
- External Secrets Operator installed
|
|
||||||
- ClusterSecretStore configured for Vault
|
|
||||||
- Buun-stack kernel images (standard images don't include Vault integration)
|
|
||||||
|
|
||||||
### Setup
|
|
||||||
|
|
||||||
Vault integration is configured during JupyterHub installation:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
just jupyterhub::install
|
|
||||||
# Answer "yes" when prompted about Vault integration
|
|
||||||
# Provide Vault root token when prompted
|
|
||||||
```
|
|
||||||
|
|
||||||
The setup process:
|
|
||||||
|
|
||||||
1. Creates `jupyterhub-admin` policy with necessary permissions including `sudo` for orphan token creation
|
|
||||||
2. Creates renewable admin token with 24h TTL and unlimited Max TTL
|
|
||||||
3. Stores token in Vault at `secret/jupyterhub/vault-token`
|
|
||||||
4. Creates ExternalSecret to fetch token from Vault
|
|
||||||
5. Deploys token renewal sidecar for automatic renewal
|
|
||||||
|
|
||||||
### Usage in Notebooks
|
|
||||||
|
|
||||||
With Vault integration enabled, use the `buunstack` package in notebooks:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from buunstack import SecretStore
|
|
||||||
|
|
||||||
# Initialize (uses pre-acquired user-specific token)
|
|
||||||
secrets = SecretStore()
|
|
||||||
|
|
||||||
# Store secrets
|
|
||||||
secrets.put('api-keys',
|
|
||||||
openai='sk-...',
|
|
||||||
github='ghp_...',
|
|
||||||
database_url='postgresql://...')
|
|
||||||
|
|
||||||
# Retrieve secrets
|
|
||||||
api_keys = secrets.get('api-keys')
|
|
||||||
openai_key = secrets.get('api-keys', field='openai')
|
|
||||||
|
|
||||||
# List all secrets
|
|
||||||
secret_names = secrets.list()
|
|
||||||
|
|
||||||
# Delete secrets or specific fields
|
|
||||||
secrets.delete('old-api-key') # Delete entire secret
|
|
||||||
secrets.delete('api-keys', field='github') # Delete only github field
|
|
||||||
```
|
|
||||||
|
|
||||||
### Security Features
|
|
||||||
|
|
||||||
- **User isolation**: Each user receives an orphan token with access only to their namespace
|
|
||||||
- **Automatic renewal**: Token renewal script renews admin token at TTL/2 intervals (minimum 30 seconds)
|
|
||||||
- **ExternalSecret integration**: Admin token fetched securely from Vault
|
|
||||||
- **Orphan tokens**: User tokens are orphan tokens, not limited by parent policy restrictions
|
|
||||||
- **Audit trail**: All secret access is logged in Vault
|
|
||||||
|
|
||||||
### Token Management
|
|
||||||
|
|
||||||
#### Admin Token
|
|
||||||
|
|
||||||
The admin token is managed through:
|
|
||||||
|
|
||||||
1. **Creation**: `just jupyterhub::create-jupyterhub-vault-token` creates renewable token
|
|
||||||
2. **Storage**: Stored in Vault at `secret/jupyterhub/vault-token`
|
|
||||||
3. **Retrieval**: ExternalSecret fetches and mounts as Kubernetes Secret
|
|
||||||
4. **Renewal**: `vault-token-renewer.sh` script renews at TTL/2 intervals
|
|
||||||
|
|
||||||
#### User Tokens
|
|
||||||
|
|
||||||
User tokens are created dynamically:
|
|
||||||
|
|
||||||
1. **Pre-spawn hook** reads admin token from `/vault/secrets/vault-token`
|
|
||||||
2. **Creates user policy** `jupyter-user-{username}` with restricted access
|
|
||||||
3. **Creates orphan token** with user policy (requires `sudo` permission)
|
|
||||||
4. **Sets environment variable** `NOTEBOOK_VAULT_TOKEN` in notebook container
|
|
||||||
|
|
||||||
## Token Renewal Implementation
|
|
||||||
|
|
||||||
### Admin Token Renewal
|
|
||||||
|
|
||||||
The admin token renewal is handled by a sidecar container (`vault-token-renewer`) running alongside the JupyterHub hub:
|
|
||||||
|
|
||||||
**Implementation Details:**
|
|
||||||
|
|
||||||
1. **Renewal Script**: `/vault/config/vault-token-renewer.sh`
|
|
||||||
- Runs in the `vault-token-renewer` sidecar container
|
|
||||||
- Uses Vault 1.17.5 image with HashiCorp Vault CLI
|
|
||||||
|
|
||||||
2. **Environment-Based TTL Configuration**:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Reads TTL from environment variable (set in .env.local)
|
|
||||||
TTL_RAW="${JUPYTERHUB_VAULT_TOKEN_TTL}" # e.g., "5m", "24h"
|
|
||||||
|
|
||||||
# Converts to seconds and calculates renewal interval
|
|
||||||
RENEWAL_INTERVAL=$((TTL_SECONDS / 2)) # TTL/2 with minimum 30s
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Token Source**: ExternalSecret → Kubernetes Secret → mounted file
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Token retrieved from ExternalSecret-managed mount
|
|
||||||
ADMIN_TOKEN=$(cat /vault/admin-token/token)
|
|
||||||
```
|
|
||||||
|
|
||||||
4. **Renewal Loop**:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
while true; do
|
|
||||||
vault token renew >/dev/null 2>&1
|
|
||||||
sleep $RENEWAL_INTERVAL
|
|
||||||
done
|
|
||||||
```
|
|
||||||
|
|
||||||
5. **Error Handling**: If renewal fails, re-retrieves token from ExternalSecret mount
|
|
||||||
|
|
||||||
**Key Files:**
|
|
||||||
|
|
||||||
- `vault-token-renewer.sh`: Main renewal script
|
|
||||||
- `jupyterhub-vault-token-external-secret.gomplate.yaml`: ExternalSecret configuration
|
|
||||||
- `vault-token-renewer-config` ConfigMap: Contains the renewal script
|
|
||||||
|
|
||||||
### User Token Renewal
|
|
||||||
|
|
||||||
User token renewal is handled within the notebook environment by the `buunstack` Python package:
|
|
||||||
|
|
||||||
**Implementation Details:**
|
|
||||||
|
|
||||||
1. **Token Source**: Environment variable set by pre-spawn hook
|
|
||||||
|
|
||||||
```python
|
|
||||||
# In pre_spawn_hook.gomplate.py
|
|
||||||
spawner.environment["NOTEBOOK_VAULT_TOKEN"] = user_vault_token
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Automatic Renewal**: Built into `SecretStore` class operations
|
|
||||||
|
|
||||||
```python
|
|
||||||
# In buunstack/secrets.py
|
|
||||||
def _ensure_authenticated(self):
|
|
||||||
token_info = self.client.auth.token.lookup_self()
|
|
||||||
ttl = token_info.get("data", {}).get("ttl", 0)
|
|
||||||
renewable = token_info.get("data", {}).get("renewable", False)
|
|
||||||
|
|
||||||
# Renew if TTL < 10 minutes and renewable
|
|
||||||
if renewable and ttl > 0 and ttl < 600:
|
|
||||||
self.client.auth.token.renew_self()
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Renewal Trigger**: Every `SecretStore` operation (get, put, delete, list)
|
|
||||||
- Checks token validity before operation
|
|
||||||
- Automatically renews if TTL < 10 minutes
|
|
||||||
- Transparent to user code
|
|
||||||
|
|
||||||
4. **Token Configuration** (set during creation):
|
|
||||||
- **TTL**: `NOTEBOOK_VAULT_TOKEN_TTL` (default: 24h = 1 day)
|
|
||||||
- **Max TTL**: `NOTEBOOK_VAULT_TOKEN_MAX_TTL` (default: 168h = 7 days)
|
|
||||||
- **Policy**: User-specific `jupyter-user-{username}`
|
|
||||||
- **Type**: Orphan token (independent of parent token lifecycle)
|
|
||||||
|
|
||||||
5. **Expiry Handling**: When token reaches Max TTL:
|
|
||||||
- Cannot be renewed further
|
|
||||||
- User must restart notebook server (triggers new token creation)
|
|
||||||
- Prevented by `JUPYTERHUB_CULL_MAX_AGE` setting (6 days < 7 day Max TTL)
|
|
||||||
|
|
||||||
**Key Files:**
|
|
||||||
|
|
||||||
- `pre_spawn_hook.gomplate.py`: User token creation logic
|
|
||||||
- `buunstack/secrets.py`: Token renewal implementation
|
|
||||||
- `user_policy.hcl`: User token permissions template
|
|
||||||
|
|
||||||
### Token Lifecycle Summary
|
|
||||||
|
|
||||||
```
|
|
||||||
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
|
|
||||||
│ Admin Token │ │ User Token │ │ Pod Lifecycle │
|
|
||||||
│ │ │ │ │ │
|
|
||||||
│ Created: Manual │ │ Created: Spawn │ │ Max Age: 7 days │
|
|
||||||
│ TTL: 5m-24h │ │ TTL: 1 day │ │ Auto-restart │
|
|
||||||
│ Max TTL: ∞ │ │ Max TTL: 7 days │ │ at Max TTL │
|
|
||||||
│ Renewal: Auto │ │ Renewal: Auto │ │ │
|
|
||||||
│ Interval: TTL/2 │ │ Trigger: Usage │ │ │
|
|
||||||
└─────────────────┘ └──────────────────┘ └─────────────────┘
|
|
||||||
│ │ │
|
|
||||||
▼ ▼ ▼
|
|
||||||
vault-token-renewer buunstack.py cull.maxAge
|
|
||||||
sidecar SecretStore pod restart
|
|
||||||
```
|
|
||||||
|
|
||||||
## Storage Options
|
|
||||||
|
|
||||||
### Default Storage
|
|
||||||
|
|
||||||
Uses Kubernetes PersistentVolumes for user home directories.
|
|
||||||
|
|
||||||
### NFS Storage
|
|
||||||
|
|
||||||
For shared storage across nodes, configure NFS:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
JUPYTERHUB_NFS_PV_ENABLED=true
|
|
||||||
JUPYTER_NFS_IP=192.168.10.1
|
|
||||||
JUPYTER_NFS_PATH=/volume1/drive1/jupyter
|
|
||||||
```
|
|
||||||
|
|
||||||
NFS storage requires:
|
|
||||||
|
|
||||||
- Longhorn storage system installed
|
|
||||||
- NFS server accessible from cluster nodes
|
|
||||||
- Proper NFS export permissions configured
|
|
||||||
|
|
||||||
## Configuration
|
|
||||||
|
|
||||||
### Environment Variables
|
|
||||||
|
|
||||||
Key configuration variables:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Basic settings
|
|
||||||
JUPYTERHUB_NAMESPACE=jupyter
|
|
||||||
JUPYTERHUB_CHART_VERSION=4.2.0
|
|
||||||
JUPYTERHUB_OIDC_CLIENT_ID=jupyterhub
|
|
||||||
|
|
||||||
# Keycloak integration
|
|
||||||
KEYCLOAK_REALM=buunstack
|
|
||||||
|
|
||||||
# Storage
|
|
||||||
JUPYTERHUB_NFS_PV_ENABLED=false
|
|
||||||
|
|
||||||
# Vault integration
|
|
||||||
JUPYTERHUB_VAULT_INTEGRATION_ENABLED=false
|
|
||||||
VAULT_ADDR=https://vault.example.com
|
|
||||||
|
|
||||||
# Image settings
|
|
||||||
JUPYTER_PYTHON_KERNEL_TAG=python-3.12-28
|
|
||||||
IMAGE_REGISTRY=localhost:30500
|
|
||||||
|
|
||||||
# Vault token TTL settings
|
|
||||||
JUPYTERHUB_VAULT_TOKEN_TTL=24h # Admin token: renewed at TTL/2 intervals
|
|
||||||
NOTEBOOK_VAULT_TOKEN_TTL=24h # User token: 1 day (renewed on usage)
|
|
||||||
NOTEBOOK_VAULT_TOKEN_MAX_TTL=168h # User token: 7 days max
|
|
||||||
|
|
||||||
# Server pod lifecycle settings
|
|
||||||
JUPYTERHUB_CULL_MAX_AGE=604800 # Max pod age in seconds (7 days = 604800s)
|
|
||||||
# Should be <= NOTEBOOK_VAULT_TOKEN_MAX_TTL
|
|
||||||
|
|
||||||
# Logging
|
|
||||||
JUPYTER_BUUNSTACK_LOG_LEVEL=warning # Options: debug, info, warning, error
|
|
||||||
```
|
|
||||||
|
|
||||||
### Advanced Configuration
|
|
||||||
|
|
||||||
Customize JupyterHub behavior by editing `jupyterhub-values.gomplate.yaml` template before installation.
|
|
||||||
|
|
||||||
## Management
|
|
||||||
|
|
||||||
### Uninstall
|
|
||||||
|
|
||||||
```bash
|
|
||||||
just jupyterhub::uninstall
|
|
||||||
```
|
|
||||||
|
|
||||||
This removes:
|
|
||||||
|
|
||||||
- JupyterHub deployment
|
|
||||||
- User pods
|
|
||||||
- PVCs
|
|
||||||
- ExternalSecret
|
|
||||||
|
|
||||||
### Update
|
|
||||||
|
|
||||||
Upgrade to newer versions:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Update image tag in .env.local
|
|
||||||
export JUPYTER_PYTHON_KERNEL_TAG=python-3.12-29
|
|
||||||
|
|
||||||
# Rebuild and push images
|
|
||||||
just jupyterhub::build-kernel-images
|
|
||||||
just jupyterhub::push-kernel-images
|
|
||||||
|
|
||||||
# Upgrade JupyterHub deployment
|
|
||||||
just jupyterhub::install
|
|
||||||
```
|
|
||||||
|
|
||||||
### Manual Token Refresh
|
|
||||||
|
|
||||||
If needed, manually refresh the admin token:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Create new renewable token
|
|
||||||
just jupyterhub::create-jupyterhub-vault-token
|
|
||||||
|
|
||||||
# Restart JupyterHub to pick up new token
|
|
||||||
kubectl rollout restart deployment/hub -n jupyter
|
|
||||||
```
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
### Image Pull Issues
|
|
||||||
|
|
||||||
Buun-stack images are large and may timeout:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Check pod status
|
|
||||||
kubectl get pods -n jupyter
|
|
||||||
|
|
||||||
# Check image pull progress
|
|
||||||
kubectl describe pod <pod-name> -n jupyter
|
|
||||||
|
|
||||||
# Increase timeout if needed
|
|
||||||
helm upgrade jupyterhub jupyterhub/jupyterhub --timeout=30m -f jupyterhub-values.yaml
|
|
||||||
```
|
|
||||||
|
|
||||||
### Vault Integration Issues
|
|
||||||
|
|
||||||
Check token and authentication:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Check ExternalSecret status
|
|
||||||
kubectl get externalsecret -n jupyter jupyterhub-vault-token
|
|
||||||
|
|
||||||
# Check if Secret was created
|
|
||||||
kubectl get secret -n jupyter jupyterhub-vault-token
|
|
||||||
|
|
||||||
# Check token renewal logs
|
|
||||||
kubectl logs -n jupyter -l app.kubernetes.io/component=hub -c vault-token-renewer
|
|
||||||
|
|
||||||
# In a notebook, verify environment
|
|
||||||
%env NOTEBOOK_VAULT_TOKEN
|
|
||||||
```
|
|
||||||
|
|
||||||
Common issues:
|
|
||||||
|
|
||||||
1. **"child policies must be subset of parent"**: Admin policy needs `sudo` permission for orphan tokens
|
|
||||||
2. **Token not found**: Check ExternalSecret and ClusterSecretStore configuration
|
|
||||||
3. **Permission denied**: Verify `jupyterhub-admin` policy has all required permissions
|
|
||||||
|
|
||||||
### Authentication Issues
|
|
||||||
|
|
||||||
Verify Keycloak client configuration:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Check client exists
|
|
||||||
just keycloak::get-client buunstack jupyterhub
|
|
||||||
|
|
||||||
# Check redirect URIs
|
|
||||||
just keycloak::update-client buunstack jupyterhub \
|
|
||||||
"https://your-jupyter-host/hub/oauth_callback"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Technical Implementation Details
|
|
||||||
|
|
||||||
### Helm Chart Version
|
|
||||||
|
|
||||||
JupyterHub uses the official Zero to JupyterHub (Z2JH) Helm chart:
|
|
||||||
|
|
||||||
- Chart: `jupyterhub/jupyterhub`
|
|
||||||
- Version: `4.2.0` (configurable via `JUPYTERHUB_CHART_VERSION`)
|
|
||||||
- Documentation: https://z2jh.jupyter.org/
|
|
||||||
|
|
||||||
### Token System Architecture
|
|
||||||
|
|
||||||
The system uses a three-tier token approach:
|
|
||||||
|
|
||||||
1. **Renewable Admin Token**:
|
|
||||||
- Created with `explicit-max-ttl=0` (unlimited Max TTL)
|
|
||||||
- Renewed automatically at TTL/2 intervals (minimum 30 seconds)
|
|
||||||
- Stored in Vault and fetched via ExternalSecret
|
|
||||||
|
|
||||||
2. **Orphan User Tokens**:
|
|
||||||
- Created with `create_orphan()` API call
|
|
||||||
- Not limited by parent token policies
|
|
||||||
- Individual TTL and Max TTL settings
|
|
||||||
|
|
||||||
3. **Token Renewal Script**:
|
|
||||||
- Runs as sidecar container
|
|
||||||
- Reads token from ExternalSecret mount
|
|
||||||
- Handles renewal and re-retrieval on failure
|
|
||||||
|
|
||||||
### Key Files
|
|
||||||
|
|
||||||
- `jupyterhub-admin-policy.hcl`: Vault policy with admin permissions
|
|
||||||
- `user_policy.hcl`: Template for user-specific policies
|
|
||||||
- `vault-token-renewer.sh`: Token renewal script
|
|
||||||
- `jupyterhub-vault-token-external-secret.gomplate.yaml`: ExternalSecret configuration
|
|
||||||
|
|
||||||
## Performance Considerations
|
|
||||||
|
|
||||||
- **Image Size**: Buun-stack images are ~13GB, plan storage accordingly
|
|
||||||
- **Pull Time**: Initial pulls take 5-15 minutes depending on network
|
|
||||||
- **Resource Usage**: Data science workloads require adequate CPU/memory
|
|
||||||
- **Token Renewal**: Minimal overhead (renewal at TTL/2 intervals)
|
|
||||||
|
|
||||||
For production deployments, consider:
|
|
||||||
|
|
||||||
- Pre-pulling images to all nodes
|
|
||||||
- Using faster storage backends
|
|
||||||
- Configuring resource limits per user
|
|
||||||
- Setting up monitoring and alerts
|
|
||||||
|
|
||||||
## Known Limitations
|
|
||||||
|
|
||||||
1. **Annual Token Recreation**: While tokens have unlimited Max TTL, best practice suggests recreating them annually
|
|
||||||
|
|
||||||
2. **Token Expiry and Pod Lifecycle**: User tokens have a TTL of 1 day (`NOTEBOOK_VAULT_TOKEN_TTL=24h`) and maximum TTL of 7 days (`NOTEBOOK_VAULT_TOKEN_MAX_TTL=168h`). Daily usage extends the token for another day, allowing up to 7 days of continuous use. Server pods are automatically restarted after 7 days (`JUPYTERHUB_CULL_MAX_AGE=604800s`) to refresh tokens.
|
|
||||||
|
|
||||||
3. **Cull Settings**: Server idle timeout is set to 2 hours by default. Adjust `cull.timeout` and `cull.every` in the Helm values for different requirements
|
|
||||||
|
|
||||||
4. **NFS Storage**: When using NFS storage, ensure proper permissions are set on the NFS server. The default `JUPYTER_FSGID` is 100
|
|
||||||
|
|
||||||
5. **ExternalSecret Dependency**: Requires External Secrets Operator to be installed and configured
|
|
||||||
|
|||||||
538
docs/resource-management.md
Normal file
538
docs/resource-management.md
Normal file
@@ -0,0 +1,538 @@
|
|||||||
|
# Resource Managementplain
|
||||||
|
|
||||||
|
This document describes how to configure resource requests and limits for components in the buun-stack.
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
- [Overview](#overview)
|
||||||
|
- [QoS Classes](#qos-classes)
|
||||||
|
- [Using Goldilocks](#using-goldilocks)
|
||||||
|
- [Configuring Resources](#configuring-resources)
|
||||||
|
- [Best Practices](#best-practices)
|
||||||
|
- [Troubleshooting](#troubleshooting)
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Kubernetes uses resource requests and limits to:
|
||||||
|
|
||||||
|
- **Schedule pods** on nodes with sufficient resources
|
||||||
|
- **Ensure quality of service** through QoS classes
|
||||||
|
- **Prevent resource exhaustion** by limiting resource consumption
|
||||||
|
|
||||||
|
All critical components in buun-stack should have resource requests and limits configured.
|
||||||
|
|
||||||
|
## QoS Classes
|
||||||
|
|
||||||
|
Kubernetes assigns one of three QoS classes to each pod based on its resource configuration:
|
||||||
|
|
||||||
|
### Guaranteed QoS (Highest Priority)
|
||||||
|
|
||||||
|
**Requirements:**
|
||||||
|
|
||||||
|
- Every container must have CPU and memory requests
|
||||||
|
- Every container must have CPU and memory limits
|
||||||
|
- Requests and limits must be **equal** for both CPU and memory
|
||||||
|
|
||||||
|
**Characteristics:**
|
||||||
|
|
||||||
|
- Highest priority during resource contention
|
||||||
|
- Last to be evicted when node runs out of resources
|
||||||
|
- Predictable performance
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 200mplain
|
||||||
|
memory: 1Gi
|
||||||
|
limits:
|
||||||
|
cpu: 200m # Same as requests
|
||||||
|
memory: 1Gi # Same as requests
|
||||||
|
```
|
||||||
|
|
||||||
|
**Use for:** Critical data stores (PostgreSQL, Vault)
|
||||||
|
|
||||||
|
### Burstable QoS (Medium Priority)
|
||||||
|
|
||||||
|
**Requirements:**
|
||||||
|
|
||||||
|
- At least one container has requests or limits
|
||||||
|
- Does not meet Guaranteed QoS criteria
|
||||||
|
- Typically `requests < limits`
|
||||||
|
|
||||||
|
**Characteristics:**
|
||||||
|
|
||||||
|
- Medium priority during resource contention
|
||||||
|
- Can burst to limits when resources are available
|
||||||
|
- More resource-efficient than Guaranteed
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 128Mi
|
||||||
|
limits:
|
||||||
|
cpu: 100m # Can burst up to this
|
||||||
|
memory: 256Mi # Can burst up to this
|
||||||
|
```
|
||||||
|
|
||||||
|
**Use for:** Operators, auxiliary services, variable workloads
|
||||||
|
|
||||||
|
### BestEffort QoS (Lowest Priority)
|
||||||
|
|
||||||
|
**Requirements:**
|
||||||
|
|
||||||
|
- No resource requests or limits configured
|
||||||
|
|
||||||
|
**Characteristics:**
|
||||||
|
|
||||||
|
- Lowest priority during resource contention
|
||||||
|
- First to be evicted when node runs out of resources
|
||||||
|
- **Not recommended for production**
|
||||||
|
|
||||||
|
## Using Goldilocks
|
||||||
|
|
||||||
|
Goldilocks uses Vertical Pod Autoscaler (VPA) to recommend resource settings based on actual usage.
|
||||||
|
|
||||||
|
### Setup
|
||||||
|
|
||||||
|
For installation and detailed setup instructions, see:
|
||||||
|
|
||||||
|
- [VPA Installation and Configuration](../vpa/README.md)
|
||||||
|
- [Goldilocks Installation and Configuration](../goldilocks/README.md)
|
||||||
|
|
||||||
|
Quick start:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install VPA
|
||||||
|
just vpa::install
|
||||||
|
|
||||||
|
# Install Goldilocks
|
||||||
|
just goldilocks::install
|
||||||
|
|
||||||
|
# Enable monitoring for a namespace
|
||||||
|
just goldilocks::enable-namespace <namespace>
|
||||||
|
```
|
||||||
|
|
||||||
|
Access the dashboard at your configured Goldilocks host (e.g., `https://goldilocks.example.com`).
|
||||||
|
|
||||||
|
### Using the Dashboard
|
||||||
|
|
||||||
|
- Navigate to the namespace
|
||||||
|
- Expand "Containers" section for each workload
|
||||||
|
- Review both "Guaranteed QoS" and "Burstable QoS" recommendations
|
||||||
|
|
||||||
|
### Limitations
|
||||||
|
|
||||||
|
Goldilocks only monitors **standard Kubernetes workloads** (Deployment, StatefulSet, DaemonSet). It **does not** automatically create VPAs for:
|
||||||
|
|
||||||
|
- Custom Resource Definitions (CRDs)
|
||||||
|
- Resources managed by operators (e.g., CloudNativePG Cluster)
|
||||||
|
|
||||||
|
For CRDs, use alternative methods:
|
||||||
|
|
||||||
|
- Check actual usage: `kubectl top pod <pod-name> -n <namespace>`
|
||||||
|
- Use Grafana dashboards: `Kubernetes / Compute Resources / Pod`
|
||||||
|
- Monitor over time and adjust based on observed patterns
|
||||||
|
|
||||||
|
### Working with Recommendations
|
||||||
|
|
||||||
|
#### For Standard Workloads (Supported by Goldilocks)
|
||||||
|
|
||||||
|
Review Goldilocks recommendations in the dashboard, then configure resources based on your testing status:
|
||||||
|
|
||||||
|
**With load testing:**
|
||||||
|
|
||||||
|
- Use Goldilocks recommended values with minimal headroom (1.5-2x)
|
||||||
|
- Round to clean values (50m, 100m, 200m, 512Mi, 1Gi, etc.)
|
||||||
|
|
||||||
|
**Without load testing:**
|
||||||
|
|
||||||
|
- Add more headroom to handle unexpected load (3-5x)
|
||||||
|
- Round to clean values
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
|
||||||
|
Goldilocks recommendation: 50m CPU, 128Mi Memory
|
||||||
|
|
||||||
|
- With load testing: 100m CPU, 256Mi Memory (2x, rounded)
|
||||||
|
- Without load testing: 200m CPU, 512Mi Memory (4x, rounded)
|
||||||
|
|
||||||
|
#### For CRDs and Unsupported Workloads
|
||||||
|
|
||||||
|
Use Grafana to check actual resource usage:
|
||||||
|
|
||||||
|
1. **Navigate to Grafana dashboard**: `Kubernetes / Compute Resources / Pod`
|
||||||
|
2. **Select namespace and pod**
|
||||||
|
3. **Review usage over 24+ hours** to identify peak values
|
||||||
|
|
||||||
|
Then apply the same approach:
|
||||||
|
|
||||||
|
**With load testing:**
|
||||||
|
|
||||||
|
- Use observed peak values with minimal headroom (1.5-2x)
|
||||||
|
|
||||||
|
**Without load testing:**
|
||||||
|
|
||||||
|
- Add significant headroom (3-5x) for safety
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
|
||||||
|
Grafana shows peak: 40m CPU, 207Mi Memory
|
||||||
|
|
||||||
|
- With load testing: 100m CPU, 512Mi Memory (2.5x/2.5x, rounded)
|
||||||
|
- Without load testing: 200m CPU, 1Gi Memory (5x/5x, rounded, Guaranteed QoS)
|
||||||
|
|
||||||
|
## Configuring Resources
|
||||||
|
|
||||||
|
### Helm-Managed Components
|
||||||
|
|
||||||
|
For components installed via Helm, configure resources in the values file.
|
||||||
|
|
||||||
|
#### Example: PostgreSQL Operator (CNPG)
|
||||||
|
|
||||||
|
**File:** `postgres/cnpg-values.yaml`
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 128Mi
|
||||||
|
limits:
|
||||||
|
cpu: 100m
|
||||||
|
memory: 256Mi
|
||||||
|
```
|
||||||
|
|
||||||
|
**Apply:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd postgres
|
||||||
|
helm upgrade --install cnpg cnpg/cloudnative-pg --version ${CNPG_CHART_VERSION} \
|
||||||
|
-n ${CNPG_NAMESPACE} -f cnpg-values.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Example: Vault
|
||||||
|
|
||||||
|
**File:** `vault/vault-values.gomplate.yaml`
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
server:
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 512Mi
|
||||||
|
limits:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 512Mi
|
||||||
|
|
||||||
|
injector:
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 128Mi
|
||||||
|
limits:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 128Mi
|
||||||
|
|
||||||
|
csi:
|
||||||
|
enabled: true
|
||||||
|
agent:
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 128Mi
|
||||||
|
limits:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 128Mi
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 64Mi
|
||||||
|
limits:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 128Mi
|
||||||
|
```
|
||||||
|
|
||||||
|
**Apply:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd vault
|
||||||
|
gomplate -f vault-values.gomplate.yaml -o vault-values.yaml
|
||||||
|
helm upgrade vault hashicorp/vault --version ${VAULT_CHART_VERSION} \
|
||||||
|
-n vault -f vault-values.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note:** After updating StatefulSet resources, delete the pod to apply changes:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl delete pod vault-0 -n vault
|
||||||
|
# Unseal Vault after restart
|
||||||
|
kubectl exec -n vault vault-0 -- vault operator unseal <UNSEAL_KEY>
|
||||||
|
```
|
||||||
|
|
||||||
|
### CRD-Managed Components
|
||||||
|
|
||||||
|
For components managed by Custom Resource Definitions, patch the CRD directly.
|
||||||
|
|
||||||
|
#### Example: PostgreSQL Cluster (CloudNativePG)
|
||||||
|
|
||||||
|
**Update values file**
|
||||||
|
|
||||||
|
**File:** `postgres/postgres-cluster-values.gomplate.yaml`
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
cluster:
|
||||||
|
instances: 1
|
||||||
|
|
||||||
|
# Resource configuration (Guaranteed QoS)
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 200m
|
||||||
|
memory: 1Gi
|
||||||
|
limits:
|
||||||
|
cpu: 200m
|
||||||
|
memory: 1Gi
|
||||||
|
|
||||||
|
storage:
|
||||||
|
size: {{ .Env.POSTGRES_STORAGE_SIZE }}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Apply via justfile:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
just postgres::create-cluster
|
||||||
|
```
|
||||||
|
|
||||||
|
**Restart pod to apply changes:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl delete pod postgres-cluster-1 -n postgres
|
||||||
|
kubectl wait --for=condition=Ready pod/postgres-cluster-1 -n postgres --timeout=180s
|
||||||
|
```
|
||||||
|
|
||||||
|
**Data Safety:** PostgreSQL data is stored in PersistentVolumeClaim (PVC) and will be preserved during pod restart.
|
||||||
|
|
||||||
|
### Verification
|
||||||
|
|
||||||
|
After applying resource configurations:
|
||||||
|
|
||||||
|
**1. Check resource settings:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# For standard workloads
|
||||||
|
kubectl get deployment <name> -n <namespace> -o jsonpath='{.spec.template.spec.containers[0].resources}' | jq
|
||||||
|
|
||||||
|
# For pods
|
||||||
|
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[0].resources}' | jq
|
||||||
|
```
|
||||||
|
|
||||||
|
**2. Verify QoS Class:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.qosClass}'
|
||||||
|
```
|
||||||
|
|
||||||
|
**3. Check actual usage:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl top pod <pod-name> -n <namespace>
|
||||||
|
```
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
### Choosing QoS Class
|
||||||
|
|
||||||
|
| Component Type | Recommended QoS | Rationale |
|
||||||
|
|---------------|-----------------|-----------|
|
||||||
|
| **Data stores** (PostgreSQL, Vault) | Guaranteed | Critical services, data integrity, predictable performance |
|
||||||
|
| **Operators** (CNPG, etc.) | Burstable | Lightweight controllers, occasional spikes |
|
||||||
|
| **Auxiliary services** (Injectors, CSI providers) | Burstable | Support services, variable load |
|
||||||
|
|
||||||
|
### Setting Resource Values
|
||||||
|
|
||||||
|
**1. Start with actual usage:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check current usage
|
||||||
|
kubectl top pod <pod-name> -n <namespace>
|
||||||
|
|
||||||
|
# Check historical usage in Grafana
|
||||||
|
# Dashboard: Kubernetes / Compute Resources / Pod
|
||||||
|
```
|
||||||
|
|
||||||
|
**2. Add appropriate headroom:**
|
||||||
|
|
||||||
|
| Scenario | Recommended Multiplier | Example |
|
||||||
|
|----------|----------------------|---------|
|
||||||
|
| Stable, predictable load | 2-3x current usage | Current: 40m → Set: 100m |
|
||||||
|
| Variable load | 5-10x current usage | Current: 40m → Set: 200m |
|
||||||
|
| Growth expected | 5-10x current usage | Current: 200Mi → Set: 1Gi |
|
||||||
|
|
||||||
|
**3. Use round numbers:**
|
||||||
|
|
||||||
|
- CPU: 50m, 100m, 200m, 500m, 1000m (1 core)
|
||||||
|
- Memory: 64Mi, 128Mi, 256Mi, 512Mi, 1Gi, 2Gi
|
||||||
|
|
||||||
|
**4. Monitor and adjust:**
|
||||||
|
|
||||||
|
- Check usage patterns after 1-2 weeks
|
||||||
|
- Adjust based on observed peak usage
|
||||||
|
- Iterate as workload changes
|
||||||
|
|
||||||
|
### Resource Configuration Examples
|
||||||
|
|
||||||
|
Based on actual deployments in buun-stack:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# PostgreSQL Operator (Burstable)
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 128Mi
|
||||||
|
limits:
|
||||||
|
cpu: 100m
|
||||||
|
memory: 256Mi
|
||||||
|
|
||||||
|
# PostgreSQL Cluster (Guaranteed)
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 200m
|
||||||
|
memory: 1Gi
|
||||||
|
limits:
|
||||||
|
cpu: 200m
|
||||||
|
memory: 1Gi
|
||||||
|
|
||||||
|
# Vault Server (Guaranteed)
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 512Mi
|
||||||
|
limits:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 512Mi
|
||||||
|
|
||||||
|
# Vault Agent Injector (Guaranteed)
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 128Mi
|
||||||
|
limits:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 128Mi
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Pod Stuck in Pending State
|
||||||
|
|
||||||
|
**Symptom:**
|
||||||
|
|
||||||
|
```plain
|
||||||
|
NAME READY STATUS RESTARTS AGE
|
||||||
|
my-pod 0/1 Pending 0 5m
|
||||||
|
```
|
||||||
|
|
||||||
|
**Check events:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl describe pod <pod-name> -n <namespace> | tail -20
|
||||||
|
```
|
||||||
|
|
||||||
|
**Common causes:**
|
||||||
|
|
||||||
|
#### Insufficient resources
|
||||||
|
|
||||||
|
```plain
|
||||||
|
FailedScheduling: 0/1 nodes are available: 1 Insufficient cpu/memory
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:** Reduce resource requests or add more nodes
|
||||||
|
|
||||||
|
#### Pod anti-affinity
|
||||||
|
|
||||||
|
```plain
|
||||||
|
FailedScheduling: 0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:** Delete old pod to allow new pod to schedule
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl delete pod <old-pod-name> -n <namespace>
|
||||||
|
```
|
||||||
|
|
||||||
|
### OOMKilled (Out of Memory)
|
||||||
|
|
||||||
|
**Symptom:**
|
||||||
|
|
||||||
|
```plain
|
||||||
|
NAME READY STATUS RESTARTS AGE
|
||||||
|
my-pod 0/1 OOMKilled 1 5m
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
#### Check memory limit is sufficient
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl top pod <pod-name> -n <namespace>
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Increase memory limits
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
memory: 2Gi # Increase from 1Gi
|
||||||
|
```
|
||||||
|
|
||||||
|
### Helm Stuck in pending-upgrade
|
||||||
|
|
||||||
|
**Symptom:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
helm status <release> -n <namespace>
|
||||||
|
# STATUS: pending-upgrade
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Remove pending release secret
|
||||||
|
kubectl get secrets -n <namespace> -l owner=helm,name=<release> --sort-by=.metadata.creationTimestamp
|
||||||
|
kubectl delete secret sh.helm.release.v1.<release>.v<pending-version> -n <namespace>
|
||||||
|
|
||||||
|
# Verify status is back to deployed
|
||||||
|
helm status <release> -n <namespace>
|
||||||
|
|
||||||
|
# Re-run upgrade
|
||||||
|
helm upgrade <release> <chart> -n <namespace> -f values.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
### VPA Not Providing Recommendations
|
||||||
|
|
||||||
|
**Symptom:**
|
||||||
|
|
||||||
|
- VPA shows "NoPodsMatched" or "ConfigUnsupported"
|
||||||
|
- Goldilocks shows empty containers section
|
||||||
|
|
||||||
|
**Cause:**
|
||||||
|
VPA cannot monitor Custom Resource Definitions (CRDs) directly
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
Use alternative monitoring methods:
|
||||||
|
|
||||||
|
1. kubectl top pod
|
||||||
|
2. Grafana dashboards
|
||||||
|
3. Prometheus queries
|
||||||
|
|
||||||
|
For CRDs, configure resources manually based on observed usage patterns.
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- [Kubernetes Resource Management](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/)
|
||||||
|
- [Kubernetes QoS Classes](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/)
|
||||||
|
- [Goldilocks Documentation](https://goldilocks.docs.fairwinds.com/)
|
||||||
|
- [CloudNativePG Resource Management](https://cloudnative-pg.io/documentation/current/resource_management/)
|
||||||
@@ -1,24 +1,146 @@
|
|||||||
# JupyterHub
|
# JupyterHub
|
||||||
|
|
||||||
Multi-user platform for interactive computing:
|
JupyterHub provides a multi-user Jupyter notebook environment with Keycloak OIDC authentication, Vault integration for secure secrets management, and custom kernel images for data science workflows.
|
||||||
|
|
||||||
- Collaborative Jupyter notebook environment
|
## Table of Contents
|
||||||
- Integrated with Keycloak for OIDC authentication
|
|
||||||
- Persistent storage for user workspaces
|
|
||||||
- Support for multiple kernels and environments
|
|
||||||
- Vault integration for secure secrets management
|
|
||||||
|
|
||||||
See [JupyterHub Documentation](../docs/jupyterhub.md) for detailed setup and configuration.
|
- [Installation](#installation)
|
||||||
|
- [Prerequisites](#prerequisites)
|
||||||
|
- [Access](#access)
|
||||||
|
- [Kernel Images](#kernel-images)
|
||||||
|
- [Profile Configuration](#profile-configuration)
|
||||||
|
- [Buun-Stack Images](#buun-stack-images)
|
||||||
|
- [buunstack Package & SecretStore](#buunstack-package--secretstore)
|
||||||
|
- [Vault Integration](#vault-integration)
|
||||||
|
- [Token Renewal Implementation](#token-renewal-implementation)
|
||||||
|
- [Storage Options](#storage-options)
|
||||||
|
- [Configuration](#configuration)
|
||||||
|
- [Custom Container Images](#custom-container-images)
|
||||||
|
- [Management](#management)
|
||||||
|
- [Troubleshooting](#troubleshooting)
|
||||||
|
- [Technical Implementation Details](#technical-implementation-details)
|
||||||
|
- [Performance Considerations](#performance-considerations)
|
||||||
|
- [Known Limitations](#known-limitations)
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
|
Install JupyterHub with interactive configuration:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
just jupyterhub::install
|
just jupyterhub::install
|
||||||
```
|
```
|
||||||
|
|
||||||
|
This will prompt for:
|
||||||
|
|
||||||
|
- JupyterHub host (FQDN)
|
||||||
|
- NFS PV usage (if Longhorn is installed)
|
||||||
|
- NFS server details (if NFS is enabled)
|
||||||
|
- Vault integration setup (requires root token for initial setup)
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
- Keycloak must be installed and configured
|
||||||
|
- For NFS storage: Longhorn must be installed
|
||||||
|
- For Vault integration: Vault and External Secrets Operator must be installed
|
||||||
|
- Helm repository must be accessible
|
||||||
|
|
||||||
## Access
|
## Access
|
||||||
|
|
||||||
Access JupyterHub at `https://jupyter.yourdomain.com` and authenticate via Keycloak.
|
Access JupyterHub at your configured host (e.g., `https://jupyter.example.com`) and authenticate via Keycloak.
|
||||||
|
|
||||||
|
## Kernel Images
|
||||||
|
|
||||||
|
### Important Note
|
||||||
|
|
||||||
|
Building and using custom buun-stack images requires building the `buunstack` Python package first. The package wheel file will be included in the Docker image during build.
|
||||||
|
|
||||||
|
JupyterHub supports multiple kernel image profiles:
|
||||||
|
|
||||||
|
### Standard Profiles
|
||||||
|
|
||||||
|
- **minimal**: Basic Python environment
|
||||||
|
- **base**: Python with common data science packages
|
||||||
|
- **datascience**: Full data science stack (default)
|
||||||
|
- **pyspark**: PySpark for big data processing
|
||||||
|
- **pytorch**: PyTorch for machine learning
|
||||||
|
- **tensorflow**: TensorFlow for machine learning
|
||||||
|
|
||||||
|
### Buun-Stack Profiles
|
||||||
|
|
||||||
|
- **buun-stack**: Comprehensive data science environment with Vault integration
|
||||||
|
- **buun-stack-cuda**: CUDA-enabled version with GPU support
|
||||||
|
|
||||||
|
## Profile Configuration
|
||||||
|
|
||||||
|
Enable/disable profiles using environment variables:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Enable buun-stack profile (CPU version)
|
||||||
|
JUPYTER_PROFILE_BUUN_STACK_ENABLED=true
|
||||||
|
|
||||||
|
# Enable buun-stack CUDA profile (GPU version)
|
||||||
|
JUPYTER_PROFILE_BUUN_STACK_CUDA_ENABLED=true
|
||||||
|
|
||||||
|
# Disable default datascience profile
|
||||||
|
JUPYTER_PROFILE_DATASCIENCE_ENABLED=false
|
||||||
|
```
|
||||||
|
|
||||||
|
Available profile variables:
|
||||||
|
|
||||||
|
- `JUPYTER_PROFILE_MINIMAL_ENABLED`
|
||||||
|
- `JUPYTER_PROFILE_BASE_ENABLED`
|
||||||
|
- `JUPYTER_PROFILE_DATASCIENCE_ENABLED`
|
||||||
|
- `JUPYTER_PROFILE_PYSPARK_ENABLED`
|
||||||
|
- `JUPYTER_PROFILE_PYTORCH_ENABLED`
|
||||||
|
- `JUPYTER_PROFILE_TENSORFLOW_ENABLED`
|
||||||
|
- `JUPYTER_PROFILE_BUUN_STACK_ENABLED`
|
||||||
|
- `JUPYTER_PROFILE_BUUN_STACK_CUDA_ENABLED`
|
||||||
|
|
||||||
|
Only `JUPYTER_PROFILE_DATASCIENCE_ENABLED` is true by default.
|
||||||
|
|
||||||
|
## Buun-Stack Images
|
||||||
|
|
||||||
|
Buun-stack images provide comprehensive data science environments with:
|
||||||
|
|
||||||
|
- All standard data science packages (NumPy, Pandas, Scikit-learn, etc.)
|
||||||
|
- Deep learning frameworks (PyTorch, TensorFlow, Keras)
|
||||||
|
- Big data tools (PySpark, Apache Arrow)
|
||||||
|
- NLP and ML libraries (LangChain, Transformers, spaCy)
|
||||||
|
- Database connectors and tools
|
||||||
|
- **Vault integration** with `buunstack` Python package
|
||||||
|
|
||||||
|
### Building Custom Images
|
||||||
|
|
||||||
|
Build and push buun-stack images to your registry:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Build images (includes building the buunstack Python package)
|
||||||
|
just jupyterhub::build-kernel-images
|
||||||
|
|
||||||
|
# Push to registry
|
||||||
|
just jupyterhub::push-kernel-images
|
||||||
|
```
|
||||||
|
|
||||||
|
The build process:
|
||||||
|
|
||||||
|
1. Builds the `buunstack` Python package wheel
|
||||||
|
2. Copies the wheel into the Docker build context
|
||||||
|
3. Installs the wheel in the Docker image
|
||||||
|
4. Cleans up temporary files
|
||||||
|
|
||||||
|
⚠️ **Note**: Buun-stack images are comprehensive and large (~13GB). Initial image pulls and deployments take significant time due to the extensive package set.
|
||||||
|
|
||||||
|
### Image Configuration
|
||||||
|
|
||||||
|
Configure image settings in `.env.local`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Image registry
|
||||||
|
IMAGE_REGISTRY=localhost:30500
|
||||||
|
|
||||||
|
# Image tag (current default)
|
||||||
|
JUPYTER_PYTHON_KERNEL_TAG=python-3.12-28
|
||||||
|
```
|
||||||
|
|
||||||
## buunstack Package & SecretStore
|
## buunstack Package & SecretStore
|
||||||
|
|
||||||
@@ -60,6 +182,305 @@ For detailed documentation, usage examples, and API reference, see:
|
|||||||
|
|
||||||
[📖 buunstack Package Documentation](../python-package/README.md)
|
[📖 buunstack Package Documentation](../python-package/README.md)
|
||||||
|
|
||||||
|
## Vault Integration
|
||||||
|
|
||||||
|
### Overview
|
||||||
|
|
||||||
|
Vault integration enables secure secrets management directly from Jupyter notebooks. The system uses:
|
||||||
|
|
||||||
|
- **ExternalSecret** to fetch the admin token from Vault
|
||||||
|
- **Renewable tokens** with unlimited Max TTL to avoid 30-day system limitations
|
||||||
|
- **Token renewal script** that automatically renews tokens at TTL/2 intervals (minimum 30 seconds)
|
||||||
|
- **User-specific tokens** created during notebook spawn with isolated access
|
||||||
|
|
||||||
|
### Architecture
|
||||||
|
|
||||||
|
```plain
|
||||||
|
┌────────────────────────────────────────────────────────────────┐
|
||||||
|
│ JupyterHub Hub Pod │
|
||||||
|
│ │
|
||||||
|
│ ┌──────────────┐ ┌────────────────┐ ┌────────────────────┐ │
|
||||||
|
│ │ Hub │ │ Token Renewer │ │ ExternalSecret │ │
|
||||||
|
│ │ Container │◄─┤ Sidecar │◄─┤ (mounted as │ │
|
||||||
|
│ │ │ │ │ │ Secret) │ │
|
||||||
|
│ └──────────────┘ └────────────────┘ └────────────────────┘ │
|
||||||
|
│ │ │ ▲ │
|
||||||
|
│ │ │ │ │
|
||||||
|
│ ▼ ▼ │ │
|
||||||
|
│ ┌──────────────────────────────────┐ │ │
|
||||||
|
│ │ /vault/secrets/vault-token │ │ │
|
||||||
|
│ │ (Admin token for user creation) │ │ │
|
||||||
|
│ └──────────────────────────────────┘ │ │
|
||||||
|
└────────────────────────────────────────────────────┼───────────┘
|
||||||
|
│
|
||||||
|
┌───────────▼──────────┐
|
||||||
|
│ Vault │
|
||||||
|
│ secret/jupyterhub/ │
|
||||||
|
│ vault-token │
|
||||||
|
└──────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Prerequisites
|
||||||
|
|
||||||
|
Vault integration requires:
|
||||||
|
|
||||||
|
- Vault server installed and configured
|
||||||
|
- External Secrets Operator installed
|
||||||
|
- ClusterSecretStore configured for Vault
|
||||||
|
- Buun-stack kernel images (standard images don't include Vault integration)
|
||||||
|
|
||||||
|
### Setup
|
||||||
|
|
||||||
|
Vault integration is configured during JupyterHub installation:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
just jupyterhub::install
|
||||||
|
# Answer "yes" when prompted about Vault integration
|
||||||
|
# Provide Vault root token when prompted
|
||||||
|
```
|
||||||
|
|
||||||
|
The setup process:
|
||||||
|
|
||||||
|
1. Creates `jupyterhub-admin` policy with necessary permissions including `sudo` for orphan token creation
|
||||||
|
2. Creates renewable admin token with 24h TTL and unlimited Max TTL
|
||||||
|
3. Stores token in Vault at `secret/jupyterhub/vault-token`
|
||||||
|
4. Creates ExternalSecret to fetch token from Vault
|
||||||
|
5. Deploys token renewal sidecar for automatic renewal
|
||||||
|
|
||||||
|
### Usage in Notebooks
|
||||||
|
|
||||||
|
With Vault integration enabled, use the `buunstack` package in notebooks:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from buunstack import SecretStore
|
||||||
|
|
||||||
|
# Initialize (uses pre-acquired user-specific token)
|
||||||
|
secrets = SecretStore()
|
||||||
|
|
||||||
|
# Store secrets
|
||||||
|
secrets.put('api-keys',
|
||||||
|
openai='sk-...',
|
||||||
|
github='ghp_...',
|
||||||
|
database_url='postgresql://...')
|
||||||
|
|
||||||
|
# Retrieve secrets
|
||||||
|
api_keys = secrets.get('api-keys')
|
||||||
|
openai_key = secrets.get('api-keys', field='openai')
|
||||||
|
|
||||||
|
# List all secrets
|
||||||
|
secret_names = secrets.list()
|
||||||
|
|
||||||
|
# Delete secrets or specific fields
|
||||||
|
secrets.delete('old-api-key') # Delete entire secret
|
||||||
|
secrets.delete('api-keys', field='github') # Delete only github field
|
||||||
|
```
|
||||||
|
|
||||||
|
### Security Features
|
||||||
|
|
||||||
|
- **User isolation**: Each user receives an orphan token with access only to their namespace
|
||||||
|
- **Automatic renewal**: Token renewal script renews admin token at TTL/2 intervals (minimum 30 seconds)
|
||||||
|
- **ExternalSecret integration**: Admin token fetched securely from Vault
|
||||||
|
- **Orphan tokens**: User tokens are orphan tokens, not limited by parent policy restrictions
|
||||||
|
- **Audit trail**: All secret access is logged in Vault
|
||||||
|
|
||||||
|
### Token Management
|
||||||
|
|
||||||
|
#### Admin Token
|
||||||
|
|
||||||
|
The admin token is managed through:
|
||||||
|
|
||||||
|
1. **Creation**: `just jupyterhub::create-jupyterhub-vault-token` creates renewable token
|
||||||
|
2. **Storage**: Stored in Vault at `secret/jupyterhub/vault-token`
|
||||||
|
3. **Retrieval**: ExternalSecret fetches and mounts as Kubernetes Secret
|
||||||
|
4. **Renewal**: `vault-token-renewer.sh` script renews at TTL/2 intervals
|
||||||
|
|
||||||
|
#### User Tokens
|
||||||
|
|
||||||
|
User tokens are created dynamically:
|
||||||
|
|
||||||
|
1. **Pre-spawn hook** reads admin token from `/vault/secrets/vault-token`
|
||||||
|
2. **Creates user policy** `jupyter-user-{username}` with restricted access
|
||||||
|
3. **Creates orphan token** with user policy (requires `sudo` permission)
|
||||||
|
4. **Sets environment variable** `NOTEBOOK_VAULT_TOKEN` in notebook container
|
||||||
|
|
||||||
|
## Token Renewal Implementation
|
||||||
|
|
||||||
|
### Admin Token Renewal
|
||||||
|
|
||||||
|
The admin token renewal is handled by a sidecar container (`vault-token-renewer`) running alongside the JupyterHub hub:
|
||||||
|
|
||||||
|
**Implementation Details:**
|
||||||
|
|
||||||
|
1. **Renewal Script**: `/vault/config/vault-token-renewer.sh`
|
||||||
|
- Runs in the `vault-token-renewer` sidecar container
|
||||||
|
- Uses Vault 1.17.5 image with HashiCorp Vault CLI
|
||||||
|
|
||||||
|
2. **Environment-Based TTL Configuration**:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Reads TTL from environment variable (set in .env.local)
|
||||||
|
TTL_RAW="${JUPYTERHUB_VAULT_TOKEN_TTL}" # e.g., "5m", "24h"
|
||||||
|
|
||||||
|
# Converts to seconds and calculates renewal interval
|
||||||
|
RENEWAL_INTERVAL=$((TTL_SECONDS / 2)) # TTL/2 with minimum 30s
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Token Source**: ExternalSecret → Kubernetes Secret → mounted file
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Token retrieved from ExternalSecret-managed mount
|
||||||
|
ADMIN_TOKEN=$(cat /vault/admin-token/token)
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Renewal Loop**:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
while true; do
|
||||||
|
vault token renew >/dev/null 2>&1
|
||||||
|
sleep $RENEWAL_INTERVAL
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
5. **Error Handling**: If renewal fails, re-retrieves token from ExternalSecret mount
|
||||||
|
|
||||||
|
**Key Files:**
|
||||||
|
|
||||||
|
- `vault-token-renewer.sh`: Main renewal script
|
||||||
|
- `jupyterhub-vault-token-external-secret.gomplate.yaml`: ExternalSecret configuration
|
||||||
|
- `vault-token-renewer-config` ConfigMap: Contains the renewal script
|
||||||
|
|
||||||
|
### User Token Renewal
|
||||||
|
|
||||||
|
User token renewal is handled within the notebook environment by the `buunstack` Python package:
|
||||||
|
|
||||||
|
**Implementation Details:**
|
||||||
|
|
||||||
|
1. **Token Source**: Environment variable set by pre-spawn hook
|
||||||
|
|
||||||
|
```python
|
||||||
|
# In pre_spawn_hook.gomplate.py
|
||||||
|
spawner.environment["NOTEBOOK_VAULT_TOKEN"] = user_vault_token
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Automatic Renewal**: Built into `SecretStore` class operations
|
||||||
|
|
||||||
|
```python
|
||||||
|
# In buunstack/secrets.py
|
||||||
|
def _ensure_authenticated(self):
|
||||||
|
token_info = self.client.auth.token.lookup_self()
|
||||||
|
ttl = token_info.get("data", {}).get("ttl", 0)
|
||||||
|
renewable = token_info.get("data", {}).get("renewable", False)
|
||||||
|
|
||||||
|
# Renew if TTL < 10 minutes and renewable
|
||||||
|
if renewable and ttl > 0 and ttl < 600:
|
||||||
|
self.client.auth.token.renew_self()
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Renewal Trigger**: Every `SecretStore` operation (get, put, delete, list)
|
||||||
|
- Checks token validity before operation
|
||||||
|
- Automatically renews if TTL < 10 minutes
|
||||||
|
- Transparent to user code
|
||||||
|
|
||||||
|
4. **Token Configuration** (set during creation):
|
||||||
|
- **TTL**: `NOTEBOOK_VAULT_TOKEN_TTL` (default: 24h = 1 day)
|
||||||
|
- **Max TTL**: `NOTEBOOK_VAULT_TOKEN_MAX_TTL` (default: 168h = 7 days)
|
||||||
|
- **Policy**: User-specific `jupyter-user-{username}`
|
||||||
|
- **Type**: Orphan token (independent of parent token lifecycle)
|
||||||
|
|
||||||
|
5. **Expiry Handling**: When token reaches Max TTL:
|
||||||
|
- Cannot be renewed further
|
||||||
|
- User must restart notebook server (triggers new token creation)
|
||||||
|
- Prevented by `JUPYTERHUB_CULL_MAX_AGE` setting (6 days < 7 day Max TTL)
|
||||||
|
|
||||||
|
**Key Files:**
|
||||||
|
|
||||||
|
- `pre_spawn_hook.gomplate.py`: User token creation logic
|
||||||
|
- `buunstack/secrets.py`: Token renewal implementation
|
||||||
|
- `user_policy.hcl`: User token permissions template
|
||||||
|
|
||||||
|
### Token Lifecycle Summary
|
||||||
|
|
||||||
|
```plain
|
||||||
|
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
|
||||||
|
│ Admin Token │ │ User Token │ │ Pod Lifecycle │
|
||||||
|
│ │ │ │ │ │
|
||||||
|
│ Created: Manual │ │ Created: Spawn │ │ Max Age: 7 days │
|
||||||
|
│ TTL: 5m-24h │ │ TTL: 1 day │ │ Auto-restart │
|
||||||
|
│ Max TTL: ∞ │ │ Max TTL: 7 days │ │ at Max TTL │
|
||||||
|
│ Renewal: Auto │ │ Renewal: Auto │ │ │
|
||||||
|
│ Interval: TTL/2 │ │ Trigger: Usage │ │ │
|
||||||
|
└─────────────────┘ └──────────────────┘ └─────────────────┘
|
||||||
|
│ │ │
|
||||||
|
▼ ▼ ▼
|
||||||
|
vault-token-renewer buunstack.py cull.maxAge
|
||||||
|
sidecar SecretStore pod restart
|
||||||
|
```
|
||||||
|
|
||||||
|
## Storage Options
|
||||||
|
|
||||||
|
### Default Storage
|
||||||
|
|
||||||
|
Uses Kubernetes PersistentVolumes for user home directories.
|
||||||
|
|
||||||
|
### NFS Storage
|
||||||
|
|
||||||
|
For shared storage across nodes, configure NFS:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
JUPYTERHUB_NFS_PV_ENABLED=true
|
||||||
|
JUPYTER_NFS_IP=192.168.10.1
|
||||||
|
JUPYTER_NFS_PATH=/volume1/drive1/jupyter
|
||||||
|
```
|
||||||
|
|
||||||
|
NFS storage requires:
|
||||||
|
|
||||||
|
- Longhorn storage system installed
|
||||||
|
- NFS server accessible from cluster nodes
|
||||||
|
- Proper NFS export permissions configured
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
### Environment Variables
|
||||||
|
|
||||||
|
Key configuration variables:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Basic settings
|
||||||
|
JUPYTERHUB_NAMESPACE=jupyter
|
||||||
|
JUPYTERHUB_CHART_VERSION=4.2.0
|
||||||
|
JUPYTERHUB_OIDC_CLIENT_ID=jupyterhub
|
||||||
|
|
||||||
|
# Keycloak integration
|
||||||
|
KEYCLOAK_REALM=buunstack
|
||||||
|
|
||||||
|
# Storage
|
||||||
|
JUPYTERHUB_NFS_PV_ENABLED=false
|
||||||
|
|
||||||
|
# Vault integration
|
||||||
|
JUPYTERHUB_VAULT_INTEGRATION_ENABLED=false
|
||||||
|
VAULT_ADDR=https://vault.example.com
|
||||||
|
|
||||||
|
# Image settings
|
||||||
|
JUPYTER_PYTHON_KERNEL_TAG=python-3.12-28
|
||||||
|
IMAGE_REGISTRY=localhost:30500
|
||||||
|
|
||||||
|
# Vault token TTL settings
|
||||||
|
JUPYTERHUB_VAULT_TOKEN_TTL=24h # Admin token: renewed at TTL/2 intervals
|
||||||
|
NOTEBOOK_VAULT_TOKEN_TTL=24h # User token: 1 day (renewed on usage)
|
||||||
|
NOTEBOOK_VAULT_TOKEN_MAX_TTL=168h # User token: 7 days max
|
||||||
|
|
||||||
|
# Server pod lifecycle settings
|
||||||
|
JUPYTERHUB_CULL_MAX_AGE=604800 # Max pod age in seconds (7 days = 604800s)
|
||||||
|
# Should be <= NOTEBOOK_VAULT_TOKEN_MAX_TTL
|
||||||
|
|
||||||
|
# Logging
|
||||||
|
JUPYTER_BUUNSTACK_LOG_LEVEL=warning # Options: debug, info, warning, error
|
||||||
|
```
|
||||||
|
|
||||||
|
### Advanced Configuration
|
||||||
|
|
||||||
|
Customize JupyterHub behavior by editing `jupyterhub-values.gomplate.yaml` template before installation.
|
||||||
|
|
||||||
## Custom Container Images
|
## Custom Container Images
|
||||||
|
|
||||||
JupyterHub uses custom container images with pre-installed data science tools and integrations:
|
JupyterHub uses custom container images with pre-installed data science tools and integrations:
|
||||||
@@ -88,3 +509,156 @@ GPU-enabled notebook image based on `jupyter/pytorch-notebook:cuda12`:
|
|||||||
[📖 See Image Documentation](./images/datastack-cuda-notebook/README.md)
|
[📖 See Image Documentation](./images/datastack-cuda-notebook/README.md)
|
||||||
|
|
||||||
Both images are based on the official [Jupyter Docker Stacks](https://github.com/jupyter/docker-stacks) and include all standard data science libraries (NumPy, pandas, scikit-learn, matplotlib, etc.).
|
Both images are based on the official [Jupyter Docker Stacks](https://github.com/jupyter/docker-stacks) and include all standard data science libraries (NumPy, pandas, scikit-learn, matplotlib, etc.).
|
||||||
|
|
||||||
|
## Management
|
||||||
|
|
||||||
|
### Uninstall
|
||||||
|
|
||||||
|
```bash
|
||||||
|
just jupyterhub::uninstall
|
||||||
|
```
|
||||||
|
|
||||||
|
This removes:
|
||||||
|
|
||||||
|
- JupyterHub deployment
|
||||||
|
- User pods
|
||||||
|
- PVCs
|
||||||
|
- ExternalSecret
|
||||||
|
|
||||||
|
### Update
|
||||||
|
|
||||||
|
Upgrade to newer versions:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Update image tag in .env.local
|
||||||
|
export JUPYTER_PYTHON_KERNEL_TAG=python-3.12-29
|
||||||
|
|
||||||
|
# Rebuild and push images
|
||||||
|
just jupyterhub::build-kernel-images
|
||||||
|
just jupyterhub::push-kernel-images
|
||||||
|
|
||||||
|
# Upgrade JupyterHub deployment
|
||||||
|
just jupyterhub::install
|
||||||
|
```
|
||||||
|
|
||||||
|
### Manual Token Refresh
|
||||||
|
|
||||||
|
If needed, manually refresh the admin token:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Create new renewable token
|
||||||
|
just jupyterhub::create-jupyterhub-vault-token
|
||||||
|
|
||||||
|
# Restart JupyterHub to pick up new token
|
||||||
|
kubectl rollout restart deployment/hub -n jupyter
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Image Pull Issues
|
||||||
|
|
||||||
|
Buun-stack images are large and may timeout:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check pod status
|
||||||
|
kubectl get pods -n jupyter
|
||||||
|
|
||||||
|
# Check image pull progress
|
||||||
|
kubectl describe pod <pod-name> -n jupyter
|
||||||
|
|
||||||
|
# Increase timeout if needed
|
||||||
|
helm upgrade jupyterhub jupyterhub/jupyterhub --timeout=30m -f jupyterhub-values.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Vault Integration Issues
|
||||||
|
|
||||||
|
Check token and authentication:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check ExternalSecret status
|
||||||
|
kubectl get externalsecret -n jupyter jupyterhub-vault-token
|
||||||
|
|
||||||
|
# Check if Secret was created
|
||||||
|
kubectl get secret -n jupyter jupyterhub-vault-token
|
||||||
|
|
||||||
|
# Check token renewal logs
|
||||||
|
kubectl logs -n jupyter -l app.kubernetes.io/component=hub -c vault-token-renewer
|
||||||
|
|
||||||
|
# In a notebook, verify environment
|
||||||
|
%env NOTEBOOK_VAULT_TOKEN
|
||||||
|
```
|
||||||
|
|
||||||
|
Common issues:
|
||||||
|
|
||||||
|
1. **"child policies must be subset of parent"**: Admin policy needs `sudo` permission for orphan tokens
|
||||||
|
2. **Token not found**: Check ExternalSecret and ClusterSecretStore configuration
|
||||||
|
3. **Permission denied**: Verify `jupyterhub-admin` policy has all required permissions
|
||||||
|
|
||||||
|
### Authentication Issues
|
||||||
|
|
||||||
|
Verify Keycloak client configuration:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check client exists
|
||||||
|
just keycloak::get-client buunstack jupyterhub
|
||||||
|
|
||||||
|
# Check redirect URIs
|
||||||
|
just keycloak::update-client buunstack jupyterhub \
|
||||||
|
"https://your-jupyter-host/hub/oauth_callback"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Technical Implementation Details
|
||||||
|
|
||||||
|
### Helm Chart Version
|
||||||
|
|
||||||
|
JupyterHub uses the official Zero to JupyterHub (Z2JH) Helm chart:
|
||||||
|
|
||||||
|
- Chart: `jupyterhub/jupyterhub`
|
||||||
|
- Version: `4.2.0` (configurable via `JUPYTERHUB_CHART_VERSION`)
|
||||||
|
- Documentation: https://z2jh.jupyter.org/
|
||||||
|
|
||||||
|
### Token System Architecture
|
||||||
|
|
||||||
|
The system uses a three-tier token approach:
|
||||||
|
|
||||||
|
1. **Renewable Admin Token**:
|
||||||
|
- Created with `explicit-max-ttl=0` (unlimited Max TTL)
|
||||||
|
- Renewed automatically at TTL/2 intervals (minimum 30 seconds)
|
||||||
|
- Stored in Vault and fetched via ExternalSecret
|
||||||
|
2. **Orphan User Tokens**:
|
||||||
|
- Created with `create_orphan()` API call
|
||||||
|
- Not limited by parent token policies
|
||||||
|
- Individual TTL and Max TTL settings
|
||||||
|
3. **Token Renewal Script**:
|
||||||
|
- Runs as sidecar container
|
||||||
|
- Reads token from ExternalSecret mount
|
||||||
|
- Handles renewal and re-retrieval on failure
|
||||||
|
|
||||||
|
### Key Files
|
||||||
|
|
||||||
|
- `jupyterhub-admin-policy.hcl`: Vault policy with admin permissions
|
||||||
|
- `user_policy.hcl`: Template for user-specific policies
|
||||||
|
- `vault-token-renewer.sh`: Token renewal script
|
||||||
|
- `jupyterhub-vault-token-external-secret.gomplate.yaml`: ExternalSecret configuration
|
||||||
|
|
||||||
|
## Performance Considerations
|
||||||
|
|
||||||
|
- **Image Size**: Buun-stack images are ~13GB, plan storage accordingly
|
||||||
|
- **Pull Time**: Initial pulls take 5-15 minutes depending on network
|
||||||
|
- **Resource Usage**: Data science workloads require adequate CPU/memory
|
||||||
|
- **Token Renewal**: Minimal overhead (renewal at TTL/2 intervals)
|
||||||
|
|
||||||
|
For production deployments, consider:
|
||||||
|
|
||||||
|
- Pre-pulling images to all nodes
|
||||||
|
- Using faster storage backends
|
||||||
|
- Configuring resource limits per user
|
||||||
|
- Setting up monitoring and alerts
|
||||||
|
|
||||||
|
## Known Limitations
|
||||||
|
|
||||||
|
1. **Annual Token Recreation**: While tokens have unlimited Max TTL, best practice suggests recreating them annually
|
||||||
|
2. **Token Expiry and Pod Lifecycle**: User tokens have a TTL of 1 day (`NOTEBOOK_VAULT_TOKEN_TTL=24h`) and maximum TTL of 7 days (`NOTEBOOK_VAULT_TOKEN_MAX_TTL=168h`). Daily usage extends the token for another day, allowing up to 7 days of continuous use. Server pods are automatically restarted after 7 days (`JUPYTERHUB_CULL_MAX_AGE=604800s`) to refresh tokens.
|
||||||
|
3. **Cull Settings**: Server idle timeout is set to 2 hours by default. Adjust `cull.timeout` and `cull.every` in the Helm values for different requirements
|
||||||
|
4. **NFS Storage**: When using NFS storage, ensure proper permissions are set on the NFS server. The default `JUPYTER_FSGID` is 100
|
||||||
|
5. **ExternalSecret Dependency**: Requires External Secrets Operator to be installed and configured
|
||||||
|
|||||||
@@ -26,7 +26,7 @@ Create `.env.claude` with Trino connection settings:
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Trino Connection (Password Authentication)
|
# Trino Connection (Password Authentication)
|
||||||
TRINO_HOST=trino.buun.dev
|
TRINO_HOST=trino.yourdomain.com
|
||||||
TRINO_PORT=443
|
TRINO_PORT=443
|
||||||
TRINO_SCHEME=https
|
TRINO_SCHEME=https
|
||||||
TRINO_SSL=true
|
TRINO_SSL=true
|
||||||
@@ -75,7 +75,7 @@ Create `~/.env.claude` in your home directory with 1Password references:
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Trino Connection (Password Authentication)
|
# Trino Connection (Password Authentication)
|
||||||
TRINO_HOST=trino.buun.dev
|
TRINO_HOST=trino.yourdomain.com
|
||||||
TRINO_PORT=443
|
TRINO_PORT=443
|
||||||
TRINO_SCHEME=https
|
TRINO_SCHEME=https
|
||||||
TRINO_SSL=true
|
TRINO_SSL=true
|
||||||
|
|||||||
@@ -392,7 +392,7 @@ cli user="":
|
|||||||
TRINO_HOST="${TRINO_HOST}"
|
TRINO_HOST="${TRINO_HOST}"
|
||||||
while [ -z "${TRINO_HOST}" ]; do
|
while [ -z "${TRINO_HOST}" ]; do
|
||||||
TRINO_HOST=$(gum input --prompt="Trino host (FQDN): " --width=100 \
|
TRINO_HOST=$(gum input --prompt="Trino host (FQDN): " --width=100 \
|
||||||
--placeholder="e.g., trino.buun.dev")
|
--placeholder="e.g., trino.yourdomain.com")
|
||||||
done
|
done
|
||||||
TRINO_USER="{{ user }}"
|
TRINO_USER="{{ user }}"
|
||||||
if [ -z "${TRINO_USER}" ]; then
|
if [ -z "${TRINO_USER}" ]; then
|
||||||
|
|||||||
Reference in New Issue
Block a user