311 lines
7.3 KiB
Markdown
311 lines
7.3 KiB
Markdown
# JupyterHub
|
|
|
|
JupyterHub provides a multi-user Jupyter notebook environment with Keycloak OIDC authentication, Vault integration for secure secrets management, and custom kernel images for data science workflows.
|
|
|
|
## Installation
|
|
|
|
Install JupyterHub with interactive configuration:
|
|
|
|
```bash
|
|
just jupyterhub::install
|
|
```
|
|
|
|
This will prompt for:
|
|
|
|
- JupyterHub host (FQDN)
|
|
- NFS PV usage (if Longhorn is installed)
|
|
- NFS server details (if NFS is enabled)
|
|
- Vault integration setup
|
|
|
|
### Prerequisites
|
|
|
|
- Keycloak must be installed and configured
|
|
- For NFS storage: Longhorn must be installed
|
|
- For Vault integration: Vault must be installed and configured
|
|
|
|
## Kernel Images
|
|
|
|
JupyterHub supports multiple kernel image profiles:
|
|
|
|
### Standard Profiles
|
|
|
|
- **minimal**: Basic Python environment
|
|
- **base**: Python with common data science packages
|
|
- **datascience**: Full data science stack (default)
|
|
- **pyspark**: PySpark for big data processing
|
|
- **pytorch**: PyTorch for machine learning
|
|
- **tensorflow**: TensorFlow for machine learning
|
|
|
|
### Buun-Stack Profiles
|
|
|
|
- **buun-stack**: Comprehensive data science environment with Vault integration
|
|
- **buun-stack-cuda**: CUDA-enabled version with GPU support
|
|
|
|
## Profile Configuration
|
|
|
|
Enable/disable profiles using environment variables:
|
|
|
|
```bash
|
|
# Enable buun-stack profile (CPU version)
|
|
export JUPYTER_PROFILE_BUUN_STACK_ENABLED=true
|
|
|
|
# Enable buun-stack CUDA profile (GPU version)
|
|
export JUPYTER_PROFILE_BUUN_STACK_CUDA_ENABLED=true
|
|
|
|
# Disable default datascience profile
|
|
export JUPYTER_PROFILE_DATASCIENCE_ENABLED=false
|
|
```
|
|
|
|
Available profile variables:
|
|
|
|
- `JUPYTER_PROFILE_MINIMAL_ENABLED`
|
|
- `JUPYTER_PROFILE_BASE_ENABLED`
|
|
- `JUPYTER_PROFILE_DATASCIENCE_ENABLED`
|
|
- `JUPYTER_PROFILE_PYSPARK_ENABLED`
|
|
- `JUPYTER_PROFILE_PYTORCH_ENABLED`
|
|
- `JUPYTER_PROFILE_TENSORFLOW_ENABLED`
|
|
- `JUPYTER_PROFILE_BUUN_STACK_ENABLED`
|
|
- `JUPYTER_PROFILE_BUUN_STACK_CUDA_ENABLED`
|
|
|
|
Only `JUPYTER_PROFILE_DATASCIENCE_ENABLED` is true by default.
|
|
|
|
## Buun-Stack Images
|
|
|
|
Buun-stack images provide comprehensive data science environments with:
|
|
|
|
- All standard data science packages (NumPy, Pandas, Scikit-learn, etc.)
|
|
- Deep learning frameworks (PyTorch, TensorFlow, Keras)
|
|
- Big data tools (PySpark, Apache Arrow)
|
|
- NLP and ML libraries (LangChain, Transformers, spaCy)
|
|
- Database connectors and tools
|
|
- **Vault integration** with `buunstack` Python package
|
|
|
|
### Building Custom Images
|
|
|
|
Build and push buun-stack images to your registry:
|
|
|
|
```bash
|
|
# Build images
|
|
just jupyterhub::build-kernel-images
|
|
|
|
# Push to registry
|
|
just jupyterhub::push-kernel-images
|
|
```
|
|
|
|
⚠️ **Note**: Buun-stack images are comprehensive and large (~13GB). Initial image pulls and deployments take significant time due to the extensive package set.
|
|
|
|
### Image Configuration
|
|
|
|
Configure image settings in `.env.local`:
|
|
|
|
```bash
|
|
# Image registry
|
|
IMAGE_REGISTRY=localhost:30500
|
|
|
|
# Image tag
|
|
JUPYTER_PYTHON_KERNEL_TAG=python-3.12-1
|
|
```
|
|
|
|
## Vault Integration
|
|
|
|
### Overview
|
|
|
|
Vault integration enables secure secrets management directly from Jupyter notebooks without re-authentication. Users can store and retrieve API keys, database credentials, and other sensitive data securely.
|
|
|
|
### Prerequisites
|
|
|
|
Vault integration requires:
|
|
|
|
- Vault server installed and configured
|
|
- Keycloak OIDC authentication configured
|
|
- **Buun-stack kernel images** (standard images don't include Vault integration)
|
|
|
|
### Setup
|
|
|
|
Enable Vault integration during installation:
|
|
|
|
```bash
|
|
# Set environment variable before installation or answer yes to prompt during install
|
|
export JUPYTERHUB_VAULT_INTEGRATION_ENABLED=true
|
|
just jupyterhub::install
|
|
```
|
|
|
|
Or configure manually:
|
|
|
|
```bash
|
|
# Setup Vault JWT authentication for JupyterHub
|
|
just jupyterhub::setup-vault-jwt-auth
|
|
```
|
|
|
|
### Usage in Notebooks
|
|
|
|
With Vault integration enabled, use the `buunstack` package in notebooks:
|
|
|
|
```python
|
|
from buunstack import SecretStore
|
|
|
|
# Initialize (uses JupyterHub session authentication)
|
|
secrets = SecretStore()
|
|
|
|
# Store secrets
|
|
secrets.put('api-keys',
|
|
openai='sk-...',
|
|
github='ghp_...',
|
|
database_url='postgresql://...')
|
|
|
|
# Retrieve secrets
|
|
api_keys = secrets.get('api-keys')
|
|
openai_key = secrets.get('api-keys', field='openai')
|
|
|
|
# List all secrets
|
|
secret_names = secrets.list()
|
|
|
|
# Delete secrets
|
|
secrets.delete('old-api-key')
|
|
```
|
|
|
|
### Security Features
|
|
|
|
- **User isolation**: Each user can only access their own secrets
|
|
- **Automatic token refresh**: Background token management prevents authentication failures
|
|
- **Audit trail**: All secret access is logged in Vault
|
|
- **No re-authentication**: Uses existing JupyterHub OIDC session
|
|
|
|
## Storage Options
|
|
|
|
### Default Storage
|
|
|
|
Uses Kubernetes PersistentVolumes for user home directories.
|
|
|
|
### NFS Storage
|
|
|
|
For shared storage across nodes, configure NFS:
|
|
|
|
```bash
|
|
export JUPYTERHUB_NFS_PV_ENABLED=true
|
|
export JUPYTER_NFS_IP=192.168.10.1
|
|
export JUPYTER_NFS_PATH=/volume1/drive1/jupyter
|
|
```
|
|
|
|
NFS storage requires:
|
|
|
|
- Longhorn storage system installed
|
|
- NFS server accessible from cluster nodes
|
|
- Proper NFS export permissions configured
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
Key configuration variables:
|
|
|
|
```bash
|
|
# Basic settings
|
|
JUPYTERHUB_NAMESPACE=jupyter
|
|
JUPYTERHUB_CHART_VERSION=4.2.0
|
|
JUPYTERHUB_OIDC_CLIENT_ID=jupyterhub
|
|
|
|
# Keycloak integration
|
|
KEYCLOAK_REALM=buunstack
|
|
|
|
# Storage
|
|
JUPYTERHUB_NFS_PV_ENABLED=false
|
|
|
|
# Vault integration
|
|
JUPYTERHUB_VAULT_INTEGRATION_ENABLED=false
|
|
VAULT_ADDR=http://vault.vault.svc:8200
|
|
|
|
# Image settings
|
|
JUPYTER_PYTHON_KERNEL_TAG=python-3.12-6
|
|
IMAGE_REGISTRY=localhost:30500
|
|
```
|
|
|
|
### Advanced Configuration
|
|
|
|
Customize JupyterHub behavior by editing `jupyterhub-values.gomplate.yaml` template before installation.
|
|
|
|
## Management
|
|
|
|
### Uninstall
|
|
|
|
```bash
|
|
just jupyterhub::uninstall
|
|
```
|
|
|
|
### Update
|
|
|
|
Upgrade to newer versions:
|
|
|
|
```bash
|
|
# Update image tag
|
|
export JUPYTER_PYTHON_KERNEL_TAG=python-3.12-2
|
|
|
|
# Rebuild and push images
|
|
just jupyterhub::push-kernel-images
|
|
|
|
# Upgrade JupyterHub deployment
|
|
just jupyterhub::install
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Image Pull Issues
|
|
|
|
Buun-stack images are large and may timeout:
|
|
|
|
```bash
|
|
# Check pod status
|
|
kubectl get pods -n jupyter
|
|
|
|
# Check image pull progress
|
|
kubectl describe pod <pod-name> -n jupyter
|
|
|
|
# Increase timeout if needed
|
|
helm upgrade jupyterhub jupyterhub/jupyterhub \
|
|
--timeout=30m -f jupyterhub-values.yaml
|
|
```
|
|
|
|
### Vault Integration Issues
|
|
|
|
Check Vault connectivity and authentication:
|
|
|
|
```python
|
|
# In a notebook
|
|
import os
|
|
print("Vault Address:", os.getenv('VAULT_ADDR'))
|
|
print("Access Token:", bool(os.getenv('JUPYTERHUB_OIDC_ACCESS_TOKEN')))
|
|
|
|
# Test SecretStore
|
|
from buunstack import SecretStore
|
|
secrets = SecretStore()
|
|
status = secrets.get_status()
|
|
print(status)
|
|
```
|
|
|
|
### Authentication Issues
|
|
|
|
Verify Keycloak client configuration:
|
|
|
|
```bash
|
|
# Check client exists
|
|
just keycloak::get-client buunstack jupyterhub
|
|
|
|
# Check redirect URIs
|
|
just keycloak::update-client buunstack jupyterhub \
|
|
"https://your-jupyter-host/hub/oauth_callback"
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
- **Image Size**: Buun-stack images are ~13GB, plan storage accordingly
|
|
- **Pull Time**: Initial pulls take 5-15 minutes depending on network
|
|
- **Resource Usage**: Data science workloads require adequate CPU/memory
|
|
- **Storage**: NFS provides better performance for shared datasets
|
|
|
|
For production deployments, consider:
|
|
|
|
- Pre-pulling images to all nodes
|
|
- Using faster storage backends
|
|
- Configuring resource limits per user
|
|
- Setting up monitoring and alerts
|