docs: write jupyterhub doc
This commit is contained in:
@@ -1,6 +1,7 @@
|
||||
default: true
|
||||
no-bare-urls: false
|
||||
line-length: false
|
||||
no-bare-urls: false
|
||||
no-duplicate-heading: false
|
||||
no-inline-html: false
|
||||
ul-indent:
|
||||
indent: 4
|
||||
|
||||
@@ -143,4 +143,5 @@ When adding new services:
|
||||
- It must pass the command: `just --fmt --check --unstable`
|
||||
- Follow existing Justfile patterns
|
||||
- Only write code comments when necessary, as the code should be self-explanatory
|
||||
(Avoid trivial comment for each code block)
|
||||
- Write output messages and code comments in English
|
||||
|
||||
@@ -112,6 +112,9 @@ Multi-user platform for interactive computing:
|
||||
- Integrated with Keycloak for OIDC authentication
|
||||
- Persistent storage for user workspaces
|
||||
- Support for multiple kernels and environments
|
||||
- Vault integration for secure secrets management
|
||||
|
||||
See [JupyterHub Documentation](./docs/jupyterhub.md) for detailed setup and configuration.
|
||||
|
||||
## Common Operations
|
||||
|
||||
|
||||
310
docs/jupyterhub.md
Normal file
310
docs/jupyterhub.md
Normal file
@@ -0,0 +1,310 @@
|
||||
# JupyterHub
|
||||
|
||||
JupyterHub provides a multi-user Jupyter notebook environment with Keycloak OIDC authentication, Vault integration for secure secrets management, and custom kernel images for data science workflows.
|
||||
|
||||
## Installation
|
||||
|
||||
Install JupyterHub with interactive configuration:
|
||||
|
||||
```bash
|
||||
just jupyterhub::install
|
||||
```
|
||||
|
||||
This will prompt for:
|
||||
|
||||
- JupyterHub host (FQDN)
|
||||
- NFS PV usage (if Longhorn is installed)
|
||||
- NFS server details (if NFS is enabled)
|
||||
- Vault integration setup
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Keycloak must be installed and configured
|
||||
- For NFS storage: Longhorn must be installed
|
||||
- For Vault integration: Vault must be installed and configured
|
||||
|
||||
## Kernel Images
|
||||
|
||||
JupyterHub supports multiple kernel image profiles:
|
||||
|
||||
### Standard Profiles
|
||||
|
||||
- **minimal**: Basic Python environment
|
||||
- **base**: Python with common data science packages
|
||||
- **datascience**: Full data science stack (default)
|
||||
- **pyspark**: PySpark for big data processing
|
||||
- **pytorch**: PyTorch for machine learning
|
||||
- **tensorflow**: TensorFlow for machine learning
|
||||
|
||||
### Buun-Stack Profiles
|
||||
|
||||
- **buun-stack**: Comprehensive data science environment with Vault integration
|
||||
- **buun-stack-cuda**: CUDA-enabled version with GPU support
|
||||
|
||||
## Profile Configuration
|
||||
|
||||
Enable/disable profiles using environment variables:
|
||||
|
||||
```bash
|
||||
# Enable buun-stack profile (CPU version)
|
||||
export JUPYTER_PROFILE_BUUN_STACK_ENABLED=true
|
||||
|
||||
# Enable buun-stack CUDA profile (GPU version)
|
||||
export JUPYTER_PROFILE_BUUN_STACK_CUDA_ENABLED=true
|
||||
|
||||
# Disable default datascience profile
|
||||
export JUPYTER_PROFILE_DATASCIENCE_ENABLED=false
|
||||
```
|
||||
|
||||
Available profile variables:
|
||||
|
||||
- `JUPYTER_PROFILE_MINIMAL_ENABLED`
|
||||
- `JUPYTER_PROFILE_BASE_ENABLED`
|
||||
- `JUPYTER_PROFILE_DATASCIENCE_ENABLED`
|
||||
- `JUPYTER_PROFILE_PYSPARK_ENABLED`
|
||||
- `JUPYTER_PROFILE_PYTORCH_ENABLED`
|
||||
- `JUPYTER_PROFILE_TENSORFLOW_ENABLED`
|
||||
- `JUPYTER_PROFILE_BUUN_STACK_ENABLED`
|
||||
- `JUPYTER_PROFILE_BUUN_STACK_CUDA_ENABLED`
|
||||
|
||||
Only `JUPYTER_PROFILE_DATASCIENCE_ENABLED` is true by default.
|
||||
|
||||
## Buun-Stack Images
|
||||
|
||||
Buun-stack images provide comprehensive data science environments with:
|
||||
|
||||
- All standard data science packages (NumPy, Pandas, Scikit-learn, etc.)
|
||||
- Deep learning frameworks (PyTorch, TensorFlow, Keras)
|
||||
- Big data tools (PySpark, Apache Arrow)
|
||||
- NLP and ML libraries (LangChain, Transformers, spaCy)
|
||||
- Database connectors and tools
|
||||
- **Vault integration** with `buunstack` Python package
|
||||
|
||||
### Building Custom Images
|
||||
|
||||
Build and push buun-stack images to your registry:
|
||||
|
||||
```bash
|
||||
# Build images
|
||||
just jupyterhub::build-kernel-images
|
||||
|
||||
# Push to registry
|
||||
just jupyterhub::push-kernel-images
|
||||
```
|
||||
|
||||
⚠️ **Note**: Buun-stack images are comprehensive and large (~13GB). Initial image pulls and deployments take significant time due to the extensive package set.
|
||||
|
||||
### Image Configuration
|
||||
|
||||
Configure image settings in `.env.local`:
|
||||
|
||||
```bash
|
||||
# Image registry
|
||||
IMAGE_REGISTRY=localhost:30500
|
||||
|
||||
# Image tag
|
||||
JUPYTER_PYTHON_KERNEL_TAG=python-3.12-1
|
||||
```
|
||||
|
||||
## Vault Integration
|
||||
|
||||
### Overview
|
||||
|
||||
Vault integration enables secure secrets management directly from Jupyter notebooks without re-authentication. Users can store and retrieve API keys, database credentials, and other sensitive data securely.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
Vault integration requires:
|
||||
|
||||
- Vault server installed and configured
|
||||
- Keycloak OIDC authentication configured
|
||||
- **Buun-stack kernel images** (standard images don't include Vault integration)
|
||||
|
||||
### Setup
|
||||
|
||||
Enable Vault integration during installation:
|
||||
|
||||
```bash
|
||||
# Set environment variable before installation or answer yes to prompt during install
|
||||
export JUPYTERHUB_VAULT_INTEGRATION_ENABLED=true
|
||||
just jupyterhub::install
|
||||
```
|
||||
|
||||
Or configure manually:
|
||||
|
||||
```bash
|
||||
# Setup Vault JWT authentication for JupyterHub
|
||||
just jupyterhub::setup-vault-jwt-auth
|
||||
```
|
||||
|
||||
### Usage in Notebooks
|
||||
|
||||
With Vault integration enabled, use the `buunstack` package in notebooks:
|
||||
|
||||
```python
|
||||
from buunstack import SecretStore
|
||||
|
||||
# Initialize (uses JupyterHub session authentication)
|
||||
secrets = SecretStore()
|
||||
|
||||
# Store secrets
|
||||
secrets.put('api-keys',
|
||||
openai='sk-...',
|
||||
github='ghp_...',
|
||||
database_url='postgresql://...')
|
||||
|
||||
# Retrieve secrets
|
||||
api_keys = secrets.get('api-keys')
|
||||
openai_key = secrets.get('api-keys', field='openai')
|
||||
|
||||
# List all secrets
|
||||
secret_names = secrets.list()
|
||||
|
||||
# Delete secrets
|
||||
secrets.delete('old-api-key')
|
||||
```
|
||||
|
||||
### Security Features
|
||||
|
||||
- **User isolation**: Each user can only access their own secrets
|
||||
- **Automatic token refresh**: Background token management prevents authentication failures
|
||||
- **Audit trail**: All secret access is logged in Vault
|
||||
- **No re-authentication**: Uses existing JupyterHub OIDC session
|
||||
|
||||
## Storage Options
|
||||
|
||||
### Default Storage
|
||||
|
||||
Uses Kubernetes PersistentVolumes for user home directories.
|
||||
|
||||
### NFS Storage
|
||||
|
||||
For shared storage across nodes, configure NFS:
|
||||
|
||||
```bash
|
||||
export JUPYTERHUB_NFS_PV_ENABLED=true
|
||||
export JUPYTER_NFS_IP=192.168.10.1
|
||||
export JUPYTER_NFS_PATH=/volume1/drive1/jupyter
|
||||
```
|
||||
|
||||
NFS storage requires:
|
||||
|
||||
- Longhorn storage system installed
|
||||
- NFS server accessible from cluster nodes
|
||||
- Proper NFS export permissions configured
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
Key configuration variables:
|
||||
|
||||
```bash
|
||||
# Basic settings
|
||||
JUPYTERHUB_NAMESPACE=jupyter
|
||||
JUPYTERHUB_CHART_VERSION=4.2.0
|
||||
JUPYTERHUB_OIDC_CLIENT_ID=jupyterhub
|
||||
|
||||
# Keycloak integration
|
||||
KEYCLOAK_REALM=buunstack
|
||||
|
||||
# Storage
|
||||
JUPYTERHUB_NFS_PV_ENABLED=false
|
||||
|
||||
# Vault integration
|
||||
JUPYTERHUB_VAULT_INTEGRATION_ENABLED=false
|
||||
VAULT_ADDR=http://vault.vault.svc:8200
|
||||
|
||||
# Image settings
|
||||
JUPYTER_PYTHON_KERNEL_TAG=python-3.12-6
|
||||
IMAGE_REGISTRY=localhost:30500
|
||||
```
|
||||
|
||||
### Advanced Configuration
|
||||
|
||||
Customize JupyterHub behavior by editing `jupyterhub-values.gomplate.yaml` template before installation.
|
||||
|
||||
## Management
|
||||
|
||||
### Uninstall
|
||||
|
||||
```bash
|
||||
just jupyterhub::uninstall
|
||||
```
|
||||
|
||||
### Update
|
||||
|
||||
Upgrade to newer versions:
|
||||
|
||||
```bash
|
||||
# Update image tag
|
||||
export JUPYTER_PYTHON_KERNEL_TAG=python-3.12-2
|
||||
|
||||
# Rebuild and push images
|
||||
just jupyterhub::push-kernel-images
|
||||
|
||||
# Upgrade JupyterHub deployment
|
||||
just jupyterhub::install
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Image Pull Issues
|
||||
|
||||
Buun-stack images are large and may timeout:
|
||||
|
||||
```bash
|
||||
# Check pod status
|
||||
kubectl get pods -n jupyter
|
||||
|
||||
# Check image pull progress
|
||||
kubectl describe pod <pod-name> -n jupyter
|
||||
|
||||
# Increase timeout if needed
|
||||
helm upgrade jupyterhub jupyterhub/jupyterhub \
|
||||
--timeout=30m -f jupyterhub-values.yaml
|
||||
```
|
||||
|
||||
### Vault Integration Issues
|
||||
|
||||
Check Vault connectivity and authentication:
|
||||
|
||||
```python
|
||||
# In a notebook
|
||||
import os
|
||||
print("Vault Address:", os.getenv('VAULT_ADDR'))
|
||||
print("Access Token:", bool(os.getenv('JUPYTERHUB_OIDC_ACCESS_TOKEN')))
|
||||
|
||||
# Test SecretStore
|
||||
from buunstack import SecretStore
|
||||
secrets = SecretStore()
|
||||
status = secrets.get_status()
|
||||
print(status)
|
||||
```
|
||||
|
||||
### Authentication Issues
|
||||
|
||||
Verify Keycloak client configuration:
|
||||
|
||||
```bash
|
||||
# Check client exists
|
||||
just keycloak::get-client buunstack jupyterhub
|
||||
|
||||
# Check redirect URIs
|
||||
just keycloak::update-client buunstack jupyterhub \
|
||||
"https://your-jupyter-host/hub/oauth_callback"
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- **Image Size**: Buun-stack images are ~13GB, plan storage accordingly
|
||||
- **Pull Time**: Initial pulls take 5-15 minutes depending on network
|
||||
- **Resource Usage**: Data science workloads require adequate CPU/memory
|
||||
- **Storage**: NFS provides better performance for shared datasets
|
||||
|
||||
For production deployments, consider:
|
||||
|
||||
- Pre-pulling images to all nodes
|
||||
- Using faster storage backends
|
||||
- Configuring resource limits per user
|
||||
- Setting up monitoring and alerts
|
||||
Reference in New Issue
Block a user