447 lines
14 KiB
Markdown
447 lines
14 KiB
Markdown
# Querybook
|
|
|
|
Pinterest's big data querying UI with notebook interface, Keycloak OAuth authentication, and Trino integration.
|
|
|
|
## Overview
|
|
|
|
This module deploys Querybook using the official Helm chart from Pinterest with:
|
|
|
|
- **Keycloak OAuth2 authentication** for user login
|
|
- **Trino integration** with user impersonation for query attribution
|
|
- **PostgreSQL backend** for metadata storage
|
|
- **Redis** for caching and session management
|
|
- **Traefik integration** with WebSocket support for real-time query execution
|
|
- **Group-based admin access** via Keycloak groups
|
|
|
|
## Prerequisites
|
|
|
|
- Kubernetes cluster (k3s)
|
|
- Keycloak installed and configured
|
|
- PostgreSQL cluster (CloudNativePG)
|
|
- Trino with access control configured
|
|
- External Secrets Operator (optional, for Vault integration)
|
|
|
|
## Installation
|
|
|
|
### Basic Installation
|
|
|
|
```bash
|
|
just querybook::install
|
|
```
|
|
|
|
You will be prompted for:
|
|
|
|
1. **Querybook host (FQDN)**: e.g., `querybook.example.com`
|
|
2. **Keycloak host (FQDN)**: e.g., `auth.example.com`
|
|
|
|
### What Gets Installed
|
|
|
|
- Querybook web service
|
|
- Querybook scheduler (background jobs)
|
|
- Querybook workers (query execution)
|
|
- PostgreSQL database for Querybook metadata
|
|
- Redis for caching and sessions
|
|
- Keycloak OAuth2 client (confidential client)
|
|
- `querybook-admin` group in Keycloak for admin access
|
|
- Traefik Middleware for WebSocket and header forwarding
|
|
|
|
## Configuration
|
|
|
|
Environment variables (set in `.env.local` or override):
|
|
|
|
```bash
|
|
QUERYBOOK_NAMESPACE=querybook # Kubernetes namespace
|
|
QUERYBOOK_HOST=querybook.example.com # External hostname
|
|
KEYCLOAK_HOST=auth.example.com # Keycloak hostname
|
|
KEYCLOAK_REALM=buunstack # Keycloak realm name
|
|
|
|
# Optional: Use custom Docker image (for testing fixes/patches)
|
|
QUERYBOOK_CUSTOM_IMAGE=localhost:30500/querybook # Custom image repository
|
|
QUERYBOOK_CUSTOM_IMAGE_TAG=buun-stack # Custom image tag (default: latest)
|
|
QUERYBOOK_CUSTOM_IMAGE_PULL_POLICY=Always # Image pull policy (default: Always)
|
|
```
|
|
|
|
### Using Custom Image
|
|
|
|
To use a custom Querybook image (e.g., with patches or fixes):
|
|
|
|
```bash
|
|
# Set environment variables
|
|
export QUERYBOOK_CUSTOM_IMAGE=localhost:30500/querybook
|
|
export QUERYBOOK_CUSTOM_IMAGE_TAG=buun-stack
|
|
|
|
# Install or upgrade Querybook
|
|
just querybook::install
|
|
# or
|
|
just querybook::upgrade
|
|
```
|
|
|
|
**When to use custom image**:
|
|
|
|
- Testing bug fixes before they are merged upstream
|
|
- Applying patches for specific issues (e.g., WebSocket disconnect errors)
|
|
- Using modified versions with custom features
|
|
|
|
**Custom image includes** (`buun-stack` tag):
|
|
|
|
- Fix for WebSocket disconnect handler (python-socketio 5.12.0+ compatibility)
|
|
- Fix for datetime serialization in WebSocket emit
|
|
- Trino 0.336.0 upgrade with Metastore support (table autocomplete, schema browser)
|
|
|
|
**Custom image behavior** (when `QUERYBOOK_CUSTOM_IMAGE` is set):
|
|
|
|
- Pull policy: `Always` (default, override with `QUERYBOOK_CUSTOM_IMAGE_PULL_POLICY`)
|
|
- Ensures latest image is always pulled from registry
|
|
|
|
**Default behavior** (when `QUERYBOOK_CUSTOM_IMAGE` is not set):
|
|
|
|
- Uses official image: `querybook/querybook:latest`
|
|
- Pull policy: `IfNotPresent`
|
|
- Note: Official image may encounter WebSocket disconnect errors with python-socketio 5.12.0+
|
|
|
|
### Building Custom Image
|
|
|
|
To build a custom Querybook image with bug fixes and Metastore support:
|
|
|
|
1. **Clone Querybook repository**:
|
|
|
|
```bash
|
|
git clone https://github.com/pinterest/querybook.git
|
|
cd querybook
|
|
```
|
|
|
|
2. **Apply bug fix patch**:
|
|
|
|
```bash
|
|
# Copy patch file from buun-stack repository
|
|
# cp /path/to/buun-stack/querybook/querybook-trino-metastore.diff .
|
|
|
|
# Apply the patch
|
|
git apply querybook-trino-metastore.diff
|
|
```
|
|
|
|
**Patch includes**:
|
|
- Fix for WebSocket disconnect handler (python-socketio 5.12.0+ compatibility)
|
|
- Fix for datetime serialization in WebSocket emit
|
|
- Trino 0.336.0 upgrade with TrinoCursor.poll() compatibility fix
|
|
- sqlalchemy-trino 0.5.0 for Metastore support
|
|
|
|
3. **Build the Docker image**:
|
|
|
|
```bash
|
|
# For remote Docker host (e.g., k3s node)
|
|
DOCKER_HOST=ssh://yourdomain.com docker build \
|
|
--no-cache \
|
|
--build-arg EXTRA_PIP_INSTALLS=extra.txt \
|
|
-t localhost:30500/querybook:buun-stack .
|
|
|
|
# For local Docker
|
|
docker build \
|
|
--no-cache \
|
|
--build-arg EXTRA_PIP_INSTALLS=extra.txt \
|
|
-t localhost:30500/querybook:buun-stack .
|
|
```
|
|
|
|
**Important**: Use `--no-cache` to ensure pip installs the correct package versions. Docker layer caching can cause pip to reuse old dependency resolutions.
|
|
|
|
4. **Push to registry**:
|
|
|
|
```bash
|
|
DOCKER_HOST=ssh://yourdomain.com docker push localhost:30500/querybook:buun-stack
|
|
# or for local Docker
|
|
docker push localhost:30500/querybook:buun-stack
|
|
```
|
|
|
|
5. **Deploy to Kubernetes**:
|
|
|
|
```bash
|
|
export QUERYBOOK_CUSTOM_IMAGE=localhost:30500/querybook
|
|
export QUERYBOOK_CUSTOM_IMAGE_TAG=buun-stack
|
|
just querybook::upgrade
|
|
```
|
|
|
|
6. **Restart Pods to use new image**:
|
|
|
|
```bash
|
|
# Delete all Querybook pods to force image pull
|
|
kubectl delete pod -n querybook -l app=querybook
|
|
|
|
# Wait for pods to be ready
|
|
kubectl wait --for=condition=ready pod -l app=querybook -n querybook --timeout=120s
|
|
|
|
# Verify trino version and sqlalchemy-trino is installed
|
|
kubectl exec -n querybook deployment/worker -- pip show trino | grep -E "Name:|Version:"
|
|
kubectl exec -n querybook deployment/worker -- pip show sqlalchemy-trino | grep -E "Name:|Version:"
|
|
```
|
|
|
|
Expected output:
|
|
|
|
```
|
|
Name: trino
|
|
Version: 0.336.0
|
|
Name: sqlalchemy-trino
|
|
Version: 0.5.0
|
|
```
|
|
|
|
**Notes**:
|
|
|
|
- `EXTRA_PIP_INSTALLS=extra.txt` ensures all query engines (Trino, BigQuery, Snowflake, etc.) are installed
|
|
- Metastore features are fully enabled with trino 0.336.0 and sqlalchemy-trino 0.5.0
|
|
|
|
## Usage
|
|
|
|
### Access Querybook
|
|
|
|
1. Navigate to `https://your-querybook-host/`
|
|
2. Click "Login with OAuth" to authenticate with Keycloak
|
|
3. Create datadocs (notebooks) and execute queries
|
|
|
|
### Grant Admin Access
|
|
|
|
Add users to the `querybook-admin` group:
|
|
|
|
```bash
|
|
just keycloak::add-user-to-group <username> querybook-admin
|
|
```
|
|
|
|
Admin users can:
|
|
|
|
- Manage query engines
|
|
- Configure data sources
|
|
- Manage user permissions
|
|
- View all datadocs
|
|
|
|
### Configure Trino Query Engine
|
|
|
|
1. Log in as an admin user
|
|
2. Navigate to Admin → Query Engines
|
|
3. Click "Add Query Engine"
|
|
4. Configure basic settings:
|
|
|
|
```plain
|
|
Name: Trino
|
|
Language: Trino
|
|
Executor: Trino (not SqlAlchemy)
|
|
Environment: production (or your preferred environment name)
|
|
```
|
|
|
|
5. Configure connection settings:
|
|
|
|
```plain
|
|
Connection String: trino://trino.example.com:443/iceberg?SSL=true
|
|
Username: admin
|
|
Password: [from just trino::admin-password]
|
|
Proxy_user_id: (leave empty to use admin username)
|
|
```
|
|
|
|
**Important Notes**:
|
|
- **Catalog in Connection String**: Include `/iceberg` (or your catalog name) after the port
|
|
- With catalog: `trino://host:443/iceberg?SSL=true` → queries work without `iceberg.` prefix
|
|
- Without catalog: `trino://host:443?SSL=true` → queries fail with "Catalog 'hive' not found"
|
|
- **Proxy_user_id**: Leave empty (defaults to Username field = admin)
|
|
- For user impersonation, configure Trino access control separately
|
|
|
|
6. Optional: Link to Metastore for table autocompletion:
|
|
- **Metastore**: Select created Metastore (see Metastore Configuration section below)
|
|
- Enables autocomplete for table and column names in query editor
|
|
|
|
### Metastore Configuration
|
|
|
|
**Status**: Metastore features are **fully enabled** in the custom image (`buun-stack` tag) with trino 0.336.0 and sqlalchemy-trino 0.5.0.
|
|
|
|
**How to configure**:
|
|
|
|
1. Log in as an admin user
|
|
2. Navigate to Admin → Metastores
|
|
3. Click "Add Metastore"
|
|
4. Configure settings:
|
|
|
|
```plain
|
|
Name: Trino Iceberg
|
|
Metastore Loader: SqlAlchemyMetastoreLoader
|
|
Connection String: trino://trino.example.com:443/iceberg?SSL=true
|
|
Username: admin
|
|
Password: [from just trino::admin-password]
|
|
```
|
|
|
|
5. Link the Metastore to your Query Engine (Admin → Query Engines → Edit → Metastore)
|
|
|
|
**Features**:
|
|
|
|
- **Schema Browser**: Browse catalogs, schemas, and tables in Admin UI
|
|
- **Table Autocomplete**: Type table names in query editor, press Tab or Escape
|
|
- **Column Autocomplete**: Type column names after table name in query
|
|
- **Search**: Use search box in Tables sidebar to find tables by name
|
|
|
|
**Note**: Views are currently not displayed in the schema browser (only tables are shown)
|
|
|
|
## Features
|
|
|
|
- **Tables Sidebar**: Browse schemas and tables, view column details
|
|
- **Autocomplete**: Type table/column names in query editor, press Tab or Escape
|
|
- **Search**: Use search box in Tables sidebar to find tables by name
|
|
|
|
### User Impersonation
|
|
|
|
Querybook connects to Trino as `admin` but executes queries as the logged-in user via Trino's impersonation feature. This provides:
|
|
|
|
- **Query Attribution**: Queries are attributed to the actual user, not the admin account
|
|
- **Audit Logging**: Trino logs show the real user who executed each query
|
|
- **Access Control**: Future per-user access policies can be enforced
|
|
|
|
**How it Works**:
|
|
|
|
1. User logs into Querybook with Keycloak
|
|
2. Querybook connects to Trino using admin credentials
|
|
3. Querybook sends queries with `X-Trino-User: <username>` header
|
|
4. Trino impersonates the user (allowed by access control rules)
|
|
5. Query runs as if executed by the actual user
|
|
|
|
## Architecture
|
|
|
|
```plain
|
|
External Users
|
|
↓
|
|
Cloudflare Tunnel (HTTPS)
|
|
↓
|
|
Traefik Ingress (HTTPS)
|
|
├─ Traefik Middleware (X-Forwarded-*, WebSocket upgrade)
|
|
└─ Backend: HTTP
|
|
↓
|
|
Querybook Web
|
|
├─ OAuth2 → Keycloak (authentication)
|
|
├─ PostgreSQL (metadata)
|
|
├─ Redis (cache/sessions)
|
|
└─ WebSocket (real-time query updates)
|
|
↓
|
|
Querybook Workers
|
|
↓
|
|
Trino (HTTPS via external hostname)
|
|
└─ Password auth + User impersonation
|
|
```
|
|
|
|
**Key Components**:
|
|
|
|
- **Traefik Middleware**: Handles WebSocket upgrade headers and X-Forwarded-* headers
|
|
- **OAuth2 Integration**: Uses standard OIDC scopes (openid, email, profile) with groups mapper
|
|
- **Trino Connection**: Must use external HTTPS hostname (not internal service name)
|
|
- **User Impersonation**: Admin credentials with X-Trino-User header for query attribution
|
|
|
|
## Authentication
|
|
|
|
### User Login (OAuth2)
|
|
|
|
- Users authenticate via Keycloak
|
|
- Standard OIDC flow with Authorization Code grant
|
|
- Group membership included in UserInfo endpoint response
|
|
- Session stored in Redis
|
|
|
|
### Admin Access
|
|
|
|
- Controlled by Keycloak group membership
|
|
- Users in `querybook-admin` group have full admin privileges
|
|
- Regular users can create and manage their own datadocs
|
|
|
|
### Trino Connection
|
|
|
|
- Uses password authentication (admin user)
|
|
- Connects via external HTTPS hostname (Traefik provides TLS)
|
|
- Python Trino client enforces HTTPS when authentication is used
|
|
- User impersonation via X-Trino-User header
|
|
|
|
## Management
|
|
|
|
### Upgrade Querybook
|
|
|
|
```bash
|
|
just querybook::upgrade
|
|
```
|
|
|
|
Updates the Helm deployment with current configuration.
|
|
|
|
### Uninstall
|
|
|
|
```bash
|
|
# Keep PostgreSQL database
|
|
just querybook::uninstall false
|
|
|
|
# Delete PostgreSQL database too
|
|
just querybook::uninstall true
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Check Pod Status
|
|
|
|
```bash
|
|
kubectl get pods -n querybook
|
|
```
|
|
|
|
### WebSocket Connection Fails
|
|
|
|
- Verify Traefik middleware exists: `kubectl get middleware querybook-headers -n querybook`
|
|
- Check WebSocket upgrade headers in middleware configuration
|
|
- Ensure Ingress annotation references middleware: `querybook-querybook-headers@kubernetescrd`
|
|
|
|
### OAuth Login Fails
|
|
|
|
- Verify Keycloak client exists: `just keycloak::list-clients`
|
|
- Check redirect URL: `https://<querybook-host>/oauth2callback`
|
|
- Verify client secret matches: Compare Vault/K8s secret with Keycloak
|
|
- Check Keycloak is accessible from Querybook pods
|
|
|
|
### Trino Connection Fails
|
|
|
|
- **Error: "cannot use authentication with HTTP"**
|
|
- Must use external hostname with HTTPS: `trino://trino.example.com:443?SSL=true`
|
|
- Do NOT use internal service name (e.g., `trino.trino.svc.cluster.local:8080`)
|
|
- Python Trino client enforces HTTPS when authentication is used
|
|
|
|
- **Error: "500 Internal Server Error"**
|
|
- Verify Trino is accessible via external hostname
|
|
- Check Trino admin password: `just trino::admin-password`
|
|
- Test Trino connection manually with curl
|
|
|
|
- **Error: "Access Denied: User admin cannot impersonate user X"**
|
|
- Verify Trino access control is configured
|
|
- Check impersonation rules: `kubectl exec -n trino deployment/trino-coordinator -- cat /etc/trino/access-control/rules.json`
|
|
- Ensure admin can impersonate all users
|
|
|
|
### Query Execution Stuck
|
|
|
|
- Check worker pod logs: `just querybook::logs worker`
|
|
- Verify Redis is running: `kubectl get pods -n querybook | grep redis`
|
|
- Check Trino coordinator health: `kubectl get pods -n trino`
|
|
|
|
### Database Connection Issues
|
|
|
|
- Verify PostgreSQL cluster is running: `kubectl get cluster -n postgres`
|
|
- Check database exists: `just postgres::list-databases | grep querybook`
|
|
- Verify secret exists: `kubectl get secret querybook-config-secret -n querybook`
|
|
|
|
### Metastore Issues
|
|
|
|
**Note**: Metastore features are fully enabled in the `buun-stack` custom image with trino 0.336.0 and sqlalchemy-trino 0.5.0.
|
|
|
|
- **Metastore not loading tables**:
|
|
- Verify Metastore configuration: Admin → Metastores → Edit
|
|
- Check connection string includes catalog: `trino://host:443/iceberg?SSL=true`
|
|
- Test Trino connection with admin credentials
|
|
- Check worker pod logs for errors: `just querybook::logs worker`
|
|
|
|
- **Tables not appearing in sidebar**:
|
|
- Wait for initial metadata sync (may take a few minutes)
|
|
- Trigger manual sync: Admin → Metastores → Sync
|
|
- Verify schemas exist in Trino: `SHOW SCHEMAS FROM iceberg`
|
|
|
|
- **Views not displayed**:
|
|
- This is a known limitation - only tables are currently shown
|
|
- Views can still be queried directly by typing the full name
|
|
|
|
## References
|
|
|
|
- [Querybook Documentation](https://www.querybook.org/)
|
|
- [Querybook GitHub](https://github.com/pinterest/querybook)
|
|
- [Trino Integration](../trino/README.md)
|
|
- [Keycloak OAuth2](https://www.keycloak.org/docs/latest/securing_apps/#_oidc)
|