Querybook
Pinterest's big data querying UI with notebook interface, Keycloak OAuth authentication, and Trino integration.
Overview
This module deploys Querybook using the official Helm chart from Pinterest with:
- Keycloak OAuth2 authentication for user login
- Trino integration with user impersonation for query attribution
- PostgreSQL backend for metadata storage
- Redis for caching and session management
- Traefik integration with WebSocket support for real-time query execution
- Group-based admin access via Keycloak groups
Prerequisites
- Kubernetes cluster (k3s)
- Keycloak installed and configured
- PostgreSQL cluster (CloudNativePG)
- Trino with access control configured
- External Secrets Operator (optional, for Vault integration)
Installation
Basic Installation
just querybook::install
You will be prompted for:
- Querybook host (FQDN): e.g.,
querybook.example.com - Keycloak host (FQDN): e.g.,
auth.example.com
What Gets Installed
- Querybook web service
- Querybook scheduler (background jobs)
- Querybook workers (query execution)
- PostgreSQL database for Querybook metadata
- Redis for caching and sessions
- Keycloak OAuth2 client (confidential client)
querybook-admingroup in Keycloak for admin access- Traefik Middleware for WebSocket and header forwarding
Configuration
Environment variables (set in .env.local or override):
QUERYBOOK_NAMESPACE=querybook # Kubernetes namespace
QUERYBOOK_HOST=querybook.example.com # External hostname
KEYCLOAK_HOST=auth.example.com # Keycloak hostname
KEYCLOAK_REALM=buunstack # Keycloak realm name
# Optional: Use custom Docker image (for testing fixes/patches)
QUERYBOOK_CUSTOM_IMAGE=localhost:30500/querybook # Custom image repository
QUERYBOOK_CUSTOM_IMAGE_TAG=buun-stack # Custom image tag (default: latest)
QUERYBOOK_CUSTOM_IMAGE_PULL_POLICY=Always # Image pull policy (default: Always)
Using Custom Image
To use a custom Querybook image (e.g., with patches or fixes):
# Set environment variables
export QUERYBOOK_CUSTOM_IMAGE=localhost:30500/querybook
export QUERYBOOK_CUSTOM_IMAGE_TAG=buun-stack
# Install or upgrade Querybook
just querybook::install
# or
just querybook::upgrade
When to use custom image:
- Testing bug fixes before they are merged upstream
- Applying patches for specific issues (e.g., WebSocket disconnect errors)
- Using modified versions with custom features
Custom image includes (buun-stack tag):
- Fix for WebSocket disconnect handler (python-socketio 5.12.0+ compatibility)
- Fix for datetime serialization in WebSocket emit
- Trino 0.336.0 upgrade with Metastore support (table autocomplete, schema browser)
Custom image behavior (when QUERYBOOK_CUSTOM_IMAGE is set):
- Pull policy:
Always(default, override withQUERYBOOK_CUSTOM_IMAGE_PULL_POLICY) - Ensures latest image is always pulled from registry
Default behavior (when QUERYBOOK_CUSTOM_IMAGE is not set):
- Uses official image:
querybook/querybook:latest - Pull policy:
IfNotPresent - Note: Official image may encounter WebSocket disconnect errors with python-socketio 5.12.0+
Building Custom Image
To build a custom Querybook image with bug fixes and Metastore support:
-
Clone Querybook repository:
git clone https://github.com/pinterest/querybook.git cd querybook -
Apply bug fix patch:
# Copy patch file from buun-stack repository # cp /path/to/buun-stack/querybook/querybook-fixes.diff . # Apply the patch git apply querybook-fixes.diffPatch includes:
- Fix for WebSocket disconnect handler (python-socketio 5.12.0+ compatibility)
- Fix for datetime serialization in WebSocket emit
- Trino 0.336.0 upgrade with TrinoCursor.poll() compatibility fix
- sqlalchemy-trino 0.5.0 for Metastore support
-
Build the Docker image:
# For remote Docker host (e.g., k3s node) DOCKER_HOST=ssh://yourdomain.com docker build \ --no-cache \ --build-arg EXTRA_PIP_INSTALLS=extra.txt \ -t localhost:30500/querybook:buun-stack . # For local Docker docker build \ --no-cache \ --build-arg EXTRA_PIP_INSTALLS=extra.txt \ -t localhost:30500/querybook:buun-stack .Important: Use
--no-cacheto ensure pip installs the correct package versions. Docker layer caching can cause pip to reuse old dependency resolutions. -
Push to registry:
DOCKER_HOST=ssh://yourdomain.com docker push localhost:30500/querybook:buun-stack # or for local Docker docker push localhost:30500/querybook:buun-stack -
Deploy to Kubernetes:
export QUERYBOOK_CUSTOM_IMAGE=localhost:30500/querybook export QUERYBOOK_CUSTOM_IMAGE_TAG=buun-stack just querybook::upgrade -
Restart Pods to use new image:
# Delete all Querybook pods to force image pull kubectl delete pod -n querybook -l app=querybook # Wait for pods to be ready kubectl wait --for=condition=ready pod -l app=querybook -n querybook --timeout=120s # Verify trino version and sqlalchemy-trino is installed kubectl exec -n querybook deployment/worker -- pip show trino | grep -E "Name:|Version:" kubectl exec -n querybook deployment/worker -- pip show sqlalchemy-trino | grep -E "Name:|Version:"Expected output:
Name: trino Version: 0.336.0 Name: sqlalchemy-trino Version: 0.5.0
Notes:
EXTRA_PIP_INSTALLS=extra.txtensures all query engines (Trino, BigQuery, Snowflake, etc.) are installed- Metastore features are fully enabled with trino 0.336.0 and sqlalchemy-trino 0.5.0
Usage
Access Querybook
- Navigate to
https://your-querybook-host/ - Click "Login with OAuth" to authenticate with Keycloak
- Create datadocs (notebooks) and execute queries
Grant Admin Access
Add users to the querybook-admin group:
just keycloak::add-user-to-group <username> querybook-admin
Admin users can:
- Manage query engines
- Configure data sources
- Manage user permissions
- View all datadocs
Configure Trino Query Engine
-
Log in as an admin user
-
Navigate to Admin → Query Engines
-
Click "Add Query Engine"
-
Configure basic settings:
Name: Trino Language: Trino Executor: Trino (not SqlAlchemy) Environment: production (or your preferred environment name) -
Configure connection settings:
Connection String: trino://trino.example.com:443/iceberg?SSL=true Username: admin Password: [from just trino::admin-password] Proxy_user_id: (leave empty to use admin username)Important Notes:
- Catalog in Connection String: Include
/iceberg(or your catalog name) after the port- With catalog:
trino://host:443/iceberg?SSL=true→ queries work withouticeberg.prefix - Without catalog:
trino://host:443?SSL=true→ queries fail with "Catalog 'hive' not found"
- With catalog:
- Proxy_user_id: Leave empty (defaults to Username field = admin)
- For user impersonation, configure Trino access control separately
- Catalog in Connection String: Include
-
Optional: Link to Metastore for table autocompletion:
- Metastore: Select created Metastore (see Metastore Configuration section below)
- Enables autocomplete for table and column names in query editor
Metastore Configuration
Status: Metastore features are fully enabled in the custom image (buun-stack tag) with trino 0.336.0 and sqlalchemy-trino 0.5.0.
How to configure:
-
Log in as an admin user
-
Navigate to Admin → Metastores
-
Click "Add Metastore"
-
Configure settings:
Name: Trino Metastore Loader: SqlAlchemyMetastoreLoader Connection String: trino://admin:<password>@trino.example.com:443/iceberg?SSL=trueImportant: The Connection String must include username and password embedded in the URL format:
trino://username:password@host:port/catalog?SSL=true -
Configure Connect_args section:
Key: http_scheme Value: httpsThis setting ensures proper HTTPS connection handling for the Metastore loader.
-
Enable Impersonate option:
Impersonate: ONThis ensures metadata is fetched as the logged-in user, consistent with query execution behavior. Each user will see tables and schemas they have access to.
-
Link the Metastore to your Query Engine (Admin → Query Engines → Edit → Metastore)
Trino admin password can be retrieved with:
just trino::admin-password
Features:
- Schema Browser: Browse catalogs, schemas, and tables in Admin UI
- Table Autocomplete: Type table names in query editor, press Tab or Escape
- Column Autocomplete: Type column names after table name in query
- Search: Use search box in Tables sidebar to find tables by name
Note: Views are currently not displayed in the schema browser (only tables are shown)
Features
- Tables Sidebar: Browse schemas and tables, view column details
- Autocomplete: Type table/column names in query editor, press Tab or Escape
- Search: Use search box in Tables sidebar to find tables by name
User Impersonation
Querybook connects to Trino as admin but executes queries as the logged-in user via Trino's impersonation feature. This provides:
- Query Attribution: Queries are attributed to the actual user, not the admin account
- Audit Logging: Trino logs show the real user who executed each query
- Access Control: Future per-user access policies can be enforced
How it Works:
- User logs into Querybook with Keycloak
- Querybook connects to Trino using admin credentials
- Querybook sends queries with
X-Trino-User: <username>header - Trino impersonates the user (allowed by access control rules)
- Query runs as if executed by the actual user
Architecture
External Users
↓
Cloudflare Tunnel (HTTPS)
↓
Traefik Ingress (HTTPS)
├─ Traefik Middleware (X-Forwarded-*, WebSocket upgrade)
└─ Backend: HTTP
↓
Querybook Web
├─ OAuth2 → Keycloak (authentication)
├─ PostgreSQL (metadata)
├─ Redis (cache/sessions)
└─ WebSocket (real-time query updates)
↓
Querybook Workers
↓
Trino (HTTPS via external hostname)
└─ Password auth + User impersonation
Key Components:
- Traefik Middleware: Handles WebSocket upgrade headers and X-Forwarded-* headers
- OAuth2 Integration: Uses standard OIDC scopes (openid, email, profile) with groups mapper
- Trino Connection: Must use external HTTPS hostname (not internal service name)
- User Impersonation: Admin credentials with X-Trino-User header for query attribution
Pod Security Standards
Current Configuration: privileged (enforce) / baseline (warn, audit)
Querybook namespace is configured with the following Pod Security Standards:
pod-security.kubernetes.io/enforce: privileged
pod-security.kubernetes.io/warn: baseline
pod-security.kubernetes.io/audit: baseline
Why Not Restricted or Baseline?
Querybook's embedded Elasticsearch component requires privileged containers and special Linux capabilities that violate both restricted and baseline Pod Security Standards:
Elasticsearch Requirements:
privileged: true- Container must run in privileged modecapabilities.add: [IPC_LOCK, SYS_RESOURCE]- Requires Linux capabilities for memory lockingsysctl -w vm.max_map_count=262144- Init container needs privileged mode to configure kernel parameters
These requirements are necessary for Elasticsearch to:
- Lock memory to prevent swapping (performance)
- Set virtual memory map count (stability)
- Configure ulimit for unlimited locked memory
Security Implications:
- Elasticsearch containers run with elevated privileges
- Init containers can modify kernel parameters
- Other components (web, worker, scheduler, redis) run without special privileges
Mitigation:
warnandauditatbaselinelevel to track violations- Web init container (copy-keycloak-auth) uses
restricted-level security context - Future: Consider external Elasticsearch service to enable stricter Pod Security Standards
Component Security Status:
| Component | Privileges Required | Security Level |
|---|---|---|
| Elasticsearch | privileged=true, IPC_LOCK, SYS_RESOURCE | Violates baseline |
| Web | None (container), runAsNonRoot (initContainer) | Baseline-ready |
| Worker | None | Baseline-ready |
| Scheduler | None | Baseline-ready |
| Redis | None | Baseline-ready |
To check current Pod Security Standards configuration:
kubectl get namespace querybook -o jsonpath='{.metadata.labels}' | jq
Authentication
User Login (OAuth2)
- Users authenticate via Keycloak
- Standard OIDC flow with Authorization Code grant
- Group membership included in UserInfo endpoint response
- Session stored in Redis
Admin Access
- Controlled by Keycloak group membership
- Users in
querybook-admingroup have full admin privileges - Regular users can create and manage their own datadocs
Trino Connection
- Uses password authentication (admin user)
- Connects via external HTTPS hostname (Traefik provides TLS)
- Python Trino client enforces HTTPS when authentication is used
- User impersonation via X-Trino-User header
Management
Upgrade Querybook
just querybook::upgrade
Updates the Helm deployment with current configuration.
Uninstall
# Keep PostgreSQL database
just querybook::uninstall false
# Delete PostgreSQL database too
just querybook::uninstall true
Troubleshooting
Check Pod Status
kubectl get pods -n querybook
WebSocket Connection Fails
- Verify Traefik middleware exists:
kubectl get middleware querybook-headers -n querybook - Check WebSocket upgrade headers in middleware configuration
- Ensure Ingress annotation references middleware:
querybook-querybook-headers@kubernetescrd
OAuth Login Fails
- Verify Keycloak client exists:
just keycloak::list-clients - Check redirect URL:
https://<querybook-host>/oauth2callback - Verify client secret matches: Compare Vault/K8s secret with Keycloak
- Check Keycloak is accessible from Querybook pods
Trino Connection Fails
-
Error: "cannot use authentication with HTTP"
- Must use external hostname with HTTPS:
trino://trino.example.com:443?SSL=true - Do NOT use internal service name (e.g.,
trino.trino.svc.cluster.local:8080) - Python Trino client enforces HTTPS when authentication is used
- Must use external hostname with HTTPS:
-
Error: "500 Internal Server Error"
- Verify Trino is accessible via external hostname
- Check Trino admin password:
just trino::admin-password - Test Trino connection manually with curl
-
Error: "Access Denied: User admin cannot impersonate user X"
- Verify Trino access control is configured
- Check impersonation rules:
kubectl exec -n trino deployment/trino-coordinator -- cat /etc/trino/access-control/rules.json - Ensure admin can impersonate all users
Query Execution Stuck
- Check worker pod logs:
just querybook::logs worker - Verify Redis is running:
kubectl get pods -n querybook | grep redis - Check Trino coordinator health:
kubectl get pods -n trino
Database Connection Issues
- Verify PostgreSQL cluster is running:
kubectl get cluster -n postgres - Check database exists:
just postgres::list-databases | grep querybook - Verify secret exists:
kubectl get secret querybook-config-secret -n querybook
Metastore Issues
Note: Metastore features are fully enabled in the buun-stack custom image with trino 0.336.0 and sqlalchemy-trino 0.5.0.
-
Metastore not loading tables:
- Verify Metastore configuration: Admin → Metastores → Edit
- Check connection string includes catalog:
trino://admin:password@host:443/iceberg?SSL=true - Test Trino connection with admin credentials
- Check worker pod logs for errors:
just querybook::logs worker
-
Tables not appearing in sidebar:
- Wait for initial metadata sync (may take a few minutes)
- Trigger manual sync: Admin → Metastores → Sync
- Verify schemas exist in Trino:
SHOW SCHEMAS FROM iceberg
-
Views not displayed:
- This is a known limitation - only tables are currently shown
- Views can still be queried directly by typing the full name