Files
buun-stack/querybook/README.md
2025-10-19 23:48:05 +09:00

16 KiB

Querybook

Pinterest's big data querying UI with notebook interface, Keycloak OAuth authentication, and Trino integration.

Overview

This module deploys Querybook using the official Helm chart from Pinterest with:

  • Keycloak OAuth2 authentication for user login
  • Trino integration with user impersonation for query attribution
  • PostgreSQL backend for metadata storage
  • Redis for caching and session management
  • Traefik integration with WebSocket support for real-time query execution
  • Group-based admin access via Keycloak groups

Prerequisites

  • Kubernetes cluster (k3s)
  • Keycloak installed and configured
  • PostgreSQL cluster (CloudNativePG)
  • Trino with access control configured
  • External Secrets Operator (optional, for Vault integration)

Installation

Basic Installation

just querybook::install

You will be prompted for:

  1. Querybook host (FQDN): e.g., querybook.example.com
  2. Keycloak host (FQDN): e.g., auth.example.com

What Gets Installed

  • Querybook web service
  • Querybook scheduler (background jobs)
  • Querybook workers (query execution)
  • PostgreSQL database for Querybook metadata
  • Redis for caching and sessions
  • Keycloak OAuth2 client (confidential client)
  • querybook-admin group in Keycloak for admin access
  • Traefik Middleware for WebSocket and header forwarding

Configuration

Environment variables (set in .env.local or override):

QUERYBOOK_NAMESPACE=querybook          # Kubernetes namespace
QUERYBOOK_HOST=querybook.example.com   # External hostname
KEYCLOAK_HOST=auth.example.com         # Keycloak hostname
KEYCLOAK_REALM=buunstack               # Keycloak realm name

# Optional: Use custom Docker image (for testing fixes/patches)
QUERYBOOK_CUSTOM_IMAGE=localhost:30500/querybook        # Custom image repository
QUERYBOOK_CUSTOM_IMAGE_TAG=trino-metastore              # Custom image tag (default: latest)
QUERYBOOK_CUSTOM_IMAGE_PULL_POLICY=Always               # Image pull policy (default: Always)

Using Custom Image

To use a custom Querybook image (e.g., with patches or fixes):

# Set environment variables
export QUERYBOOK_CUSTOM_IMAGE=localhost:30500/querybook
export QUERYBOOK_CUSTOM_IMAGE_TAG=trino-metastore

# Install or upgrade Querybook
just querybook::install
# or
just querybook::upgrade

When to use custom image:

  • Testing bug fixes before they are merged upstream
  • Applying patches for specific issues (e.g., datetime JSON serialization)
  • Using Trino Metastore integration (requires sqlalchemy-trino)
  • Using modified versions with custom features

Custom image includes (trino-metastore tag):

  • Datetime JSON serialization fixes for WebSocket communication
  • sqlalchemy-trino package for Metastore integration

Custom image behavior (when QUERYBOOK_CUSTOM_IMAGE is set):

  • Pull policy: Always (default, override with QUERYBOOK_CUSTOM_IMAGE_PULL_POLICY)
  • Ensures latest image is always pulled from registry

Default behavior (when QUERYBOOK_CUSTOM_IMAGE is not set):

  • Uses official image: querybook/querybook:latest
  • Pull policy: IfNotPresent
  • Note: Official image does not include sqlalchemy-trino, so Trino Metastore integration will not work

Building Custom Image

To build a custom Querybook image with bug fixes and sqlalchemy-trino support:

  1. Clone Querybook repository:

    git clone https://github.com/pinterest/querybook.git
    cd querybook
    
  2. Apply bug fix patch:

    # Copy patch file from buun-stack repository
    # cp /path/to/buun-stack/querybook/querybook-fix-socketio-disconnect.diff .
    
    # Apply the patch
    git apply querybook-fix-socketio-disconnect.diff
    

    Patch includes:

    • Fix for WebSocket disconnect handler signature (python-socketio 5.12.0+ compatibility)
  3. Create requirements/local.txt:

    cat > requirements/local.txt <<EOF
    # Local additional requirements for buun-stack
    # SQLAlchemy dialect for Trino (required for Metastore)
    # IMPORTANT: Pin both trino and sqlalchemy-trino versions to maintain compatibility
    # - trino must be 0.305.0 (what Querybook is tested with)
    # - sqlalchemy-trino 0.2.2 is compatible with trino ~=0.305
    # - sqlalchemy-trino >=0.3.0 requires trino>=0.310 (incompatible)
    # - Both must be explicitly pinned to prevent pip from upgrading them when extra.txt is installed
    trino==0.305.0
    sqlalchemy-trino==0.2.2
    EOF
    

    Critical: Both packages must be pinned:

    • trino==0.305.0 prevents pip from upgrading to 0.310+ when resolving dependencies
    • sqlalchemy-trino==0.2.2 is the only version compatible with trino 0.305
    • When EXTRA_PIP_INSTALLS=extra.txt is used, pip installs many packages which can trigger dependency upgrades
    • Without explicitly pinning trino, pip may upgrade it to satisfy other package requirements, breaking query execution
  4. Build the Docker image:

    # For remote Docker host (e.g., k3s node)
    DOCKER_HOST=ssh://yourdomain.com docker build \
        --no-cache \
        --build-arg EXTRA_PIP_INSTALLS=extra.txt \
        -t localhost:30500/querybook:trino-metastore .
    
    # For local Docker
    docker build \
        --no-cache \
        --build-arg EXTRA_PIP_INSTALLS=extra.txt \
        -t localhost:30500/querybook:trino-metastore .
    

    Important: Use --no-cache when changing requirements/local.txt to ensure pip installs the correct package versions. Docker layer caching can cause pip to reuse old dependency resolutions.

  5. Push to registry:

    DOCKER_HOST=ssh://yourdomain.com docker push localhost:30500/querybook:trino-metastore
    # or for local Docker
    docker push localhost:30500/querybook:trino-metastore
    
  6. Deploy to Kubernetes:

    export QUERYBOOK_CUSTOM_IMAGE=localhost:30500/querybook
    export QUERYBOOK_CUSTOM_IMAGE_TAG=trino-metastore
    just querybook::upgrade
    
  7. Restart Pods to use new image:

    # Delete all Querybook pods to force image pull
    kubectl delete pod -n querybook -l app=querybook
    
    # Wait for pods to be ready
    kubectl wait --for=condition=ready pod -l app=querybook -n querybook --timeout=120s
    
    # Verify correct package versions
    kubectl exec -n querybook deployment/worker -- pip show trino sqlalchemy-trino | grep -E "Name:|Version:"
    

    Expected output:

    Name: trino
    Version: 0.305.0
    Name: sqlalchemy-trino
    Version: 0.2.2
    

Notes:

  • The Dockerfile automatically includes requirements/local.txt if it exists (lines 40-42)
  • EXTRA_PIP_INSTALLS=extra.txt ensures additional dependencies are installed during build
  • The custom image will have both the official Querybook packages and sqlalchemy-trino

Usage

Access Querybook

  1. Navigate to https://your-querybook-host/
  2. Click "Login with OAuth" to authenticate with Keycloak
  3. Create datadocs (notebooks) and execute queries

Grant Admin Access

Add users to the querybook-admin group:

just keycloak::add-user-to-group <username> querybook-admin

Admin users can:

  • Manage query engines
  • Configure data sources
  • Manage user permissions
  • View all datadocs

Configure Trino Query Engine

  1. Log in as an admin user

  2. Navigate to Admin → Query Engines

  3. Click "Add Query Engine"

  4. Configure basic settings:

    Name: Trino
    Language: Trino
    Executor: Trino (not SqlAlchemy)
    Environment: production (or your preferred environment name)
    
  5. Configure connection settings:

    Connection String: trino://trino.example.com:443/iceberg?SSL=true
    Username: admin
    Password: [from just trino::admin-password]
    Proxy_user_id: (leave empty to use admin username)
    

    Important Notes:

    • Catalog in Connection String: Include /iceberg (or your catalog name) after the port
      • With catalog: trino://host:443/iceberg?SSL=true → queries work without iceberg. prefix
      • Without catalog: trino://host:443?SSL=true → queries fail with "Catalog 'hive' not found"
    • Proxy_user_id: Leave empty (defaults to Username field = admin)
    • For user impersonation, configure Trino access control separately
  6. Optional: Link to Metastore for table autocompletion:

    • Metastore: Select created Metastore (see Metastore Configuration section below)
    • Enables autocomplete for table and column names in query editor

Metastore enables table/column autocompletion and provides a browsable table catalog.

Prerequisites: Custom image with sqlalchemy-trino (official image does not include this package)

  1. Navigate to Admin → Metastores

  2. Click "Create Metastore"

  3. Configure:

    Name: Trino Iceberg
    Metastore Loader: SqlAlchemyMetastoreLoader
    Connection String: trino://admin:[password]@trino.example.com:443/iceberg?SSL=true
    Acct Info (Key-Value):
      http_scheme = https
    Impersonate: OFF (recommended for shared table catalog)
    

    Important Notes:

    • Include authentication in Connection String: admin:[password]@host
    • Include catalog in Connection String: /iceberg after port
    • http_scheme must be set to https in Acct Info
    • Keep Impersonate OFF unless you need per-user table filtering
  4. Click "Run Task" to sync table metadata

  5. Verify in Admin → Metastores that "Last Synced" timestamp is updated

  6. Check left sidebar "Tables" for table list

Scheduled Updates (recommended):

  • Navigate to Admin → Metastores → [your metastore] → Schedule
  • Set cron expression: 0 */6 * * * (sync every 6 hours)

Usage:

  • Tables Sidebar: Browse schemas and tables, view column details
  • Autocomplete: Type table/column names in query editor, press Tab or Escape
  • Search: Use search box in Tables sidebar to find tables by name

User Impersonation

Querybook connects to Trino as admin but executes queries as the logged-in user via Trino's impersonation feature. This provides:

  • Query Attribution: Queries are attributed to the actual user, not the admin account
  • Audit Logging: Trino logs show the real user who executed each query
  • Access Control: Future per-user access policies can be enforced

How it Works:

  1. User logs into Querybook with Keycloak
  2. Querybook connects to Trino using admin credentials
  3. Querybook sends queries with X-Trino-User: <username> header
  4. Trino impersonates the user (allowed by access control rules)
  5. Query runs as if executed by the actual user

Architecture

External Users
      ↓
Cloudflare Tunnel (HTTPS)
      ↓
Traefik Ingress (HTTPS)
      ├─ Traefik Middleware (X-Forwarded-*, WebSocket upgrade)
      └─ Backend: HTTP
      ↓
Querybook Web
      ├─ OAuth2 → Keycloak (authentication)
      ├─ PostgreSQL (metadata)
      ├─ Redis (cache/sessions)
      └─ WebSocket (real-time query updates)
      ↓
Querybook Workers
      ↓
Trino (HTTPS via external hostname)
      └─ Password auth + User impersonation

Key Components:

  • Traefik Middleware: Handles WebSocket upgrade headers and X-Forwarded-* headers
  • OAuth2 Integration: Uses standard OIDC scopes (openid, email, profile) with groups mapper
  • Trino Connection: Must use external HTTPS hostname (not internal service name)
  • User Impersonation: Admin credentials with X-Trino-User header for query attribution

Authentication

User Login (OAuth2)

  • Users authenticate via Keycloak
  • Standard OIDC flow with Authorization Code grant
  • Group membership included in UserInfo endpoint response
  • Session stored in Redis

Admin Access

  • Controlled by Keycloak group membership
  • Users in querybook-admin group have full admin privileges
  • Regular users can create and manage their own datadocs

Trino Connection

  • Uses password authentication (admin user)
  • Connects via external HTTPS hostname (Traefik provides TLS)
  • Python Trino client enforces HTTPS when authentication is used
  • User impersonation via X-Trino-User header

Management

Upgrade Querybook

just querybook::upgrade

Updates the Helm deployment with current configuration.

Uninstall

# Keep PostgreSQL database
just querybook::uninstall false

# Delete PostgreSQL database too
just querybook::uninstall true

Troubleshooting

Check Pod Status

kubectl get pods -n querybook

WebSocket Connection Fails

  • Verify Traefik middleware exists: kubectl get middleware querybook-headers -n querybook
  • Check WebSocket upgrade headers in middleware configuration
  • Ensure Ingress annotation references middleware: querybook-querybook-headers@kubernetescrd

OAuth Login Fails

  • Verify Keycloak client exists: just keycloak::list-clients
  • Check redirect URL: https://<querybook-host>/oauth2callback
  • Verify client secret matches: Compare Vault/K8s secret with Keycloak
  • Check Keycloak is accessible from Querybook pods

Trino Connection Fails

  • Error: "cannot use authentication with HTTP"

    • Must use external hostname with HTTPS: trino://trino.example.com:443?SSL=true
    • Do NOT use internal service name (e.g., trino.trino.svc.cluster.local:8080)
    • Python Trino client enforces HTTPS when authentication is used
  • Error: "500 Internal Server Error"

    • Verify Trino is accessible via external hostname
    • Check Trino admin password: just trino::admin-password
    • Test Trino connection manually with curl
  • Error: "Access Denied: User admin cannot impersonate user X"

    • Verify Trino access control is configured
    • Check impersonation rules: kubectl exec -n trino deployment/trino-coordinator -- cat /etc/trino/access-control/rules.json
    • Ensure admin can impersonate all users

Query Execution Stuck

  • Check worker pod logs: just querybook::logs worker
  • Verify Redis is running: kubectl get pods -n querybook | grep redis
  • Check Trino coordinator health: kubectl get pods -n trino

Database Connection Issues

  • Verify PostgreSQL cluster is running: kubectl get cluster -n postgres
  • Check database exists: just postgres::list-databases | grep querybook
  • Verify secret exists: kubectl get secret querybook-config-secret -n querybook

Metastore Issues

  • Tables sidebar is empty

    • Check Admin → Metastores for "Last Synced" timestamp
    • Click "Run Task" to manually sync
    • Verify Metastore is linked to Query Engine (Admin → Query Engines → Metastore field)
    • Check worker logs: kubectl logs -n querybook deployment/worker --tail=100 | grep metastore
  • Error: "Can't load plugin: sqlalchemy.dialects:trino"

    • Official Querybook image does not include sqlalchemy-trino
    • Use custom image with QUERYBOOK_CUSTOM_IMAGE_TAG=trino-metastore
    • See "Using Custom Image" section above
  • Error: "Connection.init() got an unexpected keyword argument 'password'"

    • Do not use password key in Acct Info
    • Embed authentication in Connection String: trino://admin:[password]@host:port/catalog?SSL=true
    • Set http_scheme = https in Acct Info
  • Only system. schemas visible*

    • Connection String is missing catalog specification
    • Add /iceberg (or your catalog) after port: trino://host:443/iceberg?SSL=true
  • Autocomplete not working

    • Verify Query Engine has Metastore linked (Admin → Query Engines → Metastore field)
    • Refresh DataDoc page (F5) after linking Metastore
    • Check Environment matches between DataDoc and Query Engine
    • Try Tab or Escape key instead of Ctrl+Space (macOS shortcut conflict)

References