diff --git a/trino/README.md b/trino/README.md index 60edb6c..2d1ef0a 100644 --- a/trino/README.md +++ b/trino/README.md @@ -7,7 +7,7 @@ Fast distributed SQL query engine for big data analytics with Keycloak authentic This module deploys Trino using the official Helm chart with: - **Keycloak OAuth2 authentication** for Web UI access -- **Password authentication** for JDBC clients (Metabase, Querybook, etc.) +- **Password authentication** for programmatic clients (Metabase, Querybook, etc.) - **Access control with user impersonation** for multi-user query attribution - **PostgreSQL catalog** for querying PostgreSQL databases - **Iceberg catalog** with Lakekeeper (optional) @@ -19,7 +19,8 @@ This module deploys Trino using the official Helm chart with: - Kubernetes cluster (k3s) - Keycloak installed and configured - PostgreSQL cluster (CloudNativePG) -- MinIO (optional, for Iceberg catalog) +- MinIO (optional, for Iceberg catalog storage backend) +- Lakekeeper (optional, required before enabling Iceberg catalog for Trino) - External Secrets Operator (optional, for Vault integration) ## Installation @@ -34,7 +35,7 @@ You will be prompted for: 1. **Trino host (FQDN)**: e.g., `trino.example.com` 2. **PostgreSQL catalog setup**: Recommended for production use -3. **MinIO storage setup**: Optional, for Iceberg/Hive catalogs +3. **Iceberg catalog setup**: Optional, enables Iceberg REST catalog via Lakekeeper with MinIO storage ### What Gets Installed @@ -45,10 +46,10 @@ You will be prompted for: - Traefik Middleware for X-Forwarded-* header injection - Access control with impersonation rules (admin can impersonate any user) - PostgreSQL catalog (if selected) -- Iceberg catalog with Lakekeeper (if MinIO selected) +- Iceberg catalog with Lakekeeper (if Iceberg catalog selected) - Keycloak service account enabled for OAuth2 client credentials flow - - `lakekeeper` client scope added - - `lakekeeper` audience mapper configured + - `lakekeeper` client scope added to Trino client + - MinIO credentials configured for storage backend - TPCH catalog with sample data **Note**: Trino runs HTTP-only internally. HTTPS is provided by Traefik Ingress, which handles TLS termination. @@ -94,7 +95,9 @@ See [MCP.md](./MCP.md) for detailed instructions on integrating Trino with Claud ### Metabase Integration -**Important**: The Python Trino client (used by Metabase) requires HTTPS when using authentication. You must use the external hostname which has TLS provided by Traefik Ingress. +Metabase connects to Trino using the JDBC driver (Starburst driver). You must use the external hostname with SSL/TLS for authenticated connections. + +#### Connection Configuration 1. In Metabase, go to Admin → Databases → Add database 2. Select **Database type**: Starburst @@ -109,17 +112,15 @@ See [MCP.md](./MCP.md) for detailed instructions on integrating Trino with Claud SSL: Yes ``` -**Catalog Selection**: +#### Catalog Selection - Use `postgresql` to query PostgreSQL database tables - Use `iceberg` to query Iceberg tables via Lakekeeper - You can create multiple Metabase connections, one for each catalog -**Note**: Do NOT use internal Kubernetes hostnames like `trino.trino.svc.cluster.local:8080`. Internal services do not have TLS, and the Python Trino client enforces HTTPS when authentication is used. Always use the external hostname with port 443. - ### Querybook Integration -**Connection Configuration**: +#### Connection Configuration 1. In Querybook, create a new Environment and Query Engine 2. Configure the Trino connection: @@ -133,11 +134,13 @@ See [MCP.md](./MCP.md) for detailed instructions on integrating Trino with Claud 3. Optional: Configure `Proxy_user_id` to enable user impersonation -**User Impersonation**: +#### User Impersonation -Trino is configured with file-based access control that allows the `admin` user to impersonate any user. This enables: +Querybook can execute queries as logged-in users via Trino's impersonation feature. Trino is configured with file-based access control that allows the `admin` user to impersonate any user. -- Querybook to connect as `admin` but execute queries as the logged-in Querybook user +**Benefits:** + +- Querybook connects as `admin` but executes queries as the actual logged-in user - Proper query attribution and audit logging - User-specific access control (when configured) @@ -155,12 +158,45 @@ The impersonation rules are defined in `trino-values.gomplate.yaml`: } ``` -**Why External Hostname is Required**: +See the [Access Control](#access-control) section for detailed impersonation configuration. -- The Python Trino client enforces HTTPS when authentication is used (client-side requirement) -- Trino runs HTTP-only internally; TLS is provided by Traefik Ingress -- Internal service names (e.g., `trino.trino.svc.cluster.local:8080`) do not have TLS termination -- Therefore, you must use the external hostname (e.g., `trino.example.com:443`) which has TLS from Traefik +### External Hostname Requirement + +Both Metabase and Querybook **require the external hostname with HTTPS** for authenticated connections to Trino. Internal Kubernetes service names will not work. + +**Why external hostname is required:** + +1. **Client-side HTTPS enforcement**: + - Metabase JDBC driver enforces HTTPS for authenticated connections + - Querybook Python Trino client enforces HTTPS when authentication is used + - Both clients validate SSL/TLS certificates + +2. **Trino runs HTTP-only internally**: + - Trino coordinator listens on HTTP port 8080 inside the cluster + - No TLS termination within the Trino pods + - Internal service names (e.g., `trino.trino.svc.cluster.local:8080`) do not provide HTTPS + +3. **Traefik provides TLS termination**: + - External hostname (e.g., `trino.example.com:443`) routes through Traefik Ingress + - Traefik handles SSL/TLS termination with valid certificates + - Traefik forwards to Trino's internal HTTP endpoint + +**Connection requirements:** + +```plain +✅ CORRECT: trino.example.com:443 (HTTPS via Traefik) +❌ WRONG: trino.trino.svc.cluster.local:8080 (HTTP, no TLS) +``` + +**Architecture:** + +```plain +Client (Metabase/Querybook) + ↓ HTTPS (port 443) +Traefik Ingress + ↓ HTTP (port 8080) +Trino Coordinator +``` ### Example Queries @@ -221,7 +257,7 @@ Queries your CloudNativePG cluster: - Default schema: `public` - Database: `trino` -### Iceberg (Optional) +### Iceberg (Lakekeeper) Queries Iceberg tables via Lakekeeper REST Catalog: @@ -230,24 +266,93 @@ Queries Iceberg tables via Lakekeeper REST Catalog: - **REST Catalog**: Lakekeeper (Apache Iceberg REST Catalog implementation) - **Authentication**: OAuth2 client credentials flow with Keycloak -**How It Works**: +#### How It Works 1. Trino authenticates to Lakekeeper using OAuth2 (client credentials flow) 2. Lakekeeper provides Iceberg table metadata from its catalog 3. Trino reads actual data files directly from MinIO using static S3 credentials 4. Vended credentials are disabled; Trino uses pre-configured MinIO access keys -**Configuration**: +#### Configuration -The following settings are automatically configured during installation when MinIO storage is enabled: +The following settings are automatically configured when enabling the Iceberg catalog (`just trino::enable-iceberg-catalog`): -- Service account enabled on Trino Keycloak client -- `lakekeeper` client scope added to Trino client -- Audience mapper configured to include `aud: lakekeeper` in JWT tokens +- Service account enabled on Trino Keycloak client (for OAuth2 Client Credentials Flow) +- `lakekeeper` Client Scope created in Keycloak with audience mapper +- `lakekeeper` scope added to Trino client as default scope +- Audience mapper in `lakekeeper` scope adds `aud: lakekeeper` to JWT tokens - S3 file system factory enabled (`fs.native-s3.enabled=true`) - Static MinIO credentials provided via Kubernetes secrets -**Example Usage**: +#### OAuth2 Scope and Audience + +The Iceberg catalog connection to Lakekeeper uses OAuth2 Client Credentials Flow with the following scope configuration: + +```properties +iceberg.rest-catalog.oauth2.scope=openid profile lakekeeper +``` + +#### Purpose of lakekeeper scope + +The `lakekeeper` scope controls whether the JWT token includes the audience claim required by Lakekeeper: + +1. **Scope-based Control**: + - The `lakekeeper` Client Scope contains an audience mapper + - When `scope=lakekeeper` is included in the token request, the mapper is applied + - Without this scope parameter, the audience claim is not added + +2. **Audience Claim**: + - The audience mapper adds `"aud": "lakekeeper"` to the JWT token + - This happens only when the `lakekeeper` scope is requested + +3. **Token Validation**: + - Lakekeeper validates incoming JWT tokens and requires `aud` to contain `"lakekeeper"` + - Tokens without this audience claim are rejected + +4. **Security**: + - Prevents tokens issued for other purposes from accessing Lakekeeper + - Enforces explicit authorization through scope parameter + - Defense against token leakage/misuse + +#### Authentication Flow + +```plain +1. Trino requests token from Keycloak (Client Credentials Flow) + POST /realms/buunstack/protocol/openid-connect/token + - client_id: trino + - client_secret: [from service account] + - grant_type: client_credentials + - scope: openid profile lakekeeper + +2. Keycloak validates client credentials and generates JWT token + - Checks that 'lakekeeper' is in the requested scopes + - Applies the 'lakekeeper' Client Scope + - Audience mapper (in lakekeeper scope) adds "aud": "lakekeeper" to JWT + - Includes 'lakekeeper' scope in response + +3. Trino sends JWT token to Lakekeeper REST Catalog + Authorization: Bearer [JWT token] + +4. Lakekeeper validates JWT token: + - Verifies signature using JWKS from Keycloak + - Checks issuer matches LAKEKEEPER__OPENID_PROVIDER_URI + - Validates aud claim contains "lakekeeper" + - Rejects token if audience doesn't match + +5. Lakekeeper returns Iceberg table metadata to Trino +``` + +#### Important Notes + +- This OAuth2 authentication is **completely separate** from Trino Web UI OAuth2 authentication +- Web UI OAuth2: User login via browser (Authorization Code Flow) +- Iceberg REST Catalog OAuth2: Service-to-service authentication (Client Credentials Flow) +- The `lakekeeper` scope controls the audience claim: + - With scope: `scope=openid profile lakekeeper` → JWT includes `"aud": "lakekeeper"` + - Without scope: `scope=openid profile` → JWT does not include Lakekeeper audience +- The `lakekeeper` scope is only used for Trino→Lakekeeper communication, not for user authentication + +#### Example Usage ```sql -- List all namespaces (schemas) @@ -317,21 +422,120 @@ Removes: - Stored in Vault at `trino/password` - Requires external hostname with SSL/TLS -### Access Control +## Access Control -Trino uses file-based system access control with the following configuration: +Trino uses file-based system access control managed via Kubernetes ConfigMap. The configuration is defined in Helm values and automatically deployed. -**Catalogs**: All users can access all catalogs +### Configuration Structure -**Impersonation**: The `admin` user can impersonate any user +```yaml +accessControl: + type: configmap # Store rules in Kubernetes ConfigMap + refreshPeriod: 60s # Check for rule changes every 60 seconds + configFile: "rules.json" # Rules file name + rules: + rules.json: |- + { + "catalogs": [ + { + "allow": "all" # All users can access all catalogs + } + ], + "impersonation": [ + { + "original_user": "admin", # User allowed to impersonate + "new_user": ".*" # Regex: can impersonate any user + } + ] + } +``` -This configuration enables: +### Catalog Access -- **Querybook Integration**: Admin user connects and executes queries as logged-in users -- **Audit Logging**: Queries are attributed to the actual user, not the admin account -- **Future Access Control**: Can be extended to add user-specific catalog/schema restrictions +```json +"catalogs": [{"allow": "all"}] +``` -The access control rules are defined in `/etc/trino/access-control/rules.json` (automatically generated from Helm values). +- All authenticated users can access all catalogs (postgresql, iceberg, tpch) +- No catalog-level restrictions are enforced +- Can be extended to add user/group-specific catalog access rules + +### User Impersonation + +```json +"impersonation": [ + { + "original_user": "admin", + "new_user": ".*" + } +] +``` + +#### What it does + +- The `admin` user can execute queries as any other user +- `original_user`: The user performing the impersonation (must be authenticated) +- `new_user`: Regex pattern for allowed target users (`.*` = any user) + +#### How it works + +1. Client authenticates as `admin` with password +2. Client sends `X-Trino-User: actual_username` header +3. Trino validates impersonation is allowed (admin → actual_username) +4. Query executes with `actual_username` as the principal +5. Audit logs show `actual_username`, not `admin` + +#### Example: Querybook Integration + +```python +# Querybook connects to Trino +connection = trino.dbapi.connect( + host="trino.example.com", + port=443, + user="admin", # Authenticate as admin + http_scheme="https", + auth=trino.auth.BasicAuthentication("admin", "password") +) + +# Execute query as logged-in user +cursor = connection.cursor() +cursor.execute("SELECT * FROM iceberg.sales", + http_headers={"X-Trino-User": "alice@example.com"}) +``` + +Result: Query runs as `alice@example.com`, appears in Trino logs as executed by `alice@example.com`. + +**Use Cases:** + +- **Querybook/BI Tools**: Single admin connection, multi-user attribution +- **Audit Logging**: Track which user executed which queries +- **Future Access Control**: Enable per-user data access policies +- **Query Attribution**: Correct usage statistics per user + +**Security Considerations:** + +- Only the `admin` user can impersonate others +- Regular users cannot impersonate anyone +- Impersonation targets can be restricted with specific regex patterns (e.g., `"new_user": ".*@company\\.com"`) +- Consider adding group-based impersonation rules for finer control + +### Configuration Management + +- **Storage**: Rules stored in ConfigMap `trino-coordinator-access-control` +- **Refresh**: Trino checks for changes every 60 seconds (no pod restart required) +- **Location**: Mounted at `/etc/trino/access-control/rules.json` in coordinator pod +- **Updates**: Modify Helm values and run `just trino::upgrade` to update rules + +### Verify Configuration + +```bash +# View current access control rules +kubectl exec -n trino deployment/trino-coordinator -- \ + cat /etc/trino/access-control/rules.json + +# Check ConfigMap +kubectl get configmap trino-coordinator-access-control -n trino -o yaml +``` ## Architecture @@ -363,7 +567,7 @@ Data Sources: └─ Static credentials ``` -**Key Components**: +### Key Components - **TLS Termination**: Traefik Ingress handles HTTPS, Trino runs HTTP-only internally - **Traefik Middleware**: Injects X-Forwarded-Proto, X-Forwarded-Host, X-Forwarded-Port headers for correct URL generation diff --git a/trino/justfile b/trino/justfile index 4d8670d..b04dd20 100644 --- a/trino/justfile +++ b/trino/justfile @@ -188,11 +188,16 @@ delete-postgres-secret: @kubectl delete externalsecret trino-postgres-external-secret -n ${TRINO_NAMESPACE} \ --ignore-not-found -# Setup MinIO storage for Trino (optional) -setup-minio-storage: +# Enable Iceberg catalog with Lakekeeper and MinIO (optional) +enable-iceberg-catalog: #!/bin/bash set -euo pipefail - echo "Setting up MinIO storage for Trino..." + echo "Enabling Iceberg catalog with Lakekeeper integration..." + + if ! kubectl get service lakekeeper -n ${LAKEKEEPER_NAMESPACE} &>/dev/null; then + echo "Error: Lakekeeper is not installed. Please install Lakekeeper first with 'just lakekeeper::install'" + exit 1 + fi if ! kubectl get service minio -n ${MINIO_NAMESPACE} &>/dev/null; then echo "Error: MinIO is not installed. Please install MinIO first with 'just minio::install'" @@ -206,12 +211,10 @@ setup-minio-storage: echo "Enabling service account for Trino client..." just keycloak::enable-service-account ${KEYCLOAK_REALM} trino - echo "Adding lakekeeper scope to Trino client..." + echo "Adding 'lakekeeper' scope to Trino client..." + echo "Note: The 'lakekeeper' client scope must be created by Lakekeeper installation first." just keycloak::add-scope-to-client ${KEYCLOAK_REALM} trino lakekeeper - echo "Adding lakekeeper audience mapper to Trino client..." - just keycloak::add-audience-mapper trino lakekeeper - echo "Keycloak configuration completed" if helm status external-secrets -n ${EXTERNAL_SECRETS_NAMESPACE} &>/dev/null; then @@ -236,7 +239,7 @@ setup-minio-storage: --from-literal=endpoint="http://minio.${MINIO_NAMESPACE}.svc.cluster.local:9000" echo "MinIO secret created directly in Kubernetes" fi - echo "MinIO storage setup completed" + echo "Iceberg catalog setup completed" # Delete MinIO secret delete-minio-secret: @@ -262,8 +265,8 @@ install: just setup-postgres-catalog if [ -z "${TRINO_MINIO_ENABLED}" ]; then - if gum confirm "Setup MinIO storage (for Iceberg catalogs)?"; then - just setup-minio-storage + if gum confirm "Enable Iceberg catalog with Lakekeeper and MinIO?"; then + just enable-iceberg-catalog TRINO_MINIO_ENABLED="true" else TRINO_MINIO_ENABLED="false"