docs: reconstruct READMEs

This commit is contained in:
Masaki Yatsu
2025-10-12 15:24:28 +09:00
parent e19e2fa310
commit 29d6bb15c2
7 changed files with 560 additions and 170 deletions

228
README.md
View File

@@ -1,27 +1,51 @@
# buun-stack # buun-stack
A Kubernetes development stack for self-hosted environments, designed to run on a Linux machine in your home or office that you can access from anywhere via the internet. A remotely accessible Kubernetes home lab with OIDC authentication. Build a modern development environment with integrated data analytics and AI capabilities. Includes a complete open data stack for data ingestion, transformation, serving, and orchestration—built on open-source components you can run locally and port to any cloud.
- 📺 [Remote-Accessible Kubernetes Home Lab](https://www.youtube.com/playlist?list=PLbAvvJK22Y6vJPrUC6GrfNMXneYspckAo) (YouTube playlist) - 📺 [Remote-Accessible Kubernetes Home Lab](https://www.youtube.com/playlist?list=PLbAvvJK22Y6vJPrUC6GrfNMXneYspckAo) (YouTube playlist)
- 📝 [Building a Remote-Accessible Kubernetes Home Lab with k3s](https://dev.to/buun-ch/building-a-remote-accessible-kubernetes-home-lab-with-k3s-5g05) (Dev.to article) - 📝 [Building a Remote-Accessible Kubernetes Home Lab with k3s](https://dev.to/buun-ch/building-a-remote-accessible-kubernetes-home-lab-with-k3s-5g05) (Dev.to article)
## Features ## Architecture
### Foundation
- **Kubernetes**: [k3s](https://k3s.io/) lightweight distribution
- **Automation**: [Just](https://just.systems/) task runner with templated configurations
- **Remote Access**: [Cloudflare Tunnel](https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/) for secure internet connectivity
### Core Components (Required)
- **Database**: [PostgreSQL](https://www.postgresql.org/) cluster with pgvector extension
- **Identity & Access**: [Keycloak](https://www.keycloak.org/) for OIDC authentication
### Recommended Components
- **Secrets Management**: [HashiCorp Vault](https://www.vaultproject.io/) with [External Secrets Operator](https://external-secrets.io/)
- Used by most stack modules for secure credential management
- Can be deployed without, but highly recommended
### Storage (Optional)
- **Kubernetes Distribution**: [k3s](https://k3s.io/) lightweight Kubernetes
- **Block Storage**: [Longhorn](https://longhorn.io/) distributed block storage - **Block Storage**: [Longhorn](https://longhorn.io/) distributed block storage
- **Object Storage**: [MinIO](https://min.io/) S3-compatible storage - **Object Storage**: [MinIO](https://min.io/) S3-compatible storage
- **Identity & Access**: [Keycloak](https://www.keycloak.org/) for OIDC authentication
- **Secrets Management**: [HashiCorp Vault](https://www.vaultproject.io/) with [External Secrets Operator](https://external-secrets.io/) ### Data & Analytics (Optional)
- **Interactive Computing**: [JupyterHub](https://jupyter.org/hub) for collaborative notebooks - **Interactive Computing**: [JupyterHub](https://jupyter.org/hub) for collaborative notebooks
- **Business Intelligence**: [Metabase](https://www.metabase.com/) for business intelligence and data visualization - **Analytics Database**: [ClickHouse](https://clickhouse.com/) for high-performance analytics
- **Data Catalog**: [DataHub](https://datahubproject.io/) for metadata management and data discovery - **Vector Database**: [Qdrant](https://qdrant.tech/) for vector search and AI/ML applications
- **Database**: [PostgreSQL](https://www.postgresql.org/) cluster - **Iceberg REST Catalog**: [Lakekeeper](https://lakekeeper.io/) for Apache Iceberg table management
- **Analytics Engine/Database**: [ClickHouse](https://clickhouse.com/) for high-performance analytics and data warehousing - **Business Intelligence**: [Metabase](https://www.metabase.com/) for data visualization
- **Data Orchestration**: [Dagster](https://dagster.io/) for modern data pipelines and asset management - **Data Catalog**: [DataHub](https://datahubproject.io/) for metadata management
- **Workflow Orchestration**: [Apache Airflow](https://airflow.apache.org/) for data pipeline automation and task scheduling
- **Authentication Proxy**: [OAuth2 Proxy](https://oauth2-proxy.github.io/oauth2-proxy/) for adding Keycloak authentication to any application ### Orchestration (Optional)
- **Remote Access**: [Cloudflare Tunnel](https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/) for secure internet connectivity
- **Automation**: [Just](https://just.systems/) task runner with templated configurations - **Data Orchestration**: [Dagster](https://dagster.io/) for modern data pipelines
- **Workflow Orchestration**: [Apache Airflow](https://airflow.apache.org/) for task scheduling
### Security (Optional)
- **Authentication Proxy**: [OAuth2 Proxy](https://oauth2-proxy.github.io/oauth2-proxy/) for adding Keycloak authentication
## Quick Start ## Quick Start
@@ -55,7 +79,7 @@ For detailed step-by-step instructions, see the [Installation Guide](./INSTALLAT
just k8s::setup-oidc-auth just k8s::setup-oidc-auth
``` ```
## Core Components ## Component Details
### k3s ### k3s
@@ -114,187 +138,51 @@ S3-compatible object storage system providing:
### JupyterHub ### JupyterHub
Multi-user platform for interactive computing: Multi-user platform for interactive computing with Keycloak authentication and persistent storage.
- Collaborative Jupyter notebook environment [📖 See JupyterHub Documentation](./jupyterhub/README.md)
- Integrated with Keycloak for OIDC authentication
- Persistent storage for user workspaces
- Support for multiple kernels and environments
- Vault integration for secure secrets management
See [JupyterHub Documentation](./docs/jupyterhub.md) for detailed setup and configuration.
### Metabase ### Metabase
Business intelligence and data visualization platform: Business intelligence and data visualization platform with PostgreSQL integration.
- Open-source analytics and dashboards [📖 See Metabase Documentation](./metabase/README.md)
- Interactive data exploration
- PostgreSQL integration for data storage
- Automated setup with Helm
- Session management through Vault/External Secrets
- Simplified deployment (no OIDC dependency)
Installation:
```bash
just metabase::install
```
Access Metabase at `https://metabase.yourdomain.com` and complete the initial setup wizard to create an admin account.
### DataHub ### DataHub
Modern data catalog and metadata management platform: Modern data catalog and metadata management platform with OIDC integration.
- Centralized data discovery and documentation [📖 See DataHub Documentation](./datahub/README.md)
- Data lineage tracking and impact analysis
- Schema evolution monitoring
- OIDC integration with Keycloak for secure access
- Elasticsearch-powered search and indexing
- Kafka-based real-time metadata streaming
- PostgreSQL backend for metadata storage
Installation:
```bash
just datahub::install
```
> **⚠️ Resource Requirements:** DataHub is resource-intensive, requiring approximately **4-5GB of RAM** and 1+ CPU cores across multiple components (Elasticsearch, Kafka, Zookeeper, and DataHub services). Deployment typically takes 15-20 minutes to complete. Ensure your cluster has sufficient resources before installation.
Access DataHub at `https://datahub.yourdomain.com` and use "Sign in with SSO" to authenticate via Keycloak.
### ClickHouse ### ClickHouse
High-performance columnar OLAP database for analytics and data warehousing: High-performance columnar OLAP database for analytics and data warehousing.
- Columnar storage for fast analytical queries [📖 See ClickHouse Documentation](./clickhouse/README.md)
- Real-time data ingestion and processing
- Horizontal scaling for large datasets
- SQL interface with advanced analytics functions
- Integration with External Secrets for secure credential management
- Support for various data formats (CSV, JSON, Parquet, etc.)
Installation: ### Qdrant
```bash High-performance vector database for AI/ML applications with similarity search and rich filtering.
just clickhouse::install
```
Access ClickHouse at `https://clickhouse.yourdomain.com` using the admin credentials stored in Vault. [📖 See Qdrant Documentation](./qdrant/README.md)
**CH-UI Web Interface**: An optional web-based query interface for ClickHouse is available: ### Lakekeeper
```bash Apache Iceberg REST Catalog for managing data lake tables with OIDC authentication.
just ch-ui::install
``` [📖 See Lakekeeper Documentation](./lakekeeper/README.md)
### Apache Airflow ### Apache Airflow
Modern workflow orchestration platform for data pipelines and task automation: Modern workflow orchestration platform for data pipelines with JupyterHub integration.
- Airflow 3 with modern SDK components and FastAPI integration [📖 See Airflow Documentation](./airflow/README.md)
- DAG Development: Integrated with JupyterHub for seamless workflow creation and editing
- OIDC Authentication: Secure access through Keycloak integration
- Shared Storage: DAG files shared between JupyterHub and Airflow for direct editing
- Role-based Access Control: Multiple user roles (Admin, Operator, User, Viewer)
- REST API: Ful API access for programmatic DAG management
Installation:
```bash
just airflow::install
```
**JupyterHub Integration**: After installing both JupyterHub and Airflow, DAG files are automatically shared:
- Edit DAG files directly in JupyterHub: `~/airflow-dags/*.py`
- Changes appear in Airflow UI within 1-2 minutes
- Full Python development environment with syntax checking
- Template files available for quick DAG creation
**User Management**:
```bash
# Assign roles to users
just airflow::assign-role <username> <role>
# Available roles: airflow_admin, airflow_op, airflow_user, airflow_viewer
just airflow::assign-role myuser airflow_admin
```
**API Access**: Create API users for programmatic access:
```bash
just airflow::create-api-user <username> <role>
```
> **💡 Development Workflow**: Create DAGs in JupyterHub using `~/airflow-dags/dag_template.py` as a starting point. Use `.tmp` extension during development to avoid import errors, then rename to `.py` when ready.
Access Airflow at `https://airflow.yourdomain.com` and authenticate via Keycloak.
### Dagster ### Dagster
Modern data orchestration platform for building data pipelines and managing data assets: Modern data orchestration platform for building data pipelines and managing data assets.
- **Asset-Centric Development**: Define data assets with clear lineage and dependencies [📖 See Dagster Documentation](./dagster/README.md)
- **Dynamic Pipeline Deployment**: Deploy projects directly from local development environments
- **Integrated Development**: Shared storage with PVC-based project deployment
- **OAuth2 Authentication**: Secure access through Keycloak via OAuth2 Proxy
- **Python-First**: Native Python development with comprehensive SDK
Installation:
```bash
just dagster::install
```
**Project Development**: Deploy `dagster project scaffold` projects directly to Dagster:
```bash
# Create a new project locally
dagster project scaffold my-project
# Deploy to Dagster cluster
just dagster::deploy-project my-project
# Remove project when done
just dagster::remove-project my-project
```
**Storage Configuration**:
- **MinIO**: S3-compatible object storage for compute logs and staging
- **Local**: Persistent volumes with automatic Longhorn RWX detection for shared development
**Custom Dependencies**: For projects requiring additional Python packages:
```bash
# Build custom image with dependencies
export DAGSTER_CONTAINER_IMAGE=myregistry/dagster-custom
export DAGSTER_CONTAINER_TAG=latest
just dagster::build-container-image
just dagster::push-container-image
just dagster::upgrade
```
**Project Structure**: Projects must follow naming conventions:
- Directory names: Use underscores only (e.g., `my_project`, not `my-project`)
- Python modules: Follow standard Python naming (snake_case)
**Authentication**: Dagster uses OAuth2 Proxy for Keycloak integration:
- During installation, OAuth2 authentication is automatically configured
- Access control through Keycloak groups and roles
- **Note**: All authenticated users share the same Dagster instance and workspace
> **⚠️ Multi-user Limitation**: Dagster OSS does not support individual user workspaces or role-based permissions within the application. All users authenticated through Keycloak will share the same Dagster instance and have access to all assets, jobs, and configurations. Use naming conventions and team coordination for shared usage.
>
> **💡 Development Workflow**: Create projects locally with `dagster project scaffold`, develop with local dependencies, then deploy to the cluster for execution. The shared PVC allows immediate access to deployed code.
Access Dagster at `https://dagster.yourdomain.com` and authenticate via Keycloak.
## Common Operations ## Common Operations

28
clickhouse/README.md Normal file
View File

@@ -0,0 +1,28 @@
# ClickHouse
High-performance columnar OLAP database for analytics and data warehousing:
- Columnar storage for fast analytical queries
- Real-time data ingestion and processing
- Horizontal scaling for large datasets
- SQL interface with advanced analytics functions
- Integration with External Secrets for secure credential management
- Support for various data formats (CSV, JSON, Parquet, etc.)
## Installation
```bash
just clickhouse::install
```
## Access
Access ClickHouse at `https://clickhouse.yourdomain.com` using the admin credentials stored in Vault.
## CH-UI Web Interface
An optional web-based query interface for ClickHouse is available:
```bash
just ch-ui::install
```

25
datahub/README.md Normal file
View File

@@ -0,0 +1,25 @@
# DataHub
Modern data catalog and metadata management platform:
- Centralized data discovery and documentation
- Data lineage tracking and impact analysis
- Schema evolution monitoring
- OIDC integration with Keycloak for secure access
- Elasticsearch-powered search and indexing
- Kafka-based real-time metadata streaming
- PostgreSQL backend for metadata storage
## Installation
```bash
just datahub::install
```
## Resource Requirements
> **⚠️ Resource Requirements:** DataHub is resource-intensive, requiring approximately **4-5GB of RAM** and 1+ CPU cores across multiple components (Elasticsearch, Kafka, Zookeeper, and DataHub services). Deployment typically takes 15-20 minutes to complete. Ensure your cluster has sufficient resources before installation.
## Access
Access DataHub at `https://datahub.yourdomain.com` and use "Sign in with SSO" to authenticate via Keycloak.

90
jupyterhub/README.md Normal file
View File

@@ -0,0 +1,90 @@
# JupyterHub
Multi-user platform for interactive computing:
- Collaborative Jupyter notebook environment
- Integrated with Keycloak for OIDC authentication
- Persistent storage for user workspaces
- Support for multiple kernels and environments
- Vault integration for secure secrets management
See [JupyterHub Documentation](../docs/jupyterhub.md) for detailed setup and configuration.
## Installation
```bash
just jupyterhub::install
```
## Access
Access JupyterHub at `https://jupyter.yourdomain.com` and authenticate via Keycloak.
## buunstack Package & SecretStore
JupyterHub includes the **buunstack** Python package, which provides seamless integration with HashiCorp Vault for secure secrets management in your notebooks.
### Key Features
- 🔒 **Secure Secrets Management**: Store and retrieve secrets securely using HashiCorp Vault
- 🚀 **Pre-acquired Authentication**: Uses Vault tokens created automatically at notebook spawn
- 📱 **Simple API**: Easy-to-use interface similar to Google Colab's `userdata.get()`
- 🔄 **Automatic Token Renewal**: Built-in token refresh for long-running sessions
### Quick Example
```python
from buunstack import SecretStore
# Initialize with pre-acquired Vault token (automatic)
secrets = SecretStore()
# Store secrets
secrets.put('api-keys',
openai_key='sk-your-key-here',
github_token='ghp_your-token',
database_url='postgresql://user:pass@host:5432/db'
)
# Retrieve secrets
api_keys = secrets.get('api-keys')
openai_key = api_keys['openai_key']
# Or get a specific field directly
openai_key = secrets.get('api-keys', field='openai_key')
```
### Learn More
For detailed documentation, usage examples, and API reference, see:
[📖 buunstack Package Documentation](../python-package/README.md)
## Custom Container Images
JupyterHub uses custom container images with pre-installed data science tools and integrations:
### datastack-notebook (CPU)
Standard notebook image based on `jupyter/pytorch-notebook`:
- **PyTorch**: Deep learning framework
- **PySpark**: Apache Spark integration for big data processing
- **ClickHouse Client**: Direct database access
- **Python 3.12**: Latest Python runtime
[📖 See Image Documentation](./images/datastack-notebook/README.md)
### datastack-cuda-notebook (GPU)
GPU-enabled notebook image based on `jupyter/pytorch-notebook:cuda12`:
- **CUDA 12**: GPU acceleration support
- **PyTorch with GPU**: Hardware-accelerated deep learning
- **PySpark**: Apache Spark integration
- **ClickHouse Client**: Direct database access
- **Python 3.12**: Latest Python runtime
[📖 See Image Documentation](./images/datastack-cuda-notebook/README.md)
Both images are based on the official [Jupyter Docker Stacks](https://github.com/jupyter/docker-stacks) and include all standard data science libraries (NumPy, pandas, scikit-learn, matplotlib, etc.).

72
lakekeeper/README.md Normal file
View File

@@ -0,0 +1,72 @@
# Lakekeeper
Apache Iceberg REST Catalog implementation for managing data lake tables:
- **Iceberg REST Catalog**: Complete Apache Iceberg REST specification implementation
- **OIDC Authentication**: Integrated with Keycloak for secure access via PKCE flow
- **PostgreSQL Backend**: Reliable metadata storage with automatic migrations
- **Web UI**: Built-in web interface for catalog management
- **Secrets Management**: Vault/External Secrets integration for secure credentials
- **Multi-table Format**: Primarily designed for Apache Iceberg with extensibility
## Installation
```bash
just lakekeeper::install
```
During installation, you will be prompted for:
- **Lakekeeper host (FQDN)**: The domain name for accessing Lakekeeper (e.g., `lakekeeper.yourdomain.com`)
The installation automatically:
- Creates PostgreSQL database and user
- Stores credentials in Vault (if External Secrets is available) or Kubernetes Secrets
- Creates Keycloak OIDC client with PKCE flow for secure authentication
- Configures audience mapper for JWT tokens
- Runs database migrations
- Configures Traefik ingress with TLS
## Access
Access Lakekeeper at `https://lakekeeper.yourdomain.com` and authenticate via Keycloak.
## Cleanup
To remove all Lakekeeper resources and secrets from Vault:
```bash
just lakekeeper::cleanup
```
This will prompt for confirmation before deleting:
- PostgreSQL database
- Vault secrets
- Keycloak client
## Uninstallation
```bash
# Keep database
just lakekeeper::uninstall false
# Delete database as well
just lakekeeper::uninstall true
```
This will:
- Uninstall the Lakekeeper Helm release
- Delete Kubernetes secrets
- Optionally delete PostgreSQL database
- Remove Keycloak OIDC client
## Documentation
For more information, see the official documentation:
- [Lakekeeper Documentation](https://docs.lakekeeper.io/)
- [Apache Iceberg Documentation](https://iceberg.apache.org/docs/latest/)
- [PyIceberg Documentation](https://py.iceberg.apache.org/)

20
metabase/README.md Normal file
View File

@@ -0,0 +1,20 @@
# Metabase
Business intelligence and data visualization platform:
- Open-source analytics and dashboards
- Interactive data exploration
- PostgreSQL integration for data storage
- Automated setup with Helm
- Session management through Vault/External Secrets
- Simplified deployment (no OIDC dependency)
## Installation
```bash
just metabase::install
```
## Access
Access Metabase at `https://metabase.yourdomain.com` and complete the initial setup wizard to create an admin account.

267
qdrant/README.md Normal file
View File

@@ -0,0 +1,267 @@
# Qdrant
High-performance vector database for AI/ML applications:
- **Vector Search**: Fast similarity search with multiple distance metrics (Cosine, Euclidean, Dot Product)
- **Rich Filtering**: Combine vector similarity with payload-based filtering
- **Scalable**: Horizontal scaling for large-scale vector collections
- **RESTful API**: Simple HTTP API for vector operations
- **Secure Authentication**: API key-based authentication with Vault integration
- **High Availability**: Built-in replication and fault tolerance
## Installation
```bash
just qdrant::install
```
During installation, you will be prompted for:
- **Qdrant host (FQDN)**: The domain name for accessing Qdrant (e.g., `qdrant.yourdomain.com`)
The installation automatically:
- Generates API keys (read-write and read-only)
- Stores keys in Vault (if External Secrets is available) or Kubernetes Secrets
- Configures Traefik ingress with TLS
## Access
Access Qdrant at `https://qdrant.yourdomain.com` using the API keys.
### Get API Keys
```bash
# Get read-write API key
just qdrant::get-api-key
# Get read-only API key
just qdrant::get-readonly-api-key
```
## Testing & Health Check
Qdrant includes built-in testing recipes that use telepresence to access the service from your local machine.
### Prerequisites
Ensure telepresence is connected:
```bash
telepresence connect
```
### Health Check
```bash
just qdrant::health-check
```
Checks if Qdrant is running and responding to requests.
### Vector Operations Test
```bash
just qdrant::test
```
Runs a complete test suite that:
1. Creates a test collection with 4-dimensional vectors
2. Adds sample points (cities with vector embeddings)
3. Performs similarity search
4. Cleans up the test collection
Example output:
```
Testing Qdrant at http://qdrant.qdrant.svc.cluster.local:6333
Using collection: test_collection_1760245249
1. Creating collection...
{
"result": true,
"status": "ok"
}
2. Adding test points...
{
"result": {
"operation_id": 0,
"status": "completed"
},
"status": "ok"
}
3. Searching for similar vectors...
{
"result": [
{
"id": 2,
"score": 0.99,
"payload": {"city": "London"}
}
],
"status": "ok"
}
Test completed successfully!
```
## Using Qdrant
### REST API
Qdrant provides a RESTful API for all operations. Here are some common examples:
#### Create a Collection
```bash
curl -X PUT "https://qdrant.yourdomain.com/collections/my_collection" \
-H "api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"vectors": {
"size": 384,
"distance": "Cosine"
}
}'
```
#### Insert Vectors
```bash
curl -X PUT "https://qdrant.yourdomain.com/collections/my_collection/points" \
-H "api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"points": [
{
"id": 1,
"vector": [0.1, 0.2, ...],
"payload": {"text": "example document"}
}
]
}'
```
#### Search Similar Vectors
```bash
curl -X POST "https://qdrant.yourdomain.com/collections/my_collection/points/search" \
-H "api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"vector": [0.15, 0.25, ...],
"limit": 10
}'
```
### Python Client
```python
from qdrant_client import QdrantClient
# Connect to Qdrant
client = QdrantClient(
url="https://qdrant.yourdomain.com",
api_key="YOUR_API_KEY"
)
# Create collection
client.create_collection(
collection_name="my_collection",
vectors_config={"size": 384, "distance": "Cosine"}
)
# Insert vectors
client.upsert(
collection_name="my_collection",
points=[
{
"id": 1,
"vector": [0.1, 0.2, ...],
"payload": {"text": "example document"}
}
]
)
# Search
results = client.search(
collection_name="my_collection",
query_vector=[0.15, 0.25, ...],
limit=10
)
```
### JupyterHub Integration
Store your API key securely in Vault using the buunstack package:
```python
from buunstack import SecretStore
secrets = SecretStore()
secrets.put('qdrant', api_key='YOUR_API_KEY')
# Later, retrieve it
api_key = secrets.get('qdrant', field='api_key')
```
## Use Cases
### Vector Embeddings Search
Store and search document, image, or audio embeddings for:
- Semantic search
- Recommendation systems
- Duplicate detection
- Content-based filtering
### RAG (Retrieval-Augmented Generation)
Use Qdrant as the vector store for LLM applications:
- Store document chunks with embeddings
- Retrieve relevant context for LLM prompts
- Build knowledge bases with semantic search
### Similarity Matching
Find similar items based on learned representations:
- Image similarity search
- Product recommendations
- Anomaly detection
- Clustering and classification
## Cleanup
To remove all Qdrant resources and secrets from Vault:
```bash
just qdrant::cleanup
```
This will prompt for confirmation before deleting the Vault secrets.
## Uninstallation
```bash
just qdrant::uninstall
```
This will:
- Uninstall the Qdrant Helm release
- Delete API keys secrets
- Remove the Qdrant namespace
## Documentation
For more information, see the official Qdrant documentation:
- [Qdrant Documentation](https://qdrant.tech/documentation/)
- [REST API Reference](https://qdrant.tech/documentation/api-reference/)
- [Python Client](https://github.com/qdrant/qdrant-client)