docs: reconstruct READMEs
This commit is contained in:
228
README.md
228
README.md
@@ -1,27 +1,51 @@
|
||||
# buun-stack
|
||||
|
||||
A Kubernetes development stack for self-hosted environments, designed to run on a Linux machine in your home or office that you can access from anywhere via the internet.
|
||||
A remotely accessible Kubernetes home lab with OIDC authentication. Build a modern development environment with integrated data analytics and AI capabilities. Includes a complete open data stack for data ingestion, transformation, serving, and orchestration—built on open-source components you can run locally and port to any cloud.
|
||||
|
||||
- 📺 [Remote-Accessible Kubernetes Home Lab](https://www.youtube.com/playlist?list=PLbAvvJK22Y6vJPrUC6GrfNMXneYspckAo) (YouTube playlist)
|
||||
- 📝 [Building a Remote-Accessible Kubernetes Home Lab with k3s](https://dev.to/buun-ch/building-a-remote-accessible-kubernetes-home-lab-with-k3s-5g05) (Dev.to article)
|
||||
|
||||
## Features
|
||||
## Architecture
|
||||
|
||||
### Foundation
|
||||
|
||||
- **Kubernetes**: [k3s](https://k3s.io/) lightweight distribution
|
||||
- **Automation**: [Just](https://just.systems/) task runner with templated configurations
|
||||
- **Remote Access**: [Cloudflare Tunnel](https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/) for secure internet connectivity
|
||||
|
||||
### Core Components (Required)
|
||||
|
||||
- **Database**: [PostgreSQL](https://www.postgresql.org/) cluster with pgvector extension
|
||||
- **Identity & Access**: [Keycloak](https://www.keycloak.org/) for OIDC authentication
|
||||
|
||||
### Recommended Components
|
||||
|
||||
- **Secrets Management**: [HashiCorp Vault](https://www.vaultproject.io/) with [External Secrets Operator](https://external-secrets.io/)
|
||||
- Used by most stack modules for secure credential management
|
||||
- Can be deployed without, but highly recommended
|
||||
|
||||
### Storage (Optional)
|
||||
|
||||
- **Kubernetes Distribution**: [k3s](https://k3s.io/) lightweight Kubernetes
|
||||
- **Block Storage**: [Longhorn](https://longhorn.io/) distributed block storage
|
||||
- **Object Storage**: [MinIO](https://min.io/) S3-compatible storage
|
||||
- **Identity & Access**: [Keycloak](https://www.keycloak.org/) for OIDC authentication
|
||||
- **Secrets Management**: [HashiCorp Vault](https://www.vaultproject.io/) with [External Secrets Operator](https://external-secrets.io/)
|
||||
|
||||
### Data & Analytics (Optional)
|
||||
|
||||
- **Interactive Computing**: [JupyterHub](https://jupyter.org/hub) for collaborative notebooks
|
||||
- **Business Intelligence**: [Metabase](https://www.metabase.com/) for business intelligence and data visualization
|
||||
- **Data Catalog**: [DataHub](https://datahubproject.io/) for metadata management and data discovery
|
||||
- **Database**: [PostgreSQL](https://www.postgresql.org/) cluster
|
||||
- **Analytics Engine/Database**: [ClickHouse](https://clickhouse.com/) for high-performance analytics and data warehousing
|
||||
- **Data Orchestration**: [Dagster](https://dagster.io/) for modern data pipelines and asset management
|
||||
- **Workflow Orchestration**: [Apache Airflow](https://airflow.apache.org/) for data pipeline automation and task scheduling
|
||||
- **Authentication Proxy**: [OAuth2 Proxy](https://oauth2-proxy.github.io/oauth2-proxy/) for adding Keycloak authentication to any application
|
||||
- **Remote Access**: [Cloudflare Tunnel](https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/) for secure internet connectivity
|
||||
- **Automation**: [Just](https://just.systems/) task runner with templated configurations
|
||||
- **Analytics Database**: [ClickHouse](https://clickhouse.com/) for high-performance analytics
|
||||
- **Vector Database**: [Qdrant](https://qdrant.tech/) for vector search and AI/ML applications
|
||||
- **Iceberg REST Catalog**: [Lakekeeper](https://lakekeeper.io/) for Apache Iceberg table management
|
||||
- **Business Intelligence**: [Metabase](https://www.metabase.com/) for data visualization
|
||||
- **Data Catalog**: [DataHub](https://datahubproject.io/) for metadata management
|
||||
|
||||
### Orchestration (Optional)
|
||||
|
||||
- **Data Orchestration**: [Dagster](https://dagster.io/) for modern data pipelines
|
||||
- **Workflow Orchestration**: [Apache Airflow](https://airflow.apache.org/) for task scheduling
|
||||
|
||||
### Security (Optional)
|
||||
|
||||
- **Authentication Proxy**: [OAuth2 Proxy](https://oauth2-proxy.github.io/oauth2-proxy/) for adding Keycloak authentication
|
||||
|
||||
## Quick Start
|
||||
|
||||
@@ -55,7 +79,7 @@ For detailed step-by-step instructions, see the [Installation Guide](./INSTALLAT
|
||||
just k8s::setup-oidc-auth
|
||||
```
|
||||
|
||||
## Core Components
|
||||
## Component Details
|
||||
|
||||
### k3s
|
||||
|
||||
@@ -114,187 +138,51 @@ S3-compatible object storage system providing:
|
||||
|
||||
### JupyterHub
|
||||
|
||||
Multi-user platform for interactive computing:
|
||||
Multi-user platform for interactive computing with Keycloak authentication and persistent storage.
|
||||
|
||||
- Collaborative Jupyter notebook environment
|
||||
- Integrated with Keycloak for OIDC authentication
|
||||
- Persistent storage for user workspaces
|
||||
- Support for multiple kernels and environments
|
||||
- Vault integration for secure secrets management
|
||||
|
||||
See [JupyterHub Documentation](./docs/jupyterhub.md) for detailed setup and configuration.
|
||||
[📖 See JupyterHub Documentation](./jupyterhub/README.md)
|
||||
|
||||
### Metabase
|
||||
|
||||
Business intelligence and data visualization platform:
|
||||
Business intelligence and data visualization platform with PostgreSQL integration.
|
||||
|
||||
- Open-source analytics and dashboards
|
||||
- Interactive data exploration
|
||||
- PostgreSQL integration for data storage
|
||||
- Automated setup with Helm
|
||||
- Session management through Vault/External Secrets
|
||||
- Simplified deployment (no OIDC dependency)
|
||||
|
||||
Installation:
|
||||
|
||||
```bash
|
||||
just metabase::install
|
||||
```
|
||||
|
||||
Access Metabase at `https://metabase.yourdomain.com` and complete the initial setup wizard to create an admin account.
|
||||
[📖 See Metabase Documentation](./metabase/README.md)
|
||||
|
||||
### DataHub
|
||||
|
||||
Modern data catalog and metadata management platform:
|
||||
Modern data catalog and metadata management platform with OIDC integration.
|
||||
|
||||
- Centralized data discovery and documentation
|
||||
- Data lineage tracking and impact analysis
|
||||
- Schema evolution monitoring
|
||||
- OIDC integration with Keycloak for secure access
|
||||
- Elasticsearch-powered search and indexing
|
||||
- Kafka-based real-time metadata streaming
|
||||
- PostgreSQL backend for metadata storage
|
||||
|
||||
Installation:
|
||||
|
||||
```bash
|
||||
just datahub::install
|
||||
```
|
||||
|
||||
> **⚠️ Resource Requirements:** DataHub is resource-intensive, requiring approximately **4-5GB of RAM** and 1+ CPU cores across multiple components (Elasticsearch, Kafka, Zookeeper, and DataHub services). Deployment typically takes 15-20 minutes to complete. Ensure your cluster has sufficient resources before installation.
|
||||
|
||||
Access DataHub at `https://datahub.yourdomain.com` and use "Sign in with SSO" to authenticate via Keycloak.
|
||||
[📖 See DataHub Documentation](./datahub/README.md)
|
||||
|
||||
### ClickHouse
|
||||
|
||||
High-performance columnar OLAP database for analytics and data warehousing:
|
||||
High-performance columnar OLAP database for analytics and data warehousing.
|
||||
|
||||
- Columnar storage for fast analytical queries
|
||||
- Real-time data ingestion and processing
|
||||
- Horizontal scaling for large datasets
|
||||
- SQL interface with advanced analytics functions
|
||||
- Integration with External Secrets for secure credential management
|
||||
- Support for various data formats (CSV, JSON, Parquet, etc.)
|
||||
[📖 See ClickHouse Documentation](./clickhouse/README.md)
|
||||
|
||||
Installation:
|
||||
### Qdrant
|
||||
|
||||
```bash
|
||||
just clickhouse::install
|
||||
```
|
||||
High-performance vector database for AI/ML applications with similarity search and rich filtering.
|
||||
|
||||
Access ClickHouse at `https://clickhouse.yourdomain.com` using the admin credentials stored in Vault.
|
||||
[📖 See Qdrant Documentation](./qdrant/README.md)
|
||||
|
||||
**CH-UI Web Interface**: An optional web-based query interface for ClickHouse is available:
|
||||
### Lakekeeper
|
||||
|
||||
```bash
|
||||
just ch-ui::install
|
||||
```
|
||||
Apache Iceberg REST Catalog for managing data lake tables with OIDC authentication.
|
||||
|
||||
[📖 See Lakekeeper Documentation](./lakekeeper/README.md)
|
||||
|
||||
### Apache Airflow
|
||||
|
||||
Modern workflow orchestration platform for data pipelines and task automation:
|
||||
Modern workflow orchestration platform for data pipelines with JupyterHub integration.
|
||||
|
||||
- Airflow 3 with modern SDK components and FastAPI integration
|
||||
- DAG Development: Integrated with JupyterHub for seamless workflow creation and editing
|
||||
- OIDC Authentication: Secure access through Keycloak integration
|
||||
- Shared Storage: DAG files shared between JupyterHub and Airflow for direct editing
|
||||
- Role-based Access Control: Multiple user roles (Admin, Operator, User, Viewer)
|
||||
- REST API: Ful API access for programmatic DAG management
|
||||
|
||||
Installation:
|
||||
|
||||
```bash
|
||||
just airflow::install
|
||||
```
|
||||
|
||||
**JupyterHub Integration**: After installing both JupyterHub and Airflow, DAG files are automatically shared:
|
||||
|
||||
- Edit DAG files directly in JupyterHub: `~/airflow-dags/*.py`
|
||||
- Changes appear in Airflow UI within 1-2 minutes
|
||||
- Full Python development environment with syntax checking
|
||||
- Template files available for quick DAG creation
|
||||
|
||||
**User Management**:
|
||||
|
||||
```bash
|
||||
# Assign roles to users
|
||||
just airflow::assign-role <username> <role>
|
||||
|
||||
# Available roles: airflow_admin, airflow_op, airflow_user, airflow_viewer
|
||||
just airflow::assign-role myuser airflow_admin
|
||||
```
|
||||
|
||||
**API Access**: Create API users for programmatic access:
|
||||
|
||||
```bash
|
||||
just airflow::create-api-user <username> <role>
|
||||
```
|
||||
|
||||
> **💡 Development Workflow**: Create DAGs in JupyterHub using `~/airflow-dags/dag_template.py` as a starting point. Use `.tmp` extension during development to avoid import errors, then rename to `.py` when ready.
|
||||
|
||||
Access Airflow at `https://airflow.yourdomain.com` and authenticate via Keycloak.
|
||||
[📖 See Airflow Documentation](./airflow/README.md)
|
||||
|
||||
### Dagster
|
||||
|
||||
Modern data orchestration platform for building data pipelines and managing data assets:
|
||||
Modern data orchestration platform for building data pipelines and managing data assets.
|
||||
|
||||
- **Asset-Centric Development**: Define data assets with clear lineage and dependencies
|
||||
- **Dynamic Pipeline Deployment**: Deploy projects directly from local development environments
|
||||
- **Integrated Development**: Shared storage with PVC-based project deployment
|
||||
- **OAuth2 Authentication**: Secure access through Keycloak via OAuth2 Proxy
|
||||
- **Python-First**: Native Python development with comprehensive SDK
|
||||
|
||||
Installation:
|
||||
|
||||
```bash
|
||||
just dagster::install
|
||||
```
|
||||
|
||||
**Project Development**: Deploy `dagster project scaffold` projects directly to Dagster:
|
||||
|
||||
```bash
|
||||
# Create a new project locally
|
||||
dagster project scaffold my-project
|
||||
|
||||
# Deploy to Dagster cluster
|
||||
just dagster::deploy-project my-project
|
||||
|
||||
# Remove project when done
|
||||
just dagster::remove-project my-project
|
||||
```
|
||||
|
||||
**Storage Configuration**:
|
||||
|
||||
- **MinIO**: S3-compatible object storage for compute logs and staging
|
||||
- **Local**: Persistent volumes with automatic Longhorn RWX detection for shared development
|
||||
|
||||
**Custom Dependencies**: For projects requiring additional Python packages:
|
||||
|
||||
```bash
|
||||
# Build custom image with dependencies
|
||||
export DAGSTER_CONTAINER_IMAGE=myregistry/dagster-custom
|
||||
export DAGSTER_CONTAINER_TAG=latest
|
||||
just dagster::build-container-image
|
||||
just dagster::push-container-image
|
||||
just dagster::upgrade
|
||||
```
|
||||
|
||||
**Project Structure**: Projects must follow naming conventions:
|
||||
|
||||
- Directory names: Use underscores only (e.g., `my_project`, not `my-project`)
|
||||
- Python modules: Follow standard Python naming (snake_case)
|
||||
|
||||
**Authentication**: Dagster uses OAuth2 Proxy for Keycloak integration:
|
||||
|
||||
- During installation, OAuth2 authentication is automatically configured
|
||||
- Access control through Keycloak groups and roles
|
||||
- **Note**: All authenticated users share the same Dagster instance and workspace
|
||||
|
||||
> **⚠️ Multi-user Limitation**: Dagster OSS does not support individual user workspaces or role-based permissions within the application. All users authenticated through Keycloak will share the same Dagster instance and have access to all assets, jobs, and configurations. Use naming conventions and team coordination for shared usage.
|
||||
>
|
||||
> **💡 Development Workflow**: Create projects locally with `dagster project scaffold`, develop with local dependencies, then deploy to the cluster for execution. The shared PVC allows immediate access to deployed code.
|
||||
|
||||
Access Dagster at `https://dagster.yourdomain.com` and authenticate via Keycloak.
|
||||
[📖 See Dagster Documentation](./dagster/README.md)
|
||||
|
||||
## Common Operations
|
||||
|
||||
|
||||
28
clickhouse/README.md
Normal file
28
clickhouse/README.md
Normal file
@@ -0,0 +1,28 @@
|
||||
# ClickHouse
|
||||
|
||||
High-performance columnar OLAP database for analytics and data warehousing:
|
||||
|
||||
- Columnar storage for fast analytical queries
|
||||
- Real-time data ingestion and processing
|
||||
- Horizontal scaling for large datasets
|
||||
- SQL interface with advanced analytics functions
|
||||
- Integration with External Secrets for secure credential management
|
||||
- Support for various data formats (CSV, JSON, Parquet, etc.)
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
just clickhouse::install
|
||||
```
|
||||
|
||||
## Access
|
||||
|
||||
Access ClickHouse at `https://clickhouse.yourdomain.com` using the admin credentials stored in Vault.
|
||||
|
||||
## CH-UI Web Interface
|
||||
|
||||
An optional web-based query interface for ClickHouse is available:
|
||||
|
||||
```bash
|
||||
just ch-ui::install
|
||||
```
|
||||
25
datahub/README.md
Normal file
25
datahub/README.md
Normal file
@@ -0,0 +1,25 @@
|
||||
# DataHub
|
||||
|
||||
Modern data catalog and metadata management platform:
|
||||
|
||||
- Centralized data discovery and documentation
|
||||
- Data lineage tracking and impact analysis
|
||||
- Schema evolution monitoring
|
||||
- OIDC integration with Keycloak for secure access
|
||||
- Elasticsearch-powered search and indexing
|
||||
- Kafka-based real-time metadata streaming
|
||||
- PostgreSQL backend for metadata storage
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
just datahub::install
|
||||
```
|
||||
|
||||
## Resource Requirements
|
||||
|
||||
> **⚠️ Resource Requirements:** DataHub is resource-intensive, requiring approximately **4-5GB of RAM** and 1+ CPU cores across multiple components (Elasticsearch, Kafka, Zookeeper, and DataHub services). Deployment typically takes 15-20 minutes to complete. Ensure your cluster has sufficient resources before installation.
|
||||
|
||||
## Access
|
||||
|
||||
Access DataHub at `https://datahub.yourdomain.com` and use "Sign in with SSO" to authenticate via Keycloak.
|
||||
90
jupyterhub/README.md
Normal file
90
jupyterhub/README.md
Normal file
@@ -0,0 +1,90 @@
|
||||
# JupyterHub
|
||||
|
||||
Multi-user platform for interactive computing:
|
||||
|
||||
- Collaborative Jupyter notebook environment
|
||||
- Integrated with Keycloak for OIDC authentication
|
||||
- Persistent storage for user workspaces
|
||||
- Support for multiple kernels and environments
|
||||
- Vault integration for secure secrets management
|
||||
|
||||
See [JupyterHub Documentation](../docs/jupyterhub.md) for detailed setup and configuration.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
just jupyterhub::install
|
||||
```
|
||||
|
||||
## Access
|
||||
|
||||
Access JupyterHub at `https://jupyter.yourdomain.com` and authenticate via Keycloak.
|
||||
|
||||
## buunstack Package & SecretStore
|
||||
|
||||
JupyterHub includes the **buunstack** Python package, which provides seamless integration with HashiCorp Vault for secure secrets management in your notebooks.
|
||||
|
||||
### Key Features
|
||||
|
||||
- 🔒 **Secure Secrets Management**: Store and retrieve secrets securely using HashiCorp Vault
|
||||
- 🚀 **Pre-acquired Authentication**: Uses Vault tokens created automatically at notebook spawn
|
||||
- 📱 **Simple API**: Easy-to-use interface similar to Google Colab's `userdata.get()`
|
||||
- 🔄 **Automatic Token Renewal**: Built-in token refresh for long-running sessions
|
||||
|
||||
### Quick Example
|
||||
|
||||
```python
|
||||
from buunstack import SecretStore
|
||||
|
||||
# Initialize with pre-acquired Vault token (automatic)
|
||||
secrets = SecretStore()
|
||||
|
||||
# Store secrets
|
||||
secrets.put('api-keys',
|
||||
openai_key='sk-your-key-here',
|
||||
github_token='ghp_your-token',
|
||||
database_url='postgresql://user:pass@host:5432/db'
|
||||
)
|
||||
|
||||
# Retrieve secrets
|
||||
api_keys = secrets.get('api-keys')
|
||||
openai_key = api_keys['openai_key']
|
||||
|
||||
# Or get a specific field directly
|
||||
openai_key = secrets.get('api-keys', field='openai_key')
|
||||
```
|
||||
|
||||
### Learn More
|
||||
|
||||
For detailed documentation, usage examples, and API reference, see:
|
||||
|
||||
[📖 buunstack Package Documentation](../python-package/README.md)
|
||||
|
||||
## Custom Container Images
|
||||
|
||||
JupyterHub uses custom container images with pre-installed data science tools and integrations:
|
||||
|
||||
### datastack-notebook (CPU)
|
||||
|
||||
Standard notebook image based on `jupyter/pytorch-notebook`:
|
||||
|
||||
- **PyTorch**: Deep learning framework
|
||||
- **PySpark**: Apache Spark integration for big data processing
|
||||
- **ClickHouse Client**: Direct database access
|
||||
- **Python 3.12**: Latest Python runtime
|
||||
|
||||
[📖 See Image Documentation](./images/datastack-notebook/README.md)
|
||||
|
||||
### datastack-cuda-notebook (GPU)
|
||||
|
||||
GPU-enabled notebook image based on `jupyter/pytorch-notebook:cuda12`:
|
||||
|
||||
- **CUDA 12**: GPU acceleration support
|
||||
- **PyTorch with GPU**: Hardware-accelerated deep learning
|
||||
- **PySpark**: Apache Spark integration
|
||||
- **ClickHouse Client**: Direct database access
|
||||
- **Python 3.12**: Latest Python runtime
|
||||
|
||||
[📖 See Image Documentation](./images/datastack-cuda-notebook/README.md)
|
||||
|
||||
Both images are based on the official [Jupyter Docker Stacks](https://github.com/jupyter/docker-stacks) and include all standard data science libraries (NumPy, pandas, scikit-learn, matplotlib, etc.).
|
||||
72
lakekeeper/README.md
Normal file
72
lakekeeper/README.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# Lakekeeper
|
||||
|
||||
Apache Iceberg REST Catalog implementation for managing data lake tables:
|
||||
|
||||
- **Iceberg REST Catalog**: Complete Apache Iceberg REST specification implementation
|
||||
- **OIDC Authentication**: Integrated with Keycloak for secure access via PKCE flow
|
||||
- **PostgreSQL Backend**: Reliable metadata storage with automatic migrations
|
||||
- **Web UI**: Built-in web interface for catalog management
|
||||
- **Secrets Management**: Vault/External Secrets integration for secure credentials
|
||||
- **Multi-table Format**: Primarily designed for Apache Iceberg with extensibility
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
just lakekeeper::install
|
||||
```
|
||||
|
||||
During installation, you will be prompted for:
|
||||
|
||||
- **Lakekeeper host (FQDN)**: The domain name for accessing Lakekeeper (e.g., `lakekeeper.yourdomain.com`)
|
||||
|
||||
The installation automatically:
|
||||
|
||||
- Creates PostgreSQL database and user
|
||||
- Stores credentials in Vault (if External Secrets is available) or Kubernetes Secrets
|
||||
- Creates Keycloak OIDC client with PKCE flow for secure authentication
|
||||
- Configures audience mapper for JWT tokens
|
||||
- Runs database migrations
|
||||
- Configures Traefik ingress with TLS
|
||||
|
||||
## Access
|
||||
|
||||
Access Lakekeeper at `https://lakekeeper.yourdomain.com` and authenticate via Keycloak.
|
||||
|
||||
## Cleanup
|
||||
|
||||
To remove all Lakekeeper resources and secrets from Vault:
|
||||
|
||||
```bash
|
||||
just lakekeeper::cleanup
|
||||
```
|
||||
|
||||
This will prompt for confirmation before deleting:
|
||||
|
||||
- PostgreSQL database
|
||||
- Vault secrets
|
||||
- Keycloak client
|
||||
|
||||
## Uninstallation
|
||||
|
||||
```bash
|
||||
# Keep database
|
||||
just lakekeeper::uninstall false
|
||||
|
||||
# Delete database as well
|
||||
just lakekeeper::uninstall true
|
||||
```
|
||||
|
||||
This will:
|
||||
|
||||
- Uninstall the Lakekeeper Helm release
|
||||
- Delete Kubernetes secrets
|
||||
- Optionally delete PostgreSQL database
|
||||
- Remove Keycloak OIDC client
|
||||
|
||||
## Documentation
|
||||
|
||||
For more information, see the official documentation:
|
||||
|
||||
- [Lakekeeper Documentation](https://docs.lakekeeper.io/)
|
||||
- [Apache Iceberg Documentation](https://iceberg.apache.org/docs/latest/)
|
||||
- [PyIceberg Documentation](https://py.iceberg.apache.org/)
|
||||
20
metabase/README.md
Normal file
20
metabase/README.md
Normal file
@@ -0,0 +1,20 @@
|
||||
# Metabase
|
||||
|
||||
Business intelligence and data visualization platform:
|
||||
|
||||
- Open-source analytics and dashboards
|
||||
- Interactive data exploration
|
||||
- PostgreSQL integration for data storage
|
||||
- Automated setup with Helm
|
||||
- Session management through Vault/External Secrets
|
||||
- Simplified deployment (no OIDC dependency)
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
just metabase::install
|
||||
```
|
||||
|
||||
## Access
|
||||
|
||||
Access Metabase at `https://metabase.yourdomain.com` and complete the initial setup wizard to create an admin account.
|
||||
267
qdrant/README.md
Normal file
267
qdrant/README.md
Normal file
@@ -0,0 +1,267 @@
|
||||
# Qdrant
|
||||
|
||||
High-performance vector database for AI/ML applications:
|
||||
|
||||
- **Vector Search**: Fast similarity search with multiple distance metrics (Cosine, Euclidean, Dot Product)
|
||||
- **Rich Filtering**: Combine vector similarity with payload-based filtering
|
||||
- **Scalable**: Horizontal scaling for large-scale vector collections
|
||||
- **RESTful API**: Simple HTTP API for vector operations
|
||||
- **Secure Authentication**: API key-based authentication with Vault integration
|
||||
- **High Availability**: Built-in replication and fault tolerance
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
just qdrant::install
|
||||
```
|
||||
|
||||
During installation, you will be prompted for:
|
||||
|
||||
- **Qdrant host (FQDN)**: The domain name for accessing Qdrant (e.g., `qdrant.yourdomain.com`)
|
||||
|
||||
The installation automatically:
|
||||
|
||||
- Generates API keys (read-write and read-only)
|
||||
- Stores keys in Vault (if External Secrets is available) or Kubernetes Secrets
|
||||
- Configures Traefik ingress with TLS
|
||||
|
||||
## Access
|
||||
|
||||
Access Qdrant at `https://qdrant.yourdomain.com` using the API keys.
|
||||
|
||||
### Get API Keys
|
||||
|
||||
```bash
|
||||
# Get read-write API key
|
||||
just qdrant::get-api-key
|
||||
|
||||
# Get read-only API key
|
||||
just qdrant::get-readonly-api-key
|
||||
```
|
||||
|
||||
## Testing & Health Check
|
||||
|
||||
Qdrant includes built-in testing recipes that use telepresence to access the service from your local machine.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
Ensure telepresence is connected:
|
||||
|
||||
```bash
|
||||
telepresence connect
|
||||
```
|
||||
|
||||
### Health Check
|
||||
|
||||
```bash
|
||||
just qdrant::health-check
|
||||
```
|
||||
|
||||
Checks if Qdrant is running and responding to requests.
|
||||
|
||||
### Vector Operations Test
|
||||
|
||||
```bash
|
||||
just qdrant::test
|
||||
```
|
||||
|
||||
Runs a complete test suite that:
|
||||
|
||||
1. Creates a test collection with 4-dimensional vectors
|
||||
2. Adds sample points (cities with vector embeddings)
|
||||
3. Performs similarity search
|
||||
4. Cleans up the test collection
|
||||
|
||||
Example output:
|
||||
|
||||
```
|
||||
Testing Qdrant at http://qdrant.qdrant.svc.cluster.local:6333
|
||||
Using collection: test_collection_1760245249
|
||||
|
||||
1. Creating collection...
|
||||
{
|
||||
"result": true,
|
||||
"status": "ok"
|
||||
}
|
||||
|
||||
2. Adding test points...
|
||||
{
|
||||
"result": {
|
||||
"operation_id": 0,
|
||||
"status": "completed"
|
||||
},
|
||||
"status": "ok"
|
||||
}
|
||||
|
||||
3. Searching for similar vectors...
|
||||
{
|
||||
"result": [
|
||||
{
|
||||
"id": 2,
|
||||
"score": 0.99,
|
||||
"payload": {"city": "London"}
|
||||
}
|
||||
],
|
||||
"status": "ok"
|
||||
}
|
||||
|
||||
Test completed successfully!
|
||||
```
|
||||
|
||||
## Using Qdrant
|
||||
|
||||
### REST API
|
||||
|
||||
Qdrant provides a RESTful API for all operations. Here are some common examples:
|
||||
|
||||
#### Create a Collection
|
||||
|
||||
```bash
|
||||
curl -X PUT "https://qdrant.yourdomain.com/collections/my_collection" \
|
||||
-H "api-key: YOUR_API_KEY" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"vectors": {
|
||||
"size": 384,
|
||||
"distance": "Cosine"
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
#### Insert Vectors
|
||||
|
||||
```bash
|
||||
curl -X PUT "https://qdrant.yourdomain.com/collections/my_collection/points" \
|
||||
-H "api-key: YOUR_API_KEY" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"points": [
|
||||
{
|
||||
"id": 1,
|
||||
"vector": [0.1, 0.2, ...],
|
||||
"payload": {"text": "example document"}
|
||||
}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
#### Search Similar Vectors
|
||||
|
||||
```bash
|
||||
curl -X POST "https://qdrant.yourdomain.com/collections/my_collection/points/search" \
|
||||
-H "api-key: YOUR_API_KEY" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"vector": [0.15, 0.25, ...],
|
||||
"limit": 10
|
||||
}'
|
||||
```
|
||||
|
||||
### Python Client
|
||||
|
||||
```python
|
||||
from qdrant_client import QdrantClient
|
||||
|
||||
# Connect to Qdrant
|
||||
client = QdrantClient(
|
||||
url="https://qdrant.yourdomain.com",
|
||||
api_key="YOUR_API_KEY"
|
||||
)
|
||||
|
||||
# Create collection
|
||||
client.create_collection(
|
||||
collection_name="my_collection",
|
||||
vectors_config={"size": 384, "distance": "Cosine"}
|
||||
)
|
||||
|
||||
# Insert vectors
|
||||
client.upsert(
|
||||
collection_name="my_collection",
|
||||
points=[
|
||||
{
|
||||
"id": 1,
|
||||
"vector": [0.1, 0.2, ...],
|
||||
"payload": {"text": "example document"}
|
||||
}
|
||||
]
|
||||
)
|
||||
|
||||
# Search
|
||||
results = client.search(
|
||||
collection_name="my_collection",
|
||||
query_vector=[0.15, 0.25, ...],
|
||||
limit=10
|
||||
)
|
||||
```
|
||||
|
||||
### JupyterHub Integration
|
||||
|
||||
Store your API key securely in Vault using the buunstack package:
|
||||
|
||||
```python
|
||||
from buunstack import SecretStore
|
||||
|
||||
secrets = SecretStore()
|
||||
secrets.put('qdrant', api_key='YOUR_API_KEY')
|
||||
|
||||
# Later, retrieve it
|
||||
api_key = secrets.get('qdrant', field='api_key')
|
||||
```
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Vector Embeddings Search
|
||||
|
||||
Store and search document, image, or audio embeddings for:
|
||||
|
||||
- Semantic search
|
||||
- Recommendation systems
|
||||
- Duplicate detection
|
||||
- Content-based filtering
|
||||
|
||||
### RAG (Retrieval-Augmented Generation)
|
||||
|
||||
Use Qdrant as the vector store for LLM applications:
|
||||
|
||||
- Store document chunks with embeddings
|
||||
- Retrieve relevant context for LLM prompts
|
||||
- Build knowledge bases with semantic search
|
||||
|
||||
### Similarity Matching
|
||||
|
||||
Find similar items based on learned representations:
|
||||
|
||||
- Image similarity search
|
||||
- Product recommendations
|
||||
- Anomaly detection
|
||||
- Clustering and classification
|
||||
|
||||
## Cleanup
|
||||
|
||||
To remove all Qdrant resources and secrets from Vault:
|
||||
|
||||
```bash
|
||||
just qdrant::cleanup
|
||||
```
|
||||
|
||||
This will prompt for confirmation before deleting the Vault secrets.
|
||||
|
||||
## Uninstallation
|
||||
|
||||
```bash
|
||||
just qdrant::uninstall
|
||||
```
|
||||
|
||||
This will:
|
||||
|
||||
- Uninstall the Qdrant Helm release
|
||||
- Delete API keys secrets
|
||||
- Remove the Qdrant namespace
|
||||
|
||||
## Documentation
|
||||
|
||||
For more information, see the official Qdrant documentation:
|
||||
|
||||
- [Qdrant Documentation](https://qdrant.tech/documentation/)
|
||||
- [REST API Reference](https://qdrant.tech/documentation/api-reference/)
|
||||
- [Python Client](https://github.com/qdrant/qdrant-client)
|
||||
Reference in New Issue
Block a user