buun-stack/prometheus/README.md

# Prometheus

Comprehensive monitoring and observability stack for Kubernetes:

- **Prometheus Operator**: Manages Prometheus instances via CRDs
- **Prometheus**: Time-series database and metrics collection
- **Grafana**: Visualization and dashboarding
- **Alertmanager**: Alert routing and management
- **Node Exporter**: Hardware and OS metrics
- **Kube State Metrics**: Kubernetes cluster state metrics
- **Namespace-based monitoring**: Explicit control via labels
- **OIDC authentication**: Optional Keycloak integration for Grafana

## Prerequisites

- Kubernetes cluster (k3s)
- External Secrets Operator (optional, for Vault integration)
- Vault (optional, for credential storage)
- Keycloak (optional, for Grafana OIDC authentication)

## Installation

```bash
just prometheus::install
```

You will be prompted for:

1. **Grafana host (FQDN)**: e.g., `grafana.example.com`
2. **Grafana admin password**: Auto-generated if not provided

### What Gets Installed

- Prometheus Operator and CRDs
- Prometheus server with namespace selector
- Grafana with ingress
- Alertmanager
- Node Exporter (DaemonSet)
- Kube State Metrics
- Default ServiceMonitors for Kubernetes components

The stack uses the official [kube-prometheus-stack Helm chart](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack).

## Access

### Grafana

Access Grafana at `https://your-grafana-host/`

**Default Credentials**:

- Username: `admin`
- Password: Retrieved via `just prometheus::admin-password`

### Prometheus

Prometheus Web UI is accessible internally within the cluster. For external access, set up port forwarding:

```bash
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
```

Then access at `http://localhost:9090`

### Alertmanager

Alertmanager is accessible internally within the cluster. For external access, set up port forwarding:

```bash
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093
```

Then access at `http://localhost:9093`

## Pod Security Standards

The monitoring namespace uses **privileged** Pod Security Standard enforcement.

```bash
pod-security.kubernetes.io/enforce=privileged
```

#### Why Privileged Instead of Baseline or Restricted?

The `prometheus-node-exporter` component requires the following privileged access to collect hardware and OS-level metrics:

- `hostNetwork: true` - Access to host network namespace
- `hostPID: true` - Access to host process IDs
- `hostPath` volumes - Access to host filesystem paths (`/`, `/sys`, `/proc`)
- `hostPort: 9100` - Expose metrics on host port

These requirements are incompatible with both `baseline` and `restricted` Pod Security Standards:
- **baseline** prohibits: `hostNetwork`, `hostPID`, `hostPath`, `hostPort`
- **restricted** has even stricter requirements

While these settings may seem permissive, they are necessary for node-exporter to collect system-level metrics from the host.

#### Security Measures

While using privileged enforcement at the namespace level, all other components (except node-exporter) apply restricted-level security contexts:

- **Grafana**: Non-root user (472), dropped capabilities, seccomp profile
- **Prometheus**: Non-root user (1000), read-only root filesystem, dropped capabilities
- **Alertmanager**: Non-root user (1000), read-only root filesystem, dropped capabilities
- **Prometheus Operator**: Non-root user (65534), read-only root filesystem, dropped capabilities
- **kube-state-metrics**: Non-root user (65534), read-only root filesystem, dropped capabilities

#### Alternative: Restricted Mode Without Node Metrics

To use `restricted` Pod Security Standard, disable node-exporter:

1. Add to `values.gomplate.yaml`:
   ```yaml
   nodeExporter:
     enabled: false
   ```

2. Update justfile to use `restricted`:
   ```bash
   kubectl label namespace ${PROMETHEUS_NAMESPACE} \
       pod-security.kubernetes.io/enforce=restricted --overwrite
   ```

**Trade-off**: You will lose node-level metrics (CPU, memory, disk, network at the host level), though pod-level metrics remain available.

## Configuration

Environment variables (set in `.env.local` or override):

```bash
PROMETHEUS_NAMESPACE=monitoring                      # Kubernetes namespace
PROMETHEUS_CHART_VERSION=79.4.0                      # Helm chart version
GRAFANA_HOST=grafana.example.com                     # Grafana FQDN
PROMETHEUS_HOST=prometheus.example.com               # Prometheus FQDN (optional)
ALERTMANAGER_HOST=alertmanager.example.com           # Alertmanager FQDN (optional)
GRAFANA_ADMIN_PASSWORD=                              # Grafana admin password
GRAFANA_OIDC_ENABLED=false                           # Enable Keycloak OIDC
GRAFANA_OIDC_CLIENT_SECRET=                          # Keycloak client secret
KEYCLOAK_NAMESPACE=keycloak                          # Keycloak namespace
KEYCLOAK_REALM=                                      # Keycloak realm
KEYCLOAK_HOST=                                       # Keycloak host
```

## Features

### Namespace-Based Monitoring Control

By default, Prometheus only monitors namespaces with the label `buun.channel/enable-monitoring=true`. This provides explicit control over which resources are monitored.

**Enable monitoring for a namespace**:

```bash
kubectl label namespace <namespace> buun.channel/enable-monitoring=true
```

**Disable monitoring for a namespace**:

```bash
kubectl label namespace <namespace> buun.channel/enable-monitoring-
```

The monitoring namespace is automatically labeled during installation.

### ServiceMonitor and PodMonitor

Prometheus Operator uses `ServiceMonitor` and `PodMonitor` CRDs to configure metric scraping.

**Requirements for automatic discovery**:

1. ServiceMonitor/PodMonitor must be in a namespace with label `buun.channel/enable-monitoring=true`
2. ServiceMonitor/PodMonitor must have label `release=kube-prometheus-stack`

**Example ServiceMonitor**:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-service
  namespace: my-namespace
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: my-service
  endpoints:
    - port: metrics
      path: /metrics
      interval: 30s
```

**Example PodMonitor**:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: my-pods
  namespace: my-namespace
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: my-app
  podMetricsEndpoints:
    - port: metrics
      path: /metrics
      interval: 30s
```

### Metric Relabeling

Use `metricRelabelings` to transform metric names and labels before storing in Prometheus.

**Example: Rename metrics**:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: keycloak
  namespace: keycloak
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: keycloak
  endpoints:
    - port: management
      path: /metrics
      interval: 30s
      metricRelabelings:
        - sourceLabels: [__name__]
          regex: 'vendor_(.*)'
          targetLabel: __name__
          replacement: 'keycloak_$1'
```

This configuration converts `vendor_*` metrics to `keycloak_*` for better discoverability.

## OIDC Authentication

### Setup Keycloak OIDC for Grafana

```bash
just prometheus::setup-oidc
```

This will:

1. Create Keycloak client `grafana`
2. Create `grafana-admins` group in Keycloak
3. Update Grafana configuration to use Keycloak OIDC
4. Restart Grafana with new settings

**Grant admin access to a user**:

```bash
just keycloak::add-user-to-group <username> grafana-admins
```

Users in the `grafana-admins` group will have Grafana Admin role.

### Disable OIDC

```bash
just prometheus::disable-oidc
```

This will revert Grafana to local authentication.

## Management

### Get Grafana Admin Password

```bash
just prometheus::admin-password
```

### Upgrade Stack

```bash
# Update Helm values and upgrade
gomplate -f prometheus/values.gomplate.yaml -o prometheus/values.yaml
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --version 79.4.0 \
  -n monitoring \
  -f prometheus/values.yaml
```

### Uninstall

```bash
just prometheus::uninstall
```

This will remove:

- Helm release
- All Prometheus Operator CRDs
- Namespace

## Monitoring Examples

### PostgreSQL (CloudNativePG)

Enable monitoring for PostgreSQL cluster:

```bash
just postgres::enable-monitoring
```

This creates a PodMonitor for the PostgreSQL cluster with proper labels.

### Keycloak

Enable monitoring for Keycloak:

```bash
just keycloak::enable-monitoring
```

This creates a ServiceMonitor that:

- Scrapes metrics from Keycloak management port (9000)
- Converts `vendor_*` metrics to `keycloak_*` for better discoverability

### Custom Services

For services not managed by buun-stack justfiles:

1. **Label the namespace**:

   ```bash
   kubectl label namespace <namespace> buun.channel/enable-monitoring=true
   ```

2. **Create ServiceMonitor with proper labels**:

   ```yaml
   apiVersion: monitoring.coreos.com/v1
   kind: ServiceMonitor
   metadata:
     name: my-service
     namespace: my-namespace
     labels:
       release: kube-prometheus-stack
   spec:
     selector:
       matchLabels:
         app: my-service
     endpoints:
       - port: metrics
         path: /metrics
         interval: 30s
   ```

3. **Verify target is discovered**:

   ```bash
   kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
   # Open http://localhost:9090/targets in browser
   ```

## Grafana Dashboards

The stack includes default dashboards for:

- Kubernetes cluster overview
- Node metrics
- Pod metrics
- Persistent volumes
- StatefulSets

**Import additional dashboards**:

1. Go to Grafana → Dashboards → Import
2. Enter dashboard ID from [Grafana Dashboard Library](https://grafana.com/grafana/dashboards/)
3. Select Prometheus data source
4. Click Import

**Popular dashboard IDs**:

- `15757` - Kubernetes / Views / Global
- `15758` - Kubernetes / Views / Namespaces
- `15759` - Kubernetes / Views / Pods
- `3662` - Prometheus 2.0 Stats
- `12006` - Kubernetes API Server

## Troubleshooting

### ServiceMonitor Not Discovered

**Check namespace label**:

```bash
kubectl get namespace <namespace> --show-labels
```

Should have `buun.channel/enable-monitoring=true`.

**Check ServiceMonitor labels**:

```bash
kubectl get servicemonitor <name> -n <namespace> --show-labels
```

Should have `release=kube-prometheus-stack`.

**Check Prometheus targets**:

```bash
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/targets
```

### Metrics Not Appearing in Grafana

**Refresh Grafana metrics list**:

1. Hard refresh browser: Cmd+Shift+R (Mac) or Ctrl+Shift+R (Windows/Linux)
2. Wait a few minutes for Grafana's metric cache to update
3. Query metrics directly in Explore tab

**Verify metrics in Prometheus**:

```bash
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/graph
# Query your metrics directly
```

**Check metricRelabelings**:

```bash
# View Prometheus scrape config
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 -- \
  cat /etc/prometheus/config_out/prometheus.env.yaml | grep -A 20 "job_name: serviceMonitor/<namespace>/<name>"
```

### OIDC Authentication Issues

**Verify Keycloak client exists**:

```bash
just keycloak::list-clients
```

Should show `grafana` client.

**Check redirect URL**:

The redirect URL should be `https://your-grafana-host/login/generic_oauth`.

**Verify user is in grafana-admins group**:

```bash
just keycloak::add-user-to-group <username> grafana-admins
```

### Check Pod Status

```bash
kubectl get pods -n monitoring
```

### View Prometheus Logs

```bash
kubectl logs -n monitoring prometheus-kube-prometheus-stack-prometheus-0
```

### View Grafana Logs

```bash
kubectl logs -n monitoring deployment/kube-prometheus-stack-grafana
```

## References

- [kube-prometheus-stack Helm Chart](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
- [Prometheus Operator Documentation](https://prometheus-operator.dev/)
- [Prometheus Documentation](https://prometheus.io/docs/)
- [Grafana Documentation](https://grafana.com/docs/)
- [ServiceMonitor CRD](https://prometheus-operator.dev/docs/operator/api/#servicemonitor)
- [PodMonitor CRD](https://prometheus-operator.dev/docs/operator/api/#podmonitor)
- [Alertmanager Documentation](https://prometheus.io/docs/alerting/latest/alertmanager/)