fix(k8s): fix longhorn error on shutting down

2025-11-30 16:23:46 +09:00
parent b80b775dd5
commit 992b6ca8f8
2 changed files with 207 additions and 93 deletions
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -4,8 +4,103 @@ This document provides solutions to common issues encountered when working with

 ## Table of Contents

+- [Longhorn Issues](#longhorn-issues)
 - [Vault Issues](#vault-issues)

+## Longhorn Issues
+
+### EXT4 Errors on Machine Shutdown
+
+#### Symptom
+
+When shutting down the machine, you see errors like:
+
+```plain
+EXT4-fs (sdf): failed to convert unwritten extents to written extents -- potential data loss!  (inode 393220, error -30)
+```
+
+Or similar I/O errors in kernel logs:
+
+```plain
+blk_update_request: I/O error, dev sdf, sector XXXXX op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
+Buffer I/O error on dev dm-X, logical block XXXXX, lost sync page write
+```
+
+#### Cause
+
+This occurs when the machine is shut down without properly detaching Longhorn volumes. The standard k3s shutdown procedure (`systemctl stop k3s` or `k3s-killall.sh`) does not gracefully handle Longhorn volume detachment.
+
+When volumes are forcefully detached during shutdown:
+
+- Dirty data may not be flushed to disk
+- The filesystem encounters I/O errors trying to complete pending writes
+- This can lead to data corruption or loss
+
+Reference: <https://github.com/longhorn/longhorn/issues/7206>
+
+#### Solution
+
+Always use `just k8s::stop` before shutting down the machine:
+
+```bash
+# Gracefully stop k3s with proper Longhorn volume detachment
+just k8s::stop
+
+# Now you can safely shutdown the machine
+sudo shutdown -h now
+```
+
+The `just k8s::stop` recipe performs the following steps:
+
+1. **Drains the node** using `kubectl drain` to gracefully evict all pods
+2. **Waits for Longhorn volumes** to be fully detached
+3. **Stops k3s service** and cleans up container processes
+4. **Terminates remaining containerd-shim processes**
+
+#### Expected Warnings During Drain
+
+During the drain process, you may see warnings like:
+
+```plain
+error when evicting pods/"instance-manager-..." -n "longhorn" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
+```
+
+This is normal. Longhorn's instance-manager pods are protected by PodDisruptionBudget (PDB). The drain command will retry and eventually evict them with the `--force` option.
+
+You may also see client-side throttling messages:
+
+```plain
+"Waited before sending request" delay="1.000769875s" reason="client-side throttling..."
+```
+
+This is also normal. The Kubernetes client automatically throttles requests when evicting many pods at once. These warnings do not indicate any problem.
+
+#### Starting the Cluster After Reboot
+
+After rebooting, start the cluster with:
+
+```bash
+just k8s::start
+```
+
+This will:
+
+1. Start the k3s service
+2. Wait for the node to be ready
+3. Automatically uncordon the node (which was cordoned during drain)
+
+#### Quick Reference
+
+```bash
+# Before shutdown
+just k8s::stop
+sudo shutdown -h now
+
+# After reboot
+just k8s::start
+just vault::unseal  # If Vault is installed
+```
+
 ## Vault Issues

 ### Vault is Sealed