Storage Problems¶

Troubleshooting guide for storage-related issues in Longhorn and other storage systems.

Quick Diagnostics¶

# Check Longhorn health
kubectl get pods -n longhorn-system
kubectl get volumes.longhorn.io -A

# Check PV/PVC status
kubectl get pv,pvc -A

# Check storage classes
kubectl get storageclass

# Check node disk space
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.capacity.ephemeral-storage}{"\n"}{end}'

PVC Stuck in Pending¶

See Common Issues - PVC Stuck in Pending

Longhorn Volume Degraded¶

Symptoms¶

Longhorn dashboard shows volume in "Degraded" state
Volume has fewer replicas than configured
Performance degradation

Diagnosis¶

# Get volume status
kubectl get volumes.longhorn.io -A

# Describe specific volume
kubectl describe volume <volume-name> -n longhorn-system

# Check Longhorn UI
# Navigate to: Volume → <volume-name> → Check replica status

Common Causes¶

1. Node Down or Disconnected¶

Fix: - Bring node back online - Or force detach and rebuild replica

# Check node status
kubectl get nodes

# If node permanently dead, remove it
kubectl delete node <node-name>

# Longhorn will rebuild replica on healthy node

2. Disk Full¶

Check disk space on nodes:

# SSH to each node
ssh <node>
df -h

# Check Longhorn disk usage
ls -lah /var/lib/longhorn/
du -sh /var/lib/longhorn/*

Fix: - Clean up old data - Expand disk - Add more nodes with storage

3. Network Issues Between Replicas¶

Check: - Network connectivity between nodes - Firewall rules - CNI (Flannel) health

# Test connectivity between nodes
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- ping <other-node-ip>

Recovery Steps¶

Identify problematic replica in Longhorn UI
Remove failed replica (if node is dead)
Add new replica - Longhorn automatically creates new replica
Wait for rebuild - Can take hours for large volumes

Longhorn Volume Won't Attach¶

Symptoms¶

Pod stuck in ContainerCreating
Error: "Volume is not attached"

Diagnosis¶

kubectl describe pod <pod-name> -n <namespace>
# Look for volume attachment errors

kubectl describe volume <volume-name> -n longhorn-system

Fixes¶

1. Volume Stuck on Old Node¶

# Check where volume is attached
kubectl get volume <volume-name> -n longhorn-system -o jsonpath='{.status.currentNodeID}'

# If node is down, force detach via Longhorn UI
# Volume → <volume> → Detach → Force detach

2. Multiple Pods Trying to Use Same Volume¶

Problem: Two pods on different nodes trying to use ReadWriteOnce volume

Fix:

# Find pods using the volume
kubectl get pods -A -o json | jq -r '.items[] | select(.spec.volumes[]?.persistentVolumeClaim.claimName=="<pvc-name>") | "\(.metadata.namespace)/\(.metadata.name)"'

# Delete one of the pods
kubectl delete pod <pod-name> -n <namespace>

Volume Performance Issues¶

Slow I/O¶

Causes: - Node disk is slow (SD card vs NVMe) - Network bottleneck between replicas - Longhorn engine overloaded

Diagnosis:

# Check I/O performance on node
ssh <node>
sudo iostat -x 5

# Check Longhorn engine pods
kubectl get pods -n longhorn-system | grep engine
kubectl top pods -n longhorn-system | grep engine

Fixes:

Use faster storage: Migrate to NVMe-backed nodes

Reduce replica count (trade-off: less redundancy):

parameters:
  numberOfReplicas: "2"  # Instead of 3

Use local-path storage for non-critical data

Disk Space Issues¶

Node Running Out of Space¶

Symptoms: - Pods evicted - "No space left on device" errors - Longhorn won't create new replicas

Diagnosis:

# Check node disk usage
kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): \(.status.allocatable.ephemeralStorage)"'

# SSH to node and check
ssh <node>
df -h
du -sh /var/lib/longhorn/* | sort -h

Fixes:

Clean up old data:

# Remove old container images
sudo crictl rmi --prune

# Remove old logs
sudo journalctl --vacuum-time=7d

Delete old Longhorn backups via Longhorn UI

Remove unused PVs:

kubectl get pv | grep Released
kubectl delete pv <pv-name>

Longhorn Exceeding Disk Reservation¶

Error: "Disk usage exceeded the threshold"

Fix:

# Via Longhorn UI: Node → <node> → Edit → Adjust Storage Reserved

# Or via kubectl
kubectl edit node.longhorn.io <node-name> -n longhorn-system
# Adjust: spec.disks.default.storageReserved

Backup Issues¶

Backup Failing¶

Symptoms: Longhorn backup shows failed status

Diagnosis:

# Check Longhorn backup target settings
kubectl get settings.longhorn.io backup-target -n longhorn-system -o yaml

# Check MinIO/S3 connectivity
kubectl run -it --rm aws-cli --image=amazon/aws-cli --restart=Never -- s3 ls s3://<bucket-name>/

Common Causes:

Invalid backup target URL: Verify S3/NFS URL
Network connectivity: Can't reach backup target
Insufficient permissions: S3 credentials lack permissions

Fix:

# Update backup target
kubectl edit settings.longhorn.io backup-target -n longhorn-system

# Update backup credentials
kubectl edit secret longhorn-backup-target-credential -n longhorn-system

Restore Failing¶

Symptoms: Restore from backup fails

Diagnosis:

kubectl get volumes.longhorn.io -A
# Check volume status

# Check Longhorn manager logs
kubectl logs -n longhorn-system -l app=longhorn-manager --tail=200

Fix: Delete failed restore and retry via Longhorn UI

Migrating Data Between Volumes¶

Use a Job to copy data:

apiVersion: batch/v1
kind: Job
metadata:
  name: volume-migration
  namespace: <namespace>
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migration
          image: ubuntu:latest
          command: ["/bin/sh", "-c", "cp -r /old/* /new/"]
          volumeMounts:
            - name: old-vol
              mountPath: /old
            - name: new-vol
              mountPath: /new
      volumes:
        - name: old-vol
          persistentVolumeClaim:
            claimName: old-pvc
        - name: new-vol
          persistentVolumeClaim:
            claimName: new-pvc

Quick Reference¶

# Longhorn status
kubectl get volumes.longhorn.io -A
kubectl describe volume <volume> -n longhorn-system

# Access Longhorn UI
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80

# Check all PVs
kubectl get pv,pvc -A

# Clean up Released PVs
kubectl get pv | grep Released | awk '{print $1}' | xargs kubectl delete pv

[CSI]: Container Storage Interface
[IOMMU]: Input-Output Memory Management Unit. Used to virualize memory access for devices. See Wikipedia