Storage Problems¶
Troubleshooting guide for storage-related issues in Longhorn and other storage systems.
Quick Diagnostics¶
# Check Longhorn health
kubectl get pods -n longhorn-system
kubectl get volumes.longhorn.io -A
# Check PV/PVC status
kubectl get pv,pvc -A
# Check storage classes
kubectl get storageclass
# Check node disk space
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.capacity.ephemeral-storage}{"\n"}{end}'
PVC Stuck in Pending¶
See Common Issues - PVC Stuck in Pending
Longhorn Volume Degraded¶
Symptoms¶
- Longhorn dashboard shows volume in "Degraded" state
- Volume has fewer replicas than configured
- Performance degradation
Diagnosis¶
# Get volume status
kubectl get volumes.longhorn.io -A
# Describe specific volume
kubectl describe volume <volume-name> -n longhorn-system
# Check Longhorn UI
# Navigate to: Volume → <volume-name> → Check replica status
Common Causes¶
1. Node Down or Disconnected¶
Fix: - Bring node back online - Or force detach and rebuild replica
# Check node status
kubectl get nodes
# If node permanently dead, remove it
kubectl delete node <node-name>
# Longhorn will rebuild replica on healthy node
2. Disk Full¶
Check disk space on nodes:
# SSH to each node
ssh <node>
df -h
# Check Longhorn disk usage
ls -lah /var/lib/longhorn/
du -sh /var/lib/longhorn/*
Fix: - Clean up old data - Expand disk - Add more nodes with storage
3. Network Issues Between Replicas¶
Check: - Network connectivity between nodes - Firewall rules - CNI (Flannel) health
# Test connectivity between nodes
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- ping <other-node-ip>
Recovery Steps¶
- Identify problematic replica in Longhorn UI
- Remove failed replica (if node is dead)
- Add new replica - Longhorn automatically creates new replica
- Wait for rebuild - Can take hours for large volumes
Longhorn Volume Won't Attach¶
Symptoms¶
- Pod stuck in
ContainerCreating - Error: "Volume is not attached"
Diagnosis¶
kubectl describe pod <pod-name> -n <namespace>
# Look for volume attachment errors
kubectl describe volume <volume-name> -n longhorn-system
Fixes¶
1. Volume Stuck on Old Node¶
# Check where volume is attached
kubectl get volume <volume-name> -n longhorn-system -o jsonpath='{.status.currentNodeID}'
# If node is down, force detach via Longhorn UI
# Volume → <volume> → Detach → Force detach
2. Multiple Pods Trying to Use Same Volume¶
Problem: Two pods on different nodes trying to use ReadWriteOnce volume
Fix:
# Find pods using the volume
kubectl get pods -A -o json | jq -r '.items[] | select(.spec.volumes[]?.persistentVolumeClaim.claimName=="<pvc-name>") | "\(.metadata.namespace)/\(.metadata.name)"'
# Delete one of the pods
kubectl delete pod <pod-name> -n <namespace>
Volume Performance Issues¶
Slow I/O¶
Causes: - Node disk is slow (SD card vs NVMe) - Network bottleneck between replicas - Longhorn engine overloaded
Diagnosis:
# Check I/O performance on node
ssh <node>
sudo iostat -x 5
# Check Longhorn engine pods
kubectl get pods -n longhorn-system | grep engine
kubectl top pods -n longhorn-system | grep engine
Fixes:
- Use faster storage: Migrate to NVMe-backed nodes
- Reduce replica count (trade-off: less redundancy):
- Use local-path storage for non-critical data
Disk Space Issues¶
Node Running Out of Space¶
Symptoms: - Pods evicted - "No space left on device" errors - Longhorn won't create new replicas
Diagnosis:
# Check node disk usage
kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): \(.status.allocatable.ephemeralStorage)"'
# SSH to node and check
ssh <node>
df -h
du -sh /var/lib/longhorn/* | sort -h
Fixes:
-
Clean up old data:
-
Delete old Longhorn backups via Longhorn UI
-
Remove unused PVs:
Longhorn Exceeding Disk Reservation¶
Error: "Disk usage exceeded the threshold"
Fix:
# Via Longhorn UI: Node → <node> → Edit → Adjust Storage Reserved
# Or via kubectl
kubectl edit node.longhorn.io <node-name> -n longhorn-system
# Adjust: spec.disks.default.storageReserved
Backup Issues¶
Backup Failing¶
Symptoms: Longhorn backup shows failed status
Diagnosis:
# Check Longhorn backup target settings
kubectl get settings.longhorn.io backup-target -n longhorn-system -o yaml
# Check MinIO/S3 connectivity
kubectl run -it --rm aws-cli --image=amazon/aws-cli --restart=Never -- s3 ls s3://<bucket-name>/
Common Causes:
- Invalid backup target URL: Verify S3/NFS URL
- Network connectivity: Can't reach backup target
- Insufficient permissions: S3 credentials lack permissions
Fix:
# Update backup target
kubectl edit settings.longhorn.io backup-target -n longhorn-system
# Update backup credentials
kubectl edit secret longhorn-backup-target-credential -n longhorn-system
Restore Failing¶
Symptoms: Restore from backup fails
Diagnosis:
kubectl get volumes.longhorn.io -A
# Check volume status
# Check Longhorn manager logs
kubectl logs -n longhorn-system -l app=longhorn-manager --tail=200
Fix: Delete failed restore and retry via Longhorn UI
Migrating Data Between Volumes¶
Use a Job to copy data:
apiVersion: batch/v1
kind: Job
metadata:
name: volume-migration
namespace: <namespace>
spec:
template:
spec:
restartPolicy: Never
containers:
- name: migration
image: ubuntu:latest
command: ["/bin/sh", "-c", "cp -r /old/* /new/"]
volumeMounts:
- name: old-vol
mountPath: /old
- name: new-vol
mountPath: /new
volumes:
- name: old-vol
persistentVolumeClaim:
claimName: old-pvc
- name: new-vol
persistentVolumeClaim:
claimName: new-pvc
Quick Reference¶
# Longhorn status
kubectl get volumes.longhorn.io -A
kubectl describe volume <volume> -n longhorn-system
# Access Longhorn UI
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
# Check all PVs
kubectl get pv,pvc -A
# Clean up Released PVs
kubectl get pv | grep Released | awk '{print $1}' | xargs kubectl delete pv
- [CSI]: Container Storage Interface
- [IOMMU]: Input-Output Memory Management Unit. Used to virualize memory access for devices. See Wikipedia