Common Issues¶
This page covers the most common issues you'll encounter in day-to-day operations and their quick fixes.
Quick Diagnostic Commands
Pod Issues¶
Pod Stuck in Pending¶
Symptoms: Pod shows Pending status for extended period
Common Causes:
- Insufficient Resources
Fix: - Reduce resource requests in deployment - Add more nodes - Remove resource limits if too restrictive
- No Available PV for PVC
Fix:
- Check if Longhorn is healthy: kubectl get pods -n longhorn-system
- Ensure nodes have available storage
- Check StorageClass exists: kubectl get sc
- Node Selector Not Matching
Fix: Update deployment to remove or fix node selector
Pod CrashLoopBackOff¶
Symptoms: Pod continuously restarting
Diagnosis:
# Check pod logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous # logs from crashed container
# Check events
kubectl describe pod <pod-name> -n <namespace>
Common Causes:
- Application Error: Check logs for error messages
- Missing ConfigMap/Secret:
- Liveness Probe Failing: Adjust probe timing or fix health endpoint
- Init Container Failing: Check init container logs
Fix: Depends on root cause, but often: - Fix application configuration - Ensure required secrets exist - Adjust probe configuration
Pod Stuck in Terminating¶
Symptoms: Pod won't delete, stuck in Terminating state
Diagnosis:
Quick Fix (use cautiously):
Storage Issues¶
PVC Stuck in Pending¶
Symptoms: PersistentVolumeClaim won't bind
Diagnosis:
Common Causes:
- No Available Storage
- Check Longhorn dashboard for capacity
-
Check node disk space on nodes
-
StorageClass Not Found
Fix: Ensure StorageClass exists and matches PVC
Longhorn Volume Degraded¶
Symptoms: Longhorn dashboard shows degraded volumes
Diagnosis:
Common Causes:
- Node Down: Volume replica on failed node
- Disk Full: Node disk at capacity
- Network Issues: Replicas can't sync
Network Issues¶
Service Not Accessible¶
Symptoms: Can't reach service via ClusterIP or external IP
Diagnosis:
# Check service exists
kubectl get svc <service-name> -n <namespace>
# Check endpoints (pods backing the service)
kubectl get endpoints <service-name> -n <namespace>
Common Causes:
- No Pods Ready: Service has no endpoints
- Label Mismatch: Pod labels don't match service selector
- Wrong Port: Service port doesn't match container port
Ingress Not Working¶
Symptoms: Can't access service via domain name
Diagnosis:
# Check Traefik is running
kubectl get pods -n traefik
# Check IngressRoute
kubectl get ingressroute -n <namespace>
kubectl describe ingressroute <name> -n <namespace>
Common Causes:
- DNS Not Resolving: Domain doesn't point to MetalLB IP
- IngressRoute Misconfigured: Wrong service name or port
- Traefik Not Healthy: Controller pods crashed
Certificate Issues¶
Certificate Not Issued¶
Symptoms: Certificate stuck in False or Pending status
Diagnosis:
Common Causes:
- DNS Validation Failing: ACME DNS-01 challenge can't complete
- Rate Limit: Let's Encrypt rate limits hit
- Wrong Issuer: Certificate references non-existent issuer
ArgoCD Issues¶
Application Out of Sync¶
Symptoms: ArgoCD shows application as "OutOfSync"
Common Causes:
- Manual Change: Someone ran
kubectl applymanually - Ignored Differences: Resource has expected drift
- Git Repo Not Accessible: ArgoCD can't fetch from Git
Quick Reference¶
# Pod status
kubectl get pods -n <namespace>
kubectl describe pod <pod> -n <namespace>
kubectl logs <pod> -n <namespace>
# Storage
kubectl get pv,pvc -A
kubectl get volumes.longhorn.io -A
# Network
kubectl get svc,endpoints -A
kubectl get ingressroute -A
# Certificates
kubectl get certificate -A
# ArgoCD
kubectl get applications -n argocd-system
- [CSI]: Container Storage Interface
- [IOMMU]: Input-Output Memory Management Unit. Used to virualize memory access for devices. See Wikipedia