Common Issues¶

This page covers the most common issues you'll encounter in day-to-day operations and their quick fixes.

Quick Diagnostic Commands

# Check cluster health
kubectl get nodes
kubectl get pods -A | grep -v Running

# Check ArgoCD sync status
kubectl get applications -n argocd-system

# Check certificates
kubectl get certificate -A

# Check storage
kubectl get pv,pvc -A

Pod Issues¶

Pod Stuck in Pending¶

Symptoms: Pod shows Pending status for extended period

Common Causes:

Insufficient Resources

kubectl describe pod <pod-name> -n <namespace>
# Look for: "Insufficient cpu" or "Insufficient memory"

Fix: - Reduce resource requests in deployment - Add more nodes - Remove resource limits if too restrictive

No Available PV for PVC

kubectl describe pvc <pvc-name> -n <namespace>
# Look for: "waiting for first consumer"

Fix: - Check if Longhorn is healthy: kubectl get pods -n longhorn-system - Ensure nodes have available storage - Check StorageClass exists: kubectl get sc

Node Selector Not Matching

kubectl describe pod <pod-name> -n <namespace>
# Look for: "didn't match node selector"

Fix: Update deployment to remove or fix node selector

Pod CrashLoopBackOff¶

Symptoms: Pod continuously restarting

Diagnosis:

# Check pod logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous  # logs from crashed container

# Check events
kubectl describe pod <pod-name> -n <namespace>

Common Causes:

Application Error: Check logs for error messages

Missing ConfigMap/Secret:

kubectl get configmap,secret -n <namespace>

Liveness Probe Failing: Adjust probe timing or fix health endpoint
Init Container Failing: Check init container logs

Fix: Depends on root cause, but often: - Fix application configuration - Ensure required secrets exist - Adjust probe configuration

Pod Stuck in Terminating¶

Symptoms: Pod won't delete, stuck in Terminating state

Diagnosis:

kubectl describe pod <pod-name> -n <namespace>
# Look for finalizers or stuck processes

Quick Fix (use cautiously):

# Force delete (last resort)
kubectl delete pod <pod-name> -n <namespace> --grace-period=0 --force

Storage Issues¶

PVC Stuck in Pending¶

Symptoms: PersistentVolumeClaim won't bind

Diagnosis:

kubectl describe pvc <pvc-name> -n <namespace>
kubectl get storageclass

Common Causes:

No Available Storage
Check Longhorn dashboard for capacity
Check node disk space on nodes
StorageClass Not Found
```
kubectl get sc
```
Fix: Ensure StorageClass exists and matches PVC

Longhorn Volume Degraded¶

Symptoms: Longhorn dashboard shows degraded volumes

Diagnosis:

kubectl get volumes.longhorn.io -A
kubectl describe volume <volume-name> -n longhorn-system

Common Causes:

Node Down: Volume replica on failed node
Disk Full: Node disk at capacity
Network Issues: Replicas can't sync

Network Issues¶

Service Not Accessible¶

Symptoms: Can't reach service via ClusterIP or external IP

Diagnosis:

# Check service exists
kubectl get svc <service-name> -n <namespace>

# Check endpoints (pods backing the service)
kubectl get endpoints <service-name> -n <namespace>

Common Causes:

No Pods Ready: Service has no endpoints
Label Mismatch: Pod labels don't match service selector
Wrong Port: Service port doesn't match container port

Ingress Not Working¶

Symptoms: Can't access service via domain name

Diagnosis:

# Check Traefik is running
kubectl get pods -n traefik

# Check IngressRoute
kubectl get ingressroute -n <namespace>
kubectl describe ingressroute <name> -n <namespace>

Common Causes:

DNS Not Resolving: Domain doesn't point to MetalLB IP
IngressRoute Misconfigured: Wrong service name or port
Traefik Not Healthy: Controller pods crashed

Certificate Issues¶

Certificate Not Issued¶

Symptoms: Certificate stuck in False or Pending status

Diagnosis:

kubectl get certificate -n <namespace>
kubectl describe certificate <cert-name> -n <namespace>

Common Causes:

DNS Validation Failing: ACME DNS-01 challenge can't complete
Rate Limit: Let's Encrypt rate limits hit
Wrong Issuer: Certificate references non-existent issuer

ArgoCD Issues¶

Application Out of Sync¶

Symptoms: ArgoCD shows application as "OutOfSync"

Common Causes:

Manual Change: Someone ran kubectl apply manually
Ignored Differences: Resource has expected drift
Git Repo Not Accessible: ArgoCD can't fetch from Git

Quick Reference¶

# Pod status
kubectl get pods -n <namespace>
kubectl describe pod <pod> -n <namespace>
kubectl logs <pod> -n <namespace>

# Storage
kubectl get pv,pvc -A
kubectl get volumes.longhorn.io -A

# Network
kubectl get svc,endpoints -A
kubectl get ingressroute -A

# Certificates
kubectl get certificate -A

# ArgoCD
kubectl get applications -n argocd-system

[CSI]: Container Storage Interface
[IOMMU]: Input-Output Memory Management Unit. Used to virualize memory access for devices. See Wikipedia