Disaster Recovery¶
This guide covers how to recover from various disaster scenarios, from single component failures to complete cluster loss.
Before Disaster Strikes - Critical Preparation
Ensure you have:
- Break-glass kubeconfig saved externally (from
/etc/rancher/k3s/k3s.yaml) - Vault unseal keys stored securely outside the cluster
- Git repository backed up (GitHub provides this)
- Longhorn backups enabled to TrueNAS
- PostgreSQL backups configured
- Ansible inventory and vault passwords backed up
Quick Recovery Decision Tree¶
graph TD
A[Problem Detected] --> B{What Failed?}
B -->|Single Worker| C[15-60 min: Automatic rescheduling]
B -->|Control Plane Node| D[30-60 min: Replace node]
B -->|Storage/Longhorn| E[1-4 hrs: Restore from backup]
B -->|Vault Sealed| F[15 min: Unseal with keys]
B -->|Complete Cluster| G[4-8 hrs: Full rebuild]
B -->|Network/Traefik| H[15-30 min: Restart components]
Recovery Scenarios¶
Scenario 1: Single Worker Node Failure¶
Impact: Minimal - workloads reschedule automatically
Recovery Steps:
-
Verify node is down:
-
Check pod redistribution:
-
Fix or replace node:
- If hardware failure: Replace hardware
- If software issue: SSH to node and investigate
-
If unrecoverable: Remove node and add new one
-
Remove dead node (if needed):
-
Add replacement node:
Recovery Time: 15-60 minutes
Scenario 2: Complete Cluster Loss¶
Impact: Catastrophic - everything down
This is why we use GitOps
With GitOps, rebuilding is straightforward because all configuration is in Git.
Phase 1: Rebuild Cluster (2-4 hours)¶
-
Rebuild nodes with Ansible:
-
Verify cluster is up:
Phase 2: Bootstrap ArgoCD (30 minutes)¶
-
Install ArgoCD manually (the only manual step):
-
Deploy root ApplicationSet:
-
Watch ArgoCD sync everything:
Phase 3: Restore Data (1-4 hours)¶
- Restore Vault:
- Restore Longhorn volume from backup
- Unseal Vault with saved keys
-
Or manually re-enter secrets
-
Restore Longhorn volumes:
- Longhorn UI → Backup tab
- Restore each critical volume
-
Update PVCs to use restored volumes
-
Restore PostgreSQL databases:
Total Recovery Time: 4-8 hours
Scenario 3: Vault Sealed/Lost¶
Impact: Severe - secrets unavailable, pods using secrets fail
Recovery Steps:
-
Check Vault status:
-
Unseal Vault (you need unseal keys):
-
If Vault data lost completely:
- With backup: Restore Vault data from backup
-
Without backup: Re-initialize Vault and re-enter all secrets (painful)
-
Restart pods using secrets:
Recovery Time: 15 minutes (unseal), 2-8 hours (complete rebuild)
Scenario 4: Storage Loss (Longhorn)¶
Impact: Severe - all persistent data lost if no backups
Recovery Steps:
-
Check Longhorn status:
-
If volumes corrupted/lost:
- With Backups: Restore from backup
-
Without Backups: Data is lost - rebuild from scratch
-
Restore from Longhorn backup:
Recovery Time: 1-4 hours (depending on data size)
Scenario 5: Certificate Manager Down¶
Impact: Moderate - existing certificates work, but renewals fail
Recovery Steps:
-
Check cert-manager pods:
-
Restart cert-manager:
-
Force certificate renewal:
Recovery Time: 15 minutes
Scenario 6: Network Failure (Traefik/MetalLB Down)¶
Impact: Severe - can't access any services externally
Recovery Steps:
-
Check MetalLB:
-
Check Traefik:
-
Restart components:
Recovery Time: 15-30 minutes
Backup Verification¶
Test Backups Regularly¶
Monthly Task:
- Verify Longhorn backups:
- Longhorn UI → Backup tab
- Check last backup timestamp
-
Try restoring a test volume
-
Verify PostgreSQL backups:
-
Test disaster recovery procedure (in test cluster):
- Destroy test cluster
- Rebuild from scratch
- Restore data
- Document any issues
Emergency Contacts¶
External Dependencies:
- Domain/DNS: CloudFlare (has separate login)
- Git Repository: GitHub (has separate login)
- Backup Storage: TrueNAS (admin credentials)
Critical Credentials (store in password manager):
- Break-glass kubeconfig
- Vault unseal keys
- Vault root token
- TrueNAS admin password
- Ansible vault password
- GitHub SSH keys
- CloudFlare API token
Post-Recovery Tasks¶
After recovering from a disaster:
- Update documentation: What worked? What didn't?
- Review root cause: Why did the failure occur?
- Improve automation: Can this be prevented or recovered faster?
- Test recovery process: Ensure it works next time
- Update backup strategy: Were backups sufficient?
Prevention¶
Best Practices:
- ✅ Regular backups (automated)
- ✅ Test restore procedures
- ✅ Monitor backup success/failures
- ✅ Keep credentials secure but accessible
- ✅ Document everything (like this!)
- ✅ Use GitOps (so configuration is always in Git)
- ✅ High availability for critical components
-
✅ Regular maintenance and updates
-
[CSI]: Container Storage Interface
- [IOMMU]: Input-Output Memory Management Unit. Used to virualize memory access for devices. See Wikipedia