Maintenance Tasks¶
Regular maintenance tasks to keep your k3s cluster healthy and running smoothly.
Daily Tasks¶
Check Cluster Health¶
# Quick cluster health check
kubectl get nodes
kubectl get pods -A | grep -v Running
kubectl get applications -n argocd-system | grep -v Synced
# Check ArgoCD sync status
kubectl get applications -n argocd-system -o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status
Monitor Resource Usage¶
# Node resources
kubectl top nodes
# Top memory consumers
kubectl top pods -A --sort-by=memory | head -20
# Top CPU consumers
kubectl top pods -A --sort-by=cpu | head -20
Weekly Tasks¶
Review Storage Usage¶
# Check PV usage
kubectl get pv -o custom-columns=NAME:.metadata.name,CAPACITY:.spec.capacity.storage,STATUS:.status.phase
# Check Longhorn UI for disk usage
# Clean up Released PVs
kubectl get pv | grep Released
Check Certificate Expiration¶
# List certificates and expiration
kubectl get certificate -A -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,READY:.status.conditions[0].status,EXPIRY:.status.notAfter
Review Failed Backups¶
Monthly Tasks¶
System Updates¶
OS Updates (via Ansible):
cd ansible-directory
# Check for updates
ansible all -m shell -a "apt update && apt list --upgradable"
# Apply updates (one node at a time for workers)
ansible-playbook upgrade.yml --limit <node-name>
# Reboot if needed
ansible-playbook reboot.yml --limit <node-name>
# Verify node comes back
kubectl get nodes --watch
k3s Updates:
# Check current version
kubectl version --short
# Update via Ansible
ansible-playbook playbooks/06_k3s_secure.yaml
Backup Verification¶
Test Longhorn Restore:
- Choose non-critical volume
- Create backup
- Delete volume
- Restore from backup
- Verify data integrity
Cleanup Tasks¶
Clean Up Old Docker Images:
Clean Up Old Snapshots via Longhorn UI
Clean Up Completed Jobs:
# List old jobs
kubectl get jobs -A --field-selector status.successful=1
# Delete jobs older than 7 days
kubectl delete job -n <namespace> <job-name>
Security Audit¶
Check for Security Updates:
Review Secrets:
Quarterly Tasks¶
Major Version Updates¶
k3s Major Version Update:
- Review release notes
- Test in dev environment
- Backup cluster
- Update control plane nodes one at a time
- Update worker nodes
- Verify all applications
Disaster Recovery Test¶
Full Cluster Rebuild Test:
- Document current state
- Destroy test cluster
- Rebuild from scratch
- Restore from backups
- Verify all services
- Document any issues
Performance Review¶
Analyze resource usage trends over 3 months: - CPU/memory trends - Storage growth - Network performance - Identify bottlenecks
Maintenance Calendar¶
Daily¶
- Check cluster health
- Review critical alerts
Weekly¶
- Review storage usage
- Check certificate expiration
- Review failed backups
Monthly¶
- Apply OS updates
- Backup verification
- Cleanup tasks
- Security audit
Quarterly¶
- Major version updates
- Disaster recovery test
- Performance review
Maintenance Windows¶
Planning a Maintenance Window¶
- Schedule: Choose low-traffic time
- Notify: Inform users (if applicable)
- Backup: Ensure recent backups exist
- Test: Test changes in dev first
- Execute: Perform maintenance
- Verify: Check all services
Drain Node for Maintenance¶
# Drain node (evict pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Perform maintenance
# ...
# Uncordon node (allow scheduling)
kubectl uncordon <node-name>
# Verify
kubectl get nodes
Useful Scripts¶
Health Check Script¶
#!/bin/bash
# cluster-health-check.sh
echo "=== Cluster Health Check ==="
echo ""
echo "Node Status:"
kubectl get nodes
echo ""
echo "Failed Pods:"
kubectl get pods -A | grep -v Running | grep -v Completed
echo ""
echo "ArgoCD Sync Status:"
kubectl get applications -n argocd-system | grep -v Synced
echo ""
echo "Certificate Status:"
kubectl get certificate -A | grep -v "True"
echo ""
echo "Resource Usage:"
kubectl top nodes
Cleanup Script¶
#!/bin/bash
# cleanup.sh
echo "Cleaning up old jobs..."
kubectl delete jobs --field-selector status.successful=1 -A
echo "Cleaning up old pods..."
kubectl delete pods --field-selector status.phase=Succeeded -A
kubectl delete pods --field-selector status.phase=Failed -A
echo "Done!"
Quick Reference¶
# Check cluster health
kubectl get nodes
kubectl get pods -A | grep -v Running
# Check resource usage
kubectl top nodes
kubectl top pods -A --sort-by=memory
# Check ArgoCD
kubectl get applications -n argocd-system
# Check certificates
kubectl get certificate -A
# Check storage
kubectl get pv,pvc -A
# Drain node
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# Uncordon node
kubectl uncordon <node>
- [CSI]: Container Storage Interface
- [IOMMU]: Input-Output Memory Management Unit. Used to virualize memory access for devices. See Wikipedia