Maintenance Tasks¶

Regular maintenance tasks to keep your k3s cluster healthy and running smoothly.

Daily Tasks¶

Check Cluster Health¶

# Quick cluster health check
kubectl get nodes
kubectl get pods -A | grep -v Running
kubectl get applications -n argocd-system | grep -v Synced

# Check ArgoCD sync status
kubectl get applications -n argocd-system -o custom-columns=NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status

Monitor Resource Usage¶

# Node resources
kubectl top nodes

# Top memory consumers
kubectl top pods -A --sort-by=memory | head -20

# Top CPU consumers
kubectl top pods -A --sort-by=cpu | head -20

Weekly Tasks¶

Review Storage Usage¶

# Check PV usage
kubectl get pv -o custom-columns=NAME:.metadata.name,CAPACITY:.spec.capacity.storage,STATUS:.status.phase

# Check Longhorn UI for disk usage
# Clean up Released PVs
kubectl get pv | grep Released

Check Certificate Expiration¶

# List certificates and expiration
kubectl get certificate -A -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,READY:.status.conditions[0].status,EXPIRY:.status.notAfter

Review Failed Backups¶

# Check Longhorn backup status via UI
# Check PostgreSQL backup status
kubectl get backup -A

Monthly Tasks¶

System Updates¶

OS Updates (via Ansible):

cd ansible-directory

# Check for updates
ansible all -m shell -a "apt update && apt list --upgradable"

# Apply updates (one node at a time for workers)
ansible-playbook upgrade.yml --limit <node-name>

# Reboot if needed
ansible-playbook reboot.yml --limit <node-name>

# Verify node comes back
kubectl get nodes --watch

k3s Updates:

# Check current version
kubectl version --short

# Update via Ansible
ansible-playbook playbooks/06_k3s_secure.yaml

Backup Verification¶

Test Longhorn Restore:

Choose non-critical volume
Create backup
Delete volume
Restore from backup
Verify data integrity

Cleanup Tasks¶

Clean Up Old Docker Images:

# On each node
ssh <node>
sudo crictl rmi --prune

Clean Up Old Snapshots via Longhorn UI

Clean Up Completed Jobs:

# List old jobs
kubectl get jobs -A --field-selector status.successful=1

# Delete jobs older than 7 days
kubectl delete job -n <namespace> <job-name>

Security Audit¶

Check for Security Updates:

# Check for CVEs in running images
# Use tool like Trivy

# Update base images via PRs

Review Secrets:

# Check for secrets in namespaces
kubectl get secrets -A

# Ensure sensitive secrets are in Vault

Quarterly Tasks¶

Major Version Updates¶

k3s Major Version Update:

Review release notes
Test in dev environment
Backup cluster
Update control plane nodes one at a time
Update worker nodes
Verify all applications

Disaster Recovery Test¶

Full Cluster Rebuild Test:

Document current state
Destroy test cluster
Rebuild from scratch
Restore from backups
Verify all services
Document any issues

Performance Review¶

Analyze resource usage trends over 3 months: - CPU/memory trends - Storage growth - Network performance - Identify bottlenecks

Maintenance Calendar¶

Daily¶

Check cluster health
Review critical alerts

Weekly¶

Review storage usage
Check certificate expiration
Review failed backups

Monthly¶

Apply OS updates
Backup verification
Cleanup tasks
Security audit

Quarterly¶

Major version updates
Disaster recovery test
Performance review

Maintenance Windows¶

Planning a Maintenance Window¶

Schedule: Choose low-traffic time
Notify: Inform users (if applicable)
Backup: Ensure recent backups exist
Test: Test changes in dev first
Execute: Perform maintenance
Verify: Check all services

Drain Node for Maintenance¶

# Drain node (evict pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Perform maintenance
# ...

# Uncordon node (allow scheduling)
kubectl uncordon <node-name>

# Verify
kubectl get nodes

Useful Scripts¶

Health Check Script¶

#!/bin/bash
# cluster-health-check.sh

echo "=== Cluster Health Check ==="
echo ""

echo "Node Status:"
kubectl get nodes
echo ""

echo "Failed Pods:"
kubectl get pods -A | grep -v Running | grep -v Completed
echo ""

echo "ArgoCD Sync Status:"
kubectl get applications -n argocd-system | grep -v Synced
echo ""

echo "Certificate Status:"
kubectl get certificate -A | grep -v "True"
echo ""

echo "Resource Usage:"
kubectl top nodes

Cleanup Script¶

#!/bin/bash
# cleanup.sh

echo "Cleaning up old jobs..."
kubectl delete jobs --field-selector status.successful=1 -A

echo "Cleaning up old pods..."
kubectl delete pods --field-selector status.phase=Succeeded -A
kubectl delete pods --field-selector status.phase=Failed -A

echo "Done!"

Quick Reference¶

# Check cluster health
kubectl get nodes
kubectl get pods -A | grep -v Running

# Check resource usage
kubectl top nodes
kubectl top pods -A --sort-by=memory

# Check ArgoCD
kubectl get applications -n argocd-system

# Check certificates
kubectl get certificate -A

# Check storage
kubectl get pv,pvc -A

# Drain node
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

# Uncordon node
kubectl uncordon <node>

[CSI]: Container Storage Interface
[IOMMU]: Input-Output Memory Management Unit. Used to virualize memory access for devices. See Wikipedia