K3s Homelab GitOps StackΒΆ
Welcome to the documentation for my production-ready k3s homelab cluster. This documentation serves three primary purposes (in priority order):
- π§ Troubleshoot and fix issues - Quick access to solutions when things break
- π Understand the architecture - Explain how everything is configured and why
- π οΈ Replicate the setup - Step-by-step guide to build a similar cluster
Quick NavigationΒΆ
π¨ I Need to Fix Something NOWΒΆ
Go directly to Troubleshooting & Operations for:
- Common Issues & Quick Fixes
- Disaster Recovery Procedures
- Certificate Problems
- Storage Issues
- Network Debugging
- Application Debugging
π I Want to Understand the SetupΒΆ
Start with Getting Started:
- Overview - Purpose, philosophy, technology stack
- Architecture - Detailed design with diagrams
- Quick Start - Step-by-step setup guide
ποΈ I Want to Build ThisΒΆ
Follow the Quick Start Guide then:
About This SetupΒΆ
This is my personal homelab k3s cluster, running production-ready practices:
- β Everything as code - GitOps with ArgoCD, no manual kubectl commands
- β High availability - 3-node control plane, distributed storage
- β Security first - Valid HTTPS certificates, SSO authentication, secrets management
- β Network segmentation - Public/private VLANs with proper firewall rules
- β Disaster recovery - Automated backups, documented recovery procedures
- β Observable - Comprehensive monitoring, logging, and alerting
Although this is a personal educational project, I maintain production standards because the biggest learnings come from handling real-world complexity and failure scenarios.
Technology Stack at a GlanceΒΆ
HardwareΒΆ
- Control Plane: 3x Raspberry Pi 4 (8GB RAM)
- Workers: 4x Raspberry Pi 4 (8GB RAM) + 3x x86 servers (64GB RAM)
- Storage: HL15 with TrueNAS (40TB capacity)
Core InfrastructureΒΆ
- Orchestration: k3s (lightweight Kubernetes)
- GitOps: ArgoCD
- Ingress: Traefik
- Load Balancer: MetalLB
- Storage: Longhorn (distributed block storage)
- Certificates: cert-manager + Let's Encrypt
- Secrets: HashiCorp Vault + External Secrets Operator
- Authentication: Authentik (SSO/OIDC)
- Monitoring: Prometheus + Grafana + Loki
- Database: CloudNativePG (PostgreSQL operator)
Hardware SetupΒΆ
Current Configuration:
- 3x Raspberry Pi 4 (8GB) - Control plane nodes with Corsair USB sticks
- 4x Raspberry Pi 4 (8GB) - Worker nodes with USB boot + external NVMe (for Longhorn)
- Lenovo Thinkcentre M720q - 64GB RAM, Proxmox, large SSD
- Lenovo Thinkcentre M75q - 64GB RAM, Proxmox, large SSD
- Minisforum MS-01 - 64GB RAM, Proxmox, 3x large SSDs
- HL15 with TrueNAS - Backup target, 6x HDDs (~40TB)
The control plane runs on the first three Raspberry Pis for HA. Workers run on the remaining Raspberry Pis and Proxmox VMs. Some nodes have PoE hats for power. External NVMe drives provide fast storage for Longhorn.
Documentation StructureΒΆ
This documentation is organized to prioritize troubleshooting and operations:
π Documentation
βββ π Getting Started - Orientation and setup
β βββ Overview - Purpose and technology stack
β βββ Quick Start - Step-by-step setup guide
β βββ Architecture - Detailed design with diagrams
β
βββ π§ Troubleshooting & Operations β PRIORITY
β βββ Common Issues - Daily problems & fixes
β βββ Disaster Recovery - Complete recovery procedures
β βββ Certificates - cert-manager troubleshooting
β βββ Storage - Longhorn & PVC problems
β βββ Network - DNS, ingress, connectivity
β βββ Applications - Pod & app debugging
β βββ Maintenance - Regular maintenance tasks
β
βββ ποΈ Infrastructure Setup - Hardware and OS
βββ βοΈ Cluster Core - Critical services
βββ ποΈ Platform Services - Databases, secrets
βββ π± Applications - User applications
βββ π How-To Guides - Task-focused procedures
βββ π Reference - Commands, templates, resources
Key Features of This DocumentationΒΆ
π― Troubleshooting FirstΒΆ
Every page in the troubleshooting section includes: - Quick diagnostic commands at the top - Clear symptoms β diagnosis β fix workflow - Common causes with step-by-step solutions - Copy-paste ready code examples - Quick reference commands at the bottom
π Visual DiagramsΒΆ
Architecture and workflow diagrams using Mermaid for clear understanding of system design and data flow.
π‘ Production PracticesΒΆ
Real-world lessons learned from running this cluster, including: - Design decisions and trade-offs - Why certain technologies were chosen - Common pitfalls and how to avoid them - Disaster recovery procedures tested in practice
π SearchableΒΆ
Full-text search enabled across all documentation. Use the search bar above to quickly find what you need.
Who Is This For?ΒΆ
Primarily for myself (Michael) as a reference when: - Something breaks and I need to fix it quickly - I need to remember how I configured something months ago - I want to replicate or expand the setup
But also for anyone interested in building a similar homelab cluster with production-ready practices.
Inspiration & CreditsΒΆ
This setup is inspired by the excellent work of:
Get StartedΒΆ
New to this documentation? Start here:
- Read the Overview to understand the purpose and design
- Check out the Architecture to see how everything fits together
- Follow the Quick Start if you want to build something similar
Need to troubleshoot? Go directly to:
- Common Issues for daily problems
- Disaster Recovery for serious failures
- Component-specific guides for Certificates, Storage, Network, or Applications
Looking for specific information? Use the search bar above or browse the navigation menu.
Last updated: December 2025
- [CSI]: Container Storage Interface
- [IOMMU]: Input-Output Memory Management Unit. Used to virualize memory access for devices. See Wikipedia