Skip to content

K3s Homelab GitOps StackΒΆ

Welcome to the documentation for my production-ready k3s homelab cluster. This documentation serves three primary purposes (in priority order):

  1. πŸ”§ Troubleshoot and fix issues - Quick access to solutions when things break
  2. πŸ“– Understand the architecture - Explain how everything is configured and why
  3. πŸ› οΈ Replicate the setup - Step-by-step guide to build a similar cluster

Quick NavigationΒΆ

🚨 I Need to Fix Something NOW¢

Go directly to Troubleshooting & Operations for:

πŸ“š I Want to Understand the SetupΒΆ

Start with Getting Started:

πŸ—οΈ I Want to Build ThisΒΆ

Follow the Quick Start Guide then:

  1. Hardware Setup
  2. OS Provisioning
  3. Ansible Automation
  4. Core Services

About This SetupΒΆ

This is my personal homelab k3s cluster, running production-ready practices:

  • βœ… Everything as code - GitOps with ArgoCD, no manual kubectl commands
  • βœ… High availability - 3-node control plane, distributed storage
  • βœ… Security first - Valid HTTPS certificates, SSO authentication, secrets management
  • βœ… Network segmentation - Public/private VLANs with proper firewall rules
  • βœ… Disaster recovery - Automated backups, documented recovery procedures
  • βœ… Observable - Comprehensive monitoring, logging, and alerting

Although this is a personal educational project, I maintain production standards because the biggest learnings come from handling real-world complexity and failure scenarios.

Technology Stack at a GlanceΒΆ

HardwareΒΆ

  • Control Plane: 3x Raspberry Pi 4 (8GB RAM)
  • Workers: 4x Raspberry Pi 4 (8GB RAM) + 3x x86 servers (64GB RAM)
  • Storage: HL15 with TrueNAS (40TB capacity)

Core InfrastructureΒΆ

  • Orchestration: k3s (lightweight Kubernetes)
  • GitOps: ArgoCD
  • Ingress: Traefik
  • Load Balancer: MetalLB
  • Storage: Longhorn (distributed block storage)
  • Certificates: cert-manager + Let's Encrypt
  • Secrets: HashiCorp Vault + External Secrets Operator
  • Authentication: Authentik (SSO/OIDC)
  • Monitoring: Prometheus + Grafana + Loki
  • Database: CloudNativePG (PostgreSQL operator)

Hardware SetupΒΆ

Current Configuration:

  • 3x Raspberry Pi 4 (8GB) - Control plane nodes with Corsair USB sticks
  • 4x Raspberry Pi 4 (8GB) - Worker nodes with USB boot + external NVMe (for Longhorn)
  • Lenovo Thinkcentre M720q - 64GB RAM, Proxmox, large SSD
  • Lenovo Thinkcentre M75q - 64GB RAM, Proxmox, large SSD
  • Minisforum MS-01 - 64GB RAM, Proxmox, 3x large SSDs
  • HL15 with TrueNAS - Backup target, 6x HDDs (~40TB)

The control plane runs on the first three Raspberry Pis for HA. Workers run on the remaining Raspberry Pis and Proxmox VMs. Some nodes have PoE hats for power. External NVMe drives provide fast storage for Longhorn.

Documentation StructureΒΆ

This documentation is organized to prioritize troubleshooting and operations:

πŸ“– Documentation
β”œβ”€β”€ 🏠 Getting Started - Orientation and setup
β”‚   β”œβ”€β”€ Overview - Purpose and technology stack
β”‚   β”œβ”€β”€ Quick Start - Step-by-step setup guide
β”‚   └── Architecture - Detailed design with diagrams
β”‚
β”œβ”€β”€ πŸ”§ Troubleshooting & Operations ⭐ PRIORITY
β”‚   β”œβ”€β”€ Common Issues - Daily problems & fixes
β”‚   β”œβ”€β”€ Disaster Recovery - Complete recovery procedures
β”‚   β”œβ”€β”€ Certificates - cert-manager troubleshooting
β”‚   β”œβ”€β”€ Storage - Longhorn & PVC problems
β”‚   β”œβ”€β”€ Network - DNS, ingress, connectivity
β”‚   β”œβ”€β”€ Applications - Pod & app debugging
β”‚   └── Maintenance - Regular maintenance tasks
β”‚
β”œβ”€β”€ πŸ—οΈ Infrastructure Setup - Hardware and OS
β”œβ”€β”€ βš™οΈ Cluster Core - Critical services
β”œβ”€β”€ πŸ—„οΈ Platform Services - Databases, secrets
β”œβ”€β”€ πŸ“± Applications - User applications
β”œβ”€β”€ πŸ“ How-To Guides - Task-focused procedures
└── πŸ“š Reference - Commands, templates, resources

Key Features of This DocumentationΒΆ

🎯 Troubleshooting First¢

Every page in the troubleshooting section includes: - Quick diagnostic commands at the top - Clear symptoms β†’ diagnosis β†’ fix workflow - Common causes with step-by-step solutions - Copy-paste ready code examples - Quick reference commands at the bottom

πŸ“Š Visual DiagramsΒΆ

Architecture and workflow diagrams using Mermaid for clear understanding of system design and data flow.

πŸ’‘ Production PracticesΒΆ

Real-world lessons learned from running this cluster, including: - Design decisions and trade-offs - Why certain technologies were chosen - Common pitfalls and how to avoid them - Disaster recovery procedures tested in practice

πŸ” SearchableΒΆ

Full-text search enabled across all documentation. Use the search bar above to quickly find what you need.

Who Is This For?ΒΆ

Primarily for myself (Michael) as a reference when: - Something breaks and I need to fix it quickly - I need to remember how I configured something months ago - I want to replicate or expand the setup

But also for anyone interested in building a similar homelab cluster with production-ready practices.

Inspiration & CreditsΒΆ

This setup is inspired by the excellent work of:

Get StartedΒΆ

New to this documentation? Start here:

  1. Read the Overview to understand the purpose and design
  2. Check out the Architecture to see how everything fits together
  3. Follow the Quick Start if you want to build something similar

Need to troubleshoot? Go directly to:

Looking for specific information? Use the search bar above or browse the navigation menu.


Last updated: December 2025

  • [CSI]: Container Storage Interface
  • [IOMMU]: Input-Output Memory Management Unit. Used to virualize memory access for devices. See Wikipedia