Skip to content

urmanac/cozystack-moon-and-back

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

77 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Home Lab to the Moon and Back

Validating ARM64 Kubernetes in the cloud before committing to bare-metal
Smart validation strategy: Test first, buy hardware second

CozySummit Virtual 2025 License Built with TDG


🎯 The Mission

Transform a 128Β°F office space heater (aka home lab) into a cloud-validated, ARM64-first Kubernetes deployment that:

  • βœ… Validates ARM64 architecture on t4g instances before Raspberry Pi purchase
  • βœ… Runs experiments within reasonable budget (baseline: $0.08/month, validation: <$15/month)
  • βœ… Netboots Talos Linux with custom extensions (Spin + Tailscale subnet router)
  • βœ… Demonstrates WebAssembly on ARM64 in production-like conditions
  • βœ… Proves when cloud makes sense vs. efficient home lab hardware

Target: Live demo at CozySummit Virtual 2025 on December 3, 2025

πŸ§ͺ TDG Test Status (Updated: November 19, 2025)

βœ… Working Tests (4/5):

  • βœ… Patch validation (upstream conformance) - FIXED
  • βœ… GitHub Actions workflow syntax
  • βœ… Dependency verification (crane, skopeo, jq)
  • βœ… Patch directory cleanliness (3 patches) - FIXED

❌ Failing Tests (1/5):

  • ❌ ADR-003 documentation validation - Missing file expected by test

🚧 Image Build Tests (1/3 passing):

  • ❌ Container image pulls (need actual published images)
  • ❌ OCI manifest validation (images not yet published)
  • βœ… Cost tracking validation

🎯 Conformance Achieved:

  • Upstream CozyStack integration βœ…
  • Separate repository strategy βœ…
  • ARM64 native builds βœ…
  • Test suite reality alignment βœ… NEW

🌑️ The Problem

Home Lab Status: πŸ”₯
Office Temperature: 93Β°F (ambient, with the door closed)
Electricity Bill: πŸ“ˆ
Wife's Patience: πŸ“‰

Running x86 workloads 24/7 in a home lab is:

  • HOT - Space heater in every season
  • EXPENSIVE - Power consumption adds up
  • LOUD - Fans, lots of fans
  • INFLEXIBLE - Can't easily scale down

The Solution? Validate in the cloud, then bring it home on ARM64 (Raspberry Pi CM3).


πŸ—οΈ The Architecture

Home Lab (Current)

Internet β†’ DD-WRT Router (10.17.12.1)
           └─ Front Subnet (10.17.12.0/24)
              └─ Mikrotik Router (dual-homed)
                 └─ Inner Subnet (10.17.13.0/24)
                    β”œβ”€ Netboot Infrastructure
                    β”‚  β”œβ”€ dnsmasq (DHCP)
                    β”‚  β”œβ”€ matchbox (PXE)
                    β”‚  β”œβ”€ 5x registry caches
                    β”‚  └─ pi-hole (DNS)
                    └─ Talos Nodes
                       └─ CozyStack

AWS Cloud (βœ… Design Complete)

VPC: 10.10.0.0/16 (eu-west-1) 
└─ Public Subnet (10.10.0.0/24)
   β”œβ”€ Bastion: 10.10.0.100 (ENI + IPv6, t4g.small)
   β”‚  └─ Services: registry caches, Wireguard NAT, Tailscale
   β”‚
   β”œβ”€ Talos Gateway: 10.10.0.101 (t4g.medium)
   β”‚  └─ Extensions: spin, tailscale (subnet router)
   β”‚  
   β”œβ”€ Talos Compute: 10.10.0.102 (t4g.medium)
   β”‚  └─ Extensions: spin only
   β”‚
   └─ Talos Compute: 10.10.0.103 (t4g.medium)
      └─ Extensions: spin only
      
Boot: boot-to-talos installs OCI images (no AMI management)
Cost: ~$16-20/month (mostly EBS, t4g free tier covers compute)

πŸ“‹ AWS Design Summary - Ready for Stakpak agent
🏷️ Package Naming Cleanup - Fix those ugly package names!

Key Innovation: Exact replica of home lab topology in AWS, staying within free tier limits.


🌟 Core Stack Deep Dive

Talos Linux Β· CozyStack Β· WebAssembly (Spin) Β· Tailscale Subnet Router Β· AWS Graviton

πŸ”Œ Tailscale Subnet Router Architecture

Key insight: We use Tailscale's subnet router mode (not mesh!) to create clean network bridges between:

  • AWS VPC private networks (10.20.0.0/16)
  • Kubernetes pod CIDR (managed by CozyStack's CNI)
  • Service networks (MetalLB load balancers in ARP mode)
  • Home lab networks (10.17.13.0/24)

Architecture: Single privileged Talos node runs subnet router, other nodes use standard Kubernetes networking. This preserves CNI while providing seamless VPC access.

See landing page for complete technical implementation details.

πŸ—Ώ Talos Linux: Security-First Immutability

Why Talos? It's CozySummit and CozyStack is built on it. End of justification! 🎯

What makes it compelling:

  • Immutable OS: Fewer binaries = smaller attack surface
  • Kubernetes-first: No SSH, no shell, just API-driven infrastructure
  • ARM64 native: First-class support, not an afterthought
  • Security by design: Minimal surface area, everything locked down

Real talk: We're not here to justify Talos vs. other distros. It's proven, it works, and it's what CozyStack uses. Moving on.

πŸ—οΈ CozyStack: Helm-First Platform Engineering

Why CozyStack over vanilla Kubernetes? Because it looks like something I'd build if I had unlimited time, and I want that to exist.

The compelling architecture:

  • Helm-first design: Platform built for teams that demand "Helm only"
  • Flux integration: GitOps workflows that actually work
  • Cloud-native foundation: CNCF projects with (hopefully) spectacular ARM64 support
  • Platform-as-code: Infrastructure that scales with your team, not against it

Author's note: As a Flux maintainer, I've seen enough infrastructure built on Helm to know this is the right abstraction level. CozyStack delivers that vision.

⚑ WebAssembly (Spin): Architecture-Independent Performance

Why WebAssembly? Faster, cheaper, architecture-independent. Perfect for ARM64 validation.

The Spin advantage:

  • Cold start performance: Sub-millisecond startup vs. container seconds
  • Scale-to-zero efficiency: Actually works, unlike most "serverless" promises
  • Local registry caching: Artifact caching that makes cold starts even faster
  • Architecture portability: Same binary runs on x86 home lab and ARM64 cloud

Real-world impact: We've been demoing Spin for years. The performance story is proven - now we're validating it on ARM64 at cloud scale before hardware investment.

πŸ”οΈ AWS Graviton: Free Tier ARM64 Validation

Why Graviton? It's available ARM64 in the cloud and currently free under AWS free tier usage.

The pragmatic choice:

  • Virtualization extensions: Hopefully has what Raspberry Pi lacks for advanced CozyStack features
  • Known platform: AWS is familiar territory for cloud validation
  • Risk mitigation: Test architecture before $650+ hardware investment
  • Uncertain alternatives: Ampere? Chinese Raspberry Pi clones? Unknown landscape.

Honest assessment: We think Graviton has the virtualization support that consumer ARM64 hardware might lack. We'll find out! But we'd rather discover limitations in the cloud than after buying hardware.

πŸ—οΈ Role-Based Architecture: Real-World Discovery

The Problem: Adding Tailscale to ALL cluster nodes breaks everything.

What we learned (the hard way):

  • Kubernetes Ready condition: Nodes wait for ALL configured extensions to become active
  • Multiple subnet routers: Every node tries to configure as Tailscale subnet router
  • Configuration conflicts: Multiple nodes compete for same routing role
  • Cluster formation failure: Nodes hang indefinitely, never reach Ready state

The Solution: Role-based image architecture

  • Compute nodes (spin-only): WebAssembly runtime only, quick Ready state
  • Gateway nodes (spin-tailscale): WebAssembly + Tailscale subnet router, one per cluster

Discovery method: "Walking the grounds and tilling the soil" - not systematic testing, but real-world cluster building experience on AMD64 that informed our ARM64 strategy.

Impact: This architectural insight is why our ARM64 validation will work. We've already solved the hard problems.


πŸ“Š The Economics

Cost Strategy

Baseline Infrastructure (no experiments):

Bastion (t4g.small, 5hrs/day):  $0.00 (free tier)
EBS volumes (during runtime):   $0.04/month  
NAT Gateway (minimal usage):    $0.04/month
-------------------------------------------------
Baseline cost:                  $0.08/month

Validation Phase (5 experiments, 2-3 hours each):

3x Talos nodes (t4g.small):     $0.00 (free tier < 750hrs/month)
4x EBS volumes (8GB each):      $0.25-0.50/session
NAT Gateway (active egress):    $0.15-0.35/session  
-------------------------------------------------
Per experiment session:         $0.40-0.85
Target validation budget:       <$15/month

Break-even Analysis:

  • Home lab power consumption: $30-50/month
  • Cloud validation phase: Target <$15/month
  • Production cloud cost: $25-70/month (estimated)
  • Decision point: When cloud exceeds $40/month, efficient ARM64 home lab wins

Strategy: Validate in cloud for less than the cost of buying wrong hardware ($500+ Raspberry Pi mistake), then deploy with confidence.


πŸ§ͺ Test-Driven Generation (TDG)

This project follows the Test-Driven Generation methodology created by Chanwit Kaewkasi.

Principle: Write tests FIRST, then generate code to make them pass.

Read More:

Test Status: Two-Track Approach

βœ… Patch & Image Validation (Current Suite)

Test Category Status Details
Patch Validation βœ… PASSING 4/5 tests passing (validate-complete.sh)
Image Build Tests 🚧 PARTIAL 1/3 passing (need published images)
Cost Tracking βœ… PASSING AWS cost validation working

🚧 Infrastructure TDG Suite (Planned)

Phase Tests Status
Network Foundation 1-3 πŸ“‹ DEFINED (TDG-PLAN.md)
Bastion & Netboot 4-6 πŸ“‹ DEFINED (TDG-PLAN.md)
CozyStack Deployment 7-9 πŸ“‹ DEFINED (TDG-PLAN.md)
Integration Tests 10-12 πŸ“‹ DEFINED (SpinApp + KubeVirt + Moonlander)

Run current tests: ./validate-complete.sh and ./tests/run-all-custom-image-tests.sh
Next: Implement TDG infrastructure tests from TDG-PLAN.md

Integration Test Highlights:

  • ✨ Test 10: SpinApp GitOps deployment with MetalLB external access
  • πŸ”„ Test 11: KubeVirt + Cluster-API nested Kubernetes clusters
  • 🌐 Test 12: Moonlander + Harvey cross-cluster management via Crossplane

πŸ“š Documentation

Core Documents

Repository Constellation

This project integrates with 8+ repositories:

Repo Purpose Status
urmanac/aws-accounts Infrastructure Terraform βœ… Active
kingdon-ci/cozy-fleet Flux GitOps βœ… Active
cozystack/talm GitOps Talos Management 🎯 Core Tool
kingdon-ci/kaniko-builder Custom image builds πŸ”§ Tool
kingdon-ci/time-tracker Session tracking βš™οΈ Optional
kingdonb/mecris MCP server patterns πŸ• Reference
kingdon-ci/noclaude Self-hosted AI πŸ€– Future
chanwit/tdg TDG Methodology πŸ“– Methodology

See: docs/REPO-OVERVIEW.md for full dependency graph.


🎬 The Demo

What You'll See (December 3)

  1. Home Lab Reality Check πŸ”₯

    • Temperature monitoring
    • Power consumption
    • The space heater problem
  2. AWS Economics πŸ’°

    • Live cost explorer query
    • $0.04/month current state
    • Free tier breakdown
  3. Netboot Magic ⚑

    • Launch t4g.small instance
    • Watch Talos netboot (< 5 min)
    • CozyStack dashboard
  4. SpinKube on ARM64 🎯

    • Deploy demo app
    • Show running workload
    • Verify ARM64 architecture
  5. The Exit πŸšͺ

    • Terminate instance
    • Return to $0.04/month
    • Compare to home lab costs

Live Channels

  • πŸ“Ί YouTube: @yebyen/streams
  • πŸŽ₯ CozyStack Speed Runs: Previous demos and validation runs

πŸš€ Quick Start

Prerequisites

# AWS CLI with MFA-authenticated profile
aws configure --profile sb-terraform-mfa-session

# Terraform (or OpenTofu)
brew install opentofu

# kubectl + talosctl
brew install kubectl
brew install siderolabs/tap/talosctl

# Flux CLI
brew install fluxcd/tap/flux

Deploy Infrastructure

# Clone this repo
git clone https://github.com/urmanac/cozystack-moon-and-back.git
cd cozystack-moon-and-back

# Review TDG tests
./tests/run-all.sh --dry-run

# Deploy network foundation (Test 1)
cd terraform/network
terraform init
terraform plan
terraform apply

# Deploy bastion (Test 2-3)
cd ../bastion
terraform apply

# Verify netboot infrastructure (Test 3)
ssh [email protected] "docker ps"

# Launch Talos node (Test 4)
# (Manual for now, see docs/BOOTSTRAP.md)

Bootstrap CozyStack

# Get talos config
talosctl -n 10.20.13.x config

# Bootstrap cluster
talosctl -n 10.20.13.x bootstrap

# Install CozyStack
# (See docs/COZYSTACK.md for detailed steps)

πŸŽ“ What You'll Learn

This project demonstrates:

  • ✨ Hybrid Cloud Economics - When cloud makes sense vs. home lab
  • πŸ—οΈ Infrastructure Replication - Exact topology in AWS and home
  • πŸ”§ ARM64 Validation - Test before bare-metal deployment
  • 🌐 Network Architecture - Private-first, GDPR-safe design
  • πŸ“¦ Custom Talos Images - Extensions for Spin + Tailscale
  • πŸ”„ GitOps with Flux - Including new ExternalArtifact features
  • πŸ’° Cost Optimization - Free tier strategies and monitoring
  • πŸ§ͺ TDG Methodology - Test-driven infrastructure generation

πŸ† Success Metrics

Demo Day (December 3)

  • Tests 1-6 passing (Network β†’ Demo workload)
  • Live netboot < 5 minutes
  • SpinKube demo runs on ARM64
  • Cost stays under $0.10/month
  • Audience can replicate in their own AWS account

Post-Demo

  • Home lab transitions to Raspberry Pi CM3 modules
  • Office temperature drops 15Β°F
  • Power bill decreases measurably
  • Wife's approval rating improves πŸ“ˆ

πŸ‘₯ Credits

Speaker: Kingdon Barrett
Flux Maintainer, DevOps Engineer at Navteca, LLC
Working on Science Cloud for NASA Goddard Space Flight Center

Methodology: Chanwit Kaewkasi
TDG Innovator

Platform: Andrei Kvapil
CozyStack Creator

Built with:

  • πŸ€– Claude (Anthropic) - Infrastructure design & TDG implementation
  • 🧰 CozyStack - Kubernetes platform for bare metal
  • 🐧 Talos Linux - Immutable Kubernetes OS
  • ☁️ AWS - Free tier cloud validation
  • πŸ”„ Flux - GitOps toolkit
  • πŸƒ SpinKube - WebAssembly on Kubernetes

πŸ“… Timeline

Date Milestone
Nov 16 🎬 Project kickoff, TDG tests defined
Nov 23 πŸ—οΈ Network foundation + bastion deployed
Nov 30 🐧 First Talos node netboots successfully
Dec 3 🎀 Live demo at CozySummit Virtual 2025
Dec 31 🏠 Home lab transitions to Raspberry Pi

Free tier expires: December 2025 (t4g instances)


🀝 Contributing

This is a conference talk demo, but if you want to replicate or improve:

  1. Follow TDG - Write tests first
  2. Reference, don't duplicate - Reuse existing repos
  3. Document your journey - Others can learn from your experience
  4. Share costs - Transparency helps everyone

Open issues for questions, PRs for improvements!


πŸ“œ License

Apache 2.0 - See LICENSE for details.


πŸ”— Links


"It's 2025 - If you're running a cluster, why not host it in the cloud first?"

πŸŒ™ β†’ ☁️ β†’ 🏠 β†’ πŸ₯§

From basement to cloud and back to Raspberry Pi

Releases

No releases published

Packages

 
 
 

Contributors 2

  •  
  •