Skip to content

Ansible playbooks to manage the configuration and user files for Erhhung's Kubernetes cluster at home.

License

Notifications You must be signed in to change notification settings

erhhung/homelab-k8s

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Erhhung's Home Kubernetes Cluster

This Ansible-based project provisions Erhhung's high-availability Kubernetes cluster homelab using Rancher, and installs services to monitor IoT appliances and to deploy other personal projects.

The top-level Ansible playbook main.yml run by play.sh will provision 5 VM hosts (rancher and k8s1..k8s4) in the existing XCP-ng Home pool, all running Ubuntu Server 24.04 Minimal without customizations besides basic networking and authorized SSH key for the user erhhung.

A single-node K3s Kubernetes cluster will be installed on host rancher alongside with Rancher Server on that cluster, and a 4-node RKE2 Kubernetes cluster with a high-availability control plane using virtual IPs will be installed on hosts k8s1..k8s4. Longhorn and NFS storage provisioners will be installed in each cluster to manage a pool of LVM logical volumes on each node, and to expand the overall storage capacity on the QNAP NAS.

All cluster services will be provisioned with TLS certificates from Erhhung's private CA server at pki.fourteeners.local or its faster mirror at cosmos.fourteeners.local.

Cluster Topology

topology.drawio.svg

Cluster Services

services.drawio.svg

Service Endpoints

Service Endpoint Description
https://rancher.fourteeners.local  Rancher Server console
https://harbor.fourteeners.local Harbor OCI registry
https://minio.fourteeners.local MinIO console
https://s3.fourteeners.local MinIO S3 API
opensearch.fourteeners.local:9200 OpenSearch (HTTPS only)
https://kibana.fourteeners.local OpenSearch Dashboards
postgres.fourteeners.local:5432 PostgreSQL via Pgpool (mTLS only)
https://sso.fourteeners.local Keycloak IAM console
valkey.fourteeners.local:6379
valkey{1..6}.fourteeners.local:6379
Valkey cluster (mTLS only)
https://grafana.fourteeners.local Grafana dashboards
https://metrics.fourteeners.local Prometheus UI (Keycloak SSO)
https://alerts.fourteeners.local Alertmanager UI (Keycloak SSO)
https://thanos.fourteeners.local Thanos Query UI
https://rule.thanos.fourteeners.local
https://store.thanos.fourteeners.local
https://bucket.thanos.fourteeners.local
https://compact.thanos.fourteeners.local
Thanos component status UIs
https://kiali.fourteeners.local Kiali console (Keycloak SSO)
https://argocd.fourteeners.local Argo CD console

Installation Sources

Ansible Vault

The Ansible Vault password is stored in macOS Keychain under item "Home-K8s" for account "ansible-vault".

export ANSIBLE_CONFIG="./ansible.cfg"
VAULTFILE="group_vars/all/vault.yml"

ansible-vault create $VAULTFILE
ansible-vault edit   $VAULTFILE
ansible-vault view   $VAULTFILE

Some variables stored in Ansible Vault (there are many more):

Infrastructure Secrets User Passwords
ansible_become_pass rancher_admin_pass
github_access_token harbor_admin_pass
age_secret_key minio_root_pass
icloud_smtp.* minio_admin_pass
k3s_token opensearch_admin_pass
rke2_token keycloak_admin_pass
harbor_secret thanos_admin_pass
harbor_ca_key grafana_admin_pass
minio_client_pass argocd_admin_pass
dashboards_os_pass
fluent_os_pass
valkey_pass
postgresql_pass
keycloak_db_pass
keycloak_smtp_pass
monitoring_pass
monitoring_oidc_client_secret.*
alertmanager_smtp_pass
oauth2_proxy_cookie_secret
kiali_oidc_client_secret
argocd_signing_key

Connections

All managed hosts are running Ubuntu 24.04 with SSH key from https://github.com/erhhung.keys already authorized.

Ansible will authenticate as user erhhung using private key "~/.ssh/erhhung.pem";
however, all privileged operations using sudo will require the password stored in Vault.

Playbooks

Set the config variable first for the ansible-playbook commands below:

export ANSIBLE_CONFIG="./ansible.cfg"
  1. Install required packages

    1.1. Toolsemacs, jq, yq, git, and helm
    1.2. Python — Pip packages in user virtualenv
    1.3. Helm — Helm plugins: e.g. helm-diff

    ./play.sh packages
  2. Configure system settings

    2.1. Host — host name, time zone, and locale
    2.2. Kernelsysctl params and pam_limits
    2.3. Network — DNS servers and search domains
    2.4. Login — customize login MOTD messages
    2.5. Certs — add CA certificates to trust store

    ./play.sh basics
  3. Set up admin user's home directory

    3.1. Dot files: .bash_aliases, etc.
    3.2. Config files: htop, fastfetch

    ./play.sh files
  4. Install Rancher Server on single-node K3s cluster

    ./play.sh rancher
  5. Provision Kubernetes cluster with RKE on 4 nodes

    Install RKE2 with a single control plane node and 3 worker nodes, all permitting workloads,
    or RKE2 in HA mode with 3 control plane nodes and 1 worker node, all permitting workloads
    (in HA mode, the cluster will be accessible thru a virtual IP address courtesy of kube-vip)

    ./play.sh cluster
  6. Install Longhorn dynamic PV provisioner
    Install MinIO object storage in HA mode

    6.1. Create a pool of LVM logical volumes
    6.2. Install Longhorn storage components
    6.3. Install NFS dynamic PV provisioner
    6.4. Install MinIO tenant using NFS PVs

    ./play.sh storage minio
  7. Create resources from manifest files

    IMPORTANT: Resource manifests must specify the namespaces they wished to be installed
    into because the playbook simply applies each one without targeting a specific namespace

    ./play.sh manifests
  8. Install Harbor private OCI registry

    ./play.sh harbor
  9. Install OpenSearch cluster in HA mode

    9.1. Configure the OpenSearch security plugin (users and roles) for downstream applications
    9.2. Install OpenSearch Dashboards UI

    ./play.sh opensearch
  10. Install Fluent Bit to ingest logs into OpenSearch

    ./play.sh logging
  11. Install PostgreSQL database in HA mode

    11.1. Run initialization SQL script to create roles and databases for downstream applications
    11.2. Create users in both PostgreSQL and Pgpool

    ./play.sh postgresql
  12. Install Keycloak IAM & OIDC provider

    12.1. Bootstrap PostgreSQL database with realm homelab, user erhhung, and OIDC clients

    ./play.sh keycloak
  13. Install Valkey key-value store in HA mode

    13.1. Deploy 6 nodes in total: 3 primaries and 3 replicas

    ./play.sh valkey
  14. Install Prometheus, Thanos, and Grafana in HA mode

    14.1. Expose Prometheus & Alertmanager UIs via oauth2-proxy integration with Keycloak
    14.2. Connect Thanos sidecars to MinIO to store scraped metrics in the metrics bucket
    14.3. Deploy and integrate other Thanos components with Prometheus and Alertmanager

    ./play.sh monitoring thanos
  15. Install Istio service mesh in ambient mode

    ./play.sh istio
  16. Install Argo CD GitOps delivery in HA mode

    16.1. Configure Argo CD components to use the Valkey cluster for their caching needs

    ./play.sh argocd
  17. Install Kubernetes Metacontroller add-on

    ./play.sh metacontroller
  18. Create virtual clusters in RKE running K0s

    ./play.sh vclusters

Alternatively, run all playbooks automatically in order:

# pass options like -v and --step
./play.sh [ansible-playbook-opts]

# run all playbooks starting from "storage"
# ("storage" is a playbook tag in main.yml)
./play.sh storage-

Output from play.sh will be logged in "ansible.log".

Multipass Required

Due to the dependency chain of the Prometheus monitoring stack (Keycloak and Valkey), the monitoring.yml playbook must be run after most other playbooks. At the same time, those dependent services also want to create ServiceMonitor resources that require the Prometheus Operator CRDs. Therefore, a second pass through all playbooks, starting with storage.yml, is required to enable metrics collection on those services.

Optional Playbooks

  1. Shut down all/specific VMs

    ansible-playbook shutdownvms.yml [-e targets={group|host|,...}]
  2. Create/revert/delete VM snapshots

    2.1. Create new snaphots

    ansible-playbook snapshotvms.yml [-e targets={group|host|,...}] \
                                      -e '{"desc":"text description"}'

    2.2. Revert to snapshots

    ansible-playbook snapshotvms.yml  -e do=revert \
                                     [-e targets={group|host|,...}]  \
                                      -e '{"desc":"text to search"}' \
                                     [-e '{"date":"YYYY-mm-dd prefix"}']

    2.3. Delete old snaphots

    ansible-playbook snapshotvms.yml  -e do=delete \
                                     [-e targets={group|host|,...}]  \
                                      -e '{"desc":"text to search"}' \
                                      -e '{"date":"YYYY-mm-dd prefix"}'
  3. Restart all/specific VMs

    ansible-playbook startvms.yml [-e targets={group|host|,...}]

VM Storage

To expand the VM disk on a cluster node, the VM must be shut down (attempting to resize the disk from Xen Orchestra will fail with error: VDI in use).

Once the VM disk has been expanded, restart the VM and SSH into the node to resize the partition and LV.

$ sudo su

# verify new size
$ lsblk /dev/xvda

# resize partition
$ parted /dev/xvda
) print
Warning: Not all of the space available to /dev/xvda appears to be used...
Fix/Ignore? Fix

) resizepart 3 100%
# confirm new size
) print
) quit

# sync with kernel
$ partprobe

# confirm new size
$ lsblk /dev/xvda3

# resize VG volume
$ pvresize /dev/xvda3
Physical volume "/dev/xvda3" changed
1 physical volume(s) resized...

# confirm new size
$ pvdisplay

# show LV volumes
$ lvdisplay

# set exact LV size (G=GiB)
$ lvextend -vrL 50G /dev/ubuntu-vg/ubuntu-lv
# or grow LV by percentage
$ lvextend -vrl +90%FREE /dev/ubuntu-vg/ubuntu-lv
Extending logical volume ubuntu-vg/ubuntu-lv to up to...
fsadm: Executing resize2fs /dev/mapper/ubuntu--vg-ubuntu--lv
The filesystem on /dev/mapper/ubuntu--vg-ubuntu--lv is now...

After expanding all desired disks, run ./diskfree.sh to verify available disk space on all cluster nodes:

$ ./diskfree.sh

rancher
-------
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda2       32G   18G   13G  60% /

k8s1
----
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv   50G   21G   27G  44% /
/dev/mapper/ubuntu--vg-data--lv     30G  781M   30G   3% /data

k8s2
----
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv   50G   22G   26G  47% /
/dev/mapper/ubuntu--vg-data--lv     30G  781M   30G   3% /data

k8s3
----
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv   50G   23G   25G  48% /
/dev/mapper/ubuntu--vg-data--lv     30G  1.2G   29G   4% /data

k8s4
----
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv   50G   27G   21G  57% /
/dev/mapper/ubuntu--vg-data--lv     30G  1.2G   29G   4% /data

Troubleshooting

Ansible's ad-hoc commands are useful in these scenarios.

  1. Restart Kubernetes cluster services on all nodes

    ansible rancher          -m ansible.builtin.service -b -a "name=k3s         state=restarted"
    ansible control_plane_ha -m ansible.builtin.service -b -a "name=rke2-server state=restarted"
    ansible workers_ha       -m ansible.builtin.service -b -a "name=rke2-agent  state=restarted"

    NOTE: remove _ha suffix from target hosts if the RKE cluster was deployed in non-HA mode.

  2. All kube-proxy static pods on continuous CrashLoopBackOff

    This turns out to be a Linux kernel bug in linux-image-6.8.0-56-generic and above (discovered on upgrade to linux-image-6.8.0-57-generic), causing this error in the container logs:

    ip6tables-restore v1.8.9 (nf_tables): unknown option "--xor-mark"
    

    Workaround is to downgrade to an earlier kernel:

    # list installed kernel images
    ansible -v k8s_all -a 'bash -c "dpkg -l | grep linux-image"'
    
    # install working kernel image
    ansible -v k8s_all -b -a 'apt-get install -y linux-image-6.8.0-55-generic'
    
    # GRUB use working kernel image
    ansible -v rancher -m ansible.builtin.shell -b -a '
        kernel="6.8.0-55-generic"
        dvuuid=$(blkid -s UUID -o value /dev/xvda2)
        menuid="gnulinux-advanced-$dvuuid>gnulinux-$kernel-advanced-$dvuuid"
        sed -Ei "s/^(GRUB_DEFAULT=).+$/\\1\"$menuid\"/" /etc/default/grub
        grep GRUB_DEFAULT /etc/default/grub
    '
    ansible -v cluster -m ansible.builtin.shell -b -a '
        kernel="6.8.0-55-generic"
        dvuuid=$(blkid -s UUID -o value /dev/mapper/ubuntu--vg-ubuntu--lv)
        menuid="gnulinux-advanced-$dvuuid>gnulinux-$kernel-advanced-$dvuuid"
        sed -Ei "s/^(GRUB_DEFAULT=).+$/\\1\"$menuid\"/" /etc/default/grub
        grep GRUB_DEFAULT /etc/default/grub
    '
    # update /boot/grub/grub.cfg
    ansible -v k8s_all -b -a 'update-grub'
    
    # reboot nodes, one at a time
    ansible -v k8s_all -m ansible.builtin.reboot -b -a "post_reboot_delay=120" -f 1
    
    # confirm working kernel image
    ansible -v k8s_all -a 'uname -r'
    
    # remove old backup kernels only
    # (keep latest non-working kernel
    # so upgrade won't install again)
    ansible -v k8s_all -b -a 'apt-get autoremove -y --purge'

About

Ansible playbooks to manage the configuration and user files for Erhhung's Kubernetes cluster at home.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published