Maintenance
These steps target bare metal clusters. For local VM testing, use Local Multipass Cluster.
Adding Nodes to the Cluster
This section covers expanding the cluster with additional worker nodes or control plane nodes.
You can also add nodes by updating ansible/inventory/hosts.yaml and rerunning ansible-playbook ansible/playbooks/provision-cpu.yaml.
Prerequisites for New Nodes
Use Ansible to prepare new nodes so the baseline is consistent across the cluster.
ansible-playbook -i ansible/inventory/hosts.yaml \
ansible/playbooks/provision-cpu.yaml
For Nodes with GPUs
Use the GPU-specific playbooks to enable the runtime and device plugins:
ansible-playbook -i ansible/inventory/hosts.yaml \
ansible/playbooks/provision-nvidia-gpu.yaml
ansible-playbook -i ansible/inventory/hosts.yaml \
ansible/playbooks/provision-intel-gpu.yaml
Generate Join Token (On Existing Control Plane)
Tokens expire after 24 hours. Run this on an existing control plane node to generate a new join command:
kubeadm token create --print-join-command
Add a Worker Node
Run the join command from above on the new worker node:
sudo kubeadm join <control-plane-ip>:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>
Add a Control Plane Node (HA Setup)
For HA control planes, you need a load balancer in front of all control planes and must initialize the first control plane with --control-plane-endpoint=<load-balancer-ip>:6443.
Generate a certificate key on an existing control plane:
sudo kubeadm init phase upload-certs --upload-certs
Then join with the --control-plane flag:
sudo kubeadm join <load-balancer-ip>:6443 \
--token <token> \
--discovery-token-ca-cert-hash sha256:<hash> \
--control-plane \
--certificate-key <certificate-key>
After joining, configure kubectl on the new control plane:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Routine checks
Use these checks if the cluster runs unattended for long periods.
kubectl get nodes
kubectl get pods -A
kubectl get apps -n argocd
Namespace migration
Use this flow to move existing apps out of the default namespace.
Step 1: Add a namespace and update manifests
Create apps/<app-name>/namespace.yaml and apps/<app-name>/app.yaml, then update every manifest in the app folder to use namespace: <app-name>. If two apps must share storage, point both app.yaml files at the same namespace.
Step 2: Let ArgoCD sync
Once the changes are pushed, ArgoCD will reconcile the app into the new namespace.
Step 3: Remove old resources
After the new namespace is healthy, delete the old resources in default to avoid conflicts.
kubectl delete deployment,service,httproute -n default -l app=<app-name>
If the app owns PVCs, plan a data migration before deleting the old claims.
Node maintenance window
Use this flow to patch or reboot a node safely.
Step 1: Drain the node
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
Step 2: Apply OS updates or reboot
Perform the host maintenance, then confirm the node is back online.
Step 3: Uncordon the node
kubectl uncordon <node-name>
Upgrade Kubernetes (Bare Metal)
Upgrade control plane nodes first, then upgrade workers.
Check the latest stable Kubernetes release before setting k8s_version and TARGET_VERSION.
Step 1: Update versions and packages
Update k8s_version in ansible/group_vars/all.yaml, then run the provisioning playbook against all nodes.
Step 2: Upgrade the control plane
Run these on each control plane node, one at a time:
sudo kubeadm upgrade plan
TARGET_VERSION="v1.35.x"
sudo kubeadm upgrade apply ${TARGET_VERSION}
sudo apt-get install -y kubelet kubeadm kubectl
sudo systemctl restart kubelet
Replace v1.35.x with the target patch version after checking the latest stable release.
Step 3: Upgrade each worker node
Drain, upgrade, then uncordon each worker:
kubectl drain <worker-name> --ignore-daemonsets --delete-emptydir-data
sudo kubeadm upgrade node
sudo apt-get install -y kubelet kubeadm kubectl
sudo systemctl restart kubelet
kubectl uncordon <worker-name>
Step 4: Verify the upgrade
kubectl get nodes
kubectl get pods -A
Version Management (Ansible + ArgoCD)
Use this flow to keep every node consistent while maintaining HA.
Step 1: Update the pinned versions
Host-level versions are pinned in ansible/group_vars/all.yaml:
k8s_versionfor kubeadm/kubelet/kubectlcontainerd_versionfor the container runtimecilium_versionfor the CNI
Cluster-level components (Longhorn, Tailscale, Envoy Gateway, cert-manager, ExternalDNS, ArgoCD) are pinned in their ArgoCD templates or manifests:
bootstrap/templates/longhorn.yamlinfrastructure/tailscale/tailscale-operator.yamlinfrastructure/envoy-gateway/envoy-gateway.yamlinfrastructure/envoy-gateway-crds/infrastructure/gateway-api-crds/infrastructure/cert-manager/cert-manager.yamlinfrastructure/external-dns/external-dns.yamlinfrastructure/external-secrets-crds/bootstrap/templates/*-appset.yamlfor repo references
Step 2: Apply the change consistently
- Host packages (Kubernetes, containerd): rerun the provisioning playbook on all nodes so every host converges to the same version.
- Cluster add-ons (Cilium, Longhorn, ArgoCD): update the version in Git and let ArgoCD sync. For Cilium, use
cilium upgradeafter updatingcilium_version.
Step 3: Keep HA during updates
- Upgrade control planes one at a time, then workers.
- Drain nodes before upgrades and uncordon after, as described in the Kubernetes upgrade steps above.
- Verify health between nodes:
kubectl get nodesandkubectl get pods -A.
Upgrade Cilium
Update cilium_version in ansible/group_vars/all.yaml, then run:
CILIUM_VERSION=$(grep -E "cilium_version:" ansible/group_vars/all.yaml | head -n 1 | awk -F'\"' '{print $2}')
cilium upgrade --version ${CILIUM_VERSION}
cilium status --wait
Upgrade ArgoCD
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
kubectl wait --for=condition=available --timeout=600s deployment/argocd-server -n argocd
Upgrade Longhorn
Update targetRevision in bootstrap/templates/longhorn.yaml, then let ArgoCD sync the application.
Recreate From Scratch (Ubuntu 24.04)
Use this section when rebuilding a node from bare metal or VM images:
Base setup
Install Ubuntu 24.04 LTS and log in as a user with sudo. Clone this repo and review the pinned versions and paths in ansible/group_vars/all.yaml.
Automated setup
Run the Ansible provisioning playbook, then continue at Kubernetes.
Manual setup
Follow the bare metal tutorial path in order, starting with Prerequisites.
If you are rebuilding with existing data disks for Longhorn, ensure the storage path in bootstrap/templates/longhorn.yaml points to the correct mount before applying bootstrap/root.yaml.