Skip to main content

Provision and Bootstrap the Three Kubernetes Control Planes

This is the fourth infrastructure page. During a from-scratch rebuild, start it only after the preceding load-balancer page has provisioned and verified the shared API load balancer, HAProxy, Keepalived, and the API VIP. This page then provisions and bootstraps the three highly available Kubernetes control-plane VMs.

Load-balancer VM provisioning                 Required prerequisite
HAProxy configuration                         Required prerequisite
Keepalived configuration                      Required prerequisite
API VIP 192.168.8.200                         Required prerequisite
Control-plane Terraform definitions           Part A of this page
Control-plane VM provisioning                 Part A expected result
Control-plane Ubuntu baseline                 Part B of this page
Kubernetes prerequisites                      Part B of this page
First control-plane bootstrap                 Part B of this page
CNI installation                              Part B of this page
Additional control-plane joins                Part B of this page
Development workers                           Next page
QA workers                                    Later page
Production workers                            Later page
ARC controller                                Later page
Tenant runner scale sets                      Later page

1. Scope and execution order

This page is performed in two separate promotions:

  1. Terraform promotion: create only the three control-plane VMs.
  2. Ansible promotion: after all three VMs are reachable, configure Ubuntu, containerd, Kubernetes, the first control plane, Calico, and the remaining control planes.

Do not combine the Terraform and Ansible changes into the same production promotion. The Ansible workflow must not start until Terraform has created all three VMs.

The approved branch model is:

feature/*
    ↓ pull request
   dev
    ↓ Terraform validation and plan only
 dev → prod pull request
    ↓ review
   prod
    ↓ Terraform apply or Ansible configuration

Not used by this infrastructure execution flow:
local, qa, main
Branch rule

dev performs validation and Terraform plan only. prod performs Terraform apply or Ansible configuration. Do not use local or qa in this infrastructure execution flow.

2. Approved control-plane VM allocation

cicd-ac-k8s-cp-01
  VM ID:       3156202
  MAC:         aa:bb:cc:05:0e:01
  Reserved IP: 192.168.8.202
  CPU:         4 vCPU
  RAM:         8192 MB
  Disk:        scsi0, 100G, local-lvm

cicd-ac-k8s-cp-02
  VM ID:       3156203
  MAC:         aa:bb:cc:05:0e:02
  Reserved IP: 192.168.8.203
  CPU:         4 vCPU
  RAM:         8192 MB
  Disk:        scsi0, 100G, local-lvm

cicd-ac-k8s-cp-03
  VM ID:       3156204
  MAC:         aa:bb:cc:05:0e:03
  Reserved IP: 192.168.8.204
  CPU:         4 vCPU
  RAM:         8192 MB
  Disk:        scsi0, 100G, local-lvm

Shared values
  Template:    tmplt-ub-26-min-base / VM ID 90000
  Node:        pve
  Bridge:      vmbr0
  API VIP:     192.168.8.200
  API endpoint:cicd-ac-k8s-api.aspireclan.com:443

Confirm that the router already contains these DHCP reservations and that no existing device is using .202, .203, or .204.

3. Approved Kubernetes and CNI versions

Kubernetes minor repository: v1.36
Kubernetes release:         v1.36.1
Kubernetes DEB version:     1.36.1-1.1
kubeadm API:                kubeadm.k8s.io/v1beta4
Container runtime:          containerd
CRI socket:                 unix:///run/containerd/containerd.sock
CNI:                        Calico v3.32.0
Pod CIDR:                   10.244.0.0/16
Service CIDR:               10.96.0.0/12
Service DNS domain:         cluster.local
Calico encapsulation:       VXLAN
Calico BGP:                 Disabled

The Pod CIDR deliberately uses 10.244.0.0/16. Do not use Calico's example default 192.168.0.0/16, because that overlaps the Aspireclan home-lab network.

Do not change cluster-wide CIDRs after bootstrap

Changing the Pod CIDR or Service CIDR after the cluster is created is a disruptive redesign. Confirm these values before the first kubeadm init.


Part A — Provision the Control-Plane VMs with Terraform

4. Terraform files changed

terraform/modules/proxmox-vm-group/main.tf
terraform/modules/proxmox-vm-group/variables.tf
terraform/modules/proxmox-vm-group/outputs.tf
terraform/modules/proxmox-vm-group/versions.tf
terraform/stacks/shared-k8s/main.tf
terraform/stacks/shared-k8s/outputs.tf

The reusable proxmox-vm module from the load-balancer page remains unchanged. This page implements the group wrapper and adds three control-plane definitions to the intermediate shared stack. Later worker pages extend this same stack; they are intentionally absent here.

5. Create the Terraform feature branch from dev

Run from Windows PowerShell:

cd D:\code\ASPIRECLAN-LLC-Org\ac-cicd-infra

git switch dev
git pull --ff-only origin dev

git switch -c feature/provision-k8s-control-planes

6. Implement terraform/modules/proxmox-vm-group

6.1 Replace variables.tf

variable "vms" {
  description = "Map of Proxmox VM definitions keyed by a stable logical name."

  type = map(object({
    name          = string
    description   = string
    vmid          = number
    target_node   = string
    template_name = string
    cores         = number
    memory_mb     = number
    disk_size     = string
    storage       = string
    bridge        = string
    mac_address   = string
    tags          = list(string)
  }))

  validation {
    condition = alltrue([
      for vm in values(var.vms) :
      can(regex("^([0-9A-Fa-f]{2}:){5}[0-9A-Fa-f]{2}$", vm.mac_address))
    ])
    error_message = "Every VM must have a valid six-byte colon-separated MAC address."
  }
}

6.2 Replace main.tf

module "vm" {
  for_each = var.vms

  source = "../proxmox-vm"

  name          = each.value.name
  description   = each.value.description
  vmid          = each.value.vmid
  target_node   = each.value.target_node
  template_name = each.value.template_name
  cores         = each.value.cores
  memory_mb     = each.value.memory_mb
  disk_size     = each.value.disk_size
  storage       = each.value.storage
  bridge        = each.value.bridge
  mac_address   = each.value.mac_address
  tags          = each.value.tags
}

6.3 Replace outputs.tf

output "vms" {
  description = "Created Proxmox VMs keyed by their logical map key."

  value = {
    for key, vm in module.vm : key => {
      name        = vm.name
      vmid        = vm.vmid
      target_node = vm.target_node
    }
  }
}

6.4 Replace versions.tf

terraform {
  required_version = ">= 1.15.5"

  required_providers {
    proxmox = {
      source  = "Telmate/proxmox"
      version = "3.0.1-rc9"
    }
  }
}

7. Extend the shared Kubernetes Terraform stack

7.1 Replace terraform/stacks/shared-k8s/main.tf

This is the complete intermediate file for this checkpoint. It preserves the load-balancer definition from the preceding page and adds exactly three control-plane VMs. Later worker pages replace this file with extended versions.

module "api_load_balancer" {
  source = "../../modules/proxmox-vm"

  name        = "cicd-ac-k8s-lb-01"
  description = "Aspireclan shared Kubernetes API load balancer"

  vmid        = 3156201
  target_node = "pve"

  template_name = "tmplt-ub-26-min-base"

  cores     = 2
  memory_mb = 4096

  disk_size = "40G"
  storage   = "local-lvm"

  bridge      = "vmbr0"
  mac_address = "aa:bb:cc:05:14:01"

  tags = [
    "ac-cicd",
    "shared-k8s",
    "load-balancer",
    "terraform",
    "ansible",
  ]
}

locals {
  control_planes = {
    cp01 = {
      name          = "cicd-ac-k8s-cp-01"
      description   = "Aspireclan shared Kubernetes control plane 01"
      vmid          = 3156202
      target_node   = "pve"
      template_name = "tmplt-ub-26-min-base"
      cores         = 4
      memory_mb     = 8192
      disk_size     = "100G"
      storage       = "local-lvm"
      bridge        = "vmbr0"
      mac_address   = "aa:bb:cc:05:0e:01"
      tags = [
        "ac-cicd",
        "shared-k8s",
        "control-plane",
        "terraform",
        "ansible",
      ]
    }

    cp02 = {
      name          = "cicd-ac-k8s-cp-02"
      description   = "Aspireclan shared Kubernetes control plane 02"
      vmid          = 3156203
      target_node   = "pve"
      template_name = "tmplt-ub-26-min-base"
      cores         = 4
      memory_mb     = 8192
      disk_size     = "100G"
      storage       = "local-lvm"
      bridge        = "vmbr0"
      mac_address   = "aa:bb:cc:05:0e:02"
      tags = [
        "ac-cicd",
        "shared-k8s",
        "control-plane",
        "terraform",
        "ansible",
      ]
    }

    cp03 = {
      name          = "cicd-ac-k8s-cp-03"
      description   = "Aspireclan shared Kubernetes control plane 03"
      vmid          = 3156204
      target_node   = "pve"
      template_name = "tmplt-ub-26-min-base"
      cores         = 4
      memory_mb     = 8192
      disk_size     = "100G"
      storage       = "local-lvm"
      bridge        = "vmbr0"
      mac_address   = "aa:bb:cc:05:0e:03"
      tags = [
        "ac-cicd",
        "shared-k8s",
        "control-plane",
        "terraform",
        "ansible",
      ]
    }
  }
}

module "control_planes" {
  source = "../../modules/proxmox-vm-group"

  vms = local.control_planes
}

7.2 Replace terraform/stacks/shared-k8s/outputs.tf

output "api_load_balancer" {
  description = "Kubernetes API load-balancer VM."
  value = {
    name        = module.api_load_balancer.name
    vmid        = module.api_load_balancer.vmid
    target_node = module.api_load_balancer.target_node
    reserved_ip = "192.168.8.201"
    api_vip     = "192.168.8.200"
  }
}

output "control_planes" {
  description = "Shared Kubernetes control-plane VMs."

  value = {
    cp01 = merge(module.control_planes.vms["cp01"], {
      reserved_ip = "192.168.8.202"
      mac_address = "aa:bb:cc:05:0e:01"
    })
    cp02 = merge(module.control_planes.vms["cp02"], {
      reserved_ip = "192.168.8.203"
      mac_address = "aa:bb:cc:05:0e:02"
    })
    cp03 = merge(module.control_planes.vms["cp03"], {
      reserved_ip = "192.168.8.204"
      mac_address = "aa:bb:cc:05:0e:03"
    })
  }
}

8. Confirm the existing Terraform workflow contract

No branch expansion is required. The cleaned repository workflows use push-based validation and apply behavior and must continue to implement:

Terraform plan workflow
  push branch:           dev
  manual dispatch:       supported
  action:                fmt, init, validate, plan

Terraform apply workflow
  push branch:           prod
  manual dispatch:       supported only from prod
  action:                fmt, init, validate, saved plan, apply

Persistent state
  /var/lib/ac-cicd-infra/terraform-state/shared-k8s/terraform.tfstate

The plan workflow must use actions/checkout@v5 and run from dev. The apply workflow must run only from prod and must use the same persistent local state file introduced by the load-balancer page.

9. Review and commit the Terraform change

git status
git diff --check
git diff --stat

git diff -- `
  terraform/modules/proxmox-vm-group `
  terraform/stacks/shared-k8s/main.tf `
  terraform/stacks/shared-k8s/outputs.tf

Confirm all of the following:

  • The load-balancer module remains present and unchanged.
  • Exactly three new VM resources are proposed.
  • VM IDs are 3156202, 3156203, and 3156204.
  • MAC addresses end in 0e:01, 0e:02, and 0e:03.
  • Every control-plane disk is scsi0, 100G, and local-lvm.
  • No .tfstate, secret, token, or private key is staged.

Commit and push:

git add `
  terraform/modules/proxmox-vm-group `
  terraform/stacks/shared-k8s/main.tf `
  terraform/stacks/shared-k8s/outputs.tf

git commit -m "Provision shared Kubernetes control planes"

git push -u origin feature/provision-k8s-control-planes

10. Create the Terraform pull request into dev

gh pr create `
  --base dev `
  --head feature/provision-k8s-control-planes `
  --title "Provision shared Kubernetes control planes" `
  --body "Adds the three approved control-plane VMs to the shared Kubernetes Terraform stack.

After merge, the dev plan must end with:

Plan: 3 to add, 0 to change, 0 to destroy.
Terraform stop conditions

Do not promote to prod if the plan proposes any update, replacement, or deletion of cicd-ac-k8s-lb-01, or if it proposes anything other than the three approved control-plane VMs.

11. Promote the Terraform change from dev to prod

gh pr create `
  --base prod `
  --head dev `
  --title "Provision shared Kubernetes control planes" `
  --body "Promotes the validated three-control-plane Terraform plan to prod.

After merge and environment approval, the production apply should create the three VMs and update the persistent state file.

12. Verify the VMs in Proxmox

Run on the Proxmox host:

qm status 3156202
qm config 3156202

qm status 3156203
qm config 3156203

qm status 3156204
qm config 3156204

Confirm each VM is running, has the approved VM ID, has four CPU cores, 8192 MB RAM, a 100G scsi0 disk on local-lvm, and the correct MAC address.

13. Verify DHCP and SSH

Run from prod-terraform-deploy-02:

for ip in 202 203 204; do
  echo "=== 192.168.8.${ip} ==="

  ping -c 2 -W 2 "192.168.8.${ip}"

  ssh     -i ~/.ssh/id_ed25519_ansible     -o IdentitiesOnly=yes     -o BatchMode=yes     -o ConnectTimeout=10     "acllc@192.168.8.${ip}"     'hostnamectl --static; ip -brief address; sudo -n whoami'
done

Expected before Ansible:

  • All three IP addresses respond.
  • The Ansible key authenticates as acllc.
  • Passwordless sudo returns root.
  • The Ubuntu hostname may still be the template hostname.

Stop here until all three nodes pass SSH and sudo checks.


Part B — Bootstrap the Highly Available Control Plane with Ansible

14. Ansible and Kubernetes files changed

ansible/ansible.cfg
ansible/inventories/shared-k8s/hosts.ini
ansible/inventories/shared-k8s/group_vars/all.yml
ansible/inventories/shared-k8s/group_vars/control_planes.yml
ansible/roles/common/tasks/main.yml
ansible/roles/haproxy/tasks/main.yml
ansible/roles/containerd/tasks/main.yml
ansible/roles/containerd/handlers/main.yml
ansible/roles/kubernetes-common/tasks/main.yml
ansible/roles/kubernetes-common/handlers/main.yml
ansible/roles/kubernetes-control-plane/defaults/main.yml
ansible/roles/kubernetes-control-plane/tasks/main.yml
ansible/roles/kubernetes-control-plane/tasks/bootstrap.yml
ansible/roles/kubernetes-control-plane/tasks/join.yml
ansible/roles/kubernetes-control-plane/templates/kubeadm-init-config.yaml.j2
ansible/playbooks/shared-k8s/01-common-baseline.yml
ansible/playbooks/shared-k8s/02-configure-load-balancer.yml
ansible/playbooks/shared-k8s/03-prepare-kubernetes-nodes.yml
ansible/playbooks/shared-k8s/04-bootstrap-first-control-plane.yml
ansible/playbooks/shared-k8s/05-join-control-planes.yml
ansible/playbooks/shared-k8s/06-install-cni.yml
kubernetes/common/cni/calico-custom-resources.yaml
.github/workflows/ansible-configure-control-planes.yml

This phase uses stacked etcd: each control-plane VM runs its own API server and etcd member. Join commands, bootstrap tokens, and the certificate key are generated only in memory during the workflow and are hidden from logs.

15. Create the Ansible feature branch from dev

Create this branch only after the production Terraform apply has completed successfully:

cd D:\code\ASPIRECLAN-LLC-Org\ac-cicd-infra

git switch dev
git pull --ff-only origin dev

git switch -c feature/bootstrap-k8s-control-planes

16. Replace ansible/ansible.cfg

[defaults]
inventory = inventories/shared-k8s/hosts.ini
roles_path = roles
host_key_checking = True
retry_files_enabled = False
interpreter_python = auto_silent
stdout_callback = default
inject_facts_as_vars = False
remote_tmp = /var/tmp/ansible-acllc
timeout = 30

[ssh_connection]
pipelining = True
ssh_args = -o IdentitiesOnly=yes -o ServerAliveInterval=30 -o ServerAliveCountMax=4

This keeps the permanent APT-cache fix, disables deprecated fact injection, and uses /var/tmp/ansible-acllc to avoid the remote temporary-directory warning.

17. Replace the shared inventory

Replace ansible/inventories/shared-k8s/hosts.ini with:

[load_balancers]
cicd-ac-k8s-lb-01 ansible_host=192.168.8.201 ansible_user=acllc node_primary_ip=192.168.8.201 node_interface=ens18

[first_control_plane]
cicd-ac-k8s-cp-01 ansible_host=192.168.8.202 ansible_user=acllc node_primary_ip=192.168.8.202 node_interface=ens18

[additional_control_planes]
cicd-ac-k8s-cp-02 ansible_host=192.168.8.203 ansible_user=acllc node_primary_ip=192.168.8.203 node_interface=ens18
cicd-ac-k8s-cp-03 ansible_host=192.168.8.204 ansible_user=acllc node_primary_ip=192.168.8.204 node_interface=ens18

[control_planes:children]
first_control_plane
additional_control_planes

[dev_workers]

[qa_workers]

[prod_workers]

[workers:children]
dev_workers
qa_workers
prod_workers

[k8s_cluster:children]
control_planes
workers

[all:vars]
ansible_python_interpreter=/usr/bin/python3

The first control plane is deliberately separated from the two additional control planes so bootstrap and join operations can target the correct machines.

18. Add cluster-wide variables

Replace ansible/inventories/shared-k8s/group_vars/all.yml with:

---
cluster_admin_user: acllc

kubernetes_version: "v1.36.1"
kubernetes_package_version: "1.36.1-1.1"
kubernetes_minor_repository: "v1.36"
kubernetes_cri_socket: "unix:///run/containerd/containerd.sock"
kubernetes_api_endpoint: "cicd-ac-k8s-api.aspireclan.com:443"
kubernetes_api_vip: "192.168.8.200"
kubernetes_api_backend_port: 6443
kubernetes_pod_cidr: "10.244.0.0/16"
kubernetes_service_cidr: "10.96.0.0/12"
kubernetes_dns_domain: "cluster.local"

calico_version: "v3.32.0"
calico_crd_url: "https://raw.githubusercontent.com/projectcalico/calico/v3.32.0/manifests/v1_crd_projectcalico_org.yaml"
calico_operator_url: "https://raw.githubusercontent.com/projectcalico/calico/v3.32.0/manifests/tigera-operator.yaml"

managed_hosts_entries:
  - ip: 192.168.8.200
    names:
      - cicd-ac-k8s-api.aspireclan.com
      - cicd-ac-k8s-api
  - ip: 192.168.8.201
    names:
      - cicd-ac-k8s-lb-01
  - ip: 192.168.8.202
    names:
      - cicd-ac-k8s-cp-01
  - ip: 192.168.8.203
    names:
      - cicd-ac-k8s-cp-02
  - ip: 192.168.8.204
    names:
      - cicd-ac-k8s-cp-03

19. Add control-plane variables

Replace ansible/inventories/shared-k8s/group_vars/control_planes.yml with:

---
kubernetes_node_tcp_ports:
  - "6443"
  - "2379:2380"
  - "10250"
  - "10257"
  - "10259"

calico_node_tcp_ports:
  - "5473"

calico_node_udp_ports:
  - "4789"

UFW is not enabled by this page, but the required Kubernetes and Calico rules are pre-created so a later firewall-hardening phase does not break the cluster.

20. Generalize the common Ubuntu role

Replace ansible/roles/common/tasks/main.yml with:

---
- name: Confirm required host identity variables are defined
  ansible.builtin.assert:
    that:
      - node_primary_ip is defined
      - node_interface is defined
      - inventory_hostname | length > 0
    fail_msg: >-
      The inventory must define node_primary_ip and node_interface for every host.

- name: Confirm the target IP and interface match the approved inventory
  ansible.builtin.assert:
    that:
      - ansible_facts["default_ipv4"]["address"] == node_primary_ip
      - ansible_facts["default_ipv4"]["interface"] == node_interface
    fail_msg: >-
      The detected default IPv4 address or interface does not match the approved inventory.

- name: Force refresh the APT package cache
  ansible.builtin.apt:
    update_cache: true
  register: common_apt_cache_refresh
  retries: 5
  delay: 15
  until: common_apt_cache_refresh is succeeded

- name: Install common operating-system packages
  ansible.builtin.apt:
    name:
      - ca-certificates
      - curl
      - gpg
      - jq
      - qemu-guest-agent
      - ufw
    state: present
  register: common_package_install
  retries: 3
  delay: 10
  until: common_package_install is succeeded

- name: Ensure the Ansible remote temporary directory exists
  ansible.builtin.file:
    path: /var/tmp/ansible-acllc
    state: directory
    owner: "{{ cluster_admin_user }}"
    group: "{{ cluster_admin_user }}"
    mode: "0700"

- name: Write the permanent hostname file
  ansible.builtin.copy:
    dest: /etc/hostname
    content: "{{ inventory_hostname }}\n"
    owner: root
    group: root
    mode: "0644"

- name: Set the active system hostname
  ansible.builtin.hostname:
    name: "{{ inventory_hostname }}"

- name: Set the local hostname mapping
  ansible.builtin.lineinfile:
    path: /etc/hosts
    regexp: '^127\.0\.1\.1\s+'
    line: "127.0.1.1 {{ inventory_hostname }}"
    create: true
    owner: root
    group: root
    mode: "0644"

- name: Add shared Kubernetes host mappings
  ansible.builtin.blockinfile:
    path: /etc/hosts
    marker: "# {mark} ASPIRECLAN SHARED K8S"
    block: |
      {% for item in managed_hosts_entries %}
      {{ item.ip }} {{ item.names | join(' ') }}
      {% endfor %}
    owner: root
    group: root
    mode: "0644"

- name: Enable and start QEMU Guest Agent
  ansible.builtin.service:
    name: qemu-guest-agent
    enabled: true
    state: started

- name: Allow SSH through UFW when UFW is enabled later
  ansible.builtin.command:
    cmd: ufw allow 22/tcp
  register: common_ufw_ssh_rule
  changed_when: "'Rule added' in common_ufw_ssh_rule.stdout"

- name: Verify the resulting hostname
  ansible.builtin.command:
    cmd: hostnamectl --static
  register: common_configured_hostname
  changed_when: false
  failed_when: common_configured_hostname.stdout | trim != inventory_hostname

The common role is shared by all infrastructure nodes. It validates every host using node_primary_ip and node_interface from inventory and forces an APT refresh before package installation.

20.1 Reconcile the existing HAProxy role and install socat

The final control-plane verification reads the HAProxy Runtime API through /run/haproxy/admin.sock. The socat client must therefore be managed by the HAProxy role rather than installed manually.

Replace ansible/roles/haproxy/tasks/main.yml with:

---
- name: Force refresh the APT package cache before installing HAProxy
  ansible.builtin.apt:
    update_cache: true
  register: haproxy_apt_cache_refresh
  retries: 5
  delay: 15
  until: haproxy_apt_cache_refresh is succeeded

- name: Install HAProxy and the runtime statistics client
  ansible.builtin.apt:
    name:
      - haproxy
      - socat
    state: present
  register: haproxy_package_install
  retries: 3
  delay: 10
  until: haproxy_package_install is succeeded

- name: Render the Kubernetes API HAProxy configuration
  ansible.builtin.template:
    src: haproxy.cfg.j2
    dest: /etc/haproxy/haproxy.cfg
    owner: root
    group: root
    mode: "0644"
    validate: "haproxy -c -f %s"
  notify: Restart HAProxy

- name: Enable and start HAProxy
  ansible.builtin.service:
    name: haproxy
    enabled: true
    state: started

- name: Apply any pending HAProxy restart
  ansible.builtin.meta: flush_handlers

- name: Validate the active HAProxy configuration
  ansible.builtin.command:
    cmd: haproxy -c -f /etc/haproxy/haproxy.cfg
  changed_when: false

- name: Confirm the HAProxy runtime socket exists
  ansible.builtin.stat:
    path: /run/haproxy/admin.sock
  register: haproxy_runtime_socket

- name: Assert that the HAProxy runtime socket is available
  ansible.builtin.assert:
    that:
      - haproxy_runtime_socket.stat.exists
      - haproxy_runtime_socket.stat.issock
    fail_msg: >-
      The HAProxy runtime socket /run/haproxy/admin.sock is not available.

- name: Confirm that the HAProxy Runtime API responds
  ansible.builtin.shell:
    executable: /bin/bash
    cmd: |
      set -euo pipefail

      printf 'show info\n' |
        socat - UNIX-CONNECT:/run/haproxy/admin.sock
  register: haproxy_runtime_api_check
  changed_when: false

The preceding load-balancer page must already have applied this HAProxy role. The control-plane workflow does not rerun the load-balancer playbook; it checks the HAProxy runtime socket at the end of the control-plane bootstrap. Keeping this role block source-aligned ensures socat and the runtime socket are available when the load-balancer workflow is run.

21. Implement the containerd role

21.1 Replace ansible/roles/containerd/tasks/main.yml

---
- name: Force refresh the APT package cache before configuring containerd
  ansible.builtin.apt:
    update_cache: true
  register: containerd_apt_cache_refresh
  retries: 5
  delay: 15
  until: containerd_apt_cache_refresh is succeeded

- name: Ensure containerd is installed
  ansible.builtin.apt:
    name:
      - containerd.io
    state: present
  register: containerd_package_install
  retries: 3
  delay: 10
  until: containerd_package_install is succeeded

- name: Ensure the containerd configuration directory exists
  ansible.builtin.file:
    path: /etc/containerd
    state: directory
    owner: root
    group: root
    mode: "0755"

- name: Check whether the containerd configuration already exists
  ansible.builtin.stat:
    path: /etc/containerd/config.toml
  register: containerd_config_file

- name: Generate the default containerd configuration when missing
  ansible.builtin.shell:
    cmd: containerd config default > /etc/containerd/config.toml
  when: not containerd_config_file.stat.exists
  notify: Restart containerd

- name: Ensure the CRI plugin is not disabled
  ansible.builtin.lineinfile:
    path: /etc/containerd/config.toml
    regexp: '^disabled_plugins\s*='
    line: 'disabled_plugins = []'
    insertbefore: BOF
    owner: root
    group: root
    mode: "0644"
  notify: Restart containerd

- name: Configure containerd to use the systemd cgroup driver
  ansible.builtin.replace:
    path: /etc/containerd/config.toml
    regexp: 'SystemdCgroup = false'
    replace: 'SystemdCgroup = true'
  notify: Restart containerd

- name: Write the crictl runtime configuration
  ansible.builtin.copy:
    dest: /etc/crictl.yaml
    owner: root
    group: root
    mode: "0644"
    content: |
      runtime-endpoint: {{ kubernetes_cri_socket }}
      image-endpoint: {{ kubernetes_cri_socket }}
      timeout: 10
      debug: false

- name: Enable and start containerd
  ansible.builtin.service:
    name: containerd
    enabled: true
    state: started

- name: Apply any pending containerd restart
  ansible.builtin.meta: flush_handlers

- name: Confirm the containerd CRI plugin is healthy
  ansible.builtin.shell:
    executable: /bin/bash
    cmd: |
      set -euo pipefail

      plugin_output="$(ctr plugins ls)"
      printf '%s\n' "${plugin_output}"

      printf '%s\n' "${plugin_output}" |
        awk '
          $1 == "io.containerd.grpc.v1" &&
          $2 == "cri" &&
          $NF == "ok" {
            legacy_cri_ok = 1
          }

          $1 == "io.containerd.cri.v1" &&
          $2 == "runtime" &&
          $NF == "ok" {
            runtime_cri_ok = 1
          }

          $1 == "io.containerd.cri.v1" &&
          $2 == "images" &&
          $NF == "ok" {
            images_cri_ok = 1
          }

          END {
            if (legacy_cri_ok || (runtime_cri_ok && images_cri_ok)) {
              exit 0
            }

            exit 1
          }
        '
  register: containerd_cri_plugin_check
  changed_when: false
  retries: 15
  delay: 2
  until: containerd_cri_plugin_check.rc == 0

21.2 Replace ansible/roles/containerd/handlers/main.yml

---
- name: Restart containerd
  ansible.builtin.service:
    name: containerd
    state: restarted

The role preserves the template's Docker installation but configures the shared containerd runtime for Kubernetes, enables the CRI plugin, and sets SystemdCgroup = true.

The CRI validation intentionally parses the TYPE and ID columns from ctr plugins ls separately. It supports both the legacy io.containerd.grpc.v1 / cri layout and the containerd 2.x io.containerd.cri.v1 / runtime plus images layout. The task-level register, retries, delay, until, and changed_when keywords must remain aligned with ansible.builtin.shell; placing them inside the module block causes an unsupported-parameter failure.

22. Implement the Kubernetes common role

22.1 Replace ansible/roles/kubernetes-common/tasks/main.yml

---
- name: Disable swap immediately
  ansible.builtin.command:
    cmd: swapoff -a
  changed_when: ansible_facts["swaptotal_mb"] | int > 0

- name: Disable swap entries permanently in fstab
  ansible.builtin.replace:
    path: /etc/fstab
    regexp: '^([^#].*\s+swap\s+.*)$'
    replace: '# \1'

- name: Configure Kubernetes kernel modules
  ansible.builtin.copy:
    dest: /etc/modules-load.d/k8s.conf
    owner: root
    group: root
    mode: "0644"
    content: |
      overlay
      br_netfilter

- name: Load Kubernetes kernel modules now
  ansible.builtin.command:
    cmd: "modprobe {{ item }}"
  loop:
    - overlay
    - br_netfilter
  changed_when: false

- name: Configure Kubernetes networking sysctls
  ansible.builtin.copy:
    dest: /etc/sysctl.d/99-kubernetes-cri.conf
    owner: root
    group: root
    mode: "0644"
    content: |
      net.bridge.bridge-nf-call-iptables = 1
      net.bridge.bridge-nf-call-ip6tables = 1
      net.ipv4.ip_forward = 1
  notify: Reload Kubernetes sysctls

- name: Apply pending sysctl changes
  ansible.builtin.meta: flush_handlers

- name: Force refresh the APT cache before Kubernetes repository setup
  ansible.builtin.apt:
    update_cache: true
  register: kubernetes_prerequisite_apt_refresh
  retries: 5
  delay: 15
  until: kubernetes_prerequisite_apt_refresh is succeeded

- name: Install Kubernetes repository prerequisites
  ansible.builtin.apt:
    name:
      - apt-transport-https
      - ca-certificates
      - curl
      - gpg
    state: present
  register: kubernetes_prerequisite_packages
  retries: 3
  delay: 10
  until: kubernetes_prerequisite_packages is succeeded

- name: Ensure the APT keyring directory exists
  ansible.builtin.file:
    path: /etc/apt/keyrings
    state: directory
    owner: root
    group: root
    mode: "0755"

- name: Install the Kubernetes repository signing key
  ansible.builtin.shell:
    cmd: >-
      curl -fsSL
      https://pkgs.k8s.io/core:/stable:/{{ kubernetes_minor_repository }}/deb/Release.key |
      gpg --dearmor --yes --output /etc/apt/keyrings/kubernetes-apt-keyring.gpg
  changed_when: false

- name: Configure the Kubernetes minor-version repository
  ansible.builtin.copy:
    dest: /etc/apt/sources.list.d/kubernetes.list
    owner: root
    group: root
    mode: "0644"
    content: >-
      deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg]
      https://pkgs.k8s.io/core:/stable:/{{ kubernetes_minor_repository }}/deb/ /

- name: Force refresh the APT cache after adding the Kubernetes repository
  ansible.builtin.apt:
    update_cache: true
  register: kubernetes_repository_apt_refresh
  retries: 5
  delay: 15
  until: kubernetes_repository_apt_refresh is succeeded

- name: Install the approved Kubernetes packages
  ansible.builtin.apt:
    name:
      - "kubelet={{ kubernetes_package_version }}"
      - "kubeadm={{ kubernetes_package_version }}"
      - "kubectl={{ kubernetes_package_version }}"
      - cri-tools
    state: present
    allow_downgrade: true
  register: kubernetes_package_install
  retries: 3
  delay: 10
  until: kubernetes_package_install is succeeded

- name: Hold Kubernetes packages for controlled upgrades
  ansible.builtin.dpkg_selections:
    name: "{{ item }}"
    selection: hold
  loop:
    - kubelet
    - kubeadm
    - kubectl

- name: Configure the kubelet node IP
  ansible.builtin.copy:
    dest: /etc/default/kubelet
    owner: root
    group: root
    mode: "0644"
    content: |
      KUBELET_EXTRA_ARGS=--node-ip={{ node_primary_ip }}
  notify: Restart kubelet

- name: Enable kubelet at boot
  ansible.builtin.systemd:
    name: kubelet
    enabled: true
    daemon_reload: true

- name: Allow approved Kubernetes control-plane TCP ports through UFW
  ansible.builtin.command:
    cmd: >-
      ufw allow from 192.168.8.0/22 to any port {{ item }} proto tcp
  loop: "{{ kubernetes_node_tcp_ports | default([]) }}"
  register: kubernetes_ufw_tcp_rules
  changed_when: "'Rule added' in kubernetes_ufw_tcp_rules.stdout"

- name: Allow Calico node TCP ports through UFW
  ansible.builtin.command:
    cmd: >-
      ufw allow from 192.168.8.0/22 to any port {{ item }} proto tcp
  loop: "{{ calico_node_tcp_ports | default([]) }}"
  register: calico_ufw_tcp_rules
  changed_when: "'Rule added' in calico_ufw_tcp_rules.stdout"

- name: Allow Calico node UDP ports through UFW
  ansible.builtin.command:
    cmd: >-
      ufw allow from 192.168.8.0/22 to any port {{ item }} proto udp
  loop: "{{ calico_node_udp_ports | default([]) }}"
  register: calico_ufw_udp_rules
  changed_when: "'Rule added' in calico_ufw_udp_rules.stdout"

- name: Apply any pending kubelet restart
  ansible.builtin.meta: flush_handlers

- name: Pull the approved Kubernetes control-plane images
  ansible.builtin.command:
    cmd: >-
      kubeadm config images pull
      --kubernetes-version {{ kubernetes_version }}
      --cri-socket {{ kubernetes_cri_socket }}
  register: kubernetes_image_pull
  changed_when: "'pulled' in kubernetes_image_pull.stdout | lower"
  retries: 3
  delay: 15
  until: kubernetes_image_pull is succeeded

- name: Verify that swap is disabled
  ansible.builtin.command:
    cmd: swapon --show
  register: kubernetes_swap_status
  changed_when: false
  failed_when: kubernetes_swap_status.stdout | trim | length > 0

- name: Verify the container runtime through CRI
  ansible.builtin.command:
    cmd: >-
      crictl
      --runtime-endpoint {{ kubernetes_cri_socket }}
      --image-endpoint {{ kubernetes_cri_socket }}
      info
  register: kubernetes_cri_runtime_check
  changed_when: false
  retries: 15
  delay: 2
  until: kubernetes_cri_runtime_check.rc == 0

- name: Verify installed Kubernetes versions
  ansible.builtin.shell:
    cmd: |
      kubeadm version -o short
      kubelet --version
      kubectl version --client=true
  changed_when: false

22.2 Replace ansible/roles/kubernetes-common/handlers/main.yml

---
- name: Reload Kubernetes sysctls
  ansible.builtin.command:
    cmd: sysctl --system

- name: Restart kubelet
  ansible.builtin.service:
    name: kubelet
    state: restarted

The kubelet can restart while waiting for kubeadm configuration. That behavior before kubeadm init or kubeadm join is expected.

23. Implement the Kubernetes control-plane role

23.1 Replace defaults/main.yml

---
kubernetes_control_plane_action: bootstrap
control_plane_join_command: ""

23.2 Replace tasks/main.yml

---
- name: Include first-control-plane bootstrap tasks
  ansible.builtin.include_tasks: bootstrap.yml
  when: kubernetes_control_plane_action == "bootstrap"

- name: Include additional-control-plane join tasks
  ansible.builtin.include_tasks: join.yml
  when: kubernetes_control_plane_action == "join"

23.3 Create tasks/bootstrap.yml

---
- name: Render the kubeadm initialization configuration
  ansible.builtin.template:
    src: kubeadm-init-config.yaml.j2
    dest: /etc/kubernetes/kubeadm-init-config.yaml
    owner: root
    group: root
    mode: "0600"

- name: Validate the kubeadm initialization configuration
  ansible.builtin.command:
    cmd: kubeadm config validate --config /etc/kubernetes/kubeadm-init-config.yaml
  changed_when: false

- name: Bootstrap the first control plane and upload shared certificates
  ansible.builtin.command:
    cmd: >-
      kubeadm init
      --config /etc/kubernetes/kubeadm-init-config.yaml
      --upload-certs
    creates: /etc/kubernetes/admin.conf
  no_log: true

- name: Ensure the administrator kubeconfig directory exists
  ansible.builtin.file:
    path: "/home/{{ cluster_admin_user }}/.kube"
    state: directory
    owner: "{{ cluster_admin_user }}"
    group: "{{ cluster_admin_user }}"
    mode: "0700"

- name: Install the administrator kubeconfig
  ansible.builtin.copy:
    src: /etc/kubernetes/admin.conf
    dest: "/home/{{ cluster_admin_user }}/.kube/config"
    remote_src: true
    owner: "{{ cluster_admin_user }}"
    group: "{{ cluster_admin_user }}"
    mode: "0600"

- name: Wait for the Kubernetes API through the load-balancer endpoint
  ansible.builtin.command:
    cmd: kubectl --kubeconfig=/etc/kubernetes/admin.conf get --raw=/readyz
  register: first_control_plane_readyz
  changed_when: false
  retries: 30
  delay: 10
  until: first_control_plane_readyz.stdout | trim == "ok"

- name: Verify the first control-plane node registration
  ansible.builtin.command:
    cmd: >-
      kubectl --kubeconfig=/etc/kubernetes/admin.conf
      get node {{ inventory_hostname }} -o wide
  changed_when: false

23.4 Create tasks/join.yml

---
- name: Confirm a generated control-plane join command is available
  ansible.builtin.assert:
    that:
      - control_plane_join_command | length > 0
    fail_msg: "The first control plane did not provide a join command."
  no_log: true

- name: Join this node as an additional control plane
  ansible.builtin.command:
    cmd: >-
      {{ control_plane_join_command }}
      --apiserver-advertise-address {{ node_primary_ip }}
      --node-name {{ inventory_hostname }}
      --cri-socket {{ kubernetes_cri_socket }}
    creates: /etc/kubernetes/admin.conf
  no_log: true

- name: Ensure the administrator kubeconfig directory exists
  ansible.builtin.file:
    path: "/home/{{ cluster_admin_user }}/.kube"
    state: directory
    owner: "{{ cluster_admin_user }}"
    group: "{{ cluster_admin_user }}"
    mode: "0700"

- name: Install the administrator kubeconfig
  ansible.builtin.copy:
    src: /etc/kubernetes/admin.conf
    dest: "/home/{{ cluster_admin_user }}/.kube/config"
    remote_src: true
    owner: "{{ cluster_admin_user }}"
    group: "{{ cluster_admin_user }}"
    mode: "0600"

- name: Wait for the local API server to listen
  ansible.builtin.wait_for:
    host: "{{ node_primary_ip }}"
    port: 6443
    timeout: 300

23.5 Create templates/kubeadm-init-config.yaml.j2

apiVersion: kubeadm.k8s.io/v1beta4
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: "{{ node_primary_ip }}"
  bindPort: 6443
nodeRegistration:
  name: "{{ inventory_hostname }}"
  criSocket: "{{ kubernetes_cri_socket }}"
  kubeletExtraArgs:
    - name: node-ip
      value: "{{ node_primary_ip }}"
---
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
kubernetesVersion: "{{ kubernetes_version }}"
controlPlaneEndpoint: "{{ kubernetes_api_endpoint }}"
networking:
  podSubnet: "{{ kubernetes_pod_cidr }}"
  serviceSubnet: "{{ kubernetes_service_cidr }}"
  dnsDomain: "{{ kubernetes_dns_domain }}"
apiServer:
  certSANs:
    - "cicd-ac-k8s-api.aspireclan.com"
    - "cicd-ac-k8s-api"
    - "192.168.8.200"
    - "192.168.8.202"
    - "192.168.8.203"
    - "192.168.8.204"
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd
---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: iptables

The global controlPlaneEndpoint exactly matches the completed HAProxy and Keepalived endpoint: cicd-ac-k8s-api.aspireclan.com:443.

24. Add the Calico custom resources

Create kubernetes/common/cni/calico-custom-resources.yaml:

apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  calicoNetwork:
    bgp: Disabled
    ipPools:
      - name: default-ipv4-ippool
        blockSize: 26
        cidr: 10.244.0.0/16
        encapsulation: VXLAN
        natOutgoing: Enabled
        nodeSelector: all()

This file overrides Calico's example CIDR with the approved non-overlapping 10.244.0.0/16 Pod network and uses VXLAN with BGP disabled.

25. Replace the common-baseline playbook

Replace ansible/playbooks/shared-k8s/01-common-baseline.yml with:

---
- name: Apply the common Ubuntu baseline
  hosts: all
  become: true
  gather_facts: true

  roles:
    - role: common

The workflows use --limit when the baseline should target only one infrastructure group.

26. Replace the Kubernetes preparation playbook

Replace ansible/playbooks/shared-k8s/03-prepare-kubernetes-nodes.yml with:

---
- name: Prepare Kubernetes nodes
  hosts: k8s_cluster
  become: true
  gather_facts: true

  roles:
    - role: containerd
    - role: kubernetes-common

27. Replace the first-control-plane bootstrap playbook

Replace ansible/playbooks/shared-k8s/04-bootstrap-first-control-plane.yml with:

---
- name: Bootstrap the first Kubernetes control plane
  hosts: first_control_plane
  become: true
  gather_facts: true

  roles:
    - role: kubernetes-control-plane
      vars:
        kubernetes_control_plane_action: bootstrap

28. Replace the CNI installation playbook

Replace ansible/playbooks/shared-k8s/06-install-cni.yml with:

---
- name: Install Calico on the first control plane
  hosts: first_control_plane
  become: true
  gather_facts: false

  environment:
    KUBECONFIG: /etc/kubernetes/admin.conf

  tasks:
    - name: Install Calico custom resource definitions
      ansible.builtin.command:
        cmd: "kubectl apply --server-side -f {{ calico_crd_url }}"
      register: calico_crd_apply
      changed_when: "'configured' in calico_crd_apply.stdout or 'created' in calico_crd_apply.stdout"

    - name: Install the Tigera operator
      ansible.builtin.command:
        cmd: "kubectl apply -f {{ calico_operator_url }}"
      register: calico_operator_apply
      changed_when: "'configured' in calico_operator_apply.stdout or 'created' in calico_operator_apply.stdout"

    - name: Copy the approved Calico custom resources
      ansible.builtin.copy:
        src: "{{ playbook_dir }}/../../../kubernetes/common/cni/calico-custom-resources.yaml"
        dest: /etc/kubernetes/calico-custom-resources.yaml
        owner: root
        group: root
        mode: "0644"

    - name: Apply the approved Calico custom resources
      ansible.builtin.command:
        cmd: kubectl apply -f /etc/kubernetes/calico-custom-resources.yaml
      register: calico_custom_resources_apply
      changed_when: >-
        'configured' in calico_custom_resources_apply.stdout or
        'created' in calico_custom_resources_apply.stdout

    - name: Wait for the Tigera operator deployment
      ansible.builtin.command:
        cmd: >-
          kubectl -n tigera-operator
          rollout status deployment/tigera-operator
          --timeout=10m
      changed_when: false

    - name: Wait for Calico to report Available
      ansible.builtin.command:
        cmd: kubectl wait --for=condition=Available tigerastatus/calico --timeout=15m
      register: calico_available
      changed_when: false
      retries: 3
      delay: 20
      until: calico_available is succeeded

    - name: Wait for the first control-plane node to become Ready
      ansible.builtin.command:
        cmd: >-
          kubectl wait
          --for=condition=Ready
          node/{{ inventory_hostname }}
          --timeout=15m
      changed_when: false

The execution order intentionally runs playbook 06 before playbook 05. Kubernetes networking must be installed before the additional control planes are joined.

29. Replace the additional-control-plane join playbook

Replace ansible/playbooks/shared-k8s/05-join-control-planes.yml with:

---
- name: Generate a fresh control-plane join command
  hosts: first_control_plane
  become: true
  gather_facts: false

  tasks:
    - name: Generate a fresh bootstrap-token join command
      ansible.builtin.command:
        cmd: kubeadm token create --ttl 2h --print-join-command
      register: generated_base_join_command
      changed_when: true
      no_log: true

    - name: Re-upload control-plane certificates and generate a fresh key
      ansible.builtin.command:
        cmd: kubeadm init phase upload-certs --upload-certs
      register: generated_certificate_key
      changed_when: true
      no_log: true

    - name: Build the temporary control-plane join command
      ansible.builtin.set_fact:
        generated_control_plane_join_command: >-
          {{ generated_base_join_command.stdout }}
          --control-plane
          --certificate-key {{ generated_certificate_key.stdout_lines | last }}
      no_log: true

- name: Join the remaining control planes one at a time
  hosts: additional_control_planes
  serial: 1
  become: true
  gather_facts: true

  vars:
    control_plane_join_command: >-
      {{ hostvars[groups['first_control_plane'][0]].generated_control_plane_join_command }}

  roles:
    - role: kubernetes-control-plane
      vars:
        kubernetes_control_plane_action: join

- name: Verify the complete highly available control plane
  hosts: first_control_plane
  become: true
  gather_facts: false

  environment:
    KUBECONFIG: /etc/kubernetes/admin.conf

  tasks:
    - name: Wait for every control-plane node to become Ready
      ansible.builtin.command:
        cmd: kubectl wait --for=condition=Ready nodes --all --timeout=15m
      changed_when: false

    - name: Rebalance CoreDNS after additional control planes join
      ansible.builtin.command:
        cmd: kubectl -n kube-system rollout restart deployment/coredns
      changed_when: true

    - name: Wait for CoreDNS rollout completion
      ansible.builtin.command:
        cmd: kubectl -n kube-system rollout status deployment/coredns --timeout=10m
      changed_when: false

    - name: Display the final control-plane nodes
      ansible.builtin.command:
        cmd: kubectl get nodes -o wide
      register: final_control_plane_nodes
      changed_when: false

    - name: Print the final control-plane node table
      ansible.builtin.debug:
        var: final_control_plane_nodes.stdout_lines

The two additional control planes join serially. The temporary certificate key and bootstrap token are hidden with no_log and are never committed.

30. Add the control-plane GitHub Actions workflow

Create .github/workflows/ansible-configure-control-planes.yml:

name: Ansible Configure - Kubernetes Control Planes

on:
  push:
    branches:
      - dev
      - prod
    paths:
      - "ansible/inventories/shared-k8s/group_vars/control_planes.yml"
      - "ansible/roles/kubernetes-control-plane/**"
      - "ansible/playbooks/shared-k8s/04-bootstrap-first-control-plane.yml"
      - "ansible/playbooks/shared-k8s/05-join-control-planes.yml"
      - "ansible/playbooks/shared-k8s/06-install-cni.yml"
      - "kubernetes/common/cni/**"
      - ".github/workflows/ansible-configure-control-planes.yml"

  workflow_dispatch:

permissions:
  contents: read

concurrency:
  group: shared-k8s-ansible
  cancel-in-progress: false

env:
  ANSIBLE_CONFIG: ${{ github.workspace }}/ansible/ansible.cfg

jobs:
  validate:
    name: Validate control-plane Ansible configuration
    runs-on:
      - self-hosted
      - Linux
      - X64
      - prod
      - terraform
      - deploy
      - ac-cicd-infra

    steps:
      - name: Checkout repository
        uses: actions/checkout@v5

      - name: Verify Ansible
        shell: bash
        run: |
          set -euo pipefail
          ansible --version
          ansible-playbook --version

      - name: Validate the shared inventory
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail
          ansible-inventory -i inventories/shared-k8s/hosts.ini --graph

      - name: Syntax-check the control-plane playbooks
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          for playbook in \
            playbooks/shared-k8s/01-common-baseline.yml \
            playbooks/shared-k8s/03-prepare-kubernetes-nodes.yml \
            playbooks/shared-k8s/04-bootstrap-first-control-plane.yml \
            playbooks/shared-k8s/06-install-cni.yml \
            playbooks/shared-k8s/05-join-control-planes.yml
          do
            ansible-playbook \
              -i inventories/shared-k8s/hosts.ini \
              "${playbook}" \
              --syntax-check
          done

  configure:
    name: Bootstrap the Kubernetes control planes
    needs:
      - validate

    if: >-
      (github.event_name == 'push' && github.ref_name == 'prod') ||
      (github.event_name == 'workflow_dispatch' && github.ref_name == 'prod')

    environment:
      name: shared-k8s

    runs-on:
      - self-hosted
      - Linux
      - X64
      - prod
      - terraform
      - deploy
      - ac-cicd-infra

    timeout-minutes: 150

    steps:
      - name: Checkout repository
        uses: actions/checkout@v5

      - name: Verify the production branch
        shell: bash
        run: |
          set -euo pipefail

          if [ "${GITHUB_REF_NAME}" != "prod" ]; then
            echo "ERROR: Control-plane configuration is permitted only from prod."
            exit 1
          fi

      - name: Prepare the existing Ansible SSH key
        shell: bash
        run: |
          set -euo pipefail

          KEY_PATH="${HOME}/.ssh/id_ed25519_ansible"

          if [ ! -f "${KEY_PATH}" ]; then
            echo "ERROR: Missing Ansible key: ${KEY_PATH}"
            exit 1
          fi

          chmod 600 "${KEY_PATH}"
          echo "ANSIBLE_PRIVATE_KEY_FILE=${KEY_PATH}" >> "${GITHUB_ENV}"

      - name: Refresh load-balancer and control-plane SSH host keys
        shell: bash
        run: |
          set -euo pipefail

          mkdir -p "${HOME}/.ssh"
          chmod 700 "${HOME}/.ssh"
          touch "${HOME}/.ssh/known_hosts"
          chmod 600 "${HOME}/.ssh/known_hosts"

          for ip in 192.168.8.201 192.168.8.202 192.168.8.203 192.168.8.204; do
            ssh-keygen -f "${HOME}/.ssh/known_hosts" -R "${ip}" || true

            captured=false
            for attempt in $(seq 1 30); do
              if ssh-keyscan -T 5 -H "${ip}" >> "${HOME}/.ssh/known_hosts" 2>/dev/null; then
                echo "SSH host key captured for ${ip}."
                captured=true
                break
              fi

              echo "Waiting for SSH on ${ip} (attempt ${attempt}/30)..."
              sleep 10
            done

            if [ "${captured}" != "true" ]; then
              echo "ERROR: Unable to capture SSH host key for ${ip}."
              exit 1
            fi
          done

      - name: Prepare the Ansible remote temporary directory
        shell: bash
        run: |
          set -euo pipefail

          for ip in 192.168.8.201 192.168.8.202 192.168.8.203 192.168.8.204; do
            ssh \
              -i "${ANSIBLE_PRIVATE_KEY_FILE}" \
              -o IdentitiesOnly=yes \
              -o BatchMode=yes \
              "acllc@${ip}" \
              'sudo install -d -m 0700 -o acllc -g acllc /var/tmp/ansible-acllc'
          done

      - name: Verify Ansible connectivity
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible \
            -i inventories/shared-k8s/hosts.ini \
            control_planes \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            -m ping

      - name: Apply the common Ubuntu baseline
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible-playbook \
            -i inventories/shared-k8s/hosts.ini \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            --limit control_planes \
            playbooks/shared-k8s/01-common-baseline.yml

      - name: Prepare containerd and Kubernetes prerequisites
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible-playbook \
            -i inventories/shared-k8s/hosts.ini \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            --limit control_planes \
            playbooks/shared-k8s/03-prepare-kubernetes-nodes.yml

      - name: Bootstrap the first control plane
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible-playbook \
            -i inventories/shared-k8s/hosts.ini \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            playbooks/shared-k8s/04-bootstrap-first-control-plane.yml

      - name: Install Calico before joining additional control planes
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible-playbook \
            -i inventories/shared-k8s/hosts.ini \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            playbooks/shared-k8s/06-install-cni.yml

      - name: Join the remaining control planes
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible-playbook \
            -i inventories/shared-k8s/hosts.ini \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            playbooks/shared-k8s/05-join-control-planes.yml

      - name: Verify the completed highly available control plane
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible \
            -i inventories/shared-k8s/hosts.ini \
            first_control_plane \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            -b \
            -m shell \
            -a '
              set -e
              export KUBECONFIG=/etc/kubernetes/admin.conf
              kubectl get nodes -o wide
              kubectl get pods -A
              kubectl get --raw=/readyz
            '

          ansible \
            -i inventories/shared-k8s/hosts.ini \
            load_balancers \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            -b \
            -m shell \
            -a '
              set -eu

              command -v socat >/dev/null 2>&1 || {
                echo "ERROR: socat is not installed on the load balancer."
                exit 1
              }
              
              test -S /run/haproxy/admin.sock || {
                echo "ERROR: HAProxy runtime socket is missing."
                exit 1
              }
              
              echo "=== HAPROXY INFORMATION ==="
              
              printf "show info\n" |
                socat - UNIX-CONNECT:/run/haproxy/admin.sock
              
              echo
              echo "=== HAPROXY BACKEND STATISTICS ==="
              
              printf "show stat\n" |
                socat - UNIX-CONNECT:/run/haproxy/admin.sock
            '

Workflow behavior:

EventResult
Push to devInventory and syntax validation only
Push to prodValidation and complete control-plane bootstrap
Manual dispatch from prodIdempotent control-plane reconciliation
Pull requestNo workflow trigger

31. Review and commit the Ansible change

git status
git diff --check
git diff --stat

git diff -- `
  ansible `
  kubernetes/common/cni `
  .github/workflows/ansible-configure-control-planes.yml

Confirm:

  • No private key, bootstrap token, certificate key, admin kubeconfig, or join command is committed.
  • All package-installing roles force an APT cache refresh first.
  • No ansible_default_ipv4 references remain.
  • Calico uses 10.244.0.0/16, not 192.168.0.0/16.
  • Kubernetes uses the VIP DNS endpoint on port 443.
  • Only .202, .203, and .204 are control planes.

Commit and push:

git add `
  ansible `
  kubernetes/common/cni/calico-custom-resources.yaml `
  .github/workflows/ansible-configure-control-planes.yml

git commit -m "Bootstrap shared Kubernetes control planes"

git push -u origin feature/bootstrap-k8s-control-planes

32. Create the Ansible pull request into dev

gh pr create `
  --base dev `
  --head feature/bootstrap-k8s-control-planes `
  --title "Bootstrap shared Kubernetes control planes" `
  --body "Adds the Ubuntu baseline, containerd, Kubernetes prerequisites, kubeadm HA bootstrap, Calico CNI, and additional control-plane joins.

Merge only after inventory and all five playbook syntax checks succeed.

33. Promote the Ansible change from dev to prod

gh pr create `
  --base prod `
  --head dev `
  --title "Bootstrap shared Kubernetes control planes" `
  --body "Promotes the validated highly available Kubernetes control-plane configuration to prod.

After merge and environment approval, the workflow runs in this order:

  1. Refresh SSH host keys for the load balancer and all three control planes.
  2. Create the shared Ansible remote temporary directory.
  3. Verify Ansible connectivity to the three control planes.
  4. Apply the common Ubuntu baseline to the three control planes.
  5. Configure containerd and Kubernetes prerequisites.
  6. Bootstrap cicd-ac-k8s-cp-01.
  7. Install Calico.
  8. Join cicd-ac-k8s-cp-02.
  9. Join cicd-ac-k8s-cp-03.
  10. Wait for all three nodes to become Ready.
  11. Rebalance CoreDNS.
  12. Confirm /readyz returns ok.
  13. Display HAProxy runtime and backend statistics from the existing load balancer.

34. Manual cluster verification

Run from prod-terraform-deploy-02:

ssh \
  -i ~/.ssh/id_ed25519_ansible \
  -o IdentitiesOnly=yes \
  acllc@192.168.8.202 \
  'sudo bash -c "
    export KUBECONFIG=/etc/kubernetes/admin.conf

    echo === NODES ===
    kubectl get nodes -o wide

    echo === PODS ===
    kubectl get pods -A

    echo === API READINESS ===
    kubectl get --raw=/readyz

    echo === ETCD PODS ===
    kubectl -n kube-system get pods -l component=etcd -o wide
  "' 

Verify HAProxy separately:

ssh \
  -i ~/.ssh/id_ed25519_ansible \
  -o IdentitiesOnly=yes \
  acllc@192.168.8.201 \
  '
    command -v socat
    sudo test -S /run/haproxy/admin.sock

    printf "show stat\n" |
      sudo socat - UNIX-CONNECT:/run/haproxy/admin.sock |
      awk -F, "
        NR == 1 ||
        (\$1 == \"kubernetes_control_planes\" &&
         \$2 ~ /^cicd-ac-k8s-cp-0[123]$/)
      "
  '

The three HAProxy server rows must report UP after their API servers are healthy.

35. Expected final state

Control-plane VMs:
  cicd-ac-k8s-cp-01  192.168.8.202  Ready
  cicd-ac-k8s-cp-02  192.168.8.203  Ready
  cicd-ac-k8s-cp-03  192.168.8.204  Ready

Kubernetes API endpoint:
  cicd-ac-k8s-api.aspireclan.com:443
  192.168.8.200:443

HAProxy:
  Service:        active and enabled
  Runtime socket: /run/haproxy/admin.sock
  Stats client:   socat installed
  Browser stats:  127.0.0.1:8404/stats
  Backends:
    cicd-ac-k8s-cp-01  UP
    cicd-ac-k8s-cp-02  UP
    cicd-ac-k8s-cp-03  UP

Kubernetes:
  Version:        v1.36.1
  Topology:       stacked etcd
  Control planes: 3
  CNI:            Calico v3.32.0
  Pod CIDR:       10.244.0.0/16
  Service CIDR:   10.96.0.0/12
  API readyz:     ok

Still pending:
  12 worker VMs
  worker joins, labels, and taints
  shared services
  ARC controller
  tenant runner scale sets

36. Failure handling

Terraform proposes a load-balancer change

Stop. Do not apply. Correct the control-plane map or module references until the plan is exactly three additions.

A control-plane VM has the wrong address

Check its Proxmox MAC and router reservation. Do not add a static Netplan address.

APT reports a package is unavailable

The roles already force apt update. Check DNS, internet access, Ubuntu sources, and the Kubernetes pkgs.k8s.io repository rather than modifying the template.

kubeadm init fails

Do not repeatedly run it manually. Inspect:

sudo journalctl -u kubelet -n 200 --no-pager
sudo crictl ps -a
sudo crictl logs <CONTAINER_ID>
sudo kubeadm config validate --config /etc/kubernetes/kubeadm-init-config.yaml

Correct the role or template in Git. Use kubeadm reset only as an explicitly reviewed recovery action.

The first node remains NotReady

Before Calico is installed, NotReady is expected. After the CNI workflow completes, inspect:

sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl get pods -A
sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl get tigerastatus
sudo journalctl -u kubelet -n 200 --no-pager

An additional control plane cannot join

The uploaded certificate key expires. Rerun the approved join playbook; it generates a fresh token and re-uploads the certificates automatically.

HAProxy still shows a backend as DOWN

On the affected control plane, verify:

sudo ss -lntp | grep ':6443'
sudo crictl ps | grep kube-apiserver
sudo journalctl -u kubelet -n 100 --no-pager

The containerd CRI health check fails even though containerd is active

Do not search for a combined dotted value such as io.containerd.cri.v1.runtime. ctr plugins ls prints plugin type and plugin ID in separate columns. Use the corrected AWK-based health check in the containerd role.

Inspect the real output:

sudo ctr plugins ls
sudo grep -nE '^disabled_plugins|SystemdCgroup' /etc/containerd/config.toml
sudo systemctl status containerd --no-pager

Ansible reports unsupported parameters for ansible.legacy.command

This means task keywords such as register, changed_when, retries, delay, or until were indented inside the ansible.builtin.shell block. Align them with the module name exactly as shown in this page.

The final HAProxy check reports socat: not found

Do not install it manually as the permanent fix. Rerun the updated 02-configure-load-balancer.yml playbook. The HAProxy role now installs both haproxy and socat, verifies the runtime socket, and calls the Runtime API.

ansible-playbook -i inventories/shared-k8s/hosts.ini --private-key ~/.ssh/id_ed25519_ansible playbooks/shared-k8s/02-configure-load-balancer.yml

37. Expected rebuild checkpoint after successful completion

Expected rebuild checkpoint after this page

Load-balancer VM provisioning                 Verified prerequisite
HAProxy configuration                         Verified prerequisite
Keepalived configuration                      Verified prerequisite
API VIP 192.168.8.200                         Verified prerequisite
Control-plane Terraform definitions           Applied
Control-plane VM provisioning                 Verified
Control-plane Ubuntu baseline                 Applied
Kubernetes prerequisites                      Applied
First control-plane bootstrap                 Verified
CNI installation                              Verified
Additional control-plane joins                Verified
Development workers                           Next page
QA workers                                    Later page
Production workers                            Later page
ARC controller                                Later page
Tenant runner scale sets                      Later page

The next page should provision the four development worker VMs, join them to this cluster, and apply the approved environment=dev and workload=github-runner labels and taints.

38. Source consistency and rebuild validation criteria

Source consistency review for a from-scratch rebuild:

- Control-plane VM IDs, MAC addresses, reserved IPs, memory, disks, tags, and four-core sizing match the cleaned Terraform source.
- The reusable proxmox-vm-group module matches terraform/modules/proxmox-vm-group.
- The control-plane inventory entries, cluster variables, and control-plane port variables match the cleaned Ansible source for this phase.
- The common, containerd, kubernetes-common, and kubernetes-control-plane role blocks match the cleaned repository.
- The Calico custom resources use 10.244.0.0/16, VXLAN, and BGP disabled.
- The control-plane workflow uses component-specific path filters, validates on dev and prod pushes, and configures only from prod.
- The workflow does not use pull_request triggers.
- The load-balancer configuration is a prerequisite from the preceding page; this workflow does not recreate the load balancer.
- Worker inventory entries are intentionally omitted at this checkpoint and are added by the later worker pages.
- No statement in this page assumes that a previous live Kubernetes cluster still exists.

A successful rebuild must independently demonstrate:

- All three control-plane VMs are reachable through their approved DHCP reservations.
- containerd and CRI validation succeed on all three nodes.
- kubeadm initializes cicd-ac-k8s-cp-01.
- Calico becomes Available.
- cicd-ac-k8s-cp-02 and cicd-ac-k8s-cp-03 join successfully.
- All three control-plane nodes report Ready.
- /readyz returns ok through the shared API endpoint.
- HAProxy shows the three control-plane backend rows through its runtime socket.

The source blocks on this page retain the corrected containerd plugin parsing, correct Ansible task-keyword indentation, and HAProxy management of socat. Treat the runtime checks as rebuild acceptance criteria rather than evidence that a previous cluster still exists.

Official references