Skip to main content

Provision and Join the Four Kubernetes QA Workers

This is the sixth infrastructure page in the from-scratch build sequence. Follow it after the control-plane and development-worker pages have produced a healthy cluster with four development workers; it provisions the four QA ARC worker VMs.

FROM-SCRATCH SEQUENCE CHECKPOINT

Required before this page
  Load-balancer VM, HAProxy, Keepalived, and API VIP configured
  Three control-plane nodes Ready
  Four development workers at 192.168.8.213-.216 Ready
  Development labels and taints verified

Implemented by this page
  QA worker Terraform definitions
  Four QA worker VM provisions
  Ubuntu baseline and Kubernetes node preparation
  QA worker joins
  QA worker labels and taints
  Verification that development workers remain intact

Implemented by later pages
  Production workers
  ARC controller
  Tenant runner scale sets

1. Scope and execution order

This page is performed in two separate promotions:

  1. Terraform promotion: create only the four QA worker VMs.
  2. Ansible promotion: after all four VMs are reachable, configure Ubuntu, containerd, Kubernetes prerequisites, join the workers, and apply the approved QA labels and taints.

Do not combine these two changes into the same production promotion. The Ansible workflow must not start until Terraform has created all four VMs and DHCP has assigned the approved addresses.

The approved branch model is:

feature/*
    ↓ pull request
   dev
    ↓ validation and Terraform plan only
 dev → prod pull request
    ↓ review and shared-k8s approval
   prod
    ↓ Terraform apply or Ansible configuration

Not used by this infrastructure execution flow:
local, qa, main
Branch rule

dev performs validation and Terraform plan only. prod performs Terraform apply or Ansible configuration. Do not use local or qa in this infrastructure execution flow.

2. Approved QA worker allocation

cicd-ac-k8s-qa-wk-01
  VM ID:       3156209
  MAC:         aa:bb:cc:07:0f:01
  Reserved IP: 192.168.8.209
  CPU:         4 vCPU
  RAM:         16384 MB
  Disk:        scsi0, 250G, local-lvm

cicd-ac-k8s-qa-wk-02
  VM ID:       3156210
  MAC:         aa:bb:cc:07:0f:02
  Reserved IP: 192.168.8.210
  CPU:         4 vCPU
  RAM:         16384 MB
  Disk:        scsi0, 250G, local-lvm

cicd-ac-k8s-qa-wk-03
  VM ID:       3156211
  MAC:         aa:bb:cc:07:0f:03
  Reserved IP: 192.168.8.211
  CPU:         4 vCPU
  RAM:         16384 MB
  Disk:        scsi0, 250G, local-lvm

cicd-ac-k8s-qa-wk-04
  VM ID:       3156212
  MAC:         aa:bb:cc:07:0f:04
  Reserved IP: 192.168.8.212
  CPU:         4 vCPU
  RAM:         16384 MB
  Disk:        scsi0, 250G, local-lvm

Shared values
  Template:    tmplt-ub-26-min-base / VM ID 90000
  Node:        pve
  Bridge:      vmbr0
  Environment: qa
  Workload:    github-runner

The environment-specific address ranges are intentionally not ordered dev-first:

Approved worker address order
  Production workers: 192.168.8.205-.208
  QA workers:         192.168.8.209-.212
  Development workers:192.168.8.213-.216

QA workers therefore use VM IDs 3156209-.3156212.

Confirm the router has all four DHCP reservations before running Terraform. Ubuntu must continue using DHCP; do not configure static Netplan addresses.

Resource and identity lock

Do not change the approved VM IDs, MAC addresses, IP reservations, scsi0 disk slot, local-lvm storage, or QA environment assignment while following this page.


Part A — Provision the QA Worker VMs with Terraform

3. Terraform files changed

terraform/stacks/shared-k8s/main.tf
terraform/stacks/shared-k8s/outputs.tf

The proxmox-vm and proxmox-vm-group modules created by the preceding pages remain unchanged. This page adds a new QA-worker map to the cumulative shared Kubernetes stack.

4. Create the Terraform feature branch from dev

Run from Windows PowerShell:

cd D:\code\ASPIRECLAN-LLC-Org\ac-cicd-infra

git switch dev
git pull --ff-only origin dev

git switch -c feature/provision-k8s-qa-workers

5. Extend terraform/stacks/shared-k8s/main.tf

Do not replace or edit the load-balancer, control-plane, or development-worker definitions created by the preceding pages. Append this QA block after the existing development-worker module:

locals {
  qa_workers = {
    qa_wk01 = {
      name          = "cicd-ac-k8s-qa-wk-01"
      description   = "Aspireclan shared Kubernetes QA ARC worker 01"
      vmid          = 3156209
      target_node   = "pve"
      template_name = "tmplt-ub-26-min-base"
      cores         = 4
      memory_mb     = 16384
      disk_size     = "250G"
      storage       = "local-lvm"
      bridge        = "vmbr0"
      mac_address   = "aa:bb:cc:07:0f:01"
      tags = [
        "ac-cicd",
        "shared-k8s",
        "worker",
        "qa",
        "arc-runner",
        "terraform",
        "ansible",
      ]
    }

    qa_wk02 = {
      name          = "cicd-ac-k8s-qa-wk-02"
      description   = "Aspireclan shared Kubernetes QA ARC worker 02"
      vmid          = 3156210
      target_node   = "pve"
      template_name = "tmplt-ub-26-min-base"
      cores         = 4
      memory_mb     = 16384
      disk_size     = "250G"
      storage       = "local-lvm"
      bridge        = "vmbr0"
      mac_address   = "aa:bb:cc:07:0f:02"
      tags = [
        "ac-cicd",
        "shared-k8s",
        "worker",
        "qa",
        "arc-runner",
        "terraform",
        "ansible",
      ]
    }

    qa_wk03 = {
      name          = "cicd-ac-k8s-qa-wk-03"
      description   = "Aspireclan shared Kubernetes QA ARC worker 03"
      vmid          = 3156211
      target_node   = "pve"
      template_name = "tmplt-ub-26-min-base"
      cores         = 4
      memory_mb     = 16384
      disk_size     = "250G"
      storage       = "local-lvm"
      bridge        = "vmbr0"
      mac_address   = "aa:bb:cc:07:0f:03"
      tags = [
        "ac-cicd",
        "shared-k8s",
        "worker",
        "qa",
        "arc-runner",
        "terraform",
        "ansible",
      ]
    }

    qa_wk04 = {
      name          = "cicd-ac-k8s-qa-wk-04"
      description   = "Aspireclan shared Kubernetes QA ARC worker 04"
      vmid          = 3156212
      target_node   = "pve"
      template_name = "tmplt-ub-26-min-base"
      cores         = 4
      memory_mb     = 16384
      disk_size     = "250G"
      storage       = "local-lvm"
      bridge        = "vmbr0"
      mac_address   = "aa:bb:cc:07:0f:04"
      tags = [
        "ac-cicd",
        "shared-k8s",
        "worker",
        "qa",
        "arc-runner",
        "terraform",
        "ansible",
      ]
    }
  }
}

module "qa_workers" {
  source = "../../modules/proxmox-vm-group"

  vms = local.qa_workers
}

The four workers use the approved modified sizing of 4 vCPU, 16 GB RAM, and 250 GB disk per VM.

6. Extend terraform/stacks/shared-k8s/outputs.tf

Append:

output "qa_workers" {
  description = "Shared Kubernetes QA worker VMs."

  value = {
    qa_wk01 = merge(module.qa_workers.vms["qa_wk01"], {
      reserved_ip = "192.168.8.209"
      mac_address = "aa:bb:cc:07:0f:01"
      environment = "qa"
    })
    qa_wk02 = merge(module.qa_workers.vms["qa_wk02"], {
      reserved_ip = "192.168.8.210"
      mac_address = "aa:bb:cc:07:0f:02"
      environment = "qa"
    })
    qa_wk03 = merge(module.qa_workers.vms["qa_wk03"], {
      reserved_ip = "192.168.8.211"
      mac_address = "aa:bb:cc:07:0f:03"
      environment = "qa"
    })
    qa_wk04 = merge(module.qa_workers.vms["qa_wk04"], {
      reserved_ip = "192.168.8.212"
      mac_address = "aa:bb:cc:07:0f:04"
      environment = "qa"
    })
  }
}

7. Confirm the existing Terraform workflow contract

The Terraform workflows created by the preceding pages must continue to implement:

Terraform plan workflow
  push branch:           dev
  manual dispatch:       supported
  pull_request trigger:  not used
  action:                fmt, init, validate, plan

Terraform apply workflow
  push branch:           prod
  manual dispatch:       supported from prod
  action:                fmt, init, validate, saved plan, apply

Persistent state
  /var/lib/ac-cicd-infra/terraform-state/shared-k8s/terraform.tfstate

No new workflow is required for this Terraform change. The existing path filters already cover the shared stack.

8. Review and commit the Terraform change

git status
git diff --check
git diff --stat

git diff -- \
  terraform/stacks/shared-k8s/main.tf \
  terraform/stacks/shared-k8s/outputs.tf

Confirm:

  • The load balancer, three control planes, and four development-worker definitions created by the preceding pages remain unchanged.
  • Exactly four VM additions are present.
  • VM IDs are 3156209 through 3156212.
  • MAC addresses are aa:bb:cc:07:0f:01 through aa:bb:cc:07:0f:04.
  • Each worker uses 4 cores, 16384 MB RAM, and a 250G scsi0 disk on local-lvm.
  • No Terraform state, Proxmox token, kubeconfig, join command, or private SSH key is staged.

Commit and push:

git add \
  terraform/stacks/shared-k8s/main.tf \
  terraform/stacks/shared-k8s/outputs.tf

git commit -m "Provision shared Kubernetes QA workers"

git push -u origin feature/provision-k8s-qa-workers

9. Create the Terraform pull request into dev

gh pr create \
  --base dev \
  --head feature/provision-k8s-qa-workers \
  --title "Provision shared Kubernetes QA workers" \
  --body "Adds the four approved QA worker VMs to the shared Kubernetes Terraform stack."

After merge, the dev plan must end with:

Plan: 4 to add, 0 to change, 0 to destroy.
Terraform stop conditions

Do not promote to prod when Terraform proposes any update, replacement, or deletion of the load balancer or control planes, or any change to the completed development workers, or anything other than the four approved QA workers.

10. Promote the Terraform change from dev to prod

gh pr create \
  --base prod \
  --head dev \
  --title "Provision shared Kubernetes QA workers" \
  --body "Promotes the validated four QA worker Terraform plan to prod."

After merge and shared-k8s environment approval, the production workflow applies the saved plan and writes the new worker resources to the existing persistent Terraform state.

11. Verify the worker VMs in Proxmox

Run on the Proxmox host:

qm status 3156209
qm config 3156209

qm status 3156210
qm config 3156210

qm status 3156211
qm config 3156211

qm status 3156212
qm config 3156212

For every VM, confirm:

  • Status is running.
  • CPU is 4 cores.
  • RAM is 16384 MB.
  • scsi0 is on local-lvm with size 250G.
  • The expected MAC address is present.
  • onboot remains enabled.

12. Verify DHCP, SSH, and sudo

Run from prod-terraform-deploy-02:

for ip in 209 210 211 212; do
  echo "=== 192.168.8.${ip} ==="

  ping -c 2 -W 2 "192.168.8.${ip}"

  ssh \
    -i ~/.ssh/id_ed25519_ansible \
    -o IdentitiesOnly=yes \
    -o BatchMode=yes \
    -o ConnectTimeout=10 \
    "acllc@192.168.8.${ip}" \
    'hostnamectl --static; ip -brief address; sudo -n whoami'
done

Expected before Ansible:

  • .209 through .212 respond.
  • The Ansible automation key authenticates as acllc.
  • sudo -n whoami returns root.
  • The Ubuntu hostname may still show the base-template hostname.

Stop here until all four workers pass SSH and passwordless sudo checks.


Part B — Configure and Join the QA Workers with Ansible

13. QA-only Ansible files changed

ansible/inventories/shared-k8s/hosts.ini
ansible/inventories/shared-k8s/group_vars/qa_workers.yml
ansible/playbooks/shared-k8s/07-join-qa-workers.yml
ansible/playbooks/shared-k8s/08-label-and-taint-qa-workers.yml
.github/workflows/ansible-configure-qa-workers.yml

The development-worker implementation created by the preceding page must remain intact. This QA phase does not replace the development join playbook, development labels-and-taints playbook, development group variables, or development workflow.

Preserve without replacement
  ansible/inventories/shared-k8s/group_vars/dev_workers.yml
  ansible/roles/common/**
  ansible/roles/containerd/**
  ansible/roles/kubernetes-common/**
  ansible/roles/kubernetes-worker/**
  ansible/playbooks/shared-k8s/01-common-baseline.yml
  ansible/playbooks/shared-k8s/03-prepare-kubernetes-nodes.yml
  ansible/playbooks/shared-k8s/07-join-dev-workers.yml
  ansible/playbooks/shared-k8s/08-label-and-taint-workers.yml
  .github/workflows/ansible-configure-dev-workers.yml

The QA phase makes only these cumulative or new-file changes:

Update cumulatively
  ansible/inventories/shared-k8s/hosts.ini

Create new QA-only files
  ansible/inventories/shared-k8s/group_vars/qa_workers.yml
  ansible/playbooks/shared-k8s/07-join-qa-workers.yml
  ansible/playbooks/shared-k8s/08-label-and-taint-qa-workers.yml
  .github/workflows/ansible-configure-qa-workers.yml

The common, containerd, Kubernetes-common, and Kubernetes-worker roles created by the preceding pages contain the required permanent fixes and are reused without replacement:

  • Forced APT cache refresh before package installation.
  • Package-install retries.
  • ansible_facts access instead of deprecated injected variables.
  • /var/tmp/ansible-acllc as the remote temporary directory.
  • Containerd 2.x-compatible CRI validation.
  • Correct task-level indentation.
  • Explicit crictl runtime and image endpoints.

14. Create the Ansible feature branch from dev

Create this branch only after the production Terraform apply has completed successfully:

cd D:\code\ASPIRECLAN-LLC-Org\ac-cicd-infra

git switch dev
git pull --ff-only origin dev

git switch -c feature/configure-k8s-qa-workers

15. Update the shared inventory cumulatively

At this rebuild stage, the inventory must contain the control planes and development workers created by preceding pages plus the four new QA workers. Replace ansible/inventories/shared-k8s/hosts.ini only with this complete cumulative version:

[load_balancers]
cicd-ac-k8s-lb-01 ansible_host=192.168.8.201 ansible_user=acllc node_primary_ip=192.168.8.201 node_interface=ens18

[first_control_plane]
cicd-ac-k8s-cp-01 ansible_host=192.168.8.202 ansible_user=acllc node_primary_ip=192.168.8.202 node_interface=ens18

[additional_control_planes]
cicd-ac-k8s-cp-02 ansible_host=192.168.8.203 ansible_user=acllc node_primary_ip=192.168.8.203 node_interface=ens18
cicd-ac-k8s-cp-03 ansible_host=192.168.8.204 ansible_user=acllc node_primary_ip=192.168.8.204 node_interface=ens18

[control_planes:children]
first_control_plane
additional_control_planes

[dev_workers]
cicd-ac-k8s-dev-wk-01 ansible_host=192.168.8.213 ansible_user=acllc node_primary_ip=192.168.8.213 node_interface=ens18
cicd-ac-k8s-dev-wk-02 ansible_host=192.168.8.214 ansible_user=acllc node_primary_ip=192.168.8.214 node_interface=ens18
cicd-ac-k8s-dev-wk-03 ansible_host=192.168.8.215 ansible_user=acllc node_primary_ip=192.168.8.215 node_interface=ens18
cicd-ac-k8s-dev-wk-04 ansible_host=192.168.8.216 ansible_user=acllc node_primary_ip=192.168.8.216 node_interface=ens18

[qa_workers]
cicd-ac-k8s-qa-wk-01 ansible_host=192.168.8.209 ansible_user=acllc node_primary_ip=192.168.8.209 node_interface=ens18
cicd-ac-k8s-qa-wk-02 ansible_host=192.168.8.210 ansible_user=acllc node_primary_ip=192.168.8.210 node_interface=ens18
cicd-ac-k8s-qa-wk-03 ansible_host=192.168.8.211 ansible_user=acllc node_primary_ip=192.168.8.211 node_interface=ens18
cicd-ac-k8s-qa-wk-04 ansible_host=192.168.8.212 ansible_user=acllc node_primary_ip=192.168.8.212 node_interface=ens18

[prod_workers]

[workers:children]
dev_workers
qa_workers
prod_workers

[k8s_cluster:children]
control_planes
workers

[all:vars]
ansible_python_interpreter=/usr/bin/python3

Before continuing, confirm both worker groups remain present:

grep -nE 'cicd-ac-k8s-dev-wk-0[1-4]' ansible/inventories/shared-k8s/hosts.ini

grep -nE 'cicd-ac-k8s-qa-wk-0[1-4]' ansible/inventories/shared-k8s/hosts.ini

16. Create QA worker variables without changing development variables

Do not edit or replace:

ansible/inventories/shared-k8s/group_vars/dev_workers.yml

Create the new file ansible/inventories/shared-k8s/group_vars/qa_workers.yml:

---
worker_environment: qa
worker_workload: github-runner

kubernetes_node_tcp_ports:
  - "10250"

calico_node_tcp_ports:
  - "5473"

calico_node_udp_ports:
  - "4789"

worker_labels:
  environment: qa
  workload: github-runner

worker_taints:
  - key: environment
    value: qa
    effect: NoSchedule

The existing control_planes.yml, generic Kubernetes firewall tasks, node-preparation playbook, and Kubernetes-worker role remain unchanged from the completed development-worker phase.

17. Preserve the existing development-worker playbooks

Do not edit, rename, or replace:

ansible/playbooks/shared-k8s/07-join-dev-workers.yml
ansible/playbooks/shared-k8s/08-label-and-taint-workers.yml

Those files continue to own development-worker reconciliation.

18. Add a QA-specific worker-join playbook

Create the new file:

ansible/playbooks/shared-k8s/07-join-qa-workers.yml

Use:

---
- name: Generate a fresh Kubernetes worker join command for QA workers
  hosts: first_control_plane
  become: true
  gather_facts: false

  tasks:
    - name: Confirm the Kubernetes API is ready
      ansible.builtin.command:
        cmd: kubectl --kubeconfig=/etc/kubernetes/admin.conf get --raw=/readyz
      register: qa_worker_join_api_ready
      changed_when: false
      retries: 12
      delay: 10
      until: qa_worker_join_api_ready.stdout | trim == "ok"

    - name: Generate a fresh QA worker bootstrap-token join command
      ansible.builtin.command:
        cmd: kubeadm token create --ttl 2h --print-join-command
      register: generated_qa_worker_join_command
      changed_when: true
      no_log: true

    - name: Store the temporary QA worker join command in memory
      ansible.builtin.set_fact:
        shared_qa_worker_join_command: "{{ generated_qa_worker_join_command.stdout }}"
      no_log: true

- name: Join the QA workers one at a time
  hosts: qa_workers
  serial: 1
  become: true
  gather_facts: true

  vars:
    worker_join_command: >-
      {{ hostvars[groups['first_control_plane'][0]].shared_qa_worker_join_command }}

  roles:
    - role: kubernetes-worker

- name: Verify all QA workers are Ready
  hosts: first_control_plane
  become: true
  gather_facts: false

  environment:
    KUBECONFIG: /etc/kubernetes/admin.conf

  tasks:
    - name: Wait for each QA worker to become Ready
      ansible.builtin.command:
        cmd: >-
          kubectl wait
          --for=condition=Ready
          node/{{ item }}
          --timeout=15m
      loop: "{{ groups['qa_workers'] }}"
      changed_when: false

    - name: Display the joined QA workers
      ansible.builtin.command:
        cmd: >-
          kubectl get nodes
          -l environment=qa,workload=github-runner
          -L environment,workload
          -o wide
      register: joined_qa_workers
      changed_when: false
      failed_when: false

    - name: Print the QA worker table
      ansible.builtin.debug:
        var: joined_qa_workers.stdout_lines

This playbook targets only qa_workers. It cannot join or modify development workers.

19. Add a QA-specific labels-and-taints playbook

Create the new file:

ansible/playbooks/shared-k8s/08-label-and-taint-qa-workers.yml

Use:

---
- name: Apply QA worker labels and taints
  hosts: first_control_plane
  become: true
  gather_facts: false

  environment:
    KUBECONFIG: /etc/kubernetes/admin.conf

  tasks:
    - name: Apply the approved QA worker labels
      ansible.builtin.command:
        cmd: >-
          kubectl label node {{ item }}
          environment={{ hostvars[item].worker_environment | default('qa') }}
          workload={{ hostvars[item].worker_workload | default('github-runner') }}
          --overwrite
      loop: "{{ groups['qa_workers'] }}"
      register: qa_worker_label_results
      changed_when: "'not labeled' not in qa_worker_label_results.stdout"

    - name: Apply the approved QA worker taint
      ansible.builtin.command:
        cmd: >-
          kubectl taint node {{ item }}
          environment=qa:NoSchedule
          --overwrite
      loop: "{{ groups['qa_workers'] }}"
      register: qa_worker_taint_results
      changed_when: "'not tainted' not in qa_worker_taint_results.stdout"

    - name: Display QA worker labels
      ansible.builtin.command:
        cmd: >-
          kubectl get nodes
          -l environment=qa,workload=github-runner
          -L environment,workload
          -o wide
      register: labeled_qa_workers
      changed_when: false

    - name: Print labeled QA workers
      ansible.builtin.debug:
        var: labeled_qa_workers.stdout_lines

    - name: Verify the QA taint on every worker
      ansible.builtin.shell:
        executable: /bin/bash
        cmd: |
          set -euo pipefail

          kubectl get node {{ item }} \
            -o jsonpath='{range .spec.taints[*]}{.key}={.value}:{.effect}{"\n"}{end}' |
            grep -Fx 'environment=qa:NoSchedule'
      loop: "{{ groups['qa_workers'] }}"
      changed_when: false

Every QA worker receives:

environment=qa
workload=github-runner

environment=qa:NoSchedule

The QA default is explicitly qa; it does not fall back to dev.

20. Add the QA-worker GitHub Actions workflow

Create .github/workflows/ansible-configure-qa-workers.yml:

name: Ansible Configure - Kubernetes QA Workers

on:
  push:
    branches:
      - dev
      - prod
    paths:
      - "ansible/inventories/shared-k8s/group_vars/qa_workers.yml"
      - "ansible/playbooks/shared-k8s/07-join-qa-workers.yml"
      - "ansible/playbooks/shared-k8s/08-label-and-taint-qa-workers.yml"
      - ".github/workflows/ansible-configure-qa-workers.yml"

  workflow_dispatch:

permissions:
  contents: read

concurrency:
  group: shared-k8s-ansible
  cancel-in-progress: false

env:
  ANSIBLE_CONFIG: ${{ github.workspace }}/ansible/ansible.cfg

jobs:
  validate:
    name: Validate QA-worker Ansible configuration
    runs-on:
      - self-hosted
      - Linux
      - X64
      - prod
      - terraform
      - deploy
      - ac-cicd-infra

    steps:
      - name: Checkout repository
        uses: actions/checkout@v5

      - name: Verify Ansible
        shell: bash
        run: |
          set -euo pipefail
          ansible --version
          ansible-playbook --version

      - name: Validate the cumulative shared inventory
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible-inventory             -i inventories/shared-k8s/hosts.ini             --graph

          for host in             cicd-ac-k8s-dev-wk-01             cicd-ac-k8s-dev-wk-02             cicd-ac-k8s-dev-wk-03             cicd-ac-k8s-dev-wk-04
          do
            grep -Fq "${host} " inventories/shared-k8s/hosts.ini
          done

          for host in             cicd-ac-k8s-qa-wk-01             cicd-ac-k8s-qa-wk-02             cicd-ac-k8s-qa-wk-03             cicd-ac-k8s-qa-wk-04
          do
            grep -Fq "${host} " inventories/shared-k8s/hosts.ini
          done

      - name: Confirm completed development-worker files remain present
        shell: bash
        run: |
          set -euo pipefail

          required_dev_files=(
            "ansible/inventories/shared-k8s/group_vars/dev_workers.yml"
            "ansible/playbooks/shared-k8s/07-join-dev-workers.yml"
            "ansible/playbooks/shared-k8s/08-label-and-taint-workers.yml"
            ".github/workflows/ansible-configure-dev-workers.yml"
          )

          for file in "${required_dev_files[@]}"; do
            if [ ! -f "${file}" ]; then
              echo "ERROR: Completed development-worker file is missing: ${file}"
              exit 1
            fi
          done

      - name: Syntax-check the QA-worker playbooks
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          for playbook in             playbooks/shared-k8s/01-common-baseline.yml             playbooks/shared-k8s/03-prepare-kubernetes-nodes.yml             playbooks/shared-k8s/07-join-qa-workers.yml             playbooks/shared-k8s/08-label-and-taint-qa-workers.yml
          do
            ansible-playbook               -i inventories/shared-k8s/hosts.ini               "${playbook}"               --syntax-check
          done

  configure:
    name: Configure and join QA workers
    needs:
      - validate

    if: >-
      (github.event_name == 'push' && github.ref_name == 'prod') ||
      (github.event_name == 'workflow_dispatch' && github.ref_name == 'prod')

    environment:
      name: shared-k8s

    runs-on:
      - self-hosted
      - Linux
      - X64
      - prod
      - terraform
      - deploy
      - ac-cicd-infra

    timeout-minutes: 180

    steps:
      - name: Checkout repository
        uses: actions/checkout@v5

      - name: Verify the production branch
        shell: bash
        run: |
          set -euo pipefail

          if [ "${GITHUB_REF_NAME}" != "prod" ]; then
            echo "ERROR: QA-worker configuration is permitted only from prod."
            exit 1
          fi

      - name: Prepare the existing Ansible SSH key
        shell: bash
        run: |
          set -euo pipefail

          KEY_PATH="${HOME}/.ssh/id_ed25519_ansible"

          if [ ! -f "${KEY_PATH}" ]; then
            echo "ERROR: Missing Ansible key: ${KEY_PATH}"
            exit 1
          fi

          chmod 600 "${KEY_PATH}"
          echo "ANSIBLE_PRIVATE_KEY_FILE=${KEY_PATH}" >> "${GITHUB_ENV}"

      - name: Refresh QA-worker SSH host keys
        shell: bash
        run: |
          set -euo pipefail

          mkdir -p "${HOME}/.ssh"
          chmod 700 "${HOME}/.ssh"
          touch "${HOME}/.ssh/known_hosts"
          chmod 600 "${HOME}/.ssh/known_hosts"

          for ip in 192.168.8.209 192.168.8.210 192.168.8.211 192.168.8.212; do
            ssh-keygen -f "${HOME}/.ssh/known_hosts" -R "${ip}" || true

            captured=false
            for attempt in $(seq 1 60); do
              if ssh-keyscan -T 5 -H "${ip}" >> "${HOME}/.ssh/known_hosts" 2>/dev/null; then
                echo "SSH host key captured for ${ip}."
                captured=true
                break
              fi

              echo "Waiting for SSH on ${ip} (attempt ${attempt}/60)..."
              sleep 10
            done

            if [ "${captured}" != "true" ]; then
              echo "ERROR: Unable to capture SSH host key for ${ip}."
              exit 1
            fi
          done

      - name: Prepare the Ansible remote temporary directory
        shell: bash
        run: |
          set -euo pipefail

          for ip in 192.168.8.209 192.168.8.210 192.168.8.211 192.168.8.212; do
            ssh               -i "${ANSIBLE_PRIVATE_KEY_FILE}"               -o IdentitiesOnly=yes               -o BatchMode=yes               "acllc@${ip}"               'sudo install -d -m 0700 -o acllc -g acllc /var/tmp/ansible-acllc'
          done

      - name: Verify Ansible connectivity
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible             -i inventories/shared-k8s/hosts.ini             qa_workers             --private-key "${ANSIBLE_PRIVATE_KEY_FILE}"             -m ping

      - name: Confirm the existing development workers remain Ready
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible             -i inventories/shared-k8s/hosts.ini             first_control_plane             --private-key "${ANSIBLE_PRIVATE_KEY_FILE}"             -b             -m shell             -a '
              set -e
              export KUBECONFIG=/etc/kubernetes/admin.conf

              for node in                 cicd-ac-k8s-dev-wk-01                 cicd-ac-k8s-dev-wk-02                 cicd-ac-k8s-dev-wk-03                 cicd-ac-k8s-dev-wk-04
              do
                kubectl wait --for=condition=Ready "node/${node}" --timeout=2m
              done
            '

      - name: Confirm the existing Kubernetes API is healthy
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible             -i inventories/shared-k8s/hosts.ini             first_control_plane             --private-key "${ANSIBLE_PRIVATE_KEY_FILE}"             -b             -m command             -a 'kubectl --kubeconfig=/etc/kubernetes/admin.conf get --raw=/readyz'

      - name: Apply the common Ubuntu baseline
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible-playbook             -i inventories/shared-k8s/hosts.ini             --private-key "${ANSIBLE_PRIVATE_KEY_FILE}"             --limit qa_workers             playbooks/shared-k8s/01-common-baseline.yml

      - name: Prepare containerd and Kubernetes prerequisites
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible-playbook             -i inventories/shared-k8s/hosts.ini             --private-key "${ANSIBLE_PRIVATE_KEY_FILE}"             --limit qa_workers             playbooks/shared-k8s/03-prepare-kubernetes-nodes.yml

      - name: Join only the QA workers
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible-playbook             -i inventories/shared-k8s/hosts.ini             --private-key "${ANSIBLE_PRIVATE_KEY_FILE}"             playbooks/shared-k8s/07-join-qa-workers.yml

      - name: Apply only QA labels and taints
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible-playbook             -i inventories/shared-k8s/hosts.ini             --private-key "${ANSIBLE_PRIVATE_KEY_FILE}"             playbooks/shared-k8s/08-label-and-taint-qa-workers.yml

      - name: Verify QA workers and preserve development workers
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible             -i inventories/shared-k8s/hosts.ini             first_control_plane             --private-key "${ANSIBLE_PRIVATE_KEY_FILE}"             -b             -m shell             -a '
              set -e
              export KUBECONFIG=/etc/kubernetes/admin.conf

              echo "=== QA WORKERS ==="
              kubectl get nodes                 -l environment=qa,workload=github-runner                 -L environment,workload                 -o wide

              echo "=== DEVELOPMENT WORKERS — MUST REMAIN PRESENT ==="
              kubectl get nodes                 -l environment=dev,workload=github-runner                 -L environment,workload                 -o wide

              kubectl get --raw=/readyz
            '

The QA workflow:

  • Uses QA-specific playbook filenames.
  • Targets only qa_workers for baseline and node preparation.
  • Verifies all completed development workers still exist in inventory.
  • Verifies completed development workers remain Ready.
  • Does not call the development join or labels playbooks.
  • Uses QA addresses .209-.212.

Workflow behavior:

EventResult
Push to devCumulative inventory and QA playbook validation only
Push to prodQA baseline, preparation, join, labels, taints, and verification
Manual dispatch from prodIdempotent QA-worker reconciliation

21. Review the QA-only change and prove development files are unchanged

git status
git diff --check
git diff --stat

git diff -- \
  ansible/inventories/shared-k8s/hosts.ini \
  ansible/inventories/shared-k8s/group_vars/qa_workers.yml \
  ansible/playbooks/shared-k8s/07-join-qa-workers.yml \
  ansible/playbooks/shared-k8s/08-label-and-taint-qa-workers.yml \
  .github/workflows/ansible-configure-qa-workers.yml

git diff --exit-code -- \
  ansible/inventories/shared-k8s/group_vars/dev_workers.yml \
  ansible/playbooks/shared-k8s/07-join-dev-workers.yml \
  ansible/playbooks/shared-k8s/08-label-and-taint-workers.yml \
  .github/workflows/ansible-configure-dev-workers.yml

The final git diff --exit-code command must return exit code 0. Any output means a completed development-worker file was modified; restore it before committing.

Confirm:

  • All four development workers remain in hosts.ini.
  • dev_workers.yml is unchanged.
  • 07-join-dev-workers.yml is unchanged.
  • 08-label-and-taint-workers.yml is unchanged.
  • ansible-configure-dev-workers.yml is unchanged.
  • QA files use .209-.212, environment=qa, and environment=qa:NoSchedule.
  • No secret, kubeconfig, join token, Terraform state, or private key is staged.

Commit only the cumulative inventory and QA-specific files:

git add   ansible/inventories/shared-k8s/hosts.ini   ansible/inventories/shared-k8s/group_vars/qa_workers.yml   ansible/playbooks/shared-k8s/07-join-qa-workers.yml   ansible/playbooks/shared-k8s/08-label-and-taint-qa-workers.yml   .github/workflows/ansible-configure-qa-workers.yml

git commit -m "Configure shared Kubernetes QA workers without changing development workers"

git push -u origin feature/configure-k8s-qa-workers

22. Create the Ansible pull request into dev

gh pr create \
  --base dev \
  --head feature/configure-k8s-qa-workers \
  --title "Configure shared Kubernetes QA workers" \
  --body "Adds the Ubuntu baseline, Kubernetes prerequisites, worker joins, and approved QA labels and taints."

Merge only after cumulative inventory validation and all QA playbook syntax checks succeed.

23. Promote the Ansible change from dev to prod

gh pr create \
  --base prod \
  --head dev \
  --title "Configure shared Kubernetes QA workers" \
  --body "Promotes the validated four QA worker configuration to prod."

After merge and environment approval, the workflow runs in this order:

  1. Confirm all four completed development workers remain Ready.
  2. Verify the existing Kubernetes API.
  3. Apply the Ubuntu baseline only to QA workers.
  4. Configure containerd and Kubernetes prerequisites only on QA workers.
  5. Generate a temporary QA worker join command.
  6. Join the four QA workers serially.
  7. Wait for every QA worker to become Ready.
  8. Apply only QA labels and the QA taint.
  9. Display both the QA and preserved development worker sets.

24. Manual verification

Run from prod-terraform-deploy-02:

ssh   -i ~/.ssh/id_ed25519_ansible   -o IdentitiesOnly=yes   acllc@192.168.8.202   'sudo bash -c "
    export KUBECONFIG=/etc/kubernetes/admin.conf

    echo === QA WORKERS ===
    kubectl get nodes       -l environment=qa,workload=github-runner       -L environment,workload       -o wide

    echo === QA TAINTS ===
    for node in       cicd-ac-k8s-qa-wk-01       cicd-ac-k8s-qa-wk-02       cicd-ac-k8s-qa-wk-03       cicd-ac-k8s-qa-wk-04
    do
      echo --- ${node} ---
      kubectl get node ${node}         -o jsonpath=\"{range .spec.taints[*]}{.key}={.value}:{.effect}{'\\n'}{end}\"
    done

    echo === DEVELOPMENT WORKERS PRESERVED ===
    kubectl get nodes       -l environment=dev,workload=github-runner       -L environment,workload       -o wide

    echo === API READINESS ===
    kubectl get --raw=/readyz
  "'

Verify:

  • Four QA workers are Ready.
  • QA labels and taints are correct.
  • Four development workers are still present and Ready.
  • Development labels and taints remain unchanged.

25. Expected final state

EXPECTED REBUILD ACCEPTANCE CHECKPOINT

QA worker VMs:
  cicd-ac-k8s-qa-wk-01  192.168.8.209  Ready
  cicd-ac-k8s-qa-wk-02  192.168.8.210  Ready
  cicd-ac-k8s-qa-wk-03  192.168.8.211  Ready
  cicd-ac-k8s-qa-wk-04  192.168.8.212  Ready

Kubernetes labels on every QA worker:
  environment=qa
  workload=github-runner

Kubernetes taint on every QA worker:
  environment=qa:NoSchedule

Cluster checkpoint after this page:
  3 control planes Ready
  4 development workers Ready and unchanged
  4 QA workers Ready
  Kubernetes API /readyz returns ok

Still implemented by later pages:
  4 production workers at 192.168.8.205-.208
  Shared ARC controller
  Repository and environment runner scale sets

This is a rebuild acceptance checkpoint, not a statement about the current live environment.

26. Failure handling

Terraform proposes changes to existing infrastructure

Stop. Do not apply. The plan must be exactly four additions and no changes or deletions.

A worker receives the wrong DHCP address

Check its Proxmox MAC and router reservation. Do not configure a static Netplan address.

APT reports a package is unavailable

The roles already force an APT refresh. Check DNS, internet access, Ubuntu repository files, and pkgs.k8s.io. Do not rebuild the template merely to preload packages.

Containerd CRI validation fails

Run on the affected worker:

sudo systemctl status containerd --no-pager
sudo ctr plugins ls
sudo grep -nE 'disabled_plugins|SystemdCgroup' /etc/containerd/config.toml
sudo crictl --runtime-endpoint unix:///run/containerd/containerd.sock info

The containerd role must continue using the corrected type-and-ID column checks for both legacy and containerd 2.x plugin layouts.

A worker cannot join

Inspect:

sudo journalctl -u kubelet -n 200 --no-pager
sudo crictl ps -a
sudo test -f /etc/kubernetes/kubelet.conf && echo joined || echo not-joined

Rerun the approved worker-join workflow. It generates a fresh token automatically. Do not paste join commands into Git.

A worker remains NotReady

From cicd-ac-k8s-cp-01, inspect:

sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf get nodes -o wide
sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf get pods -A -o wide
sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf describe node <WORKER_NAME>

Also verify Calico and kubelet logs on the affected worker.

ARC pods are pending after the controller is installed later

The QA runner values must include a matching selector and toleration:

nodeSelector:
environment: qa
workload: github-runner

tolerations:
- key: environment
  operator: Equal
  value: qa
  effect: NoSchedule

27. Project status after successful completion

Use this as the expected acceptance checkpoint for a fresh rebuild. It does not assert that any previously existing Kubernetes VM or cluster is still present.

FROM-SCRATCH STATUS AFTER THIS PAGE

Expected prerequisites retained
  Load balancer, API VIP, and three control planes healthy
  Development workers at 192.168.8.213-.216 Ready

Expected QA result
  QA Terraform definitions added
  QA VMs at 192.168.8.209-.212 provisioned
  QA Ubuntu baseline applied
  QA Kubernetes prerequisites configured
  QA workers joined and Ready
  QA labels and NoSchedule taints verified
  Development worker files and nodes preserved

Next page
  Production workers at 192.168.8.205-.208

Later pages
  Shared ARC controller
  Tenant runner scale sets

After this rebuild checkpoint passes, continue to the production-worker page to provision .205-.208 and apply environment=prod, workload=github-runner, and environment=prod:NoSchedule.