Skip to main content

Provision and Join the Four Kubernetes Production Workers

This is the seventh infrastructure page in the from-scratch build sequence. Follow it after the development-worker and QA-worker pages have produced a healthy cluster; it provisions the four production ARC worker VMs.

FROM-SCRATCH SEQUENCE CHECKPOINT

Required before this page
  Load-balancer VM, HAProxy, Keepalived, and API VIP configured
  Three control-plane nodes Ready
  Four development workers at 192.168.8.213-.216 Ready
  Four QA workers at 192.168.8.209-.212 Ready
  Development and QA labels and taints verified

Implemented by this page
  Production worker Terraform definitions
  Four production worker VM provisions
  Ubuntu baseline and Kubernetes node preparation
  Production worker joins
  Production worker labels and taints
  Verification that development and QA workers remain intact

Implemented by later pages
  Shared ARC controller
  Tenant runner scale sets

1. Scope and execution order

This page is performed in two separate promotions:

  1. Terraform promotion: create only the four production worker VMs.
  2. Ansible promotion: after all four VMs are reachable, configure Ubuntu, containerd, Kubernetes prerequisites, join the workers, and apply the approved production labels and taints.

Do not combine these two changes into the same production promotion. The Ansible workflow must not start until Terraform has created all four VMs and DHCP has assigned the approved addresses.

The approved branch model is:

feature/*
    ↓ pull request
   dev
    ↓ validation and Terraform plan only
 dev → prod pull request
    ↓ review and shared-k8s approval
   prod
    ↓ Terraform apply or Ansible configuration

Not used by this infrastructure execution flow:
local, qa, main
Branch rule

dev performs validation and Terraform plan only. prod performs Terraform apply or Ansible configuration. Do not use local or qa in this infrastructure execution flow.

2. Approved production worker allocation

cicd-ac-k8s-prod-wk-01
  VM ID:       3156205
  MAC:         aa:bb:cc:08:0f:01
  Reserved IP: 192.168.8.205
  CPU:         4 vCPU
  RAM:         16384 MB
  Disk:        scsi0, 250G, local-lvm

cicd-ac-k8s-prod-wk-02
  VM ID:       3156206
  MAC:         aa:bb:cc:08:0f:02
  Reserved IP: 192.168.8.206
  CPU:         4 vCPU
  RAM:         16384 MB
  Disk:        scsi0, 250G, local-lvm

cicd-ac-k8s-prod-wk-03
  VM ID:       3156207
  MAC:         aa:bb:cc:08:0f:03
  Reserved IP: 192.168.8.207
  CPU:         4 vCPU
  RAM:         16384 MB
  Disk:        scsi0, 250G, local-lvm

cicd-ac-k8s-prod-wk-04
  VM ID:       3156208
  MAC:         aa:bb:cc:08:0f:04
  Reserved IP: 192.168.8.208
  CPU:         4 vCPU
  RAM:         16384 MB
  Disk:        scsi0, 250G, local-lvm

Shared values
  Template:    tmplt-ub-26-min-base / VM ID 90000
  Node:        pve
  Bridge:      vmbr0
  Environment: prod
  Workload:    github-runner

The environment-specific address ranges are intentionally not ordered dev-first:

Approved worker address order
  Production workers: 192.168.8.205-.208
  QA workers:         192.168.8.209-.212
  Development workers:192.168.8.213-.216

Production workers therefore use VM IDs 3156205-.3156208.

Confirm the router has all four DHCP reservations before running Terraform. Ubuntu must continue using DHCP; do not configure static Netplan addresses.

Resource and identity lock

Do not change the approved VM IDs, MAC addresses, IP reservations, scsi0 disk slot, local-lvm storage, or production environment assignment while following this page.


Part A — Provision the Production Worker VMs with Terraform

3. Terraform files changed

terraform/stacks/shared-k8s/main.tf
terraform/stacks/shared-k8s/outputs.tf

The proxmox-vm and proxmox-vm-group modules created by the preceding pages remain unchanged. This page adds a new production-worker map to the cumulative shared Kubernetes stack.

4. Create the Terraform feature branch from dev

Run from Windows PowerShell:

cd D:\code\ASPIRECLAN-LLC-Org\ac-cicd-infra

git switch dev
git pull --ff-only origin dev

git switch -c feature/provision-k8s-prod-workers

5. Extend terraform/stacks/shared-k8s/main.tf

Do not replace or edit the load-balancer, control-plane, development-worker, or QA-worker definitions created by the preceding pages. Append this production block after the existing QA-worker module:

locals {
  prod_workers = {
    prod_wk01 = {
      name          = "cicd-ac-k8s-prod-wk-01"
      description   = "Aspireclan shared Kubernetes production ARC worker 01"
      vmid          = 3156205
      target_node   = "pve"
      template_name = "tmplt-ub-26-min-base"
      cores         = 4
      memory_mb     = 16384
      disk_size     = "250G"
      storage       = "local-lvm"
      bridge        = "vmbr0"
      mac_address   = "aa:bb:cc:08:0f:01"
      tags = [
        "ac-cicd",
        "shared-k8s",
        "worker",
        "prod",
        "arc-runner",
        "terraform",
        "ansible",
      ]
    }

    prod_wk02 = {
      name          = "cicd-ac-k8s-prod-wk-02"
      description   = "Aspireclan shared Kubernetes production ARC worker 02"
      vmid          = 3156206
      target_node   = "pve"
      template_name = "tmplt-ub-26-min-base"
      cores         = 4
      memory_mb     = 16384
      disk_size     = "250G"
      storage       = "local-lvm"
      bridge        = "vmbr0"
      mac_address   = "aa:bb:cc:08:0f:02"
      tags = [
        "ac-cicd",
        "shared-k8s",
        "worker",
        "prod",
        "arc-runner",
        "terraform",
        "ansible",
      ]
    }

    prod_wk03 = {
      name          = "cicd-ac-k8s-prod-wk-03"
      description   = "Aspireclan shared Kubernetes production ARC worker 03"
      vmid          = 3156207
      target_node   = "pve"
      template_name = "tmplt-ub-26-min-base"
      cores         = 4
      memory_mb     = 16384
      disk_size     = "250G"
      storage       = "local-lvm"
      bridge        = "vmbr0"
      mac_address   = "aa:bb:cc:08:0f:03"
      tags = [
        "ac-cicd",
        "shared-k8s",
        "worker",
        "prod",
        "arc-runner",
        "terraform",
        "ansible",
      ]
    }

    prod_wk04 = {
      name          = "cicd-ac-k8s-prod-wk-04"
      description   = "Aspireclan shared Kubernetes production ARC worker 04"
      vmid          = 3156208
      target_node   = "pve"
      template_name = "tmplt-ub-26-min-base"
      cores         = 4
      memory_mb     = 16384
      disk_size     = "250G"
      storage       = "local-lvm"
      bridge        = "vmbr0"
      mac_address   = "aa:bb:cc:08:0f:04"
      tags = [
        "ac-cicd",
        "shared-k8s",
        "worker",
        "prod",
        "arc-runner",
        "terraform",
        "ansible",
      ]
    }
  }
}

module "prod_workers" {
  source = "../../modules/proxmox-vm-group"

  vms = local.prod_workers
}

The four workers use the approved sizing of 4 vCPU, 16 GB RAM, and 250 GB disk per VM.

6. Extend terraform/stacks/shared-k8s/outputs.tf

Append:

output "prod_workers" {
  description = "Shared Kubernetes production worker VMs."

  value = {
    prod_wk01 = merge(module.prod_workers.vms["prod_wk01"], {
      reserved_ip = "192.168.8.205"
      mac_address = "aa:bb:cc:08:0f:01"
      environment = "prod"
    })
    prod_wk02 = merge(module.prod_workers.vms["prod_wk02"], {
      reserved_ip = "192.168.8.206"
      mac_address = "aa:bb:cc:08:0f:02"
      environment = "prod"
    })
    prod_wk03 = merge(module.prod_workers.vms["prod_wk03"], {
      reserved_ip = "192.168.8.207"
      mac_address = "aa:bb:cc:08:0f:03"
      environment = "prod"
    })
    prod_wk04 = merge(module.prod_workers.vms["prod_wk04"], {
      reserved_ip = "192.168.8.208"
      mac_address = "aa:bb:cc:08:0f:04"
      environment = "prod"
    })
  }
}

7. Confirm the existing Terraform workflow contract

The Terraform workflows created by the preceding pages must continue to implement:

Terraform plan workflow
  push branch:           dev
  manual dispatch:       supported
  pull_request trigger:  not used
  action:                fmt, init, validate, plan

Terraform apply workflow
  push branch:           prod
  manual dispatch:       supported from prod
  action:                fmt, init, validate, saved plan, apply

Persistent state
  /var/lib/ac-cicd-infra/terraform-state/shared-k8s/terraform.tfstate

No new workflow is required for this Terraform change. The existing path filters already cover the shared stack.

8. Review and commit the Terraform change

git status
git diff --check
git diff --stat

git diff -- \
  terraform/stacks/shared-k8s/main.tf \
  terraform/stacks/shared-k8s/outputs.tf

Confirm:

  • The load balancer, control planes, development workers, and QA workers created by the preceding pages remain unchanged.
  • Exactly four VM additions are present.
  • VM IDs are 3156205 through 3156208.
  • MAC addresses are aa:bb:cc:08:0f:01 through aa:bb:cc:08:0f:04.
  • Each worker uses 4 cores, 16384 MB RAM, and a 250G scsi0 disk on local-lvm.
  • No Terraform state, Proxmox token, kubeconfig, join command, or private SSH key is staged.

Commit and push:

git add \
  terraform/stacks/shared-k8s/main.tf \
  terraform/stacks/shared-k8s/outputs.tf

git commit -m "Provision shared Kubernetes production workers"

git push -u origin feature/provision-k8s-prod-workers

9. Create the Terraform pull request into dev

gh pr create \
  --base dev \
  --head feature/provision-k8s-prod-workers \
  --title "Provision shared Kubernetes production workers" \
  --body "Adds the four approved production worker VMs to the shared Kubernetes Terraform stack."

After merge, the dev plan must end with:

Plan: 4 to add, 0 to change, 0 to destroy.
Terraform stop conditions

Do not promote to prod when Terraform proposes any update, replacement, or deletion of existing infrastructure, or anything other than the four approved production workers.

10. Promote the Terraform change from dev to prod

gh pr create \
  --base prod \
  --head dev \
  --title "Provision shared Kubernetes production workers" \
  --body "Promotes the validated four production worker Terraform plan to prod."

After merge and shared-k8s environment approval, the production workflow applies the saved plan and writes the new worker resources to the existing persistent Terraform state.

11. Verify the worker VMs in Proxmox

Run on the Proxmox host:

qm status 3156205
qm config 3156205

qm status 3156206
qm config 3156206

qm status 3156207
qm config 3156207

qm status 3156208
qm config 3156208

For every VM, confirm:

  • Status is running.
  • CPU is 4 cores.
  • RAM is 16384 MB.
  • scsi0 is on local-lvm with size 250G.
  • The expected MAC address is present.
  • onboot remains enabled.

12. Verify DHCP, SSH, and sudo

Run from prod-terraform-deploy-02:

for ip in 205 206 207 208; do
  echo "=== 192.168.8.${ip} ==="

  ping -c 2 -W 2 "192.168.8.${ip}"

  ssh \
    -i ~/.ssh/id_ed25519_ansible \
    -o IdentitiesOnly=yes \
    -o BatchMode=yes \
    -o ConnectTimeout=10 \
    "acllc@192.168.8.${ip}" \
    'hostnamectl --static; ip -brief address; sudo -n whoami'
done

Expected before Ansible:

  • .205 through .208 respond.
  • The Ansible automation key authenticates as acllc.
  • sudo -n whoami returns root.
  • The Ubuntu hostname may still show the base-template hostname.

Stop here until all four workers pass SSH and passwordless sudo checks.


Part B — Configure and Join the Production Workers with Ansible

13. Production-only Ansible files changed

ansible/inventories/shared-k8s/hosts.ini
ansible/inventories/shared-k8s/group_vars/prod_workers.yml
ansible/playbooks/shared-k8s/07-join-prod-workers.yml
ansible/playbooks/shared-k8s/08-label-and-taint-prod-workers.yml
.github/workflows/ansible-configure-prod-workers.yml

The development-worker and QA-worker implementations created by the preceding pages must remain intact. This production phase does not replace their variables, join playbooks, labels-and-taints playbooks, or workflows.

Preserve without replacement
  ansible/inventories/shared-k8s/group_vars/dev_workers.yml
  ansible/inventories/shared-k8s/group_vars/qa_workers.yml
  ansible/roles/common/**
  ansible/roles/containerd/**
  ansible/roles/kubernetes-common/**
  ansible/roles/kubernetes-worker/**
  ansible/playbooks/shared-k8s/01-common-baseline.yml
  ansible/playbooks/shared-k8s/03-prepare-kubernetes-nodes.yml
  ansible/playbooks/shared-k8s/07-join-dev-workers.yml
  ansible/playbooks/shared-k8s/08-label-and-taint-workers.yml
  ansible/playbooks/shared-k8s/07-join-qa-workers.yml
  ansible/playbooks/shared-k8s/08-label-and-taint-qa-workers.yml
  .github/workflows/ansible-configure-control-planes.yml
  .github/workflows/ansible-configure-dev-workers.yml
  .github/workflows/ansible-configure-qa-workers.yml

The production phase makes only these cumulative or new-file changes:

Update cumulatively
  ansible/inventories/shared-k8s/hosts.ini

Create new production-only files
  ansible/inventories/shared-k8s/group_vars/prod_workers.yml
  ansible/playbooks/shared-k8s/07-join-prod-workers.yml
  ansible/playbooks/shared-k8s/08-label-and-taint-prod-workers.yml
  .github/workflows/ansible-configure-prod-workers.yml

The common, containerd, Kubernetes-common, and Kubernetes-worker roles created by the preceding pages contain the required permanent fixes and are reused without replacement:

  • Forced APT cache refresh before package installation.
  • Package-install retries.
  • ansible_facts access instead of deprecated injected variables.
  • /var/tmp/ansible-acllc as the remote temporary directory.
  • Containerd 2.x-compatible CRI validation.
  • Correct task-level indentation.
  • Explicit crictl runtime and image endpoints.

The control-plane workflow created by the preceding page must retain --limit control_planes for its common-baseline and Kubernetes-preparation steps. This prevents later inventory additions from targeting workers.

14. Create the Ansible feature branch from dev

Create this branch only after the production Terraform apply has completed successfully:

cd D:\code\ASPIRECLAN-LLC-Org\ac-cicd-infra

git switch dev
git pull --ff-only origin dev

git switch -c feature/configure-k8s-prod-workers

15. Update the shared inventory cumulatively

At this rebuild stage, the inventory must contain the control planes, development workers, and QA workers created by preceding pages plus four new production workers. Replace ansible/inventories/shared-k8s/hosts.ini only with this complete cumulative version:

[load_balancers]
cicd-ac-k8s-lb-01 ansible_host=192.168.8.201 ansible_user=acllc node_primary_ip=192.168.8.201 node_interface=ens18

[first_control_plane]
cicd-ac-k8s-cp-01 ansible_host=192.168.8.202 ansible_user=acllc node_primary_ip=192.168.8.202 node_interface=ens18

[additional_control_planes]
cicd-ac-k8s-cp-02 ansible_host=192.168.8.203 ansible_user=acllc node_primary_ip=192.168.8.203 node_interface=ens18
cicd-ac-k8s-cp-03 ansible_host=192.168.8.204 ansible_user=acllc node_primary_ip=192.168.8.204 node_interface=ens18

[control_planes:children]
first_control_plane
additional_control_planes

[dev_workers]
cicd-ac-k8s-dev-wk-01 ansible_host=192.168.8.213 ansible_user=acllc node_primary_ip=192.168.8.213 node_interface=ens18
cicd-ac-k8s-dev-wk-02 ansible_host=192.168.8.214 ansible_user=acllc node_primary_ip=192.168.8.214 node_interface=ens18
cicd-ac-k8s-dev-wk-03 ansible_host=192.168.8.215 ansible_user=acllc node_primary_ip=192.168.8.215 node_interface=ens18
cicd-ac-k8s-dev-wk-04 ansible_host=192.168.8.216 ansible_user=acllc node_primary_ip=192.168.8.216 node_interface=ens18

[qa_workers]
cicd-ac-k8s-qa-wk-01 ansible_host=192.168.8.209 ansible_user=acllc node_primary_ip=192.168.8.209 node_interface=ens18
cicd-ac-k8s-qa-wk-02 ansible_host=192.168.8.210 ansible_user=acllc node_primary_ip=192.168.8.210 node_interface=ens18
cicd-ac-k8s-qa-wk-03 ansible_host=192.168.8.211 ansible_user=acllc node_primary_ip=192.168.8.211 node_interface=ens18
cicd-ac-k8s-qa-wk-04 ansible_host=192.168.8.212 ansible_user=acllc node_primary_ip=192.168.8.212 node_interface=ens18

[prod_workers]
cicd-ac-k8s-prod-wk-01 ansible_host=192.168.8.205 ansible_user=acllc node_primary_ip=192.168.8.205 node_interface=ens18
cicd-ac-k8s-prod-wk-02 ansible_host=192.168.8.206 ansible_user=acllc node_primary_ip=192.168.8.206 node_interface=ens18
cicd-ac-k8s-prod-wk-03 ansible_host=192.168.8.207 ansible_user=acllc node_primary_ip=192.168.8.207 node_interface=ens18
cicd-ac-k8s-prod-wk-04 ansible_host=192.168.8.208 ansible_user=acllc node_primary_ip=192.168.8.208 node_interface=ens18

[workers:children]
dev_workers
qa_workers
prod_workers

[k8s_cluster:children]
control_planes
workers

[all:vars]
ansible_python_interpreter=/usr/bin/python3

Before continuing, confirm all three worker groups remain present:

grep -nE 'cicd-ac-k8s-dev-wk-0[1-4]' ansible/inventories/shared-k8s/hosts.ini

grep -nE 'cicd-ac-k8s-qa-wk-0[1-4]' ansible/inventories/shared-k8s/hosts.ini

grep -nE 'cicd-ac-k8s-prod-wk-0[1-4]' ansible/inventories/shared-k8s/hosts.ini

16. Create production worker variables without changing development or QA variables

Do not edit or replace:

ansible/inventories/shared-k8s/group_vars/dev_workers.yml
ansible/inventories/shared-k8s/group_vars/qa_workers.yml

Create the new file ansible/inventories/shared-k8s/group_vars/prod_workers.yml:

---
worker_environment: prod
worker_workload: github-runner

kubernetes_node_tcp_ports:
  - "10250"

calico_node_tcp_ports:
  - "5473"

calico_node_udp_ports:
  - "4789"

worker_labels:
  environment: prod
  workload: github-runner

worker_taints:
  - key: environment
    value: prod
    effect: NoSchedule

The existing control_planes.yml, generic Kubernetes firewall tasks, node-preparation playbook, and Kubernetes-worker role remain unchanged.

17. Preserve the existing development and QA playbooks

Do not edit, rename, or replace:

ansible/playbooks/shared-k8s/07-join-dev-workers.yml
ansible/playbooks/shared-k8s/08-label-and-taint-workers.yml
ansible/playbooks/shared-k8s/07-join-qa-workers.yml
ansible/playbooks/shared-k8s/08-label-and-taint-qa-workers.yml

Those files continue to own development-worker and QA-worker reconciliation.

18. Add a production-specific worker-join playbook

Create the new file:

ansible/playbooks/shared-k8s/07-join-prod-workers.yml

Use:

---
- name: Generate a fresh Kubernetes worker join command for production workers
  hosts: first_control_plane
  become: true
  gather_facts: false

  tasks:
    - name: Confirm the Kubernetes API is ready
      ansible.builtin.command:
        cmd: kubectl --kubeconfig=/etc/kubernetes/admin.conf get --raw=/readyz
      register: prod_worker_join_api_ready
      changed_when: false
      retries: 12
      delay: 10
      until: prod_worker_join_api_ready.stdout | trim == "ok"

    - name: Generate a fresh production worker bootstrap-token join command
      ansible.builtin.command:
        cmd: kubeadm token create --ttl 2h --print-join-command
      register: generated_prod_worker_join_command
      changed_when: true
      no_log: true

    - name: Store the temporary production worker join command in memory
      ansible.builtin.set_fact:
        shared_prod_worker_join_command: "{{ generated_prod_worker_join_command.stdout }}"
      no_log: true

- name: Join the production workers one at a time
  hosts: prod_workers
  serial: 1
  become: true
  gather_facts: true

  vars:
    worker_join_command: >-
      {{ hostvars[groups['first_control_plane'][0]].shared_prod_worker_join_command }}

  roles:
    - role: kubernetes-worker

- name: Verify all production workers are Ready
  hosts: first_control_plane
  become: true
  gather_facts: false

  environment:
    KUBECONFIG: /etc/kubernetes/admin.conf

  tasks:
    - name: Wait for each production worker to become Ready
      ansible.builtin.command:
        cmd: >-
          kubectl wait
          --for=condition=Ready
          node/{{ item }}
          --timeout=15m
      loop: "{{ groups['prod_workers'] }}"
      changed_when: false

    - name: Display all nodes after the production worker joins
      ansible.builtin.command:
        cmd: kubectl get nodes -o wide
      register: joined_prod_workers
      changed_when: false

    - name: Print the node table
      ansible.builtin.debug:
        var: joined_prod_workers.stdout_lines

This playbook targets only prod_workers. It cannot join or modify development or QA workers.

19. Add a production-specific labels-and-taints playbook

Create the new file:

ansible/playbooks/shared-k8s/08-label-and-taint-prod-workers.yml

Use:

---
- name: Apply production worker labels and taints
  hosts: first_control_plane
  become: true
  gather_facts: false

  environment:
    KUBECONFIG: /etc/kubernetes/admin.conf

  tasks:
    - name: Apply the approved production worker labels
      ansible.builtin.command:
        cmd: >-
          kubectl label node {{ item }}
          environment={{ hostvars[item].worker_environment | default('prod') }}
          workload={{ hostvars[item].worker_workload | default('github-runner') }}
          --overwrite
      loop: "{{ groups['prod_workers'] }}"
      register: prod_worker_label_results
      changed_when: "'not labeled' not in prod_worker_label_results.stdout"

    - name: Apply the approved production worker taint
      ansible.builtin.command:
        cmd: >-
          kubectl taint node {{ item }}
          environment=prod:NoSchedule
          --overwrite
      loop: "{{ groups['prod_workers'] }}"
      register: prod_worker_taint_results
      changed_when: "'not tainted' not in prod_worker_taint_results.stdout"

    - name: Display production worker labels
      ansible.builtin.command:
        cmd: >-
          kubectl get nodes
          -l environment=prod,workload=github-runner
          -L environment,workload
          -o wide
      register: labeled_prod_workers
      changed_when: false

    - name: Print labeled production workers
      ansible.builtin.debug:
        var: labeled_prod_workers.stdout_lines

    - name: Verify the production taint on every worker
      ansible.builtin.shell:
        executable: /bin/bash
        cmd: |
          set -euo pipefail

          kubectl get node {{ item }} \
            -o jsonpath='{range .spec.taints[*]}{.key}={.value}:{.effect}{"\n"}{end}' |
            grep -Fx 'environment=prod:NoSchedule'
      loop: "{{ groups['prod_workers'] }}"
      changed_when: false

Every production worker receives:

environment=prod
workload=github-runner

environment=prod:NoSchedule

The production default is explicitly prod; it does not fall back to dev or qa.

20. Add the production-worker GitHub Actions workflow

Create .github/workflows/ansible-configure-prod-workers.yml:

name: Ansible Configure - Kubernetes Production Workers

on:
  push:
    branches:
      - dev
      - prod
    paths:
      - "ansible/inventories/shared-k8s/group_vars/prod_workers.yml"
      - "ansible/playbooks/shared-k8s/07-join-prod-workers.yml"
      - "ansible/playbooks/shared-k8s/08-label-and-taint-prod-workers.yml"
      - ".github/workflows/ansible-configure-prod-workers.yml"

  workflow_dispatch:

permissions:
  contents: read

concurrency:
  group: shared-k8s-ansible
  cancel-in-progress: false

env:
  ANSIBLE_CONFIG: ${{ github.workspace }}/ansible/ansible.cfg

jobs:
  validate:
    name: Validate production-worker Ansible configuration
    runs-on:
      - self-hosted
      - Linux
      - X64
      - prod
      - terraform
      - deploy
      - ac-cicd-infra

    steps:
      - name: Checkout repository
        uses: actions/checkout@v5

      - name: Verify Ansible
        shell: bash
        run: |
          set -euo pipefail
          ansible --version
          ansible-playbook --version

      - name: Validate the cumulative shared inventory
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible-inventory \
            -i inventories/shared-k8s/hosts.ini \
            --graph

          for host in \
            cicd-ac-k8s-dev-wk-01 \
            cicd-ac-k8s-dev-wk-02 \
            cicd-ac-k8s-dev-wk-03 \
            cicd-ac-k8s-dev-wk-04 \
            cicd-ac-k8s-qa-wk-01 \
            cicd-ac-k8s-qa-wk-02 \
            cicd-ac-k8s-qa-wk-03 \
            cicd-ac-k8s-qa-wk-04 \
            cicd-ac-k8s-prod-wk-01 \
            cicd-ac-k8s-prod-wk-02 \
            cicd-ac-k8s-prod-wk-03 \
            cicd-ac-k8s-prod-wk-04
          do
            grep -Fq "${host} " inventories/shared-k8s/hosts.ini
          done

      - name: Confirm completed development and QA files remain present
        shell: bash
        run: |
          set -euo pipefail

          required_existing_files=(
            "ansible/inventories/shared-k8s/group_vars/dev_workers.yml"
            "ansible/inventories/shared-k8s/group_vars/qa_workers.yml"
            "ansible/playbooks/shared-k8s/07-join-dev-workers.yml"
            "ansible/playbooks/shared-k8s/08-label-and-taint-workers.yml"
            "ansible/playbooks/shared-k8s/07-join-qa-workers.yml"
            "ansible/playbooks/shared-k8s/08-label-and-taint-qa-workers.yml"
            ".github/workflows/ansible-configure-dev-workers.yml"
            ".github/workflows/ansible-configure-qa-workers.yml"
          )

          for file in "${required_existing_files[@]}"; do
            if [ ! -f "${file}" ]; then
              echo "ERROR: Completed worker file is missing: ${file}"
              exit 1
            fi
          done

      - name: Confirm each generic playbook resolves only production targets
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          for playbook in \
            playbooks/shared-k8s/01-common-baseline.yml \
            playbooks/shared-k8s/03-prepare-kubernetes-nodes.yml
          do
            output="$(
              ansible-playbook \
                -i inventories/shared-k8s/hosts.ini \
                --limit prod_workers \
                "${playbook}" \
                --list-hosts
            )"

            printf '%s\n' "${output}"

            for host in \
              cicd-ac-k8s-prod-wk-01 \
              cicd-ac-k8s-prod-wk-02 \
              cicd-ac-k8s-prod-wk-03 \
              cicd-ac-k8s-prod-wk-04
            do
              grep -Fq "${host}" <<< "${output}"
            done

            if grep -Eq 'cicd-ac-k8s-(dev|qa)-wk-' <<< "${output}"; then
              echo "ERROR: Production workflow target includes a non-production worker."
              exit 1
            fi
          done

      - name: Syntax-check the production-worker playbooks
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          for playbook in \
            playbooks/shared-k8s/01-common-baseline.yml \
            playbooks/shared-k8s/03-prepare-kubernetes-nodes.yml \
            playbooks/shared-k8s/07-join-prod-workers.yml \
            playbooks/shared-k8s/08-label-and-taint-prod-workers.yml
          do
            ansible-playbook \
              -i inventories/shared-k8s/hosts.ini \
              "${playbook}" \
              --syntax-check
          done

  configure:
    name: Configure and join production workers
    needs:
      - validate

    if: >-
      (github.event_name == 'push' && github.ref_name == 'prod') ||
      (github.event_name == 'workflow_dispatch' && github.ref_name == 'prod')

    environment:
      name: shared-k8s

    runs-on:
      - self-hosted
      - Linux
      - X64
      - prod
      - terraform
      - deploy
      - ac-cicd-infra

    timeout-minutes: 180

    steps:
      - name: Checkout repository
        uses: actions/checkout@v5

      - name: Verify the production branch
        shell: bash
        run: |
          set -euo pipefail

          if [ "${GITHUB_REF_NAME}" != "prod" ]; then
            echo "ERROR: Production-worker configuration is permitted only from prod."
            exit 1
          fi

      - name: Prepare the existing Ansible SSH key
        shell: bash
        run: |
          set -euo pipefail

          KEY_PATH="${HOME}/.ssh/id_ed25519_ansible"

          if [ ! -f "${KEY_PATH}" ]; then
            echo "ERROR: Missing Ansible key: ${KEY_PATH}"
            exit 1
          fi

          chmod 600 "${KEY_PATH}"
          echo "ANSIBLE_PRIVATE_KEY_FILE=${KEY_PATH}" >> "${GITHUB_ENV}"

      - name: Refresh production-worker SSH host keys
        shell: bash
        run: |
          set -euo pipefail

          mkdir -p "${HOME}/.ssh"
          chmod 700 "${HOME}/.ssh"
          touch "${HOME}/.ssh/known_hosts"
          chmod 600 "${HOME}/.ssh/known_hosts"

          for ip in 192.168.8.205 192.168.8.206 192.168.8.207 192.168.8.208; do
            ssh-keygen -f "${HOME}/.ssh/known_hosts" -R "${ip}" || true

            captured=false
            for attempt in $(seq 1 60); do
              if ssh-keyscan -T 5 -H "${ip}" >> "${HOME}/.ssh/known_hosts" 2>/dev/null; then
                echo "SSH host key captured for ${ip}."
                captured=true
                break
              fi

              echo "Waiting for SSH on ${ip} (attempt ${attempt}/60)..."
              sleep 10
            done

            if [ "${captured}" != "true" ]; then
              echo "ERROR: Unable to capture SSH host key for ${ip}."
              exit 1
            fi
          done

      - name: Prepare the Ansible remote temporary directory
        shell: bash
        run: |
          set -euo pipefail

          for ip in 192.168.8.205 192.168.8.206 192.168.8.207 192.168.8.208; do
            ssh \
              -i "${ANSIBLE_PRIVATE_KEY_FILE}" \
              -o IdentitiesOnly=yes \
              -o BatchMode=yes \
              "acllc@${ip}" \
              'sudo install -d -m 0700 -o acllc -g acllc /var/tmp/ansible-acllc'
          done

      - name: Verify Ansible connectivity
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible \
            -i inventories/shared-k8s/hosts.ini \
            prod_workers \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            -m ping

      - name: Confirm the existing development and QA workers remain Ready
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible \
            -i inventories/shared-k8s/hosts.ini \
            first_control_plane \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            -b \
            -m shell \
            -a '
              set -e
              export KUBECONFIG=/etc/kubernetes/admin.conf

              for node in \
                cicd-ac-k8s-dev-wk-01 \
                cicd-ac-k8s-dev-wk-02 \
                cicd-ac-k8s-dev-wk-03 \
                cicd-ac-k8s-dev-wk-04 \
                cicd-ac-k8s-qa-wk-01 \
                cicd-ac-k8s-qa-wk-02 \
                cicd-ac-k8s-qa-wk-03 \
                cicd-ac-k8s-qa-wk-04
              do
                kubectl wait --for=condition=Ready "node/${node}" --timeout=2m
              done
            '

      - name: Confirm the existing Kubernetes API is healthy
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible \
            -i inventories/shared-k8s/hosts.ini \
            first_control_plane \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            -b \
            -m command \
            -a 'kubectl --kubeconfig=/etc/kubernetes/admin.conf get --raw=/readyz'

      - name: Apply the common Ubuntu baseline only to production workers
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible-playbook \
            -i inventories/shared-k8s/hosts.ini \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            --limit prod_workers \
            playbooks/shared-k8s/01-common-baseline.yml

      - name: Prepare containerd and Kubernetes prerequisites only on production workers
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible-playbook \
            -i inventories/shared-k8s/hosts.ini \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            --limit prod_workers \
            playbooks/shared-k8s/03-prepare-kubernetes-nodes.yml

      - name: Join only the production workers
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible-playbook \
            -i inventories/shared-k8s/hosts.ini \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            playbooks/shared-k8s/07-join-prod-workers.yml

      - name: Apply only production labels and taints
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible-playbook \
            -i inventories/shared-k8s/hosts.ini \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            playbooks/shared-k8s/08-label-and-taint-prod-workers.yml

      - name: Verify production workers and preserve development and QA workers
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible \
            -i inventories/shared-k8s/hosts.ini \
            first_control_plane \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            -b \
            -m shell \
            -a '
              set -e
              export KUBECONFIG=/etc/kubernetes/admin.conf

              echo "=== PRODUCTION WORKERS ==="
              kubectl get nodes \
                -l environment=prod,workload=github-runner \
                -L environment,workload \
                -o wide

              echo "=== QA WORKERS — MUST REMAIN PRESENT ==="
              kubectl get nodes \
                -l environment=qa,workload=github-runner \
                -L environment,workload \
                -o wide

              echo "=== DEVELOPMENT WORKERS — MUST REMAIN PRESENT ==="
              kubectl get nodes \
                -l environment=dev,workload=github-runner \
                -L environment,workload \
                -o wide

              kubectl get --raw=/readyz
            '

The production workflow:

  • Uses production-specific playbook filenames.
  • Targets only prod_workers for the common baseline and node preparation.
  • Uses --list-hosts during validation to prove generic playbooks resolve only the four production workers.
  • Verifies all completed development and QA files remain present.
  • Verifies completed development and QA workers remain Ready.
  • Does not call the development or QA join and labels playbooks.
  • Uses production addresses .205-.208.
  • Uses narrow path filters so a shared inventory-only change does not independently trigger this workflow.

Workflow behavior:

EventResult
Push to devCumulative inventory and production playbook validation only
Push to prodProduction baseline, preparation, join, labels, taints, and verification
Manual dispatch from prodIdempotent production-worker reconciliation

21. Review the production-only change and prove existing files are unchanged

git status
git diff --check
git diff --stat

git diff -- \
  ansible/inventories/shared-k8s/hosts.ini \
  ansible/inventories/shared-k8s/group_vars/prod_workers.yml \
  ansible/playbooks/shared-k8s/07-join-prod-workers.yml \
  ansible/playbooks/shared-k8s/08-label-and-taint-prod-workers.yml \
  .github/workflows/ansible-configure-prod-workers.yml

git diff --exit-code -- \
  ansible/inventories/shared-k8s/group_vars/dev_workers.yml \
  ansible/inventories/shared-k8s/group_vars/qa_workers.yml \
  ansible/playbooks/shared-k8s/07-join-dev-workers.yml \
  ansible/playbooks/shared-k8s/08-label-and-taint-workers.yml \
  ansible/playbooks/shared-k8s/07-join-qa-workers.yml \
  ansible/playbooks/shared-k8s/08-label-and-taint-qa-workers.yml \
  .github/workflows/ansible-configure-dev-workers.yml \
  .github/workflows/ansible-configure-qa-workers.yml

The final git diff --exit-code command must return exit code 0. Any output means a completed development-worker or QA-worker file was modified; restore it before committing.

Confirm:

  • All development, QA, and production workers remain in hosts.ini.
  • Development and QA variables are unchanged.
  • Development and QA join playbooks are unchanged.
  • Development and QA labels-and-taints playbooks are unchanged.
  • Development and QA workflows are unchanged.
  • Production files use .205-.208, environment=prod, and environment=prod:NoSchedule.
  • The generic common-baseline and preparation commands use --limit prod_workers.
  • No secret, kubeconfig, join token, Terraform state, or private key is staged.

Commit only the cumulative inventory and production-specific files:

git add \
  ansible/inventories/shared-k8s/hosts.ini \
  ansible/inventories/shared-k8s/group_vars/prod_workers.yml \
  ansible/playbooks/shared-k8s/07-join-prod-workers.yml \
  ansible/playbooks/shared-k8s/08-label-and-taint-prod-workers.yml \
  .github/workflows/ansible-configure-prod-workers.yml

git commit -m "Configure shared Kubernetes production workers without changing development or QA workers"

git push -u origin feature/configure-k8s-prod-workers

22. Create the Ansible pull request into dev

gh pr create \
  --base dev \
  --head feature/configure-k8s-prod-workers \
  --title "Configure shared Kubernetes production workers" \
  --body "Adds the Ubuntu baseline, Kubernetes prerequisites, worker joins, and approved production labels and taints while preserving development and QA workers."

Merge only after cumulative inventory validation, target-list validation, and all production playbook syntax checks succeed.

23. Promote the Ansible change from dev to prod

gh pr create \
  --base prod \
  --head dev \
  --title "Configure shared Kubernetes production workers" \
  --body "Promotes the validated four production worker configuration to prod."

After merge and environment approval, the workflow runs in this order:

  1. Confirm all completed development and QA workers remain Ready.
  2. Verify the existing Kubernetes API.
  3. Apply the Ubuntu baseline only to production workers.
  4. Configure containerd and Kubernetes prerequisites only on production workers.
  5. Generate a temporary production worker join command.
  6. Join the four production workers serially.
  7. Wait for every production worker to become Ready.
  8. Apply only production labels and the production taint.
  9. Display the production, QA, and development worker sets.

24. Manual verification

Run from prod-terraform-deploy-02:

ssh \
  -i ~/.ssh/id_ed25519_ansible \
  -o IdentitiesOnly=yes \
  acllc@192.168.8.202 \
  'sudo bash -c "
    export KUBECONFIG=/etc/kubernetes/admin.conf

    echo === PRODUCTION WORKERS ===
    kubectl get nodes \
      -l environment=prod,workload=github-runner \
      -L environment,workload \
      -o wide

    echo === PRODUCTION TAINTS ===
    for node in \
      cicd-ac-k8s-prod-wk-01 \
      cicd-ac-k8s-prod-wk-02 \
      cicd-ac-k8s-prod-wk-03 \
      cicd-ac-k8s-prod-wk-04
    do
      echo --- ${node} ---
      kubectl get node ${node} \
        -o jsonpath=\"{range .spec.taints[*]}{.key}={.value}:{.effect}{'\\n'}{end}\"
    done

    echo === QA WORKERS PRESERVED ===
    kubectl get nodes \
      -l environment=qa,workload=github-runner \
      -L environment,workload \
      -o wide

    echo === DEVELOPMENT WORKERS PRESERVED ===
    kubectl get nodes \
      -l environment=dev,workload=github-runner \
      -L environment,workload \
      -o wide

    echo === ALL NODES ===
    kubectl get nodes -o wide

    echo === API READINESS ===
    kubectl get --raw=/readyz
  "'

Verify:

  • Four production workers are Ready.
  • Production labels and taints are correct.
  • Four QA workers remain present and Ready.
  • Four development workers remain present and Ready.
  • Development and QA labels and taints remain unchanged.
  • The cluster contains 15 Kubernetes nodes.

25. Expected final state

EXPECTED REBUILD ACCEPTANCE CHECKPOINT

Production worker VMs:
  cicd-ac-k8s-prod-wk-01  192.168.8.205  Ready
  cicd-ac-k8s-prod-wk-02  192.168.8.206  Ready
  cicd-ac-k8s-prod-wk-03  192.168.8.207  Ready
  cicd-ac-k8s-prod-wk-04  192.168.8.208  Ready

Kubernetes labels on every production worker:
  environment=prod
  workload=github-runner

Kubernetes taint on every production worker:
  environment=prod:NoSchedule

Cluster checkpoint after this page:
  3 control planes Ready
  4 development workers Ready and unchanged
  4 QA workers Ready and unchanged
  4 production workers Ready
  15 Kubernetes nodes total
  Kubernetes API /readyz returns ok

Still implemented by later pages:
  Shared ARC controller
  Repository and environment runner scale sets

This is a rebuild acceptance checkpoint, not a statement about the current live environment.

26. Failure handling

Terraform proposes changes to existing infrastructure

Stop. Do not apply. The plan must be exactly four additions and no changes or deletions.

A worker receives the wrong DHCP address

Check its Proxmox MAC and router reservation. Do not configure a static Netplan address.

An older workflow targets the new production workers unexpectedly

Confirm every generic playbook invocation is limited to the workflow's intended host group:

Control planes: --limit control_planes
Development:    --limit dev_workers
QA:             --limit qa_workers
Production:     --limit prod_workers

Do not remove the --limit arguments from environment-specific workflows.

APT reports a package is unavailable

The roles already force an APT refresh. Check DNS, internet access, Ubuntu repository files, and pkgs.k8s.io. Do not rebuild the template merely to preload packages.

Containerd CRI validation fails

Run on the affected worker:

sudo systemctl status containerd --no-pager
sudo ctr plugins ls
sudo grep -nE 'disabled_plugins|SystemdCgroup' /etc/containerd/config.toml
sudo crictl --runtime-endpoint unix:///run/containerd/containerd.sock info

The containerd role must continue using the corrected type-and-ID column checks for both legacy and containerd 2.x plugin layouts.

A worker cannot join

Inspect:

sudo journalctl -u kubelet -n 200 --no-pager
sudo crictl ps -a
sudo test -f /etc/kubernetes/kubelet.conf && echo joined || echo not-joined

Rerun the approved production-worker workflow. It generates a fresh token automatically. Do not paste join commands into Git.

A worker remains NotReady

From cicd-ac-k8s-cp-01, inspect:

sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf get nodes -o wide
sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf get pods -A -o wide
sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf describe node <WORKER_NAME>

Also verify Calico and kubelet logs on the affected worker.

ARC pods are pending after the controller is installed later

The production runner values must include a matching selector and toleration:

nodeSelector:
environment: prod
workload: github-runner

tolerations:
- key: environment
  operator: Equal
  value: prod
  effect: NoSchedule

27. Project status after successful completion

Use this as the expected acceptance checkpoint for a fresh rebuild. It does not assert that any previously existing Kubernetes VM or cluster is still present.

FROM-SCRATCH STATUS AFTER THIS PAGE

Expected prerequisites retained
  Load balancer, API VIP, and three control planes healthy
  Development workers at 192.168.8.213-.216 Ready
  QA workers at 192.168.8.209-.212 Ready

Expected production result
  Production Terraform definitions added
  Production VMs at 192.168.8.205-.208 provisioned
  Production Ubuntu baseline applied
  Production Kubernetes prerequisites configured
  Production workers joined and Ready
  Production labels and NoSchedule taints verified
  Development and QA files and nodes preserved
  Full 15-node Kubernetes cluster verified

Next page
  Shared ARC controller

Later pages
  GitHub organization onboarding
  Repository and environment runner scale sets

After this rebuild checkpoint passes, continue to the shared ARC controller page and then the organization and repository runner-scale-set pages.