Provision and Join the Four Kubernetes Production Workers
This is the seventh infrastructure page in the from-scratch build sequence. Follow it after the development-worker and QA-worker pages have produced a healthy cluster; it provisions the four production ARC worker VMs.
FROM-SCRATCH SEQUENCE CHECKPOINT
Required before this page
Load-balancer VM, HAProxy, Keepalived, and API VIP configured
Three control-plane nodes Ready
Four development workers at 192.168.8.213-.216 Ready
Four QA workers at 192.168.8.209-.212 Ready
Development and QA labels and taints verified
Implemented by this page
Production worker Terraform definitions
Four production worker VM provisions
Ubuntu baseline and Kubernetes node preparation
Production worker joins
Production worker labels and taints
Verification that development and QA workers remain intact
Implemented by later pages
Shared ARC controller
Tenant runner scale sets1. Scope and execution order
This page is performed in two separate promotions:
- Terraform promotion: create only the four production worker VMs.
- Ansible promotion: after all four VMs are reachable, configure Ubuntu, containerd, Kubernetes prerequisites, join the workers, and apply the approved production labels and taints.
Do not combine these two changes into the same production promotion. The Ansible workflow must not start until Terraform has created all four VMs and DHCP has assigned the approved addresses.
The approved branch model is:
feature/*
↓ pull request
dev
↓ validation and Terraform plan only
dev → prod pull request
↓ review and shared-k8s approval
prod
↓ Terraform apply or Ansible configuration
Not used by this infrastructure execution flow:
local, qa, maindev performs validation and Terraform plan only. prod performs Terraform apply or Ansible configuration. Do not use local or qa in this infrastructure execution flow.
2. Approved production worker allocation
cicd-ac-k8s-prod-wk-01
VM ID: 3156205
MAC: aa:bb:cc:08:0f:01
Reserved IP: 192.168.8.205
CPU: 4 vCPU
RAM: 16384 MB
Disk: scsi0, 250G, local-lvm
cicd-ac-k8s-prod-wk-02
VM ID: 3156206
MAC: aa:bb:cc:08:0f:02
Reserved IP: 192.168.8.206
CPU: 4 vCPU
RAM: 16384 MB
Disk: scsi0, 250G, local-lvm
cicd-ac-k8s-prod-wk-03
VM ID: 3156207
MAC: aa:bb:cc:08:0f:03
Reserved IP: 192.168.8.207
CPU: 4 vCPU
RAM: 16384 MB
Disk: scsi0, 250G, local-lvm
cicd-ac-k8s-prod-wk-04
VM ID: 3156208
MAC: aa:bb:cc:08:0f:04
Reserved IP: 192.168.8.208
CPU: 4 vCPU
RAM: 16384 MB
Disk: scsi0, 250G, local-lvm
Shared values
Template: tmplt-ub-26-min-base / VM ID 90000
Node: pve
Bridge: vmbr0
Environment: prod
Workload: github-runnerThe environment-specific address ranges are intentionally not ordered dev-first:
Approved worker address order
Production workers: 192.168.8.205-.208
QA workers: 192.168.8.209-.212
Development workers:192.168.8.213-.216
Production workers therefore use VM IDs 3156205-.3156208.Confirm the router has all four DHCP reservations before running Terraform. Ubuntu must continue using DHCP; do not configure static Netplan addresses.
Do not change the approved VM IDs, MAC addresses, IP reservations, scsi0 disk slot, local-lvm storage, or production environment assignment while following this page.
Part A — Provision the Production Worker VMs with Terraform
3. Terraform files changed
terraform/stacks/shared-k8s/main.tf
terraform/stacks/shared-k8s/outputs.tfThe proxmox-vm and proxmox-vm-group modules created by the preceding pages remain unchanged. This page adds a new production-worker map to the cumulative shared Kubernetes stack.
4. Create the Terraform feature branch from dev
Run from Windows PowerShell:
cd D:\code\ASPIRECLAN-LLC-Org\ac-cicd-infra
git switch dev
git pull --ff-only origin dev
git switch -c feature/provision-k8s-prod-workers5. Extend terraform/stacks/shared-k8s/main.tf
Do not replace or edit the load-balancer, control-plane, development-worker, or QA-worker definitions created by the preceding pages. Append this production block after the existing QA-worker module:
locals {
prod_workers = {
prod_wk01 = {
name = "cicd-ac-k8s-prod-wk-01"
description = "Aspireclan shared Kubernetes production ARC worker 01"
vmid = 3156205
target_node = "pve"
template_name = "tmplt-ub-26-min-base"
cores = 4
memory_mb = 16384
disk_size = "250G"
storage = "local-lvm"
bridge = "vmbr0"
mac_address = "aa:bb:cc:08:0f:01"
tags = [
"ac-cicd",
"shared-k8s",
"worker",
"prod",
"arc-runner",
"terraform",
"ansible",
]
}
prod_wk02 = {
name = "cicd-ac-k8s-prod-wk-02"
description = "Aspireclan shared Kubernetes production ARC worker 02"
vmid = 3156206
target_node = "pve"
template_name = "tmplt-ub-26-min-base"
cores = 4
memory_mb = 16384
disk_size = "250G"
storage = "local-lvm"
bridge = "vmbr0"
mac_address = "aa:bb:cc:08:0f:02"
tags = [
"ac-cicd",
"shared-k8s",
"worker",
"prod",
"arc-runner",
"terraform",
"ansible",
]
}
prod_wk03 = {
name = "cicd-ac-k8s-prod-wk-03"
description = "Aspireclan shared Kubernetes production ARC worker 03"
vmid = 3156207
target_node = "pve"
template_name = "tmplt-ub-26-min-base"
cores = 4
memory_mb = 16384
disk_size = "250G"
storage = "local-lvm"
bridge = "vmbr0"
mac_address = "aa:bb:cc:08:0f:03"
tags = [
"ac-cicd",
"shared-k8s",
"worker",
"prod",
"arc-runner",
"terraform",
"ansible",
]
}
prod_wk04 = {
name = "cicd-ac-k8s-prod-wk-04"
description = "Aspireclan shared Kubernetes production ARC worker 04"
vmid = 3156208
target_node = "pve"
template_name = "tmplt-ub-26-min-base"
cores = 4
memory_mb = 16384
disk_size = "250G"
storage = "local-lvm"
bridge = "vmbr0"
mac_address = "aa:bb:cc:08:0f:04"
tags = [
"ac-cicd",
"shared-k8s",
"worker",
"prod",
"arc-runner",
"terraform",
"ansible",
]
}
}
}
module "prod_workers" {
source = "../../modules/proxmox-vm-group"
vms = local.prod_workers
}The four workers use the approved sizing of 4 vCPU, 16 GB RAM, and 250 GB disk per VM.
6. Extend terraform/stacks/shared-k8s/outputs.tf
Append:
output "prod_workers" {
description = "Shared Kubernetes production worker VMs."
value = {
prod_wk01 = merge(module.prod_workers.vms["prod_wk01"], {
reserved_ip = "192.168.8.205"
mac_address = "aa:bb:cc:08:0f:01"
environment = "prod"
})
prod_wk02 = merge(module.prod_workers.vms["prod_wk02"], {
reserved_ip = "192.168.8.206"
mac_address = "aa:bb:cc:08:0f:02"
environment = "prod"
})
prod_wk03 = merge(module.prod_workers.vms["prod_wk03"], {
reserved_ip = "192.168.8.207"
mac_address = "aa:bb:cc:08:0f:03"
environment = "prod"
})
prod_wk04 = merge(module.prod_workers.vms["prod_wk04"], {
reserved_ip = "192.168.8.208"
mac_address = "aa:bb:cc:08:0f:04"
environment = "prod"
})
}
}7. Confirm the existing Terraform workflow contract
The Terraform workflows created by the preceding pages must continue to implement:
Terraform plan workflow
push branch: dev
manual dispatch: supported
pull_request trigger: not used
action: fmt, init, validate, plan
Terraform apply workflow
push branch: prod
manual dispatch: supported from prod
action: fmt, init, validate, saved plan, apply
Persistent state
/var/lib/ac-cicd-infra/terraform-state/shared-k8s/terraform.tfstateNo new workflow is required for this Terraform change. The existing path filters already cover the shared stack.
8. Review and commit the Terraform change
git status
git diff --check
git diff --stat
git diff -- \
terraform/stacks/shared-k8s/main.tf \
terraform/stacks/shared-k8s/outputs.tfConfirm:
- The load balancer, control planes, development workers, and QA workers created by the preceding pages remain unchanged.
- Exactly four VM additions are present.
- VM IDs are
3156205through3156208. - MAC addresses are
aa:bb:cc:08:0f:01throughaa:bb:cc:08:0f:04. - Each worker uses
4cores,16384MB RAM, and a250Gscsi0disk onlocal-lvm. - No Terraform state, Proxmox token, kubeconfig, join command, or private SSH key is staged.
Commit and push:
git add \
terraform/stacks/shared-k8s/main.tf \
terraform/stacks/shared-k8s/outputs.tf
git commit -m "Provision shared Kubernetes production workers"
git push -u origin feature/provision-k8s-prod-workers9. Create the Terraform pull request into dev
gh pr create \
--base dev \
--head feature/provision-k8s-prod-workers \
--title "Provision shared Kubernetes production workers" \
--body "Adds the four approved production worker VMs to the shared Kubernetes Terraform stack."After merge, the dev plan must end with:
Plan: 4 to add, 0 to change, 0 to destroy.Do not promote to prod when Terraform proposes any update, replacement, or deletion of existing infrastructure, or anything other than the four approved production workers.
10. Promote the Terraform change from dev to prod
gh pr create \
--base prod \
--head dev \
--title "Provision shared Kubernetes production workers" \
--body "Promotes the validated four production worker Terraform plan to prod."After merge and shared-k8s environment approval, the production workflow applies the saved plan and writes the new worker resources to the existing persistent Terraform state.
11. Verify the worker VMs in Proxmox
Run on the Proxmox host:
qm status 3156205
qm config 3156205
qm status 3156206
qm config 3156206
qm status 3156207
qm config 3156207
qm status 3156208
qm config 3156208For every VM, confirm:
- Status is
running. - CPU is
4cores. - RAM is
16384MB. scsi0is onlocal-lvmwith size250G.- The expected MAC address is present.
onbootremains enabled.
12. Verify DHCP, SSH, and sudo
Run from prod-terraform-deploy-02:
for ip in 205 206 207 208; do
echo "=== 192.168.8.${ip} ==="
ping -c 2 -W 2 "192.168.8.${ip}"
ssh \
-i ~/.ssh/id_ed25519_ansible \
-o IdentitiesOnly=yes \
-o BatchMode=yes \
-o ConnectTimeout=10 \
"acllc@192.168.8.${ip}" \
'hostnamectl --static; ip -brief address; sudo -n whoami'
doneExpected before Ansible:
.205through.208respond.- The Ansible automation key authenticates as
acllc. sudo -n whoamireturnsroot.- The Ubuntu hostname may still show the base-template hostname.
Stop here until all four workers pass SSH and passwordless sudo checks.
Part B — Configure and Join the Production Workers with Ansible
13. Production-only Ansible files changed
ansible/inventories/shared-k8s/hosts.ini
ansible/inventories/shared-k8s/group_vars/prod_workers.yml
ansible/playbooks/shared-k8s/07-join-prod-workers.yml
ansible/playbooks/shared-k8s/08-label-and-taint-prod-workers.yml
.github/workflows/ansible-configure-prod-workers.ymlThe development-worker and QA-worker implementations created by the preceding pages must remain intact. This production phase does not replace their variables, join playbooks, labels-and-taints playbooks, or workflows.
Preserve without replacement
ansible/inventories/shared-k8s/group_vars/dev_workers.yml
ansible/inventories/shared-k8s/group_vars/qa_workers.yml
ansible/roles/common/**
ansible/roles/containerd/**
ansible/roles/kubernetes-common/**
ansible/roles/kubernetes-worker/**
ansible/playbooks/shared-k8s/01-common-baseline.yml
ansible/playbooks/shared-k8s/03-prepare-kubernetes-nodes.yml
ansible/playbooks/shared-k8s/07-join-dev-workers.yml
ansible/playbooks/shared-k8s/08-label-and-taint-workers.yml
ansible/playbooks/shared-k8s/07-join-qa-workers.yml
ansible/playbooks/shared-k8s/08-label-and-taint-qa-workers.yml
.github/workflows/ansible-configure-control-planes.yml
.github/workflows/ansible-configure-dev-workers.yml
.github/workflows/ansible-configure-qa-workers.ymlThe production phase makes only these cumulative or new-file changes:
Update cumulatively
ansible/inventories/shared-k8s/hosts.ini
Create new production-only files
ansible/inventories/shared-k8s/group_vars/prod_workers.yml
ansible/playbooks/shared-k8s/07-join-prod-workers.yml
ansible/playbooks/shared-k8s/08-label-and-taint-prod-workers.yml
.github/workflows/ansible-configure-prod-workers.ymlThe common, containerd, Kubernetes-common, and Kubernetes-worker roles created by the preceding pages contain the required permanent fixes and are reused without replacement:
- Forced APT cache refresh before package installation.
- Package-install retries.
ansible_factsaccess instead of deprecated injected variables./var/tmp/ansible-acllcas the remote temporary directory.- Containerd 2.x-compatible CRI validation.
- Correct task-level indentation.
- Explicit
crictlruntime and image endpoints.
The control-plane workflow created by the preceding page must retain --limit control_planes for its common-baseline and Kubernetes-preparation steps. This prevents later inventory additions from targeting workers.
14. Create the Ansible feature branch from dev
Create this branch only after the production Terraform apply has completed successfully:
cd D:\code\ASPIRECLAN-LLC-Org\ac-cicd-infra
git switch dev
git pull --ff-only origin dev
git switch -c feature/configure-k8s-prod-workers15. Update the shared inventory cumulatively
At this rebuild stage, the inventory must contain the control planes, development workers, and QA workers created by preceding pages plus four new production workers. Replace ansible/inventories/shared-k8s/hosts.ini only with this complete cumulative version:
[load_balancers]
cicd-ac-k8s-lb-01 ansible_host=192.168.8.201 ansible_user=acllc node_primary_ip=192.168.8.201 node_interface=ens18
[first_control_plane]
cicd-ac-k8s-cp-01 ansible_host=192.168.8.202 ansible_user=acllc node_primary_ip=192.168.8.202 node_interface=ens18
[additional_control_planes]
cicd-ac-k8s-cp-02 ansible_host=192.168.8.203 ansible_user=acllc node_primary_ip=192.168.8.203 node_interface=ens18
cicd-ac-k8s-cp-03 ansible_host=192.168.8.204 ansible_user=acllc node_primary_ip=192.168.8.204 node_interface=ens18
[control_planes:children]
first_control_plane
additional_control_planes
[dev_workers]
cicd-ac-k8s-dev-wk-01 ansible_host=192.168.8.213 ansible_user=acllc node_primary_ip=192.168.8.213 node_interface=ens18
cicd-ac-k8s-dev-wk-02 ansible_host=192.168.8.214 ansible_user=acllc node_primary_ip=192.168.8.214 node_interface=ens18
cicd-ac-k8s-dev-wk-03 ansible_host=192.168.8.215 ansible_user=acllc node_primary_ip=192.168.8.215 node_interface=ens18
cicd-ac-k8s-dev-wk-04 ansible_host=192.168.8.216 ansible_user=acllc node_primary_ip=192.168.8.216 node_interface=ens18
[qa_workers]
cicd-ac-k8s-qa-wk-01 ansible_host=192.168.8.209 ansible_user=acllc node_primary_ip=192.168.8.209 node_interface=ens18
cicd-ac-k8s-qa-wk-02 ansible_host=192.168.8.210 ansible_user=acllc node_primary_ip=192.168.8.210 node_interface=ens18
cicd-ac-k8s-qa-wk-03 ansible_host=192.168.8.211 ansible_user=acllc node_primary_ip=192.168.8.211 node_interface=ens18
cicd-ac-k8s-qa-wk-04 ansible_host=192.168.8.212 ansible_user=acllc node_primary_ip=192.168.8.212 node_interface=ens18
[prod_workers]
cicd-ac-k8s-prod-wk-01 ansible_host=192.168.8.205 ansible_user=acllc node_primary_ip=192.168.8.205 node_interface=ens18
cicd-ac-k8s-prod-wk-02 ansible_host=192.168.8.206 ansible_user=acllc node_primary_ip=192.168.8.206 node_interface=ens18
cicd-ac-k8s-prod-wk-03 ansible_host=192.168.8.207 ansible_user=acllc node_primary_ip=192.168.8.207 node_interface=ens18
cicd-ac-k8s-prod-wk-04 ansible_host=192.168.8.208 ansible_user=acllc node_primary_ip=192.168.8.208 node_interface=ens18
[workers:children]
dev_workers
qa_workers
prod_workers
[k8s_cluster:children]
control_planes
workers
[all:vars]
ansible_python_interpreter=/usr/bin/python3Before continuing, confirm all three worker groups remain present:
grep -nE 'cicd-ac-k8s-dev-wk-0[1-4]' ansible/inventories/shared-k8s/hosts.ini
grep -nE 'cicd-ac-k8s-qa-wk-0[1-4]' ansible/inventories/shared-k8s/hosts.ini
grep -nE 'cicd-ac-k8s-prod-wk-0[1-4]' ansible/inventories/shared-k8s/hosts.ini16. Create production worker variables without changing development or QA variables
Do not edit or replace:
ansible/inventories/shared-k8s/group_vars/dev_workers.yml
ansible/inventories/shared-k8s/group_vars/qa_workers.ymlCreate the new file ansible/inventories/shared-k8s/group_vars/prod_workers.yml:
---
worker_environment: prod
worker_workload: github-runner
kubernetes_node_tcp_ports:
- "10250"
calico_node_tcp_ports:
- "5473"
calico_node_udp_ports:
- "4789"
worker_labels:
environment: prod
workload: github-runner
worker_taints:
- key: environment
value: prod
effect: NoScheduleThe existing control_planes.yml, generic Kubernetes firewall tasks, node-preparation playbook, and Kubernetes-worker role remain unchanged.
17. Preserve the existing development and QA playbooks
Do not edit, rename, or replace:
ansible/playbooks/shared-k8s/07-join-dev-workers.yml
ansible/playbooks/shared-k8s/08-label-and-taint-workers.yml
ansible/playbooks/shared-k8s/07-join-qa-workers.yml
ansible/playbooks/shared-k8s/08-label-and-taint-qa-workers.ymlThose files continue to own development-worker and QA-worker reconciliation.
18. Add a production-specific worker-join playbook
Create the new file:
ansible/playbooks/shared-k8s/07-join-prod-workers.ymlUse:
---
- name: Generate a fresh Kubernetes worker join command for production workers
hosts: first_control_plane
become: true
gather_facts: false
tasks:
- name: Confirm the Kubernetes API is ready
ansible.builtin.command:
cmd: kubectl --kubeconfig=/etc/kubernetes/admin.conf get --raw=/readyz
register: prod_worker_join_api_ready
changed_when: false
retries: 12
delay: 10
until: prod_worker_join_api_ready.stdout | trim == "ok"
- name: Generate a fresh production worker bootstrap-token join command
ansible.builtin.command:
cmd: kubeadm token create --ttl 2h --print-join-command
register: generated_prod_worker_join_command
changed_when: true
no_log: true
- name: Store the temporary production worker join command in memory
ansible.builtin.set_fact:
shared_prod_worker_join_command: "{{ generated_prod_worker_join_command.stdout }}"
no_log: true
- name: Join the production workers one at a time
hosts: prod_workers
serial: 1
become: true
gather_facts: true
vars:
worker_join_command: >-
{{ hostvars[groups['first_control_plane'][0]].shared_prod_worker_join_command }}
roles:
- role: kubernetes-worker
- name: Verify all production workers are Ready
hosts: first_control_plane
become: true
gather_facts: false
environment:
KUBECONFIG: /etc/kubernetes/admin.conf
tasks:
- name: Wait for each production worker to become Ready
ansible.builtin.command:
cmd: >-
kubectl wait
--for=condition=Ready
node/{{ item }}
--timeout=15m
loop: "{{ groups['prod_workers'] }}"
changed_when: false
- name: Display all nodes after the production worker joins
ansible.builtin.command:
cmd: kubectl get nodes -o wide
register: joined_prod_workers
changed_when: false
- name: Print the node table
ansible.builtin.debug:
var: joined_prod_workers.stdout_linesThis playbook targets only prod_workers. It cannot join or modify development or QA workers.
19. Add a production-specific labels-and-taints playbook
Create the new file:
ansible/playbooks/shared-k8s/08-label-and-taint-prod-workers.ymlUse:
---
- name: Apply production worker labels and taints
hosts: first_control_plane
become: true
gather_facts: false
environment:
KUBECONFIG: /etc/kubernetes/admin.conf
tasks:
- name: Apply the approved production worker labels
ansible.builtin.command:
cmd: >-
kubectl label node {{ item }}
environment={{ hostvars[item].worker_environment | default('prod') }}
workload={{ hostvars[item].worker_workload | default('github-runner') }}
--overwrite
loop: "{{ groups['prod_workers'] }}"
register: prod_worker_label_results
changed_when: "'not labeled' not in prod_worker_label_results.stdout"
- name: Apply the approved production worker taint
ansible.builtin.command:
cmd: >-
kubectl taint node {{ item }}
environment=prod:NoSchedule
--overwrite
loop: "{{ groups['prod_workers'] }}"
register: prod_worker_taint_results
changed_when: "'not tainted' not in prod_worker_taint_results.stdout"
- name: Display production worker labels
ansible.builtin.command:
cmd: >-
kubectl get nodes
-l environment=prod,workload=github-runner
-L environment,workload
-o wide
register: labeled_prod_workers
changed_when: false
- name: Print labeled production workers
ansible.builtin.debug:
var: labeled_prod_workers.stdout_lines
- name: Verify the production taint on every worker
ansible.builtin.shell:
executable: /bin/bash
cmd: |
set -euo pipefail
kubectl get node {{ item }} \
-o jsonpath='{range .spec.taints[*]}{.key}={.value}:{.effect}{"\n"}{end}' |
grep -Fx 'environment=prod:NoSchedule'
loop: "{{ groups['prod_workers'] }}"
changed_when: falseEvery production worker receives:
environment=prod
workload=github-runner
environment=prod:NoScheduleThe production default is explicitly prod; it does not fall back to dev or qa.
20. Add the production-worker GitHub Actions workflow
Create .github/workflows/ansible-configure-prod-workers.yml:
name: Ansible Configure - Kubernetes Production Workers
on:
push:
branches:
- dev
- prod
paths:
- "ansible/inventories/shared-k8s/group_vars/prod_workers.yml"
- "ansible/playbooks/shared-k8s/07-join-prod-workers.yml"
- "ansible/playbooks/shared-k8s/08-label-and-taint-prod-workers.yml"
- ".github/workflows/ansible-configure-prod-workers.yml"
workflow_dispatch:
permissions:
contents: read
concurrency:
group: shared-k8s-ansible
cancel-in-progress: false
env:
ANSIBLE_CONFIG: ${{ github.workspace }}/ansible/ansible.cfg
jobs:
validate:
name: Validate production-worker Ansible configuration
runs-on:
- self-hosted
- Linux
- X64
- prod
- terraform
- deploy
- ac-cicd-infra
steps:
- name: Checkout repository
uses: actions/checkout@v5
- name: Verify Ansible
shell: bash
run: |
set -euo pipefail
ansible --version
ansible-playbook --version
- name: Validate the cumulative shared inventory
working-directory: ansible
shell: bash
run: |
set -euo pipefail
ansible-inventory \
-i inventories/shared-k8s/hosts.ini \
--graph
for host in \
cicd-ac-k8s-dev-wk-01 \
cicd-ac-k8s-dev-wk-02 \
cicd-ac-k8s-dev-wk-03 \
cicd-ac-k8s-dev-wk-04 \
cicd-ac-k8s-qa-wk-01 \
cicd-ac-k8s-qa-wk-02 \
cicd-ac-k8s-qa-wk-03 \
cicd-ac-k8s-qa-wk-04 \
cicd-ac-k8s-prod-wk-01 \
cicd-ac-k8s-prod-wk-02 \
cicd-ac-k8s-prod-wk-03 \
cicd-ac-k8s-prod-wk-04
do
grep -Fq "${host} " inventories/shared-k8s/hosts.ini
done
- name: Confirm completed development and QA files remain present
shell: bash
run: |
set -euo pipefail
required_existing_files=(
"ansible/inventories/shared-k8s/group_vars/dev_workers.yml"
"ansible/inventories/shared-k8s/group_vars/qa_workers.yml"
"ansible/playbooks/shared-k8s/07-join-dev-workers.yml"
"ansible/playbooks/shared-k8s/08-label-and-taint-workers.yml"
"ansible/playbooks/shared-k8s/07-join-qa-workers.yml"
"ansible/playbooks/shared-k8s/08-label-and-taint-qa-workers.yml"
".github/workflows/ansible-configure-dev-workers.yml"
".github/workflows/ansible-configure-qa-workers.yml"
)
for file in "${required_existing_files[@]}"; do
if [ ! -f "${file}" ]; then
echo "ERROR: Completed worker file is missing: ${file}"
exit 1
fi
done
- name: Confirm each generic playbook resolves only production targets
working-directory: ansible
shell: bash
run: |
set -euo pipefail
for playbook in \
playbooks/shared-k8s/01-common-baseline.yml \
playbooks/shared-k8s/03-prepare-kubernetes-nodes.yml
do
output="$(
ansible-playbook \
-i inventories/shared-k8s/hosts.ini \
--limit prod_workers \
"${playbook}" \
--list-hosts
)"
printf '%s\n' "${output}"
for host in \
cicd-ac-k8s-prod-wk-01 \
cicd-ac-k8s-prod-wk-02 \
cicd-ac-k8s-prod-wk-03 \
cicd-ac-k8s-prod-wk-04
do
grep -Fq "${host}" <<< "${output}"
done
if grep -Eq 'cicd-ac-k8s-(dev|qa)-wk-' <<< "${output}"; then
echo "ERROR: Production workflow target includes a non-production worker."
exit 1
fi
done
- name: Syntax-check the production-worker playbooks
working-directory: ansible
shell: bash
run: |
set -euo pipefail
for playbook in \
playbooks/shared-k8s/01-common-baseline.yml \
playbooks/shared-k8s/03-prepare-kubernetes-nodes.yml \
playbooks/shared-k8s/07-join-prod-workers.yml \
playbooks/shared-k8s/08-label-and-taint-prod-workers.yml
do
ansible-playbook \
-i inventories/shared-k8s/hosts.ini \
"${playbook}" \
--syntax-check
done
configure:
name: Configure and join production workers
needs:
- validate
if: >-
(github.event_name == 'push' && github.ref_name == 'prod') ||
(github.event_name == 'workflow_dispatch' && github.ref_name == 'prod')
environment:
name: shared-k8s
runs-on:
- self-hosted
- Linux
- X64
- prod
- terraform
- deploy
- ac-cicd-infra
timeout-minutes: 180
steps:
- name: Checkout repository
uses: actions/checkout@v5
- name: Verify the production branch
shell: bash
run: |
set -euo pipefail
if [ "${GITHUB_REF_NAME}" != "prod" ]; then
echo "ERROR: Production-worker configuration is permitted only from prod."
exit 1
fi
- name: Prepare the existing Ansible SSH key
shell: bash
run: |
set -euo pipefail
KEY_PATH="${HOME}/.ssh/id_ed25519_ansible"
if [ ! -f "${KEY_PATH}" ]; then
echo "ERROR: Missing Ansible key: ${KEY_PATH}"
exit 1
fi
chmod 600 "${KEY_PATH}"
echo "ANSIBLE_PRIVATE_KEY_FILE=${KEY_PATH}" >> "${GITHUB_ENV}"
- name: Refresh production-worker SSH host keys
shell: bash
run: |
set -euo pipefail
mkdir -p "${HOME}/.ssh"
chmod 700 "${HOME}/.ssh"
touch "${HOME}/.ssh/known_hosts"
chmod 600 "${HOME}/.ssh/known_hosts"
for ip in 192.168.8.205 192.168.8.206 192.168.8.207 192.168.8.208; do
ssh-keygen -f "${HOME}/.ssh/known_hosts" -R "${ip}" || true
captured=false
for attempt in $(seq 1 60); do
if ssh-keyscan -T 5 -H "${ip}" >> "${HOME}/.ssh/known_hosts" 2>/dev/null; then
echo "SSH host key captured for ${ip}."
captured=true
break
fi
echo "Waiting for SSH on ${ip} (attempt ${attempt}/60)..."
sleep 10
done
if [ "${captured}" != "true" ]; then
echo "ERROR: Unable to capture SSH host key for ${ip}."
exit 1
fi
done
- name: Prepare the Ansible remote temporary directory
shell: bash
run: |
set -euo pipefail
for ip in 192.168.8.205 192.168.8.206 192.168.8.207 192.168.8.208; do
ssh \
-i "${ANSIBLE_PRIVATE_KEY_FILE}" \
-o IdentitiesOnly=yes \
-o BatchMode=yes \
"acllc@${ip}" \
'sudo install -d -m 0700 -o acllc -g acllc /var/tmp/ansible-acllc'
done
- name: Verify Ansible connectivity
working-directory: ansible
shell: bash
run: |
set -euo pipefail
ansible \
-i inventories/shared-k8s/hosts.ini \
prod_workers \
--private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
-m ping
- name: Confirm the existing development and QA workers remain Ready
working-directory: ansible
shell: bash
run: |
set -euo pipefail
ansible \
-i inventories/shared-k8s/hosts.ini \
first_control_plane \
--private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
-b \
-m shell \
-a '
set -e
export KUBECONFIG=/etc/kubernetes/admin.conf
for node in \
cicd-ac-k8s-dev-wk-01 \
cicd-ac-k8s-dev-wk-02 \
cicd-ac-k8s-dev-wk-03 \
cicd-ac-k8s-dev-wk-04 \
cicd-ac-k8s-qa-wk-01 \
cicd-ac-k8s-qa-wk-02 \
cicd-ac-k8s-qa-wk-03 \
cicd-ac-k8s-qa-wk-04
do
kubectl wait --for=condition=Ready "node/${node}" --timeout=2m
done
'
- name: Confirm the existing Kubernetes API is healthy
working-directory: ansible
shell: bash
run: |
set -euo pipefail
ansible \
-i inventories/shared-k8s/hosts.ini \
first_control_plane \
--private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
-b \
-m command \
-a 'kubectl --kubeconfig=/etc/kubernetes/admin.conf get --raw=/readyz'
- name: Apply the common Ubuntu baseline only to production workers
working-directory: ansible
shell: bash
run: |
set -euo pipefail
ansible-playbook \
-i inventories/shared-k8s/hosts.ini \
--private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
--limit prod_workers \
playbooks/shared-k8s/01-common-baseline.yml
- name: Prepare containerd and Kubernetes prerequisites only on production workers
working-directory: ansible
shell: bash
run: |
set -euo pipefail
ansible-playbook \
-i inventories/shared-k8s/hosts.ini \
--private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
--limit prod_workers \
playbooks/shared-k8s/03-prepare-kubernetes-nodes.yml
- name: Join only the production workers
working-directory: ansible
shell: bash
run: |
set -euo pipefail
ansible-playbook \
-i inventories/shared-k8s/hosts.ini \
--private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
playbooks/shared-k8s/07-join-prod-workers.yml
- name: Apply only production labels and taints
working-directory: ansible
shell: bash
run: |
set -euo pipefail
ansible-playbook \
-i inventories/shared-k8s/hosts.ini \
--private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
playbooks/shared-k8s/08-label-and-taint-prod-workers.yml
- name: Verify production workers and preserve development and QA workers
working-directory: ansible
shell: bash
run: |
set -euo pipefail
ansible \
-i inventories/shared-k8s/hosts.ini \
first_control_plane \
--private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
-b \
-m shell \
-a '
set -e
export KUBECONFIG=/etc/kubernetes/admin.conf
echo "=== PRODUCTION WORKERS ==="
kubectl get nodes \
-l environment=prod,workload=github-runner \
-L environment,workload \
-o wide
echo "=== QA WORKERS — MUST REMAIN PRESENT ==="
kubectl get nodes \
-l environment=qa,workload=github-runner \
-L environment,workload \
-o wide
echo "=== DEVELOPMENT WORKERS — MUST REMAIN PRESENT ==="
kubectl get nodes \
-l environment=dev,workload=github-runner \
-L environment,workload \
-o wide
kubectl get --raw=/readyz
'The production workflow:
- Uses production-specific playbook filenames.
- Targets only
prod_workersfor the common baseline and node preparation. - Uses
--list-hostsduring validation to prove generic playbooks resolve only the four production workers. - Verifies all completed development and QA files remain present.
- Verifies completed development and QA workers remain
Ready. - Does not call the development or QA join and labels playbooks.
- Uses production addresses
.205-.208. - Uses narrow path filters so a shared inventory-only change does not independently trigger this workflow.
Workflow behavior:
| Event | Result |
|---|---|
Push to dev | Cumulative inventory and production playbook validation only |
Push to prod | Production baseline, preparation, join, labels, taints, and verification |
Manual dispatch from prod | Idempotent production-worker reconciliation |
21. Review the production-only change and prove existing files are unchanged
git status
git diff --check
git diff --stat
git diff -- \
ansible/inventories/shared-k8s/hosts.ini \
ansible/inventories/shared-k8s/group_vars/prod_workers.yml \
ansible/playbooks/shared-k8s/07-join-prod-workers.yml \
ansible/playbooks/shared-k8s/08-label-and-taint-prod-workers.yml \
.github/workflows/ansible-configure-prod-workers.yml
git diff --exit-code -- \
ansible/inventories/shared-k8s/group_vars/dev_workers.yml \
ansible/inventories/shared-k8s/group_vars/qa_workers.yml \
ansible/playbooks/shared-k8s/07-join-dev-workers.yml \
ansible/playbooks/shared-k8s/08-label-and-taint-workers.yml \
ansible/playbooks/shared-k8s/07-join-qa-workers.yml \
ansible/playbooks/shared-k8s/08-label-and-taint-qa-workers.yml \
.github/workflows/ansible-configure-dev-workers.yml \
.github/workflows/ansible-configure-qa-workers.ymlThe final git diff --exit-code command must return exit code 0. Any output means a completed development-worker or QA-worker file was modified; restore it before committing.
Confirm:
- All development, QA, and production workers remain in
hosts.ini. - Development and QA variables are unchanged.
- Development and QA join playbooks are unchanged.
- Development and QA labels-and-taints playbooks are unchanged.
- Development and QA workflows are unchanged.
- Production files use
.205-.208,environment=prod, andenvironment=prod:NoSchedule. - The generic common-baseline and preparation commands use
--limit prod_workers. - No secret, kubeconfig, join token, Terraform state, or private key is staged.
Commit only the cumulative inventory and production-specific files:
git add \
ansible/inventories/shared-k8s/hosts.ini \
ansible/inventories/shared-k8s/group_vars/prod_workers.yml \
ansible/playbooks/shared-k8s/07-join-prod-workers.yml \
ansible/playbooks/shared-k8s/08-label-and-taint-prod-workers.yml \
.github/workflows/ansible-configure-prod-workers.yml
git commit -m "Configure shared Kubernetes production workers without changing development or QA workers"
git push -u origin feature/configure-k8s-prod-workers22. Create the Ansible pull request into dev
gh pr create \
--base dev \
--head feature/configure-k8s-prod-workers \
--title "Configure shared Kubernetes production workers" \
--body "Adds the Ubuntu baseline, Kubernetes prerequisites, worker joins, and approved production labels and taints while preserving development and QA workers."Merge only after cumulative inventory validation, target-list validation, and all production playbook syntax checks succeed.
23. Promote the Ansible change from dev to prod
gh pr create \
--base prod \
--head dev \
--title "Configure shared Kubernetes production workers" \
--body "Promotes the validated four production worker configuration to prod."After merge and environment approval, the workflow runs in this order:
- Confirm all completed development and QA workers remain
Ready. - Verify the existing Kubernetes API.
- Apply the Ubuntu baseline only to production workers.
- Configure containerd and Kubernetes prerequisites only on production workers.
- Generate a temporary production worker join command.
- Join the four production workers serially.
- Wait for every production worker to become
Ready. - Apply only production labels and the production taint.
- Display the production, QA, and development worker sets.
24. Manual verification
Run from prod-terraform-deploy-02:
ssh \
-i ~/.ssh/id_ed25519_ansible \
-o IdentitiesOnly=yes \
acllc@192.168.8.202 \
'sudo bash -c "
export KUBECONFIG=/etc/kubernetes/admin.conf
echo === PRODUCTION WORKERS ===
kubectl get nodes \
-l environment=prod,workload=github-runner \
-L environment,workload \
-o wide
echo === PRODUCTION TAINTS ===
for node in \
cicd-ac-k8s-prod-wk-01 \
cicd-ac-k8s-prod-wk-02 \
cicd-ac-k8s-prod-wk-03 \
cicd-ac-k8s-prod-wk-04
do
echo --- ${node} ---
kubectl get node ${node} \
-o jsonpath=\"{range .spec.taints[*]}{.key}={.value}:{.effect}{'\\n'}{end}\"
done
echo === QA WORKERS PRESERVED ===
kubectl get nodes \
-l environment=qa,workload=github-runner \
-L environment,workload \
-o wide
echo === DEVELOPMENT WORKERS PRESERVED ===
kubectl get nodes \
-l environment=dev,workload=github-runner \
-L environment,workload \
-o wide
echo === ALL NODES ===
kubectl get nodes -o wide
echo === API READINESS ===
kubectl get --raw=/readyz
"'Verify:
- Four production workers are
Ready. - Production labels and taints are correct.
- Four QA workers remain present and
Ready. - Four development workers remain present and
Ready. - Development and QA labels and taints remain unchanged.
- The cluster contains 15 Kubernetes nodes.
25. Expected final state
EXPECTED REBUILD ACCEPTANCE CHECKPOINT
Production worker VMs:
cicd-ac-k8s-prod-wk-01 192.168.8.205 Ready
cicd-ac-k8s-prod-wk-02 192.168.8.206 Ready
cicd-ac-k8s-prod-wk-03 192.168.8.207 Ready
cicd-ac-k8s-prod-wk-04 192.168.8.208 Ready
Kubernetes labels on every production worker:
environment=prod
workload=github-runner
Kubernetes taint on every production worker:
environment=prod:NoSchedule
Cluster checkpoint after this page:
3 control planes Ready
4 development workers Ready and unchanged
4 QA workers Ready and unchanged
4 production workers Ready
15 Kubernetes nodes total
Kubernetes API /readyz returns ok
Still implemented by later pages:
Shared ARC controller
Repository and environment runner scale sets
This is a rebuild acceptance checkpoint, not a statement about the current live environment.26. Failure handling
Terraform proposes changes to existing infrastructure
Stop. Do not apply. The plan must be exactly four additions and no changes or deletions.
A worker receives the wrong DHCP address
Check its Proxmox MAC and router reservation. Do not configure a static Netplan address.
An older workflow targets the new production workers unexpectedly
Confirm every generic playbook invocation is limited to the workflow's intended host group:
Control planes: --limit control_planes
Development: --limit dev_workers
QA: --limit qa_workers
Production: --limit prod_workersDo not remove the --limit arguments from environment-specific workflows.
APT reports a package is unavailable
The roles already force an APT refresh. Check DNS, internet access, Ubuntu repository files, and pkgs.k8s.io. Do not rebuild the template merely to preload packages.
Containerd CRI validation fails
Run on the affected worker:
sudo systemctl status containerd --no-pager
sudo ctr plugins ls
sudo grep -nE 'disabled_plugins|SystemdCgroup' /etc/containerd/config.toml
sudo crictl --runtime-endpoint unix:///run/containerd/containerd.sock infoThe containerd role must continue using the corrected type-and-ID column checks for both legacy and containerd 2.x plugin layouts.
A worker cannot join
Inspect:
sudo journalctl -u kubelet -n 200 --no-pager
sudo crictl ps -a
sudo test -f /etc/kubernetes/kubelet.conf && echo joined || echo not-joinedRerun the approved production-worker workflow. It generates a fresh token automatically. Do not paste join commands into Git.
A worker remains NotReady
From cicd-ac-k8s-cp-01, inspect:
sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf get nodes -o wide
sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf get pods -A -o wide
sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf describe node <WORKER_NAME>Also verify Calico and kubelet logs on the affected worker.
ARC pods are pending after the controller is installed later
The production runner values must include a matching selector and toleration:
nodeSelector:
environment: prod
workload: github-runner
tolerations:
- key: environment
operator: Equal
value: prod
effect: NoSchedule27. Project status after successful completion
Use this as the expected acceptance checkpoint for a fresh rebuild. It does not assert that any previously existing Kubernetes VM or cluster is still present.
FROM-SCRATCH STATUS AFTER THIS PAGE
Expected prerequisites retained
Load balancer, API VIP, and three control planes healthy
Development workers at 192.168.8.213-.216 Ready
QA workers at 192.168.8.209-.212 Ready
Expected production result
Production Terraform definitions added
Production VMs at 192.168.8.205-.208 provisioned
Production Ubuntu baseline applied
Production Kubernetes prerequisites configured
Production workers joined and Ready
Production labels and NoSchedule taints verified
Development and QA files and nodes preserved
Full 15-node Kubernetes cluster verified
Next page
Shared ARC controller
Later pages
GitHub organization onboarding
Repository and environment runner scale setsAfter this rebuild checkpoint passes, continue to the shared ARC controller page and then the organization and repository runner-scale-set pages.