Skip to main content

Configure Repository ARC Runner Scale Set

1. 🧭 Scope Legend

The scale set is repository-specific and environment-specific, while the cluster, controller, and generic Ansible role remain shared.

1.COMMON · NO CHANGE NEEDEDConfigure once and reuse across every GitHub organization, repository, and environment.
2.CHANGE PER GITHUB ORGRepeat for each GitHub organization or separately isolated product.
3.CHANGE PER REPOSITORYRepeat for each private repository that receives an isolated runner scale set.
4.CHANGE PER DEPLOYMENT BRANCHRepeat for each dev, qa, or prod branch and worker-pool mapping.

2. 🎯 Page Purpose and Isolation Boundary

COMMON · NO CHANGE NEEDEDCHANGE PER GITHUB ORGCHANGE PER REPOSITORYCHANGE PER DEPLOYMENT BRANCHOne repository and one deployment environment form one isolated ARC runner scale set. Repeat the repository and branch sections for every additional combination.
THIS PAGE CREATES
  One environment-specific Kubernetes namespace
  One GitHub App Kubernetes Secret in that namespace
  One repository and environment ARC runner scale set
  One ARC listener resource and listener pod in the shared controller namespace
  One repository branch-to-scale-set mapping
  Ephemeral runner pods only when jobs are queued

THIS PAGE REUSES
  The shared ARC controller
  The GitHub App and protected credential environment from page 9
  The completed dev, QA, and production worker pools
  The generic exact-409 / Pod-UID-replacement recovery and one-retry logic

THIS PAGE DOES NOT
  Install another ARC controller
  Commit any private key or Kubernetes Secret manifest
  Create permanent runner VMs
  Share runner pods across environments
  Automatically grant unrelated repositories access
COMMON · NO CHANGE NEEDED
  Shared Kubernetes cluster and API VIP
  Shared ARC controller in arc-systems
  ARC custom resource definitions
  First-control-plane administration path
  Generic modular ARC runner-scale-set Ansible role
  Server-side rendered-manifest validation
  Exact 409 and rapid same-name listener replacement detection
  Scoped cleanup, 120-second lease wait, and one automatic retry
  Same-Pod-UID listener stability and diagnostic-log verification

CHANGE PER GITHUB ORG / PRODUCT
  Tenant short form
  GitHub organization
  Protected organization credential environment
  GitHub App identifiers and private-key secret
  Kubernetes GitHub App secret name
  Harbor project naming

CHANGE PER REPOSITORY
  GitHub repository name and URL
  Runner scale-set name
  Helm release name
  Repository runner group
  Repository-specific values and workflow

CHANGE PER DEPLOYMENT BRANCH
  dev, qa, or prod source branch
  Kubernetes runner namespace
  Worker node selector and taint toleration
  Minimum and maximum runner capacity
  GitHub repository Environment
  runs-on value used by application workflows
Branch-mapping rule: ARC does not inspect the Git branch before assigning a job. Isolation comes from the target repository workflow explicitly using the environment-specific runs-on value generated on this page.

3. 🧾 Required Inputs

COMMON · NO CHANGE NEEDEDCHANGE PER GITHUB ORGCHANGE PER REPOSITORYCHANGE PER DEPLOYMENT BRANCHCommon values are prefilled. Organization, repository, and environment values determine every file path, namespace, release, and runs-on label.

Common cluster inputs

GitHub organization or product inputs

Repository inputs

Deployment-branch inputs

4. 🧮 Derived Names and Routing Values

CHANGE PER GITHUB ORGCHANGE PER REPOSITORYCHANGE PER DEPLOYMENT BRANCHThese generated values must remain identical across Kubernetes, Helm, Ansible, GitHub Actions, and the target repository workflow.
GitHub repository URL:
https://github.com/<<GITHUB_ORGANIZATION>>/<<REPOSITORY_NAME>>
Organization credential environment:
arc-org-<<APP_SHORT_FORM>>
Runner namespace:
arc-runners-<<APP_SHORT_FORM>>-dev
GitHub App Kubernetes Secret:
<<APP_SHORT_FORM>>-arc-ghapp-secret
Helm release:
<<REPOSITORY_NAME>>-dev-arc
Runner scale-set name / runs-on:
<<REPOSITORY_NAME>>-dev-arc
Source branch:
dev
GitHub repository Environment:
dev
Runner worker selector:
environment=dev, workload=github-runner
Runner taint toleration:
environment=dev:NoSchedule

5. 📊 From-Scratch Sequence Checkpoint

COMMON · NO CHANGE NEEDEDCHANGE PER GITHUB ORGCHANGE PER REPOSITORYCHANGE PER DEPLOYMENT BRANCHConfirm the prerequisite cluster, controller, and organization identity created by the earlier pages. This page then recreates and verifies one repository and environment runner scale set without relying on retained Kubernetes resources.
FROM-SCRATCH SEQUENCE CHECKPOINT

Required before this page
  Load balancer, API VIP, and three control planes operational
  Development, QA, and production worker pools Ready
  15 Kubernetes nodes Ready
  Shared ARC controller deployed as arc in arc-systems
  GitHub organization App and protected credential environment verified

Implemented by this page
  One environment-specific runner namespace
  One runtime-created GitHub App Kubernetes Secret
  One repository and environment runner scale set
  One repository branch-to-runs-on mapping
  ARC listener verification using an unchanged Kubernetes Pod UID
  Ephemeral Docker-in-Docker runner validation

Source consistency
  Generic ARC role files are created once and reused unchanged
  Repository/environment files must match the cleaned infrastructure repository
  dev validates only; prod reconciles Kubernetes and Helm resources

Not retained from an earlier installation
  Namespace, Secret, Helm release, AutoscalingRunnerSet, listener, and runner pods
  must all be recreated and verified during the clean rebuild.

6. 🔬 Verify the Shared Controller and Environment Worker Pool

COMMON · NO CHANGE NEEDEDCHANGE PER DEPLOYMENT BRANCHThe shared controller must be installed and healthy, and the selected environment must have four Ready workers labeled for GitHub runner workloads before the scale set is created.
ssh \
  -i ~/.ssh/id_ed25519_ansible \
  -o IdentitiesOnly=yes \
  acllc@192.168.8.202 \
  'sudo bash -s' <<'REMOTE'
set -euo pipefail
export KUBECONFIG=/etc/kubernetes/admin.conf
HELM=/usr/local/bin/helm

echo "=== SHARED ARC RELEASE ==="
"${HELM}" status arc -n arc-systems

echo
echo "=== CONTROLLER DEPLOYMENT ==="
kubectl get deployment,pods -n arc-systems -o wide

echo
echo "=== ARC CRDS ==="
kubectl get crd | grep actions.github.com

echo
echo "=== CONTROLLER SERVICE ACCOUNT ==="
kubectl get serviceaccount arc-gha-rs-controller -n arc-systems

echo
echo "=== ENVIRONMENT WORKER POOL ==="
kubectl get nodes -l environment=dev,workload=github-runner -o wide
REMOTE

7. 🌿 Create the Feature Branch

CHANGE PER REPOSITORYCHANGE PER DEPLOYMENT BRANCHUse dev for non-mutating validation and prod for namespace, secret, and Helm reconciliation.
feature/configure-<<REPOSITORY_NAME>>-dev-arc
    ↓ merge or push
   dev
    ↓ validate YAML, Ansible targeting, and secret hygiene only
 dev → prod promotion
    ↓ the merge creates a prod push
   prod
    ↓ create namespace, reconcile GitHub App secret, install scale set, verify listener

There is no pull_request workflow trigger.
Pull requests may still be used for review.

The target application repository uses its own branch mapping:
  dev → runs-on: <<REPOSITORY_NAME>>-dev-arc → dev workers
cd D:\code\ASPIRECLAN-LLC-Org\ac-cicd-infra

git switch dev
git pull --ff-only origin dev

git switch -c feature/configure-<<REPOSITORY_NAME>>-dev-arc

8. 📁 Create the Environment Runner Namespace Manifest

CHANGE PER GITHUB ORGCHANGE PER DEPLOYMENT BRANCHOne namespace is reused by repositories belonging to the same product and environment. The GitHub App secret must exist in the same namespace as each scale-set installation. The shared namespace path is intentionally excluded from repository-specific workflow triggers to prevent fan-out.

Create kubernetes/tenants/<<APP_SHORT_FORM>>/dev/namespace/namespace.yaml:

apiVersion: v1
kind: Namespace
metadata:
  name: arc-runners-<<APP_SHORT_FORM>>-dev
  labels:
    app.kubernetes.io/part-of: arc-runners
    aspireclan.com/tenant: <<APP_SHORT_FORM>>
    aspireclan.com/environment: dev

9. 🗂️ Create the Repository and Branch Definition

CHANGE PER GITHUB ORGCHANGE PER REPOSITORYCHANGE PER DEPLOYMENT BRANCHThis non-secret metadata file records the exact repository ownership, scale-set identity, worker placement, capacity, and branch routing.

Create kubernetes/tenants/<<APP_SHORT_FORM>>/dev/runner-scale-sets/<<REPOSITORY_NAME>>.yaml:

schemaVersion: 1

tenant:
  displayName: "<<TENANT_OR_PRODUCT_NAME>>"
  shortForm: "<<APP_SHORT_FORM>>"

github:
  organization: "<<GITHUB_ORGANIZATION>>"
  repository: "<<REPOSITORY_NAME>>"
  repositoryUrl: "https://github.com/<<GITHUB_ORGANIZATION>>/<<REPOSITORY_NAME>>"
  credentialEnvironment: "arc-org-<<APP_SHORT_FORM>>"
  githubAppSecretName: "<<APP_SHORT_FORM>>-arc-ghapp-secret"

runnerScaleSet:
  name: "<<REPOSITORY_NAME>>-dev-arc"
  helmRelease: "<<REPOSITORY_NAME>>-dev-arc"
  namespace: "arc-runners-<<APP_SHORT_FORM>>-dev"
  chartVersion: "0.14.2"
  runnerGroup: "Default"
  containerMode: "dind"
  minRunners: 0
  maxRunners: 4

branchMapping:
  sourceBranch: "dev"
  githubEnvironment: "dev"
  runsOn: "<<REPOSITORY_NAME>>-dev-arc"
  kubernetesEnvironment: "dev"

placement:
  nodeSelector:
    environment: "dev"
    workload: github-runner
  toleration:
    key: environment
    operator: Equal
    value: "dev"
    effect: NoSchedule

security:
  privateKeyCommittedToGit: false
  kubernetesSecretManifestCommitted: false
  githubAppSecretCreatedAtRuntime: true
  runnerPodsAreEphemeral: true

10. 🎛️ Create the Helm Values File

CHANGE PER GITHUB ORGCHANGE PER REPOSITORYCHANGE PER DEPLOYMENT BRANCHThe chart registers the scale set to the repository, references the runtime-created GitHub App secret, declares the required listener container, enables the selected container mode, and schedules runner pods only on the selected environment workers.

Create helm/tenants/<<APP_SHORT_FORM>>/dev/<<REPOSITORY_NAME>>-values.yaml:

githubConfigUrl: "https://github.com/<<GITHUB_ORGANIZATION>>/<<REPOSITORY_NAME>>"
githubConfigSecret: "<<APP_SHORT_FORM>>-arc-ghapp-secret"

runnerGroup: "Default"
runnerScaleSetName: "<<REPOSITORY_NAME>>-dev-arc"

minRunners: 0
maxRunners: 4

containerMode:
  type: "dind"

controllerServiceAccount:
  namespace: "arc-systems"
  name: "arc-gha-rs-controller"

listenerTemplate:
  spec:
    containers:
      - name: listener

    nodeSelector:
      node-role.kubernetes.io/control-plane: ""

    tolerations:
      - key: node-role.kubernetes.io/control-plane
        operator: Exists
        effect: NoSchedule

template:
  spec:
    nodeSelector:
      environment: "dev"
      workload: "github-runner"

    tolerations:
      - key: environment
        operator: Equal
        value: "dev"
        effect: NoSchedule

    topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            actions.github.com/scale-set-name: "<<REPOSITORY_NAME>>-dev-arc"
Required ARC listener container: when listenerTemplate is customized, listenerTemplate.spec.containers must contain a container named listener. Without it, Kubernetes rejects the AutoscalingRunnerSet with spec.listenerTemplate.spec.containers: Required value.

11. 🧰 Create the Generic Runner Scale-Set Ansible Role

COMMON · NO CHANGE NEEDEDCreate this modular role once and reuse it unchanged. The first internal Helm attempt is immediately probed by Pod UID. If the listener receives an exact GitHub 409 or ARC rapidly recreates the same listener name with a different Kubernetes Pod UID, the same workflow removes only that scale set, waits 120 seconds, performs one clean retry, and requires the replacement Pod UID to remain stable.
Listener-session recovery: an installation attempt may create a listener, briefly report Ready, and then exit while a previous GitHub message-session lease is still active. The exact 409 message can disappear before logs are collected because ARC recreates the same listener name. The role therefore detects either the explicit conflict text or a same-name listener Pod UID replacement, removes only the affected scale-set resources, waits 120 seconds, retries exactly once, and then requires the same replacement Pod UID to remain Ready.

Create ansible/roles/arc-runner-scale-set/defaults/main.yml:

---
arc_scale_set_admin_hostname: "cicd-ac-k8s-cp-01"
arc_scale_set_admin_ip: "192.168.8.202"

arc_scale_set_kubeconfig: "/etc/kubernetes/admin.conf"
arc_scale_set_helm_binary: "/usr/local/bin/helm"
arc_scale_set_chart: >-
  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set

arc_scale_set_controller_namespace: "arc-systems"
arc_scale_set_namespace: ""
arc_scale_set_release_name: ""
arc_scale_set_name: ""
arc_scale_set_chart_version: ""

arc_scale_set_values_source: ""
arc_scale_set_namespace_manifest_source: ""

arc_scale_set_github_secret_name: ""
arc_scale_set_github_app_id: "{{ lookup('env', 'ARC_GITHUB_APP_ID') }}"
arc_scale_set_github_app_installation_id: >-
  {{ lookup('env', 'ARC_GITHUB_APP_INSTALLATION_ID') }}
arc_scale_set_github_app_private_key_source: >-
  {{ lookup('env', 'ARC_GITHUB_APP_PRIVATE_KEY_FILE') }}

arc_scale_set_work_root: "/etc/ac-cicd-infra/arc-runner-scale-sets"

# Recovery is narrowly scoped to the current runner scale set.
arc_scale_set_enable_session_conflict_recovery: true
arc_scale_set_session_release_wait_seconds: 120

# Discover a Ready listener without relying on a long kubectl wait against a
# name that ARC may delete and recreate with a different Kubernetes Pod UID.
arc_scale_set_listener_discovery_attempts: 60
arc_scale_set_listener_poll_seconds: 2

# The same listener Pod UID must remain Running and Ready for this interval.
arc_scale_set_listener_stability_seconds: 30

Create ansible/roles/arc-runner-scale-set/tasks/main.yml:

---
- name: Confirm the runner scale set is managed from the first control plane
  ansible.builtin.assert:
    that:
      - inventory_hostname == groups['first_control_plane'][0]
      - inventory_hostname == arc_scale_set_admin_hostname
      - ansible_facts["default_ipv4"]["address"] == arc_scale_set_admin_ip
      - arc_scale_set_controller_namespace | length > 0
      - arc_scale_set_namespace | length > 0
      - arc_scale_set_release_name | length > 0
      - arc_scale_set_name | length > 0
      - arc_scale_set_chart_version | length > 0
      - arc_scale_set_values_source | length > 0
      - arc_scale_set_namespace_manifest_source | length > 0
      - arc_scale_set_github_secret_name | length > 0
      - arc_scale_set_github_app_id | length > 0
      - arc_scale_set_github_app_installation_id | length > 0
      - arc_scale_set_github_app_private_key_source | length > 0
      - (arc_scale_set_listener_discovery_attempts | int) > 0
      - (arc_scale_set_listener_poll_seconds | int) > 0
      - (arc_scale_set_listener_stability_seconds | int) > 0
    fail_msg: >-
      Runner scale-set variables or GitHub App credentials are missing.

- name: Confirm local runner scale-set source files exist
  ansible.builtin.stat:
    path: "{{ item }}"
  delegate_to: localhost
  become: false
  loop:
    - "{{ arc_scale_set_values_source }}"
    - "{{ arc_scale_set_namespace_manifest_source }}"
    - "{{ arc_scale_set_github_app_private_key_source }}"
  register: arc_scale_set_source_files

- name: Assert every local runner scale-set source file exists
  ansible.builtin.assert:
    that:
      - item.stat.exists
      - item.stat.isreg
  loop: "{{ arc_scale_set_source_files.results }}"
  loop_control:
    label: "{{ item.stat.path | default('unknown') }}"

- name: Create the remote runner scale-set working directory
  ansible.builtin.file:
    path: "{{ arc_scale_set_work_root }}/{{ arc_scale_set_release_name }}"
    state: directory
    owner: root
    group: root
    mode: "0700"

- name: Copy the namespace manifest
  ansible.builtin.copy:
    src: "{{ arc_scale_set_namespace_manifest_source }}"
    dest: >-
      {{ arc_scale_set_work_root }}/{{ arc_scale_set_release_name }}/namespace.yaml
    owner: root
    group: root
    mode: "0644"

- name: Copy the Git-managed runner scale-set values
  ansible.builtin.copy:
    src: "{{ arc_scale_set_values_source }}"
    dest: >-
      {{ arc_scale_set_work_root }}/{{ arc_scale_set_release_name }}/values.yaml
    owner: root
    group: root
    mode: "0644"

- name: Copy the temporary GitHub App private key
  ansible.builtin.copy:
    src: "{{ arc_scale_set_github_app_private_key_source }}"
    dest: >-
      {{ arc_scale_set_work_root }}/{{ arc_scale_set_release_name }}/github-app-private-key.pem
    owner: root
    group: root
    mode: "0600"
  no_log: true

- name: Reconcile the repository runner scale set
  block:
    - name: Apply the environment runner namespace
      ansible.builtin.command:
        argv:
          - kubectl
          - "--kubeconfig={{ arc_scale_set_kubeconfig }}"
          - apply
          - "--filename={{ arc_scale_set_work_root }}/{{ arc_scale_set_release_name }}/namespace.yaml"
      register: arc_scale_set_namespace_apply
      changed_when: >-
        'created' in arc_scale_set_namespace_apply.stdout or
        'configured' in arc_scale_set_namespace_apply.stdout

    - name: Reconcile the GitHub App Kubernetes Secret
      ansible.builtin.shell:
        executable: /bin/bash
        cmd: |
          set -euo pipefail

          kubectl \
            --kubeconfig={{ arc_scale_set_kubeconfig }} \
            create secret generic {{ arc_scale_set_github_secret_name }} \
            --namespace={{ arc_scale_set_namespace }} \
            --from-literal=github_app_id='{{ arc_scale_set_github_app_id }}' \
            --from-literal=github_app_installation_id='{{ arc_scale_set_github_app_installation_id }}' \
            --from-file=github_app_private_key={{ arc_scale_set_work_root }}/{{ arc_scale_set_release_name }}/github-app-private-key.pem \
            --dry-run=client \
            --output=yaml |
          kubectl \
            --kubeconfig={{ arc_scale_set_kubeconfig }} \
            apply \
            --filename=-
      register: arc_scale_set_secret_apply
      changed_when: >-
        'created' in arc_scale_set_secret_apply.stdout or
        'configured' in arc_scale_set_secret_apply.stdout
      no_log: true

    - name: Render the pinned runner scale-set chart
      ansible.builtin.shell:
        executable: /bin/bash
        cmd: |
          set -euo pipefail

          {{ arc_scale_set_helm_binary }} template \
            {{ arc_scale_set_release_name }} \
            {{ arc_scale_set_chart }} \
            --namespace {{ arc_scale_set_namespace }} \
            --version {{ arc_scale_set_chart_version }} \
            --values {{ arc_scale_set_work_root }}/{{ arc_scale_set_release_name }}/values.yaml \
            --kubeconfig {{ arc_scale_set_kubeconfig }} \
            > {{ arc_scale_set_work_root }}/{{ arc_scale_set_release_name }}/rendered.yaml
      changed_when: false

    - name: Validate rendered runner scale-set resources with the Kubernetes API
      ansible.builtin.command:
        argv:
          - kubectl
          - "--kubeconfig={{ arc_scale_set_kubeconfig }}"
          - apply
          - --dry-run=server
          - "--filename={{ arc_scale_set_work_root }}/{{ arc_scale_set_release_name }}/rendered.yaml"
      changed_when: false

    - name: Perform the first installation and listener probe
      ansible.builtin.include_tasks: install-and-probe.yml

    - name: Reject a non-recoverable first-attempt failure
      ansible.builtin.assert:
        that:
          - >-
            arc_scale_set_attempt_requires_recovery or
            (
              (arc_scale_set_attempt_helm_rc | int) == 0 and
              (arc_scale_set_attempt_listener_rc | int) == 0
            )
        fail_msg: >-
          The ARC installation or listener failed without an exact 409 conflict
          or the rapid same-name listener replacement pattern that permits one
          scoped recovery attempt. Helm rc={{ arc_scale_set_attempt_helm_rc }},
          listener rc={{ arc_scale_set_attempt_listener_rc }}.
          Helm stderr={{ arc_scale_set_helm_attempt.stderr | default('') }}
          Listener output={{ arc_scale_set_listener_probe.stdout | default('') }}

    - name: Recover an active or suspected stale listener session
      ansible.builtin.include_tasks: recover-session-conflict.yml
      when:
        - arc_scale_set_enable_session_conflict_recovery | bool
        - arc_scale_set_attempt_requires_recovery | bool

    - name: Perform one clean installation retry after listener-session recovery
      ansible.builtin.include_tasks: install-and-probe.yml
      when:
        - arc_scale_set_enable_session_conflict_recovery | bool
        - arc_scale_set_attempt_requires_recovery | bool

    - name: Confirm the final Helm installation and listener are healthy
      ansible.builtin.assert:
        that:
          - (arc_scale_set_attempt_helm_rc | int) == 0
          - (arc_scale_set_attempt_listener_rc | int) == 0
          - not (arc_scale_set_attempt_session_conflict | bool)
          - not (arc_scale_set_attempt_requires_recovery | bool)
        fail_msg: >-
          The final ARC installation did not become stable.
          Helm rc={{ arc_scale_set_attempt_helm_rc }},
          listener rc={{ arc_scale_set_attempt_listener_rc }},
          exact session conflict={{ arc_scale_set_attempt_session_conflict }},
          additional recovery required={{ arc_scale_set_attempt_requires_recovery }}.
          Helm stderr={{ arc_scale_set_helm_attempt.stderr | default('') }}
          Listener output={{ arc_scale_set_listener_probe.stdout | default('') }}

    - name: Verify the AutoscalingRunnerSet exists in the runner namespace
      ansible.builtin.command:
        argv:
          - kubectl
          - "--kubeconfig={{ arc_scale_set_kubeconfig }}"
          - get
          - autoscalingrunnerset.actions.github.com
          - "{{ arc_scale_set_name }}"
          - "--namespace={{ arc_scale_set_namespace }}"
      changed_when: false

    - name: Verify the AutoscalingListener exists in the controller namespace
      ansible.builtin.shell:
        executable: /bin/bash
        cmd: |
          set -euo pipefail

          kubectl \
            --kubeconfig={{ arc_scale_set_kubeconfig }} \
            get autoscalinglisteners.actions.github.com \
            --namespace={{ arc_scale_set_controller_namespace }} \
            --output=name |
          grep -E \
            '^autoscalinglistener\.actions\.github\.com/{{ arc_scale_set_name }}-[a-z0-9]+-listener$'
      changed_when: false

    - name: Display the installed runner scale-set release
      ansible.builtin.command:
        argv:
          - "{{ arc_scale_set_helm_binary }}"
          - status
          - "{{ arc_scale_set_release_name }}"
          - --namespace
          - "{{ arc_scale_set_namespace }}"
          - --kubeconfig
          - "{{ arc_scale_set_kubeconfig }}"
      register: arc_scale_set_status
      changed_when: false

    - name: Print the runner scale-set release status
      ansible.builtin.debug:
        var: arc_scale_set_status.stdout_lines

  always:
    - name: Remove the remote temporary GitHub App private key
      ansible.builtin.file:
        path: >-
          {{ arc_scale_set_work_root }}/{{ arc_scale_set_release_name }}/github-app-private-key.pem
        state: absent
      no_log: true

Create ansible/roles/arc-runner-scale-set/tasks/install-and-probe.yml:

---
- name: Install or reconcile the repository runner scale set
  ansible.builtin.command:
    argv:
      - "{{ arc_scale_set_helm_binary }}"
      - upgrade
      - --install
      - "{{ arc_scale_set_release_name }}"
      - "{{ arc_scale_set_chart }}"
      - --namespace
      - "{{ arc_scale_set_namespace }}"
      - --version
      - "{{ arc_scale_set_chart_version }}"
      - --values
      - "{{ arc_scale_set_work_root }}/{{ arc_scale_set_release_name }}/values.yaml"
      - --kubeconfig
      - "{{ arc_scale_set_kubeconfig }}"
      - --atomic
      - --wait
      - --timeout
      - 10m
      - --history-max
      - "10"
      - --debug
  register: arc_scale_set_helm_attempt
  changed_when: >-
    'has been upgraded' in arc_scale_set_helm_attempt.stdout or
    'has been installed' in arc_scale_set_helm_attempt.stdout
  failed_when: false

- name: Probe the listener and capture exact or suspected stale sessions
  ansible.builtin.shell:
    executable: /bin/bash
    cmd: |
      set -euo pipefail

      listener_pod=""
      listener_uid=""
      last_seen_uid=""
      uid_replacements=0

      for attempt in $(seq 1 {{ arc_scale_set_listener_discovery_attempts }}); do
        listener_record="$(
          kubectl \
            --kubeconfig={{ arc_scale_set_kubeconfig }} \
            get pods \
            --namespace={{ arc_scale_set_controller_namespace }} \
            --output=jsonpath='{range .items[*]}{.metadata.name}{"|"}{.metadata.uid}{"|"}{.status.phase}{"|"}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}' \
            2>/dev/null |
          grep -E '^{{ arc_scale_set_name }}-[a-z0-9]+-listener\|' |
          head -n 1 || true
        )"

        if [ -n "${listener_record}" ]; then
          IFS='|' read -r candidate_pod candidate_uid candidate_phase candidate_ready \
            <<< "${listener_record}"

          if [ -n "${last_seen_uid}" ] &&
            [ -n "${candidate_uid}" ] &&
            [ "${candidate_uid}" != "${last_seen_uid}" ]
          then
            uid_replacements=$((uid_replacements + 1))
            echo "Observed listener Pod UID replacement: ${last_seen_uid} -> ${candidate_uid}"
          fi

          if [ -n "${candidate_uid}" ]; then
            last_seen_uid="${candidate_uid}"
          fi

          if [ "${candidate_phase}" = "Running" ] &&
            [ "${candidate_ready}" = "True" ]
          then
            listener_pod="${candidate_pod}"
            listener_uid="${candidate_uid}"
            break
          fi
        fi

        echo "Waiting for a Ready ARC listener (attempt ${attempt}/{{ arc_scale_set_listener_discovery_attempts }})..."
        sleep {{ arc_scale_set_listener_poll_seconds }}
      done

      if [ -z "${listener_pod}" ] || [ -z "${listener_uid}" ]; then
        echo "ERROR: A Ready ARC listener was not found."
        echo "Observed Pod UID replacements: ${uid_replacements}"

        kubectl \
          --kubeconfig={{ arc_scale_set_kubeconfig }} \
          get pods \
          --namespace={{ arc_scale_set_controller_namespace }} \
          -o wide |
        grep -F '{{ arc_scale_set_name }}' || true

        if [ "${uid_replacements}" -gt 0 ]; then
          echo "RECOVERABLE: The same logical listener was repeatedly recreated before becoming stable."
          exit 45
        fi

        exit 43
      fi

      echo "Listener became Ready: ${listener_pod}"
      echo "Listener Pod UID: ${listener_uid}"

      elapsed=0
      while [ "${elapsed}" -lt {{ arc_scale_set_listener_stability_seconds }} ]; do
        listener_record="$(
          kubectl \
            --kubeconfig={{ arc_scale_set_kubeconfig }} \
            get pod \
            "${listener_pod}" \
            --namespace={{ arc_scale_set_controller_namespace }} \
            --output=jsonpath='{.metadata.uid}{"|"}{.status.phase}{"|"}{.status.conditions[?(@.type=="Ready")].status}' \
            2>/dev/null || true
        )"

        listener_logs="$(
          kubectl \
            --kubeconfig={{ arc_scale_set_kubeconfig }} \
            logs \
            --namespace={{ arc_scale_set_controller_namespace }} \
            "${listener_pod}" \
            --container=listener \
            --tail=200 \
            2>&1 || true
        )"

        if printf '%s\n' "${listener_logs}" |
          grep -Eq \
            'RunnerScaleSetSessionConflictException|already has an active session'
        then
          printf '%s\n' "${listener_logs}"
          exit 42
        fi

        if [ -z "${listener_record}" ]; then
          echo "RECOVERABLE: Listener disappeared during the stability interval."
          echo "Original Pod UID: ${listener_uid}"
          printf '%s\n' "${listener_logs}"
          exit 45
        fi

        IFS='|' read -r current_uid current_phase current_ready \
          <<< "${listener_record}"

        if [ "${current_uid}" != "${listener_uid}" ]; then
          echo "RECOVERABLE: Listener name was recreated with a different Pod UID."
          echo "Listener pod: ${listener_pod}"
          echo "Original Pod UID: ${listener_uid}"
          echo "Current Pod UID: ${current_uid:-absent}"
          printf '%s\n' "${listener_logs}"
          exit 45
        fi

        if [ "${current_phase}" != "Running" ] ||
          [ "${current_ready}" != "True" ]
        then
          echo "RECOVERABLE: Listener stopped being Running and Ready during the stability interval."
          echo "Listener pod: ${listener_pod}"
          echo "Pod UID: ${listener_uid}"
          echo "Phase: ${current_phase:-absent}"
          echo "Ready: ${current_ready:-absent}"
          printf '%s\n' "${listener_logs}"
          exit 45
        fi

        sleep {{ arc_scale_set_listener_poll_seconds }}
        elapsed=$((elapsed + {{ arc_scale_set_listener_poll_seconds }}))
      done

      echo "Listener is stable with unchanged Pod UID: ${listener_pod} (${listener_uid})"
  register: arc_scale_set_listener_probe
  changed_when: false
  failed_when: false

- name: Record the installation-attempt result
  ansible.builtin.set_fact:
    arc_scale_set_attempt_helm_rc: "{{ arc_scale_set_helm_attempt.rc | int }}"
    arc_scale_set_attempt_listener_rc: "{{ arc_scale_set_listener_probe.rc | int }}"
    arc_scale_set_attempt_session_conflict: >-
      {{
        (arc_scale_set_listener_probe.rc | int) == 42 or
        'RunnerScaleSetSessionConflictException' in
          (arc_scale_set_listener_probe.stdout | default('')) or
        'already has an active session' in
          (arc_scale_set_listener_probe.stdout | default('')) or
        'RunnerScaleSetSessionConflictException' in
          (arc_scale_set_helm_attempt.stdout | default('')) or
        'already has an active session' in
          (arc_scale_set_helm_attempt.stdout | default('')) or
        'RunnerScaleSetSessionConflictException' in
          (arc_scale_set_helm_attempt.stderr | default('')) or
        'already has an active session' in
          (arc_scale_set_helm_attempt.stderr | default(''))
      }}
    arc_scale_set_attempt_requires_recovery: >-
      {{
        (arc_scale_set_listener_probe.rc | int) in [42, 45] or
        'RunnerScaleSetSessionConflictException' in
          (arc_scale_set_helm_attempt.stdout | default('')) or
        'already has an active session' in
          (arc_scale_set_helm_attempt.stdout | default('')) or
        'RunnerScaleSetSessionConflictException' in
          (arc_scale_set_helm_attempt.stderr | default('')) or
        'already has an active session' in
          (arc_scale_set_helm_attempt.stderr | default(''))
      }}

Create ansible/roles/arc-runner-scale-set/tasks/recover-session-conflict.yml:

---
- name: Explain the scoped ARC listener-session recovery
  ansible.builtin.debug:
    msg:
      - "Recovering only runner scale set: {{ arc_scale_set_name }}"
      - "Runner namespace: {{ arc_scale_set_namespace }}"
      - "Listener namespace: {{ arc_scale_set_controller_namespace }}"
      - "Shared ARC controller and unrelated scale sets are preserved."
      - "Recovery applies to an exact 409 or rapid same-name listener Pod UID replacement."

- name: Uninstall the affected Helm release when present
  ansible.builtin.shell:
    executable: /bin/bash
    cmd: |
      set -euo pipefail

      if {{ arc_scale_set_helm_binary }} status \
        {{ arc_scale_set_release_name }} \
        --namespace {{ arc_scale_set_namespace }} \
        --kubeconfig {{ arc_scale_set_kubeconfig }} \
        >/dev/null 2>&1
      then
        {{ arc_scale_set_helm_binary }} uninstall \
          {{ arc_scale_set_release_name }} \
          --namespace {{ arc_scale_set_namespace }} \
          --kubeconfig {{ arc_scale_set_kubeconfig }} \
          --wait \
          --timeout 3m
      else
        echo "Helm release is not present."
      fi
  register: arc_scale_set_recovery_uninstall
  changed_when: "'uninstalled' in arc_scale_set_recovery_uninstall.stdout"

- name: Delete a lingering AutoscalingRunnerSet
  ansible.builtin.command:
    argv:
      - kubectl
      - "--kubeconfig={{ arc_scale_set_kubeconfig }}"
      - delete
      - autoscalingrunnerset.actions.github.com
      - "{{ arc_scale_set_name }}"
      - "--namespace={{ arc_scale_set_namespace }}"
      - --ignore-not-found=true
      - --wait=true
      - --timeout=3m
  register: arc_scale_set_recovery_runner_set_delete
  changed_when: "'deleted' in arc_scale_set_recovery_runner_set_delete.stdout"

- name: Delete lingering listener resources and listener pods
  ansible.builtin.shell:
    executable: /bin/bash
    cmd: |
      set -euo pipefail

      changed=false

      listener_resources="$(
        kubectl \
          --kubeconfig={{ arc_scale_set_kubeconfig }} \
          get autoscalinglisteners.actions.github.com \
          --namespace={{ arc_scale_set_controller_namespace }} \
          --output=name \
          2>/dev/null |
        grep -E \
          '^autoscalinglistener\.actions\.github\.com/{{ arc_scale_set_name }}-[a-z0-9]+-listener$' \
          || true
      )"

      if [ -n "${listener_resources}" ]; then
        while IFS= read -r listener_resource; do
          [ -n "${listener_resource}" ] || continue

          kubectl \
            --kubeconfig={{ arc_scale_set_kubeconfig }} \
            delete \
            --namespace={{ arc_scale_set_controller_namespace }} \
            "${listener_resource}" \
            --wait=true \
            --timeout=3m || true

          changed=true
        done <<< "${listener_resources}"
      fi

      listener_pods="$(
        kubectl \
          --kubeconfig={{ arc_scale_set_kubeconfig }} \
          get pods \
          --namespace={{ arc_scale_set_controller_namespace }} \
          --output=name \
          2>/dev/null |
        grep -E '^pod/{{ arc_scale_set_name }}-[a-z0-9]+-listener$' \
          || true
      )"

      if [ -n "${listener_pods}" ]; then
        while IFS= read -r listener_pod; do
          [ -n "${listener_pod}" ] || continue

          kubectl \
            --kubeconfig={{ arc_scale_set_kubeconfig }} \
            delete \
            --namespace={{ arc_scale_set_controller_namespace }} \
            "${listener_pod}" \
            --wait=true \
            --timeout=2m || true

          changed=true
        done <<< "${listener_pods}"
      fi

      printf 'changed=%s\n' "${changed}"
  register: arc_scale_set_recovery_listener_delete
  changed_when: "'changed=true' in arc_scale_set_recovery_listener_delete.stdout"

- name: Wait for the affected ARC resources to disappear
  ansible.builtin.shell:
    executable: /bin/bash
    cmd: |
      set -euo pipefail

      for attempt in $(seq 1 90); do
        runner_set="$(
          kubectl \
            --kubeconfig={{ arc_scale_set_kubeconfig }} \
            get autoscalingrunnerset.actions.github.com \
            {{ arc_scale_set_name }} \
            --namespace={{ arc_scale_set_namespace }} \
            --ignore-not-found \
            --output=name \
            2>/dev/null || true
        )"

        listeners="$(
          kubectl \
            --kubeconfig={{ arc_scale_set_kubeconfig }} \
            get autoscalinglisteners.actions.github.com \
            --namespace={{ arc_scale_set_controller_namespace }} \
            --output=name \
            2>/dev/null |
          grep -E \
            '^autoscalinglistener\.actions\.github\.com/{{ arc_scale_set_name }}-[a-z0-9]+-listener$' \
            || true
        )"

        listener_pods="$(
          kubectl \
            --kubeconfig={{ arc_scale_set_kubeconfig }} \
            get pods \
            --namespace={{ arc_scale_set_controller_namespace }} \
            --output=name \
            2>/dev/null |
          grep -E '^pod/{{ arc_scale_set_name }}-[a-z0-9]+-listener$' \
            || true
        )"

        if [ -z "${runner_set}" ] &&
          [ -z "${listeners}" ] &&
          [ -z "${listener_pods}" ]
        then
          echo "Affected ARC resources are absent."
          exit 0
        fi

        echo "Waiting for affected ARC resources to disappear (attempt ${attempt}/90)..."
        sleep 2
      done

      echo "ERROR: Affected ARC resources did not disappear."
      exit 1
  changed_when: false

- name: Wait for the GitHub Actions backend session lease to expire
  ansible.builtin.pause:
    seconds: "{{ arc_scale_set_session_release_wait_seconds }}"

After a future change to this common role, manually dispatch only the affected scale-set workflows. Do not add the common role path to every repository-specific workflow.

12. 📘 Create the Repository and Environment Playbook

CHANGE PER REPOSITORYCHANGE PER DEPLOYMENT BRANCHThe playbook contains no credential values. It receives the GitHub App identifiers and temporary private-key file through the protected organization environment at runtime.

Create ansible/playbooks/tenants/<<APP_SHORT_FORM>>/dev/install-<<REPOSITORY_NAME>>-runners.yml:

---
- name: Install the <<REPOSITORY_NAME>> dev ARC runner scale set
  hosts: first_control_plane
  become: true
  gather_facts: true

  vars:
    arc_scale_set_admin_hostname: "cicd-ac-k8s-cp-01"
    arc_scale_set_admin_ip: "192.168.8.202"

    arc_scale_set_kubeconfig: "/etc/kubernetes/admin.conf"
    arc_scale_set_helm_binary: "/usr/local/bin/helm"

    arc_scale_set_controller_namespace: "arc-systems"
    arc_scale_set_namespace: "arc-runners-<<APP_SHORT_FORM>>-dev"

    arc_scale_set_release_name: "<<REPOSITORY_NAME>>-dev-arc"
    arc_scale_set_name: "<<REPOSITORY_NAME>>-dev-arc"
    arc_scale_set_chart_version: "0.14.2"

    arc_scale_set_values_source: >-
      {{ lookup('env', 'GITHUB_WORKSPACE') }}/helm/tenants/<<APP_SHORT_FORM>>/dev/<<REPOSITORY_NAME>>-values.yaml

    arc_scale_set_namespace_manifest_source: >-
      {{ lookup('env', 'GITHUB_WORKSPACE') }}/kubernetes/tenants/<<APP_SHORT_FORM>>/dev/namespace/namespace.yaml

    arc_scale_set_github_secret_name: "<<APP_SHORT_FORM>>-arc-ghapp-secret"

    arc_scale_set_enable_session_conflict_recovery: true
    arc_scale_set_session_release_wait_seconds: 120
    arc_scale_set_listener_discovery_attempts: 60
    arc_scale_set_listener_poll_seconds: 2
    arc_scale_set_listener_stability_seconds: 30

  roles:
    - role: arc-runner-scale-set

13. 🔄 Create the Infrastructure Reconciliation Workflow

CHANGE PER GITHUB ORGCHANGE PER REPOSITORYCHANGE PER DEPLOYMENT BRANCHA dev push validates only this scale set. A prod push creates the namespace and secret, performs the install-probe-recover-retry sequence, verifies an unchanged listener Pod UID, and captures detailed diagnostics on failure. Narrow paths prevent unrelated scale sets from running, while the shared Ansible concurrency group prevents simultaneous cluster mutations.

Create .github/workflows/arc-<<REPOSITORY_NAME>>-dev.yml:

name: ARC Runner Scale Set - <<REPOSITORY_NAME>> - dev

on:
  push:
    branches:
      - dev
      - prod
    paths:
      - "kubernetes/tenants/<<APP_SHORT_FORM>>/dev/runner-scale-sets/<<REPOSITORY_NAME>>.yaml"
      - "helm/tenants/<<APP_SHORT_FORM>>/dev/<<REPOSITORY_NAME>>-values.yaml"
      - "ansible/playbooks/tenants/<<APP_SHORT_FORM>>/dev/install-<<REPOSITORY_NAME>>-runners.yml"
      - ".github/workflows/arc-<<REPOSITORY_NAME>>-dev.yml"

  workflow_dispatch:

permissions:
  contents: read

concurrency:
  group: shared-k8s-ansible
  cancel-in-progress: false

env:
  ANSIBLE_CONFIG: ${{ github.workspace }}/ansible/ansible.cfg

jobs:
  validate:
    name: Validate <<REPOSITORY_NAME>> dev runner scale set

    runs-on:
      - self-hosted
      - Linux
      - X64
      - prod
      - terraform
      - deploy
      - ac-cicd-infra

    timeout-minutes: 30

    steps:
      - name: Checkout repository
        uses: actions/checkout@v5

      - name: Verify required files
        shell: bash
        run: |
          set -euo pipefail

          required_files=(
            "kubernetes/tenants/<<APP_SHORT_FORM>>/dev/namespace/namespace.yaml"
            "kubernetes/tenants/<<APP_SHORT_FORM>>/dev/runner-scale-sets/<<REPOSITORY_NAME>>.yaml"
            "helm/tenants/<<APP_SHORT_FORM>>/dev/<<REPOSITORY_NAME>>-values.yaml"
            "ansible/roles/arc-runner-scale-set/defaults/main.yml"
            "ansible/roles/arc-runner-scale-set/tasks/main.yml"
            "ansible/roles/arc-runner-scale-set/tasks/install-and-probe.yml"
            "ansible/roles/arc-runner-scale-set/tasks/recover-session-conflict.yml"
            "ansible/playbooks/tenants/<<APP_SHORT_FORM>>/dev/install-<<REPOSITORY_NAME>>-runners.yml"
          )

          for file in "${required_files[@]}"; do
            if [ ! -f "${file}" ]; then
              echo "ERROR: Missing ${file}"
              exit 1
            fi
          done

      - name: Validate namespace, definition, and Helm values
        shell: bash
        run: |
          set -euo pipefail

          python3 - <<'PY'
          from pathlib import Path
          import yaml

          namespace = yaml.safe_load(
              Path("kubernetes/tenants/<<APP_SHORT_FORM>>/dev/namespace/namespace.yaml")
              .read_text(encoding="utf-8")
          )
          definition = yaml.safe_load(
              Path("kubernetes/tenants/<<APP_SHORT_FORM>>/dev/runner-scale-sets/<<REPOSITORY_NAME>>.yaml")
              .read_text(encoding="utf-8")
          )
          values = yaml.safe_load(
              Path("helm/tenants/<<APP_SHORT_FORM>>/dev/<<REPOSITORY_NAME>>-values.yaml")
              .read_text(encoding="utf-8")
          )

          assert namespace["kind"] == "Namespace"
          assert namespace["metadata"]["name"] == "arc-runners-<<APP_SHORT_FORM>>-dev"

          assert definition["schemaVersion"] == 1
          assert definition["github"]["organization"] == "<<GITHUB_ORGANIZATION>>"
          assert definition["github"]["repository"] == "<<REPOSITORY_NAME>>"
          assert definition["runnerScaleSet"]["name"] == "<<REPOSITORY_NAME>>-dev-arc"
          assert definition["runnerScaleSet"]["namespace"] == "arc-runners-<<APP_SHORT_FORM>>-dev"
          assert definition["branchMapping"]["sourceBranch"] == "dev"
          assert definition["branchMapping"]["runsOn"] == "<<REPOSITORY_NAME>>-dev-arc"
          assert definition["security"]["privateKeyCommittedToGit"] is False
          assert definition["security"]["kubernetesSecretManifestCommitted"] is False

          assert values["githubConfigUrl"] == "https://github.com/<<GITHUB_ORGANIZATION>>/<<REPOSITORY_NAME>>"
          assert values["githubConfigSecret"] == "<<APP_SHORT_FORM>>-arc-ghapp-secret"
          assert values["runnerScaleSetName"] == "<<REPOSITORY_NAME>>-dev-arc"
          assert values["minRunners"] == 0
          assert values["maxRunners"] == 4
          assert values["minRunners"] <= values["maxRunners"]
          assert values["containerMode"]["type"] == "dind"

          listener_spec = values["listenerTemplate"]["spec"]
          listener_containers = listener_spec["containers"]

          assert isinstance(listener_containers, list)
          assert len(listener_containers) >= 1
          assert listener_containers[0]["name"] == "listener"
          assert (
              listener_spec["nodeSelector"][
                  "node-role.kubernetes.io/control-plane"
              ]
              == ""
          )
          assert any(
              toleration["key"] == "node-role.kubernetes.io/control-plane"
              and toleration["operator"] == "Exists"
              and toleration["effect"] == "NoSchedule"
              for toleration in listener_spec["tolerations"]
          )

          assert values["template"]["spec"]["nodeSelector"]["environment"] == "dev"
          assert (
              values["template"]["spec"]["nodeSelector"]["workload"]
              == "github-runner"
          )

          print("Runner scale-set configuration is valid.")
          PY

      - name: Reject committed private keys and Kubernetes Secret manifests
        shell: bash
        run: |
          set -euo pipefail

          tracked_key_files="$(git ls-files | grep -E '\.(pem|key|p8)$' || true)"

          if [ -n "${tracked_key_files}" ]; then
            echo "ERROR: Private-key files are tracked by Git:"
            printf '%s\n' "${tracked_key_files}"
            exit 1
          fi

          private_key_matches="$(
            git grep \
              -n \
              -E \
              -- '-----BEGIN ([A-Z0-9]+ )?PRIVATE KEY-----' \
              -- . \
              ':!.github/workflows/arc-<<REPOSITORY_NAME>>-dev.yml' \
              ':!docs/**' || true
          )"

          if [ -n "${private_key_matches}" ]; then
            echo "ERROR: Private-key contents were found in tracked files:"
            printf '%s\n' "${private_key_matches}"
            exit 1
          fi

          secret_manifests="$(
            git grep -l -E '^kind:[[:space:]]*Secret$' -- \
              'kubernetes/tenants/**' || true
          )"

          if [ -n "${secret_manifests}" ]; then
            echo "ERROR: Kubernetes Secret manifests are committed:"
            printf '%s\n' "${secret_manifests}"
            exit 1
          fi

          echo "No private key or Kubernetes Secret manifest is tracked."

      - name: Validate the Ansible playbook target
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          output="$(
            ansible-playbook \
              -i inventories/shared-k8s/hosts.ini \
              playbooks/tenants/<<APP_SHORT_FORM>>/dev/install-<<REPOSITORY_NAME>>-runners.yml \
              --list-hosts
          )"

          printf '%s\n' "${output}"
          grep -Fq "cicd-ac-k8s-cp-01" <<< "${output}"

          if grep -Eq 'cicd-ac-k8s-(dev|qa|prod)-wk-' <<< "${output}"; then
            echo "ERROR: Runner scale-set installation playbook targets a worker node."
            exit 1
          fi

      - name: Syntax-check the runner scale-set playbook
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible-playbook \
            -i inventories/shared-k8s/hosts.ini \
            playbooks/tenants/<<APP_SHORT_FORM>>/dev/install-<<REPOSITORY_NAME>>-runners.yml \
            --syntax-check

  configure:
    name: Install or reconcile <<REPOSITORY_NAME>> dev runner scale set

    needs:
      - validate

    if: >-
      (github.event_name == 'push' && github.ref_name == 'prod') ||
      (github.event_name == 'workflow_dispatch' && github.ref_name == 'prod')

    environment:
      name: arc-org-<<APP_SHORT_FORM>>

    runs-on:
      - self-hosted
      - Linux
      - X64
      - prod
      - terraform
      - deploy
      - ac-cicd-infra

    timeout-minutes: 90

    steps:
      - name: Checkout repository
        uses: actions/checkout@v5

      - name: Verify production branch and organization credentials
        shell: bash
        env:
          ARC_GITHUB_ORGANIZATION: ${{ vars.ARC_GITHUB_ORGANIZATION }}
          ARC_GITHUB_APP_CLIENT_ID: ${{ vars.ARC_GITHUB_APP_CLIENT_ID }}
          ARC_GITHUB_APP_ID: ${{ vars.ARC_GITHUB_APP_ID }}
          ARC_GITHUB_APP_INSTALLATION_ID: ${{ vars.ARC_GITHUB_APP_INSTALLATION_ID }}
          ARC_GITHUB_APP_PRIVATE_KEY: ${{ secrets.ARC_GITHUB_APP_PRIVATE_KEY }}
        run: |
          set -euo pipefail

          test "${GITHUB_REF_NAME}" = "prod"
          test "${ARC_GITHUB_ORGANIZATION}" = "<<GITHUB_ORGANIZATION>>"
          test -n "${ARC_GITHUB_APP_CLIENT_ID}"
          test -n "${ARC_GITHUB_APP_ID}"
          test -n "${ARC_GITHUB_APP_INSTALLATION_ID}"
          test -n "${ARC_GITHUB_APP_PRIVATE_KEY}"

      - name: Create a repository-scoped GitHub App token
        id: app-token
        uses: actions/create-github-app-token@v3.2.0
        with:
          client-id: ${{ vars.ARC_GITHUB_APP_CLIENT_ID }}
          private-key: ${{ secrets.ARC_GITHUB_APP_PRIVATE_KEY }}
          owner: "<<GITHUB_ORGANIZATION>>"
          repositories: "<<REPOSITORY_NAME>>"

      - name: Verify the GitHub App can access the repository
        shell: bash
        env:
          GH_TOKEN: ${{ steps.app-token.outputs.token }}
        run: |
          set -euo pipefail

          full_name="$(gh api /repos/<<GITHUB_ORGANIZATION>>/<<REPOSITORY_NAME>> --jq .full_name)"
          test "${full_name,,}" = "<<GITHUB_ORGANIZATION>>/<<REPOSITORY_NAME>>"
          echo "Verified GitHub App repository access: ${full_name}"

      - name: Prepare the existing Ansible SSH key
        shell: bash
        run: |
          set -euo pipefail

          key_path="${HOME}/.ssh/id_ed25519_ansible"
          test -f "${key_path}"
          chmod 600 "${key_path}"
          echo "ANSIBLE_PRIVATE_KEY_FILE=${key_path}" >> "${GITHUB_ENV}"

      - name: Refresh the first-control-plane SSH host key
        shell: bash
        run: |
          set -euo pipefail

          mkdir -p "${HOME}/.ssh"
          chmod 700 "${HOME}/.ssh"
          touch "${HOME}/.ssh/known_hosts"
          chmod 600 "${HOME}/.ssh/known_hosts"

          ssh-keygen -f "${HOME}/.ssh/known_hosts" -R "192.168.8.202" || true

          captured=false
          for attempt in $(seq 1 30); do
            if ssh-keyscan -T 5 -H "192.168.8.202" >> "${HOME}/.ssh/known_hosts" 2>/dev/null; then
              captured=true
              break
            fi

            echo "Waiting for SSH on 192.168.8.202 (attempt ${attempt}/30)..."
            sleep 10
          done

          test "${captured}" = "true"

      - name: Prepare the Ansible remote temporary directory
        shell: bash
        run: |
          set -euo pipefail

          ssh \
            -i "${ANSIBLE_PRIVATE_KEY_FILE}" \
            -o IdentitiesOnly=yes \
            -o BatchMode=yes \
            "acllc@192.168.8.202" \
            'sudo install -d -m 0700 -o acllc -g acllc /var/tmp/ansible-acllc'

      - name: Create the temporary GitHub App private-key file
        shell: bash
        env:
          ARC_GITHUB_APP_PRIVATE_KEY: ${{ secrets.ARC_GITHUB_APP_PRIVATE_KEY }}
        run: |
          set -euo pipefail

          key_file="${RUNNER_TEMP}/<<APP_SHORT_FORM>>-github-app-private-key.pem"
          umask 077
          printf '%s' "${ARC_GITHUB_APP_PRIVATE_KEY}" > "${key_file}"
          chmod 600 "${key_file}"
          echo "ARC_GITHUB_APP_PRIVATE_KEY_FILE=${key_file}" >> "${GITHUB_ENV}"

      - name: Install or reconcile the repository runner scale set
        working-directory: ansible
        shell: bash
        env:
          ARC_GITHUB_APP_ID: ${{ vars.ARC_GITHUB_APP_ID }}
          ARC_GITHUB_APP_INSTALLATION_ID: ${{ vars.ARC_GITHUB_APP_INSTALLATION_ID }}
        run: |
          set -euo pipefail

          ansible-playbook \
            -i inventories/shared-k8s/hosts.ini \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            playbooks/tenants/<<APP_SHORT_FORM>>/dev/install-<<REPOSITORY_NAME>>-runners.yml

      - name: Capture ARC diagnostics after a failed reconciliation
        if: failure()
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible \
            -i inventories/shared-k8s/hosts.ini \
            first_control_plane \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            --become \
            --extra-vars ansible_shell_executable=/bin/bash \
            -m ansible.builtin.shell \
            -a '
              set +e
              export KUBECONFIG=/etc/kubernetes/admin.conf

              echo "=== HELM RELEASES ==="
              /usr/local/bin/helm list -a -n arc-runners-<<APP_SHORT_FORM>>-dev
              /usr/local/bin/helm status <<REPOSITORY_NAME>>-dev-arc -n arc-runners-<<APP_SHORT_FORM>>-dev

              echo
              echo "=== AUTOSCALING RUNNER SET ==="
              kubectl get autoscalingrunnerset.actions.github.com \
                <<REPOSITORY_NAME>>-dev-arc \
                -n arc-runners-<<APP_SHORT_FORM>>-dev \
                -o yaml

              echo
              echo "=== AUTOSCALING LISTENERS ==="
              kubectl get autoscalinglisteners.actions.github.com \
                -n arc-systems \
                -o wide

              echo
              echo "=== LISTENER PODS ==="
              kubectl get pods -n arc-systems -o wide |
                grep "<<REPOSITORY_NAME>>-dev-arc" || true

              echo
              echo "=== LISTENER LOGS ==="
              for listener_pod in $(
                kubectl get pods \
                  -n arc-systems \
                  -o jsonpath="{range .items[*]}{.metadata.name}{\"\\n\"}{end}" |
                grep -E "^<<REPOSITORY_NAME>>-dev-arc-[a-z0-9]+-listener$" || true
              ); do
                echo "--- ${listener_pod} ---"
                kubectl logs \
                  "${listener_pod}" \
                  -n arc-systems \
                  -c listener \
                  --tail=200 || true
              done

              echo
              echo "=== RUNNER-NAMESPACE EVENTS ==="
              kubectl get events \
                -n arc-runners-<<APP_SHORT_FORM>>-dev \
                --sort-by=.lastTimestamp |
                tail -50

              echo
              echo "=== CONTROLLER-NAMESPACE EVENTS ==="
              kubectl get events \
                -n arc-systems \
                --sort-by=.lastTimestamp |
                tail -50
            ' || true

      - name: Verify the runner scale set and stable listener Pod UID
        working-directory: ansible
        shell: bash
        run: |
          set -euo pipefail

          ansible \
            -i inventories/shared-k8s/hosts.ini \
            first_control_plane \
            --private-key "${ANSIBLE_PRIVATE_KEY_FILE}" \
            --become \
            --extra-vars ansible_shell_executable=/bin/bash \
            -m ansible.builtin.shell \
            -a '
              set -euo pipefail
              export KUBECONFIG=/etc/kubernetes/admin.conf

              /usr/local/bin/helm status \
                <<REPOSITORY_NAME>>-dev-arc \
                -n arc-runners-<<APP_SHORT_FORM>>-dev

              kubectl get autoscalingrunnerset.actions.github.com \
                <<REPOSITORY_NAME>>-dev-arc \
                -n arc-runners-<<APP_SHORT_FORM>>-dev \
                -o wide

              kubectl get autoscalinglisteners.actions.github.com \
                -n arc-systems \
                -o wide

              listener_pod=""
              listener_uid=""

              for attempt in $(seq 1 60); do
                listener_record="$(
                  kubectl get pods \
                    -n arc-systems \
                    -o jsonpath="{range .items[*]}{.metadata.name}{\"|\"}{.metadata.uid}{\"|\"}{.status.phase}{\"|\"}{.status.conditions[?(@.type==\"Ready\")].status}{\"\\n\"}{end}" \
                    2>/dev/null |
                  grep -E "^<<REPOSITORY_NAME>>-dev-arc-[a-z0-9]+-listener\\|" |
                  head -n 1 || true
                )"

                if [ -n "${listener_record}" ]; then
                  IFS="|" read -r candidate_pod candidate_uid candidate_phase candidate_ready \
                    <<< "${listener_record}"

                  if [ "${candidate_phase}" = "Running" ] &&
                    [ "${candidate_ready}" = "True" ]
                  then
                    listener_pod="${candidate_pod}"
                    listener_uid="${candidate_uid}"
                    break
                  fi
                fi

                sleep 2
              done

              test -n "${listener_pod}"
              test -n "${listener_uid}"

              for attempt in $(seq 1 15); do
                listener_record="$(
                  kubectl get pod \
                    "${listener_pod}" \
                    -n arc-systems \
                    -o jsonpath="{.metadata.uid}{\"|\"}{.status.phase}{\"|\"}{.status.conditions[?(@.type==\"Ready\")].status}" \
                    2>/dev/null || true
                )"

                test -n "${listener_record}"
                IFS="|" read -r current_uid current_phase current_ready \
                  <<< "${listener_record}"

                test "${current_uid}" = "${listener_uid}"
                test "${current_phase}" = "Running"
                test "${current_ready}" = "True"

                listener_logs="$(
                  kubectl logs \
                    "${listener_pod}" \
                    -n arc-systems \
                    -c listener \
                    --tail=200 \
                    2>&1 || true
                )"

                if printf "%s\n" "${listener_logs}" |
                  grep -Eq \
                    "RunnerScaleSetSessionConflictException|already has an active session"
                then
                  printf "%s\n" "${listener_logs}"
                  exit 1
                fi

                sleep 2
              done

              echo "Stable listener pod and UID: ${listener_pod} (${listener_uid})"
              kubectl get pods -n arc-runners-<<APP_SHORT_FORM>>-dev -o wide
            '

      - name: Remove the temporary local private-key file
        if: always()
        shell: bash
        run: |
          set -euo pipefail

          if [ -n "${ARC_GITHUB_APP_PRIVATE_KEY_FILE:-}" ]; then
            rm -f "${ARC_GITHUB_APP_PRIVATE_KEY_FILE}"
          fi

14. 🛡️ Create the Target Repository Environment Mapping

CHANGE PER REPOSITORYCHANGE PER DEPLOYMENT BRANCHThe GitHub repository Environment records the selected runs-on label and namespace. Deployment credentials can be added to this environment later without exposing them to other branches.
$TargetRepository = "<<GITHUB_ORGANIZATION>>/<<REPOSITORY_NAME>>"
$RepositoryEnvironment = "dev"
$RunsOn = "<<REPOSITORY_NAME>>-dev-arc"
$RunnerNamespace = "arc-runners-<<APP_SHORT_FORM>>-dev"

Write-Host "Creating or reconciling the repository deployment Environment..."
gh api --method PUT "repos/$TargetRepository/environments/$RepositoryEnvironment"

Write-Host "Setting non-secret branch mapping variables..."
gh variable set ARC_RUNS_ON --body $RunsOn --env $RepositoryEnvironment --repo $TargetRepository
gh variable set ARC_RUNNER_NAMESPACE --body $RunnerNamespace --env $RepositoryEnvironment --repo $TargetRepository
gh variable set ARC_DEPLOYMENT_ENVIRONMENT --body $RepositoryEnvironment --env $RepositoryEnvironment --repo $TargetRepository

Write-Host "Repository Environment variables:"
gh variable list --env $RepositoryEnvironment --repo $TargetRepository

Write-Host "Repository Environment details:"
gh api "repos/$TargetRepository/environments/$RepositoryEnvironment"

Restrict the dev environment to the devbranch and add required reviewers for QA or production where appropriate.

15. 🧪 Review and Commit the Infrastructure Files

COMMON · NO CHANGE NEEDEDCHANGE PER REPOSITORYCHANGE PER DEPLOYMENT BRANCHCommit only the generic role when first introduced and the files belonging to this repository and environment. Do not stage private keys or Kubernetes Secret manifests.
COMMON — create once when the first scale set is onboarded
  ansible/roles/arc-runner-scale-set/defaults/main.yml
  ansible/roles/arc-runner-scale-set/tasks/main.yml
  ansible/roles/arc-runner-scale-set/tasks/install-and-probe.yml
  ansible/roles/arc-runner-scale-set/tasks/recover-session-conflict.yml

REPOSITORY + DEPLOYMENT BRANCH — repeat for each scale set
  kubernetes/tenants/<<APP_SHORT_FORM>>/dev/namespace/namespace.yaml
  kubernetes/tenants/<<APP_SHORT_FORM>>/dev/runner-scale-sets/<<REPOSITORY_NAME>>.yaml
  helm/tenants/<<APP_SHORT_FORM>>/dev/<<REPOSITORY_NAME>>-values.yaml
  ansible/playbooks/tenants/<<APP_SHORT_FORM>>/dev/install-<<REPOSITORY_NAME>>-runners.yml
  .github/workflows/arc-<<REPOSITORY_NAME>>-dev.yml

TARGET APPLICATION REPOSITORY
  .github/workflows/arc-dev-smoke-test.yml
Preserve without replacement
  terraform/**
  ansible/inventories/shared-k8s/**
  ansible/roles/common/**
  ansible/roles/containerd/**
  ansible/roles/kubernetes-common/**
  ansible/roles/kubernetes-control-plane/**
  ansible/roles/kubernetes-worker/**
  ansible/roles/arc-controller/**
  ansible/playbooks/shared-k8s/01-*.yml through 09-install-arc-controller.yml
  helm/common/arc-controller/**
  kubernetes/common/**
  existing tenant organization configurations
  existing repository and environment scale sets

Never commit
  GitHub App private keys
  *.pem, *.key, or *.p8 files
  rendered Kubernetes Secret manifests
  GitHub App installation tokens
  Harbor passwords
  deployment SSH private keys
git status
git diff --check
git diff --stat

git diff -- \
  kubernetes/tenants/<<APP_SHORT_FORM>>/dev/namespace/namespace.yaml \
  kubernetes/tenants/<<APP_SHORT_FORM>>/dev/runner-scale-sets/<<REPOSITORY_NAME>>.yaml \
  helm/tenants/<<APP_SHORT_FORM>>/dev/<<REPOSITORY_NAME>>-values.yaml \
  ansible/roles/arc-runner-scale-set \
  ansible/playbooks/tenants/<<APP_SHORT_FORM>>/dev/install-<<REPOSITORY_NAME>>-runners.yml \
  .github/workflows/arc-<<REPOSITORY_NAME>>-dev.yml
git add \
  kubernetes/tenants/<<APP_SHORT_FORM>>/dev/namespace/namespace.yaml \
  kubernetes/tenants/<<APP_SHORT_FORM>>/dev/runner-scale-sets/<<REPOSITORY_NAME>>.yaml \
  helm/tenants/<<APP_SHORT_FORM>>/dev/<<REPOSITORY_NAME>>-values.yaml \
  ansible/roles/arc-runner-scale-set/defaults/main.yml \
  ansible/roles/arc-runner-scale-set/tasks/main.yml \
  ansible/roles/arc-runner-scale-set/tasks/install-and-probe.yml \
  ansible/roles/arc-runner-scale-set/tasks/recover-session-conflict.yml \
  ansible/playbooks/tenants/<<APP_SHORT_FORM>>/dev/install-<<REPOSITORY_NAME>>-runners.yml \
  .github/workflows/arc-<<REPOSITORY_NAME>>-dev.yml

git commit -m "Configure <<REPOSITORY_NAME>> dev ARC runner scale set"
git push -u origin feature/configure-<<REPOSITORY_NAME>>-dev-arc

16. Validate Through dev

CHANGE PER REPOSITORYCHANGE PER DEPLOYMENT BRANCHA dev push parses all configuration, rejects secrets, validates the exact Ansible target, and syntax-checks the playbook. It does not alter Kubernetes.
gh pr create --base dev --head feature/configure-<<REPOSITORY_NAME>>-dev-arc --title "Configure <<REPOSITORY_NAME>> dev ARC runner scale set" --body "Adds one isolated repository and environment runner scale set. The pull request is for review only; validation starts after the merge creates a dev push."

Expected dev result:

Validate <<REPOSITORY_NAME>> dev runner scale set — success
Install or reconcile <<REPOSITORY_NAME>> dev runner scale set — skipped

17. 🚀 Promote and Install Through prod

CHANGE PER GITHUB ORGCHANGE PER REPOSITORYCHANGE PER DEPLOYMENT BRANCHThe prod push reads the protected GitHub App environment, validates the rendered chart, performs an internal install-and-probe attempt, automatically recovers an exact 409 or rapid same-name listener Pod UID replacement, retries once, and verifies same-Pod-UID stability.
gh pr create --base prod --head dev --title "Install <<REPOSITORY_NAME>> dev ARC runner scale set" --body "Promotes the validated repository and environment runner scale set to prod. The prod push creates the runtime Kubernetes Secret and installs the Helm release."

Expected production sequence:

Validate configuration
Read GitHub App credentials from arc-org-<<APP_SHORT_FORM>>
Create or reconcile arc-runners-<<APP_SHORT_FORM>>-dev
Create or reconcile <<APP_SHORT_FORM>>-arc-ghapp-secret
Render runner chart 0.14.2
Validate rendered resources with kubectl --dry-run=server
Perform the first internal Helm install and immediate listener probe
If an exact 409 or same-name listener Pod UID replacement is detected, remove only <<REPOSITORY_NAME>>-dev-arc resources
Wait 120 seconds for the GitHub message-session lease to clear
Perform exactly one clean Helm retry in the same workflow run
Require the same replacement listener Pod UID to remain Running and Ready for 30 seconds
Verify AutoscalingRunnerSet in arc-runners-<<APP_SHORT_FORM>>-dev
Verify AutoscalingListener and the unchanged listener Pod UID in arc-systems
Capture Helm, ARC resource, listener-log, pod, and event diagnostics on failure
Confirm no 409 active-session conflict remains

18. 🏷️ Add the Branch-to-Scale-Set Workflow in the Target Repository

CHANGE PER REPOSITORYCHANGE PER DEPLOYMENT BRANCHThe application repository must explicitly target this scale-set name. That runs-on value is the actual branch-to-worker-pool routing mechanism.

Create .github/workflows/arc-dev-smoke-test.yml in <<GITHUB_ORGANIZATION>>/<<REPOSITORY_NAME>>:

name: ARC dev Runner Smoke Test

on:
  push:
    branches:
      - dev
    paths:
      - ".github/workflows/arc-dev-smoke-test.yml"

  workflow_dispatch:

permissions:
  contents: read

jobs:
  smoke-test:
    name: Verify dev ARC runner
    runs-on: <<REPOSITORY_NAME>>-dev-arc
    environment:
      name: dev

    timeout-minutes: 20

    steps:
      - name: Checkout repository
        uses: actions/checkout@v5

      - name: Display runner context
        shell: bash
        run: |
          set -euo pipefail

          echo "Repository: ${GITHUB_REPOSITORY}"
          echo "Branch: ${GITHUB_REF_NAME}"
          echo "Runner name: ${RUNNER_NAME}"
          echo "Runner architecture: ${RUNNER_ARCH}"
          echo "Expected runs-on: <<REPOSITORY_NAME>>-dev-arc"
          echo "Expected Kubernetes environment: dev"

      - name: Verify Docker-in-Docker
        shell: bash
        run: |
          set -euo pipefail

          docker version
          docker info
          docker run --rm hello-world

Commit this workflow to dev. The push creates an ephemeral runner pod on the dev worker pool. With minRunners=0, no idle runner pod is expected before a job is queued.

19. 🔎 Verify the Listener and Ephemeral Runner

CHANGE PER REPOSITORYCHANGE PER DEPLOYMENT BRANCHThe listener resource and pod run in the shared controller namespace. The same listener Pod UID must remain Running and Ready through the stability interval. Ephemeral runner pods appear in the environment runner namespace only while a job is queued or executing.
ssh \
  -i ~/.ssh/id_ed25519_ansible \
  -o IdentitiesOnly=yes \
  acllc@192.168.8.202 \
  'sudo bash -s' <<'REMOTE'
set -euo pipefail
export KUBECONFIG=/etc/kubernetes/admin.conf
HELM=/usr/local/bin/helm

echo "=== HELM RELEASE ==="
"${HELM}" status <<REPOSITORY_NAME>>-dev-arc -n arc-runners-<<APP_SHORT_FORM>>-dev

echo
echo "=== AUTOSCALING RUNNER SET ==="
kubectl get autoscalingrunnerset.actions.github.com \
  <<REPOSITORY_NAME>>-dev-arc \
  -n arc-runners-<<APP_SHORT_FORM>>-dev \
  -o wide

echo
echo "=== LISTENER RESOURCE IN THE CONTROLLER NAMESPACE ==="
kubectl get autoscalinglisteners.actions.github.com \
  -n arc-systems \
  -o wide

echo
echo "=== STABLE LISTENER POD AND UID ==="
listener_pod=""
listener_uid=""

for attempt in $(seq 1 60); do
  listener_record="$(
    kubectl get pods \
      -n arc-systems \
      -o jsonpath='{range .items[*]}{.metadata.name}{"|"}{.metadata.uid}{"|"}{.status.phase}{"|"}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}' \
      2>/dev/null |
    grep -E '^<<REPOSITORY_NAME>>-dev-arc-[a-z0-9]+-listener\|' |
    head -n 1 || true
  )"

  if [ -n "${listener_record}" ]; then
    IFS='|' read -r candidate_pod candidate_uid candidate_phase candidate_ready \
      <<< "${listener_record}"

    if [ "${candidate_phase}" = "Running" ] &&
      [ "${candidate_ready}" = "True" ]
    then
      listener_pod="${candidate_pod}"
      listener_uid="${candidate_uid}"
      break
    fi
  fi

  sleep 2
done

test -n "${listener_pod}"
test -n "${listener_uid}"

for attempt in $(seq 1 15); do
  listener_record="$(
    kubectl get pod \
      "${listener_pod}" \
      -n arc-systems \
      -o jsonpath='{.metadata.uid}{"|"}{.status.phase}{"|"}{.status.conditions[?(@.type=="Ready")].status}' \
      2>/dev/null || true
  )"

  test -n "${listener_record}"
  IFS='|' read -r current_uid current_phase current_ready \
    <<< "${listener_record}"

  test "${current_uid}" = "${listener_uid}"
  test "${current_phase}" = "Running"
  test "${current_ready}" = "True"

  listener_logs="$(
    kubectl logs \
      "${listener_pod}" \
      -n arc-systems \
      -c listener \
      --tail=200 \
      2>&1 || true
  )"

  if printf '%s\n' "${listener_logs}" |
    grep -Eq \
      'RunnerScaleSetSessionConflictException|already has an active session'
  then
    printf '%s\n' "${listener_logs}"
    echo "ERROR: Listener has an active GitHub session conflict."
    exit 1
  fi

  sleep 2
done

echo "Stable listener pod and UID: ${listener_pod} (${listener_uid})"

echo
echo "=== EPHEMERAL RUNNER RESOURCES ==="
kubectl get \
  ephemeralrunnersets.actions.github.com,ephemeralrunners.actions.github.com \
  -n arc-runners-<<APP_SHORT_FORM>>-dev \
  -o wide || true

echo
echo "=== RUNNER PODS ==="
kubectl get pods -n arc-runners-<<APP_SHORT_FORM>>-dev -o wide

echo
echo "=== APPROVED ENVIRONMENT WORKERS ==="
kubectl get nodes \
  -l environment=dev,workload=github-runner \
  -o wide
REMOTE

Watch the stable listener and ephemeral runner pods in separate terminals:

sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf get pods -n arc-systems -o wide --watch
sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf get pods -n arc-runners-<<APP_SHORT_FORM>>-dev -o wide --watch
FROM-SCRATCH RUNNER SCALE-SET ACCEPTANCE CHECKPOINT

Repository runner scale set
  Tenant:                     <<TENANT_OR_PRODUCT_NAME>>
  GitHub organization:        <<GITHUB_ORGANIZATION>>
  Repository:                 <<REPOSITORY_NAME>>
  Deployment branch:          dev
  GitHub repository Environment: dev
  Runner namespace:           arc-runners-<<APP_SHORT_FORM>>-dev
  Kubernetes GitHub App secret: <<APP_SHORT_FORM>>-arc-ghapp-secret
  Helm release:               <<REPOSITORY_NAME>>-dev-arc
  Scale-set name / runs-on:   <<REPOSITORY_NAME>>-dev-arc
  Runner group:               Default
  Container mode:             dind
  Minimum idle runners:       0
  Maximum runners:            4
  AutoscalingRunnerSet:       present in arc-runners-<<APP_SHORT_FORM>>-dev
  AutoscalingListener:        present in arc-systems
  Listener pod:               same Pod UID remains Running and Ready in arc-systems
  Listener session conflict:  absent
  Idle runner pods at min=0:  none — expected
  Runner worker pool:         dev

Branch isolation
  dev workflows explicitly use runs-on: <<REPOSITORY_NAME>>-dev-arc
  The scale set itself does not inspect Git branches
  qa and prod require separate runner scale sets and different runs-on values

Recovery and validation
  Rendered Helm resources are checked with kubectl --dry-run=server
  The first internal install attempt is followed by an immediate listener probe
  An exact 409 or rapid same-name listener Pod UID replacement triggers scoped cleanup
  The role waits 120 seconds for the GitHub session lease to clear
  The role performs exactly one clean Helm retry in the same workflow run
  The same listener Pod UID must remain Running and Ready for at least 30 seconds
  The source workflow verifies the resulting scale set and listener after reconciliation

Security
  GitHub App private key committed: no
  Kubernetes Secret manifest committed: no
  GitHub App secret created at runtime in the runner namespace: yes
  Temporary private-key copies removed after reconciliation: yes
  Runner pods: ephemeral

Success rule
  Do not continue until the prod reconciliation succeeds, the same listener Pod UID remains
  Running and Ready without a 409 conflict, and the target repository smoke test completes on an ephemeral
  runner pod scheduled to the selected environment worker pool.

20. 🩺 Failure Handling

CHANGE PER GITHUB ORGCHANGE PER REPOSITORYCHANGE PER DEPLOYMENT BRANCHCorrect namespace, GitHub App, Helm, listener, branch mapping, or worker-placement problems without reinstalling the shared controller or rebuilding Kubernetes VMs.

No listener pod is created

sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf get secret <<APP_SHORT_FORM>>-arc-ghapp-secret -n arc-runners-<<APP_SHORT_FORM>>-dev
sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf get autoscalingrunnerset.actions.github.com <<REPOSITORY_NAME>>-dev-arc -n arc-runners-<<APP_SHORT_FORM>>-dev -o wide
sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf get autoscalinglisteners.actions.github.com -n arc-systems -o wide
sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf get pods -n arc-systems -o wide
sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf get events -n arc-runners-<<APP_SHORT_FORM>>-dev --sort-by=.lastTimestamp
sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf get events -n arc-systems --sort-by=.lastTimestamp
sudo /usr/local/bin/helm status <<REPOSITORY_NAME>>-dev-arc -n arc-runners-<<APP_SHORT_FORM>>-dev

Confirm the GitHub App secret exists in exactly the same namespace as the Helm release. The AutoscalingRunnerSet belongs to the runner namespace, while the AutoscalingListener and listener pod belong to the shared controller namespace.

The listener repeatedly becomes Running and then Error

sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf logs   -n arc-systems   -l actions.github.com/scale-set-name=<<REPOSITORY_NAME>>-dev-arc   -c listener   --tail=200   --prefix || true

sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf get autoscalinglisteners.actions.github.com   -n arc-systems   -o wide

Look for RunnerScaleSetSessionConflictException oralready has an active session. Even when those lines disappear before collection, repeated recreation of the same listener name with changing Pod UIDs is treated as the same recoverable stale-session pattern. The shared role removes only the affected scale set and listener resources, waits 120 seconds, performs exactly one clean retry, and requires the same replacement Pod UID to remain Running and Ready for 30 seconds. A second instability is a real failure and is not hidden by an unlimited retry loop.

The workflow remains queued with “Waiting for a runner”

Confirm the target workflow contains:
  runs-on: <<REPOSITORY_NAME>>-dev-arc

Then check:
  GitHub repository: <<GITHUB_ORGANIZATION>>/<<REPOSITORY_NAME>>
  Runner scale set:  <<REPOSITORY_NAME>>-dev-arc
  Source branch:     dev
  Namespace:         arc-runners-<<APP_SHORT_FORM>>-dev

Runner pods remain Pending

sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf get pods -n arc-runners-<<APP_SHORT_FORM>>-dev -o wide
sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf describe pods -n arc-runners-<<APP_SHORT_FORM>>-dev
sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf get nodes -l environment=dev,workload=github-runner --show-labels
sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf describe nodes -l environment=dev,workload=github-runner | grep -A5 -E 'Taints:|Allocated resources:'

Docker commands fail inside the runner

Confirm containerMode.type is dind, then inspect the runner pod and the Docker sidecar. Docker-in-Docker uses a privileged container and should be limited to these isolated CI workers.

The GitHub App secret is rejected

Re-run the prod workflow after confirming the organization environment containsARC_GITHUB_APP_ID, ARC_GITHUB_APP_INSTALLATION_ID, andARC_GITHUB_APP_PRIVATE_KEY. The role reconciles the Kubernetes Secret without committing it.

More than one scale-set workflow starts from one commit

Keep every scale-set workflow path filter limited to its namespace manifest, definition, values, playbook, and workflow file. Do not add broad helm/tenants/**, kubernetes/tenants/**, or ansible/** filters to repository-specific workflows.

21. 🧹 Uninstall One Repository Runner Scale Set and Recreate It

CHANGE PER REPOSITORYCHANGE PER DEPLOYMENT BRANCHRemove only the selected repository and environment scale set from Kubernetes. Preserve the shared ARC controller, the environment runner namespace, the shared GitHub App Secret, and unrelated scale sets.
Important shared-resource boundary: do not delete arc-runners-<<APP_SHORT_FORM>>-dev or <<APP_SHORT_FORM>>-arc-ghapp-secret. The environment namespace and GitHub App Secret may be shared by other repository scale sets. The command below removes only <<REPOSITORY_NAME>>-dev-arc, its listener, its ephemeral runner resources, and its remote working directory.
ssh \
  -i ~/.ssh/id_ed25519_ansible \
  -o IdentitiesOnly=yes \
  acllc@192.168.8.202 \
  'sudo bash -s' <<'REMOTE'
set -euo pipefail

export KUBECONFIG=/etc/kubernetes/admin.conf
HELM=/usr/local/bin/helm
RELEASE=<<REPOSITORY_NAME>>-dev-arc
SCALE_SET=<<REPOSITORY_NAME>>-dev-arc
RUNNER_NAMESPACE=arc-runners-<<APP_SHORT_FORM>>-dev
CONTROLLER_NAMESPACE=arc-systems
WORK_ROOT=/etc/ac-cicd-infra/arc-runner-scale-sets

echo "=== BEFORE: affected fp-gw resources ==="
"${HELM}" status "${RELEASE}" -n "${RUNNER_NAMESPACE}" || true
kubectl get autoscalingrunnerset.actions.github.com \
  "${SCALE_SET}" \
  -n "${RUNNER_NAMESPACE}" \
  -o wide || true
kubectl get autoscalinglisteners.actions.github.com \
  -n "${CONTROLLER_NAMESPACE}" \
  -o name | grep -F "${SCALE_SET}" || true
kubectl get pods \
  -n "${CONTROLLER_NAMESPACE}" \
  -o name | grep -F "${SCALE_SET}" || true
kubectl get pods \
  -n "${RUNNER_NAMESPACE}" \
  -l "actions.github.com/scale-set-name=${SCALE_SET}" \
  -o name || true

echo
echo "=== Uninstall only the affected Helm release ==="
if "${HELM}" status "${RELEASE}" -n "${RUNNER_NAMESPACE}" >/dev/null 2>&1; then
  "${HELM}" uninstall \
    "${RELEASE}" \
    -n "${RUNNER_NAMESPACE}" \
    --wait \
    --timeout 5m
else
  echo "Helm release is already absent."
fi

echo
echo "=== Delete only lingering affected ARC resources ==="
kubectl delete autoscalingrunnerset.actions.github.com \
  "${SCALE_SET}" \
  -n "${RUNNER_NAMESPACE}" \
  --ignore-not-found=true \
  --wait=true \
  --timeout=3m

kubectl get autoscalinglisteners.actions.github.com \
  -n "${CONTROLLER_NAMESPACE}" \
  -o name 2>/dev/null | \
  grep -E "^autoscalinglistener\.actions\.github\.com/${SCALE_SET}-[a-z0-9]+-listener$" | \
  xargs -r kubectl delete \
    -n "${CONTROLLER_NAMESPACE}" \
    --wait=true \
    --timeout=3m || true

kubectl get pods \
  -n "${CONTROLLER_NAMESPACE}" \
  -o name 2>/dev/null | \
  grep -E "^pod/${SCALE_SET}-[a-z0-9]+-listener$" | \
  xargs -r kubectl delete \
    -n "${CONTROLLER_NAMESPACE}" \
    --wait=true \
    --timeout=2m || true

kubectl delete ephemeralrunnersets.actions.github.com \
  -n "${RUNNER_NAMESPACE}" \
  -l "actions.github.com/scale-set-name=${SCALE_SET}" \
  --ignore-not-found=true \
  --wait=true \
  --timeout=3m || true

kubectl delete ephemeralrunners.actions.github.com \
  -n "${RUNNER_NAMESPACE}" \
  -l "actions.github.com/scale-set-name=${SCALE_SET}" \
  --ignore-not-found=true \
  --wait=true \
  --timeout=3m || true

kubectl delete pods \
  -n "${RUNNER_NAMESPACE}" \
  -l "actions.github.com/scale-set-name=${SCALE_SET}" \
  --ignore-not-found=true \
  --wait=true \
  --timeout=2m || true

rm -rf "${WORK_ROOT}/${RELEASE}"

echo
echo "=== Verify the affected resources are absent ==="
for attempt in $(seq 1 90); do
  runner_set="$(
    kubectl get autoscalingrunnerset.actions.github.com \
      "${SCALE_SET}" \
      -n "${RUNNER_NAMESPACE}" \
      --ignore-not-found \
      -o name 2>/dev/null || true
  )"

  listeners="$(
    kubectl get autoscalinglisteners.actions.github.com \
      -n "${CONTROLLER_NAMESPACE}" \
      -o name 2>/dev/null | \
    grep -F "${SCALE_SET}" || true
  )"

  listener_pods="$(
    kubectl get pods \
      -n "${CONTROLLER_NAMESPACE}" \
      -o name 2>/dev/null | \
    grep -F "${SCALE_SET}" || true
  )"

  runner_pods="$(
    kubectl get pods \
      -n "${RUNNER_NAMESPACE}" \
      -l "actions.github.com/scale-set-name=${SCALE_SET}" \
      -o name 2>/dev/null || true
  )"

  if [ -z "${runner_set}" ] && \
    [ -z "${listeners}" ] && \
    [ -z "${listener_pods}" ] && \
    [ -z "${runner_pods}" ]
  then
    echo "All ${SCALE_SET} Kubernetes resources are absent."
    break
  fi

  if [ "${attempt}" -eq 90 ]; then
    echo "ERROR: Affected resources still exist after the cleanup timeout."
    exit 1
  fi

  sleep 2
done

echo
echo "=== Preserve shared FP dev resources ==="
kubectl get namespace "${RUNNER_NAMESPACE}"
kubectl get secret <<APP_SHORT_FORM>>-arc-ghapp-secret -n "${RUNNER_NAMESPACE}"
kubectl get autoscalingrunnerset.actions.github.com \
  -n "${RUNNER_NAMESPACE}" \
  -o wide || true

echo
echo "Do not delete ${RUNNER_NAMESPACE} or <<APP_SHORT_FORM>>-arc-ghapp-secret."
echo "They may be shared with other repository scale sets in the same product and environment."

echo
echo "Waiting 120 seconds for the GitHub message-session lease to clear..."
sleep 120

echo "Scoped uninstall and session-release wait completed."
REMOTE

After this cleanup, keep the Git-managed namespace manifest, scale-set definition, Helm values, playbook, workflow, target repository Environment, and target smoke-test workflow. They are the desired-state source used to recreate the scale set.

To adopt the corrected common role now, update the files in Section 11 and commit them using Section 15. Because common role paths intentionally do not trigger every repository workflow, manually dispatch this repository workflow on dev for Section 16, promote the change to prod, and manually dispatch it again for Section 17. Once the corrected role and existing repository files are already present on prod, a future cluster-only uninstall requires restarting at Section 17. Sections 8 through 14 do not need to be repeated unless their source files or GitHub mappings were deleted.

22. 🏁 From-Scratch Rebuild Checkpoint After This Page

COMMON · NO CHANGE NEEDEDCHANGE PER GITHUB ORGCHANGE PER REPOSITORYCHANGE PER DEPLOYMENT BRANCHUse this acceptance checkpoint after executing the page on a newly rebuilt cluster. It records the resources that must now exist before additional repositories, environments, or application workflows are added.
FROM-SCRATCH REBUILD CHECKPOINT AFTER THIS PAGE

Expected shared platform
  Load balancer, API VIP, control planes, and all worker pools operational
  Shared ARC controller deployed and healthy
  GitHub organization App and protected credential environment verified

Expected repository/environment resources
  Environment runner namespace created
  Runtime GitHub App Kubernetes Secret created in that namespace
  Repository/environment Helm release deployed
  AutoscalingRunnerSet present in the runner namespace
  AutoscalingListener and same-Pod-UID stable listener present in arc-systems
  minRunners=0 and maxRunners=4 applied
  Docker-in-Docker container mode applied
  Repository Environment and runs-on mapping created
  Target repository smoke test completed successfully

Clean-rebuild consistency
  No namespace, Secret, Helm release, listener, or runner pod is assumed to survive
  from a previous cluster. Every resource above must be recreated and verified by
  following this page after the Kubernetes VMs are rebuilt.

Next documentation step
  Repeat the repository and deployment-environment sections for additional scale sets,
  then add the application build-and-deploy workflow.