Skip to main content

TidyShelves – Configuration System Design (50M+ Ready)

Overview

This document defines the final configuration system design for TidyShelves, built to scale safely from 0 → 50M+ users.

The system is designed to:

  • Eliminate hardcoding across server and iOS
  • Be offline-first
  • Avoid refetch storms at scale
  • Serve reads from cache, not the database
  • Support per-device log escalation
  • Use POST-only APIs
  • Be incrementally adoptable (in-memory → Redis → Event-driven)

No implementation code is included.


Design Principles

  • Server-driven configuration (clients render only)
  • Version-first, fetch-on-change
  • Cache-first reads
  • Sparse overrides (most devices have none)
  • POST-only APIs
  • Backward-compatible evolution
  • Operational safety built-in

Non-Negotiable Scale Requirements (50M+)

  1. Global config changes must not cause device-wide refetch storms
  2. Per-device changes must affect only that device
  3. Database must not be on the hot read path
  4. Clients fetch config only when versions change
  5. App must function fully using cached config when offline
  6. Config response shape must be versioned

Data Model

1. Config Key Catalog

config_keys

  • id (PK)
  • key (unique)
  • data_type (string | int | bool | enum | json)
  • allowed_values (JSON array string, nullable)
  • default_value (string, JSON-encoded for json)
  • description

This table is small and stable (hundreds of rows at most).


2. Global Overrides

global_config

  • config_key_id (unique)
  • value
  • updated_at

Used for global behavior changes.


3. Device Overrides (Sparse)

device_config

  • device_id
  • config_key_id
  • value
  • updated_at

Constraints:

  • UNIQUE(device_id, config_key_id)
  • Indexed by device_id

Only a small percentage of devices will have overrides (e.g., log escalation).


Versioning Strategy (Critical for Scale)

Global Version

conf_global_version

  • Single row
  • Bumped only when:
    • config_keys changes
    • global_config changes

Device Version (Sparse)

conf_device_version

  • One row per device only when needed
  • Bumped only when that device’s effective output changes

❗ Do not bump for internal bookkeeping (e.g., decrementing counters).


(Future) Tenant Version

conf_tenant_version

  • Per-tenant versioning
  • Enables tenant-scoped config without waking all devices

Caching Strategy

Layer 1: In-Process Memory Cache

Each service instance caches:

  • GlobalSnapshot = { globalVersion, mergedKeys }

TTL: 30–120 seconds (or event-driven invalidation).


Layer 2: Distributed Cache (Redis)

Redis becomes the primary read source.

Recommended keys:

  • conf:global:version
  • conf:global:snapshot
  • conf:device:version:<deviceId>
  • conf:device:overrides:<deviceId>
  • (future) conf:tenant:*

Database remains the system of record.


API Design (POST-Only)

1. Version Check

POST /conf/mobile/config/version

Returns:

  • configSchemaVersion
  • globalVersion
  • deviceVersion
  • tenantVersion
  • shouldFetchEffectiveConfig
  • useCachedConfigOnly (ops safety switch)

Used on:

  • app launch
  • foreground (debounced)
  • after sign-in
  • after error/fatal log upload

2. Fetch Effective Config

POST /conf/mobile/config/effective

Returns:

  • version values
  • keys as a dictionary map

Supports optional:

  • groups (e.g., brand, log, invite)
  • requestId (idempotency)

Client Behavior (Offline-First)

Local Storage

Clients store:

  • cached config map
  • version values
  • schema version
  • timestamp

Refresh Algorithm

  1. If offline → use cached config
  2. If online → POST version check
  3. If useCachedConfigOnly=true → stop
  4. If versions changed → fetch effective config

Most requests are cheap and return no change.


Device Log Escalation

Purpose

Temporarily increase logging for devices that encounter errors, without affecting others.


Config Keys

  • log.level.global
  • log.escalation.policy
  • log.level.device_override

Lifecycle

  1. Device sends error or fatal log
  2. Server applies escalation policy:
    • sets log.level.device_override
    • bumps deviceVersion only
  3. Device fetches new config → debug enabled
  4. Auto fallback when:
    • max launches reached or
    • time window expires
  5. Server clears override and bumps deviceVersion again

Important Rules

  • Decrement launch counters server-side
  • Do not bump deviceVersion on each decrement
  • Bump only when effective output changes
  • APNs is optional and never required

Schema Versioning

configSchemaVersion is returned with every config response.

Purpose:

  • Allows evolving response shape safely
  • Prevents older clients from misinterpreting payloads

Seed Keys (Summary)

Categories:

  • Branding (brand.*)
  • App & Links (app.*, web.*)
  • Feature flags (feature.*)
  • Invitation policy (invite.*)
  • Mobile & Sync (mobile.*, sync.*)
  • Logging (log.*)

All keys:

  • Typed
  • Validated
  • Server-driven
  • Nullable overrides

Implementation Order

  1. Seed config_keys with strict validation
  2. Add version tables (global + device)
  3. Add in-memory cache
  4. Add Redis cache
  5. Implement POST-only config APIs
  6. Refactor server services to consume config
  7. Implement log escalation
  8. Update iOS to consume config
  9. (Optional) Add event-driven invalidation

Azure Services Guidance

Redis

Used for:

  • global snapshot
  • device versions
  • device overrides

Recommended once services scale horizontally.


Kafka (Azure Event Hubs – Kafka Endpoint)

Optional. Used only for:

  • config change events
  • cache invalidation
  • audit streams

Not required for config reads.


Final Sanity Check (50M+)

This system scales because:

  • Versions prevent storms
  • Device overrides are isolated
  • Cache handles reads
  • DB is protected
  • Offline clients remain functional
  • Ops controls exist for emergencies

This design is production-ready, evolvable, and safe for very large scale.