# The AntiAtropos Reward Function A physics-grounded, multi-scale control signal for Autonomous SRE — built on Lyapunov stability theory, graph-aware energy, three-tier cost economics, and smooth preventive SLA gradients. --- ## Architecture at a Glance The reward operates at three temporal scales: | Scale | Purpose | Consumer | |-------|---------|----------| | **Per-step** | Immediate feedback on stability, cost, SLA, action quality | LLM agent | | **Per-episode** | Holistic grading across uptime, cost, stability | Leaderboard | | **Cross-episode** | Sustained-excellence bonuses | Leaderboard | No single term dominates — a healthy cluster with wasted capacity scores poorly, as does a cheap cluster with burning queues. --- ## Layer 1: Lyapunov Graph Energy ``` V_graph(s) = Σ w_i · Q_i² + edge_weight · Σ_{(i,j)∈E} |Q_i − Q_j| ``` **Node energy** — Squared queue depths weighted by business importance (node-0 VIP = 2×). Squaring penalizes load concentration: one node at Q=100 is far worse than five at Q=20. **Edge imbalance** — Penalizes flow mismatch across DAG edges. If a parent has deep queues but its child is idle, the edge term fires even though the child's individual energy is zero. This gives the agent gradient signal to **balance load across topology**, not just minimize individual queues. --- ## Layer 2: Reward Composition ``` R_t = −(α·ΔV + β·Cost + γ·SLA + δ·Barrier) ``` ### ΔV — Lyapunov Drift (α = 0.002) One-step change in cluster energy. Negative = stabilizing. Positive = destabilizing. The small weight makes it a **directional nudge**, not a sledgehammer. Grounded in Neely's Drift-Plus-Penalty framework, which guarantees queue stability with bounded average cost. ### Cost — Three-Tier Infrastructure Model (β = 1.5) ``` needed = ⌈incoming_rate / 15⌉ ``` | Tier | Range | Rate | Rationale | |------|-------|------|-----------| | **Baseline** | ≤ DEFAULT_CAPACITY (3) | $0.05/u/hr | Already provisioned — sunk cost, no penalty | | **Justified** | >3, ≤ needed | $0.20/u/hr (4×) | Extra capacity serving traffic — defensible spend | | **Idle waste** | > needed | $1.00/u/hr (20×) | Capacity sitting idle — pure waste | A naive two-tier model (cheap ≤ needed, expensive > needed) penalizes baseline capacity as "overprovisioned" because `needed` can be 1 while DEFAULT_CAPACITY is 3. The three-tier model recognizes that **baseline infrastructure is already paid for** — only agent-added capacity triggers premium pricing. Baseline cost: 5 nodes × 3 units × $0.05 = **$0.75/hr**. Scaling one node to capacity=6 (3 justified + 0 idle) costs $0.75 + $0.60 = $1.35/hr. ### SLA — Smooth Preventive Penalty (γ = 4.0) ``` sla = max(σ(latency, threshold=0.20, temp=0.03), σ(errors, threshold=0.05, temp=0.01)) ``` Dual sigmoids with **max** operator — worst dimension dominates. Unlike binary penalties (0 or 1), the sigmoid provides **gradient before the violation**, enabling the agent to learn preventive scaling. Asymmetric temperatures reflect operational reality: latency degrades gradually (wide band), error rates spike sharply (narrow band). ### Barrier — Control-Barrier Function (δ = 0.1) ``` H(s) = Σ max(0, Q_i − 150)² / 10000 ``` Zero below Q=150, quadratic above. Creates a **hard danger zone** near catastrophic failure (Q=200). Architecturally distinct from SLA: SLA says "pay attention," barrier says "act now or the node dies." Layered defense at different urgency levels. --- ## Layer 3: Sigmoid Normalization ``` reward_01 = σ(raw_reward, midpoint=−3.0, temperature=2.0) ``` Maps raw reward (always negative) to [0, 1] for the LLM. The **midpoint = −3.0** centers the sigmoid where rewards actually cluster (≈ −1 to −8). Temperature = 2.0 gives visible per-action gradient: | Action | Reward | |--------|--------| | Baseline NO_OP | ~0.72 | | 1× SCALE_UP | ~0.54 | | 2× SCALE_UP | ~0.29 | | 3× SCALE_UP | ~0.14 | | 4× SCALE_UP | ~0.04 | The LLM can read the trend and adjust — each unnecessary scale-up is visibly worse. --- ## Layer 4: Action-Efficiency Penalties **Cooldown** — Same action on same node within 3 ticks: `reward −= cooldown × 0.1`. Action still executes (emergencies aren't blocked), but the agent learns to wait for boot delay before re-scaling. **Wasted action** (−0.05) — Rejected/invalid actions reduce reward immediately. The LLM sees the consequence in its very next step, not at episode end. --- ## Layer 5: Episode Grader ``` composite = 0.4·Uptime + 0.2·Stability + 0.4·Cost − invalid_penalty + bonus ``` | Dimension | Formula | Notes | |-----------|---------|-------| | **Uptime** | Fraction of ticks with SLA met | Latency ≤ 200ms, errors ≤ 5% | | **Cost** | `exp(−3.0 × over_ratio)` | Exponential decay from baseline; 2× spend → score 0.05 | | **Stability** | `1 / (1 + (avg_energy/2000)²)` | Inverse Lyapunov, no early saturation | **Task-3 coupling** — Cost score zeroed if uptime < 50%. Prevents "cheap but dead" strategies. **Prevention bonuses** (additive, no overlap with step reward): - +0.10 zero VIP failures all episode - +0.05 < 3 SLA violations all episode - +0.05 zero invalid actions all episode --- ## Why This Is Innovative 1. **Dynamical systems, not threshold monitoring** — Lyapunov drift measures *direction of travel*, not just current state. The agent learns whether its actions move the cluster toward or away from equilibrium. 2. **Topology-aware energy** — The edge imbalance term captures parent-child queue mismatch that flat per-node metrics miss entirely. 3. **Baseline-anchored cost** — The three-tier model separates "infrastructure you already pay for" from "capacity you chose to add," preventing the reward from penalizing the default cluster state. 4. **Preventive, not reactive** — Smooth SLA sigmoids give gradient *before* violation. The agent learns the pre-scale window that boot delay demands, rather than waiting for alarms. 5. **Layered safety** — SLA + barrier = two-tier defense at different thresholds and urgencies. Not all danger is equally urgent. 6. **Action quality as a first-class signal** — Wasted actions, rapid re-scaling, and invalid commands produce immediate penalties. Prevents "spam SCALE_UP and hope." 7. **Simulator-to-K8s parity** — Every parameter has a real-world counterpart. DEFAULT_CAPACITY=3 = K8s replicas. Boot delay = pod startup time. Cost tiers = cloud pricing. Trained policies transfer to live infrastructure. 8. **Theoretical guarantee** — The reward structure instantiates Neely's Drift-Plus-Penalty optimization, providing formal guarantees of queue stability with bounded average cost. The agent implements a theoretically grounded control policy, not ad-hoc heuristics.