# The AntiAtropos Reward Function

A physics-grounded, multi-scale control signal for Autonomous SRE — built on Lyapunov stability theory, graph-aware energy, three-tier cost economics, and smooth preventive SLA gradients.

---

## Architecture at a Glance

The reward operates at three temporal scales:

| Scale | Purpose | Consumer |
|-------|---------|----------|
| **Per-step** | Immediate feedback on stability, cost, SLA, action quality | LLM agent |
| **Per-episode** | Holistic grading across uptime, cost, stability | Leaderboard |
| **Cross-episode** | Sustained-excellence bonuses | Leaderboard |

No single term dominates — a healthy cluster with wasted capacity scores poorly, as does a cheap cluster with burning queues.

---

## Layer 1: Lyapunov Graph Energy

```
V_graph(s) = Σ w_i · Q_i²  +  edge_weight · Σ_{(i,j)∈E} |Q_i − Q_j|
```

**Node energy** — Squared queue depths weighted by business importance (node-0 VIP = 2×). Squaring penalizes load concentration: one node at Q=100 is far worse than five at Q=20.

**Edge imbalance** — Penalizes flow mismatch across DAG edges. If a parent has deep queues but its child is idle, the edge term fires even though the child's individual energy is zero. This gives the agent gradient signal to **balance load across topology**, not just minimize individual queues.

---

## Layer 2: Reward Composition

```
R_t = −(α·ΔV + β·Cost + γ·SLA + δ·Barrier)
```

### ΔV — Lyapunov Drift (α = 0.002)

One-step change in cluster energy. Negative = stabilizing. Positive = destabilizing. The small weight makes it a **directional nudge**, not a sledgehammer. Grounded in Neely's Drift-Plus-Penalty framework, which guarantees queue stability with bounded average cost.

### Cost — Three-Tier Infrastructure Model (β = 1.5)

```
needed = ⌈incoming_rate / 15⌉
```

| Tier | Range | Rate | Rationale |
|------|-------|------|-----------|
| **Baseline** | ≤ DEFAULT_CAPACITY (3) | $0.05/u/hr | Already provisioned — sunk cost, no penalty |
| **Justified** | >3, ≤ needed | $0.20/u/hr (4×) | Extra capacity serving traffic — defensible spend |
| **Idle waste** | > needed | $1.00/u/hr (20×) | Capacity sitting idle — pure waste |

A naive two-tier model (cheap ≤ needed, expensive > needed) penalizes baseline capacity as "overprovisioned" because `needed` can be 1 while DEFAULT_CAPACITY is 3. The three-tier model recognizes that **baseline infrastructure is already paid for** — only agent-added capacity triggers premium pricing.

Baseline cost: 5 nodes × 3 units × $0.05 = **$0.75/hr**. Scaling one node to capacity=6 (3 justified + 0 idle) costs $0.75 + $0.60 = $1.35/hr.

### SLA — Smooth Preventive Penalty (γ = 4.0)

```
sla = max(σ(latency, threshold=0.20, temp=0.03), σ(errors, threshold=0.05, temp=0.01))
```

Dual sigmoids with **max** operator — worst dimension dominates. Unlike binary penalties (0 or 1), the sigmoid provides **gradient before the violation**, enabling the agent to learn preventive scaling. Asymmetric temperatures reflect operational reality: latency degrades gradually (wide band), error rates spike sharply (narrow band).

### Barrier — Control-Barrier Function (δ = 0.1)

```
H(s) = Σ max(0, Q_i − 150)² / 10000
```

Zero below Q=150, quadratic above. Creates a **hard danger zone** near catastrophic failure (Q=200). Architecturally distinct from SLA: SLA says "pay attention," barrier says "act now or the node dies." Layered defense at different urgency levels.

---

## Layer 3: Sigmoid Normalization

```
reward_01 = σ(raw_reward, midpoint=−3.0, temperature=2.0)
```

Maps raw reward (always negative) to [0, 1] for the LLM. The **midpoint = −3.0** centers the sigmoid where rewards actually cluster (≈ −1 to −8). Temperature = 2.0 gives visible per-action gradient:

| Action | Reward |
|--------|--------|
| Baseline NO_OP | ~0.72 |
| 1× SCALE_UP | ~0.54 |
| 2× SCALE_UP | ~0.29 |
| 3× SCALE_UP | ~0.14 |
| 4× SCALE_UP | ~0.04 |

The LLM can read the trend and adjust — each unnecessary scale-up is visibly worse.

---

## Layer 4: Action-Efficiency Penalties

**Cooldown** — Same action on same node within 3 ticks: `reward −= cooldown × 0.1`. Action still executes (emergencies aren't blocked), but the agent learns to wait for boot delay before re-scaling.

**Wasted action** (−0.05) — Rejected/invalid actions reduce reward immediately. The LLM sees the consequence in its very next step, not at episode end.

---

## Layer 5: Episode Grader

```
composite = 0.4·Uptime + 0.2·Stability + 0.4·Cost − invalid_penalty + bonus
```

| Dimension | Formula | Notes |
|-----------|---------|-------|
| **Uptime** | Fraction of ticks with SLA met | Latency ≤ 200ms, errors ≤ 5% |
| **Cost** | `exp(−3.0 × over_ratio)` | Exponential decay from baseline; 2× spend → score 0.05 |
| **Stability** | `1 / (1 + (avg_energy/2000)²)` | Inverse Lyapunov, no early saturation |

**Task-3 coupling** — Cost score zeroed if uptime < 50%. Prevents "cheap but dead" strategies.

**Prevention bonuses** (additive, no overlap with step reward):
- +0.10 zero VIP failures all episode
- +0.05 < 3 SLA violations all episode
- +0.05 zero invalid actions all episode

---

## Why This Is Innovative

1. **Dynamical systems, not threshold monitoring** — Lyapunov drift measures *direction of travel*, not just current state. The agent learns whether its actions move the cluster toward or away from equilibrium.

2. **Topology-aware energy** — The edge imbalance term captures parent-child queue mismatch that flat per-node metrics miss entirely.

3. **Baseline-anchored cost** — The three-tier model separates "infrastructure you already pay for" from "capacity you chose to add," preventing the reward from penalizing the default cluster state.

4. **Preventive, not reactive** — Smooth SLA sigmoids give gradient *before* violation. The agent learns the pre-scale window that boot delay demands, rather than waiting for alarms.

5. **Layered safety** — SLA + barrier = two-tier defense at different thresholds and urgencies. Not all danger is equally urgent.

6. **Action quality as a first-class signal** — Wasted actions, rapid re-scaling, and invalid commands produce immediate penalties. Prevents "spam SCALE_UP and hope."

7. **Simulator-to-K8s parity** — Every parameter has a real-world counterpart. DEFAULT_CAPACITY=3 = K8s replicas. Boot delay = pod startup time. Cost tiers = cloud pricing. Trained policies transfer to live infrastructure.

8. **Theoretical guarantee** — The reward structure instantiates Neely's Drift-Plus-Penalty optimization, providing formal guarantees of queue stability with bounded average cost. The agent implements a theoretically grounded control policy, not ad-hoc heuristics.