AntiAtropos / REWARD_FUNCTION.md

div18

fix(inference): add debugging and error handling for env.step action calls

b84be63 about 1 month ago

6.82 kB

	# The AntiAtropos Reward Function

	A physics-grounded, multi-scale control signal for Autonomous SRE — built on Lyapunov stability theory, graph-aware energy, three-tier cost economics, and smooth preventive SLA gradients.

	---

	## Architecture at a Glance

	The reward operates at three temporal scales:

	\| Scale \| Purpose \| Consumer \|
	\|-------\|---------\|----------\|
	\| Per-step \| Immediate feedback on stability, cost, SLA, action quality \| LLM agent \|
	\| Per-episode \| Holistic grading across uptime, cost, stability \| Leaderboard \|
	\| Cross-episode \| Sustained-excellence bonuses \| Leaderboard \|

	No single term dominates — a healthy cluster with wasted capacity scores poorly, as does a cheap cluster with burning queues.

	---

	## Layer 1: Lyapunov Graph Energy

	```
	V_graph(s) = Σ w_i · Q_i² + edge_weight · Σ_{(i,j)∈E} \|Q_i − Q_j\|
	```

	Node energy — Squared queue depths weighted by business importance (node-0 VIP = 2×). Squaring penalizes load concentration: one node at Q=100 is far worse than five at Q=20.

	Edge imbalance — Penalizes flow mismatch across DAG edges. If a parent has deep queues but its child is idle, the edge term fires even though the child's individual energy is zero. This gives the agent gradient signal to balance load across topology, not just minimize individual queues.

	---

	## Layer 2: Reward Composition

	```
	R_t = −(α·ΔV + β·Cost + γ·SLA + δ·Barrier)
	```

	### ΔV — Lyapunov Drift (α = 0.002)

	One-step change in cluster energy. Negative = stabilizing. Positive = destabilizing. The small weight makes it a directional nudge, not a sledgehammer. Grounded in Neely's Drift-Plus-Penalty framework, which guarantees queue stability with bounded average cost.

	### Cost — Three-Tier Infrastructure Model (β = 1.5)

	```
	needed = ⌈incoming_rate / 15⌉
	```

	\| Tier \| Range \| Rate \| Rationale \|
	\|------\|-------\|------\|-----------\|
	\| Baseline \| ≤ DEFAULT_CAPACITY (3) \| $0.05/u/hr \| Already provisioned — sunk cost, no penalty \|
	\| Justified \| >3, ≤ needed \| $0.20/u/hr (4×) \| Extra capacity serving traffic — defensible spend \|
	\| Idle waste \| > needed \| $1.00/u/hr (20×) \| Capacity sitting idle — pure waste \|

	A naive two-tier model (cheap ≤ needed, expensive > needed) penalizes baseline capacity as "overprovisioned" because `needed` can be 1 while DEFAULT_CAPACITY is 3. The three-tier model recognizes that baseline infrastructure is already paid for — only agent-added capacity triggers premium pricing.

	Baseline cost: 5 nodes × 3 units × $0.05 = $0.75/hr. Scaling one node to capacity=6 (3 justified + 0 idle) costs $0.75 + $0.60 = $1.35/hr.

	### SLA — Smooth Preventive Penalty (γ = 4.0)

	```
	sla = max(σ(latency, threshold=0.20, temp=0.03), σ(errors, threshold=0.05, temp=0.01))
	```

	Dual sigmoids with max operator — worst dimension dominates. Unlike binary penalties (0 or 1), the sigmoid provides gradient before the violation, enabling the agent to learn preventive scaling. Asymmetric temperatures reflect operational reality: latency degrades gradually (wide band), error rates spike sharply (narrow band).

	### Barrier — Control-Barrier Function (δ = 0.1)

	```
	H(s) = Σ max(0, Q_i − 150)² / 10000
	```

	Zero below Q=150, quadratic above. Creates a hard danger zone near catastrophic failure (Q=200). Architecturally distinct from SLA: SLA says "pay attention," barrier says "act now or the node dies." Layered defense at different urgency levels.

	---

	## Layer 3: Sigmoid Normalization

	```
	reward_01 = σ(raw_reward, midpoint=−3.0, temperature=2.0)
	```

	Maps raw reward (always negative) to [0, 1] for the LLM. The midpoint = −3.0 centers the sigmoid where rewards actually cluster (≈ −1 to −8). Temperature = 2.0 gives visible per-action gradient:

	\| Action \| Reward \|
	\|--------\|--------\|
	\| Baseline NO_OP \| ~0.72 \|
	\| 1× SCALE_UP \| ~0.54 \|
	\| 2× SCALE_UP \| ~0.29 \|
	\| 3× SCALE_UP \| ~0.14 \|
	\| 4× SCALE_UP \| ~0.04 \|

	The LLM can read the trend and adjust — each unnecessary scale-up is visibly worse.

	---

	## Layer 4: Action-Efficiency Penalties

	Cooldown — Same action on same node within 3 ticks: `reward −= cooldown × 0.1`. Action still executes (emergencies aren't blocked), but the agent learns to wait for boot delay before re-scaling.

	Wasted action (−0.05) — Rejected/invalid actions reduce reward immediately. The LLM sees the consequence in its very next step, not at episode end.

	---

	## Layer 5: Episode Grader

	```
	composite = 0.4·Uptime + 0.2·Stability + 0.4·Cost − invalid_penalty + bonus
	```

	\| Dimension \| Formula \| Notes \|
	\|-----------\|---------\|-------\|
	\| Uptime \| Fraction of ticks with SLA met \| Latency ≤ 200ms, errors ≤ 5% \|
	\| Cost \| `exp(−3.0 × over_ratio)` \| Exponential decay from baseline; 2× spend → score 0.05 \|
	\| Stability \| `1 / (1 + (avg_energy/2000)²)` \| Inverse Lyapunov, no early saturation \|

	Task-3 coupling — Cost score zeroed if uptime < 50%. Prevents "cheap but dead" strategies.

	Prevention bonuses (additive, no overlap with step reward):
	- +0.10 zero VIP failures all episode
	- +0.05 < 3 SLA violations all episode
	- +0.05 zero invalid actions all episode

	---

	## Why This Is Innovative

	1. Dynamical systems, not threshold monitoring — Lyapunov drift measures direction of travel, not just current state. The agent learns whether its actions move the cluster toward or away from equilibrium.

	2. Topology-aware energy — The edge imbalance term captures parent-child queue mismatch that flat per-node metrics miss entirely.

	3. Baseline-anchored cost — The three-tier model separates "infrastructure you already pay for" from "capacity you chose to add," preventing the reward from penalizing the default cluster state.

	4. Preventive, not reactive — Smooth SLA sigmoids give gradient before violation. The agent learns the pre-scale window that boot delay demands, rather than waiting for alarms.

	5. Layered safety — SLA + barrier = two-tier defense at different thresholds and urgencies. Not all danger is equally urgent.

	6. Action quality as a first-class signal — Wasted actions, rapid re-scaling, and invalid commands produce immediate penalties. Prevents "spam SCALE_UP and hope."

	7. Simulator-to-K8s parity — Every parameter has a real-world counterpart. DEFAULT_CAPACITY=3 = K8s replicas. Boot delay = pod startup time. Cost tiers = cloud pricing. Trained policies transfer to live infrastructure.

	8. Theoretical guarantee — The reward structure instantiates Neely's Drift-Plus-Penalty optimization, providing formal guarantees of queue stability with bounded average cost. The agent implements a theoretically grounded control policy, not ad-hoc heuristics.