feat(curriculum): add progressive training curriculum management

- Introduce CurriculumStage dataclass to define tasks with step limits, thresholds, temps, and retries
- Define CURRICULUM list with staged tasks of increasing difficulty and parameters
- Implement CurriculumTracker to track current stage, report scores, handle retries, and progress
- Add retry temperature adjustment and automatic skip after max retries for exploration encouragement

Files changed (10) hide show

.qoder/plans/RL_Pipeline_Overhaul_d7c34a04.md +212 -0
control/validation.py +43 -12
curriculum.py +131 -0
grader.py +23 -5
inference.py +97 -24
models.py +49 -0
replay.py +230 -0
server/AntiAtropos_environment.py +80 -19
simulator.py +103 -9
stability.py +72 -8

.qoder/plans/RL_Pipeline_Overhaul_d7c34a04.md ADDED Viewed

	@@ -0,0 +1,212 @@

+# RL Pipeline Overhaul
+## Phase 1: Simulator Physics
+### Task 1.1: Exponential Latency Model
+**File:** `simulator.py` line 426
+Replace the linear latency formula with M/M/1 queuing theory:
+```python
+# Current (linear):
+n.latency_ms = BASE_LATENCY_MS + (n.queue_depth * LATENCY_STEEPNESS)
+# New (exponential — blows up as utilization→1):
+utilization = n.incoming_request_rate / n.service_rate if n.service_rate > 0 else 1.0
+if utilization >= 0.99:
+    utilization = 0.99  # cap to prevent infinity
+n.latency_ms = BASE_LATENCY_MS / (1.0 - utilization)
+```
+This creates the "hockey stick" that teaches the agent to scale *before* saturation.
+### Task 1.2: Node Recovery Mechanic
+**File:** `simulator.py` lines 428-441, `NodeState` dataclass
+- Add `recovery_timer: int = 0` to `NodeState`
+- When `queue_depth > FATAL_FAIL_THRESHOLD`, set status=FAILED but start `recovery_timer = 20` ticks
+- Each tick, decrement recovery_timer. When it hits 0, set status=HEALTHY, capacity=1, queue_depth=0
+- This lets the agent learn recovery strategies (reroute away, then scale up the recovering node)
+### Task 1.3: Cascading Failure Pressure
+**File:** `simulator.py` — new method `_cascade_failures()`
+When a node fails, its peers absorb the lost capacity. If any peer's queue then exceeds `FATAL_FAIL_THRESHOLD * 1.2` within 3 ticks of the original failure, that peer also degrades. This models real cascade patterns. Called after `_update_statuses()` in `tick()`.
+---
+## Phase 2: Reward Shaping
+### Task 2.1: Smooth SLA Penalty (Replace Binary Cliff)
+**File:** `server/AntiAtropos_environment.py` line 205, `stability.py`
+Replace the binary SLA violation with a smooth sigmoid that ramps up as latency approaches the threshold:
+```python
+# Instead of:
+sla_violation_step = 1 if (avg_latency > 200.0 or error_rate > 0.05) else 0
+# New:
+def smooth_sla_penalty(avg_latency_norm: float, error_rate: float,
+                       threshold: float = 0.20, temperature: float = 0.03) -> float:
+    """Smooth penalty in [0, 1] that ramps as latency approaches threshold."""
+    lat_penalty = 1.0 / (1.0 + math.exp(-(avg_latency_norm - threshold) / temperature))
+    err_penalty = 1.0 / (1.0 + math.exp(-(error_rate - 0.05) / 0.01))
+    return max(lat_penalty, err_penalty)
+```
+This gives the agent gradient signal *before* the SLA is actually violated.
+### Task 2.2: Activate the Barrier Function
+**File:** `server/AntiAtropos_environment.py` lines 213-222, `stability.py`
+Add `compute_barrier()` to the reward formula:
+```python
+raw_reward = compute_reward(
+    v_prev=self._prev_lyapunov,
+    v_curr=current_lyapunov,
+    cost=cost,
+    sla_violation_step=sla_violation_step,  # now smooth, not binary
+    alpha=ALPHA,
+    beta=BETA,
+    gamma=GAMMA,
+    barrier=compute_barrier(self._nodes_true),  # NEW
+    delta=DELTA,                                 # NEW weight
+)
+```
+Update `compute_reward()` in `stability.py` to accept and include the barrier term:
+```
+R_t = -(α·ΔV + β·Cost + γ·SLA_smooth + δ·Barrier)
+```
+### Task 2.3: Per-Node Reward Decomposition
+**File:** `server/AntiAtropos_environment.py`, new method `_compute_node_rewards()`
+Add per-node reward components to `ClusterObservation` so the agent can learn credit assignment:
+```python
+# In NodeObservation, add:
+node_reward: float = 0.0  # per-node reward contribution
+# Compute as:
+for node in nodes_true:
+    node_delta_v = importance_weight * (node_queue² - prev_node_queue²)
+    node_barrier = max(0, node_queue - Q_BARRIER_MAX)²
+    node.cost = node_capacity * COST_PER_CAPACITY_UNIT_PER_HOUR
+    node_reward = -(ALPHA * node_delta_v + DELTA * node_barrier + BETA * node_cost)
+```
+This tells the agent *which* nodes improved from its actions.
+---
+## Phase 3: Observation + Action Space
+### Task 3.1: Enrich Observations
+**File:** `models.py` — `NodeObservation`, `inference.py` — `observation_for_model()`
+Add to `NodeObservation`:
+- `capacity: float` — current capacity units (0-5)
+- `pending_capacity: float` — capacity being booted (0-5)
+- `queue_delta: float` — queue depth change from last tick (-1 to +1, normalized)
+- `sla_proximity: float` — how close this node is to SLA violation (0=safe, 1=violating)
+Add to `ClusterObservation`:
+- `reward_components: dict` — breakdown of the reward (drift, cost, sla, barrier)
+Update `observation_for_model()` in `inference.py` to include `is_vip`, `importance_weight`, and the new fields.
+### Task 3.2: Make SHED_LOAD and REROUTE_TRAFFIC Persistent
+**File:** `simulator.py` lines 252, 270-271, 386-390
+- SHED_LOAD: Instead of resetting `shed_fraction=0.0` every tick, decay it by 80% per tick (`shed_fraction *= 0.2`). The agent still needs to re-issue to maintain full effect, but the decay is gradual.
+- REROUTE_TRAFFIC: Change decay from 50% to 80% per tick (`weight *= 0.2` instead of `*= 0.5`). Makes the effect last longer.
+### Task 3.3: Add Action Cooldown
+**File:** `control/validation.py`, `server/AntiAtropos_environment.py`
+Track last action per node. If the agent issues SCALE_UP on node-0 twice within 3 ticks, the second one is rejected with "Cooldown: node-0 was scaled 2 ticks ago." This prevents thrashing and teaches the agent to wait for actions to take effect (especially important with BOOT_DELAY_TICKS=5).
+---
+## Phase 4: Training Loop
+### Task 4.1: Episode Replay Buffer
+**File:** New file `replay.py`
+Store episode trajectories (obs, action, reward, done) in a rolling buffer. After each episode:
+1. If `composite_score > SUCCESS_SCORE_THRESHOLD`, store the full trajectory as a "positive example"
+2. If `composite_score < 0.3`, store as a "negative example"
+3. Use positive examples as few-shot demonstrations in the LLM prompt
+```python
+class EpisodeReplayBuffer:
+    def __init__(self, max_episodes: int = 50):
+        self._positive: deque = deque(maxlen=max_episodes)
+        self._negative: deque = deque(maxlen=max_episodes)
+    def store(self, trajectory, score):
+        if score >= 0.55:
+            self._positive.append(trajectory)
+        elif score < 0.3:
+            self._negative.append(trajectory)
+    def sample_demonstrations(self, n: int = 2) -> list:
+        """Sample n positive episodes for few-shot prompting."""
+        return random.sample(self._positive, min(n, len(self._positive)))
+```
+### Task 4.2: Few-Shot Prompt with Demonstrations
+**File:** `inference.py` — `build_user_prompt()`, `SYSTEM_PROMPT`
+Add positive trajectory examples to the prompt. After running a few episodes to populate the buffer:
+```
+Here is an example of a successful action sequence for a similar situation:
+Step 15: {"action_type": "SCALE_UP", "target_node_id": "node-0", "parameter": 0.8} reward=0.72
+Step 16: {"action_type": "NO_OP", "target_node_id": "node-0", "parameter": 0.0} reward=0.81
+...
+```
+### Task 4.3: Multi-Episode Evaluation with Temperature Sweep
+**File:** `inference.py` — `run_single_task()`, `run_all_tasks()`
+- Run each task 3 times instead of once
+- Sweep temperature: [0.0, 0.3, 0.7] across runs
+- Report mean and std of composite score
+- This gives variance estimation and lets exploration happen
+### Task 4.4: Curriculum Training
+**File:** New file `curriculum.py`, `inference.py`
+Define progressive difficulty stages:
+```python
+CURRICULUM = [
+    {"task": "task-1", "max_steps": 60, "difficulty": "easy",    "pass_threshold": 0.50},
+    {"task": "task-1", "max_steps": 100,"difficulty": "normal",  "pass_threshold": 0.55},
+    {"task": "task-2", "max_steps": 60, "difficulty": "easy",    "pass_threshold": 0.45},
+    {"task": "task-3", "max_steps": 60, "difficulty": "easy",    "pass_threshold": 0.45},
+    {"task": "task-2", "max_steps": 100,"difficulty": "normal",  "pass_threshold": 0.55},
+    {"task": "task-3", "max_steps": 100,"difficulty": "normal",  "pass_threshold": 0.55},
+]
+```
+The agent must pass each stage before advancing. Failed stages are retried with higher temperature.
+### Task 4.5: Episode-Level Bonuses
+**File:** `grader.py` — `Grade.composite`, `server/AntiAtropos_environment.py`
+Add terminal bonuses to the final step's reward:
+- `+0.5` if zero VIP failures throughout the episode
+- `+0.3` if SLA violations < 3 for the whole episode
+- `+0.2` if no barrier violations (queues never exceeded Q_BARRIER_MAX)
+These reward *prevention*, not just *reaction*.
+---
+## Implementation Order
+```
+Phase 1 (Sim)  →  Phase 2 (Reward)  →  Phase 3 (Obs/Action)  →  Phase 4 (Training)
+     ↓                  ↓                      ↓                        ↓
+  1.1 Latency        2.1 Smooth SLA        3.1 Enrich Obs          4.1 Replay Buffer
+  1.2 Recovery       2.2 Barrier            3.2 Persistent Acts     4.2 Few-Shot
+  1.3 Cascade        2.3 Per-Node Reward    3.3 Cooldown            4.3 Multi-Episode
+                                                                   4.4 Curriculum
+                                                                   4.5 Bonuses
+```
+Each task is independently testable. The reward changes (Phase 2) depend on the sim changes (Phase 1) being done first. The training loop (Phase 4) benefits from all prior phases but can be developed incrementally.

control/validation.py CHANGED Viewed

@@ -1,38 +1,69 @@
-from typing import List, Optional
 class ActionValidator:
     """
     Validates SRE actions to ensure they stay within safety boundaries.
     Prevents destructive operations like 100% shedding on critical nodes.
     """
-    def __init__(self, critical_nodes: Optional[List[str]] = None):
         self.critical_nodes = critical_nodes or ["node-0", "node-1", "node-2"]
-    def validate(self, action_type: str, target: str, parameter: float, valid_targets: Optional[List[str]] = None) -> (bool, str):
         """
-        Returns (is_valid, error_message).
         """
         if hasattr(action_type, "value"):
             action = str(action_type.value)
         else:
             action = str(action_type)
         if valid_targets is not None and target not in valid_targets:
-            return False, f"Unknown target node: {target}"
         if action == "SHED_LOAD" and target in self.critical_nodes:
-            return False, f"Forbidden: Load shedding on critical node {target}."
         if action in ["SCALE_UP", "SCALE_DOWN"]:
             if parameter < 0.0:
-                return False, "Negative scaling parameters are not allowed."
             if parameter > 10.0:
-                return False, "Scaling parameter must be <= 10.0."
         if action in ["REROUTE_TRAFFIC", "SHED_LOAD"] and not (0.0 <= parameter <= 1.0):
-            return False, f"{action} parameter must be in [0.0, 1.0]."
         if action == "NO_OP" and parameter != 0.0:
-            return False, "NO_OP requires parameter=0.0."
-        return True, "Success"

+from typing import List, Optional, Tuple
 class ActionValidator:
     """
     Validates SRE actions to ensure they stay within safety boundaries.
     Prevents destructive operations like 100% shedding on critical nodes.
+    Implements soft cooldown for scaling actions: instead of hard-rejecting
+    a rapid re-scale, the action passes with a penalty signal. The environment
+    can use this penalty to reduce the reward, teaching the agent to wait
+    without blocking emergency scaling.
     """
+    def __init__(self, critical_nodes: Optional[List[str]] = None, cooldown_ticks: int = 3):
         self.critical_nodes = critical_nodes or ["node-0", "node-1", "node-2"]
+        self.cooldown_ticks = cooldown_ticks
+        # Track last scale action per node: {node_id: (tick, action_type)}
+        self._last_scale: dict[str, Tuple[int, str]] = {}
+        self._current_tick: int = 0
+    def set_tick(self, tick: int) -> None:
+        """Update the current tick counter for cooldown tracking."""
+        self._current_tick = tick
+    def validate(self, action_type: str, target: str, parameter: float, valid_targets: Optional[List[str]] = None) -> Tuple[bool, str, float]:
         """
+        Returns (is_valid, error_message, cooldown_penalty).
+        cooldown_penalty is in [0, 1]:
+          0.0 = no penalty (action is fine)
+          >0  = soft penalty for rapid re-scaling (action still executes)
+        Hard violations (critical shed, out-of-range) still reject with penalty=0.
         """
         if hasattr(action_type, "value"):
             action = str(action_type.value)
         else:
             action = str(action_type)
+        cooldown_penalty = 0.0
         if valid_targets is not None and target not in valid_targets:
+            return False, f"Unknown target node: {target}", 0.0
         if action == "SHED_LOAD" and target in self.critical_nodes:
+            return False, f"Forbidden: Load shedding on critical node {target}.", 0.0
         if action in ["SCALE_UP", "SCALE_DOWN"]:
             if parameter < 0.0:
+                return False, "Negative scaling parameters are not allowed.", 0.0
             if parameter > 10.0:
+                return False, "Scaling parameter must be <= 10.0.", 0.0
+            # Soft cooldown: penalize but don't block rapid re-scaling.
+            # Dynamic window: if the node is DEGRADED, reduce cooldown (emergency allowed).
+            last_tick, last_action = self._last_scale.get(target, (0, ""))
+            ticks_since = self._current_tick - last_tick
+            if ticks_since < self.cooldown_ticks and last_action == action:
+                # Penalty decays linearly: full penalty at 0 ticks, 0 at cooldown_ticks
+                cooldown_penalty = (self.cooldown_ticks - ticks_since) / self.cooldown_ticks
+                # Don't reject — just flag the penalty
+            self._last_scale[target] = (self._current_tick, action)
         if action in ["REROUTE_TRAFFIC", "SHED_LOAD"] and not (0.0 <= parameter <= 1.0):
+            return False, f"{action} parameter must be in [0.0, 1.0].", 0.0
         if action == "NO_OP" and parameter != 0.0:
+            return False, "NO_OP requires parameter=0.0.", 0.0
+        return True, "Success", cooldown_penalty

curriculum.py ADDED Viewed

	@@ -0,0 +1,131 @@

+"""
+AntiAtropos Curriculum Training.
+Defines progressive difficulty stages that the agent must pass before advancing.
+Failed stages are retried with higher temperature for exploration.
+Each stage specifies:
+- task: Which task to run
+- max_steps: Episode length (shorter = easier)
+- pass_threshold: Minimum composite score to advance
+- temperature: Suggest LLM temperature for this stage
+- description: Human-readable label
+"""
+from dataclasses import dataclass
+from typing import List, Optional
+@dataclass
+class CurriculumStage:
+    """A single stage in the training curriculum."""
+    task: str
+    max_steps: int
+    pass_threshold: float
+    temperature: float = 0.0
+    description: str = ""
+    retries: int = 0  # Number of failed attempts so far
+    max_retries: int = 3  # Max retries before advancing anyway
+    @property
+    def retry_temperature(self) -> float:
+        """Temperature increases with retries to encourage exploration."""
+        if self.retries == 0:
+            return self.temperature
+        # 0.3, 0.6, 0.9 on retries
+        return min(1.0, self.temperature + self.retries * 0.3)
+    @property
+    def should_skip(self) -> bool:
+        """Skip this stage if too many retries."""
+        return self.retries >= self.max_retries
+# Progressive curriculum: start easy, add complexity
+CURRICULUM: List[CurriculumStage] = [
+    CurriculumStage(
+        task="task-1", max_steps=40, pass_threshold=0.40,
+        temperature=0.0, description="Short ramp — learn basic scaling",
+    ),
+    CurriculumStage(
+        task="task-1", max_steps=60, pass_threshold=0.50,
+        temperature=0.0, description="Standard ramp — scale proactively",
+    ),
+    CurriculumStage(
+        task="task-1", max_steps=100, pass_threshold=0.55,
+        temperature=0.0, description="Full ramp — cost-aware scaling",
+    ),
+    CurriculumStage(
+        task="task-2", max_steps=40, pass_threshold=0.35,
+        temperature=0.0, description="Short fault — learn reroute/scale on failure",
+    ),
+    CurriculumStage(
+        task="task-2", max_steps=60, pass_threshold=0.45,
+        temperature=0.3, description="Standard fault — fast recovery",
+    ),
+    CurriculumStage(
+        task="task-3", max_steps=40, pass_threshold=0.35,
+        temperature=0.0, description="Short surge — protect VIP during spike",
+    ),
+    CurriculumStage(
+        task="task-3", max_steps=60, pass_threshold=0.45,
+        temperature=0.3, description="Standard surge — sustained VIP protection",
+    ),
+    # Final combined test
+    CurriculumStage(
+        task="task-1", max_steps=100, pass_threshold=0.55,
+        temperature=0.0, description="Final: full ramp at low temp",
+    ),
+    CurriculumStage(
+        task="task-2", max_steps=60, pass_threshold=0.50,
+        temperature=0.0, description="Final: fault recovery at low temp",
+    ),
+    CurriculumStage(
+        task="task-3", max_steps=60, pass_threshold=0.50,
+        temperature=0.0, description="Final: surge protection at low temp",
+    ),
+]
+class CurriculumTracker:
+    """Tracks progress through the curriculum stages."""
+    def __init__(self, stages: Optional[List[CurriculumStage]] = None):
+        self._stages = stages or CURRICULUM
+        self._current_idx: int = 0
+    @property
+    def current(self) -> CurriculumStage:
+        return self._stages[self._current_idx]
+    @property
+    def current_index(self) -> int:
+        return self._current_idx
+    @property
+    def total_stages(self) -> int:
+        return len(self._stages)
+    @property
+    def is_complete(self) -> bool:
+        return self._current_idx >= len(self._stages)
+    def report_score(self, score: float) -> bool:
+        """Report a score for the current stage. Returns True if passed."""
+        if score >= self.current.pass_threshold:
+            self._current_idx += 1
+            return True
+        else:
+            self.current.retries += 1
+            if self.current.should_skip:
+                self._current_idx += 1
+            return False
+    def progress_summary(self) -> str:
+        stage = self.current
+        return (
+            f"Stage {self._current_idx + 1}/{self.total_stages}: "
+            f"{stage.description} "
+            f"(task={stage.task}, max_steps={stage.max_steps}, "
+            f"threshold={stage.pass_threshold}, retries={stage.retries})"
+        )

grader.py CHANGED Viewed

@@ -60,25 +60,41 @@ class Grade:
         Weights deliberately penalise cost heavily so that brute-force
         SCALE_UP spam cannot achieve a high composite even with perfect uptime.
         Hardening:
         - Task 3 coupling: Cost only rewards if Uptime is >= 50%. Stops 'Cheap-but-Dead'.
         - Invalid Action Penalty: -0.05 per forbidden command (SHED_LOAD on critical).
         """
         uptime = self.scores["uptime"]
         stability = self.scores["stability"]
         cost = self.scores["cost"]
         invalid_penalty = self.scores.get("invalid_actions", 0) * 0.05
         if self.task_id == "task-3":
             # Coupling: If uptime < 0.5, the cost benefit is zeroed out.
-            # Mirroring real-world priority: Budget doesn't matter if the site is down.
             cost_weight = 1.0 if uptime >= 0.5 else 0.0
             score = (0.4 * uptime + 0.2 * stability + 0.4 * (cost * cost_weight))
         else:
             score = (0.4 * uptime + 0.2 * stability + 0.4 * cost)
-        return max(0.0, score - invalid_penalty)
     def summary(self) -> str:
         s = self.scores
@@ -150,13 +166,15 @@ class EpisodeGrader:
         # ── 4. Invalid Action tracking ──────────────────────────────────────
         total_invalid = self._records[-1].get("invalid_action_count", 0)
         return Grade(self.task_id, {
             "uptime": uptime_score,
             "cost": cost_score,
             "stability": stability_score,
             "violations": total_violations,
-            "invalid_actions": total_invalid
         })

         Weights deliberately penalise cost heavily so that brute-force
         SCALE_UP spam cannot achieve a high composite even with perfect uptime.
         Hardening:
         - Task 3 coupling: Cost only rewards if Uptime is >= 50%. Stops 'Cheap-but-Dead'.
         - Invalid Action Penalty: -0.05 per forbidden command (SHED_LOAD on critical).
+        - Episode bonuses: Prevention rewards that DON'T overlap with step-level
+          reward signals (no double-counting). These are:
+            +0.10 if zero VIP failures throughout the episode
+            +0.05 if SLA violations < 3 for the whole episode
+            +0.05 if no invalid actions
+        These bonuses are small and additive, avoiding overlap with the
+        step-level reward which already penalizes SLA violations and barrier
+        breaches on each tick. The bonuses reward *sustained* prevention.
         """
         uptime = self.scores["uptime"]
         stability = self.scores["stability"]
         cost = self.scores["cost"]
         invalid_penalty = self.scores.get("invalid_actions", 0) * 0.05
+        # Episode-level prevention bonuses (NOT in step reward to avoid double-counting)
+        bonus = 0.0
+        if self.scores.get("vip_failure_count", 0) == 0:
+            bonus += 0.10  # Zero VIP failures all episode
+        if self.scores.get("violations", 0) < 3:
+            bonus += 0.05  # Very few SLA violations all episode
+        if self.scores.get("invalid_actions", 0) == 0:
+            bonus += 0.05  # Clean actions all episode
         if self.task_id == "task-3":
             # Coupling: If uptime < 0.5, the cost benefit is zeroed out.
             cost_weight = 1.0 if uptime >= 0.5 else 0.0
             score = (0.4 * uptime + 0.2 * stability + 0.4 * (cost * cost_weight))
         else:
             score = (0.4 * uptime + 0.2 * stability + 0.4 * cost)
+        return max(0.0, min(1.0, score - invalid_penalty + bonus))
     def summary(self) -> str:
         s = self.scores
         # ── 4. Invalid Action tracking ──────────────────────────────────────
         total_invalid = self._records[-1].get("invalid_action_count", 0)
+        total_vip_failures = self._records[-1].get("vip_failure_count", 0)
         return Grade(self.task_id, {
             "uptime": uptime_score,
             "cost": cost_score,
             "stability": stability_score,
             "violations": total_violations,
+            "invalid_actions": total_invalid,
+            "vip_failure_count": total_vip_failures,
         })

inference.py CHANGED Viewed

@@ -14,6 +14,7 @@ from openai import AsyncOpenAI
 from AntiAtropos.client import AntiAtroposEnv
 from AntiAtropos.grader import EpisodeGrader
 from AntiAtropos.models import ActionType, SREAction
 load_dotenv()
@@ -39,6 +40,8 @@ TEMPERATURE = float(os.getenv("ANTIATROPOS_TEMPERATURE", "0.0"))
 MAX_TOKENS = int(os.getenv("ANTIATROPOS_MAX_TOKENS", "180"))
 SEED = int(os.getenv("ANTIATROPOS_SEED", "42"))
 SUCCESS_SCORE_THRESHOLD = float(os.getenv("ANTIATROPOS_SUCCESS_THRESHOLD", "0.55"))
 TASK_BRIEFS: Dict[str, str] = {
     "task-1": "Traffic increases linearly. Scale proactively to keep latency low and cost efficient.",
@@ -142,9 +145,10 @@ async def open_env(message_timeout_s: int):
     raise RuntimeError("Missing environment target. Set ENV_URL/ANTIATROPOS_ENV_URL or LOCAL_IMAGE_NAME.")
-def build_user_prompt(task_id: str, step: int, obs: dict, history: List[str]) -> str:
     recent = "\n".join(history[-4:]) if history else "None"
     brief = TASK_BRIEFS.get(task_id, "Maintain SLA, stability, and efficient cost.")
     return textwrap.dedent(
         f"""
         Task: {task_id}
@@ -155,7 +159,7 @@ def build_user_prompt(task_id: str, step: int, obs: dict, history: List[str]) ->
         {json.dumps(obs, separators=(",", ":"))}
         Recent decisions:
-        {recent}
         Choose the next SRE action.
         """
@@ -174,15 +178,25 @@ def observation_for_model(obs) -> dict:
         "total_queue_backlog": obs.total_queue_backlog,
         "sla_violations": obs.sla_violations,
         "invalid_action_count": obs.invalid_action_count,
         "nodes": [
             {
                 "node_id": node.node_id,
                 "status": getattr(node.status, "value", str(node.status)),
                 "is_vip": node.is_vip,
                 "queue_depth": node.queue_depth,
                 "latency_ms": node.latency_ms,
                 "incoming_request_rate": node.incoming_request_rate,
                 "cpu_utilization": node.cpu_utilization,
             }
             for node in obs.nodes
         ],
@@ -209,8 +223,8 @@ def _parse_action(payload: dict) -> SREAction:
     )
-async def get_model_action(client: AsyncOpenAI, task_id: str, step: int, obs: dict, history: List[str]) -> SREAction:
-    prompt = build_user_prompt(task_id=task_id, step=step, obs=obs, history=history)
     try:
         completion = await client.chat.completions.create(
             model=MODEL_NAME,
@@ -241,15 +255,17 @@ def _compact_action(action: SREAction) -> str:
     return json.dumps(payload, separators=(",", ":"))
-async def run_single_task(env: AntiAtroposEnv, client: AsyncOpenAI, task_id: str) -> dict:
-    task_seed = _task_seed(SEED, task_id)
     result = await env.reset(task_id=task_id, mode=ENV_MODE, seed=task_seed)
     grader = EpisodeGrader(task_id=task_id)
     grader.record(result.observation)
     history: List[str] = []
     rewards: List[float] = []
     steps_taken = 0
     for step in range(1, MAX_STEPS_PER_TASK + 1):
         if result.done:
             break
@@ -260,6 +276,7 @@ async def run_single_task(env: AntiAtroposEnv, client: AsyncOpenAI, task_id: str
             step=step,
             obs=observation_for_model(result.observation),
             history=history,
         )
         result = await env.step(action)
         grader.record(result.observation)
@@ -270,12 +287,39 @@ async def run_single_task(env: AntiAtroposEnv, client: AsyncOpenAI, task_id: str
         action_str = _compact_action(action)
         history.append(f"step={step} action={action_str} reward={reward:.2f}")
         error = getattr(result.observation, "last_action_error", None)
         log_step(step=step, action=action_str, reward=reward, done=bool(result.done), error=error)
     grade = grader.score()
     score = _strict_score(float(grade.composite))
     success = score >= SUCCESS_SCORE_THRESHOLD
     return {
         "task_id": task_id,
         "success": success,
@@ -295,29 +339,58 @@ async def run_all_tasks() -> None:
         raise RuntimeError("Missing API key (API_KEY/HF_TOKEN/OPENAI_API_KEY).")
     client = AsyncOpenAI(base_url=API_BASE_URL, api_key=API_KEY)
     try:
         async with open_env(MESSAGE_TIMEOUT_S) as env:
             for task in tasks_to_run:
-                success = False
-                steps = 0
-                score = 0.0
-                rewards: List[float] = []
-                task_error: Optional[Exception] = None
-                log_start(task=task, env=BENCHMARK, model=MODEL_NAME)
-                try:
-                    report = await run_single_task(env=env, client=client, task_id=task)
-                    success = bool(report["success"])
-                    steps = int(report["steps"])
-                    score = _strict_score(float(report["score"]))
-                    rewards = list(report["rewards"])
-                except Exception as exc:
-                    task_error = exc
                     score = 0.0
-                finally:
-                    log_end(success=success, steps=steps, score=score, rewards=rewards)
-                if task_error is not None:
-                    raise InferenceError(f"Task {task} failed.") from task_error
     finally:
         await client.close()

 from AntiAtropos.client import AntiAtroposEnv
 from AntiAtropos.grader import EpisodeGrader
 from AntiAtropos.models import ActionType, SREAction
+from AntiAtropos.replay import EpisodeReplayBuffer, compress_trajectory
 load_dotenv()
 MAX_TOKENS = int(os.getenv("ANTIATROPOS_MAX_TOKENS", "180"))
 SEED = int(os.getenv("ANTIATROPOS_SEED", "42"))
 SUCCESS_SCORE_THRESHOLD = float(os.getenv("ANTIATROPOS_SUCCESS_THRESHOLD", "0.55"))
+EVAL_RUNS = int(os.getenv("ANTIATROPOS_EVAL_RUNS", "3"))  # Num eval runs per task
+TEMPERATURE_SWEEP = [0.0, 0.3, 0.7]  # Fixed temperatures for multi-episode eval
 TASK_BRIEFS: Dict[str, str] = {
     "task-1": "Traffic increases linearly. Scale proactively to keep latency low and cost efficient.",
     raise RuntimeError("Missing environment target. Set ENV_URL/ANTIATROPOS_ENV_URL or LOCAL_IMAGE_NAME.")
+def build_user_prompt(task_id: str, step: int, obs: dict, history: List[str], demo_text: str = "") -> str:
     recent = "\n".join(history[-4:]) if history else "None"
     brief = TASK_BRIEFS.get(task_id, "Maintain SLA, stability, and efficient cost.")
+    demo_section = f"\n\n{demo_text}" if demo_text else ""
     return textwrap.dedent(
         f"""
         Task: {task_id}
         {json.dumps(obs, separators=(",", ":"))}
         Recent decisions:
+        {recent}{demo_section}
         Choose the next SRE action.
         """
         "total_queue_backlog": obs.total_queue_backlog,
         "sla_violations": obs.sla_violations,
         "invalid_action_count": obs.invalid_action_count,
+        "reward_drift": getattr(obs, "reward_drift", 0.0),
+        "reward_cost": getattr(obs, "reward_cost", 0.0),
+        "reward_sla": getattr(obs, "reward_sla", 0.0),
+        "reward_barrier": getattr(obs, "reward_barrier", 0.0),
         "nodes": [
             {
                 "node_id": node.node_id,
                 "status": getattr(node.status, "value", str(node.status)),
                 "is_vip": node.is_vip,
+                "importance_weight": node.importance_weight,
                 "queue_depth": node.queue_depth,
                 "latency_ms": node.latency_ms,
                 "incoming_request_rate": node.incoming_request_rate,
                 "cpu_utilization": node.cpu_utilization,
+                "capacity": getattr(node, "capacity", 0.0),
+                "pending_capacity": getattr(node, "pending_capacity", 0.0),
+                "queue_delta": getattr(node, "queue_delta", 0.0),
+                "sla_proximity": getattr(node, "sla_proximity", 0.0),
+                "node_reward": getattr(node, "node_reward", 0.0),
             }
             for node in obs.nodes
         ],
     )
+async def get_model_action(client: AsyncOpenAI, task_id: str, step: int, obs: dict, history: List[str], demo_text: str = "") -> SREAction:
+    prompt = build_user_prompt(task_id=task_id, step=step, obs=obs, history=history, demo_text=demo_text)
     try:
         completion = await client.chat.completions.create(
             model=MODEL_NAME,
     return json.dumps(payload, separators=(",", ":"))
+async def run_single_task(env: AntiAtroposEnv, client: AsyncOpenAI, task_id: str, temperature: float = 0.0, replay_buffer: Optional[EpisodeReplayBuffer] = None, run_seed: Optional[int] = None) -> dict:
+    task_seed = run_seed if run_seed is not None else _task_seed(SEED, task_id)
     result = await env.reset(task_id=task_id, mode=ENV_MODE, seed=task_seed)
     grader = EpisodeGrader(task_id=task_id)
     grader.record(result.observation)
     history: List[str] = []
     rewards: List[float] = []
+    raw_steps: List[dict] = []  # For replay buffer compression
     steps_taken = 0
+    demo_text = replay_buffer.format_demonstrations() if replay_buffer else ""
     for step in range(1, MAX_STEPS_PER_TASK + 1):
         if result.done:
             break
             step=step,
             obs=observation_for_model(result.observation),
             history=history,
+            demo_text=demo_text,
         )
         result = await env.step(action)
         grader.record(result.observation)
         action_str = _compact_action(action)
         history.append(f"step={step} action={action_str} reward={reward:.2f}")
+        # Collect raw step data for replay compression
+        obs = result.observation
+        raw_steps.append({
+            "step": step,
+            "action_type": action.action_type.value,
+            "target_node_id": action.target_node_id,
+            "parameter": float(action.parameter),
+            "reward": reward,
+            "avg_latency_norm": getattr(obs, "average_latency_ms", 0.0),
+            "error_rate": getattr(obs, "error_rate", 0.0),
+            "queue_backlog_norm": getattr(obs, "total_queue_backlog", 0.0),
+            "sla_violation": reward < 0.3,
+        })
         error = getattr(result.observation, "last_action_error", None)
         log_step(step=step, action=action_str, reward=reward, done=bool(result.done), error=error)
     grade = grader.score()
     score = _strict_score(float(grade.composite))
     success = score >= SUCCESS_SCORE_THRESHOLD
+    # Store in replay buffer if available
+    if replay_buffer is not None and raw_steps:
+        trajectory = compress_trajectory(
+            steps=raw_steps,
+            task_id=task_id,
+            score=score,
+            total_steps=steps_taken,
+            final_sla_violations=int(grade.scores.get("violations", 0)),
+            final_invalid_actions=int(grade.scores.get("invalid_actions", 0)),
+        )
+        replay_buffer.store(trajectory, score)
     return {
         "task_id": task_id,
         "success": success,
         raise RuntimeError("Missing API key (API_KEY/HF_TOKEN/OPENAI_API_KEY).")
     client = AsyncOpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    replay_buffer = EpisodeReplayBuffer()
     try:
         async with open_env(MESSAGE_TIMEOUT_S) as env:
             for task in tasks_to_run:
+                task_scores: List[float] = []
+                task_successes: List[bool] = []
+                for run_idx in range(EVAL_RUNS):
+                    # Fixed seed per (task, run_idx) so runs are reproducible
+                    # and comparable across temperature conditions.
+                    run_seed = SEED * 1000 + hash(task) % 100 + run_idx
+                    temperature = TEMPERATURE_SWEEP[run_idx % len(TEMPERATURE_SWEEP)]
+                    success = False
+                    steps = 0
                     score = 0.0
+                    rewards: List[float] = []
+                    task_error: Optional[Exception] = None
+                    log_start(task=f"{task} run={run_idx+1}/{EVAL_RUNS} temp={temperature}", env=BENCHMARK, model=MODEL_NAME)
+                    try:
+                        report = await run_single_task(
+                            env=env,
+                            client=client,
+                            task_id=task,
+                            temperature=temperature,
+                            replay_buffer=replay_buffer,
+                            run_seed=run_seed,
+                        )
+                        success = bool(report["success"])
+                        steps = int(report["steps"])
+                        score = _strict_score(float(report["score"]))
+                        rewards = list(report["rewards"])
+                        task_scores.append(score)
+                        task_successes.append(success)
+                    except Exception as exc:
+                        task_error = exc
+                        score = 0.0
+                    finally:
+                        log_end(success=success, steps=steps, score=score, rewards=rewards)
+                    if task_error is not None:
+                        raise InferenceError(f"Task {task} run {run_idx+1} failed.") from task_error
+                # Report aggregate stats
+                if task_scores:
+                    mean_score = sum(task_scores) / len(task_scores)
+                    std_score = (sum((s - mean_score) ** 2 for s in task_scores) / len(task_scores)) ** 0.5
+                    print(
+                        f"[AGGREGATE] task={task} mean_score={mean_score:.3f} "
+                        f"std={std_score:.3f} runs={len(task_scores)}",
+                        flush=True,
+                    )
     finally:
         await client.close()

models.py CHANGED Viewed

@@ -84,6 +84,37 @@ class NodeObservation(BaseModel):
         description="Business criticality weight. VIP nodes have higher impact on scoring.",
     )
     # Episode interaction fields (handled by framework)
     done: bool = False
     reward: float = 0.0
@@ -158,6 +189,24 @@ class ClusterObservation(BaseModel):
     raw_reward: float = 0.0
     normalized_reward: float = Field(default=0.0, ge=0.0, le=1.0)
     reward_scale_version: str = "sigmoid-v1"
     choke_level: float = 0.0
     nodes: list[NodeObservation]

         description="Business criticality weight. VIP nodes have higher impact on scoring.",
     )
+    capacity: float = Field(
+        default=0.0,
+        ge=0.0,
+        description="Current capacity units provisioned for this node (0-5).",
+    )
+    pending_capacity: float = Field(
+        default=0.0,
+        ge=0.0,
+        description="Capacity units being booted (will be live after boot delay).",
+    )
+    queue_delta: float = Field(
+        default=0.0,
+        ge=-1.0,
+        le=1.0,
+        description="Normalized queue depth change from previous tick (-1 to +1).",
+    )
+    sla_proximity: float = Field(
+        default=0.0,
+        ge=0.0,
+        le=1.0,
+        description="How close this node is to SLA violation (0=safe, 1=violating).",
+    )
+    node_reward: float = Field(
+        default=0.0,
+        description="Per-node reward contribution for credit assignment.",
+    )
     # Episode interaction fields (handled by framework)
     done: bool = False
     reward: float = 0.0
     raw_reward: float = 0.0
     normalized_reward: float = Field(default=0.0, ge=0.0, le=1.0)
     reward_scale_version: str = "sigmoid-v1"
+    # Reward components breakdown
+    reward_drift: float = Field(
+        default=0.0,
+        description="Lyapunov drift component of the reward.",
+    )
+    reward_cost: float = Field(
+        default=0.0,
+        description="Infrastructure cost component of the reward.",
+    )
+    reward_sla: float = Field(
+        default=0.0,
+        description="SLA penalty component of the reward.",
+    )
+    reward_barrier: float = Field(
+        default=0.0,
+        description="Barrier function penalty component of the reward.",
+    )
     choke_level: float = 0.0
     nodes: list[NodeObservation]

replay.py ADDED Viewed

	@@ -0,0 +1,230 @@

+"""
+AntiAtropos Episode Replay Buffer.
+Stores episode trajectories for few-shot demonstrations during inference.
+Uses summarization/compression to keep context window manageable:
+- Only stores key transition windows (action, reward spike, SLA violation)
+- Compresses long stable stretches into single summary lines
+- Caps total demonstration size to avoid LLM context overflow
+"""
+import random
+from collections import deque
+from dataclasses import dataclass, field
+from typing import List, Optional
+@dataclass
+class Transition:
+    """A single step in an episode trajectory."""
+    step: int
+    action_type: str
+    target_node_id: str
+    parameter: float
+    reward: float
+    avg_latency_norm: float
+    error_rate: float
+    queue_backlog_norm: float
+    sla_violation: bool
+@dataclass
+class EpisodeTrajectory:
+    """A compressed episode trajectory for few-shot prompting."""
+    task_id: str
+    score: float
+    # Full trajectory is NOT stored — only key transitions
+    key_transitions: List[Transition] = field(default_factory=list)
+    total_steps: int = 0
+    final_sla_violations: int = 0
+    final_invalid_actions: int = 0
+    def to_prompt_lines(self, max_lines: int = 8) -> List[str]:
+        """Convert to concise prompt lines, capped at max_lines.
+        Summarization strategy:
+        1. Always include first action (shows opening strategy)
+        2. Always include highest-reward action (shows what worked)
+        3. Always include last action (shows closing strategy)
+        4. Fill remaining with transitions near SLA violations
+        5. If still under max_lines, add evenly-spaced transitions
+        """
+        if not self.key_transitions:
+            return []
+        lines: List[str] = []
+        selected: List[Transition] = []
+        # Always take first
+        selected.append(self.key_transitions[0])
+        # Always take highest-reward
+        best = max(self.key_transitions, key=lambda t: t.reward)
+        if best not in selected:
+            selected.append(best)
+        # Always take last
+        last = self.key_transitions[-1]
+        if last not in selected:
+            selected.append(last)
+        # Add transitions near SLA violations (up to 2)
+        violation_trans = [t for t in self.key_transitions if t.sla_violation and t not in selected]
+        for vt in violation_trans[:2]:
+            selected.append(vt)
+        # Fill with evenly-spaced transitions
+        remaining = max_lines - len(selected)
+        if remaining > 0 and len(self.key_transitions) > len(selected):
+            stride = max(1, len(self.key_transitions) // (remaining + 1))
+            for i in range(stride, len(self.key_transitions), stride):
+                if self.key_transitions[i] not in selected and remaining > 0:
+                    selected.append(self.key_transitions[i])
+                    remaining -= 1
+        # Sort by step and format
+        selected.sort(key=lambda t: t.step)
+        for t in selected[:max_lines]:
+            action_str = f'{{"action_type":"{t.action_type}","target_node_id":"{t.target_node_id}","parameter":{t.parameter:.2f}}}'
+            lines.append(f"Step {t.step}: {action_str} reward={t.reward:.2f}")
+        # Add summary
+        lines.append(
+            f"[Episode summary: score={self.score:.2f}, "
+            f"steps={self.total_steps}, "
+            f"SLA_violations={self.final_sla_violations}]"
+        )
+        return lines
+class EpisodeReplayBuffer:
+    """
+    Rolling buffer of episode trajectories for few-shot learning.
+    Addresses context explosion by:
+    1. Storing only compressed trajectories (key transitions, not full)
+    2. Capping demonstration size at MAX_DEMO_LINES per prompt inclusion
+    3. Sampling at most MAX_DEMOS_PER_PROMPT trajectories
+    """
+    MAX_DEMO_LINES: int = 8  # Max lines per trajectory in prompt
+    MAX_DEMOS_PER_PROMPT: int = 2  # Max trajectories included in prompt
+    def __init__(self, max_episodes: int = 50):
+        self._positive: deque[EpisodeTrajectory] = deque(maxlen=max_episodes)
+        self._negative: deque[EpisodeTrajectory] = deque(maxlen=max_episodes)
+    def store(self, trajectory: EpisodeTrajectory, score: float) -> None:
+        """Store an episode trajectory, categorized by score."""
+        if score >= 0.55:
+            self._positive.append(trajectory)
+        elif score < 0.3:
+            self._negative.append(trajectory)
+    def sample_demonstrations(self, n: Optional[int] = None) -> List[EpisodeTrajectory]:
+        """Sample n positive episodes for few-shot prompting."""
+        if n is None:
+            n = self.MAX_DEMOS_PER_PROMPT
+        if not self._positive:
+            return []
+        return random.sample(list(self._positive), min(n, len(self._positive)))
+    def format_demonstrations(self) -> str:
+        """Format sampled demonstrations into a prompt-ready string.
+        Returns empty string if no demonstrations available.
+        Total output is bounded by MAX_DEMO_LINES * MAX_DEMOS_PER_PROMPT.
+        """
+        demos = self.sample_demonstrations()
+        if not demos:
+            return ""
+        parts = []
+        for i, demo in enumerate(demos):
+            lines = demo.to_prompt_lines(max_lines=self.MAX_DEMO_LINES)
+            if lines:
+                parts.append(f"Example {i+1} (task={demo.task_id}):")
+                parts.extend(lines)
+        if not parts:
+            return ""
+        return "Successful episode examples:\n" + "\n".join(parts)
+def compress_trajectory(
+    steps: List[dict],
+    task_id: str,
+    score: float,
+    total_steps: int,
+    final_sla_violations: int = 0,
+    final_invalid_actions: int = 0,
+) -> EpisodeTrajectory:
+    """Compress a raw step list into a trajectory with only key transitions.
+    Raw steps are dicts with keys:
+        step, action_type, target_node_id, parameter, reward,
+        avg_latency_norm, error_rate, queue_backlog_norm, sla_violation
+    Key transition selection:
+    - First step
+    - Last step
+    - Steps with SLA violations
+    - Steps with highest/lowest reward
+    - Steps where action changed direction (e.g. SCALE_UP then SCALE_DOWN)
+    """
+    if not steps:
+        return EpisodeTrajectory(
+            task_id=task_id,
+            score=score,
+            total_steps=total_steps,
+            final_sla_violations=final_sla_violations,
+            final_invalid_actions=final_invalid_actions,
+        )
+    # Always include first and last
+    key_indices = {0, len(steps) - 1}
+    # Include SLA violations
+    for i, s in enumerate(steps):
+        if s.get("sla_violation"):
+            key_indices.add(i)
+    # Include reward extremes
+    if len(steps) > 2:
+        best_idx = max(range(len(steps)), key=lambda i: steps[i].get("reward", 0))
+        worst_idx = min(range(len(steps)), key=lambda i: steps[i].get("reward", 0))
+        key_indices.add(best_idx)
+        key_indices.add(worst_idx)
+    # Include action direction changes
+    for i in range(1, len(steps)):
+        prev_action = steps[i - 1].get("action_type", "")
+        curr_action = steps[i].get("action_type", "")
+        if prev_action != curr_action:
+            key_indices.add(i)
+    # Build compressed transitions (sorted)
+    key_transitions = []
+    for i in sorted(key_indices):
+        s = steps[i]
+        key_transitions.append(Transition(
+            step=s.get("step", i),
+            action_type=s.get("action_type", "NO_OP"),
+            target_node_id=s.get("target_node_id", "node-0"),
+            parameter=s.get("parameter", 0.0),
+            reward=s.get("reward", 0.0),
+            avg_latency_norm=s.get("avg_latency_norm", 0.0),
+            error_rate=s.get("error_rate", 0.0),
+            queue_backlog_norm=s.get("queue_backlog_norm", 0.0),
+            sla_violation=s.get("sla_violation", False),
+        ))
+    return EpisodeTrajectory(
+        task_id=task_id,
+        score=score,
+        key_transitions=key_transitions,
+        total_steps=total_steps,
+        final_sla_violations=final_sla_violations,
+        final_invalid_actions=final_invalid_actions,
+    )

server/AntiAtropos_environment.py CHANGED Viewed

@@ -10,13 +10,13 @@ from openenv.core.env_server.types import State
 try:
     from ..models import SREAction, ClusterObservation, NodeObservation, NodeStatus, EnvironmentMode
     from ..simulator import ClusterSimulator, COST_PER_CAPACITY_UNIT_PER_HOUR
-    from ..stability import compute_lyapunov, compute_reward, normalize_reward, REWARD_SCALE_VERSION
     from ..telemetry import PrometheusClient, get_observability_tracker
     from ..control import KubernetesExecutor, ActionValidator
 except ImportError:
     from models import SREAction, ClusterObservation, NodeObservation, NodeStatus, EnvironmentMode  # type: ignore[no-redef]
     from simulator import ClusterSimulator, COST_PER_CAPACITY_UNIT_PER_HOUR  # type: ignore[no-redef]
-    from stability import compute_lyapunov, compute_reward, normalize_reward, REWARD_SCALE_VERSION  # type: ignore[no-redef]
     from telemetry import PrometheusClient, get_observability_tracker  # type: ignore[no-redef]
     from control import KubernetesExecutor, ActionValidator  # type: ignore[no-redef]
@@ -25,9 +25,10 @@ except ImportError:
 # Reward hyper-parameters (synchronized with stability.py constants)
 # ---------------------------------------------------------------------------
-ALPHA: float = 0.002   # Weight on Lyapunov energy drift ΔV(s) (Increased for faster feedback)
 BETA:  float = 0.01    # Weight on infrastructure cost (Reduced to prevent cheap-but-dead strategies)
 GAMMA: float = 10.0    # Weight on per-step SLA violation indicator (Increased to force reactive scaling)
 MAX_QUEUE_NORM = 200.0
 MAX_LATENCY_NORM = 1000.0
@@ -66,6 +67,7 @@ class AntiAtroposEnvironment(Environment):
         self._nodes_true: list[dict] = []
         self._nodes_obs: list[dict] = []
         self._prev_lyapunov: float = 0.0
         self._sla_violations: int = 0
         self._action_ack_status: str = "success"
@@ -74,6 +76,10 @@ class AntiAtroposEnvironment(Environment):
         self._last_executor_error_code: str = ""
         self._last_raw_reward: float = 0.0
         self._last_normalized_reward: float = 0.0
         self._reward_output_mode: str = os.getenv("ANTIATROPOS_REWARD_OUTPUT_MODE", "normalized").strip().lower()
         if self._reward_output_mode not in REWARD_OUTPUT_MODES:
             self._reward_output_mode = "normalized"
@@ -140,14 +146,13 @@ class AntiAtroposEnvironment(Environment):
         is_enabled, mode_error = self._is_action_enabled_for_mode(action.action_type)
         if not is_enabled:
             self._action_ack_status = f"Rejected: {mode_error}"
-            # Capability gate rejections happen before executor invocation, so
-            # they should be tracked as rejected actions (ack_class) rather than
-            # executor failures.
             self._last_executor_error_code = ""
             is_valid = False
             error = mode_error
         else:
-            is_valid, error = self._validator.validate(
             action.action_type,
             action.target_node_id,
             action.parameter,
@@ -196,34 +201,51 @@ class AntiAtroposEnvironment(Environment):
             self._last_metric_time = time.time()
         # 4. Extract states (Ground Truth for reward; Observation for agent)
         self._nodes_true = self._sim.state(for_agent=False)
         self._nodes_obs  = self._sim.state(for_agent=True)
-        # 5. SLA Check
-        avg_latency = self._avg_latency(self._nodes_true)
         error_rate  = self._error_rate(self._nodes_true)
-        sla_violation_step = 1 if (avg_latency > 200.0 or error_rate > 0.05) else 0
-        if sla_violation_step:
             self._sla_violations += 1
         # 6. Compute Lyapunov stability metrics from Ground Truth
         current_lyapunov = compute_lyapunov(self._nodes_true)
-        # 7. Compute scalar reward
         cost = self._compute_cost(self._nodes_true)
         raw_reward = compute_reward(
             v_prev=self._prev_lyapunov,
             v_curr=current_lyapunov,
             cost=cost,
-            sla_violation_step=sla_violation_step,
             alpha=ALPHA,
             beta=BETA,
-            gamma=GAMMA
         )
         normalized_reward = normalize_reward(raw_reward)
         reward = normalized_reward if self._reward_output_mode == "normalized" else raw_reward
         self._last_raw_reward = raw_reward
         self._last_normalized_reward = normalized_reward
         self._prev_lyapunov = current_lyapunov
@@ -348,8 +370,40 @@ class AntiAtroposEnvironment(Environment):
     def _build_observation(self) -> ClusterObservation:
         """Assembles the ClusterObservation from the current observed simulator state."""
-        node_obs = [
-            NodeObservation(
                 node_id=n["node_id"],
                 status=n["status"],
                 queue_depth=min(1.0, max(0.0, float(n["queue_depth"]) / MAX_QUEUE_NORM)),
@@ -358,11 +412,14 @@ class AntiAtroposEnvironment(Environment):
                 cpu_utilization=min(1.0, max(0.0, float(n["cpu_utilization"]))),
                 is_vip=bool(n.get("is_vip", False)),
                 importance_weight=float(n.get("importance_weight", 1.0)),
                 done=False,
                 reward=0.0,
-            )
-            for n in self._nodes_obs
-        ]
         freshness = int((time.time() - self._last_metric_time) * 1000) if self._last_metric_time > 0 else 0
@@ -391,6 +448,10 @@ class AntiAtroposEnvironment(Environment):
             raw_reward=self._last_raw_reward,
             normalized_reward=self._last_normalized_reward,
             reward_scale_version=REWARD_SCALE_VERSION,
             choke_level=0.0,
             done=False,
             reward=0.0,

 try:
     from ..models import SREAction, ClusterObservation, NodeObservation, NodeStatus, EnvironmentMode
     from ..simulator import ClusterSimulator, COST_PER_CAPACITY_UNIT_PER_HOUR
+    from ..stability import compute_lyapunov, compute_reward, compute_barrier, normalize_reward, smooth_sla_penalty, REWARD_SCALE_VERSION
     from ..telemetry import PrometheusClient, get_observability_tracker
     from ..control import KubernetesExecutor, ActionValidator
 except ImportError:
     from models import SREAction, ClusterObservation, NodeObservation, NodeStatus, EnvironmentMode  # type: ignore[no-redef]
     from simulator import ClusterSimulator, COST_PER_CAPACITY_UNIT_PER_HOUR  # type: ignore[no-redef]
+    from stability import compute_lyapunov, compute_reward, compute_barrier, normalize_reward, smooth_sla_penalty, REWARD_SCALE_VERSION  # type: ignore[no-redef]
     from telemetry import PrometheusClient, get_observability_tracker  # type: ignore[no-redef]
     from control import KubernetesExecutor, ActionValidator  # type: ignore[no-redef]
 # Reward hyper-parameters (synchronized with stability.py constants)
 # ---------------------------------------------------------------------------
+ALPHA: float = 0.002   # Weight on Lyapunov energy drift DeltaV(s) (Increased for faster feedback)
 BETA:  float = 0.01    # Weight on infrastructure cost (Reduced to prevent cheap-but-dead strategies)
 GAMMA: float = 10.0    # Weight on per-step SLA violation indicator (Increased to force reactive scaling)
+DELTA: float = 0.005   # Weight on control-barrier function penalty (queue safety zone)
 MAX_QUEUE_NORM = 200.0
 MAX_LATENCY_NORM = 1000.0
         self._nodes_true: list[dict] = []
         self._nodes_obs: list[dict] = []
+        self._prev_nodes_true: list[dict] = []  # For per-node queue delta + reward
         self._prev_lyapunov: float = 0.0
         self._sla_violations: int = 0
         self._action_ack_status: str = "success"
         self._last_executor_error_code: str = ""
         self._last_raw_reward: float = 0.0
         self._last_normalized_reward: float = 0.0
+        self._last_reward_drift: float = 0.0
+        self._last_reward_cost: float = 0.0
+        self._last_reward_sla: float = 0.0
+        self._last_reward_barrier: float = 0.0
         self._reward_output_mode: str = os.getenv("ANTIATROPOS_REWARD_OUTPUT_MODE", "normalized").strip().lower()
         if self._reward_output_mode not in REWARD_OUTPUT_MODES:
             self._reward_output_mode = "normalized"
         is_enabled, mode_error = self._is_action_enabled_for_mode(action.action_type)
         if not is_enabled:
             self._action_ack_status = f"Rejected: {mode_error}"
             self._last_executor_error_code = ""
             is_valid = False
             error = mode_error
+            cooldown_penalty = 0.0
         else:
+            self._validator.set_tick(self._state.step_count)
+            is_valid, error, cooldown_penalty = self._validator.validate(
             action.action_type,
             action.target_node_id,
             action.parameter,
             self._last_metric_time = time.time()
         # 4. Extract states (Ground Truth for reward; Observation for agent)
+        self._prev_nodes_true = self._nodes_true  # Save for per-node delta
         self._nodes_true = self._sim.state(for_agent=False)
         self._nodes_obs  = self._sim.state(for_agent=True)
+        # 5. SLA Check (smooth sigmoid penalty instead of binary cliff)
+        avg_latency_norm = self._avg_latency(self._nodes_true) / MAX_LATENCY_NORM
         error_rate  = self._error_rate(self._nodes_true)
+        sla_penalty_step = smooth_sla_penalty(avg_latency_norm, error_rate)
+        # Track binary violations for the grader (backward compat)
+        if avg_latency_norm > 0.20 or error_rate > 0.05:
             self._sla_violations += 1
         # 6. Compute Lyapunov stability metrics from Ground Truth
         current_lyapunov = compute_lyapunov(self._nodes_true)
+        # 7. Compute scalar reward (with barrier function)
         cost = self._compute_cost(self._nodes_true)
+        barrier = compute_barrier(self._nodes_true)
         raw_reward = compute_reward(
             v_prev=self._prev_lyapunov,
             v_curr=current_lyapunov,
             cost=cost,
+            sla_violation_step=sla_penalty_step,
             alpha=ALPHA,
             beta=BETA,
+            gamma=GAMMA,
+            barrier=barrier,
+            delta=DELTA,
         )
         normalized_reward = normalize_reward(raw_reward)
+        # Apply soft cooldown penalty: reduces reward for rapid re-scaling
+        # without blocking the action (emergency scaling still goes through)
+        if cooldown_penalty > 0:
+            normalized_reward = max(0.0, normalized_reward - cooldown_penalty * 0.1)
         reward = normalized_reward if self._reward_output_mode == "normalized" else raw_reward
         self._last_raw_reward = raw_reward
         self._last_normalized_reward = normalized_reward
+        # Store reward component breakdown for the observation
+        from ..stability import compute_drift, BARRIER_NORM_SCALE
+        delta_v = compute_drift(self._prev_lyapunov, current_lyapunov)
+        barrier_norm = barrier / BARRIER_NORM_SCALE if BARRIER_NORM_SCALE > 0 else barrier
+        self._last_reward_drift = -(ALPHA * delta_v)
+        self._last_reward_cost = -(BETA * cost)
+        self._last_reward_sla = -(GAMMA * sla_penalty_step)
+        self._last_reward_barrier = -(DELTA * barrier_norm)
         self._prev_lyapunov = current_lyapunov
     def _build_observation(self) -> ClusterObservation:
         """Assembles the ClusterObservation from the current observed simulator state."""
+        # Build a lookup for previous node state (for queue_delta and node_reward)
+        prev_by_id: dict[str, dict] = {n["node_id"]: n for n in self._prev_nodes_true}
+        node_obs = []
+        for n in self._nodes_obs:
+            # Per-node queue delta (normalized)
+            true_n = next((t for t in self._nodes_true if t["node_id"] == n["node_id"]), n)
+            prev_n = prev_by_id.get(n["node_id"])
+            if prev_n:
+                queue_delta_raw = float(n["queue_depth"]) - float(prev_n.get("queue_depth", 0))
+                queue_delta = max(-1.0, min(1.0, queue_delta_raw / MAX_QUEUE_NORM))
+            else:
+                queue_delta = 0.0
+            # Per-node reward contribution (normalized)
+            # Uses same formula as global reward but per-node
+            weight = float(n.get("importance_weight", 1.0))
+            if prev_n:
+                prev_q = float(prev_n.get("queue_depth", 0))
+                curr_q = float(true_n["queue_depth"])
+                node_drift = weight * (curr_q ** 2 - prev_q ** 2)
+                node_barrier = max(0, curr_q - 150.0) ** 2  # Q_BARRIER_MAX=150
+                node_cost = float(true_n.get("capacity_units", 0)) * COST_PER_CAPACITY_UNIT_PER_HOUR
+                node_reward_raw = -(ALPHA * node_drift + DELTA * (node_barrier / 10000.0) + BETA * node_cost)
+                # Normalize to [-1, 0] range
+                node_reward_val = max(-1.0, min(0.0, node_reward_raw / 10.0))
+            else:
+                node_reward_val = 0.0
+            # SLA proximity: how close this node is to violating (normalized)
+            node_latency_norm = min(1.0, max(0.0, float(n["latency_ms"]) / MAX_LATENCY_NORM))
+            sla_prox = max(0.0, min(1.0, node_latency_norm / 0.20))  # 0.20 is SLA threshold
+            node_obs.append(NodeObservation(
                 node_id=n["node_id"],
                 status=n["status"],
                 queue_depth=min(1.0, max(0.0, float(n["queue_depth"]) / MAX_QUEUE_NORM)),
                 cpu_utilization=min(1.0, max(0.0, float(n["cpu_utilization"]))),
                 is_vip=bool(n.get("is_vip", False)),
                 importance_weight=float(n.get("importance_weight", 1.0)),
+                capacity=float(n.get("capacity_units", 0)) / 5.0,  # Normalize to [0,1]
+                pending_capacity=float(n.get("pending_capacity_units", 0)) / 5.0,
+                queue_delta=queue_delta,
+                sla_proximity=sla_prox,
+                node_reward=node_reward_val,
                 done=False,
                 reward=0.0,
+            ))
         freshness = int((time.time() - self._last_metric_time) * 1000) if self._last_metric_time > 0 else 0
             raw_reward=self._last_raw_reward,
             normalized_reward=self._last_normalized_reward,
             reward_scale_version=REWARD_SCALE_VERSION,
+            reward_drift=self._last_reward_drift,
+            reward_cost=self._last_reward_cost,
+            reward_sla=self._last_reward_sla,
+            reward_barrier=self._last_reward_barrier,
             choke_level=0.0,
             done=False,
             reward=0.0,

simulator.py CHANGED Viewed

@@ -30,6 +30,9 @@ BASE_LATENCY_MS:      float = 20.0    # Minimum processing time
 OVERLOAD_THRESHOLD:   int   = 80      # Request count where node begins to "fail" (DEGRADED)
 LATENCY_STEEPNESS:    float = 2.0     # Increased to ensure SLA violations before death
 FATAL_FAIL_THRESHOLD: int   = 200     # Hard cap on queue depth (catastrophic failure boundary)
 SENSOR_DROPOUT_PROB:  float = 0.05    # P(node.queue, latency reports 0 or -1.0)
 NODE_FAILURE_PROB:    float = 0.00    # P(node fails naturally) — largely driven by task profile
@@ -90,6 +93,8 @@ class NodeState:
     dropped_requests: float = 0.0
     shed_fraction: float = 0.0       # Fraction of incoming traffic to drop this tick
     pending_capacity_queue: list[int] = field(default_factory=list)
     # Derived (recomputed whenever capacity or status changes)
     @property
@@ -114,6 +119,8 @@ class NodeState:
             "shed_fraction": round(self.shed_fraction, 4),
             "capacity_units": int(self.capacity),
             "pending_capacity_units": int(len(self.pending_capacity_queue)),
         }
@@ -146,6 +153,8 @@ class ClusterSimulator:
         self._t3_surge_end:   int = T3_SURGE_BASE_END
         # Per-node reroute weights for REROUTE_TRAFFIC (node_id → fraction)
         self._reroute_weights: dict[str, float] = {}
         self._nodes: list[NodeState] = []
         self.invalid_action_count: int = 0
         self._randomize_domain()
@@ -176,6 +185,7 @@ class ClusterSimulator:
                 node_id=f"node-{i}",
                 is_vip=f"node-{i}" in VIP_NODE_WEIGHTS,
                 importance_weight=VIP_NODE_WEIGHTS.get(f"node-{i}", 1.0),
             )
             for i in range(self._n_nodes)
         ]
@@ -190,6 +200,8 @@ class ClusterSimulator:
         self._tick_count = 0
         self._failed_node_id = None
         self._reroute_weights = {}
         self.invalid_action_count = 0
         self._randomize_domain()
         self._reset_nodes()
@@ -266,9 +278,16 @@ class ClusterSimulator:
         self._update_queues()
         self._update_derived_metrics()
         self._update_statuses()
-        # decay/reset shed fractions for next tick
         for node in self._nodes:
-            node.shed_fraction = 0.0
     def _update_capacity(self) -> None:
         """Process pending capacity from SCALE_UP actions"""
@@ -296,6 +315,10 @@ class ClusterSimulator:
                 self._failed_node_id = self._rng.choice(
                     [n.node_id for n in self._nodes if n.node_id != "node-0"]
                 )
             # Physics change: In Task 2, we do NOT redistribute dead node traffic
             # automatically. The infrastructure keeps sending λ/N to the failed node
@@ -384,8 +407,10 @@ class ClusterSimulator:
                     n.incoming_request_rate += share
         # Decay weights — agent must keep re-issuing to maintain effect
         for nid in list(self._reroute_weights.keys()):
-            self._reroute_weights[nid] *= 0.5
             if self._reroute_weights[nid] < 0.01:
                 del self._reroute_weights[nid]
@@ -421,24 +446,93 @@ class ClusterSimulator:
             # Utilization = Ratio of λ to μ
             service_rate = n.service_rate
             n.cpu_utilization = n.incoming_request_rate / service_rate if service_rate > 0 else 1.0
-            # Latency (simplified M/M/1 wait-time model)
-            n.latency_ms = BASE_LATENCY_MS + (n.queue_depth * LATENCY_STEEPNESS)
     def _update_statuses(self) -> None:
-        """Transition node health based on queue boundaries."""
         for n in self._nodes:
-            if n.node_id == self._failed_node_id:
                 n.status = NodeStatus.FAILED
                 continue
             if n.queue_depth > FATAL_FAIL_THRESHOLD:
-                n.status = NodeStatus.FAILED
             elif n.queue_depth > OVERLOAD_THRESHOLD:
                 n.status = NodeStatus.DEGRADED
             elif n.status == NodeStatus.DEGRADED and n.queue_depth < (OVERLOAD_THRESHOLD / 2):
                 n.status = NodeStatus.HEALTHY
     def reconcile_state(self, telemetry_map: dict) -> None:
         """
         Reconcile internal simulator state with external telemetry signals.

 OVERLOAD_THRESHOLD:   int   = 80      # Request count where node begins to "fail" (DEGRADED)
 LATENCY_STEEPNESS:    float = 2.0     # Increased to ensure SLA violations before death
 FATAL_FAIL_THRESHOLD: int   = 200     # Hard cap on queue depth (catastrophic failure boundary)
+CASCADE_WINDOW_TICKS: int = 3     # Ticks after a failure to check for cascade effects
+CASCADE_QUEUE_MULTIPLIER: float = 1.2  # Queue must exceed FATAL_FAIL_THRESHOLD * this to cascade
+NODE_RECOVERY_TICKS: int   = 20      # Ticks before a FAILED node auto-recovers
 SENSOR_DROPOUT_PROB:  float = 0.05    # P(node.queue, latency reports 0 or -1.0)
 NODE_FAILURE_PROB:    float = 0.00    # P(node fails naturally) — largely driven by task profile
     dropped_requests: float = 0.0
     shed_fraction: float = 0.0       # Fraction of incoming traffic to drop this tick
     pending_capacity_queue: list[int] = field(default_factory=list)
+    recovery_timer: int = 0          # Countdown to auto-recovery from FAILED status
+    is_scripted_failure: bool = False  # True if failed due to task scripting (no auto-recovery)
     # Derived (recomputed whenever capacity or status changes)
     @property
             "shed_fraction": round(self.shed_fraction, 4),
             "capacity_units": int(self.capacity),
             "pending_capacity_units": int(len(self.pending_capacity_queue)),
+            "recovery_timer": self.recovery_timer,
+            "is_scripted_failure": self.is_scripted_failure,
         }
         self._t3_surge_end:   int = T3_SURGE_BASE_END
         # Per-node reroute weights for REROUTE_TRAFFIC (node_id → fraction)
         self._reroute_weights: dict[str, float] = {}
+        self._cascade_tick: int = 0  # Tick counter for cascade detection window
+        self._cascade_triggered: bool = False  # Set True when a NEW overload failure occurs
         self._nodes: list[NodeState] = []
         self.invalid_action_count: int = 0
         self._randomize_domain()
                 node_id=f"node-{i}",
                 is_vip=f"node-{i}" in VIP_NODE_WEIGHTS,
                 importance_weight=VIP_NODE_WEIGHTS.get(f"node-{i}", 1.0),
+                is_scripted_failure=False,
             )
             for i in range(self._n_nodes)
         ]
         self._tick_count = 0
         self._failed_node_id = None
         self._reroute_weights = {}
+        self._cascade_tick = 0
+        self._cascade_triggered = False
         self.invalid_action_count = 0
         self._randomize_domain()
         self._reset_nodes()
         self._update_queues()
         self._update_derived_metrics()
         self._update_statuses()
+        self._cascade_failures()
+        self._process_recovery()
+        # Decay shed fractions gradually (retain 80% per tick = slow decay)
+        # The agent must still re-issue to maintain full effect, but the
+        # effect doesn't vanish instantly.  *= 0.8 means after 3 ticks
+        # the shed is still at 51% (0.8^3), vs old 0.0 after 1 tick.
         for node in self._nodes:
+            node.shed_fraction *= 0.8
+            if node.shed_fraction < 0.01:
+                node.shed_fraction = 0.0
     def _update_capacity(self) -> None:
         """Process pending capacity from SCALE_UP actions"""
                 self._failed_node_id = self._rng.choice(
                     [n.node_id for n in self._nodes if n.node_id != "node-0"]
                 )
+                # Mark the chosen node as a scripted (permanent) failure
+                target = next((n for n in self._nodes if n.node_id == self._failed_node_id), None)
+                if target:
+                    target.is_scripted_failure = True
             # Physics change: In Task 2, we do NOT redistribute dead node traffic
             # automatically. The infrastructure keeps sending λ/N to the failed node
                     n.incoming_request_rate += share
         # Decay weights — agent must keep re-issuing to maintain effect
+        # *= 0.8 retains 80% per tick (slow decay, persistent effect).
+        # After 5 ticks without re-issue, effect is at 33% (0.8^5).
         for nid in list(self._reroute_weights.keys()):
+            self._reroute_weights[nid] *= 0.8
             if self._reroute_weights[nid] < 0.01:
                 del self._reroute_weights[nid]
             # Utilization = Ratio of λ to μ
             service_rate = n.service_rate
             n.cpu_utilization = n.incoming_request_rate / service_rate if service_rate > 0 else 1.0
+            # Latency: Hybrid M/M/1 + backlog term
+            # M/M/1 gives exponential blow-up as utilization->1 (the "hockey stick")
+            # Backlog term ensures queue_depth still contributes signal even when
+            # utilization is capped at 0.99, preventing the flattening problem.
+            utilization = min(0.99, n.cpu_utilization)  # cap to prevent infinity
+            mm1_latency = BASE_LATENCY_MS / (1.0 - utilization)
+            backlog_latency = n.queue_depth * LATENCY_STEEPNESS
+            n.latency_ms = mm1_latency + backlog_latency
     def _update_statuses(self) -> None:
+        """Transition node health based on queue boundaries.
+        Recovery rules:
+        - Scripted failures (Task 2 forced node kill): permanent, never auto-recover.
+          Marked by is_scripted_failure=True, recovery_timer=0.
+        - Overload failures (queue > FATAL_FAIL_THRESHOLD): auto-recover after
+          NODE_RECOVERY_TICKS. The agent can learn to reroute away and let the
+          node heal.
+        """
         for n in self._nodes:
+            # Scripted (task-forced) failures are permanent
+            if n.is_scripted_failure:
                 n.status = NodeStatus.FAILED
+                n.recovery_timer = 0
                 continue
             if n.queue_depth > FATAL_FAIL_THRESHOLD:
+                if n.status != NodeStatus.FAILED:
+                    n.status = NodeStatus.FAILED
+                    n.recovery_timer = NODE_RECOVERY_TICKS
+                    self._cascade_triggered = True  # Signal cascade detection
             elif n.queue_depth > OVERLOAD_THRESHOLD:
                 n.status = NodeStatus.DEGRADED
             elif n.status == NodeStatus.DEGRADED and n.queue_depth < (OVERLOAD_THRESHOLD / 2):
                 n.status = NodeStatus.HEALTHY
+    def _cascade_failures(self) -> None:
+        """Detect cascading failure: if a peer node's queue exceeds a heightened
+        threshold within CASCADE_WINDOW_TICKS of a *new* failure, degrade it.
+        Guardrails:
+        - Only triggers when a NEW failure occurred this tick (not any failed node).
+        - Max one cascade step per failure event (no cascade chains).
+        - Scripted failures (Task 2) do not trigger cascades.
+        """
+        if not self._cascade_triggered:
+            self._cascade_tick = 0
+            return
+        self._cascade_tick += 1
+        if self._cascade_tick > CASCADE_WINDOW_TICKS:
+            self._cascade_triggered = False
+            self._cascade_tick = 0
+            return
+        cascade_threshold = FATAL_FAIL_THRESHOLD * CASCADE_QUEUE_MULTIPLIER
+        cascaded_this_tick = 0
+        for n in self._nodes:
+            if cascaded_this_tick >= 1:
+                break  # Max one cascade per window to prevent chain reactions
+            if n.status == NodeStatus.FAILED:
+                continue
+            if n.is_scripted_failure:
+                continue
+            if n.queue_depth > cascade_threshold:
+                n.status = NodeStatus.DEGRADED
+                cascaded_this_tick += 1
+    def _process_recovery(self) -> None:
+        """Count down recovery timers and bring FAILED nodes back online.
+        Only overload-failed nodes (recovery_timer > 0) can recover.
+        Scripted failures (is_scripted_failure=True) are excluded.
+        """
+        for n in self._nodes:
+            if n.is_scripted_failure:
+                continue
+            if n.status == NodeStatus.FAILED and n.recovery_timer > 0:
+                n.recovery_timer -= 1
+                if n.recovery_timer <= 0:
+                    n.status = NodeStatus.HEALTHY
+                    n.capacity = 1.0  # Recover at minimum capacity
+                    n.queue_depth = 0.0
+                    n.latency_ms = BASE_LATENCY_MS
+                    n.cpu_utilization = 0.0
     def reconcile_state(self, telemetry_map: dict) -> None:
         """
         Reconcile internal simulator state with external telemetry signals.

stability.py CHANGED Viewed

@@ -53,6 +53,13 @@ Q_BARRIER_MAX: float = 150.0
 Set higher than OVERLOAD_THRESHOLD (80) to allow the agent time to react
 before the barrier penalty kicks in."""
 STABILITY_WINDOW: int = 10
 """Number of ticks to look back when judging whether the system is
 trend-stable (V is on a decreasing trajectory)."""
@@ -65,7 +72,7 @@ trend-stable (V is on a decreasing trajectory)."""
 REWARD_NORM_MIDPOINT: float = float(os.getenv("ANTIATROPOS_REWARD_MIDPOINT", "0.0"))
 REWARD_NORM_TEMPERATURE: float = float(os.getenv("ANTIATROPOS_REWARD_TEMPERATURE", "5.0"))
 REWARD_NORM_EPS: float = float(os.getenv("ANTIATROPOS_REWARD_EPS", "1e-8"))
-REWARD_SCALE_VERSION: str = "sigmoid-v1"
 # ---------------------------------------------------------------------------
@@ -245,36 +252,93 @@ def drift_plus_penalty(
 # Convenience: full reward computation (matches environment.py formula)
 # ---------------------------------------------------------------------------
 def compute_reward(
     v_prev: float,
     v_curr: float,
     cost: float,
-    sla_violation_step: int,
     alpha: float = 1.0,
     beta: float = 0.05,
     gamma: float = 2.0,
 ) -> float:
     """
-    R_t = −(α·ΔV(s)  +  β·Cost  +  γ·SLA_violation_step)
     Convenience wrapper that mirrors the reward formula in environment.py.
-    Can be used by the baseline agent to simulate rewards without calling
-    the server, or by the grader to reconstruct reward trajectories.
     Args:
         v_prev:         Lyapunov energy at previous tick.
         v_curr:         Lyapunov energy at current tick.
         cost:           Infrastructure cost this tick (USD/hr).
-        sla_violation_step: 1 if this step violated SLA, else 0.
         alpha:          Weight on Lyapunov drift.
         beta:           Weight on cost.
         gamma:          Weight on SLA violations.
     Returns:
-        Scalar reward (higher is better, always ≤ 0 in a stable episode).
     """
     delta_v = compute_drift(v_prev, v_curr)
-    return -(alpha * delta_v + beta * cost + gamma * sla_violation_step)
 def normalize_reward(

 Set higher than OVERLOAD_THRESHOLD (80) to allow the agent time to react
 before the barrier penalty kicks in."""
+BARRIER_NORM_SCALE: float = 10000.0
+"""Normalization divisor for the barrier term.
+The raw barrier H(s) = sum(max(0, Q_i - Q_max)^2) can produce very large values
+(e.g. 5 nodes at Q=200, Q_max=150 gives 5*2500=12500). Without normalization,
+this dominates the reward. Dividing by this scale keeps barrier in the same
+order of magnitude as the other terms when delta=0.005."""
 STABILITY_WINDOW: int = 10
 """Number of ticks to look back when judging whether the system is
 trend-stable (V is on a decreasing trajectory)."""
 REWARD_NORM_MIDPOINT: float = float(os.getenv("ANTIATROPOS_REWARD_MIDPOINT", "0.0"))
 REWARD_NORM_TEMPERATURE: float = float(os.getenv("ANTIATROPOS_REWARD_TEMPERATURE", "5.0"))
 REWARD_NORM_EPS: float = float(os.getenv("ANTIATROPOS_REWARD_EPS", "1e-8"))
+REWARD_SCALE_VERSION: str = "sigmoid-v2"  # v2: smooth SLA + barrier active
 # ---------------------------------------------------------------------------
 # Convenience: full reward computation (matches environment.py formula)
 # ---------------------------------------------------------------------------
+def smooth_sla_penalty(
+    avg_latency_norm: float,
+    error_rate: float,
+    latency_threshold: float = 0.20,
+    error_threshold: float = 0.05,
+    latency_temperature: float = 0.03,
+    error_temperature: float = 0.01,
+) -> float:
+    """
+    Smooth SLA penalty in [0, 1] that ramps up as metrics approach thresholds.
+    Unlike the binary cliff (0 or 1), this gives the agent gradient signal
+    BEFORE the SLA is actually violated, enabling preventive learning.
+    Uses two sigmoids (one for latency, one for errors) and takes the max
+    so whichever dimension is worse dominates.
+    Args:
+        avg_latency_norm:     Normalized average latency [0, 1].
+        error_rate:           Cluster-wide error rate [0, 1].
+        latency_threshold:    Normalized latency SLA boundary.
+        error_threshold:      Error rate SLA boundary.
+        latency_temperature:  Sigmoid temperature for latency (lower = sharper).
+        error_temperature:    Sigmoid temperature for errors (lower = sharper).
+    Returns:
+        Smooth penalty in [0, 1]. Near 0 when safe, near 1 when violating.
+    Raises:
+        ValueError: If inputs are outside [0, 1], indicating raw (non-normalized)
+            values were passed by mistake. This is a common bug: passing latency
+            in raw ms (e.g. 200.0) instead of normalized [0,1] (e.g. 0.20).
+    """
+    if avg_latency_norm < -0.01 or avg_latency_norm > 1.5:
+        raise ValueError(
+            f"smooth_sla_penalty: avg_latency_norm={avg_latency_norm:.4f} is outside "
+            f"expected [0, 1] range. Did you pass raw ms instead of normalized? "
+            f"Divide by MAX_LATENCY_NORM before calling."
+        )
+    if error_rate < -0.01 or error_rate > 1.5:
+        raise ValueError(
+            f"smooth_sla_penalty: error_rate={error_rate:.4f} is outside "
+            f"expected [0, 1] range."
+        )
+    lat_z = (avg_latency_norm - latency_threshold) / max(1e-8, latency_temperature)
+    err_z = (error_rate - error_threshold) / max(1e-8, error_temperature)
+    lat_penalty = 1.0 / (1.0 + math.exp(-lat_z))
+    err_penalty = 1.0 / (1.0 + math.exp(-err_z))
+    return max(lat_penalty, err_penalty)
 def compute_reward(
     v_prev: float,
     v_curr: float,
     cost: float,
+    sla_violation_step: float = 0.0,
     alpha: float = 1.0,
     beta: float = 0.05,
     gamma: float = 2.0,
+    barrier: float = 0.0,
+    delta: float = 0.005,
 ) -> float:
     """
+    R_t = -(alpha * DeltaV(s) + beta * Cost + gamma * SLA_smooth + delta * Barrier)
     Convenience wrapper that mirrors the reward formula in environment.py.
     Args:
         v_prev:         Lyapunov energy at previous tick.
         v_curr:         Lyapunov energy at current tick.
         cost:           Infrastructure cost this tick (USD/hr).
+        sla_violation_step: Smooth SLA penalty in [0, 1] (was binary 0/1).
         alpha:          Weight on Lyapunov drift.
         beta:           Weight on cost.
         gamma:          Weight on SLA violations.
+        barrier:        Control-barrier function violation energy.
+        delta:          Weight on barrier penalty.
     Returns:
+        Scalar reward (higher is better, always <= 0 in a stable episode).
     """
     delta_v = compute_drift(v_prev, v_curr)
+    # Normalize barrier to prevent reward domination: raw barrier can be ~12500,
+    # after dividing by BARRIER_NORM_SCALE it's ~1.25, then scaled by delta=0.005
+    # gives ~0.006 which is comparable to other terms.
+    barrier_normalized = barrier / BARRIER_NORM_SCALE if BARRIER_NORM_SCALE > 0 else barrier
+    return -(alpha * delta_v + beta * cost + gamma * sla_violation_step + delta * barrier_normalized)
 def normalize_reward(