Spaces:

saravanatanjiro
/

Openenv

Paused

Commit

dfc5996

1 Parent(s): af6bbef

Migrate LLM pipeline to custom GRPO with robust rewards

Replace the REINFORCE-style loop with grouped GRPO optimization and add decomposed reward telemetry, GRPO diagnostics, and UI controls so training is more stable, observable, and reproducible.

Files changed (6) hide show

README.md +61 -10
app.py +41 -5
cloud_arena/evaluation.py +85 -0
cloud_arena/llm_environment.py +224 -308
cloud_arena/llm_training.py +350 -306
cloud_arena/visualization.py +51 -0

README.md CHANGED Viewed

@@ -5,18 +5,69 @@ colorFrom: blue
 colorTo: purple
 sdk: docker
 pinned: false
-short_description: Cloud Arena Mathematical Model RL Training
 ---
-# Cloud Arena — Mathematical Model RL Training
-Multi-objective cloud operations RL environment trained with **MaskablePPO**.
-This is the **Mathematical Model** (MLP + stable-baselines3), NOT the LLM model.
-## Features
-- 125-dim observation space, 150 discrete actions
-- 6-phase curriculum learning
-- Action masking, fog-of-war, chaos events
-- Boss fight evaluation scenarios
-- Interactive training dashboard

 colorTo: purple
 sdk: docker
 pinned: false
+short_description: Cloud Arena RL with Custom GRPO
 ---
+# Cloud Arena RL
+This Space contains two independent RL systems:
+- Mathematical model RL (`MaskablePPO` + MLP) for structured cloud-ops optimization.
+- LLM RL using a **custom GRPO** training loop with LoRA adaptation.
+## LLM Algorithm
+The LLM pipeline now uses Group Relative Policy Optimization (GRPO):
+1. For each state, sample `K` responses.
+2. Simulate each sampled action on an environment clone.
+3. Compute group-relative normalized advantage.
+4. Optimize a clipped policy objective with KL and entropy regularization.
+Key implementation file:
+- `cloud_arena/llm_training.py`
+## Reward Design
+The LLM environment reward is decomposed into robust components:
+- `cost_delta`: incentivize concrete savings.
+- `risk`: reward lower operational risk.
+- `reliability`: reward safe, stable outcomes.
+- `action_quality`: valid action bonus, tool misuse and veto penalties.
+- `anti_loop`: repeated-action and hesitation penalties.
+- `terminal`: success bonus and failure penalties.
+Safety protections:
+- Semantic veto for production-like resources.
+- Structural crash penalty on dependency-breaking stop/delete actions.
+- Tool reward caps and repeated-action penalties to prevent farming loops.
+Key environment file:
+- `cloud_arena/llm_environment.py`
+## GRPO Runtime Optimizations
+- LoRA fine-tuning over causal LMs.
+- Gradient accumulation and gradient clipping.
+- KL watchdog with adaptive KL coefficient.
+- VRAM cleanup between model runs.
+- Deterministic seeds for reproducible smoke checks.
+## Recommended Config (Smoke Benchmark)
+- Iterations: `30`
+- Steps per episode: `12`
+- Group size: `4`
+- Clip epsilon: `0.2`
+- KL coefficient: `0.01`
+- Entropy coefficient: `0.001`
+- Max generation tokens: `80`
+- Temperature: `0.7`
+## Validation Criteria
+- Determinism: repeated runs with fixed seed show similar trends.
+- Safety: veto/violation rates stay stable or improve.
+- Learning: post-training reward exceeds pre-training baseline for at least one default model.
+- Stability: no persistent NaNs, KL blowups, or reward collapse.

app.py CHANGED Viewed

@@ -54,7 +54,17 @@ def run_math_evaluation():
 # ── LLM Model Training ───────────────────────────────────────────────────────
-def run_llm_training(model_name, num_iterations, steps_per_episode):
     from cloud_arena.llm_training import train_llm
     try:
         iters = int(num_iterations)
@@ -62,11 +72,18 @@ def run_llm_training(model_name, num_iterations, steps_per_episode):
             model_name=model_name,
             num_iterations=iters,
             steps_per_episode=int(steps_per_episode),
         )
         delta = all_rewards[-1] - all_rewards[0]
         summary = (
-            f"✅ LLM Training Complete\n"
             f"Model: {model_name}\n"
             f"Pre-training reward: {all_rewards[0]:+.3f}\n"
             f"Post-training reward: {all_rewards[-1]:+.3f}\n"
             f"Δ Change: {delta:+.3f}\n\n"
@@ -98,7 +115,7 @@ with gr.Blocks(title="Cloud Arena RL") as demo:
         eval_btn.click(run_math_evaluation, outputs=eval_output)
     with gr.Tab("🧠 LLM RL"):
-        gr.Markdown("### Multi-Model RL Benchmark — REINFORCE + LoRA")
         gr.Markdown("> Comma-separate model names to benchmark multiple models sequentially")
         llm_model = gr.Textbox(
             value="unsloth/Qwen2.5-Math-7B-Instruct-bnb-4bit, unsloth/gemma-2b-it-bnb-4bit, unsloth/llama-3-8b-Instruct-bnb-4bit",
@@ -106,11 +123,30 @@ with gr.Blocks(title="Cloud Arena RL") as demo:
         )
         llm_iters = gr.Number(value=200, label="Training Iterations per Model")
         llm_steps = gr.Number(value=15, label="Steps per Episode")
         llm_btn = gr.Button("🚀 Start LLM Training", variant="primary")
         llm_output = gr.Textbox(label="Training Log", lines=15)
         llm_img = gr.Image(label="Results")
-        llm_btn.click(run_llm_training, inputs=[llm_model, llm_iters, llm_steps],
-                      outputs=[llm_output, llm_img])
 if __name__ == "__main__":
     demo.launch(server_name="0.0.0.0", server_port=7860, theme=gr.themes.Base())

 # ── LLM Model Training ───────────────────────────────────────────────────────
+def run_llm_training(
+    model_name,
+    num_iterations,
+    steps_per_episode,
+    group_size,
+    clip_epsilon,
+    kl_coef,
+    entropy_coef,
+    max_gen_tokens,
+    temperature,
+):
     from cloud_arena.llm_training import train_llm
     try:
         iters = int(num_iterations)
             model_name=model_name,
             num_iterations=iters,
             steps_per_episode=int(steps_per_episode),
+            group_size=int(group_size),
+            clip_epsilon=float(clip_epsilon),
+            kl_coef=float(kl_coef),
+            entropy_coef=float(entropy_coef),
+            max_gen_tokens=int(max_gen_tokens),
+            temperature=float(temperature),
         )
         delta = all_rewards[-1] - all_rewards[0]
         summary = (
+            f"✅ LLM GRPO Training Complete\n"
             f"Model: {model_name}\n"
+            f"Algorithm: Custom GRPO\n"
             f"Pre-training reward: {all_rewards[0]:+.3f}\n"
             f"Post-training reward: {all_rewards[-1]:+.3f}\n"
             f"Δ Change: {delta:+.3f}\n\n"
         eval_btn.click(run_math_evaluation, outputs=eval_output)
     with gr.Tab("🧠 LLM RL"):
+        gr.Markdown("### Multi-Model RL Benchmark — Custom GRPO + LoRA")
         gr.Markdown("> Comma-separate model names to benchmark multiple models sequentially")
         llm_model = gr.Textbox(
             value="unsloth/Qwen2.5-Math-7B-Instruct-bnb-4bit, unsloth/gemma-2b-it-bnb-4bit, unsloth/llama-3-8b-Instruct-bnb-4bit",
         )
         llm_iters = gr.Number(value=200, label="Training Iterations per Model")
         llm_steps = gr.Number(value=15, label="Steps per Episode")
+        grpo_group = gr.Number(value=4, label="GRPO Group Size (K)")
+        grpo_clip = gr.Number(value=0.2, label="GRPO Clip Epsilon")
+        grpo_kl = gr.Number(value=0.01, label="KL Coefficient")
+        grpo_entropy = gr.Number(value=0.001, label="Entropy Coefficient")
+        grpo_tokens = gr.Number(value=80, label="Max Generation Tokens")
+        grpo_temp = gr.Number(value=0.7, label="Sampling Temperature")
         llm_btn = gr.Button("🚀 Start LLM Training", variant="primary")
         llm_output = gr.Textbox(label="Training Log", lines=15)
         llm_img = gr.Image(label="Results")
+        llm_btn.click(
+            run_llm_training,
+            inputs=[
+                llm_model,
+                llm_iters,
+                llm_steps,
+                grpo_group,
+                grpo_clip,
+                grpo_kl,
+                grpo_entropy,
+                grpo_tokens,
+                grpo_temp,
+            ],
+            outputs=[llm_output, llm_img],
+        )
 if __name__ == "__main__":
     demo.launch(server_name="0.0.0.0", server_port=7860, theme=gr.themes.Base())

cloud_arena/evaluation.py CHANGED Viewed

@@ -124,3 +124,88 @@ def run_boss_fights(model_path="./models/cloud_arena_final",
         boss_scores[s_id] = score
     return boss_scores

         boss_scores[s_id] = score
     return boss_scores
+def evaluate_llm_grpo(model, tokenizer, n_eval=20, steps_per_episode=15, seed=123):
+    """
+    Evaluate LLM policy quality on the FinOps environment using the same
+    ACTION parser logic as training.
+    """
+    import random
+    import torch
+    from cloud_arena.llm_environment import SB3Adapter
+    from cloud_arena.llm_training import extract_action_and_reasoning, format_prompt
+    random.seed(seed)
+    np.random.seed(seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(seed)
+    env = SB3Adapter()
+    metrics = {
+        "episodes": n_eval,
+        "win_rate": 0.0,
+        "avg_savings_pct": 0.0,
+        "avg_episode_len": 0.0,
+        "safety_violation_rate": 0.0,
+        "action_distribution": {str(i): 0 for i in range(5)},
+        "avg_reward_components": {},
+    }
+    wins = 0
+    total_savings = 0.0
+    total_steps = 0
+    total_safety_violations = 0
+    reward_components_sum = {}
+    total_component_steps = 0
+    for _ in range(n_eval):
+        _, _ = env.reset()
+        done = False
+        step_count = 0
+        last_info = {}
+        while not done and step_count < steps_per_episode:
+            state_dict = env.core._get_internal_state()
+            prompt = format_prompt(state_dict)
+            inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
+            input_ids = inputs["input_ids"].to(model.device)
+            attn_mask = inputs["attention_mask"].to(model.device)
+            with torch.no_grad():
+                out = model.generate(
+                    input_ids=input_ids,
+                    attention_mask=attn_mask,
+                    max_new_tokens=80,
+                    do_sample=False,
+                    pad_token_id=tokenizer.pad_token_id,
+                )
+            response = tokenizer.decode(out[0][input_ids.shape[1] :], skip_special_tokens=True)
+            action, _ = extract_action_and_reasoning(response)
+            metrics["action_distribution"][str(action)] += 1
+            _, _, terminated, truncated, info = env.step(action)
+            done = bool(terminated or truncated)
+            step_count += 1
+            last_info = info
+            total_safety_violations += int(info.get("safety_violation", 0))
+            rc = info.get("reward_components", {})
+            for k, v in rc.items():
+                reward_components_sum[k] = reward_components_sum.get(k, 0.0) + float(v)
+            total_component_steps += 1
+        wins += int(last_info.get("win", False))
+        total_savings += float(last_info.get("savings_pct", 0.0))
+        total_steps += step_count
+    total_actions = max(sum(metrics["action_distribution"].values()), 1)
+    metrics["action_distribution"] = {
+        k: round(v / total_actions, 4) for k, v in metrics["action_distribution"].items()
+    }
+    metrics["win_rate"] = round(wins / max(n_eval, 1), 4)
+    metrics["avg_savings_pct"] = round(total_savings / max(n_eval, 1), 3)
+    metrics["avg_episode_len"] = round(total_steps / max(n_eval, 1), 3)
+    metrics["safety_violation_rate"] = round(total_safety_violations / max(total_steps, 1), 4)
+    metrics["avg_reward_components"] = {
+        k: round(v / max(total_component_steps, 1), 4) for k, v in reward_components_sum.items()
+    }
+    return metrics

cloud_arena/llm_environment.py CHANGED Viewed

@@ -1,419 +1,341 @@
-# ============================================================
-# CELL 3 — Cloud FinOps Environment (Final Fixed Version)
-#
-# ALL loopholes closed:
-#   1. CHECK_DEPENDENCIES after cap → hesitation penalty (not 0.0)
-#      This kills the "+0.200 every episode" passive policy
-#   2. W_HESITATION = 0.10 — strong enough to force action
-#   3. Win bonus +2.0 — rewards completing the goal, not just steps
-#   4. RESIZE guaranteed to reduce cost (uniform 0.40-0.65)
-#   5. MIN_DELETABLE_COST_RATIO = 0.35 — win is always reachable
-#   6. Stronger semantic veto — also catches high-dependency temp nodes
-# ============================================================
-import numpy as np
 import gymnasium as gym
 from gymnasium import spaces
-from enum import IntEnum
-import random
 random.seed(42)
 np.random.seed(42)
-# ─── Action Space ─────────────────────────────────────────────────────────────
 class Action(IntEnum):
-    NOOP               = 0
     CHECK_DEPENDENCIES = 1
-    RESIZE             = 2
-    STOP               = 3
-    DELETE             = 4
-NUM_ACTIONS = len(Action)
-# ─── Constants ────────────────────────────────────────────────────────────────
 N_RESOURCES = 6
 OBS_PER_RES = 5
-OBS_DIM     = N_RESOURCES * OBS_PER_RES + 2   # = 32
 PROD_NAMES = [
-    "storage-prod-db",      "core-auth-router",     "primary-k8s-master",
-    "billing-db-01",        "payment-gateway-prod",  "prod-cache-redis",
-    "prod-elb-frontend",    "rds-prod-main",          "main-api-prod",
-    "prod-cosmos-db",       "primary-gke-cluster",   "prod-spanner-db",
 ]
 TEMP_NAMES = [
-    "worker-node-temp",     "test-frontend-ui",      "sandbox-db-04",
-    "batch-processor-temp", "dev-cache-redis",        "temp-worker-88",
-    "staging-api-v2",       "dev-log-collector",      "temp-ecs-task",
-    "dev-gke-node",         "test-bigquery-scratch",  "sandbox-spanner-dev",
 ]
-# ─── Cloud Resource ───────────────────────────────────────────────────────────
 class CloudResource:
     def __init__(self, name, cost_per_hr, cpu_pct, dependencies, is_prod):
-        self.name         = name
-        self.cost_per_hr  = cost_per_hr
-        self.cpu_pct      = cpu_pct
         self.dependencies = dependencies
-        self.is_prod      = is_prod
-        self.active       = True
     def to_obs_array(self):
-        return np.array([
-            self.cpu_pct / 100.0,
-            self.cost_per_hr / 5.0,
-            len(self.dependencies) / 14.0,
-            float(self.is_prod),
-            float(self.active),
-        ], dtype=np.float32)
-# ─── Resource Generator ───────────────────────────────────────────────────────
 def generate_resources(n=N_RESOURCES):
-    """
-    Creates a randomised pool of cloud resources per episode.
-    Guarantees:
-      1. One production trap with low CPU (looks deletable but isn't)
-      2. Temp resources account for >= 35% of total cost
-         so the 20% savings win condition is always reachable
-    """
     resources = []
-    # Guaranteed prod trap — low CPU makes it look safe to delete
     prod_name = random.choice(PROD_NAMES)
-    resources.append(CloudResource(
-        name         = prod_name,
-        cost_per_hr  = round(random.uniform(0.5, 3.0), 2),
-        cpu_pct      = random.randint(2, 12),
-        dependencies = random.sample(TEMP_NAMES, k=random.randint(2, 4)),
-        is_prod      = True,
-    ))
-    # Fill remaining slots with random mix
     for _ in range(n - 1):
-        is_prod   = random.random() < 0.30   # 30% chance prod
         name_pool = PROD_NAMES if is_prod else TEMP_NAMES
         dep_count = random.randint(1, 5) if is_prod else random.randint(0, 3)
-        resources.append(CloudResource(
-            name         = random.choice(name_pool),
-            cost_per_hr  = round(random.uniform(0.8, 4.0), 2),
-            cpu_pct      = random.randint(1, 95),
-            dependencies = random.sample(TEMP_NAMES, k=min(dep_count, len(TEMP_NAMES))),
-            is_prod      = is_prod,
-        ))
-    # ── Guarantee minimum deletable cost ratio ────────────────────────────
-    # Raises temp resource costs until they represent >= 35% of total.
-    # Without this guarantee, some episodes are mathematically unwinnable.
-    MIN_RATIO = 0.35
-    for _ in range(10):   # iterate up to 10x to converge
-        total      = sum(r.cost_per_hr for r in resources)
         temp_total = sum(r.cost_per_hr for r in resources if not r.is_prod)
-        if total > 0 and (temp_total / total) < MIN_RATIO:
             for r in resources:
                 if not r.is_prod:
                     r.cost_per_hr = round(r.cost_per_hr * 1.3, 2)
         else:
             break
     return resources
-# ─── Core Environment (OpenEnv dict API) ─────────────────────────────────────
 class AWSCostEnv:
-    """
-    Cloud FinOps Optimisation Environment — OpenEnv dict API.
-    Wrap with SB3Adapter for stable-baselines3 PPO training.
-    REWARD FORMULA
-    --------------
-    Savings  : clip(delta_cost_pct × W_SAVINGS, -5, +5)
-    Win bonus: +W_WIN_BONUS when savings >= target (one-time)
-    NOOP     : -W_HESITATION per step
-    Tool     : +W_TOOL per new node checked (capped at W_TOOL_EPISODE_CAP)
-               After cap → -W_HESITATION (closes passive policy loophole)
-    Veto     : PENALTY_VETO (semantic guardrail blocked the action)
-    Crash    : PENALTY_CRASH, episode ends immediately
-    KEY LOOPHOLE FIXES
-    ------------------
-    Fix 1 — CHECK after cap returns -W_HESITATION not 0.0
-             Prevents "+0.200 every episode" passive exploit
-    Fix 2 — RESIZE guaranteed to reduce cost (0.40-0.65 multiplier)
-             Prevents zero-saving resize farming
-    Fix 3 — Tool cap resets every episode via reset()
-    Fix 4 — Semantic veto also catches high-dependency temp nodes
-    Fix 5 — Min deletable ratio guarantee makes win always reachable
-    """
-    # ── Reward weights (do not change without updating Cell 4 too) ──────────
-    W_SAVINGS          = 20.0
-    W_HESITATION       = 0.10    # raised: strong enough to force decisive action
-    W_TOOL             = 0.20
-    W_TOOL_EPISODE_CAP = 0.60    # max tool reward per episode (3 uses)
-    W_WIN_BONUS        = 2.0     # one-time bonus for completing the goal
-    PENALTY_CRASH      = -10.0
-    PENALTY_VETO       = -0.50
-    MAX_STEPS          = 100
     def __init__(self, n_resources=N_RESOURCES, target_savings=0.20):
-        self.n_resources    = n_resources
         self.target_savings = target_savings
-        self.resources      = []
-        self.baseline_cost  = 0.0
-        self.current_cost   = 0.0
-        self.current_step   = 0
         self.nodes_investigated_this_episode = set()
-        self.total_tool_reward_this_episode  = 0.0
-    # ── Private helpers ─────────────────────────────���────────────────────────
     def _resource_from_action(self, action_idx):
         idx = (action_idx - 2) % self.n_resources
         return self.resources[idx % len(self.resources)]
     def _has_dependency_violation(self, resource):
-        """True if deleting this resource breaks any other active resource."""
         for other in self.resources:
-            if other.active and other.name != resource.name:
-                if resource.name in other.dependencies:
-                    return True
         return False
     def _calc_cost(self):
         return sum(r.cost_per_hr for r in self.resources if r.active)
     def _get_obs(self):
         obs = []
         for r in self.resources:
             obs.extend(r.to_obs_array())
-        budget_used = (
-            1.0 - (self.current_cost / self.baseline_cost)
-            if self.baseline_cost > 0 else 0.0
-        )
         steps_left = 1.0 - (self.current_step / self.MAX_STEPS)
         obs.extend([budget_used, steps_left])
         return np.array(obs, dtype=np.float32)
     def _get_internal_state(self):
-        """Human-readable state dict for OpenEnv /state endpoint."""
         return {
-            "step":          self.current_step,
             "baseline_cost": self.baseline_cost,
-            "current_cost":  self.current_cost,
-            "savings_pct":   round(
-                (1 - self.current_cost / self.baseline_cost) * 100, 2
-            ) if self.baseline_cost > 0 else 0.0,
-            "resources": [{
-                "name":         r.name,
-                "active":       r.active,
-                "is_prod":      r.is_prod,
-                "cost_per_hr":  r.cost_per_hr,
-                "cpu_pct":      r.cpu_pct,
-                "dependencies": r.dependencies,
-            } for r in self.resources]
         }
-    def _semantic_veto(self, name: str, dep_count: int) -> bool:
-        """
-        Semantic guardrail — returns True if action should be blocked.
-        Two veto triggers:
-          1. Name contains production keywords (primary check)
-          2. High dependency count on any resource (structural safety net)
-             Even temp-named nodes with 5+ deps get vetoed
-             This catches the edge case that caused the -31.800 crash
-        In production: replace with call to fine-tuned Llama inference endpoint.
-        """
-        name_lower    = name.lower()
-        prod_keywords = [
-            "prod", "primary", "main", "core",
-            "billing", "payment", "rds", "master"
-        ]
-        # Primary: semantic name check
-        if any(kw in name_lower for kw in prod_keywords):
-            return True
-        # Secondary: structural safety net — high deps = critical regardless of name
-        if dep_count >= 5:
-            return True
-        return False
-    # ── Lifecycle ─────────────────────────────────────────────────────────────
     def reset(self):
-        """Reset environment for a new episode. Returns OpenEnv dict."""
-        self.current_step   = 0
         self.nodes_investigated_this_episode = set()
-        self.total_tool_reward_this_episode  = 0.0
-        self.resources      = generate_resources(self.n_resources)
-        self.baseline_cost  = self._calc_cost()
-        self.current_cost   = self.baseline_cost
         return {
             "observation": self._get_obs(),
-            "info": {
-                "msg":           "Episode reset",
-                "baseline_cost": self.baseline_cost,
-            }
         }
     def step(self, action):
-        """
-        Execute one environment step.
-        Args:
-            action : int, one of Action enum values (0-4)
-        Returns:
-            dict with keys: observation, state, reward, done, info
-        """
         self.current_step += 1
         truncated = self.current_step >= self.MAX_STEPS
-        # ── 1. NOOP — hesitation penalty ──────────────────────────────────
         if action == Action.NOOP:
-            return {
-                "observation": self._get_obs(),
-                "state":       self._get_internal_state(),
-                "reward":      float(-self.W_HESITATION),
-                "done":        bool(truncated),
-                "info":        {"msg": "Hesitation penalty", "win": False,
-                                "savings_pct": round(
-                    (1 - self.current_cost / self.baseline_cost) * 100, 2)}
-            }
         target = self._resource_from_action(action)
-        # ── 2. CHECK_DEPENDENCIES ─────────────────────────────────────────
-        # LOOPHOLE FIX: After cap is reached, return hesitation penalty
-        # instead of 0.0. This kills the passive "+0.200 every episode" policy.
         if action == Action.CHECK_DEPENDENCIES:
-            under_cap  = self.total_tool_reward_this_episode < self.W_TOOL_EPISODE_CAP
-            new_node   = target.name not in self.nodes_investigated_this_episode
             if new_node and under_cap:
-                # Valid tool use — reward it
                 self.nodes_investigated_this_episode.add(target.name)
                 self.total_tool_reward_this_episode += self.W_TOOL
-                tool_reward = self.W_TOOL
                 msg = f"Checked {target.name}"
             else:
-                # Cap reached or node already checked — penalise like NOOP
-                tool_reward = -self.W_HESITATION
-                msg = "Tool cap reached — penalised"
-            return {
-                "observation": self._get_obs(),
-                "state":       self._get_internal_state(),
-                "reward":      float(tool_reward),
-                "done":        bool(truncated),
-                "info":        {"msg": msg, "win": False,
-                                "savings_pct": round(
-                    (1 - self.current_cost / self.baseline_cost) * 100, 2)}
-            }
-        # ── 3. SEMANTIC + STRUCTURAL GUARDRAIL ────────────────────────────
-        # Blocks dangerous actions using name keywords AND dependency count.
-        # Dependency count fix closes the edge case that caused -31.800 crash.
         danger = action in (Action.STOP, Action.DELETE)
         if danger and self._semantic_veto(target.name, len(target.dependencies)):
-            return {
-                "observation": self._get_obs(),
-                "state":       self._get_internal_state(),
-                "reward":      float(self.PENALTY_VETO),
-                "done":        bool(truncated),
-                "info":        {"msg": f"SEMANTIC VETO on {target.name}",
-                                "win": False,
-                                "savings_pct": round(
-                    (1 - self.current_cost / self.baseline_cost) * 100, 2)}
-            }
-        # ── 4. EXECUTE ACTION ─────────────────────────────────────────────
-        prev_cost = self.current_cost
-        if action == Action.RESIZE:
-            if target.active:
-                old_cost = target.cost_per_hr
-                # LOOPHOLE FIX: 0.40-0.65 multiplier guarantees meaningful reduction
-                target.cost_per_hr = round(
-                    target.cost_per_hr * random.uniform(0.40, 0.65), 2
-                )
-                # Extra safety: if somehow no reduction, penalise
-                if target.cost_per_hr >= old_cost:
-                    target.cost_per_hr = round(old_cost * 0.50, 2)
         elif action in (Action.STOP, Action.DELETE):
-            # ── 5. STRUCTURAL DEPENDENCY CHECK ────────────────────────────
             if self._has_dependency_violation(target):
-                return {
-                    "observation": self._get_obs(),
-                    "state":       self._get_internal_state(),
-                    "reward":      float(self.PENALTY_CRASH),
-                    "done":        True,
-                    "info":        {
-                        "msg":         f"CATASTROPHIC FAILURE: {target.name}",
-                        "win":         False,
-                        "savings_pct": round(
-                            (1 - self.current_cost / self.baseline_cost) * 100, 2)
-                    }
-                }
             target.active = False
-        # ── 6. FINANCIAL REWARD ───────────────────────────────────────────
         self.current_cost = self._calc_cost()
-        delta_pct         = (prev_cost - self.current_cost) / self.baseline_cost
-        savings_reward    = float(np.clip(delta_pct * self.W_SAVINGS, -5.0, 5.0))
-        # ── 7. WIN CONDITION + BONUS ──────────────────────────────────────
-        total_saved = (
-            (self.baseline_cost - self.current_cost) / self.baseline_cost
-        )
         is_win = total_saved >= self.target_savings
-        # One-time win bonus — rewards completing the goal
-        if is_win:
-            savings_reward += self.W_WIN_BONUS
-        is_done = bool(is_win or truncated)
-        return {
-            "observation": self._get_obs(),
-            "state":       self._get_internal_state(),
-            "reward":      savings_reward,
-            "done":        is_done,
-            "info": {
-                "msg":         "Win!" if is_win else "Action Successful",
-                "win":         is_win,
-                "savings_pct": round(total_saved * 100, 2),
-            }
-        }
-# ─── SB3 Adapter (Gymnasium wrapper for PPO) ─────────────────────────────────
 class SB3Adapter(gym.Env):
-    """
-    Wraps AWSCostEnv (OpenEnv dict API) into the Gymnasium 5-tuple API
-    that stable-baselines3 PPO expects.
-    terminated = agent achieved the savings target (win)
-    truncated  = MAX_STEPS reached without winning
-    """
     metadata = {"render_modes": []}
     def __init__(self):
         super().__init__()
         self.core = AWSCostEnv()
         self.action_space = spaces.Discrete(NUM_ACTIONS)
-        self.observation_space = spaces.Box(
-            low=-np.inf, high=np.inf, shape=(OBS_DIM,), dtype=np.float32
-        )
     def reset(self, seed=None, options=None):
         super().reset(seed=seed)
@@ -421,16 +343,10 @@ class SB3Adapter(gym.Env):
         return result["observation"], result["info"]
     def step(self, action):
-        result     = self.core.step(action)
         terminated = result["done"] and result["info"].get("win", False)
-        truncated  = result["done"] and not result["info"].get("win", False)
-        return (
-            result["observation"],
-            result["reward"],
-            terminated,
-            truncated,
-            result["info"],
-        )
     def render(self):
         pass

+import random
+from enum import IntEnum
 import gymnasium as gym
+import numpy as np
 from gymnasium import spaces
 random.seed(42)
 np.random.seed(42)
 class Action(IntEnum):
+    NOOP = 0
     CHECK_DEPENDENCIES = 1
+    RESIZE = 2
+    STOP = 3
+    DELETE = 4
+NUM_ACTIONS = len(Action)
 N_RESOURCES = 6
 OBS_PER_RES = 5
+OBS_DIM = N_RESOURCES * OBS_PER_RES + 2
 PROD_NAMES = [
+    "storage-prod-db", "core-auth-router", "primary-k8s-master",
+    "billing-db-01", "payment-gateway-prod", "prod-cache-redis",
+    "prod-elb-frontend", "rds-prod-main", "main-api-prod",
+    "prod-cosmos-db", "primary-gke-cluster", "prod-spanner-db",
 ]
 TEMP_NAMES = [
+    "worker-node-temp", "test-frontend-ui", "sandbox-db-04",
+    "batch-processor-temp", "dev-cache-redis", "temp-worker-88",
+    "staging-api-v2", "dev-log-collector", "temp-ecs-task",
+    "dev-gke-node", "test-bigquery-scratch", "sandbox-spanner-dev",
 ]
 class CloudResource:
     def __init__(self, name, cost_per_hr, cpu_pct, dependencies, is_prod):
+        self.name = name
+        self.cost_per_hr = cost_per_hr
+        self.cpu_pct = cpu_pct
         self.dependencies = dependencies
+        self.is_prod = is_prod
+        self.active = True
     def to_obs_array(self):
+        return np.array(
+            [
+                self.cpu_pct / 100.0,
+                self.cost_per_hr / 5.0,
+                len(self.dependencies) / 14.0,
+                float(self.is_prod),
+                float(self.active),
+            ],
+            dtype=np.float32,
+        )
 def generate_resources(n=N_RESOURCES):
     resources = []
     prod_name = random.choice(PROD_NAMES)
+    resources.append(
+        CloudResource(
+            name=prod_name,
+            cost_per_hr=round(random.uniform(0.5, 3.0), 2),
+            cpu_pct=random.randint(2, 12),
+            dependencies=random.sample(TEMP_NAMES, k=random.randint(2, 4)),
+            is_prod=True,
+        )
+    )
     for _ in range(n - 1):
+        is_prod = random.random() < 0.30
         name_pool = PROD_NAMES if is_prod else TEMP_NAMES
         dep_count = random.randint(1, 5) if is_prod else random.randint(0, 3)
+        resources.append(
+            CloudResource(
+                name=random.choice(name_pool),
+                cost_per_hr=round(random.uniform(0.8, 4.0), 2),
+                cpu_pct=random.randint(1, 95),
+                dependencies=random.sample(TEMP_NAMES, k=min(dep_count, len(TEMP_NAMES))),
+                is_prod=is_prod,
+            )
+        )
+    min_ratio = 0.35
+    for _ in range(10):
+        total = sum(r.cost_per_hr for r in resources)
         temp_total = sum(r.cost_per_hr for r in resources if not r.is_prod)
+        if total > 0 and (temp_total / total) < min_ratio:
             for r in resources:
                 if not r.is_prod:
                     r.cost_per_hr = round(r.cost_per_hr * 1.3, 2)
         else:
             break
     return resources
 class AWSCostEnv:
+    W_COST = 18.0
+    W_RISK = 5.0
+    W_RELIABILITY = 3.5
+    W_VALID_ACTION = 0.2
+    W_WIN_BONUS = 2.5
+    W_FAIL_PENALTY = -3.0
+    W_REPEAT_ACTION = -0.06
+    W_HESITATION = -0.10
+    W_TOOL = 0.20
+    W_TOOL_EPISODE_CAP = 0.60
+    W_VETO = -0.70
+    W_CRASH = -10.0
+    W_IDLE = -0.08
+    MAX_STEPS = 100
+    MAX_COMPONENT_ABS = 5.0
     def __init__(self, n_resources=N_RESOURCES, target_savings=0.20):
+        self.n_resources = n_resources
         self.target_savings = target_savings
+        self.resources = []
+        self.baseline_cost = 0.0
+        self.current_cost = 0.0
+        self.current_step = 0
         self.nodes_investigated_this_episode = set()
+        self.total_tool_reward_this_episode = 0.0
+        self.action_history = []
+        self.last_action = None
+        self.same_action_streak = 0
     def _resource_from_action(self, action_idx):
         idx = (action_idx - 2) % self.n_resources
         return self.resources[idx % len(self.resources)]
     def _has_dependency_violation(self, resource):
         for other in self.resources:
+            if other.active and other.name != resource.name and resource.name in other.dependencies:
+                return True
         return False
     def _calc_cost(self):
         return sum(r.cost_per_hr for r in self.resources if r.active)
+    def _active_resources(self):
+        return [r for r in self.resources if r.active]
+    def _risk_score(self):
+        active = self._active_resources()
+        if not active:
+            return 0.0
+        risky = sum(1 for r in active if (r.is_prod or len(r.dependencies) >= 4))
+        return risky / len(active)
+    def _reliability_score(self):
+        active = self._active_resources()
+        if not active:
+            return 0.0
+        healthy = sum(1 for r in active if len(r.dependencies) < 5)
+        return healthy / len(active)
+    def _semantic_veto(self, name: str, dep_count: int) -> bool:
+        name_lower = name.lower()
+        prod_keywords = ["prod", "primary", "main", "core", "billing", "payment", "rds", "master"]
+        if any(kw in name_lower for kw in prod_keywords):
+            return True
+        if dep_count >= 5:
+            return True
+        return False
     def _get_obs(self):
         obs = []
         for r in self.resources:
             obs.extend(r.to_obs_array())
+        budget_used = 1.0 - (self.current_cost / self.baseline_cost) if self.baseline_cost > 0 else 0.0
         steps_left = 1.0 - (self.current_step / self.MAX_STEPS)
         obs.extend([budget_used, steps_left])
         return np.array(obs, dtype=np.float32)
     def _get_internal_state(self):
+        savings_pct = (1 - self.current_cost / self.baseline_cost) * 100 if self.baseline_cost > 0 else 0.0
         return {
+            "step": self.current_step,
             "baseline_cost": self.baseline_cost,
+            "current_cost": self.current_cost,
+            "savings_pct": round(savings_pct, 2),
+            "resources": [
+                {
+                    "name": r.name,
+                    "active": r.active,
+                    "is_prod": r.is_prod,
+                    "cost_per_hr": r.cost_per_hr,
+                    "cpu_pct": r.cpu_pct,
+                    "dependencies": r.dependencies,
+                }
+                for r in self.resources
+            ],
         }
+    def _clip_component(self, value):
+        return float(np.clip(value, -self.MAX_COMPONENT_ABS, self.MAX_COMPONENT_ABS))
+    def _update_repeat_penalty(self, action):
+        if self.last_action == action:
+            self.same_action_streak += 1
+        else:
+            self.same_action_streak = 0
+        self.last_action = action
+        return self._clip_component(self.same_action_streak * self.W_REPEAT_ACTION)
+    def _build_step_result(self, reward_components, done, win, msg, safety_violation=False):
+        total_reward = float(sum(reward_components.values()))
+        savings_pct = round((1 - self.current_cost / self.baseline_cost) * 100, 2) if self.baseline_cost > 0 else 0.0
+        return {
+            "observation": self._get_obs(),
+            "state": self._get_internal_state(),
+            "reward": total_reward,
+            "done": bool(done),
+            "info": {
+                "msg": msg,
+                "win": bool(win),
+                "savings_pct": savings_pct,
+                "safety_violation": int(safety_violation),
+                "reward_components": reward_components,
+            },
+        }
     def reset(self):
+        self.current_step = 0
         self.nodes_investigated_this_episode = set()
+        self.total_tool_reward_this_episode = 0.0
+        self.action_history = []
+        self.last_action = None
+        self.same_action_streak = 0
+        self.resources = generate_resources(self.n_resources)
+        self.baseline_cost = self._calc_cost()
+        self.current_cost = self.baseline_cost
         return {
             "observation": self._get_obs(),
+            "info": {"msg": "Episode reset", "baseline_cost": self.baseline_cost},
         }
     def step(self, action):
         self.current_step += 1
         truncated = self.current_step >= self.MAX_STEPS
+        self.action_history.append(int(action))
+        reward_components = {
+            "cost_delta": 0.0,
+            "risk": 0.0,
+            "reliability": 0.0,
+            "action_quality": 0.0,
+            "terminal": 0.0,
+            "anti_loop": self._update_repeat_penalty(int(action)),
+        }
+        prev_cost = self.current_cost
+        prev_risk = self._risk_score()
+        prev_reliability = self._reliability_score()
         if action == Action.NOOP:
+            reward_components["action_quality"] += self._clip_component(self.W_HESITATION)
+            reward_components["anti_loop"] += self._clip_component(self.W_IDLE)
+            return self._build_step_result(
+                reward_components, truncated, False, "Hesitation penalty"
+            )
         target = self._resource_from_action(action)
         if action == Action.CHECK_DEPENDENCIES:
+            under_cap = self.total_tool_reward_this_episode < self.W_TOOL_EPISODE_CAP
+            new_node = target.name not in self.nodes_investigated_this_episode
             if new_node and under_cap:
                 self.nodes_investigated_this_episode.add(target.name)
                 self.total_tool_reward_this_episode += self.W_TOOL
+                reward_components["action_quality"] += self._clip_component(self.W_TOOL)
                 msg = f"Checked {target.name}"
             else:
+                reward_components["action_quality"] += self._clip_component(self.W_HESITATION)
+                msg = "Tool cap reached or repeated check"
+            return self._build_step_result(
+                reward_components, truncated, False, msg
+            )
         danger = action in (Action.STOP, Action.DELETE)
         if danger and self._semantic_veto(target.name, len(target.dependencies)):
+            reward_components["action_quality"] += self._clip_component(self.W_VETO)
+            return self._build_step_result(
+                reward_components, truncated, False, f"SEMANTIC VETO on {target.name}", safety_violation=True
+            )
+        if action == Action.RESIZE and target.active:
+            old_cost = target.cost_per_hr
+            target.cost_per_hr = round(target.cost_per_hr * random.uniform(0.40, 0.65), 2)
+            if target.cost_per_hr >= old_cost:
+                target.cost_per_hr = round(old_cost * 0.50, 2)
+            reward_components["action_quality"] += self._clip_component(self.W_VALID_ACTION)
         elif action in (Action.STOP, Action.DELETE):
             if self._has_dependency_violation(target):
+                reward_components["terminal"] += self._clip_component(self.W_CRASH)
+                return self._build_step_result(
+                    reward_components, True, False, f"CATASTROPHIC FAILURE: {target.name}", safety_violation=True
+                )
             target.active = False
+            reward_components["action_quality"] += self._clip_component(self.W_VALID_ACTION)
         self.current_cost = self._calc_cost()
+        total_saved = ((self.baseline_cost - self.current_cost) / self.baseline_cost) if self.baseline_cost > 0 else 0.0
         is_win = total_saved >= self.target_savings
+        new_risk = self._risk_score()
+        new_reliability = self._reliability_score()
+        cost_delta_pct = (prev_cost - self.current_cost) / self.baseline_cost if self.baseline_cost > 0 else 0.0
+        risk_improvement = prev_risk - new_risk
+        reliability_improvement = new_reliability - prev_reliability
+        reward_components["cost_delta"] += self._clip_component(cost_delta_pct * self.W_COST)
+        reward_components["risk"] += self._clip_component(risk_improvement * self.W_RISK)
+        reward_components["reliability"] += self._clip_component(reliability_improvement * self.W_RELIABILITY)
+        if is_win:
+            reward_components["terminal"] += self._clip_component(self.W_WIN_BONUS)
+        elif truncated:
+            reward_components["terminal"] += self._clip_component(self.W_FAIL_PENALTY)
+        done = bool(is_win or truncated)
+        msg = "Win!" if is_win else "Action Successful"
+        return self._build_step_result(reward_components, done, is_win, msg)
 class SB3Adapter(gym.Env):
     metadata = {"render_modes": []}
     def __init__(self):
         super().__init__()
         self.core = AWSCostEnv()
         self.action_space = spaces.Discrete(NUM_ACTIONS)
+        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(OBS_DIM,), dtype=np.float32)
     def reset(self, seed=None, options=None):
         super().reset(seed=seed)
         return result["observation"], result["info"]
     def step(self, action):
+        result = self.core.step(action)
         terminated = result["done"] and result["info"].get("win", False)
+        truncated = result["done"] and not result["info"].get("win", False)
+        return (result["observation"], result["reward"], terminated, truncated, result["info"])
     def render(self):
         pass

cloud_arena/llm_training.py CHANGED Viewed

@@ -1,23 +1,24 @@
-# ============================================================
-# Multi-Model RL Benchmarking Pipeline
-# Sequential training of multiple LLMs with VRAM cleanup
-# REINFORCE + LoRA on Cloud FinOps Environment
-# ============================================================
-import os, re, json, time, gc
 import numpy as np
 import torch
 import torch.nn.functional as F
-import matplotlib
 matplotlib.use("Agg")
-import matplotlib.pyplot as plt
-import warnings
 warnings.filterwarnings("ignore", category=UserWarning)
 warnings.filterwarnings("ignore", category=FutureWarning)
-from cloud_arena.llm_environment import SB3Adapter, Action, AWSCostEnv
-# ── Configuration ─────────────────────────────────────────────────────────────
 MODELS_TO_TEST = [
     "unsloth/Qwen2.5-Math-7B-Instruct-bnb-4bit",
@@ -27,13 +28,27 @@ MODELS_TO_TEST = [
 ACTION_NAMES = {0: "NOOP", 1: "CHECK_DEPS", 2: "RESIZE", 3: "STOP", 4: "DELETE"}
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
-GRAD_ACCUM_STEPS = 4
 MAX_SEQ_LEN = 512
-MAX_GEN_TOKENS = 80
-EMA_ALPHA = 0.3  # EMA smoothing factor for reward graph
-# ── Prompt & Parser ───────────────────────────────────────────────────────────
 def format_prompt(state_dict):
     resources_text = ""
@@ -42,327 +57,360 @@ def format_prompt(state_dict):
         tag = "PROD" if r["is_prod"] else "TEMP"
         resources_text += (
             f"  - {r['name']} [{status}] ({tag}): "
-            f"${r['cost_per_hr']:.2f}/hr, CPU={r['cpu_pct']}%, "
-            f"Deps={len(r['dependencies'])}\n"
         )
     savings_pct = state_dict.get("savings_pct", 0.0)
     return (
-        f"You are a Cloud FinOps AI. Reduce cost by >=20% without breaking production.\n\n"
-        f"Actions: 0=NOOP, 1=CHECK_DEPS, 2=RESIZE, 3=STOP, 4=DELETE\n\n"
         f"Resources:\n{resources_text}\n"
         f"Baseline: ${state_dict['baseline_cost']:.2f}/hr | "
         f"Current: ${state_dict['current_cost']:.2f}/hr | Savings: {savings_pct:.1f}%\n\n"
-        f"Rules:\n- Never delete/stop prod resources or those with >=5 deps\n"
-        f"- Temp resources with 0-1 deps are safe to delete\n- RESIZE is always safe\n\n"
-        f"CRITICAL: Output ONLY a brief reason then ACTION: <number 0-4>. Nothing else.\n\n"
-        f"REASONING:"
     )
 def extract_action_and_reasoning(response_text):
-    """Regex safety net: extracts action even from truncated/malformed output."""
     reasoning = response_text.strip()
-    action = 2  # Default RESIZE
-    match = re.search(r'ACTION:\s*([0-4])', response_text, re.IGNORECASE)
     if match:
         return int(match.group(1)), reasoning
-    json_match = re.search(r'\{.*?\}', response_text, re.DOTALL)
-    if json_match:
-        try:
-            parsed = json.loads(json_match.group(0))
-            a = parsed.get("action", 2)
-            if isinstance(a, int) and 0 <= a <= 4:
-                return a, reasoning
-        except (json.JSONDecodeError, ValueError):
-            pass
-    digits = re.findall(r'\b([0-4])\b', response_text[-30:])
     if digits:
         action = int(digits[-1])
     return action, reasoning
-# ── REINFORCE Loss ────────────────────────────────────────────────────────────
-def compute_pg_loss(model, tokenizer, prompt, response_text, reward):
     full_text = prompt + response_text
     enc = tokenizer(full_text, return_tensors="pt", truncation=True, max_length=MAX_SEQ_LEN).to(DEVICE)
     prompt_len = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=MAX_SEQ_LEN)["input_ids"].shape[1]
-    outputs = model(**enc, labels=enc["input_ids"])
-    logits = outputs.logits[:, prompt_len-1:-1, :]
     targets = enc["input_ids"][:, prompt_len:]
     if targets.shape[1] == 0 or logits.shape[1] == 0:
-        return 0.0
-    ml = min(logits.shape[1], targets.shape[1])
-    log_probs = F.log_softmax(logits[:, :ml, :], dim=-1)
-    token_lp = log_probs.gather(2, targets[:, :ml].unsqueeze(-1)).squeeze(-1)
-    loss = -(reward / 10.0) * token_lp.mean()
-    loss.backward()
-    return loss.item()
-# ── Episode Runner ────────────────────────────────────────────────────────────
-def run_episode(model, tokenizer, env, is_training=False, optimizer=None,
-                steps_per_episode=15, iteration_num=0, total_iters=0):
     obs, info = env.reset()
-    state_dict = env.core._get_internal_state()
     done = False
-    episode_reward = 0.0
     step_count = 0
-    reasoning_log = []
-    if is_training and optimizer is not None:
-        optimizer.zero_grad()
-    while not done and step_count < steps_per_episode:
         prompt = format_prompt(state_dict)
-        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=MAX_SEQ_LEN)
-        input_ids = inputs["input_ids"].to(DEVICE)
-        attn_mask = inputs["attention_mask"].to(DEVICE)
-        with torch.no_grad():
-            gen = model.generate(
-                input_ids, attention_mask=attn_mask,
-                max_new_tokens=MAX_GEN_TOKENS,
-                do_sample=True, temperature=0.7, top_p=0.95,
-                pad_token_id=tokenizer.pad_token_id,
             )
-        response_text = tokenizer.decode(gen[0][input_ids.shape[1]:], skip_special_tokens=True)
-        action, reasoning = extract_action_and_reasoning(response_text)
-        next_obs, reward, terminated, truncated, next_info = env.step(action)
-        done = terminated or truncated
-        episode_reward += reward
-        # ── Detailed per-step terminal output ──
-        if is_training and total_iters > 0:
-            pct = (iteration_num / total_iters) * 100
-            print(f"  [{pct:5.1f}%] Ep {iteration_num} Step {step_count+1}: "
-                  f"{ACTION_NAMES.get(action,'?')} → r={reward:+.3f} | "
-                  f"💬 {reasoning[:80]}")
-        reasoning_log.append({
-            "step": step_count + 1, "action": action,
-            "action_name": ACTION_NAMES.get(action, "?"),
-            "reward": round(reward, 4),
-            "reasoning": reasoning[:200],
-            "message": next_info.get("msg", ""),
-        })
-        if is_training and optimizer is not None:
-            compute_pg_loss(model, tokenizer, prompt, response_text, reward)
-        obs = next_obs
-        state_dict = env.core._get_internal_state()
         step_count += 1
-    return episode_reward, reasoning_log
-# ── VRAM Cleanup ──────────────────────────────────────────────────────────────
-def nuke_vram(model=None, optimizer=None, tokenizer=None):
-    """Aggressively free VRAM between model runs."""
-    if model is not None:
-        del model
-    if optimizer is not None:
-        del optimizer
-    if tokenizer is not None:
-        del tokenizer
-    gc.collect()
-    if torch.cuda.is_available():
-        torch.cuda.empty_cache()
-        torch.cuda.synchronize()
-        vram = torch.cuda.memory_allocated() / 1e9
-        print(f"  🧹 VRAM after cleanup: {vram:.2f} GB")
-# ── Single Model Training ────────────────────────────────────────────────────
-def train_single_model(model_name, num_iterations=200, steps_per_episode=15,
-                       learning_rate=2e-6):
-    """Train one model, return rewards list."""
-    hf_token = os.environ.get("HF_TOKEN")
     from transformers import AutoModelForCausalLM, AutoTokenizer
-    from peft import get_peft_model, LoraConfig, TaskType
     short_name = model_name.split("/")[-1]
-    print(f"\n{'='*60}")
-    print(f"  🧠 Loading: {short_name}")
-    print(f"{'='*60}")
     tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)
-    model = AutoModelForCausalLM.from_pretrained(
-        model_name, torch_dtype=torch.bfloat16, token=hf_token,
         attn_implementation="sdpa",
     ).to(DEVICE)
     lora_cfg = LoraConfig(
-        r=16, lora_alpha=16,
         target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
-        lora_dropout=0.0, bias="none", task_type=TaskType.CAUSAL_LM,
-    )
-    model = get_peft_model(model, lora_cfg)
-    if tokenizer.pad_token is None:
-        tokenizer.pad_token = tokenizer.eos_token
-    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
-    total = sum(p.numel() for p in model.parameters())
-    print(f"  ✅ Loaded | Trainable: {trainable:,} / {total:,}")
-    if torch.cuda.is_available():
-        vram = torch.cuda.memory_allocated() / 1e9
-        print(f"  📊 VRAM used: {vram:.2f} GB")
-    optimizer = torch.optim.AdamW(
-        filter(lambda p: p.requires_grad, model.parameters()), lr=learning_rate
     )
     env = SB3Adapter()
-    all_rewards = []
-    # Pre-training eval
-    print(f"\n  ▶ PRE-TRAINING EVAL")
     model.eval()
-    pre_r, _ = run_episode(model, tokenizer, env, steps_per_episode=steps_per_episode)
-    all_rewards.append(pre_r)
-    print(f"    Baseline reward: {pre_r:+.3f}")
-    # Training loop
-    print(f"\n  ▶ TRAINING ({num_iterations} iters, accum={GRAD_ACCUM_STEPS})")
     model.train()
-    t0 = time.time()
-    for i in range(num_iterations):
-        reward, log_data = run_episode(
-            model, tokenizer, env, is_training=True, optimizer=optimizer,
-            steps_per_episode=steps_per_episode,
-            iteration_num=i+1, total_iters=num_iterations,
-        )
         all_rewards.append(reward)
-        if (i + 1) % GRAD_ACCUM_STEPS == 0:
-            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
-            optimizer.step()
-            optimizer.zero_grad()
-        # Per-iteration summary
-        pct = ((i+1) / num_iterations) * 100
-        elapsed = time.time() - t0
-        eta = (elapsed / (i+1)) * (num_iterations - i - 1)
-        ema = all_rewards[-1] if len(all_rewards) < 3 else (
-            EMA_ALPHA * all_rewards[-1] + (1 - EMA_ALPHA) * all_rewards[-2]
         )
-        print(f"  ┃ [{pct:5.1f}%] Iter {i+1:3d}/{num_iterations} │ "
-              f"r={reward:+.3f} │ EMA={ema:+.3f} │ "
-              f"ETA={eta:.0f}s")
-    # Final grad step
-    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
-    optimizer.step()
-    # Post-training eval
-    print(f"\n  ▶ POST-TRAINING EVAL")
     model.eval()
-    post_r, _ = run_episode(model, tokenizer, env, steps_per_episode=steps_per_episode)
-    all_rewards.append(post_r)
-    delta = post_r - pre_r
-    print(f"    Final reward: {post_r:+.3f} (Δ={delta:+.3f})")
-    print(f"    Time: {time.time()-t0:.0f}s")
-    # Cleanup VRAM
-    nuke_vram(model, optimizer, tokenizer)
-    return all_rewards
-# ── EMA Graph ─────────────────────────────────────────────────────────────────
-def compute_ema(rewards, alpha=EMA_ALPHA):
-    ema = [rewards[0]]
-    for r in rewards[1:]:
-        ema.append(alpha * r + (1 - alpha) * ema[-1])
-    return ema
-def generate_comparison_graph(all_results, output_path="outputs/multi_model_comparison.png"):
-    BG = '#0e1117'
-    COLORS = ['#00d4ff', '#ffa500', '#39ff14', '#ff6b6b', '#b47eff']
-    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7), facecolor=BG)
-    for ax in [ax1, ax2]:
-        ax.set_facecolor(BG)
-        ax.tick_params(colors='#e6e6e6', labelsize=10)
-        ax.grid(True, alpha=0.08, color='white')
-        for s in ['top', 'right']:
-            ax.spines[s].set_visible(False)
-        for s in ['left', 'bottom']:
-            ax.spines[s].set_color('#333')
-    # Left: EMA reward curves
-    for idx, (name, rewards) in enumerate(all_results.items()):
-        color = COLORS[idx % len(COLORS)]
-        ema = compute_ema(rewards)
-        ax1.plot(ema, color=color, lw=2.5, label=name, alpha=0.9)
-        ax1.plot(rewards, color=color, lw=0.5, alpha=0.2)
-    ax1.set_title("Training Reward (EMA Smoothed)", color='#e6e6e6', fontsize=14, fontweight='bold')
-    ax1.set_xlabel("Episode", color='#e6e6e6', fontsize=11)
-    ax1.set_ylabel("Reward", color='#e6e6e6', fontsize=11)
-    ax1.legend(facecolor='#1a1a2e', edgecolor='#333', labelcolor='#e6e6e6', fontsize=9)
-    ax1.axhline(y=0, color='gray', linestyle='--', alpha=0.3)
-    # Right: Before vs After comparison bars
-    names = list(all_results.keys())
-    pre_scores = [all_results[n][0] for n in names]
-    post_scores = [all_results[n][-1] for n in names]
-    x = np.arange(len(names))
-    w = 0.35
-    bars1 = ax2.bar(x - w/2, pre_scores, w, label='Before', color='#ef4444', edgecolor='white', lw=1)
-    bars2 = ax2.bar(x + w/2, post_scores, w, label='After', color='#22c55e', edgecolor='white', lw=1)
-    ax2.set_xticks(x)
-    ax2.set_xticklabels(names, fontsize=8, color='#e6e6e6', rotation=15)
-    ax2.set_title("Pre vs Post Training", color='#e6e6e6', fontsize=14, fontweight='bold')
-    ax2.set_ylabel("Reward", color='#e6e6e6', fontsize=11)
-    ax2.legend(facecolor='#1a1a2e', edgecolor='#333', labelcolor='#e6e6e6')
-    ax2.axhline(y=0, color='gray', linestyle='--', alpha=0.3)
-    for bar, val in zip(list(bars1) + list(bars2), pre_scores + post_scores):
-        ax2.text(bar.get_x() + bar.get_width()/2, val + 0.1,
-                 f"{val:+.1f}", ha='center', va='bottom', fontsize=9,
-                 color='#e6e6e6', fontweight='bold')
-    plt.tight_layout()
-    plt.savefig(output_path, dpi=200, bbox_inches='tight', facecolor=BG)
-    plt.close()
-    return output_path
-# ── Main Pipeline ──────────────���──────────────────────────────────────────────
-def train_llm(model_name=None, num_iterations=200, steps_per_episode=15,
-              learning_rate=2e-6, progress_callback=None):
-    """
-    Multi-model or single-model training pipeline.
-    If model_name contains commas, runs multi-model benchmark.
-    """
     log_lines = []
     def log(msg):
         print(msg)
         log_lines.append(msg)
         if progress_callback:
             progress_callback("\n".join(log_lines))
-    # Determine model list
     if model_name and "," in model_name:
         models = [m.strip() for m in model_name.split(",")]
     elif model_name:
@@ -370,53 +418,49 @@ def train_llm(model_name=None, num_iterations=200, steps_per_episode=15,
     else:
         models = MODELS_TO_TEST
-    log(f"🖥️  Device: {DEVICE}")
-    log(f"🔁 Models to test: {len(models)}")
-    for m in models:
-        log(f"   • {m}")
     all_results = {}
     full_log = []
-    for model_idx, mname in enumerate(models):
         short = mname.split("/")[-1]
-        log(f"\n{'━'*60}")
-        log(f"  [{model_idx+1}/{len(models)}] {short}")
-        log(f"{'━'*60}")
         try:
-            rewards = train_single_model(
-                mname, num_iterations=num_iterations,
-                steps_per_episode=steps_per_episode,
-                learning_rate=learning_rate,
-            )
             all_results[short] = rewards
             delta = rewards[-1] - rewards[0]
-            log(f"  ✅ {short}: Pre={rewards[0]:+.3f} → Post={rewards[-1]:+.3f} (Δ={delta:+.3f})")
-            full_log.append({
-                "model": mname, "pre": rewards[0], "post": rewards[-1],
-                "delta": delta, "all_rewards": rewards,
-            })
         except Exception as e:
-            log(f"  ❌ {short} FAILED: {e}")
             full_log.append({"model": mname, "error": str(e)})
-            nuke_vram()  # cleanup even on failure
-    # Generate comparison graph
     graph_path = None
     if all_results:
-        os.makedirs("outputs", exist_ok=True)
-        graph_path = generate_comparison_graph(all_results)
-        log(f"\n📊 Comparison graph saved to {graph_path}")
-    # Save log
-    with open("outputs/multi_model_log.json", "w") as f:
-        json.dump(full_log, f, indent=2, default=str)
-    # Build flat reward list for backward compat
     flat_rewards = []
     for rewards in all_results.values():
         flat_rewards.extend(rewards)
-    log_text = "\n".join(log_lines)
-    return flat_rewards or [0], full_log, graph_path, log_text

+import copy
+import gc
+import json
+import os
+import re
+import time
+import warnings
+from dataclasses import dataclass
+import matplotlib
 import numpy as np
 import torch
 import torch.nn.functional as F
+from cloud_arena.llm_environment import AWSCostEnv, SB3Adapter
+from cloud_arena.visualization import generate_grpo_dashboard
 matplotlib.use("Agg")
 warnings.filterwarnings("ignore", category=UserWarning)
 warnings.filterwarnings("ignore", category=FutureWarning)
 MODELS_TO_TEST = [
     "unsloth/Qwen2.5-Math-7B-Instruct-bnb-4bit",
 ACTION_NAMES = {0: "NOOP", 1: "CHECK_DEPS", 2: "RESIZE", 3: "STOP", 4: "DELETE"}
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
 MAX_SEQ_LEN = 512
+EMA_ALPHA = 0.3
+@dataclass
+class GRPOConfig:
+    num_iterations: int = 200
+    steps_per_episode: int = 15
+    group_size: int = 4
+    clip_epsilon: float = 0.2
+    kl_coef: float = 0.01
+    entropy_coef: float = 0.001
+    learning_rate: float = 2e-6
+    grad_accum_steps: int = 4
+    max_gen_tokens: int = 80
+    temperature: float = 0.7
+    top_p: float = 0.95
+    max_grad_norm: float = 1.0
+    seed: int = 42
+    target_kl: float = 0.12
 def format_prompt(state_dict):
     resources_text = ""
         tag = "PROD" if r["is_prod"] else "TEMP"
         resources_text += (
             f"  - {r['name']} [{status}] ({tag}): "
+            f"${r['cost_per_hr']:.2f}/hr, CPU={r['cpu_pct']}%, Deps={len(r['dependencies'])}\n"
         )
     savings_pct = state_dict.get("savings_pct", 0.0)
     return (
+        "You are a Cloud FinOps AI.\n"
+        "Goal: Reduce cloud cost by >=20% while preserving safety and reliability.\n\n"
+        "Actions: 0=NOOP, 1=CHECK_DEPS, 2=RESIZE, 3=STOP, 4=DELETE\n\n"
         f"Resources:\n{resources_text}\n"
         f"Baseline: ${state_dict['baseline_cost']:.2f}/hr | "
         f"Current: ${state_dict['current_cost']:.2f}/hr | Savings: {savings_pct:.1f}%\n\n"
+        "Safety policy:\n"
+        "- Avoid deleting/stopping production-like or high dependency resources.\n"
+        "- Prefer low-risk actions that improve savings steadily.\n\n"
+        "Output format strictly:\n"
+        "Reason: <short>\n"
+        "ACTION: <number 0-4>\n\n"
+        "RESPONSE:"
     )
 def extract_action_and_reasoning(response_text):
     reasoning = response_text.strip()
+    action = 2
+    match = re.search(r"ACTION:\s*([0-4])", response_text, re.IGNORECASE)
     if match:
         return int(match.group(1)), reasoning
+    digits = re.findall(r"\b([0-4])\b", response_text[-30:])
     if digits:
         action = int(digits[-1])
     return action, reasoning
+def seed_everything(seed):
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(seed)
+def nuke_vram(model=None, optimizer=None, tokenizer=None, ref_model=None):
+    if model is not None:
+        del model
+    if optimizer is not None:
+        del optimizer
+    if tokenizer is not None:
+        del tokenizer
+    if ref_model is not None:
+        del ref_model
+    gc.collect()
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize()
+def _completion_logprob(model, tokenizer, prompt, response_text):
     full_text = prompt + response_text
     enc = tokenizer(full_text, return_tensors="pt", truncation=True, max_length=MAX_SEQ_LEN).to(DEVICE)
     prompt_len = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=MAX_SEQ_LEN)["input_ids"].shape[1]
+    outputs = model(**enc)
+    logits = outputs.logits[:, prompt_len - 1 : -1, :]
     targets = enc["input_ids"][:, prompt_len:]
     if targets.shape[1] == 0 or logits.shape[1] == 0:
+        z = torch.zeros(1, device=DEVICE)
+        return z, z, z
+    n_tokens = min(logits.shape[1], targets.shape[1])
+    log_probs = F.log_softmax(logits[:, :n_tokens, :], dim=-1)
+    probs = torch.softmax(logits[:, :n_tokens, :], dim=-1)
+    picked = log_probs.gather(2, targets[:, :n_tokens].unsqueeze(-1)).squeeze(-1)
+    token_logprob = picked.mean()
+    entropy = (-(probs * log_probs).sum(-1)).mean()
+    return token_logprob, entropy, torch.tensor(float(n_tokens), device=DEVICE)
+def _sample_response(model, tokenizer, prompt, cfg):
+    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=MAX_SEQ_LEN)
+    input_ids = inputs["input_ids"].to(DEVICE)
+    attn_mask = inputs["attention_mask"].to(DEVICE)
+    with torch.no_grad():
+        out = model.generate(
+            input_ids=input_ids,
+            attention_mask=attn_mask,
+            max_new_tokens=cfg.max_gen_tokens,
+            do_sample=True,
+            temperature=cfg.temperature,
+            top_p=cfg.top_p,
+            pad_token_id=tokenizer.pad_token_id,
+        )
+    return tokenizer.decode(out[0][input_ids.shape[1] :], skip_special_tokens=True)
+def _evaluate_action_on_clone(core_env, action):
+    env_copy = copy.deepcopy(core_env)
+    result = env_copy.step(action)
+    return result
+def run_grpo_episode(model, ref_model, tokenizer, env, cfg, optimizer=None, train_mode=False):
     obs, info = env.reset()
     done = False
     step_count = 0
+    episode_reward = 0.0
+    chosen_samples = []
+    stats = {
+        "wins": 0,
+        "veto_rate": 0.0,
+        "safety_violations": 0,
+        "avg_group_std": 0.0,
+        "avg_group_reward": 0.0,
+        "avg_token_len": 0.0,
+        "reward_components": {},
+    }
+    veto_count = 0
+    all_group_stds = []
+    all_group_rewards = []
+    token_lens = []
+    component_acc = {}
+    while not done and step_count < cfg.steps_per_episode:
+        state_dict = env.core._get_internal_state()
         prompt = format_prompt(state_dict)
+        group = []
+        for _ in range(cfg.group_size):
+            response_text = _sample_response(model, tokenizer, prompt, cfg)
+            action, reasoning = extract_action_and_reasoning(response_text)
+            sim_result = _evaluate_action_on_clone(env.core, action)
+            reward = float(sim_result["reward"])
+            info = sim_result["info"]
+            token_lp, _, token_len = _completion_logprob(model, tokenizer, prompt, response_text)
+            with torch.no_grad():
+                old_lp = token_lp.detach()
+                ref_lp, _, _ = _completion_logprob(ref_model, tokenizer, prompt, response_text)
+            group.append(
+                {
+                    "prompt": prompt,
+                    "response": response_text,
+                    "action": action,
+                    "reasoning": reasoning,
+                    "reward": reward,
+                    "old_logprob": old_lp,
+                    "ref_logprob": ref_lp.detach(),
+                    "token_len": float(token_len.item()),
+                    "info": info,
+                }
             )
+        rewards = np.array([s["reward"] for s in group], dtype=np.float32)
+        baseline = float(rewards.mean())
+        std = float(rewards.std() + 1e-6)
+        for s in group:
+            s["advantage"] = float((s["reward"] - baseline) / std)
+        all_group_stds.append(std)
+        all_group_rewards.append(baseline)
+        best = max(group, key=lambda x: x["reward"])
+        chosen_samples.append(best)
+        token_lens.append(best["token_len"])
+        real_step = env.step(best["action"])
+        _, step_reward, terminated, truncated, step_info = real_step
+        done = bool(terminated or truncated)
+        episode_reward += float(step_reward)
+        veto_count += int(step_info.get("safety_violation", 0))
+        if step_info.get("win", False):
+            stats["wins"] += 1
+        if step_info.get("safety_violation", 0):
+            stats["safety_violations"] += 1
+        for k, v in step_info.get("reward_components", {}).items():
+            component_acc[k] = component_acc.get(k, 0.0) + float(v)
         step_count += 1
+    if train_mode and chosen_samples and optimizer is not None:
+        optimizer.zero_grad(set_to_none=True)
+        loss_sum = 0.0
+        kl_sum = 0.0
+        ent_sum = 0.0
+        clip_frac_count = 0
+        for i, sample in enumerate(chosen_samples, start=1):
+            new_lp, ent, _ = _completion_logprob(model, tokenizer, sample["prompt"], sample["response"])
+            ratio = torch.exp(new_lp - sample["old_logprob"])
+            clipped = torch.clamp(ratio, 1.0 - cfg.clip_epsilon, 1.0 + cfg.clip_epsilon)
+            adv = torch.tensor(sample["advantage"], device=DEVICE, dtype=torch.float32)
+            pg_loss = -torch.min(ratio * adv, clipped * adv)
+            kl = torch.clamp(new_lp - sample["ref_logprob"], min=-2.0, max=2.0)
+            total_loss = pg_loss + (cfg.kl_coef * kl) - (cfg.entropy_coef * ent)
+            if torch.isnan(total_loss) or torch.isinf(total_loss):
+                continue
+            (total_loss / cfg.grad_accum_steps).backward()
+            loss_sum += float(total_loss.detach().item())
+            kl_sum += float(kl.detach().item())
+            ent_sum += float(ent.detach().item())
+            clip_frac_count += int((ratio > (1.0 + cfg.clip_epsilon) or ratio < (1.0 - cfg.clip_epsilon)).item())
+            if i % cfg.grad_accum_steps == 0 or i == len(chosen_samples):
+                torch.nn.utils.clip_grad_norm_(model.parameters(), cfg.max_grad_norm)
+                optimizer.step()
+                optimizer.zero_grad(set_to_none=True)
+        n = max(len(chosen_samples), 1)
+        stats["loss"] = loss_sum / n
+        stats["kl"] = kl_sum / n
+        stats["entropy"] = ent_sum / n
+        stats["clip_frac"] = clip_frac_count / n
+    else:
+        stats["loss"] = 0.0
+        stats["kl"] = 0.0
+        stats["entropy"] = 0.0
+        stats["clip_frac"] = 0.0
+    stats["veto_rate"] = veto_count / max(step_count, 1)
+    stats["avg_group_std"] = float(np.mean(all_group_stds)) if all_group_stds else 0.0
+    stats["avg_group_reward"] = float(np.mean(all_group_rewards)) if all_group_rewards else 0.0
+    stats["avg_token_len"] = float(np.mean(token_lens)) if token_lens else 0.0
+    if step_count > 0:
+        stats["reward_components"] = {k: v / step_count for k, v in component_acc.items()}
+    else:
+        stats["reward_components"] = {}
+    return episode_reward, stats
+def train_single_model_grpo(model_name, cfg):
+    from peft import LoraConfig, TaskType, get_peft_model
     from transformers import AutoModelForCausalLM, AutoTokenizer
+    hf_token = os.environ.get("HF_TOKEN")
     short_name = model_name.split("/")[-1]
+    print(f"\n{'=' * 60}\n  Loading: {short_name}\n{'=' * 60}")
+    seed_everything(cfg.seed)
     tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    base_model = AutoModelForCausalLM.from_pretrained(
+        model_name,
+        torch_dtype=torch.bfloat16 if DEVICE == "cuda" else torch.float32,
+        token=hf_token,
         attn_implementation="sdpa",
     ).to(DEVICE)
+    ref_model = AutoModelForCausalLM.from_pretrained(
+        model_name,
+        torch_dtype=torch.bfloat16 if DEVICE == "cuda" else torch.float32,
+        token=hf_token,
+        attn_implementation="sdpa",
+    ).to(DEVICE)
+    ref_model.eval()
+    for p in ref_model.parameters():
+        p.requires_grad_(False)
     lora_cfg = LoraConfig(
+        r=16,
+        lora_alpha=16,
         target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
+        lora_dropout=0.0,
+        bias="none",
+        task_type=TaskType.CAUSAL_LM,
     )
+    model = get_peft_model(base_model, lora_cfg)
+    optimizer = torch.optim.AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=cfg.learning_rate)
     env = SB3Adapter()
+    all_rewards = []
+    iter_stats = []
+    print("\n  PRE-TRAINING EVAL")
     model.eval()
+    pre_reward, pre_stats = run_grpo_episode(model, ref_model, tokenizer, env, cfg, train_mode=False)
+    all_rewards.append(pre_reward)
+    iter_stats.append(pre_stats)
+    print(f"    Baseline reward: {pre_reward:+.3f}")
+    print(f"\n  GRPO TRAINING ({cfg.num_iterations} iters, group={cfg.group_size})")
     model.train()
+    start = time.time()
+    for i in range(cfg.num_iterations):
+        reward, stats = run_grpo_episode(model, ref_model, tokenizer, env, cfg, optimizer=optimizer, train_mode=True)
         all_rewards.append(reward)
+        iter_stats.append(stats)
+        if stats["kl"] > cfg.target_kl * 1.5:
+            cfg.kl_coef = min(cfg.kl_coef * 1.15, 0.2)
+        elif stats["kl"] < cfg.target_kl * 0.5:
+            cfg.kl_coef = max(cfg.kl_coef * 0.95, 1e-4)
+        pct = ((i + 1) / cfg.num_iterations) * 100
+        elapsed = time.time() - start
+        eta = (elapsed / (i + 1)) * (cfg.num_iterations - i - 1)
+        ema = all_rewards[-1] if len(all_rewards) < 3 else (EMA_ALPHA * all_rewards[-1] + (1 - EMA_ALPHA) * all_rewards[-2])
+        print(
+            f"  [{pct:5.1f}%] Iter {i+1:3d}/{cfg.num_iterations} | "
+            f"r={reward:+.3f} ema={ema:+.3f} loss={stats['loss']:+.4f} "
+            f"kl={stats['kl']:+.4f} ent={stats['entropy']:+.4f} eta={eta:.0f}s"
         )
+    print("\n  POST-TRAINING EVAL")
     model.eval()
+    post_reward, post_stats = run_grpo_episode(model, ref_model, tokenizer, env, cfg, train_mode=False)
+    all_rewards.append(post_reward)
+    iter_stats.append(post_stats)
+    print(f"    Final reward: {post_reward:+.3f} (Δ={post_reward - pre_reward:+.3f})")
+    nuke_vram(model, optimizer, tokenizer, ref_model=ref_model)
+    return all_rewards, iter_stats
+def _build_grpo_config(
+    num_iterations=200,
+    steps_per_episode=15,
+    group_size=4,
+    clip_epsilon=0.2,
+    kl_coef=0.01,
+    entropy_coef=0.001,
+    learning_rate=2e-6,
+    max_gen_tokens=80,
+    temperature=0.7,
+    seed=42,
+):
+    return GRPOConfig(
+        num_iterations=int(num_iterations),
+        steps_per_episode=int(steps_per_episode),
+        group_size=max(2, int(group_size)),
+        clip_epsilon=float(clip_epsilon),
+        kl_coef=float(kl_coef),
+        entropy_coef=float(entropy_coef),
+        learning_rate=float(learning_rate),
+        max_gen_tokens=int(max_gen_tokens),
+        temperature=float(temperature),
+        seed=int(seed),
+    )
+def train_llm(
+    model_name=None,
+    num_iterations=200,
+    steps_per_episode=15,
+    learning_rate=2e-6,
+    progress_callback=None,
+    group_size=4,
+    clip_epsilon=0.2,
+    kl_coef=0.01,
+    entropy_coef=0.001,
+    max_gen_tokens=80,
+    temperature=0.7,
+    seed=42,
+):
     log_lines = []
     def log(msg):
         print(msg)
         log_lines.append(msg)
         if progress_callback:
             progress_callback("\n".join(log_lines))
     if model_name and "," in model_name:
         models = [m.strip() for m in model_name.split(",")]
     elif model_name:
     else:
         models = MODELS_TO_TEST
+    log(f"Device: {DEVICE}")
+    log(f"Models: {len(models)}")
     all_results = {}
+    all_stats = {}
     full_log = []
+    for idx, mname in enumerate(models, start=1):
         short = mname.split("/")[-1]
+        log(f"\n{'-' * 58}\n[{idx}/{len(models)}] {short}\n{'-' * 58}")
+        cfg = _build_grpo_config(
+            num_iterations=num_iterations,
+            steps_per_episode=steps_per_episode,
+            group_size=group_size,
+            clip_epsilon=clip_epsilon,
+            kl_coef=kl_coef,
+            entropy_coef=entropy_coef,
+            learning_rate=learning_rate,
+            max_gen_tokens=max_gen_tokens,
+            temperature=temperature,
+            seed=seed + idx,
+        )
         try:
+            rewards, iter_stats = train_single_model_grpo(mname, cfg)
             all_results[short] = rewards
+            all_stats[short] = iter_stats
             delta = rewards[-1] - rewards[0]
+            log(f"GRPO complete: pre={rewards[0]:+.3f} post={rewards[-1]:+.3f} delta={delta:+.3f}")
+            full_log.append({"model": mname, "pre": rewards[0], "post": rewards[-1], "delta": delta, "rewards": rewards})
         except Exception as e:
+            log(f"FAILED: {short}: {e}")
             full_log.append({"model": mname, "error": str(e)})
+            nuke_vram()
+    os.makedirs("outputs", exist_ok=True)
     graph_path = None
     if all_results:
+        graph_path = generate_grpo_dashboard(all_results, all_stats, output_path="outputs/grpo_dashboard.png")
+        log(f"Dashboard saved: {graph_path}")
+    with open("outputs/multi_model_log.json", "w", encoding="utf-8") as f:
+        json.dump({"summary": full_log, "stats": all_stats}, f, indent=2, default=str)
     flat_rewards = []
     for rewards in all_results.values():
         flat_rewards.extend(rewards)
+    return flat_rewards or [0], full_log, graph_path, "\n".join(log_lines)

cloud_arena/visualization.py CHANGED Viewed

@@ -54,3 +54,54 @@ def generate_dashboard(callback, output_path="outputs/training_dashboard.png"):
     plt.savefig(output_path, dpi=200, bbox_inches='tight', facecolor=REF_BG)
     plt.close()
     return output_path

     plt.savefig(output_path, dpi=200, bbox_inches='tight', facecolor=REF_BG)
     plt.close()
     return output_path
+def generate_grpo_dashboard(all_results, all_stats, output_path="outputs/grpo_dashboard.png"):
+    fig, axs = plt.subplots(2, 2, figsize=(16, 10), facecolor=REF_BG)
+    ax1, ax2, ax3, ax4 = axs.flatten()
+    for ax in [ax1, ax2, ax3, ax4]:
+        ax.set_facecolor(REF_BG)
+        ax.grid(True, alpha=0.08, color="white")
+        ax.spines["top"].set_visible(False)
+        ax.spines["right"].set_visible(False)
+        ax.spines["left"].set_color("#333333")
+        ax.spines["bottom"].set_color("#333333")
+        ax.tick_params(colors=TEXT_COLOR, labelsize=9)
+    palette = ["#00d4ff", "#ffa500", "#39ff14", "#ff6b6b", "#b47eff"]
+    model_names = list(all_results.keys())
+    for i, name in enumerate(model_names):
+        c = palette[i % len(palette)]
+        rewards = all_results[name]
+        ax1.plot(smooth(np.array(rewards), box_pts=min(20, max(3, len(rewards) // 5))), color=c, lw=2, label=name)
+        kl_curve = [s.get("kl", 0.0) for s in all_stats.get(name, [])]
+        ent_curve = [s.get("entropy", 0.0) for s in all_stats.get(name, [])]
+        veto_curve = [s.get("veto_rate", 0.0) for s in all_stats.get(name, [])]
+        ax2.plot(kl_curve, color=c, lw=1.8, label=name)
+        ax3.plot(ent_curve, color=c, lw=1.8, label=name)
+        ax4.plot(veto_curve, color=c, lw=1.8, label=name)
+    ax1.set_title("GRPO Reward (Smoothed)", color=TEXT_COLOR, fontsize=12, fontweight="bold")
+    ax1.set_xlabel("Episode", color=TEXT_COLOR)
+    ax1.set_ylabel("Reward", color=TEXT_COLOR)
+    ax1.legend(facecolor="#1a1a2e", edgecolor="#333", labelcolor=TEXT_COLOR, fontsize=8)
+    ax2.set_title("KL Trend", color=TEXT_COLOR, fontsize=12, fontweight="bold")
+    ax2.set_xlabel("Episode", color=TEXT_COLOR)
+    ax2.set_ylabel("KL", color=TEXT_COLOR)
+    ax3.set_title("Entropy Trend", color=TEXT_COLOR, fontsize=12, fontweight="bold")
+    ax3.set_xlabel("Episode", color=TEXT_COLOR)
+    ax3.set_ylabel("Entropy", color=TEXT_COLOR)
+    ax4.set_title("Safety Violation / Veto Rate", color=TEXT_COLOR, fontsize=12, fontweight="bold")
+    ax4.set_xlabel("Episode", color=TEXT_COLOR)
+    ax4.set_ylabel("Rate", color=TEXT_COLOR)
+    ax4.set_ylim(0, 1)
+    plt.tight_layout()
+    plt.savefig(output_path, dpi=200, bbox_inches="tight", facecolor=REF_BG)
+    plt.close()
+    return output_path