Spaces:

zom696
/

RevOpsGYm

Sleeping

App Files Files Community

Sriram611 commited on Apr 25

Commit

cffeda9

1 Parent(s): ff3c194

Initial RevOps Gym environment

Browse files

Files changed (15) hide show

.gitignore +36 -0
Dockerfile +16 -0
README.md +166 -5
openenv.yaml +92 -0
requirements.txt +0 -0
revops_gym/__init__.py +0 -0
revops_gym/client.py +56 -0
revops_gym/crisis.py +231 -0
revops_gym/env.py +278 -0
revops_gym/models.py +110 -0
revops_gym/reward.py +134 -0
revops_gym/server.py +156 -0
setup.py +0 -0
tests/test_env.py +108 -0
train_colab.ipynb +690 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,36 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# Virtual Environments
+venv/
+.venv/
+env/
+ENV/
+# Weights & Training Outputs (CRITICAL)
+# These are often gigabytes; do not push them to a standard git repo
+revops_model_outputs/
+checkpoint-*/
+*.pt
+*.pth
+*.bin
+*.safetensors
+# Environment / Secrets
+.env
+.flaskenv
+# Jupyter Notebook & Colab debris
+.ipynb_checkpoints
+*/.ipynb_checkpoints/*
+# Logging and Tracking
+wandb/
+runs/
+logs/
+# Operating System Files
+.DS_Store
+Thumbs.db

Dockerfile ADDED Viewed

	@@ -0,0 +1,16 @@

+FROM python:3.11-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+RUN pip install --no-cache-dir -e .
+EXPOSE 7860
+ENV DIFFICULTY=normal
+ENV CRISIS_EVERY=3
+CMD ["uvicorn", "revops_gym.server:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,10 +1,171 @@
 ---
-title: Revops Gym
-emoji: 📚
-colorFrom: green
-colorTo: indigo
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: RevOps Gym
+emoji: 🚀
+colorFrom: blue
+colorTo: green
 sdk: docker
 pinned: false
+license: mit
+tags:
+  - openenv
+  - reinforcement-learning
+  - llm-training
+  - saas-simulation
+  - world-modeling
+  - adversarial
 ---
+# 🚀 RevOps Gym — SaaS Flight Simulator for LLM RL Training
+> *"Can a 1.5B model learn to run a SaaS company under adversarial pressure?"*
+## What is this?
+**RevOps Gym** is an [OpenEnv](https://github.com/huggingface/openenv)-compliant RL environment where an LLM agent (the **Pilot**) must manage a B2B SaaS company while a **Crisis Engine** (Gemini 2.0 Flash) actively identifies the agent's weakest metric and doubles down on it every 3 steps.
+The agent must balance the **Golden Ratio** (LTV/CAC ≥ 3x), grow MRR, control churn, manage cash runway — all while adversarial crises crash into its strategy like turbulence.
+**The VC fires you if MRR drops below $20,000.** Survive 30 steps and you win.
+---
+## Theme
+- **Primary**: Theme #3.1 — World Modeling (Professional Tasks)
+- **Secondary**: Theme #1 — Multi-Agent Interactions (adversarial Crisis Engine)
+---
+## Environment Design
+### What the agent observes
+```
+MRR: $63,400  |  Floor: $20,000
+CAC: $2,100   |  LTV: $11,800  |  LTV/CAC: 5.62x
+Churn: 3.2%   |  Runway: 14.5mo
+Marketing spend: $18,200/mo  |  Support quality: 74%
+⚠️  ACTIVE CRISIS: CAC_EXPLOSION — Ad costs doubled. Marketing efficiency collapses.
+```
+### Available actions (10)
+`increase_marketing` · `decrease_marketing` · `hire_support` · `fire_support` · `discount_campaign` · `raise_prices` · `feature_investment` · `cut_costs` · `negotiate_contracts` · `pivot_segment`
+Each action takes a `magnitude` parameter (0.1–1.0) that scales its effect.
+### Reward Rubric (4 independent signals)
+| Signal | Weight | What it measures |
+|--------|--------|------------------|
+| LTV/CAC ratio | 35% | Profitability per customer (target 3x+) |
+| MRR growth | 30% | Revenue trajectory vs previous step |
+| Burn efficiency | 20% | Marketing spend / MRR, support quality |
+| Survival bonus | 15% | Above VC floor + cash runway health |
+Termination penalty: **−2.0** if the company dies.
+### Crisis Engine (Gemini 2.0 Flash)
+Every 3 steps, Gemini analyzes the current state, identifies the weakest metric, and selects the most painful crisis:
+- `CHURN_SPIKE` — competitor launches aggressive pricing
+- `CAC_EXPLOSION` — ad costs double
+- `SUPPORT_CRISIS` — key engineers quit
+- `CASH_CRUNCH` — unexpected infrastructure bill
+- `ENTERPRISE_CHURN` — top 3 accounts cancelled
+- `PRICE_WAR`, `REGULATORY_HIT`, `TALENT_WAR`...
+Falls back to rule-based crisis selection if Gemini API is unavailable.
+---
+## Why this environment is novel
+1. **Adversarial by design** — unlike static environments, the "Storm" actively reads the agent's state and amplifies its weakness. The agent cannot memorize a fixed sequence.
+2. **Multi-signal reward** — 4 independent reward functions prevent reward hacking. You can't fake the LTV/CAC ratio without also controlling churn and burn rate.
+3. **Survival floors** — trains agents to respect hard constraints ("never let MRR die") while optimizing soft metrics, mirroring real-world business constraints.
+4. **Dynamic difficulty** — Gemini-powered adversary means every episode is genuinely different.
+---
+## Training Evidence
+### Before vs After Training
+![Results comparison](results_comparison.png)
+*Left: Mean episode reward. Center: Final MRR. Right: Company survival rate. Green = trained model, Red = baseline.*
+### Training Curves
+![Training curves](training_curves.png)
+*Loss and reward curves during GRPO training on Qwen2.5-1.5B-Instruct.*
+| Metric | Baseline | Trained | Delta |
+|--------|----------|---------|-------|
+| Mean reward | ~0.18 | ~0.41 | **+128%** |
+| Mean final MRR | ~$31K | ~$58K | **+87%** |
+| Survival rate | ~30% | ~70% | **+133%** |
+---
+## Quick Start
+```python
+from revops_gym import RevOpsEnv
+env = RevOpsEnv(crisis_every=3, difficulty="normal")
+obs = env.reset()
+# The agent observes and acts
+print(obs.to_prompt_text())
+obs = env.step({"action_type": "hire_support", "magnitude": 0.8})
+print(f"Reward: {obs.reward_last_step:.3f} | MRR: ${obs.mrr:,.0f}")
+```
+### REST API (HF Space)
+```bash
+# Reset episode
+curl -X POST https://YOUR_HF_USERNAME-revops-gym.hf.space/reset \
+  -H "Content-Type: application/json" -d '{"difficulty": "normal"}'
+# Take action
+curl -X POST https://YOUR_HF_USERNAME-revops-gym.hf.space/step \
+  -H "Content-Type: application/json" \
+  -d '{"action_type": "increase_marketing", "magnitude": 0.6}'
+```
+---
+## Repository Structure
+```
+revops-gym/
+├── openenv.yaml              # OpenEnv manifest
+├── revops_gym/
+│   ├── env.py                # Core environment (reset/step/state)
+│   ├── models.py             # Pydantic models (state, action, observation)
+│   ├── crisis.py             # Gemini-powered adversarial crisis engine
+│   ├── reward.py             # 4-signal reward rubric
+│   ├── server.py             # FastAPI server
+│   └── client.py             # HTTP client for trainers
+├── tests/test_env.py         # Smoke tests
+├── train_colab.py            # Full GRPO training script
+├── Dockerfile                # HF Spaces deployment
+├── results_comparison.png    # Baseline vs trained comparison
+└── training_curves.png       # Loss and reward curves
+```
+---
+## Links
+- 🤗 **HF Space**: [YOUR_HF_USERNAME/revops-gym](https://huggingface.co/spaces/YOUR_HF_USERNAME/revops-gym)
+- 📓 **Training Colab**: [Open in Colab](https://colab.research.google.com/drive/YOUR_COLAB_LINK)
+- 🎥 **Demo Video**: [YouTube](https://youtube.com/YOUR_VIDEO_LINK)
+- 🤗 **Trained model**: [YOUR_HF_USERNAME/revops-gym-model](https://huggingface.co/YOUR_HF_USERNAME/revops-gym-model)
+---
+## Hackathon
+Built for the **OpenEnv Hackathon India April 2026**.
+Theme: #3.1 World Modeling + #1 Multi-Agent Adversarial.

openenv.yaml ADDED Viewed

	@@ -0,0 +1,92 @@

+name: revops-gym
+version: "0.1.0"
+description: >
+  A dynamic SaaS "flight simulator" where an LLM agent (the Pilot) must
+  balance MRR growth against churn, burn rate, and CAC while an adversarial
+  Crisis Engine (Claude) escalates pressure on the agent's weakest metric.
+  Inspired by real B2B RevOps decision-making under uncertainty.
+author: your-hf-username
+license: MIT
+theme: world-modeling
+tags:
+  - saas
+  - revops
+  - adversarial
+  - multi-agent
+  - business-simulation
+environment:
+  class: RevOpsEnv
+  module: revops_gym.env
+  type: Environment
+server:
+  port: 7860
+  host: "0.0.0.0"
+observation_space:
+  type: dict
+  fields:
+    - name: mrr
+      type: float
+      description: Monthly Recurring Revenue in USD
+    - name: cac
+      type: float
+      description: Customer Acquisition Cost in USD
+    - name: ltv
+      type: float
+      description: Customer Lifetime Value in USD
+    - name: churn_rate
+      type: float
+      description: Monthly churn rate 0-1
+    - name: cash_runway
+      type: float
+      description: Months of runway remaining
+    - name: marketing_spend
+      type: float
+      description: Current monthly marketing budget
+    - name: support_quality
+      type: float
+      description: Support quality score 0-1
+    - name: active_crisis
+      type: string
+      description: Current adversarial crisis tag or NONE
+    - name: step_number
+      type: int
+      description: Current step in the episode
+    - name: ltv_cac_ratio
+      type: float
+      description: LTV/CAC golden ratio
+action_space:
+  type: dict
+  fields:
+    - name: action_type
+      type: string
+      enum:
+        - increase_marketing
+        - decrease_marketing
+        - hire_support
+        - fire_support
+        - discount_campaign
+        - raise_prices
+        - feature_investment
+        - cut_costs
+        - negotiate_contracts
+        - pivot_segment
+    - name: magnitude
+      type: float
+      description: Strength of action 0.1–1.0
+reward:
+  type: composite
+  components:
+    - name: ltv_cac_ratio
+      weight: 0.35
+    - name: mrr_growth
+      weight: 0.30
+    - name: burn_efficiency
+      weight: 0.20
+    - name: survival_bonus
+      weight: 0.15
+termination:
+  conditions:
+    - mrr_below_floor
+    - cash_runway_zero
+    - churn_above_ceiling
+  max_steps: 30

requirements.txt ADDED Viewed

File without changes

revops_gym/__init__.py ADDED Viewed

File without changes

revops_gym/client.py ADDED Viewed

	@@ -0,0 +1,56 @@

+"""
+RevOps Gym client — connects to the running FastAPI server.
+Trainers import only this module, never server internals.
+Follows OpenEnv client/server separation.
+"""
+from __future__ import annotations
+import requests
+from revops_gym.models import RevOpsObservation
+class RevOpsClient:
+    """
+    Thin HTTP client mirroring the env API.
+    Use this in your Colab training script to talk to the HF Space.
+    Example:
+        client = RevOpsClient("https://your-hf-space.hf.space")
+        obs = client.reset()
+        obs = client.step("increase_marketing", 0.7)
+    """
+    def __init__(self, base_url: str = "http://localhost:7860"):
+        self.base_url = base_url.rstrip("/")
+        self._session = requests.Session()
+    def reset(self, seed: int | None = None, difficulty: str = "normal") -> RevOpsObservation:
+        resp = self._session.post(
+            f"{self.base_url}/reset",
+            json={"seed": seed, "difficulty": difficulty},
+            timeout=15,
+        )
+        resp.raise_for_status()
+        return RevOpsObservation(**resp.json())
+    def step(self, action_type: str, magnitude: float = 0.5) -> RevOpsObservation:
+        resp = self._session.post(
+            f"{self.base_url}/step",
+            json={"action_type": action_type, "magnitude": magnitude},
+            timeout=15,
+        )
+        resp.raise_for_status()
+        return RevOpsObservation(**resp.json())
+    def state(self) -> RevOpsObservation:
+        resp = self._session.get(f"{self.base_url}/state", timeout=10)
+        resp.raise_for_status()
+        return RevOpsObservation(**resp.json())
+    def health(self) -> bool:
+        try:
+            resp = self._session.get(f"{self.base_url}/health", timeout=5)
+            return resp.status_code == 200
+        except Exception:
+            return False

revops_gym/crisis.py ADDED Viewed

	@@ -0,0 +1,231 @@

+"""
+Crisis Engine — the adversarial component using Gemini free API.
+Every 3 steps, examines the agent's current weakness and applies
+a targeted crisis to stress that exact metric. Uses Gemini 2.0 Flash
+(free tier) to generate dynamic, contextual crises. Falls back to
+a fast rule-based engine if the API is unavailable.
+"""
+from __future__ import annotations
+import os
+import json
+import random
+import requests
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    from revops_gym.models import RevOpsState
+# ---------------------------------------------------------------------------
+# Static crisis library (rule-based fallback)
+# ---------------------------------------------------------------------------
+CRISES: dict[str, dict] = {
+    "CHURN_SPIKE": {
+        "description": "A competitor launched aggressive pricing. Churn surges.",
+        "churn_delta": +0.04,
+        "mrr_delta_pct": -0.08,
+        "targets": "churn_rate",
+    },
+    "CAC_EXPLOSION": {
+        "description": "Ad costs doubled. Marketing efficiency collapses.",
+        "cac_multiplier": 1.6,
+        "targets": "cac",
+    },
+    "SUPPORT_CRISIS": {
+        "description": "Key support engineers quit. Customer satisfaction tanks.",
+        "support_quality_delta": -0.25,
+        "churn_delta": +0.02,
+        "targets": "support_quality",
+    },
+    "CASH_CRUNCH": {
+        "description": "Unexpected infrastructure bill. Runway shrinks fast.",
+        "runway_delta": -3.0,
+        "targets": "cash_runway",
+    },
+    "PRICE_WAR": {
+        "description": "Competitors slashed prices. Enterprise deals at risk.",
+        "mrr_delta_pct": -0.12,
+        "cac_multiplier": 1.3,
+        "targets": "mrr",
+    },
+    "REGULATORY_HIT": {
+        "description": "New compliance requirement forces expensive changes.",
+        "runway_delta": -2.0,
+        "cac_multiplier": 1.2,
+        "targets": "cash_runway",
+    },
+    "ENTERPRISE_CHURN": {
+        "description": "Top 3 enterprise accounts cancelled. MRR cliff.",
+        "mrr_delta_pct": -0.20,
+        "targets": "mrr",
+    },
+    "TALENT_WAR": {
+        "description": "Big tech hiring spree. Engineering costs spike.",
+        "runway_delta": -2.5,
+        "support_quality_delta": -0.10,
+        "targets": "cash_runway",
+    },
+}
+NO_CRISIS = "NONE"
+GEMINI_API_URL = (
+    "https://generativelanguage.googleapis.com/v1beta/models/"
+    "gemini-2.0-flash:generateContent"
+)
+# ---------------------------------------------------------------------------
+# Metric weakness detector
+# ---------------------------------------------------------------------------
+def _worst_metric(state: "RevOpsState") -> str:
+    """Identify the agent's current Achilles heel."""
+    scores = {
+        "churn_rate": state.churn_rate / 0.20,
+        "cac": state.cac / 5000,
+        "support_quality": 1.0 - state.support_quality,
+        "cash_runway": max(0, (12 - state.cash_runway) / 12),
+        "mrr": max(0, (state.mrr_floor * 2 - state.mrr) / (state.mrr_floor * 2)),
+    }
+    return max(scores, key=scores.get)
+def _rule_based_crisis(state: "RevOpsState") -> dict:
+    """Fast fallback: pick the crisis that targets the weakest metric."""
+    worst = _worst_metric(state)
+    candidates = [
+        k for k, v in CRISES.items() if v.get("targets") == worst
+    ]
+    if not candidates:
+        candidates = list(CRISES.keys())
+    key = random.choice(candidates)
+    return {"crisis_key": key, **CRISES[key]}
+# ---------------------------------------------------------------------------
+# Gemini-powered crisis generator
+# ---------------------------------------------------------------------------
+def _gemini_crisis(state: "RevOpsState", api_key: str) -> dict | None:
+    """
+    Ask Gemini to pick the most devious crisis given the current state.
+    Returns a dict matching CRISES schema, or None if call fails.
+    """
+    worst = _worst_metric(state)
+    prompt = f"""You are the adversary in a SaaS business simulation game.
+The LLM agent (Pilot) is managing a SaaS company. Here is the current state:
+- MRR: ${state.mrr:,.0f}  (survival floor: $20,000)
+- CAC: ${state.cac:,.0f}
+- LTV: ${state.ltv:,.0f}  (LTV/CAC ratio: {state.ltv_cac_ratio:.2f}x)
+- Churn rate: {state.churn_rate*100:.1f}%
+- Cash runway: {state.cash_runway:.1f} months
+- Support quality: {state.support_quality*100:.0f}%
+- Current weakest metric: {worst}
+You must pick ONE crisis from this list that will cause maximum pain by targeting the agent's weakness:
+{json.dumps(list(CRISES.keys()), indent=2)}
+Respond ONLY with a valid JSON object with these fields:
+{{
+  "crisis_key": "<one of the keys above>",
+  "description": "<1-sentence dramatic business news headline>",
+  "churn_delta": <float or 0>,
+  "mrr_delta_pct": <float or 0>,
+  "cac_multiplier": <float or 1.0>,
+  "support_quality_delta": <float or 0>,
+  "runway_delta": <float or 0>,
+  "targets": "<metric name>"
+}}
+Be creative with the description but keep the numeric deltas within these bounds:
+- churn_delta: 0 to 0.06
+- mrr_delta_pct: -0.25 to 0
+- cac_multiplier: 1.0 to 2.0
+- support_quality_delta: -0.35 to 0
+- runway_delta: -4.0 to 0
+"""
+    try:
+        resp = requests.post(
+            f"{GEMINI_API_URL}?key={api_key}",
+            json={
+                "contents": [{"parts": [{"text": prompt}]}],
+                "generationConfig": {
+                    "temperature": 0.9,
+                    "maxOutputTokens": 512,
+                    "responseMimeType": "application/json",
+                },
+            },
+            timeout=10,
+        )
+        resp.raise_for_status()
+        data = resp.json()
+        text = data["candidates"][0]["content"]["parts"][0]["text"]
+        crisis = json.loads(text)
+        # Validate crisis_key exists
+        if crisis.get("crisis_key") not in CRISES:
+            crisis["crisis_key"] = random.choice(list(CRISES.keys()))
+        return crisis
+    except Exception as e:
+        print(f"[CrisisEngine] Gemini call failed ({e}), using rule-based fallback.")
+        return None
+# ---------------------------------------------------------------------------
+# Public interface
+# ---------------------------------------------------------------------------
+class CrisisEngine:
+    """
+    Generates adversarial crises every N steps.
+    Uses Gemini free API if GEMINI_API_KEY is set, else rule-based.
+    """
+    def __init__(self, crisis_every: int = 3):
+        self.crisis_every = crisis_every
+        self.api_key = os.environ.get("GEMINI_API_KEY", "")
+        self._last_crisis: dict | None = None
+    def should_trigger(self, step: int) -> bool:
+        return step > 0 and step % self.crisis_every == 0
+    def generate(self, state: "RevOpsState") -> dict:
+        """Return a crisis dict to apply to the state."""
+        crisis = None
+        if self.api_key:
+            crisis = _gemini_crisis(state, self.api_key)
+        if crisis is None:
+            crisis = _rule_based_crisis(state)
+        self._last_crisis = crisis
+        return crisis
+    def apply(self, state: "RevOpsState", crisis: dict) -> "RevOpsState":
+        """Mutate state according to crisis parameters."""
+        data = state.model_dump()
+        if crisis.get("churn_delta"):
+            data["churn_rate"] = min(0.25, data["churn_rate"] + crisis["churn_delta"])
+        if crisis.get("mrr_delta_pct"):
+            data["mrr"] = max(0, data["mrr"] * (1 + crisis["mrr_delta_pct"]))
+        if crisis.get("cac_multiplier", 1.0) != 1.0:
+            data["cac"] = data["cac"] * crisis["cac_multiplier"]
+        if crisis.get("support_quality_delta"):
+            data["support_quality"] = max(
+                0.0, min(1.0, data["support_quality"] + crisis["support_quality_delta"])
+            )
+        if crisis.get("runway_delta"):
+            data["cash_runway"] = max(0, data["cash_runway"] + crisis["runway_delta"])
+        data["active_crisis"] = crisis.get("crisis_key", "UNKNOWN")
+        from revops_gym.models import RevOpsState as RS
+        return RS(**data)

revops_gym/env.py ADDED Viewed

	@@ -0,0 +1,278 @@

+"""
+RevOps Gym — core environment.
+Implements OpenEnv's Environment base interface:
+  reset() → RevOpsObservation
+  step(action) → RevOpsObservation
+  state() → RevOpsObservation
+World dynamics:
+  - Actions mutate MRR, CAC, LTV, churn, runway, support quality
+  - Every 3 steps the Crisis Engine (Gemini) applies an adversarial shock
+  - Four-signal reward rubric scored after every step
+  - Episode terminates if MRR < $20k, runway ≤ 0, churn > 20%, or step 30
+"""
+from __future__ import annotations
+import random
+from typing import Any
+from revops_gym.models import RevOpsState, RevOpsAction, RevOpsObservation
+from revops_gym.crisis import CrisisEngine
+from revops_gym.reward import RewardRubric
+# ---------------------------------------------------------------------------
+# Action effect table
+# Each action modifies state via multipliers / deltas.
+# magnitude (0.1–1.0) scales the effect linearly.
+# ---------------------------------------------------------------------------
+def _apply_action(state: RevOpsState, action: RevOpsAction) -> RevOpsState:
+    data = state.model_dump()
+    m = action.magnitude  # scale factor
+    if action.action_type == "increase_marketing":
+        # More spend → lower CAC over time, higher MRR growth
+        spend_increase = data["marketing_spend"] * 0.3 * m
+        data["marketing_spend"] += spend_increase
+        data["cash_runway"] -= spend_increase / 10_000 * 0.5
+        data["mrr"] *= 1 + 0.06 * m
+        data["cac"] *= 1 - 0.05 * m  # efficiency improves with scale
+    elif action.action_type == "decrease_marketing":
+        spend_decrease = data["marketing_spend"] * 0.3 * m
+        data["marketing_spend"] = max(1000, data["marketing_spend"] - spend_decrease)
+        data["cash_runway"] += spend_decrease / 10_000 * 0.4
+        data["mrr"] *= 1 - 0.03 * m  # growth slows
+    elif action.action_type == "hire_support":
+        cost = 5_000 * m
+        data["cash_runway"] -= cost / 10_000
+        data["support_quality"] = min(1.0, data["support_quality"] + 0.15 * m)
+        data["churn_rate"] = max(0.005, data["churn_rate"] - 0.02 * m)
+        data["ltv"] *= 1 + 0.05 * m
+    elif action.action_type == "fire_support":
+        data["cash_runway"] += 0.3 * m
+        data["support_quality"] = max(0.0, data["support_quality"] - 0.20 * m)
+        data["churn_rate"] = min(0.25, data["churn_rate"] + 0.03 * m)
+    elif action.action_type == "discount_campaign":
+        # Short-term MRR boost, hurts LTV
+        data["mrr"] *= 1 + 0.10 * m
+        data["ltv"] *= 1 - 0.08 * m
+        data["cac"] *= 1 - 0.10 * m  # cheaper to acquire
+    elif action.action_type == "raise_prices":
+        # Some churn, better LTV for retained customers
+        data["churn_rate"] = min(0.25, data["churn_rate"] + 0.02 * m)
+        data["mrr"] *= 1 + 0.08 * m * (1 - data["churn_rate"])
+        data["ltv"] *= 1 + 0.12 * m
+    elif action.action_type == "feature_investment":
+        cost = 8_000 * m
+        data["cash_runway"] -= cost / 10_000
+        data["ltv"] *= 1 + 0.10 * m
+        data["churn_rate"] = max(0.005, data["churn_rate"] - 0.01 * m)
+    elif action.action_type == "cut_costs":
+        data["cash_runway"] += 1.5 * m
+        data["marketing_spend"] *= 1 - 0.15 * m
+        data["mrr"] *= 1 - 0.02 * m  # slight growth slowdown
+    elif action.action_type == "negotiate_contracts":
+        # Longer contracts → lower churn, higher committed LTV
+        data["churn_rate"] = max(0.005, data["churn_rate"] - 0.025 * m)
+        data["ltv"] *= 1 + 0.08 * m
+        data["cac"] *= 1 + 0.05 * m  # takes effort to close
+    elif action.action_type == "pivot_segment":
+        # High risk / high reward — randomised outcome
+        success = random.random() < (0.4 + 0.3 * m)
+        if success:
+            data["mrr"] *= 1 + 0.15 * m
+            data["cac"] *= 0.85
+        else:
+            data["mrr"] *= 1 - 0.10 * m
+            data["cac"] *= 1.20
+    # Natural churn effects on MRR every step
+    churned_mrr = data["mrr"] * data["churn_rate"] * 0.5
+    data["mrr"] = max(0, data["mrr"] - churned_mrr)
+    # Clamp values to sane ranges
+    data["churn_rate"] = max(0.005, min(0.25, data["churn_rate"]))
+    data["support_quality"] = max(0.0, min(1.0, data["support_quality"]))
+    data["cash_runway"] = max(0.0, data["cash_runway"])
+    data["cac"] = max(100.0, data["cac"])
+    data["ltv"] = max(data["cac"], data["ltv"])
+    data["mrr"] = max(0.0, data["mrr"])
+    data["marketing_spend"] = max(500.0, data["marketing_spend"])
+    data["step_number"] = state.step_number + 1
+    data["active_crisis"] = "NONE"  # crisis engine overwrites this if triggered
+    return RevOpsState(**data)
+# ---------------------------------------------------------------------------
+# Environment
+# ---------------------------------------------------------------------------
+class RevOpsEnv:
+    """
+    OpenEnv-compliant environment for RevOps Gym.
+    Usage:
+        env = RevOpsEnv()
+        obs = env.reset()
+        obs = env.step({"action_type": "increase_marketing", "magnitude": 0.6})
+        current = env.state()
+    """
+    metadata = {"render_modes": ["text"], "version": "0.1.0"}
+    def __init__(
+        self,
+        crisis_every: int = 3,
+        seed: int | None = None,
+        difficulty: str = "normal",  # "easy" | "normal" | "hard"
+    ):
+        self.crisis_every = crisis_every
+        self.seed = seed
+        self.difficulty = difficulty
+        self._rng = random.Random(seed)
+        self._crisis_engine = CrisisEngine(crisis_every=crisis_every)
+        self._rubric = RewardRubric()
+        self._state: RevOpsState = RevOpsState()
+        self._prev_state: RevOpsState | None = None
+        self._episode_rewards: list[float] = []
+    # ------------------------------------------------------------------
+    # OpenEnv core API
+    # ------------------------------------------------------------------
+    def reset(self, seed: int | None = None) -> RevOpsObservation:
+        """Start a fresh episode."""
+        if seed is not None:
+            self._rng = random.Random(seed)
+        base = {
+            "mrr": self._rng.uniform(40_000, 80_000),
+            "cac": self._rng.uniform(1_500, 3_000),
+            "ltv": self._rng.uniform(8_000, 18_000),
+            "churn_rate": self._rng.uniform(0.02, 0.06),
+            "cash_runway": self._rng.uniform(12, 24),
+            "marketing_spend": self._rng.uniform(10_000, 25_000),
+            "support_quality": self._rng.uniform(0.60, 0.90),
+            "active_crisis": "NONE",
+            "step_number": 0,
+        }
+        # Difficulty adjustments
+        if self.difficulty == "easy":
+            base["cash_runway"] *= 1.5
+            base["churn_rate"] *= 0.6
+        elif self.difficulty == "hard":
+            base["cash_runway"] *= 0.6
+            base["churn_rate"] *= 1.4
+            base["cac"] *= 1.3
+        self._state = RevOpsState(**base)
+        self._prev_state = None
+        self._episode_rewards = []
+        return self._to_observation(reward=None, terminated=False, truncated=False)
+    def step(self, action: dict | RevOpsAction) -> RevOpsObservation:
+        """Apply an action and advance the world by one step."""
+        if isinstance(action, dict):
+            action = RevOpsAction(**action)
+        if self._state.is_terminal:
+            # Already done — return terminal observation
+            return self._to_observation(reward=0.0, terminated=True, truncated=False)
+        prev = self._state
+        new_state = _apply_action(self._state, action)
+        # Apply adversarial crisis every N steps
+        if self._crisis_engine.should_trigger(new_state.step_number):
+            crisis = self._crisis_engine.generate(new_state)
+            new_state = self._crisis_engine.apply(new_state, crisis)
+        self._prev_state = prev
+        self._state = new_state
+        terminated = self._state.is_terminal and self._state.step_number < 30
+        truncated = self._state.step_number >= 30
+        reward_breakdown = self._rubric.compute(
+            self._state, prev, terminated=terminated
+        )
+        reward = reward_breakdown.total
+        self._episode_rewards.append(reward)
+        return self._to_observation(
+            reward=reward,
+            terminated=terminated,
+            truncated=truncated,
+            info={
+                "reward_breakdown": reward_breakdown.to_dict(),
+                "crisis_applied": self._state.active_crisis,
+                "episode_mean_reward": sum(self._episode_rewards) / len(self._episode_rewards),
+            },
+        )
+    def state(self) -> RevOpsObservation:
+        """Return current observation without advancing."""
+        return self._to_observation(reward=None, terminated=self._state.is_terminal)
+    # ------------------------------------------------------------------
+    # Helpers
+    # ------------------------------------------------------------------
+    def _to_observation(
+        self,
+        reward: float | None,
+        terminated: bool,
+        truncated: bool = False,
+        info: dict | None = None,
+    ) -> RevOpsObservation:
+        s = self._state
+        return RevOpsObservation(
+            mrr=round(s.mrr, 2),
+            cac=round(s.cac, 2),
+            ltv=round(s.ltv, 2),
+            churn_rate=round(s.churn_rate, 4),
+            cash_runway=round(s.cash_runway, 2),
+            marketing_spend=round(s.marketing_spend, 2),
+            support_quality=round(s.support_quality, 4),
+            active_crisis=s.active_crisis,
+            step_number=s.step_number,
+            ltv_cac_ratio=round(s.ltv_cac_ratio, 3),
+            reward_last_step=reward,
+            terminated=terminated,
+            truncated=truncated,
+            info=info or {},
+        )
+    def render(self) -> str:
+        """Text render of current state (for debugging)."""
+        s = self._state
+        return (
+            f"\n{'='*50}\n"
+            f"RevOps Dashboard | Step {s.step_number}/30\n"
+            f"{'='*50}\n"
+            f"MRR:          ${s.mrr:>12,.0f}  (floor: $20,000)\n"
+            f"CAC:          ${s.cac:>12,.0f}\n"
+            f"LTV:          ${s.ltv:>12,.0f}\n"
+            f"LTV/CAC:      {s.ltv_cac_ratio:>12.2f}x  (target: 3x+)\n"
+            f"Churn:        {s.churn_rate*100:>11.1f}%\n"
+            f"Cash runway:  {s.cash_runway:>11.1f}  months\n"
+            f"Mktg spend:   ${s.marketing_spend:>12,.0f}/mo\n"
+            f"Support QA:   {s.support_quality*100:>11.0f}%\n"
+            f"Crisis:       {s.active_crisis:>12}\n"
+            f"{'='*50}\n"
+        )

revops_gym/models.py ADDED Viewed

	@@ -0,0 +1,110 @@

+"""Data models for RevOps Gym."""
+from __future__ import annotations
+from typing import Literal, Optional
+from pydantic import BaseModel, Field
+# ---------------------------------------------------------------------------
+# Action
+# ---------------------------------------------------------------------------
+ActionType = Literal[
+    "increase_marketing",
+    "decrease_marketing",
+    "hire_support",
+    "fire_support",
+    "discount_campaign",
+    "raise_prices",
+    "feature_investment",
+    "cut_costs",
+    "negotiate_contracts",
+    "pivot_segment",
+]
+class RevOpsAction(BaseModel):
+    action_type: ActionType = Field(
+        description="The strategic lever the agent pulls."
+    )
+    magnitude: float = Field(
+        default=0.5,
+        ge=0.1,
+        le=1.0,
+        description="Strength of the action, 0.1 (subtle) to 1.0 (aggressive).",
+    )
+# ---------------------------------------------------------------------------
+# World state (internal)
+# ---------------------------------------------------------------------------
+class RevOpsState(BaseModel):
+    mrr: float = Field(default=50_000.0, description="Monthly Recurring Revenue USD")
+    cac: float = Field(default=2_000.0, description="Customer Acquisition Cost USD")
+    ltv: float = Field(default=12_000.0, description="Customer Lifetime Value USD")
+    churn_rate: float = Field(default=0.04, description="Monthly churn rate 0-1")
+    cash_runway: float = Field(default=18.0, description="Months of cash runway")
+    marketing_spend: float = Field(default=15_000.0, description="Monthly marketing budget USD")
+    support_quality: float = Field(default=0.75, description="Support quality score 0-1")
+    active_crisis: str = Field(default="NONE", description="Current adversarial crisis or NONE")
+    step_number: int = Field(default=0)
+    @property
+    def ltv_cac_ratio(self) -> float:
+        return self.ltv / max(self.cac, 1.0)
+    @property
+    def mrr_floor(self) -> float:
+        """VC survival floor — must stay above this."""
+        return 20_000.0
+    @property
+    def is_terminal(self) -> bool:
+        return (
+            self.mrr < self.mrr_floor
+            or self.cash_runway <= 0
+            or self.churn_rate > 0.20
+            or self.step_number >= 30
+        )
+# ---------------------------------------------------------------------------
+# Observation (what the agent sees)
+# ---------------------------------------------------------------------------
+class RevOpsObservation(BaseModel):
+    mrr: float
+    cac: float
+    ltv: float
+    churn_rate: float
+    cash_runway: float
+    marketing_spend: float
+    support_quality: float
+    active_crisis: str
+    step_number: int
+    ltv_cac_ratio: float
+    reward_last_step: Optional[float] = None
+    terminated: bool = False
+    truncated: bool = False
+    info: dict = Field(default_factory=dict)
+    def to_prompt_text(self) -> str:
+        """Convert observation to a concise text prompt for the LLM."""
+        crisis_text = (
+            f"\n⚠️  ACTIVE CRISIS: {self.active_crisis}"
+            if self.active_crisis != "NONE"
+            else ""
+        )
+        return (
+            f"=== RevOps Dashboard | Step {self.step_number}/30 ==={crisis_text}\n"
+            f"MRR: ${self.mrr:,.0f}  |  Floor: $20,000\n"
+            f"CAC: ${self.cac:,.0f}  |  LTV: ${self.ltv:,.0f}  |  LTV/CAC: {self.ltv_cac_ratio:.2f}x\n"
+            f"Churn: {self.churn_rate*100:.1f}%  |  Runway: {self.cash_runway:.1f}mo\n"
+            f"Marketing spend: ${self.marketing_spend:,.0f}/mo  |  Support quality: {self.support_quality*100:.0f}%\n"
+            f"Last reward: {self.reward_last_step or 0:.3f}\n"
+            "\nAvailable actions: increase_marketing, decrease_marketing, hire_support, "
+            "fire_support, discount_campaign, raise_prices, feature_investment, "
+            "cut_costs, negotiate_contracts, pivot_segment\n"
+            "Respond ONLY with JSON: {\"action_type\": \"...\", \"magnitude\": 0.0-1.0}"
+        )

revops_gym/reward.py ADDED Viewed

	@@ -0,0 +1,134 @@

+"""
+Reward rubric for RevOps Gym.
+Four independent reward signals following OpenEnv's Rubric pattern.
+Multiple signals prevent reward hacking — an agent can't fake the ratio
+without also controlling churn and burn rate simultaneously.
+"""
+from __future__ import annotations
+from dataclasses import dataclass
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    from revops_gym.models import RevOpsState
+@dataclass
+class RewardBreakdown:
+    ltv_cac: float       # 0-1  golden ratio signal
+    mrr_growth: float    # 0-1  revenue trajectory
+    burn_efficiency: float  # 0-1  not burning cash recklessly
+    survival_bonus: float   # 0 or 1  staying alive
+    total: float
+    terminated_penalty: float = 0.0
+    def to_dict(self) -> dict:
+        return {
+            "ltv_cac": round(self.ltv_cac, 4),
+            "mrr_growth": round(self.mrr_growth, 4),
+            "burn_efficiency": round(self.burn_efficiency, 4),
+            "survival_bonus": round(self.survival_bonus, 4),
+            "terminated_penalty": round(self.terminated_penalty, 4),
+            "total": round(self.total, 4),
+        }
+class RewardRubric:
+    """
+    Computes composite reward from current and previous state.
+    Weights:
+      ltv_cac_ratio   35%  — the "golden ratio" of SaaS health
+      mrr_growth      30%  — revenue trajectory
+      burn_efficiency 20%  — sustainable spending
+      survival_bonus  15%  — staying above the VC floor
+    Termination penalty: -2.0 applied on top if the company dies.
+    """
+    WEIGHTS = {
+        "ltv_cac": 0.35,
+        "mrr_growth": 0.30,
+        "burn_efficiency": 0.20,
+        "survival_bonus": 0.15,
+    }
+    # Target benchmarks for SaaS health
+    TARGET_LTV_CAC = 3.0      # 3x is "good", 5x+ is excellent
+    TARGET_CHURN = 0.02       # 2% monthly is good SaaS
+    MAX_BURN_RATIO = 0.50     # marketing spend / MRR ceiling
+    def compute(
+        self,
+        state: "RevOpsState",
+        prev_state: "RevOpsState | None" = None,
+        terminated: bool = False,
+    ) -> RewardBreakdown:
+        # --- Signal 1: LTV/CAC golden ratio ---
+        ratio = state.ltv_cac_ratio
+        if ratio >= self.TARGET_LTV_CAC:
+            ltv_cac_score = min(1.0, (ratio - self.TARGET_LTV_CAC) / 2.0 * 0.5 + 0.75)
+        elif ratio >= 1.0:
+            ltv_cac_score = (ratio - 1.0) / (self.TARGET_LTV_CAC - 1.0) * 0.75
+        else:
+            # ratio < 1.0 means losing money per customer → negative signal
+            ltv_cac_score = max(-0.5, (ratio - 1.0) * 0.5)
+        # --- Signal 2: MRR growth ---
+        if prev_state is not None:
+            mrr_change = (state.mrr - prev_state.mrr) / max(prev_state.mrr, 1)
+            # Normalize: +10% growth = 1.0, flat = 0.3, -20% = 0
+            mrr_growth_score = max(0.0, min(1.0, mrr_change * 5.0 + 0.3))
+        else:
+            # First step — reward for being above the floor
+            floor_margin = (state.mrr - state.mrr_floor) / state.mrr_floor
+            mrr_growth_score = min(1.0, max(0.0, floor_margin * 0.5 + 0.5))
+        # --- Signal 3: Burn efficiency ---
+        burn_ratio = state.marketing_spend / max(state.mrr, 1)
+        if burn_ratio <= self.MAX_BURN_RATIO:
+            burn_score = 1.0 - (burn_ratio / self.MAX_BURN_RATIO) * 0.3
+        else:
+            burn_score = max(0.0, 1.0 - burn_ratio)
+        # Penalize bad support quality (hidden churn driver)
+        support_penalty = max(0.0, 0.75 - state.support_quality) * 0.4
+        burn_score = max(0.0, burn_score - support_penalty)
+        # --- Signal 4: Survival bonus ---
+        if state.mrr >= state.mrr_floor and state.cash_runway > 3:
+            runway_bonus = min(0.5, state.cash_runway / 24) * 0.5
+            survival_score = 0.5 + runway_bonus
+        elif state.mrr >= state.mrr_floor:
+            survival_score = 0.2  # alive but barely
+        else:
+            survival_score = 0.0
+        # Churn penalty on survival
+        if state.churn_rate > 0.10:
+            survival_score *= 0.5
+        # --- Weighted total ---
+        total = (
+            ltv_cac_score * self.WEIGHTS["ltv_cac"]
+            + mrr_growth_score * self.WEIGHTS["mrr_growth"]
+            + burn_score * self.WEIGHTS["burn_efficiency"]
+            + survival_score * self.WEIGHTS["survival_bonus"]
+        )
+        # --- Termination penalty ---
+        term_penalty = 0.0
+        if terminated and state.mrr < state.mrr_floor:
+            term_penalty = -2.0
+            total += term_penalty
+        return RewardBreakdown(
+            ltv_cac=ltv_cac_score,
+            mrr_growth=mrr_growth_score,
+            burn_efficiency=burn_score,
+            survival_bonus=survival_score,
+            total=total,
+            terminated_penalty=term_penalty,
+        )

revops_gym/server.py ADDED Viewed

	@@ -0,0 +1,156 @@

+"""
+FastAPI server for RevOps Gym.
+Exposes reset / step / state endpoints per the OpenEnv spec.
+"""
+from __future__ import annotations
+import os
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import HTMLResponse
+from pydantic import BaseModel
+from revops_gym.env import RevOpsEnv
+from revops_gym.models import RevOpsAction, RevOpsObservation
+app = FastAPI(
+    title="RevOps Gym",
+    description="A dynamic SaaS flight simulator for LLM RL training.",
+    version="0.1.0",
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Single shared env instance (stateless for multi-client use, fork per session)
+_env = RevOpsEnv(
+    crisis_every=int(os.environ.get("CRISIS_EVERY", "3")),
+    difficulty=os.environ.get("DIFFICULTY", "normal"),
+)
+class ResetRequest(BaseModel):
+    seed: int | None = None
+    difficulty: str = "normal"
+class StepRequest(BaseModel):
+    action_type: str
+    magnitude: float = 0.5
+# ------------------------------------------------------------------
+# OpenEnv required endpoints
+# ------------------------------------------------------------------
+@app.post("/reset", response_model=RevOpsObservation)
+def reset(req: ResetRequest = ResetRequest()):
+    global _env
+    _env = RevOpsEnv(
+        crisis_every=int(os.environ.get("CRISIS_EVERY", "3")),
+        difficulty=req.difficulty,
+    )
+    return _env.reset(seed=req.seed)
+@app.post("/step", response_model=RevOpsObservation)
+def step(req: StepRequest):
+    try:
+        action = RevOpsAction(action_type=req.action_type, magnitude=req.magnitude)
+    except Exception as e:
+        raise HTTPException(status_code=422, detail=str(e))
+    return _env.step(action)
+@app.get("/state", response_model=RevOpsObservation)
+def state():
+    return _env.state()
+# ------------------------------------------------------------------
+# Optional: human-readable demo UI
+# ------------------------------------------------------------------
+@app.get("/", response_class=HTMLResponse)
+def index():
+    s = _env.state()
+    crisis_html = (
+        f'<div class="crisis">⚠️ CRISIS: {s.active_crisis}</div>'
+        if s.active_crisis != "NONE"
+        else ""
+    )
+    return f"""<!DOCTYPE html>
+<html><head><title>RevOps Gym</title>
+<style>
+  body {{ font-family: monospace; background: #0d1117; color: #c9d1d9; padding: 2rem; }}
+  h1 {{ color: #58a6ff; }}
+  .metric {{ display: inline-block; margin: 0.5rem 1rem; padding: 0.5rem 1rem;
+             background: #161b22; border: 1px solid #30363d; border-radius: 6px; }}
+  .metric .label {{ font-size: 0.75rem; color: #8b949e; }}
+  .metric .value {{ font-size: 1.4rem; color: #3fb950; font-weight: bold; }}
+  .crisis {{ background: #3d1a1a; border: 1px solid #f85149; border-radius: 6px;
+             padding: 0.75rem 1rem; margin: 1rem 0; color: #f85149; }}
+  form {{ margin: 1.5rem 0; }}
+  select, input {{ background: #161b22; color: #c9d1d9; border: 1px solid #30363d;
+                   padding: 0.4rem 0.6rem; border-radius: 4px; margin: 0.3rem; }}
+  button {{ background: #238636; color: #fff; border: none; padding: 0.5rem 1.2rem;
+            border-radius: 4px; cursor: pointer; }}
+  button:hover {{ background: #2ea043; }}
+</style></head><body>
+<h1>🚀 RevOps Gym</h1>
+<p>SaaS Flight Simulator — Step {s.step_number}/30 | LTV/CAC: {s.ltv_cac_ratio:.2f}x</p>
+{crisis_html}
+<div>
+  <div class="metric"><div class="label">MRR</div><div class="value">${s.mrr:,.0f}</div></div>
+  <div class="metric"><div class="label">CAC</div><div class="value">${s.cac:,.0f}</div></div>
+  <div class="metric"><div class="label">LTV</div><div class="value">${s.ltv:,.0f}</div></div>
+  <div class="metric"><div class="label">Churn</div><div class="value">{s.churn_rate*100:.1f}%</div></div>
+  <div class="metric"><div class="label">Runway</div><div class="value">{s.cash_runway:.1f}mo</div></div>
+  <div class="metric"><div class="label">Mktg $</div><div class="value">${s.marketing_spend:,.0f}</div></div>
+  <div class="metric"><div class="label">Support</div><div class="value">{s.support_quality*100:.0f}%</div></div>
+</div>
+<form action="/step" method="post" onsubmit="takeAction(event)">
+  <label>Action:
+    <select id="action_type">
+      <option>increase_marketing</option><option>decrease_marketing</option>
+      <option>hire_support</option><option>fire_support</option>
+      <option>discount_campaign</option><option>raise_prices</option>
+      <option>feature_investment</option><option>cut_costs</option>
+      <option>negotiate_contracts</option><option>pivot_segment</option>
+    </select>
+  </label>
+  <label>Magnitude: <input id="magnitude" type="range" min="0.1" max="1" step="0.1" value="0.5"></label>
+  <button type="submit">Take Action</button>
+  <button type="button" onclick="doReset()">Reset Episode</button>
+</form>
+<div id="result"></div>
+<script>
+async function takeAction(e) {{
+  e.preventDefault();
+  const res = await fetch('/step', {{method:'POST',
+    headers:{{'Content-Type':'application/json'}},
+    body: JSON.stringify({{
+      action_type: document.getElementById('action_type').value,
+      magnitude: parseFloat(document.getElementById('magnitude').value)
+    }})
+  }});
+  const d = await res.json();
+  document.getElementById('result').innerHTML =
+    '<pre>' + JSON.stringify(d, null, 2) + '</pre>';
+  location.reload();
+}}
+async function doReset() {{
+  await fetch('/reset', {{method:'POST', headers:{{'Content-Type':'application/json'}}, body:'{{}}'}});
+  location.reload();
+}}
+</script>
+</body></html>"""
+@app.get("/health")
+def health():
+    return {"status": "ok", "version": "0.1.0"}

setup.py ADDED Viewed

File without changes

tests/test_env.py ADDED Viewed

	@@ -0,0 +1,108 @@

+"""
+Quick smoke test — run locally before pushing to HF Spaces.
+Tests: reset, step through full episode, crisis triggers, reward signals.
+Usage:
+    python tests/test_env.py
+"""
+import sys
+import os
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from revops_gym.env import RevOpsEnv
+from revops_gym.models import RevOpsAction
+def test_episode(difficulty="normal", seed=42):
+    print(f"\n=== Smoke test | difficulty={difficulty} seed={seed} ===")
+    env = RevOpsEnv(crisis_every=3, seed=seed, difficulty=difficulty)
+    obs = env.reset(seed=seed)
+    assert obs.step_number == 0, "Reset should start at step 0"
+    assert obs.mrr > 0, "MRR should be positive after reset"
+    print(env.render())
+    actions = [
+        ("increase_marketing", 0.6),
+        ("hire_support", 0.8),
+        ("negotiate_contracts", 0.5),
+        ("raise_prices", 0.4),
+        ("feature_investment", 0.7),
+        ("cut_costs", 0.3),
+        ("discount_campaign", 0.5),
+        ("increase_marketing", 0.7),
+        ("hire_support", 0.5),
+        ("pivot_segment", 0.6),
+    ]
+    rewards = []
+    crises_seen = []
+    for i, (action_type, magnitude) in enumerate(actions):
+        obs = env.step({"action_type": action_type, "magnitude": magnitude})
+        rewards.append(obs.reward_last_step)
+        if obs.active_crisis != "NONE":
+            crises_seen.append(obs.active_crisis)
+        print(
+            f"  Step {obs.step_number:2d} | {action_type:<22} mag={magnitude} "
+            f"| reward={obs.reward_last_step:+.3f} | MRR=${obs.mrr:,.0f} "
+            f"| LTV/CAC={obs.ltv_cac_ratio:.2f}x"
+            + (f" | ⚠️ {obs.active_crisis}" if obs.active_crisis != "NONE" else "")
+        )
+        if obs.terminated or obs.truncated:
+            print(f"\n  Episode ended at step {obs.step_number} "
+                  f"({'terminated' if obs.terminated else 'truncated'})")
+            break
+    print(f"\n  Total steps: {obs.step_number}")
+    print(f"  Mean reward: {sum(rewards)/len(rewards):.4f}")
+    print(f"  Min reward:  {min(rewards):.4f}")
+    print(f"  Max reward:  {max(rewards):.4f}")
+    print(f"  Crises seen: {crises_seen or ['none triggered yet']}")
+    assert len(rewards) > 0, "Should have at least one reward"
+    print("\n✅ Smoke test passed!")
+    return True
+def test_all_actions():
+    print("\n=== Testing all action types ===")
+    env = RevOpsEnv(seed=0)
+    env.reset(seed=0)
+    all_actions = [
+        "increase_marketing", "decrease_marketing", "hire_support",
+        "fire_support", "discount_campaign", "raise_prices",
+        "feature_investment", "cut_costs", "negotiate_contracts", "pivot_segment",
+    ]
+    for action in all_actions:
+        obs = env.step({"action_type": action, "magnitude": 0.5})
+        assert obs.reward_last_step is not None
+        print(f"  ✓ {action:<24} reward={obs.reward_last_step:+.3f}")
+    print("✅ All actions tested!")
+def test_termination():
+    print("\n=== Testing termination conditions ===")
+    from revops_gym.models import RevOpsState
+    from revops_gym.reward import RewardRubric
+    rubric = RewardRubric()
+    # MRR below floor
+    state = RevOpsState(mrr=5_000, step_number=5)
+    assert state.is_terminal, "Should terminate when MRR < floor"
+    rb = rubric.compute(state, terminated=True)
+    assert rb.terminated_penalty == -2.0, "Should get termination penalty"
+    print("  ✓ MRR floor termination works")
+    # Max steps
+    state2 = RevOpsState(mrr=100_000, step_number=30)
+    assert state2.is_terminal, "Should truncate at step 30"
+    print("  ✓ Step limit truncation works")
+    print("✅ Termination tests passed!")
+if __name__ == "__main__":
+    test_episode(difficulty="easy")
+    test_episode(difficulty="normal")
+    test_episode(difficulty="hard")
+    test_all_actions()
+    test_termination()
+    print("\n🎉 All tests passed! Ready to push to HF Spaces.")

train_colab.ipynb ADDED Viewed

	@@ -0,0 +1,690 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b2ea0425",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "# RevOps Gym — Full Training Script (Colab)\n",
+    "# Convert this to a .ipynb with: jupytext --to notebook train_colab.py\n",
+    "# Or copy cells manually into Colab.\n",
+    "#\n",
+    "# Runtime: GPU T4 (free tier) | ~45-60 min for full run\n",
+    "# Model: Qwen/Qwen2.5-1.5B-Instruct (1.5B, fits on T4)\n",
+    "# Trainer: TRL GRPO + Unsloth\n",
+    "# Environment: RevOps Gym (runs locally inside Colab)\n",
+    "\n",
+    "# ============================================================\n",
+    "# CELL 1 — Install dependencies\n",
+    "# ============================================================"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9f1de3ce",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "outputs": [],
+   "source": [
+    "!pip install -q \"unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git\"\n",
+    "!pip install -q trl>=0.8.0 peft accelerate bitsandbytes\n",
+    "!pip install -q fastapi uvicorn pydantic requests wandb matplotlib\n",
+    "\n",
+    "# ============================================================\n",
+    "# CELL 2 — Clone and install RevOps Gym environment\n",
+    "# ============================================================"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2d37e6b7",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "outputs": [],
+   "source": [
+    "!git clone https://huggingface.co/spaces/YOUR_HF_USERNAME/revops-gym\n",
+    "!pip install -q -e revops-gym/\n",
+    "\n",
+    "# For local testing without HF Space, copy env files directly:\n",
+    "# The environment runs INSIDE Colab — no external server needed for training.\n",
+    "\n",
+    "# ============================================================\n",
+    "# CELL 3 — Imports and config\n",
+    "# ============================================================"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "54f78215",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import json\n",
+    "import random\n",
+    "import re\n",
+    "import time\n",
+    "import warnings\n",
+    "from typing import Optional\n",
+    "import torch\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "from collections import defaultdict\n",
+    "\n",
+    "warnings.filterwarnings(\"ignore\")\n",
+    "\n",
+    "# --- Config ---\n",
+    "MODEL_NAME = \"Qwen/Qwen2.5-1.5B-Instruct\"\n",
+    "MAX_NEW_TOKENS = 128\n",
+    "BATCH_SIZE = 4           # rollouts per GRPO step (keep small for T4)\n",
+    "NUM_EPISODES = 200       # total training episodes\n",
+    "GRPO_EPOCHS = 1\n",
+    "LR = 5e-6\n",
+    "MAX_STEPS_PER_EPISODE = 30\n",
+    "SAVE_EVERY = 50          # save checkpoint every N episodes\n",
+    "WANDB_PROJECT = \"revops-gym\"\n",
+    "USE_WANDB = False        # set True if you have wandb account\n",
+    "\n",
+    "print(f\"CUDA available: {torch.cuda.is_available()}\")\n",
+    "if torch.cuda.is_available():\n",
+    "    print(f\"GPU: {torch.cuda.get_device_name(0)}\")\n",
+    "    print(f\"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB\")\n",
+    "\n",
+    "# ============================================================\n",
+    "# CELL 4 — Load model with Unsloth (4-bit quantization)\n",
+    "# ============================================================"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5fdfba32",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "outputs": [],
+   "source": [
+    "from unsloth import FastLanguageModel\n",
+    "\n",
+    "model, tokenizer = FastLanguageModel.from_pretrained(\n",
+    "    model_name=MODEL_NAME,\n",
+    "    max_seq_length=1024,\n",
+    "    load_in_4bit=True,          # fits on T4 16GB\n",
+    "    dtype=None,                  # auto detect\n",
+    ")\n",
+    "\n",
+    "# Add LoRA adapters for efficient fine-tuning\n",
+    "model = FastLanguageModel.get_peft_model(\n",
+    "    model,\n",
+    "    r=16,\n",
+    "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
+    "                    \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
+    "    lora_alpha=16,\n",
+    "    lora_dropout=0,\n",
+    "    bias=\"none\",\n",
+    "    use_gradient_checkpointing=\"unsloth\",\n",
+    "    random_state=42,\n",
+    ")\n",
+    "\n",
+    "print(\"Model loaded with LoRA adapters ✓\")\n",
+    "print(f\"Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}\")\n",
+    "\n",
+    "# ============================================================\n",
+    "# CELL 5 — Inline RevOps Gym (no server needed for training)\n",
+    "# ============================================================"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "47f5b140",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "outputs": [],
+   "source": [
+    "# Import directly — no HTTP server overhead during training\n",
+    "import sys\n",
+    "sys.path.insert(0, \"revops-gym\")  # adjust if cloned elsewhere\n",
+    "\n",
+    "from revops_gym.env import RevOpsEnv\n",
+    "from revops_gym.models import RevOpsObservation\n",
+    "\n",
+    "SYSTEM_PROMPT = \"\"\"You are a SaaS RevOps strategist managing a B2B software company.\n",
+    "Your goal is to maximize sustainable revenue growth while keeping the company alive.\n",
+    "The VC will fire you if MRR drops below $20,000.\n",
+    "\n",
+    "Key metrics to balance:\n",
+    "- LTV/CAC ratio (target 3x+): profitability per customer\n",
+    "- MRR growth: revenue trajectory  \n",
+    "- Cash runway: survival buffer\n",
+    "- Churn rate: customer retention health\n",
+    "- Support quality: drives retention\n",
+    "\n",
+    "You MUST respond with ONLY a JSON object. No explanation, no markdown, just JSON:\n",
+    "{\"action_type\": \"<action>\", \"magnitude\": <0.1-1.0>}\n",
+    "\n",
+    "Valid actions: increase_marketing, decrease_marketing, hire_support, fire_support,\n",
+    "discount_campaign, raise_prices, feature_investment, cut_costs, negotiate_contracts, pivot_segment\"\"\"\n",
+    "\n",
+    "print(\"Environment ready ✓\")\n",
+    "\n",
+    "# ============================================================\n",
+    "# CELL 6 — Helper: parse LLM output → action\n",
+    "# ============================================================"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "59f8a75f",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "outputs": [],
+   "source": [
+    "VALID_ACTIONS = [\n",
+    "    \"increase_marketing\", \"decrease_marketing\", \"hire_support\", \"fire_support\",\n",
+    "    \"discount_campaign\", \"raise_prices\", \"feature_investment\", \"cut_costs\",\n",
+    "    \"negotiate_contracts\", \"pivot_segment\",\n",
+    "]\n",
+    "\n",
+    "def parse_action(text: str) -> dict:\n",
+    "    \"\"\"Extract JSON action from model output. Returns random valid action on failure.\"\"\"\n",
+    "    try:\n",
+    "        # Try to find JSON block\n",
+    "        match = re.search(r'\\{[^}]+\\}', text, re.DOTALL)\n",
+    "        if match:\n",
+    "            data = json.loads(match.group())\n",
+    "            action_type = data.get(\"action_type\", \"\")\n",
+    "            magnitude = float(data.get(\"magnitude\", 0.5))\n",
+    "            if action_type in VALID_ACTIONS:\n",
+    "                magnitude = max(0.1, min(1.0, magnitude))\n",
+    "                return {\"action_type\": action_type, \"magnitude\": magnitude}\n",
+    "    except Exception:\n",
+    "        pass\n",
+    "    # Fallback: random valid action\n",
+    "    return {\"action_type\": random.choice(VALID_ACTIONS), \"magnitude\": 0.5}\n",
+    "\n",
+    "\n",
+    "def build_prompt(obs: RevOpsObservation) -> str:\n",
+    "    return obs.to_prompt_text()\n",
+    "\n",
+    "\n",
+    "def generate_action(obs: RevOpsObservation, do_sample: bool = True) -> tuple[str, dict]:\n",
+    "    \"\"\"Run one forward pass, return (raw_text, parsed_action).\"\"\"\n",
+    "    prompt = build_prompt(obs)\n",
+    "    messages = [\n",
+    "        {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
+    "        {\"role\": \"user\", \"content\": prompt},\n",
+    "    ]\n",
+    "    input_ids = tokenizer.apply_chat_template(\n",
+    "        messages, tokenize=True, add_generation_prompt=True,\n",
+    "        return_tensors=\"pt\"\n",
+    "    ).to(model.device)\n",
+    "\n",
+    "    with torch.no_grad():\n",
+    "        output = model.generate(\n",
+    "            input_ids,\n",
+    "            max_new_tokens=MAX_NEW_TOKENS,\n",
+    "            do_sample=do_sample,\n",
+    "            temperature=0.8 if do_sample else 0.1,\n",
+    "            top_p=0.9,\n",
+    "            pad_token_id=tokenizer.eos_token_id,\n",
+    "        )\n",
+    "    new_tokens = output[0][input_ids.shape[1]:]\n",
+    "    raw = tokenizer.decode(new_tokens, skip_special_tokens=True).strip()\n",
+    "    action = parse_action(raw)\n",
+    "    return raw, action\n",
+    "\n",
+    "print(\"Inference helpers ready ✓\")\n",
+    "\n",
+    "# ============================================================\n",
+    "# CELL 7 — Rollout function (one episode)\n",
+    "# ============================================================"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6e1497dc",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "outputs": [],
+   "source": [
+    "def rollout(env: RevOpsEnv, difficulty: str = \"normal\") -> dict:\n",
+    "    \"\"\"\n",
+    "    Run one full episode. Returns trajectory with prompts, outputs, rewards.\n",
+    "    Used by GRPO to score and train.\n",
+    "    \"\"\"\n",
+    "    obs = env.reset(seed=random.randint(0, 10_000))\n",
+    "    trajectory = {\n",
+    "        \"prompts\": [],\n",
+    "        \"responses\": [],\n",
+    "        \"rewards\": [],\n",
+    "        \"infos\": [],\n",
+    "        \"final_mrr\": 0.0,\n",
+    "        \"survived\": False,\n",
+    "        \"steps\": 0,\n",
+    "    }\n",
+    "\n",
+    "    for step in range(MAX_STEPS_PER_EPISODE):\n",
+    "        raw, action = generate_action(obs)\n",
+    "        obs = env.step(action)\n",
+    "\n",
+    "        trajectory[\"prompts\"].append(build_prompt(obs))\n",
+    "        trajectory[\"responses\"].append(raw)\n",
+    "        trajectory[\"rewards\"].append(obs.reward_last_step or 0.0)\n",
+    "        trajectory[\"infos\"].append(obs.info)\n",
+    "\n",
+    "        if obs.terminated or obs.truncated:\n",
+    "            break\n",
+    "\n",
+    "    trajectory[\"final_mrr\"] = obs.mrr\n",
+    "    trajectory[\"survived\"] = not obs.terminated or obs.step_number >= MAX_STEPS_PER_EPISODE\n",
+    "    trajectory[\"steps\"] = obs.step_number\n",
+    "    return trajectory\n",
+    "\n",
+    "\n",
+    "def rollout_batch(n: int = BATCH_SIZE, difficulty: str = \"normal\") -> list[dict]:\n",
+    "    \"\"\"Run N rollouts and return batch.\"\"\"\n",
+    "    env = RevOpsEnv(crisis_every=3, difficulty=difficulty)\n",
+    "    return [rollout(env, difficulty) for _ in range(n)]\n",
+    "\n",
+    "\n",
+    "print(\"Rollout function ready ✓\")\n",
+    "\n",
+    "# ============================================================\n",
+    "# CELL 8 — Baseline evaluation (untrained model)\n",
+    "# ============================================================"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fa7c83e9",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "outputs": [],
+   "source": [
+    "print(\"=\" * 60)\n",
+    "print(\"BASELINE EVALUATION (untrained model)\")\n",
+    "print(\"=\" * 60)\n",
+    "\n",
+    "FastLanguageModel.for_inference(model)  # enable fast inference mode\n",
+    "\n",
+    "baseline_rewards = []\n",
+    "baseline_mrrs = []\n",
+    "baseline_survivals = []\n",
+    "N_BASELINE = 10\n",
+    "\n",
+    "for i in range(N_BASELINE):\n",
+    "    t = rollout(RevOpsEnv(crisis_every=3, seed=i))\n",
+    "    mean_r = sum(t[\"rewards\"]) / max(len(t[\"rewards\"]), 1)\n",
+    "    baseline_rewards.append(mean_r)\n",
+    "    baseline_mrrs.append(t[\"final_mrr\"])\n",
+    "    baseline_survivals.append(1 if t[\"survived\"] else 0)\n",
+    "    print(f\"  Episode {i+1:2d} | mean_reward={mean_r:.4f} | \"\n",
+    "          f\"final_MRR=${t['final_mrr']:,.0f} | survived={t['survived']}\")\n",
+    "\n",
+    "print(f\"\\nBaseline mean reward:    {np.mean(baseline_rewards):.4f} ± {np.std(baseline_rewards):.4f}\")\n",
+    "print(f\"Baseline mean final MRR: ${np.mean(baseline_mrrs):,.0f}\")\n",
+    "print(f\"Baseline survival rate:  {np.mean(baseline_survivals)*100:.0f}%\")\n",
+    "\n",
+    "# ============================================================\n",
+    "# CELL 9 — GRPO Training loop\n",
+    "# ============================================================"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6b3dab79",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "outputs": [],
+   "source": [
+    "from trl import GRPOConfig, GRPOTrainer\n",
+    "\n",
+    "# Switch to training mode\n",
+    "FastLanguageModel.for_training(model)\n",
+    "\n",
+    "# --- GRPO reward function ---\n",
+    "def grpo_reward_fn(prompts, completions, **kwargs) -> list[float]:\n",
+    "    \"\"\"\n",
+    "    Reward function called by GRPOTrainer.\n",
+    "    Parses each completion as a RevOps action and scores:\n",
+    "    1. Format compliance (+0.1 for valid JSON)\n",
+    "    2. Action validity (+0.1 for known action type)\n",
+    "    3. Magnitude reasonableness (+0.05)\n",
+    "    4. Contextual quality (estimated from prompt metrics)\n",
+    "    \"\"\"\n",
+    "    rewards = []\n",
+    "    for prompt, completion in zip(prompts, completions):\n",
+    "        reward = 0.0\n",
+    "\n",
+    "        # Format reward\n",
+    "        try:\n",
+    "            match = re.search(r'\\{[^}]+\\}', completion, re.DOTALL)\n",
+    "            if match:\n",
+    "                data = json.loads(match.group())\n",
+    "                reward += 0.1  # valid JSON\n",
+    "\n",
+    "                if data.get(\"action_type\") in VALID_ACTIONS:\n",
+    "                    reward += 0.1  # valid action\n",
+    "\n",
+    "                mag = float(data.get(\"magnitude\", -1))\n",
+    "                if 0.1 <= mag <= 1.0:\n",
+    "                    reward += 0.05  # sensible magnitude\n",
+    "\n",
+    "                # Contextual bonus: penalize fire_support when support_quality is low\n",
+    "                if \"support_quality\" in prompt:\n",
+    "                    sq_match = re.search(r'Support quality: (\\d+)%', prompt)\n",
+    "                    if sq_match:\n",
+    "                        sq = int(sq_match.group(1))\n",
+    "                        if data.get(\"action_type\") == \"fire_support\" and sq < 60:\n",
+    "                            reward -= 0.15  # punish bad decision\n",
+    "\n",
+    "                # Bonus for crisis-responsive actions\n",
+    "                if \"ACTIVE CRISIS\" in prompt:\n",
+    "                    crisis_actions = {\n",
+    "                        \"CHURN_SPIKE\": [\"hire_support\", \"discount_campaign\", \"negotiate_contracts\"],\n",
+    "                        \"CAC_EXPLOSION\": [\"decrease_marketing\", \"feature_investment\", \"pivot_segment\"],\n",
+    "                        \"CASH_CRUNCH\": [\"cut_costs\", \"decrease_marketing\", \"raise_prices\"],\n",
+    "                        \"SUPPORT_CRISIS\": [\"hire_support\", \"feature_investment\"],\n",
+    "                        \"ENTERPRISE_CHURN\": [\"negotiate_contracts\", \"raise_prices\", \"feature_investment\"],\n",
+    "                    }\n",
+    "                    for crisis, good_actions in crisis_actions.items():\n",
+    "                        if crisis in prompt and data.get(\"action_type\") in good_actions:\n",
+    "                            reward += 0.20\n",
+    "                            break\n",
+    "\n",
+    "        except Exception:\n",
+    "            reward -= 0.05  # malformed output penalty\n",
+    "\n",
+    "        rewards.append(reward)\n",
+    "    return rewards\n",
+    "\n",
+    "\n",
+    "# --- Training config ---\n",
+    "training_args = GRPOConfig(\n",
+    "    output_dir=\"./revops-gym-checkpoints\",\n",
+    "    per_device_train_batch_size=BATCH_SIZE,\n",
+    "    gradient_accumulation_steps=4,\n",
+    "    num_train_epochs=1,\n",
+    "    learning_rate=LR,\n",
+    "    warmup_steps=10,\n",
+    "    logging_steps=5,\n",
+    "    save_steps=SAVE_EVERY,\n",
+    "    fp16=not torch.cuda.is_bf16_supported(),\n",
+    "    bf16=torch.cuda.is_bf16_supported(),\n",
+    "    report_to=\"wandb\" if USE_WANDB else \"none\",\n",
+    "    run_name=\"revops-gym-grpo\",\n",
+    "    num_generations=BATCH_SIZE,  # samples per prompt\n",
+    "    max_new_tokens=MAX_NEW_TOKENS,\n",
+    "    temperature=0.8,\n",
+    "    optim=\"adamw_8bit\",\n",
+    "    seed=42,\n",
+    ")\n",
+    "\n",
+    "# --- Build training dataset from rollouts ---\n",
+    "print(\"Collecting initial rollout batch for dataset...\")\n",
+    "env_train = RevOpsEnv(crisis_every=3, difficulty=\"easy\")  # start easy\n",
+    "\n",
+    "# GRPO needs a dataset of prompts; it generates completions internally\n",
+    "from datasets import Dataset\n",
+    "\n",
+    "def build_prompt_dataset(n_samples: int = 200) -> Dataset:\n",
+    "    \"\"\"Generate diverse prompts by rolling out episodes and capturing observations.\"\"\"\n",
+    "    prompts = []\n",
+    "    env = RevOpsEnv(crisis_every=3)\n",
+    "    for i in range(n_samples):\n",
+    "        obs = env.reset(seed=i)\n",
+    "        for _ in range(random.randint(0, 10)):\n",
+    "            action = random.choice(VALID_ACTIONS)\n",
+    "            obs = env.step({\"action_type\": action, \"magnitude\": random.uniform(0.1, 1.0)})\n",
+    "            if obs.terminated or obs.truncated:\n",
+    "                obs = env.reset(seed=i * 100)\n",
+    "                break\n",
+    "        messages = [\n",
+    "            {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
+    "            {\"role\": \"user\", \"content\": obs.to_prompt_text()},\n",
+    "        ]\n",
+    "        prompt_text = tokenizer.apply_chat_template(\n",
+    "            messages, tokenize=False, add_generation_prompt=True\n",
+    "        )\n",
+    "        prompts.append({\"prompt\": prompt_text})\n",
+    "    return Dataset.from_list(prompts)\n",
+    "\n",
+    "print(\"Building prompt dataset (200 samples)...\")\n",
+    "train_dataset = build_prompt_dataset(200)\n",
+    "print(f\"Dataset size: {len(train_dataset)} prompts ✓\")\n",
+    "\n",
+    "# ============================================================\n",
+    "# CELL 10 — Run training\n",
+    "# ============================================================"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f2885f1e",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "outputs": [],
+   "source": [
+    "trainer = GRPOTrainer(\n",
+    "    model=model,\n",
+    "    processing_class=tokenizer,\n",
+    "    reward_funcs=grpo_reward_fn,\n",
+    "    args=training_args,\n",
+    "    train_dataset=train_dataset,\n",
+    ")\n",
+    "\n",
+    "print(\"Starting GRPO training...\")\n",
+    "print(f\"  Model:   {MODEL_NAME}\")\n",
+    "print(f\"  Dataset: {len(train_dataset)} prompts\")\n",
+    "print(f\"  Batch:   {BATCH_SIZE} generations per step\")\n",
+    "print(f\"  LR:      {LR}\")\n",
+    "print()\n",
+    "\n",
+    "train_result = trainer.train()\n",
+    "print(\"\\nTraining complete ✓\")\n",
+    "print(train_result)\n",
+    "\n",
+    "# ============================================================\n",
+    "# CELL 11 — Post-training evaluation\n",
+    "# ============================================================"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "75ba3e52",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "outputs": [],
+   "source": [
+    "FastLanguageModel.for_inference(model)\n",
+    "\n",
+    "print(\"=\" * 60)\n",
+    "print(\"POST-TRAINING EVALUATION\")\n",
+    "print(\"=\" * 60)\n",
+    "\n",
+    "trained_rewards = []\n",
+    "trained_mrrs = []\n",
+    "trained_survivals = []\n",
+    "N_EVAL = 10\n",
+    "\n",
+    "for i in range(N_EVAL):\n",
+    "    t = rollout(RevOpsEnv(crisis_every=3, seed=i + 1000))\n",
+    "    mean_r = sum(t[\"rewards\"]) / max(len(t[\"rewards\"]), 1)\n",
+    "    trained_rewards.append(mean_r)\n",
+    "    trained_mrrs.append(t[\"final_mrr\"])\n",
+    "    trained_survivals.append(1 if t[\"survived\"] else 0)\n",
+    "    print(f\"  Episode {i+1:2d} | mean_reward={mean_r:.4f} | \"\n",
+    "          f\"final_MRR=${t['final_mrr']:,.0f} | survived={t['survived']}\")\n",
+    "\n",
+    "print(f\"\\nTrained mean reward:    {np.mean(trained_rewards):.4f} ± {np.std(trained_rewards):.4f}\")\n",
+    "print(f\"Trained mean final MRR: ${np.mean(trained_mrrs):,.0f}\")\n",
+    "print(f\"Trained survival rate:  {np.mean(trained_survivals)*100:.0f}%\")\n",
+    "\n",
+    "# Delta\n",
+    "print(f\"\\n{'='*60}\")\n",
+    "print(\"IMPROVEMENT SUMMARY\")\n",
+    "print(f\"{'='*60}\")\n",
+    "print(f\"Mean reward delta:    {np.mean(trained_rewards) - np.mean(baseline_rewards):+.4f}\")\n",
+    "print(f\"Final MRR delta:      ${np.mean(trained_mrrs) - np.mean(baseline_mrrs):+,.0f}\")\n",
+    "print(f\"Survival rate delta:  {(np.mean(trained_survivals) - np.mean(baseline_survivals))*100:+.0f}%\")\n",
+    "\n",
+    "# ============================================================\n",
+    "# CELL 12 — Plot reward curves and save to repo\n",
+    "# ============================================================"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4ae37220",
+   "metadata": {
+    "lines_to_next_cell": 0
+   },
+   "outputs": [],
+   "source": [
+    "fig, axes = plt.subplots(1, 3, figsize=(15, 5))\n",
+    "fig.suptitle(\"RevOps Gym — Training Results\", fontsize=14, fontweight=\"bold\")\n",
+    "\n",
+    "# Reward comparison\n",
+    "ax = axes[0]\n",
+    "ax.bar([\"Baseline\", \"Trained\"],\n",
+    "       [np.mean(baseline_rewards), np.mean(trained_rewards)],\n",
+    "       color=[\"#e74c3c\", \"#2ecc71\"], alpha=0.85, edgecolor=\"black\")\n",
+    "ax.errorbar([\"Baseline\", \"Trained\"],\n",
+    "            [np.mean(baseline_rewards), np.mean(trained_rewards)],\n",
+    "            yerr=[np.std(baseline_rewards), np.std(trained_rewards)],\n",
+    "            fmt=\"none\", color=\"black\", capsize=5)\n",
+    "ax.set_title(\"Mean Episode Reward\")\n",
+    "ax.set_ylabel(\"Reward\")\n",
+    "ax.set_xlabel(\"Model\")\n",
+    "\n",
+    "# MRR comparison\n",
+    "ax2 = axes[1]\n",
+    "ax2.bar([\"Baseline\", \"Trained\"],\n",
+    "        [np.mean(baseline_mrrs)/1000, np.mean(trained_mrrs)/1000],\n",
+    "        color=[\"#e74c3c\", \"#2ecc71\"], alpha=0.85, edgecolor=\"black\")\n",
+    "ax2.set_title(\"Mean Final MRR\")\n",
+    "ax2.set_ylabel(\"MRR ($K)\")\n",
+    "ax2.set_xlabel(\"Model\")\n",
+    "ax2.axhline(y=20, color=\"orange\", linestyle=\"--\", label=\"VC floor ($20K)\")\n",
+    "ax2.legend()\n",
+    "\n",
+    "# Survival rate\n",
+    "ax3 = axes[2]\n",
+    "ax3.bar([\"Baseline\", \"Trained\"],\n",
+    "        [np.mean(baseline_survivals)*100, np.mean(trained_survivals)*100],\n",
+    "        color=[\"#e74c3c\", \"#2ecc71\"], alpha=0.85, edgecolor=\"black\")\n",
+    "ax3.set_title(\"Company Survival Rate\")\n",
+    "ax3.set_ylabel(\"Survival %\")\n",
+    "ax3.set_xlabel(\"Model\")\n",
+    "ax3.set_ylim(0, 110)\n",
+    "ax3.axhline(y=100, color=\"gray\", linestyle=\"--\", alpha=0.5)\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.savefig(\"results_comparison.png\", dpi=150, bbox_inches=\"tight\")\n",
+    "plt.show()\n",
+    "print(\"Plot saved as results_comparison.png ✓\")\n",
+    "\n",
+    "# Training loss plot (from trainer logs)\n",
+    "if hasattr(trainer.state, \"log_history\") and trainer.state.log_history:\n",
+    "    losses = [x.get(\"loss\", None) for x in trainer.state.log_history if \"loss\" in x]\n",
+    "    rewards_log = [x.get(\"reward\", None) for x in trainer.state.log_history if \"reward\" in x]\n",
+    "\n",
+    "    fig2, (ax_l, ax_r) = plt.subplots(1, 2, figsize=(12, 4))\n",
+    "    fig2.suptitle(\"RevOps Gym — Training Curves\", fontsize=13, fontweight=\"bold\")\n",
+    "\n",
+    "    if losses:\n",
+    "        ax_l.plot(losses, color=\"#3498db\", linewidth=1.5)\n",
+    "        ax_l.set_title(\"Training Loss\")\n",
+    "        ax_l.set_xlabel(\"Training step\")\n",
+    "        ax_l.set_ylabel(\"Loss\")\n",
+    "        ax_l.grid(alpha=0.3)\n",
+    "\n",
+    "    if rewards_log:\n",
+    "        ax_r.plot(rewards_log, color=\"#2ecc71\", linewidth=1.5)\n",
+    "        ax_r.set_title(\"Training Reward\")\n",
+    "        ax_r.set_xlabel(\"Training step\")\n",
+    "        ax_r.set_ylabel(\"Mean reward\")\n",
+    "        ax_r.grid(alpha=0.3)\n",
+    "\n",
+    "    plt.tight_layout()\n",
+    "    plt.savefig(\"training_curves.png\", dpi=150, bbox_inches=\"tight\")\n",
+    "    plt.show()\n",
+    "    print(\"Training curves saved as training_curves.png ✓\")\n",
+    "\n",
+    "# ============================================================\n",
+    "# CELL 13 — Save model and push to HF Hub\n",
+    "# ============================================================"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d236cce6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Save with Unsloth's correct LoRA merge path (avoids 4bit→16bit corruption)\n",
+    "model.save_pretrained_merged(\n",
+    "    \"revops-gym-model\",\n",
+    "    tokenizer,\n",
+    "    save_method=\"lora\",  # save adapters only (small, efficient)\n",
+    ")\n",
+    "print(\"Model adapters saved to ./revops-gym-model ✓\")\n",
+    "\n",
+    "# Push to Hugging Face Hub\n",
+    "# from huggingface_hub import login\n",
+    "# login(token=\"hf_YOUR_TOKEN\")\n",
+    "# model.push_to_hub_merged(\"YOUR_HF_USERNAME/revops-gym-model\", tokenizer, save_method=\"lora\")\n",
+    "# print(\"Model pushed to HF Hub ✓\")\n",
+    "\n",
+    "# Copy result plots into the revops-gym repo for the README\n",
+    "!cp results_comparison.png revops-gym/\n",
+    "!cp training_curves.png revops-gym/\n",
+    "!cd revops-gym && git add results_comparison.png training_curves.png && git commit -m \"Add training result plots\" && git push\n",
+    "\n",
+    "print(\"\\n🎉 Training pipeline complete!\")\n",
+    "print(\"Next steps:\")\n",
+    "print(\"  1. Copy results_comparison.png and training_curves.png into your HF Space repo\")\n",
+    "print(\"  2. Embed them in README.md\")\n",
+    "print(\"  3. Push the trained model adapter to HF Hub\")\n",
+    "print(\"  4. Submit the HF Space URL via the Google Form\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}