Spaces:

zom696
/

RevOpsGYm

Sleeping

App Files Files Community

Sriram611 commited on Apr 26

Commit

d79beef

verified ·

1 Parent(s): 0bbfc73

Update Blog.md

Browse files

Files changed (1) hide show

Blog.md +146 -129

Blog.md CHANGED Viewed

@@ -1,130 +1,147 @@
-# RevOps Gym — I Built an RL Environment That Actively Tries to Destroy Your Company
-Most RL environments are static. The world doesn't fight back. You learn the rules, you optimize, you win.
-I wanted to build something that fights back.
----
-## The Problem I Was Trying to Solve
-Running a SaaS company is not a puzzle with a fixed solution. It's a moving target. You're balancing revenue growth, customer churn, marketing spend, and cash runway — all at the same time — while the market keeps throwing curveballs at you.
-I wanted to see if a small LLM could learn to handle that. Not just answer questions *about* business strategy, but actually *make decisions* inside a simulation and get measurably better at it through training.
-That's RevOps Gym.
----
-## What It Is
-RevOps Gym is an RL training environment where an LLM agent — I call it the **Pilot** — has to manage a B2B SaaS company across 30 decision steps.
-Every episode starts fresh: different starting revenue, different costs, different churn rate. The agent sees a live dashboard of its company's health and picks from 10 strategic actions like increasing marketing, hiring support staff, raising prices, or pivoting to a new customer segment.
-The goal: keep Monthly Recurring Revenue (MRR) above $20,000. The VC fires you if it drops below that. Survive 30 steps and you win.
-Simple enough. But here's the twist.
----
-## The Crisis Engine — This Is What Makes It Different
-Every 3 steps, a **Crisis Engine** (powered by Gemini 2.0 Flash) wakes up, looks at the agent's current state, finds its weakest metric, and hits it with the most painful crisis it can.
-High churn already? Expect a competitor to launch aggressive pricing.
-CAC already climbing? Ad costs are about to double.
-Cash runway getting thin? Surprise infrastructure bill incoming.
-It's not random. It's targeted. The environment reads you and attacks your blind spots — just like real markets do.
-Some of the crises it can throw at you:
-- `CHURN_SPIKE` — competitor launches aggressive pricing
-- `CAC_EXPLOSION` — ad costs double overnight
-- `SUPPORT_CRISIS` — key engineers quit, satisfaction tanks
-- `ENTERPRISE_CHURN` — top 3 accounts cancel at once
-- `CASH_CRUNCH` — unexpected bill, runway shrinks fast
-This means the agent can't just memorize a fixed sequence of actions and win. It has to actually reason about its current situation and adapt. That's the whole point.
----
-## How the Reward Works
-I used 4 independent reward signals instead of one, and this was one of the most important design decisions I made.
-| Signal | Weight | What It's Measuring |
-|---|---|---|
-| LTV/CAC Ratio | 35% | Are you acquiring customers profitably? (target: 3x+) |
-| MRR Growth | 30% | Is revenue going up or down vs last step? |
-| Burn Efficiency | 20% | Are you spending sustainably? |
-| Survival Bonus | 15% | Are you above the VC floor with healthy runway? |
-If your company dies: **−2.0 penalty on top of everything.**
-Why 4 signals? Because a single reward is easy to game. An agent optimizing only for MRR will just blast marketing spend until it burns through all its cash. Multiple independent signals force the agent to balance real tradeoffs — which is exactly what the task actually demands.
----
-## The Training Results
-I trained **Qwen2.5-1.5B-Instruct** using GRPO (via HuggingFace TRL + Unsloth) on a Colab T4. Compared the trained model against the untrained baseline on the same environment.
-| Metric | Baseline | Trained | Change |
-|---|---|---|---|
-| Mean Episode Reward | ~0.18 | ~0.41 | **+128%** |
-| Mean Final MRR | ~$31K | ~$58K | **+87%** |
-| Survival Rate | ~30% | ~70% | **+133%** |
-The baseline model basically picks the same action over and over regardless of what's happening. The trained model actually adapts — it responds differently to a `CAC_EXPLOSION` crisis versus a `CHURN_SPIKE`, and it learned to pull back spending when runway gets low instead of doubling down.
-That behavior change is real and it came entirely from training on this environment.
----
-## Why I Think This Matters Beyond the Hackathon
-Business decision-making under pressure is one of those things LLMs *seem* good at when you chat with them, but struggle with when you actually stress-test their reasoning over multiple steps.
-This environment creates a training ground for exactly that — multi-step strategic thinking, adapting to adversarial conditions, managing competing constraints. The domain is real, the decisions are meaningful, and the skills that transfer out of this simulation are genuinely useful.
-A model trained on RevOps Gym isn't just learning to win a game. It's learning to reason under pressure with incomplete information. That's a capability that matters.
----
-## A Few Things I Learned Building This
-**Start with curriculum learning.** My first training runs on hard difficulty gave the model near-zero reward for hundreds of steps — basically no learning signal at all. Starting on easy mode first, then ramping up, made a huge difference.
-**The adversarial Crisis Engine accidentally solved reward hacking.** I was worried about the agent gaming the reward, but the targeted crises made that nearly impossible. If you over-optimize one metric, the environment hammers that exact metric 3 steps later.
-**Inspect your generations during training, not just the reward curve.** The reward going up doesn't always mean the model is learning what you think it's learning. Looking at actual outputs caught some weird behavior early on.
----
-## Quick Start
-```python
-from revops_gym import RevOpsEnv
-env = RevOpsEnv(crisis_every=3, difficulty="normal")
-obs = env.reset()
-print(obs.to_prompt_text())
-obs = env.step({"action_type": "hire_support", "magnitude": 0.8})
-print(f"Reward: {obs.reward_last_step:.3f} | MRR: ${obs.mrr:,.0f}")
-```
----
-## Links
-- 🤗 **HF Space**: [Sriram611/revops-gym](https://huggingface.co/spaces/Sriram611/revops-gym)
-- 📓 **Training Colab**: [Open in Colab](https://colab.research.google.com/drive/1Gg-odqjf1eQLlYZe8LDkqzTtKWb3blz9?usp=sharing)
-- 🤗 **Trained Model**: [Sriram611/revops-gym-model](https://huggingface.co/Sriram611/revops-gym-model)
----
 *Built for the OpenEnv Hackathon India, April 2026 — Theme #3.1 World Modeling + Theme #1 Multi-Agent Adversarial.*

+# RevOps Gym — I Built an RL Environment That Actively Tries to Destroy Your Company
+Most RL environments are static. The world doesn't fight back. You learn the rules, you optimize, you win.
+I wanted to build something that fights back.
+---
+## The Problem I Was Trying to Solve
+Running a SaaS company is not a puzzle with a fixed solution. It's a moving target. You're balancing revenue growth, customer churn, marketing spend, and cash runway — all at the same time — while the market keeps throwing curveballs at you.
+I wanted to see if a small LLM could learn to handle that. Not just answer questions *about* business strategy, but actually *make decisions* inside a simulation and get measurably better at it through training.
+That's RevOps Gym.
+---
+## What It Is
+RevOps Gym is an RL training environment where an LLM agent — I call it the **Pilot** — has to manage a B2B SaaS company across 30 decision steps.
+Every episode starts fresh: different starting revenue, different costs, different churn rate. The agent sees a live dashboard of its company's health and picks from 10 strategic actions like increasing marketing, hiring support staff, raising prices, or pivoting to a new customer segment.
+The goal: keep Monthly Recurring Revenue (MRR) above $20,000. The VC fires you if it drops below that. Survive 30 steps and you win.
+Simple enough. But here's the twist.
+---
+## The Crisis Engine — This Is What Makes It Different
+Every 3 steps, a **Crisis Engine** (powered by Gemini 2.0 Flash) wakes up, looks at the agent's current state, finds its weakest metric, and hits it with the most painful crisis it can.
+High churn already? Expect a competitor to launch aggressive pricing.
+CAC already climbing? Ad costs are about to double.
+Cash runway getting thin? Surprise infrastructure bill incoming.
+It's not random. It's targeted. The environment reads you and attacks your blind spots — just like real markets do.
+Some of the crises it can throw at you:
+- `CHURN_SPIKE` — competitor launches aggressive pricing
+- `CAC_EXPLOSION` — ad costs double overnight
+- `SUPPORT_CRISIS` — key engineers quit, satisfaction tanks
+- `ENTERPRISE_CHURN` — top 3 accounts cancel at once
+- `CASH_CRUNCH` — unexpected bill, runway shrinks fast
+This means the agent can't just memorize a fixed sequence of actions and win. It has to actually reason about its current situation and adapt. That's the whole point.
+---
+## How the Reward Works
+I used 4 independent reward signals instead of one, and this was one of the most important design decisions I made.
+| Signal | Weight | What It's Measuring |
+|---|---|---|
+| LTV/CAC Ratio | 35% | Are you acquiring customers profitably? (target: 3x+) |
+| MRR Growth | 30% | Is revenue going up or down vs last step? |
+| Burn Efficiency | 20% | Are you spending sustainably? |
+| Survival Bonus | 15% | Are you above the VC floor with healthy runway? |
+If your company dies: **−2.0 penalty on top of everything.**
+Why 4 signals? Because a single reward is easy to game. An agent optimizing only for MRR will just blast marketing spend until it burns through all its cash. Multiple independent signals force the agent to balance real tradeoffs — which is exactly what the task actually demands.
+---
+## Training Progress
+Below is a snapshot of the GRPO training process in Colab. You can see the model starting to stabilize its completion lengths and reward mean as it moves through the steps.
+![GRPO Training Logs](training_logs.jpg)
+---
+## The Training Results
+I trained **Qwen2.5-1.5B-Instruct** using GRPO (via HuggingFace TRL + Unsloth) on a Colab T4. Compared the trained model against the untrained baseline on the same environment.
+| Metric | Baseline | Trained | Change |
+|---|---|---|---|
+| Mean Episode Reward | ~0.18 | ~0.41 | **+128%** |
+| Mean Final MRR | ~$31K | ~$58K | **+87%** |
+| Survival Rate | ~30% | ~70% | **+133%** |
+The baseline model basically picks the same action over and over regardless of what's happening. The trained model actually adapts — it responds differently to a `CAC_EXPLOSION` crisis versus a `CHURN_SPIKE`, and it learned to pull back spending when runway gets low instead of doubling down.
+That behavior change is real and it came entirely from training on this environment.
+---
+## Reward vs. Loss
+The training curves show a clear upward trend in mean reward. While RL training can be noisy, the general trajectory confirms the agent is successfully learning to navigate the Crisis Engine.
+![RevOps Gym Training Curves](training_curves.png)
+---
+### Baseline vs. Trained Comparison
+When put head-to-head in the same environment, the trained agent significantly outperforms the baseline across every core business metric.
+![RevOps Gym Comparison Results](results_comparison.png)
+---
+## Why I Think This Matters Beyond the Hackathon
+Business decision-making under pressure is one of those things LLMs *seem* good at when you chat with them, but struggle with when you actually stress-test their reasoning over multiple steps.
+This environment creates a training ground for exactly that — multi-step strategic thinking, adapting to adversarial conditions, managing competing constraints. The domain is real, the decisions are meaningful, and the skills that transfer out of this simulation are genuinely useful.
+A model trained on RevOps Gym isn't just learning to win a game. It's learning to reason under pressure with incomplete information. That's a capability that matters.
+---
+## A Few Things I Learned Building This
+**Start with curriculum learning.** My first training runs on hard difficulty gave the model near-zero reward for hundreds of steps — basically no learning signal at all. Starting on easy mode first, then ramping up, made a huge difference.
+**The adversarial Crisis Engine accidentally solved reward hacking.** I was worried about the agent gaming the reward, but the targeted crises made that nearly impossible. If you over-optimize one metric, the environment hammers that exact metric 3 steps later.
+**Inspect your generations during training, not just the reward curve.** The reward going up doesn't always mean the model is learning what you think it's learning. Looking at actual outputs caught some weird behavior early on.
+---
+## Quick Start
+```python
+from revops_gym import RevOpsEnv
+env = RevOpsEnv(crisis_every=3, difficulty="normal")
+obs = env.reset()
+print(obs.to_prompt_text())
+obs = env.step({"action_type": "hire_support", "magnitude": 0.8})
+print(f"Reward: {obs.reward_last_step:.3f} | MRR: ${obs.mrr:,.0f}")
+```
+---
+## Links
+- 🤗 **HF Space**: [Sriram611/revops-gym](https://huggingface.co/spaces/Sriram611/revops-gym)
+- 📓 **Training Colab**: [Open in Colab](https://colab.research.google.com/drive/1Gg-odqjf1eQLlYZe8LDkqzTtKWb3blz9?usp=sharing)
+- 🤗 **Trained Model**: [Sriram611/revops-gym-model](https://huggingface.co/Sriram611/revops-gym-model)
+---
 *Built for the OpenEnv Hackathon India, April 2026 — Theme #3.1 World Modeling + Theme #1 Multi-Agent Adversarial.*