Spaces:

zom696
/

RevOpsGYm

Sleeping

App Files Files Community

RevOpsGYm / Blog.md

Sriram611

Update Blog.md

182d291 verified about 1 month ago

preview code

raw

history blame contribute delete

7.16 kB

RevOps Gym — I Built an RL Environment That Actively Tries to Destroy Your Company

Most RL environments are static. The world doesn't fight back. You learn the rules, you optimize, you win.

I wanted to build something that fights back.

The Problem I Was Trying to Solve

Running a SaaS company is not a puzzle with a fixed solution. It's a moving target. You're balancing revenue growth, customer churn, marketing spend, and cash runway — all at the same time — while the market keeps throwing curveballs at you.

I wanted to see if a small LLM could learn to handle that. Not just answer questions about business strategy, but actually make decisions inside a simulation and get measurably better at it through training.

That's RevOps Gym.

What It Is

RevOps Gym is an RL training environment where an LLM agent — I call it the Pilot — has to manage a B2B SaaS company across 30 decision steps.

Every episode starts fresh: different starting revenue, different costs, different churn rate. The agent sees a live dashboard of its company's health and picks from 10 strategic actions like increasing marketing, hiring support staff, raising prices, or pivoting to a new customer segment.

The goal: keep Monthly Recurring Revenue (MRR) above $20,000. The VC fires you if it drops below that. Survive 30 steps and you win.

Simple enough. But here's the twist.

The Crisis Engine — This Is What Makes It Different

Every 3 steps, a Crisis Engine (powered by Gemini 2.0 Flash) wakes up, looks at the agent's current state, finds its weakest metric, and hits it with the most painful crisis it can.

High churn already? Expect a competitor to launch aggressive pricing. CAC already climbing? Ad costs are about to double. Cash runway getting thin? Surprise infrastructure bill incoming.

It's not random. It's targeted. The environment reads you and attacks your blind spots — just like real markets do.

Some of the crises it can throw at you:

CHURN_SPIKE — competitor launches aggressive pricing
CAC_EXPLOSION — ad costs double overnight
SUPPORT_CRISIS — key engineers quit, satisfaction tanks
ENTERPRISE_CHURN — top 3 accounts cancel at once
CASH_CRUNCH — unexpected bill, runway shrinks fast

This means the agent can't just memorize a fixed sequence of actions and win. It has to actually reason about its current situation and adapt. That's the whole point.

How the Reward Works

I used 4 independent reward signals instead of one, and this was one of the most important design decisions I made.

Signal	Weight	What It's Measuring
LTV/CAC Ratio	35%	Are you acquiring customers profitably? (target: 3x+)
MRR Growth	30%	Is revenue going up or down vs last step?
Burn Efficiency	20%	Are you spending sustainably?
Survival Bonus	15%	Are you above the VC floor with healthy runway?

If your company dies: −2.0 penalty on top of everything.

Why 4 signals? Because a single reward is easy to game. An agent optimizing only for MRR will just blast marketing spend until it burns through all its cash. Multiple independent signals force the agent to balance real tradeoffs — which is exactly what the task actually demands.

Training Progress

Below is a snapshot of the GRPO training process in Colab. You can see the model starting to stabilize its completion lengths and reward mean as it moves through the steps.

The Training Results

I trained Qwen2.5-1.5B-Instruct using GRPO (via HuggingFace TRL + Unsloth) on a Colab T4. Compared the trained model against the untrained baseline on the same environment.

Metric	Baseline	Trained	Change
Mean Episode Reward	~0.18	~0.41	+128%
Mean Final MRR	~$31K	~$58K	+87%
Survival Rate	~30%	~70%	+133%

The baseline model basically picks the same action over and over regardless of what's happening. The trained model actually adapts — it responds differently to a CAC_EXPLOSION crisis versus a CHURN_SPIKE, and it learned to pull back spending when runway gets low instead of doubling down.

That behavior change is real and it came entirely from training on this environment.

Reward vs. Loss

The training curves show a clear upward trend in mean reward. While RL training can be noisy, the general trajectory confirms the agent is successfully learning to navigate the Crisis Engine.

Baseline vs. Trained Comparison

When put head-to-head in the same environment, the trained agent significantly outperforms the baseline across every core business metric.

Why I Think This Matters Beyond the Hackathon

Business decision-making under pressure is one of those things LLMs seem good at when you chat with them, but struggle with when you actually stress-test their reasoning over multiple steps.

This environment creates a training ground for exactly that — multi-step strategic thinking, adapting to adversarial conditions, managing competing constraints. The domain is real, the decisions are meaningful, and the skills that transfer out of this simulation are genuinely useful.

A model trained on RevOps Gym isn't just learning to win a game. It's learning to reason under pressure with incomplete information. That's a capability that matters.

A Few Things I Learned Building This

Start with curriculum learning. My first training runs on hard difficulty gave the model near-zero reward for hundreds of steps — basically no learning signal at all. Starting on easy mode first, then ramping up, made a huge difference.

The adversarial Crisis Engine accidentally solved reward hacking. I was worried about the agent gaming the reward, but the targeted crises made that nearly impossible. If you over-optimize one metric, the environment hammers that exact metric 3 steps later.

Inspect your generations during training, not just the reward curve. The reward going up doesn't always mean the model is learning what you think it's learning. Looking at actual outputs caught some weird behavior early on.

Quick Start

from revops_gym import RevOpsEnv

env = RevOpsEnv(crisis_every=3, difficulty="normal")
obs = env.reset()
print(obs.to_prompt_text())

obs = env.step({"action_type": "hire_support", "magnitude": 0.8})
print(f"Reward: {obs.reward_last_step:.3f} | MRR: ${obs.mrr:,.0f}")