RevOpsGYm / README.md
Sriram611's picture
Update README.md
0bbfc73 verified
metadata
title: RevOps Gym
emoji: πŸš€
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
license: mit
tags:
  - openenv
  - reinforcement-learning
  - llm-training
  - saas-simulation
  - world-modeling
  - adversarial

πŸš€ RevOps Gym β€” SaaS Flight Simulator for LLM RL Training

Train a language model to run a B2B SaaS company β€” under adversarial pressure, with real business tradeoffs, across 30 decision steps.

OpenEnv HF Space Colab License: MIT


The Problem This Solves

LLMs can talk about business strategy. But can they actually execute it, step by step, under pressure, with competing constraints and an adversary actively working against them?

That's the gap RevOps Gym targets.

Revenue Operations β€” the discipline of aligning sales, marketing, and customer success β€” is one of the most consequential decision-making domains in the modern economy. Every B2B SaaS company lives or dies by a handful of metrics: Monthly Recurring Revenue (MRR), Customer Acquisition Cost (CAC), Lifetime Value (LTV), churn, and cash runway. Decisions around these metrics are multi-step, non-linear, and made under incomplete information β€” exactly the kind of reasoning that current LLMs struggle with when pushed beyond a single turn.

RevOps Gym creates a faithful, adversarial simulation of this domain β€” one that an LLM can train on and measurably improve at.


Hackathon Themes

Theme Coverage
#3.1 β€” World Modeling: Professional Tasks Primary β€” agent operates inside a dynamic business world with real state, real tradeoffs, and causal action effects
#1 β€” Multi-Agent Interactions Secondary β€” the Gemini-powered Crisis Engine is an active adversarial agent competing against the Pilot

Environment Overview

The agent β€” called the Pilot β€” must manage a procedurally generated B2B SaaS company across 30 decision steps. The company is randomly initialized each episode (different MRR, CAC, churn, runway), so no two episodes are identical and fixed-sequence strategies cannot succeed.

The win condition: survive 30 steps with MRR above $20,000. The lose condition: MRR drops below the VC floor, cash runway hits zero, or churn exceeds 20%.

What the Agent Observes

Every step, the Pilot sees a structured text dashboard:

=== RevOps Dashboard | Step 12/30 ===
⚠️  ACTIVE CRISIS: CAC_EXPLOSION β€” Ad costs doubled. Marketing efficiency collapses.
MRR: $63,400  |  Floor: $20,000
CAC: $2,100   |  LTV: $11,800  |  LTV/CAC: 5.62x
Churn: 3.2%   |  Runway: 14.5mo
Marketing spend: $18,200/mo  |  Support quality: 74%
Last reward: 0.312

Available actions: increase_marketing, decrease_marketing, hire_support,
fire_support, discount_campaign, raise_prices, feature_investment,
cut_costs, negotiate_contracts, pivot_segment
Respond ONLY with JSON: {"action_type": "...", "magnitude": 0.0-1.0}

Action Space

10 discrete strategic actions, each with a continuous magnitude parameter (0.1–1.0) that scales the effect intensity:

Action Effect
increase_marketing Boosts MRR growth, raises spend, improves CAC at scale
decrease_marketing Frees cash, slows growth
hire_support Improves support quality, reduces churn, increases LTV β€” costs runway
fire_support Saves cash, degrades support quality, raises churn
discount_campaign Short-term MRR spike, hurts LTV
raise_prices Increases LTV and MRR for retained customers, some churn risk
feature_investment Raises LTV and reduces churn, costs runway
cut_costs Extends runway, slows growth slightly
negotiate_contracts Reduces churn, raises LTV, slightly increases CAC
pivot_segment High risk / high reward β€” probabilistic outcome

Termination Conditions

An episode ends when any of these are true:

  • mrr < $20,000 (VC floor breached)
  • cash_runway ≀ 0 (company bankrupt)
  • churn_rate > 20% (unrecoverable customer loss)
  • step_number β‰₯ 30 (episode complete β€” agent survived)

The Crisis Engine β€” What Makes This Environment Novel

Every 3 steps, the Crisis Engine activates. It reads the agent's current state, identifies the weakest metric using a normalized scoring function, and selects the most damaging crisis it can deploy against that exact vulnerability.

This is not random. It is targeted adversarial pressure β€” the environment actively hunts for the agent's blind spots.

# From crisis.py β€” weakness detector
def _worst_metric(state: RevOpsState) -> str:
    scores = {
        "churn_rate":      state.churn_rate / 0.20,
        "cac":             state.cac / 5000,
        "support_quality": 1.0 - state.support_quality,
        "cash_runway":     max(0, (12 - state.cash_runway) / 12),
        "mrr":             max(0, (mrr_floor * 2 - state.mrr) / (mrr_floor * 2)),
    }
    return max(scores, key=scores.get)

When GEMINI_API_KEY is set, Gemini 2.0 Flash is called with the full state context and asked to generate a contextual, creative crisis description with calibrated numeric deltas. If the API is unavailable, the engine falls back to a deterministic rule-based selector β€” training never stalls.

Available Crises

Crisis Effect
CHURN_SPIKE Competitor launches aggressive pricing β€” churn +4%, MRR βˆ’8%
CAC_EXPLOSION Ad costs double β€” CAC Γ—1.6
SUPPORT_CRISIS Key engineers quit β€” support quality βˆ’25%, churn +2%
CASH_CRUNCH Unexpected infrastructure bill β€” runway βˆ’3 months
PRICE_WAR Competitors slash prices β€” MRR βˆ’12%, CAC Γ—1.3
REGULATORY_HIT New compliance requirement β€” runway βˆ’2 months, CAC Γ—1.2
ENTERPRISE_CHURN Top 3 accounts cancelled β€” MRR βˆ’20%
TALENT_WAR Big tech hiring spree β€” runway βˆ’2.5 months, support βˆ’10%

Why this prevents reward hacking: if the agent over-optimizes one metric, the Crisis Engine targets that metric three steps later. Over-investing in marketing without controlling CAC? Expect CAC_EXPLOSION. Ignoring support quality to save cash? SUPPORT_CRISIS is coming.


Reward Architecture

Four independent reward signals, composited with calibrated weights. This multi-signal design is central to the environment's integrity β€” a single reward is trivially gameable; four orthogonal signals are not.

Total Reward = (LTV/CAC Γ— 0.35) + (MRR Growth Γ— 0.30) + (Burn Efficiency Γ— 0.20) + (Survival Bonus Γ— 0.15)

If the company dies: βˆ’2.0 termination penalty applied on top.

Signal Breakdown

Signal 1 β€” LTV/CAC Ratio (35%) The "golden ratio" of SaaS health. Target is β‰₯ 3Γ—. Score is nonlinear: below 1Γ— gives negative signal (losing money per customer), between 1–3Γ— scales to 0.75, above 3Γ— rewards further improvement up to a ceiling of 1.0.

Signal 2 β€” MRR Growth (30%) Measures revenue trajectory relative to the previous step. +10% growth β†’ score of 1.0. Flat β†’ 0.3. βˆ’20% β†’ 0. First step rewards being above the VC floor.

Signal 3 β€” Burn Efficiency (20%) Penalizes unsustainable marketing spend (marketing/MRR > 50% ceiling). Additionally penalizes poor support quality as a proxy for hidden churn risk β€” an agent that ignores support quality will see this signal degrade even if spend looks fine.

Signal 4 β€” Survival Bonus (15%) Binary floor check (MRR > $20K, runway > 3 months) with a runway health bonus (up to +0.5 for 24+ months of runway). Halved if churn exceeds 10%.


System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        TRAINING LOOP                            β”‚
β”‚                                                                 β”‚
β”‚   LLM Agent (Qwen2.5-1.5B)  ◄──── Prompt text observation     β”‚
β”‚          β”‚                                                      β”‚
β”‚          β”‚  JSON action {"action_type": ..., "magnitude": ...} β”‚
β”‚          β–Ό                                                      β”‚
β”‚   TRL GRPOTrainer  ◄──── Reward signal (4-signal composite)    β”‚
β”‚          β”‚                                                      β”‚
β”‚          β”‚  Policy update via GRPO                             β”‚
β”‚          β–Ό                                                      β”‚
β”‚   Unsloth (memory efficiency + fast rollout)                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚  HTTP (OpenEnv API)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     ENVIRONMENT SERVER                          β”‚
β”‚                     (FastAPI / Docker)                          β”‚
β”‚                                                                 β”‚
β”‚   POST /reset  β†’  RevOpsEnv.reset()  β†’  Random episode init    β”‚
β”‚   POST /step   β†’  RevOpsEnv.step()   β†’  Action + crisis + rewardβ”‚
β”‚   GET  /state  β†’  RevOpsEnv.state()  β†’  Current observation    β”‚
β”‚                                                                 β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚   env.py    β”‚   β”‚   crisis.py  β”‚   β”‚    reward.py       β”‚  β”‚
β”‚   β”‚             β”‚   β”‚              β”‚   β”‚                    β”‚  β”‚
β”‚   β”‚ _apply_     │──►│ CrisisEngine β”‚   β”‚ RewardRubric       β”‚  β”‚
β”‚   β”‚ action()    β”‚   β”‚              β”‚   β”‚                    β”‚  β”‚
β”‚   β”‚             β”‚   β”‚ Gemini 2.0   β”‚   β”‚ 4-signal composite β”‚  β”‚
β”‚   β”‚ World       β”‚   β”‚ Flash (LLM)  β”‚   β”‚                    β”‚  β”‚
β”‚   β”‚ dynamics    β”‚   β”‚    +         β”‚   β”‚ ltv_cac    35%     β”‚  β”‚
β”‚   β”‚             β”‚   β”‚ Rule-based   β”‚   β”‚ mrr_growth 30%     β”‚  β”‚
β”‚   β”‚             β”‚   β”‚ fallback     β”‚   β”‚ burn_eff   20%     β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚ survival   15%     β”‚  β”‚
β”‚                                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β”‚  Deployed to
                           β–Ό
              πŸ€— HuggingFace Spaces (Docker)

Component Responsibilities

File Role
env.py Core world dynamics β€” reset(), step(), state(). Orchestrates action effects, crisis triggering, reward scoring
models.py Pydantic data models β€” RevOpsState (internal world), RevOpsAction (agent input), RevOpsObservation (agent output + reward)
crisis.py Adversarial engine β€” weakness detection, Gemini API call, rule-based fallback, state mutation
reward.py Four-signal reward rubric β€” independent scoring functions, weighted composition, termination penalty
server.py FastAPI server β€” OpenEnv-compliant /reset, /step, /state endpoints + live judge dashboard with Chart.js plots
client.py HTTP client wrapper β€” used by training scripts to interact with the environment via clean Python API

Training Pipeline

Base model: Qwen2.5-1.5B-Instruct
     β”‚
     β”‚  No SFT needed β€” base instruct model already formats JSON
     β–Ό
GRPO Training (HuggingFace TRL)
     β”‚
     β”œβ”€β”€ Rollout: model generates {"action_type": ..., "magnitude": ...}
     β”‚
     β”œβ”€β”€ Environment step: action β†’ world dynamics β†’ crisis check β†’ reward
     β”‚
     β”œβ”€β”€ Reward: 4-signal composite (ltv_cac, mrr_growth, burn_efficiency, survival)
     β”‚
     β”œβ”€β”€ GRPO update: shift probability mass toward higher-reward trajectories
     β”‚
     └── Repeat across episodes (random init each time)
     β”‚
     β–Ό
Unsloth (QLoRA efficiency β€” runs on single Colab T4)
     β”‚
     β–Ό
Trained adapter saved β†’ tested on held-out episodes β†’ results compared vs baseline

Why GRPO over PPO: GRPO removes the need for a separate value model, making it significantly more memory-efficient. This is the key reason training on a 1.5B model fits comfortably on a Colab T4 with Unsloth.

Why no SFT warmup: Qwen2.5-1.5B-Instruct already follows JSON format instructions reliably. The environment gives a non-zero reward signal from episode one β€” no curriculum needed to bootstrap valid rollouts.


Training Results

Model trained on Qwen2.5-1.5B-Instruct via GRPO + Unsloth on a single Colab T4 GPU.

Training logs Raw GRPO training logs showing loss, reward, and completion metrics across 50 steps.

Results comparison Left: Mean episode reward. Center: Final MRR at episode end. Right: Company survival rate. 🟒 Trained model vs πŸ”΄ Untrained baseline.

Training curves Loss and reward curves during GRPO training. Reward climbs steadily; loss converges.

Metric Baseline (untrained) Trained Improvement
Mean episode reward ~0.18 ~0.41 +128%
Mean final MRR ~$31,000 ~$58,000 +87%
Company survival rate ~30% ~70% +133%

Qualitative Behavior Change

The baseline model applies a near-fixed strategy (typically aggressive marketing) regardless of the current state or active crisis. The trained model demonstrates state-dependent reasoning:

  • Responds to CAC_EXPLOSION by pulling back marketing spend and pivoting to retention actions
  • Responds to CHURN_SPIKE by prioritizing hire_support and negotiate_contracts over growth actions
  • Scales back magnitude on risky actions when cash runway is low
  • Maintains LTV/CAC above 3Γ— across more episodes by balancing growth and unit economics simultaneously

Why This Environment Is Technically Novel

1. Adversarial non-stationarity. Unlike environments with fixed dynamics, the Crisis Engine creates a genuinely non-stationary world. The agent cannot memorize a winning sequence β€” the environment reads its state and adapts. This forces the agent to learn generalizable reasoning, not pattern matching.

2. Multi-signal reward design. Four orthogonal reward functions that share no common exploitable shortcut. The only way to score well across all four signals simultaneously is to actually solve the underlying business problem.

3. Continuous action magnitude. Most discrete action environments reduce decisions to binary choices. The magnitude parameter (0.1–1.0) forces the agent to reason about how much to commit to a strategy β€” a fundamentally harder and more realistic problem.

4. LLM-in-the-loop adversary. Gemini 2.0 Flash doesn't just select from a fixed crisis menu β€” it generates contextually calibrated crisis descriptions and numeric deltas based on the agent's actual current state. Every episode has crises that are semantically and numerically tailored to the agent's specific weaknesses at that moment.

5. Real-world domain fidelity. The action effects, reward signals, and crisis scenarios are grounded in actual SaaS business mechanics β€” LTV/CAC, churn economics, burn rate analysis. A model trained here learns transferable business reasoning, not arbitrary game mechanics.


Real-World Relevance

The capabilities RevOps Gym trains are directly applicable to:

  • AI business analysts β€” agents that can reason through multi-step financial decisions, not just summarize data
  • Executive decision-support tools β€” AI that can model "what happens if I cut marketing by 30% while a competitor is aggressive" rather than just answering in the abstract
  • RL benchmarking for business domains β€” an underexplored area where most existing work focuses on games, math, and code, leaving an open gap

The SaaS management domain is rich, verifiable, and economically significant. An LLM that can genuinely reason through these tradeoffs under adversarial pressure represents a meaningful capability advance over current models.


Quick Start

Python (local)

from revops_gym import RevOpsEnv

env = RevOpsEnv(crisis_every=3, difficulty="normal")
obs = env.reset()

# See what the agent sees
print(obs.to_prompt_text())

# Take an action
obs = env.step({"action_type": "hire_support", "magnitude": 0.8})
print(f"Reward: {obs.reward_last_step:.3f} | MRR: ${obs.mrr:,.0f} | LTV/CAC: {obs.ltv_cac_ratio:.2f}x")
print(f"Reward breakdown: {obs.info['reward_breakdown']}")

REST API (HuggingFace Space)

# Start a new episode
curl -X POST https://YOUR_HF_USERNAME-revops-gym.hf.space/reset \
  -H "Content-Type: application/json" \
  -d '{"difficulty": "normal"}'

# Take an action
curl -X POST https://YOUR_HF_USERNAME-revops-gym.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action_type": "increase_marketing", "magnitude": 0.6}'

# Check current state
curl https://YOUR_HF_USERNAME-revops-gym.hf.space/state

Difficulty Levels

Level Cash Runway Churn Rate CAC
easy Γ—1.5 multiplier Γ—0.6 multiplier Base
normal Base Base Base
hard Γ—0.6 multiplier Γ—1.4 multiplier Γ—1.3 multiplier

Repository Structure

revops-gym/
β”œβ”€β”€ openenv.yaml              # OpenEnv manifest (obs space, action space, reward schema)
β”œβ”€β”€ Dockerfile                # HuggingFace Spaces deployment
β”œβ”€β”€ setup.py                  # Package definition
β”œβ”€β”€ requirements.txt          # Dependencies
β”‚
β”œβ”€β”€ revops_gym/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ env.py                # Core environment β€” reset / step / state / action effects
β”‚   β”œβ”€β”€ models.py             # Pydantic models β€” RevOpsState, RevOpsAction, RevOpsObservation
β”‚   β”œβ”€β”€ crisis.py             # Crisis Engine β€” weakness detection, Gemini API, rule-based fallback
β”‚   β”œβ”€β”€ reward.py             # RewardRubric β€” 4-signal composite scoring
β”‚   β”œβ”€β”€ server.py             # FastAPI server β€” OpenEnv endpoints + live judge dashboard
β”‚   └── client.py             # HTTP client for training scripts
β”‚
β”œβ”€β”€ tests/
β”‚   └── test_env.py           # Smoke tests β€” reset, step, termination, reward sanity
β”‚
β”œβ”€β”€ train_colab.ipynb         # Full GRPO training notebook (TRL + Unsloth, runs on Colab T4)
β”œβ”€β”€ results_comparison.png    # Baseline vs trained β€” reward, MRR, survival rate
└── training_curves.png       # Loss and reward curves during training

Minimum Requirements Checklist

  • OpenEnv compliant β€” implements reset(), step(), state() per spec; valid openenv.yaml manifest
  • Training script β€” full GRPO training notebook (train_colab.ipynb) using HuggingFace TRL + Unsloth, runs on free Colab T4
  • Training evidence β€” reward curves, loss curves, and before/after comparison plots committed to repo
  • HuggingFace Space β€” Docker deployment with live interactive judge dashboard
  • Write-up β€” this README + Blog post linked below

Links

Resource Link
πŸ€— HuggingFace Space (live environment) Sriram611/revops-gym
πŸ““ Training Colab Notebook Open in Colab
πŸ“ Blog Post (HuggingFace) Read the blog
πŸ€— Trained Model Sriram611/revops-gym-model

Built for the OpenEnv Hackathon India, April 2026. Themes: #3.1 World Modeling (Professional Tasks) + #1 Multi-Agent Adversarial.