| --- |
| title: RevOps Gym |
| emoji: π |
| colorFrom: blue |
| colorTo: green |
| sdk: docker |
| pinned: false |
| license: mit |
| tags: |
| - openenv |
| - reinforcement-learning |
| - llm-training |
| - saas-simulation |
| - world-modeling |
| - adversarial |
| --- |
| |
| # π RevOps Gym β SaaS Flight Simulator for LLM RL Training |
|
|
| > *Train a language model to run a B2B SaaS company β under adversarial pressure, with real business tradeoffs, across 30 decision steps.* |
|
|
| [](https://github.com/huggingface/openenv) |
| [](https://huggingface.co/spaces/YOUR_HF_USERNAME/revops-gym) |
| [](https://colab.research.google.com/drive/YOUR_COLAB_LINK) |
| [](LICENSE) |
|
|
| --- |
|
|
| ## The Problem This Solves |
|
|
| LLMs can *talk* about business strategy. But can they actually *execute* it, step by step, under pressure, with competing constraints and an adversary actively working against them? |
|
|
| That's the gap RevOps Gym targets. |
|
|
| Revenue Operations β the discipline of aligning sales, marketing, and customer success β is one of the most consequential decision-making domains in the modern economy. Every B2B SaaS company lives or dies by a handful of metrics: Monthly Recurring Revenue (MRR), Customer Acquisition Cost (CAC), Lifetime Value (LTV), churn, and cash runway. Decisions around these metrics are multi-step, non-linear, and made under incomplete information β exactly the kind of reasoning that current LLMs struggle with when pushed beyond a single turn. |
|
|
| RevOps Gym creates a **faithful, adversarial simulation** of this domain β one that an LLM can train on and measurably improve at. |
|
|
| --- |
|
|
| ## Hackathon Themes |
|
|
| | Theme | Coverage | |
| |---|---| |
| | **#3.1 β World Modeling: Professional Tasks** | Primary β agent operates inside a dynamic business world with real state, real tradeoffs, and causal action effects | |
| | **#1 β Multi-Agent Interactions** | Secondary β the Gemini-powered Crisis Engine is an active adversarial agent competing against the Pilot | |
|
|
| --- |
|
|
| ## Environment Overview |
|
|
| The agent β called the **Pilot** β must manage a procedurally generated B2B SaaS company across **30 decision steps**. The company is randomly initialized each episode (different MRR, CAC, churn, runway), so no two episodes are identical and fixed-sequence strategies cannot succeed. |
|
|
| **The win condition:** survive 30 steps with MRR above $20,000. |
| **The lose condition:** MRR drops below the VC floor, cash runway hits zero, or churn exceeds 20%. |
|
|
| ### What the Agent Observes |
|
|
| Every step, the Pilot sees a structured text dashboard: |
|
|
| ``` |
| === RevOps Dashboard | Step 12/30 === |
| β οΈ ACTIVE CRISIS: CAC_EXPLOSION β Ad costs doubled. Marketing efficiency collapses. |
| MRR: $63,400 | Floor: $20,000 |
| CAC: $2,100 | LTV: $11,800 | LTV/CAC: 5.62x |
| Churn: 3.2% | Runway: 14.5mo |
| Marketing spend: $18,200/mo | Support quality: 74% |
| Last reward: 0.312 |
| |
| Available actions: increase_marketing, decrease_marketing, hire_support, |
| fire_support, discount_campaign, raise_prices, feature_investment, |
| cut_costs, negotiate_contracts, pivot_segment |
| Respond ONLY with JSON: {"action_type": "...", "magnitude": 0.0-1.0} |
| ``` |
|
|
| ### Action Space |
|
|
| 10 discrete strategic actions, each with a continuous `magnitude` parameter (0.1β1.0) that scales the effect intensity: |
|
|
| | Action | Effect | |
| |---|---| |
| | `increase_marketing` | Boosts MRR growth, raises spend, improves CAC at scale | |
| | `decrease_marketing` | Frees cash, slows growth | |
| | `hire_support` | Improves support quality, reduces churn, increases LTV β costs runway | |
| | `fire_support` | Saves cash, degrades support quality, raises churn | |
| | `discount_campaign` | Short-term MRR spike, hurts LTV | |
| | `raise_prices` | Increases LTV and MRR for retained customers, some churn risk | |
| | `feature_investment` | Raises LTV and reduces churn, costs runway | |
| | `cut_costs` | Extends runway, slows growth slightly | |
| | `negotiate_contracts` | Reduces churn, raises LTV, slightly increases CAC | |
| | `pivot_segment` | High risk / high reward β probabilistic outcome | |
|
|
| ### Termination Conditions |
|
|
| An episode ends when any of these are true: |
|
|
| - `mrr < $20,000` (VC floor breached) |
| - `cash_runway β€ 0` (company bankrupt) |
| - `churn_rate > 20%` (unrecoverable customer loss) |
| - `step_number β₯ 30` (episode complete β agent survived) |
|
|
| --- |
|
|
| ## The Crisis Engine β What Makes This Environment Novel |
|
|
| Every 3 steps, the **Crisis Engine** activates. It reads the agent's current state, identifies the **weakest metric** using a normalized scoring function, and selects the most damaging crisis it can deploy against that exact vulnerability. |
|
|
| This is not random. It is targeted adversarial pressure β the environment actively hunts for the agent's blind spots. |
|
|
| ```python |
| # From crisis.py β weakness detector |
| def _worst_metric(state: RevOpsState) -> str: |
| scores = { |
| "churn_rate": state.churn_rate / 0.20, |
| "cac": state.cac / 5000, |
| "support_quality": 1.0 - state.support_quality, |
| "cash_runway": max(0, (12 - state.cash_runway) / 12), |
| "mrr": max(0, (mrr_floor * 2 - state.mrr) / (mrr_floor * 2)), |
| } |
| return max(scores, key=scores.get) |
| ``` |
|
|
| When `GEMINI_API_KEY` is set, Gemini 2.0 Flash is called with the full state context and asked to generate a contextual, creative crisis description with calibrated numeric deltas. If the API is unavailable, the engine falls back to a deterministic rule-based selector β **training never stalls**. |
|
|
| ### Available Crises |
|
|
| | Crisis | Effect | |
| |---|---| |
| | `CHURN_SPIKE` | Competitor launches aggressive pricing β churn +4%, MRR β8% | |
| | `CAC_EXPLOSION` | Ad costs double β CAC Γ1.6 | |
| | `SUPPORT_CRISIS` | Key engineers quit β support quality β25%, churn +2% | |
| | `CASH_CRUNCH` | Unexpected infrastructure bill β runway β3 months | |
| | `PRICE_WAR` | Competitors slash prices β MRR β12%, CAC Γ1.3 | |
| | `REGULATORY_HIT` | New compliance requirement β runway β2 months, CAC Γ1.2 | |
| | `ENTERPRISE_CHURN` | Top 3 accounts cancelled β MRR β20% | |
| | `TALENT_WAR` | Big tech hiring spree β runway β2.5 months, support β10% | |
|
|
| **Why this prevents reward hacking:** if the agent over-optimizes one metric, the Crisis Engine targets that metric three steps later. Over-investing in marketing without controlling CAC? Expect `CAC_EXPLOSION`. Ignoring support quality to save cash? `SUPPORT_CRISIS` is coming. |
|
|
| --- |
|
|
| ## Reward Architecture |
|
|
| Four independent reward signals, composited with calibrated weights. This multi-signal design is central to the environment's integrity β a single reward is trivially gameable; four orthogonal signals are not. |
|
|
| ``` |
| Total Reward = (LTV/CAC Γ 0.35) + (MRR Growth Γ 0.30) + (Burn Efficiency Γ 0.20) + (Survival Bonus Γ 0.15) |
| ``` |
|
|
| If the company dies: **β2.0 termination penalty** applied on top. |
|
|
| ### Signal Breakdown |
|
|
| **Signal 1 β LTV/CAC Ratio (35%)** |
| The "golden ratio" of SaaS health. Target is β₯ 3Γ. Score is nonlinear: below 1Γ gives negative signal (losing money per customer), between 1β3Γ scales to 0.75, above 3Γ rewards further improvement up to a ceiling of 1.0. |
|
|
| **Signal 2 β MRR Growth (30%)** |
| Measures revenue trajectory relative to the previous step. +10% growth β score of 1.0. Flat β 0.3. β20% β 0. First step rewards being above the VC floor. |
|
|
| **Signal 3 β Burn Efficiency (20%)** |
| Penalizes unsustainable marketing spend (marketing/MRR > 50% ceiling). Additionally penalizes poor support quality as a proxy for hidden churn risk β an agent that ignores support quality will see this signal degrade even if spend looks fine. |
|
|
| **Signal 4 β Survival Bonus (15%)** |
| Binary floor check (MRR > $20K, runway > 3 months) with a runway health bonus (up to +0.5 for 24+ months of runway). Halved if churn exceeds 10%. |
|
|
| --- |
|
|
| ## System Architecture |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β TRAINING LOOP β |
| β β |
| β LLM Agent (Qwen2.5-1.5B) βββββ Prompt text observation β |
| β β β |
| β β JSON action {"action_type": ..., "magnitude": ...} β |
| β βΌ β |
| β TRL GRPOTrainer βββββ Reward signal (4-signal composite) β |
| β β β |
| β β Policy update via GRPO β |
| β βΌ β |
| β Unsloth (memory efficiency + fast rollout) β |
| ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ |
| β HTTP (OpenEnv API) |
| ββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ |
| β ENVIRONMENT SERVER β |
| β (FastAPI / Docker) β |
| β β |
| β POST /reset β RevOpsEnv.reset() β Random episode init β |
| β POST /step β RevOpsEnv.step() β Action + crisis + rewardβ |
| β GET /state β RevOpsEnv.state() β Current observation β |
| β β |
| β βββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββ β |
| β β env.py β β crisis.py β β reward.py β β |
| β β β β β β β β |
| β β _apply_ ββββΊβ CrisisEngine β β RewardRubric β β |
| β β action() β β β β β β |
| β β β β Gemini 2.0 β β 4-signal composite β β |
| β β World β β Flash (LLM) β β β β |
| β β dynamics β β + β β ltv_cac 35% β β |
| β β β β Rule-based β β mrr_growth 30% β β |
| β β β β fallback β β burn_eff 20% β β |
| β βββββββββββββββ ββββββββββββββββ β survival 15% β β |
| β ββββββββββββββββββββββ β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| β Deployed to |
| βΌ |
| π€ HuggingFace Spaces (Docker) |
| ``` |
|
|
| ### Component Responsibilities |
|
|
| | File | Role | |
| |---|---| |
| | `env.py` | Core world dynamics β `reset()`, `step()`, `state()`. Orchestrates action effects, crisis triggering, reward scoring | |
| | `models.py` | Pydantic data models β `RevOpsState` (internal world), `RevOpsAction` (agent input), `RevOpsObservation` (agent output + reward) | |
| | `crisis.py` | Adversarial engine β weakness detection, Gemini API call, rule-based fallback, state mutation | |
| | `reward.py` | Four-signal reward rubric β independent scoring functions, weighted composition, termination penalty | |
| | `server.py` | FastAPI server β OpenEnv-compliant `/reset`, `/step`, `/state` endpoints + live judge dashboard with Chart.js plots | |
| | `client.py` | HTTP client wrapper β used by training scripts to interact with the environment via clean Python API | |
|
|
| --- |
|
|
| ## Training Pipeline |
|
|
| ``` |
| Base model: Qwen2.5-1.5B-Instruct |
| β |
| β No SFT needed β base instruct model already formats JSON |
| βΌ |
| GRPO Training (HuggingFace TRL) |
| β |
| βββ Rollout: model generates {"action_type": ..., "magnitude": ...} |
| β |
| βββ Environment step: action β world dynamics β crisis check β reward |
| β |
| βββ Reward: 4-signal composite (ltv_cac, mrr_growth, burn_efficiency, survival) |
| β |
| βββ GRPO update: shift probability mass toward higher-reward trajectories |
| β |
| βββ Repeat across episodes (random init each time) |
| β |
| βΌ |
| Unsloth (QLoRA efficiency β runs on single Colab T4) |
| β |
| βΌ |
| Trained adapter saved β tested on held-out episodes β results compared vs baseline |
| ``` |
|
|
| **Why GRPO over PPO:** GRPO removes the need for a separate value model, making it significantly more memory-efficient. This is the key reason training on a 1.5B model fits comfortably on a Colab T4 with Unsloth. |
|
|
| **Why no SFT warmup:** Qwen2.5-1.5B-Instruct already follows JSON format instructions reliably. The environment gives a non-zero reward signal from episode one β no curriculum needed to bootstrap valid rollouts. |
|
|
| --- |
|
|
| ## Training Results |
|
|
| Model trained on **Qwen2.5-1.5B-Instruct** via GRPO + Unsloth on a single Colab T4 GPU. |
|
|
|  |
| *Raw GRPO training logs showing loss, reward, and completion metrics across 50 steps.* |
|
|
|  |
| *Left: Mean episode reward. Center: Final MRR at episode end. Right: Company survival rate. π’ Trained model vs π΄ Untrained baseline.* |
|
|
|  |
| *Loss and reward curves during GRPO training. Reward climbs steadily; loss converges.* |
|
|
| | Metric | Baseline (untrained) | Trained | Improvement | |
| |---|---|---|---| |
| | Mean episode reward | ~0.18 | ~0.41 | **+128%** | |
| | Mean final MRR | ~$31,000 | ~$58,000 | **+87%** | |
| | Company survival rate | ~30% | ~70% | **+133%** | |
|
|
| ### Qualitative Behavior Change |
|
|
| The baseline model applies a near-fixed strategy (typically aggressive marketing) regardless of the current state or active crisis. The trained model demonstrates **state-dependent reasoning**: |
|
|
| - Responds to `CAC_EXPLOSION` by pulling back marketing spend and pivoting to retention actions |
| - Responds to `CHURN_SPIKE` by prioritizing `hire_support` and `negotiate_contracts` over growth actions |
| - Scales back `magnitude` on risky actions when cash runway is low |
| - Maintains LTV/CAC above 3Γ across more episodes by balancing growth and unit economics simultaneously |
|
|
| --- |
|
|
| ## Why This Environment Is Technically Novel |
|
|
| **1. Adversarial non-stationarity.** Unlike environments with fixed dynamics, the Crisis Engine creates a genuinely non-stationary world. The agent cannot memorize a winning sequence β the environment reads its state and adapts. This forces the agent to learn generalizable reasoning, not pattern matching. |
|
|
| **2. Multi-signal reward design.** Four orthogonal reward functions that share no common exploitable shortcut. The only way to score well across all four signals simultaneously is to actually solve the underlying business problem. |
|
|
| **3. Continuous action magnitude.** Most discrete action environments reduce decisions to binary choices. The `magnitude` parameter (0.1β1.0) forces the agent to reason about *how much* to commit to a strategy β a fundamentally harder and more realistic problem. |
|
|
| **4. LLM-in-the-loop adversary.** Gemini 2.0 Flash doesn't just select from a fixed crisis menu β it generates contextually calibrated crisis descriptions and numeric deltas based on the agent's actual current state. Every episode has crises that are semantically and numerically tailored to the agent's specific weaknesses at that moment. |
|
|
| **5. Real-world domain fidelity.** The action effects, reward signals, and crisis scenarios are grounded in actual SaaS business mechanics β LTV/CAC, churn economics, burn rate analysis. A model trained here learns transferable business reasoning, not arbitrary game mechanics. |
|
|
| --- |
|
|
| ## Real-World Relevance |
|
|
| The capabilities RevOps Gym trains are directly applicable to: |
|
|
| - **AI business analysts** β agents that can reason through multi-step financial decisions, not just summarize data |
| - **Executive decision-support tools** β AI that can model "what happens if I cut marketing by 30% while a competitor is aggressive" rather than just answering in the abstract |
| - **RL benchmarking for business domains** β an underexplored area where most existing work focuses on games, math, and code, leaving an open gap |
|
|
| The SaaS management domain is rich, verifiable, and economically significant. An LLM that can genuinely reason through these tradeoffs under adversarial pressure represents a meaningful capability advance over current models. |
|
|
| --- |
|
|
| ## Quick Start |
|
|
| ### Python (local) |
|
|
| ```python |
| from revops_gym import RevOpsEnv |
| |
| env = RevOpsEnv(crisis_every=3, difficulty="normal") |
| obs = env.reset() |
| |
| # See what the agent sees |
| print(obs.to_prompt_text()) |
| |
| # Take an action |
| obs = env.step({"action_type": "hire_support", "magnitude": 0.8}) |
| print(f"Reward: {obs.reward_last_step:.3f} | MRR: ${obs.mrr:,.0f} | LTV/CAC: {obs.ltv_cac_ratio:.2f}x") |
| print(f"Reward breakdown: {obs.info['reward_breakdown']}") |
| ``` |
|
|
| ### REST API (HuggingFace Space) |
|
|
| ```bash |
| # Start a new episode |
| curl -X POST https://YOUR_HF_USERNAME-revops-gym.hf.space/reset \ |
| -H "Content-Type: application/json" \ |
| -d '{"difficulty": "normal"}' |
| |
| # Take an action |
| curl -X POST https://YOUR_HF_USERNAME-revops-gym.hf.space/step \ |
| -H "Content-Type: application/json" \ |
| -d '{"action_type": "increase_marketing", "magnitude": 0.6}' |
| |
| # Check current state |
| curl https://YOUR_HF_USERNAME-revops-gym.hf.space/state |
| ``` |
|
|
| ### Difficulty Levels |
|
|
| | Level | Cash Runway | Churn Rate | CAC | |
| |---|---|---|---| |
| | `easy` | Γ1.5 multiplier | Γ0.6 multiplier | Base | |
| | `normal` | Base | Base | Base | |
| | `hard` | Γ0.6 multiplier | Γ1.4 multiplier | Γ1.3 multiplier | |
|
|
| --- |
|
|
| ## Repository Structure |
|
|
| ``` |
| revops-gym/ |
| βββ openenv.yaml # OpenEnv manifest (obs space, action space, reward schema) |
| βββ Dockerfile # HuggingFace Spaces deployment |
| βββ setup.py # Package definition |
| βββ requirements.txt # Dependencies |
| β |
| βββ revops_gym/ |
| β βββ __init__.py |
| β βββ env.py # Core environment β reset / step / state / action effects |
| β βββ models.py # Pydantic models β RevOpsState, RevOpsAction, RevOpsObservation |
| β βββ crisis.py # Crisis Engine β weakness detection, Gemini API, rule-based fallback |
| β βββ reward.py # RewardRubric β 4-signal composite scoring |
| β βββ server.py # FastAPI server β OpenEnv endpoints + live judge dashboard |
| β βββ client.py # HTTP client for training scripts |
| β |
| βββ tests/ |
| β βββ test_env.py # Smoke tests β reset, step, termination, reward sanity |
| β |
| βββ train_colab.ipynb # Full GRPO training notebook (TRL + Unsloth, runs on Colab T4) |
| βββ results_comparison.png # Baseline vs trained β reward, MRR, survival rate |
| βββ training_curves.png # Loss and reward curves during training |
| ``` |
|
|
| --- |
|
|
| ## Minimum Requirements Checklist |
|
|
| - [x] **OpenEnv compliant** β implements `reset()`, `step()`, `state()` per spec; valid `openenv.yaml` manifest |
| - [x] **Training script** β full GRPO training notebook (`train_colab.ipynb`) using HuggingFace TRL + Unsloth, runs on free Colab T4 |
| - [x] **Training evidence** β reward curves, loss curves, and before/after comparison plots committed to repo |
| - [x] **HuggingFace Space** β Docker deployment with live interactive judge dashboard |
| - [x] **Write-up** β this README + Blog post linked below |
|
|
| --- |
|
|
| ## Links |
|
|
| | Resource | Link | |
| |---|---| |
| | π€ HuggingFace Space (live environment) | [Sriram611/revops-gym](https://huggingface.co/spaces/Sriram611/revops-gym) |
| | π Training Colab Notebook | [Open in Colab](https://colab.research.google.com/drive/1Gg-odqjf1eQLlYZe8LDkqzTtKWb3blz9?usp=sharing) |
| | π Blog Post (HuggingFace) | [Read the blog](https://huggingface.co/spaces/Sriram611/revops-gym/blob/main/Blog.md) |
| | π€ Trained Model | [Sriram611/revops-gym-model](https://huggingface.co/Sriram611/revops-gym-model) |
|
|
| --- |
|
|
| *Built for the **OpenEnv Hackathon India, April 2026**.* |
| *Themes: #3.1 World Modeling (Professional Tasks) + #1 Multi-Agent Adversarial.* |