Spaces:

zom696
/

RevOpsGYm

Sleeping

App Files Files Community

RevOpsGYm / README.md

Sriram611

Update README.md

0bbfc73 verified about 1 month ago

preview code

raw

history blame contribute delete

21 kB

metadata

title: RevOps Gym
emoji: 🚀
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
license: mit
tags:
  - openenv
  - reinforcement-learning
  - llm-training
  - saas-simulation
  - world-modeling
  - adversarial

🚀 RevOps Gym — SaaS Flight Simulator for LLM RL Training

Train a language model to run a B2B SaaS company — under adversarial pressure, with real business tradeoffs, across 30 decision steps.

The Problem This Solves

LLMs can talk about business strategy. But can they actually execute it, step by step, under pressure, with competing constraints and an adversary actively working against them?

That's the gap RevOps Gym targets.

Revenue Operations — the discipline of aligning sales, marketing, and customer success — is one of the most consequential decision-making domains in the modern economy. Every B2B SaaS company lives or dies by a handful of metrics: Monthly Recurring Revenue (MRR), Customer Acquisition Cost (CAC), Lifetime Value (LTV), churn, and cash runway. Decisions around these metrics are multi-step, non-linear, and made under incomplete information — exactly the kind of reasoning that current LLMs struggle with when pushed beyond a single turn.

RevOps Gym creates a faithful, adversarial simulation of this domain — one that an LLM can train on and measurably improve at.

Hackathon Themes

Theme	Coverage
#3.1 — World Modeling: Professional Tasks	Primary — agent operates inside a dynamic business world with real state, real tradeoffs, and causal action effects
#1 — Multi-Agent Interactions	Secondary — the Gemini-powered Crisis Engine is an active adversarial agent competing against the Pilot

Environment Overview

The agent — called the Pilot — must manage a procedurally generated B2B SaaS company across 30 decision steps. The company is randomly initialized each episode (different MRR, CAC, churn, runway), so no two episodes are identical and fixed-sequence strategies cannot succeed.

The win condition: survive 30 steps with MRR above $20,000. The lose condition: MRR drops below the VC floor, cash runway hits zero, or churn exceeds 20%.

What the Agent Observes

Every step, the Pilot sees a structured text dashboard:

=== RevOps Dashboard | Step 12/30 ===
⚠️  ACTIVE CRISIS: CAC_EXPLOSION — Ad costs doubled. Marketing efficiency collapses.
MRR: $63,400  |  Floor: $20,000
CAC: $2,100   |  LTV: $11,800  |  LTV/CAC: 5.62x
Churn: 3.2%   |  Runway: 14.5mo
Marketing spend: $18,200/mo  |  Support quality: 74%
Last reward: 0.312

Available actions: increase_marketing, decrease_marketing, hire_support,
fire_support, discount_campaign, raise_prices, feature_investment,
cut_costs, negotiate_contracts, pivot_segment
Respond ONLY with JSON: {"action_type": "...", "magnitude": 0.0-1.0}

Action Space

10 discrete strategic actions, each with a continuous magnitude parameter (0.1–1.0) that scales the effect intensity:

Action	Effect
`increase_marketing`	Boosts MRR growth, raises spend, improves CAC at scale
`decrease_marketing`	Frees cash, slows growth
`hire_support`	Improves support quality, reduces churn, increases LTV — costs runway
`fire_support`	Saves cash, degrades support quality, raises churn
`discount_campaign`	Short-term MRR spike, hurts LTV
`raise_prices`	Increases LTV and MRR for retained customers, some churn risk
`feature_investment`	Raises LTV and reduces churn, costs runway
`cut_costs`	Extends runway, slows growth slightly
`negotiate_contracts`	Reduces churn, raises LTV, slightly increases CAC
`pivot_segment`	High risk / high reward — probabilistic outcome

Termination Conditions

An episode ends when any of these are true:

mrr < $20,000 (VC floor breached)
cash_runway ≤ 0 (company bankrupt)
churn_rate > 20% (unrecoverable customer loss)
step_number ≥ 30 (episode complete — agent survived)

The Crisis Engine — What Makes This Environment Novel

Every 3 steps, the Crisis Engine activates. It reads the agent's current state, identifies the weakest metric using a normalized scoring function, and selects the most damaging crisis it can deploy against that exact vulnerability.

This is not random. It is targeted adversarial pressure — the environment actively hunts for the agent's blind spots.

# From crisis.py — weakness detector
def _worst_metric(state: RevOpsState) -> str:
    scores = {
        "churn_rate":      state.churn_rate / 0.20,
        "cac":             state.cac / 5000,
        "support_quality": 1.0 - state.support_quality,
        "cash_runway":     max(0, (12 - state.cash_runway) / 12),
        "mrr":             max(0, (mrr_floor * 2 - state.mrr) / (mrr_floor * 2)),
    }
    return max(scores, key=scores.get)

When GEMINI_API_KEY is set, Gemini 2.0 Flash is called with the full state context and asked to generate a contextual, creative crisis description with calibrated numeric deltas. If the API is unavailable, the engine falls back to a deterministic rule-based selector — training never stalls.

Available Crises

Crisis	Effect
`CHURN_SPIKE`	Competitor launches aggressive pricing — churn +4%, MRR −8%
`CAC_EXPLOSION`	Ad costs double — CAC ×1.6
`SUPPORT_CRISIS`	Key engineers quit — support quality −25%, churn +2%
`CASH_CRUNCH`	Unexpected infrastructure bill — runway −3 months
`PRICE_WAR`	Competitors slash prices — MRR −12%, CAC ×1.3
`REGULATORY_HIT`	New compliance requirement — runway −2 months, CAC ×1.2
`ENTERPRISE_CHURN`	Top 3 accounts cancelled — MRR −20%
`TALENT_WAR`	Big tech hiring spree — runway −2.5 months, support −10%

Why this prevents reward hacking: if the agent over-optimizes one metric, the Crisis Engine targets that metric three steps later. Over-investing in marketing without controlling CAC? Expect CAC_EXPLOSION. Ignoring support quality to save cash? SUPPORT_CRISIS is coming.

Reward Architecture

Four independent reward signals, composited with calibrated weights. This multi-signal design is central to the environment's integrity — a single reward is trivially gameable; four orthogonal signals are not.

Total Reward = (LTV/CAC × 0.35) + (MRR Growth × 0.30) + (Burn Efficiency × 0.20) + (Survival Bonus × 0.15)

If the company dies: −2.0 termination penalty applied on top.

Signal Breakdown

Signal 1 — LTV/CAC Ratio (35%) The "golden ratio" of SaaS health. Target is ≥ 3×. Score is nonlinear: below 1× gives negative signal (losing money per customer), between 1–3× scales to 0.75, above 3× rewards further improvement up to a ceiling of 1.0.

Signal 2 — MRR Growth (30%) Measures revenue trajectory relative to the previous step. +10% growth → score of 1.0. Flat → 0.3. −20% → 0. First step rewards being above the VC floor.

Signal 3 — Burn Efficiency (20%) Penalizes unsustainable marketing spend (marketing/MRR > 50% ceiling). Additionally penalizes poor support quality as a proxy for hidden churn risk — an agent that ignores support quality will see this signal degrade even if spend looks fine.

Signal 4 — Survival Bonus (15%) Binary floor check (MRR > $20K, runway > 3 months) with a runway health bonus (up to +0.5 for 24+ months of runway). Halved if churn exceeds 10%.

System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        TRAINING LOOP                            │
│                                                                 │
│   LLM Agent (Qwen2.5-1.5B)  ◄──── Prompt text observation     │
│          │                                                      │
│          │  JSON action {"action_type": ..., "magnitude": ...} │
│          ▼                                                      │
│   TRL GRPOTrainer  ◄──── Reward signal (4-signal composite)    │
│          │                                                      │
│          │  Policy update via GRPO                             │
│          ▼                                                      │
│   Unsloth (memory efficiency + fast rollout)                   │
└──────────────────────────┬──────────────────────────────────────┘
                           │  HTTP (OpenEnv API)
┌──────────────────────────▼──────────────────────────────────────┐
│                     ENVIRONMENT SERVER                          │
│                     (FastAPI / Docker)                          │
│                                                                 │
│   POST /reset  →  RevOpsEnv.reset()  →  Random episode init    │
│   POST /step   →  RevOpsEnv.step()   →  Action + crisis + reward│
│   GET  /state  →  RevOpsEnv.state()  →  Current observation    │
│                                                                 │
│   ┌─────────────┐   ┌──────────────┐   ┌────────────────────┐  │
│   │   env.py    │   │   crisis.py  │   │    reward.py       │  │
│   │             │   │              │   │                    │  │
│   │ _apply_     │──►│ CrisisEngine │   │ RewardRubric       │  │
│   │ action()    │   │              │   │                    │  │
│   │             │   │ Gemini 2.0   │   │ 4-signal composite │  │
│   │ World       │   │ Flash (LLM)  │   │                    │  │
│   │ dynamics    │   │    +         │   │ ltv_cac    35%     │  │
│   │             │   │ Rule-based   │   │ mrr_growth 30%     │  │
│   │             │   │ fallback     │   │ burn_eff   20%     │  │
│   └─────────────┘   └──────────────┘   │ survival   15%     │  │
│                                        └────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
                           │
                           │  Deployed to
                           ▼
              🤗 HuggingFace Spaces (Docker)

Component Responsibilities

File	Role
`env.py`	Core world dynamics — `reset()`, `step()`, `state()`. Orchestrates action effects, crisis triggering, reward scoring
`models.py`	Pydantic data models — `RevOpsState` (internal world), `RevOpsAction` (agent input), `RevOpsObservation` (agent output + reward)
`crisis.py`	Adversarial engine — weakness detection, Gemini API call, rule-based fallback, state mutation
`reward.py`	Four-signal reward rubric — independent scoring functions, weighted composition, termination penalty
`server.py`	FastAPI server — OpenEnv-compliant `/reset`, `/step`, `/state` endpoints + live judge dashboard with Chart.js plots
`client.py`	HTTP client wrapper — used by training scripts to interact with the environment via clean Python API

Training Pipeline

Base model: Qwen2.5-1.5B-Instruct
     │
     │  No SFT needed — base instruct model already formats JSON
     ▼
GRPO Training (HuggingFace TRL)
     │
     ├── Rollout: model generates {"action_type": ..., "magnitude": ...}
     │
     ├── Environment step: action → world dynamics → crisis check → reward
     │
     ├── Reward: 4-signal composite (ltv_cac, mrr_growth, burn_efficiency, survival)
     │
     ├── GRPO update: shift probability mass toward higher-reward trajectories
     │
     └── Repeat across episodes (random init each time)
     │
     ▼
Unsloth (QLoRA efficiency — runs on single Colab T4)
     │
     ▼
Trained adapter saved → tested on held-out episodes → results compared vs baseline

Why GRPO over PPO: GRPO removes the need for a separate value model, making it significantly more memory-efficient. This is the key reason training on a 1.5B model fits comfortably on a Colab T4 with Unsloth.

Why no SFT warmup: Qwen2.5-1.5B-Instruct already follows JSON format instructions reliably. The environment gives a non-zero reward signal from episode one — no curriculum needed to bootstrap valid rollouts.

Training Results

Model trained on Qwen2.5-1.5B-Instruct via GRPO + Unsloth on a single Colab T4 GPU.

Raw GRPO training logs showing loss, reward, and completion metrics across 50 steps.

Left: Mean episode reward. Center: Final MRR at episode end. Right: Company survival rate. 🟢 Trained model vs 🔴 Untrained baseline.

Loss and reward curves during GRPO training. Reward climbs steadily; loss converges.

Metric	Baseline (untrained)	Trained	Improvement
Mean episode reward	~0.18	~0.41	+128%
Mean final MRR	~$31,000	~$58,000	+87%
Company survival rate	~30%	~70%	+133%

Qualitative Behavior Change

The baseline model applies a near-fixed strategy (typically aggressive marketing) regardless of the current state or active crisis. The trained model demonstrates state-dependent reasoning:

Responds to CAC_EXPLOSION by pulling back marketing spend and pivoting to retention actions
Responds to CHURN_SPIKE by prioritizing hire_support and negotiate_contracts over growth actions
Scales back magnitude on risky actions when cash runway is low
Maintains LTV/CAC above 3× across more episodes by balancing growth and unit economics simultaneously

Why This Environment Is Technically Novel

1. Adversarial non-stationarity. Unlike environments with fixed dynamics, the Crisis Engine creates a genuinely non-stationary world. The agent cannot memorize a winning sequence — the environment reads its state and adapts. This forces the agent to learn generalizable reasoning, not pattern matching.

2. Multi-signal reward design. Four orthogonal reward functions that share no common exploitable shortcut. The only way to score well across all four signals simultaneously is to actually solve the underlying business problem.

3. Continuous action magnitude. Most discrete action environments reduce decisions to binary choices. The magnitude parameter (0.1–1.0) forces the agent to reason about how much to commit to a strategy — a fundamentally harder and more realistic problem.

4. LLM-in-the-loop adversary. Gemini 2.0 Flash doesn't just select from a fixed crisis menu — it generates contextually calibrated crisis descriptions and numeric deltas based on the agent's actual current state. Every episode has crises that are semantically and numerically tailored to the agent's specific weaknesses at that moment.

5. Real-world domain fidelity. The action effects, reward signals, and crisis scenarios are grounded in actual SaaS business mechanics — LTV/CAC, churn economics, burn rate analysis. A model trained here learns transferable business reasoning, not arbitrary game mechanics.

Real-World Relevance

The capabilities RevOps Gym trains are directly applicable to:

AI business analysts — agents that can reason through multi-step financial decisions, not just summarize data
Executive decision-support tools — AI that can model "what happens if I cut marketing by 30% while a competitor is aggressive" rather than just answering in the abstract
RL benchmarking for business domains — an underexplored area where most existing work focuses on games, math, and code, leaving an open gap

The SaaS management domain is rich, verifiable, and economically significant. An LLM that can genuinely reason through these tradeoffs under adversarial pressure represents a meaningful capability advance over current models.

Quick Start

Python (local)

from revops_gym import RevOpsEnv

env = RevOpsEnv(crisis_every=3, difficulty="normal")
obs = env.reset()

# See what the agent sees
print(obs.to_prompt_text())

# Take an action
obs = env.step({"action_type": "hire_support", "magnitude": 0.8})
print(f"Reward: {obs.reward_last_step:.3f} | MRR: ${obs.mrr:,.0f} | LTV/CAC: {obs.ltv_cac_ratio:.2f}x")
print(f"Reward breakdown: {obs.info['reward_breakdown']}")

REST API (HuggingFace Space)

# Start a new episode
curl -X POST https://YOUR_HF_USERNAME-revops-gym.hf.space/reset \
  -H "Content-Type: application/json" \
  -d '{"difficulty": "normal"}'

# Take an action
curl -X POST https://YOUR_HF_USERNAME-revops-gym.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action_type": "increase_marketing", "magnitude": 0.6}'

# Check current state
curl https://YOUR_HF_USERNAME-revops-gym.hf.space/state

Difficulty Levels

Level	Cash Runway	Churn Rate	CAC
`easy`	×1.5 multiplier	×0.6 multiplier	Base
`normal`	Base	Base	Base
`hard`	×0.6 multiplier	×1.4 multiplier	×1.3 multiplier

Repository Structure

revops-gym/
├── openenv.yaml              # OpenEnv manifest (obs space, action space, reward schema)
├── Dockerfile                # HuggingFace Spaces deployment
├── setup.py                  # Package definition
├── requirements.txt          # Dependencies
│
├── revops_gym/
│   ├── __init__.py
│   ├── env.py                # Core environment — reset / step / state / action effects
│   ├── models.py             # Pydantic models — RevOpsState, RevOpsAction, RevOpsObservation
│   ├── crisis.py             # Crisis Engine — weakness detection, Gemini API, rule-based fallback
│   ├── reward.py             # RewardRubric — 4-signal composite scoring
│   ├── server.py             # FastAPI server — OpenEnv endpoints + live judge dashboard
│   └── client.py             # HTTP client for training scripts
│
├── tests/
│   └── test_env.py           # Smoke tests — reset, step, termination, reward sanity
│
├── train_colab.ipynb         # Full GRPO training notebook (TRL + Unsloth, runs on Colab T4)
├── results_comparison.png    # Baseline vs trained — reward, MRR, survival rate
└── training_curves.png       # Loss and reward curves during training

Minimum Requirements Checklist

OpenEnv compliant — implements reset(), step(), state() per spec; valid openenv.yaml manifest
Training script — full GRPO training notebook (train_colab.ipynb) using HuggingFace TRL + Unsloth, runs on free Colab T4
Training evidence — reward curves, loss curves, and before/after comparison plots committed to repo
HuggingFace Space — Docker deployment with live interactive judge dashboard
Write-up — this README + Blog post linked below

Links

Resource	Link
🤗 HuggingFace Space (live environment)	Sriram611/revops-gym
📓 Training Colab Notebook	Open in Colab
📝 Blog Post (HuggingFace)	Read the blog
🤗 Trained Model	Sriram611/revops-gym-model

Built for the OpenEnv Hackathon India, April 2026. Themes: #3.1 World Modeling (Professional Tasks) + #1 Multi-Agent Adversarial.