Spaces:

zom696
/

RevOpsGYm

Sleeping

App Files Files Community

RevOpsGYm / README.md

Sriram611

Update README.md

0bbfc73 verified about 1 month ago

preview code

raw

history blame contribute delete

21 kB

	---
	title: RevOps Gym
	emoji: 🚀
	colorFrom: blue
	colorTo: green
	sdk: docker
	pinned: false
	license: mit
	tags:
	- openenv
	- reinforcement-learning
	- llm-training
	- saas-simulation
	- world-modeling
	- adversarial
	---

	# 🚀 RevOps Gym — SaaS Flight Simulator for LLM RL Training

	> Train a language model to run a B2B SaaS company — under adversarial pressure, with real business tradeoffs, across 30 decision steps.

	[![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue)](https://github.com/huggingface/openenv)
	[![HF Space](https://img.shields.io/badge/🤗%20HuggingFace-Space-yellow)](https://huggingface.co/spaces/YOUR_HF_USERNAME/revops-gym)
	[![Colab](https://img.shields.io/badge/Training-Colab%20Notebook-orange)](https://colab.research.google.com/drive/YOUR_COLAB_LINK)
	[![License: MIT](https://img.shields.io/badge/License-MIT-green)](LICENSE)

	---

	## The Problem This Solves

	LLMs can talk about business strategy. But can they actually execute it, step by step, under pressure, with competing constraints and an adversary actively working against them?

	That's the gap RevOps Gym targets.

	Revenue Operations — the discipline of aligning sales, marketing, and customer success — is one of the most consequential decision-making domains in the modern economy. Every B2B SaaS company lives or dies by a handful of metrics: Monthly Recurring Revenue (MRR), Customer Acquisition Cost (CAC), Lifetime Value (LTV), churn, and cash runway. Decisions around these metrics are multi-step, non-linear, and made under incomplete information — exactly the kind of reasoning that current LLMs struggle with when pushed beyond a single turn.

	RevOps Gym creates a faithful, adversarial simulation of this domain — one that an LLM can train on and measurably improve at.

	---

	## Hackathon Themes

	\| Theme \| Coverage \|
	\|---\|---\|
	\| #3.1 — World Modeling: Professional Tasks \| Primary — agent operates inside a dynamic business world with real state, real tradeoffs, and causal action effects \|
	\| #1 — Multi-Agent Interactions \| Secondary — the Gemini-powered Crisis Engine is an active adversarial agent competing against the Pilot \|

	---

	## Environment Overview

	The agent — called the Pilot — must manage a procedurally generated B2B SaaS company across 30 decision steps. The company is randomly initialized each episode (different MRR, CAC, churn, runway), so no two episodes are identical and fixed-sequence strategies cannot succeed.

	The win condition: survive 30 steps with MRR above $20,000.
	The lose condition: MRR drops below the VC floor, cash runway hits zero, or churn exceeds 20%.

	### What the Agent Observes

	Every step, the Pilot sees a structured text dashboard:

	```
	=== RevOps Dashboard \| Step 12/30 ===
	⚠️ ACTIVE CRISIS: CAC_EXPLOSION — Ad costs doubled. Marketing efficiency collapses.
	MRR: $63,400 \| Floor: $20,000
	CAC: $2,100 \| LTV: $11,800 \| LTV/CAC: 5.62x
	Churn: 3.2% \| Runway: 14.5mo
	Marketing spend: $18,200/mo \| Support quality: 74%
	Last reward: 0.312

	Available actions: increase_marketing, decrease_marketing, hire_support,
	fire_support, discount_campaign, raise_prices, feature_investment,
	cut_costs, negotiate_contracts, pivot_segment
	Respond ONLY with JSON: {"action_type": "...", "magnitude": 0.0-1.0}
	```

	### Action Space

	10 discrete strategic actions, each with a continuous `magnitude` parameter (0.1–1.0) that scales the effect intensity:

	\| Action \| Effect \|
	\|---\|---\|
	\| `increase_marketing` \| Boosts MRR growth, raises spend, improves CAC at scale \|
	\| `decrease_marketing` \| Frees cash, slows growth \|
	\| `hire_support` \| Improves support quality, reduces churn, increases LTV — costs runway \|
	\| `fire_support` \| Saves cash, degrades support quality, raises churn \|
	\| `discount_campaign` \| Short-term MRR spike, hurts LTV \|
	\| `raise_prices` \| Increases LTV and MRR for retained customers, some churn risk \|
	\| `feature_investment` \| Raises LTV and reduces churn, costs runway \|
	\| `cut_costs` \| Extends runway, slows growth slightly \|
	\| `negotiate_contracts` \| Reduces churn, raises LTV, slightly increases CAC \|
	\| `pivot_segment` \| High risk / high reward — probabilistic outcome \|

	### Termination Conditions

	An episode ends when any of these are true:

	- `mrr < $20,000` (VC floor breached)
	- `cash_runway ≤ 0` (company bankrupt)
	- `churn_rate > 20%` (unrecoverable customer loss)
	- `step_number ≥ 30` (episode complete — agent survived)

	---

	## The Crisis Engine — What Makes This Environment Novel

	Every 3 steps, the Crisis Engine activates. It reads the agent's current state, identifies the weakest metric using a normalized scoring function, and selects the most damaging crisis it can deploy against that exact vulnerability.

	This is not random. It is targeted adversarial pressure — the environment actively hunts for the agent's blind spots.

	```python
	# From crisis.py — weakness detector
	def _worst_metric(state: RevOpsState) -> str:
	scores = {
	"churn_rate": state.churn_rate / 0.20,
	"cac": state.cac / 5000,
	"support_quality": 1.0 - state.support_quality,
	"cash_runway": max(0, (12 - state.cash_runway) / 12),
	"mrr": max(0, (mrr_floor * 2 - state.mrr) / (mrr_floor * 2)),
	}
	return max(scores, key=scores.get)
	```

	When `GEMINI_API_KEY` is set, Gemini 2.0 Flash is called with the full state context and asked to generate a contextual, creative crisis description with calibrated numeric deltas. If the API is unavailable, the engine falls back to a deterministic rule-based selector — training never stalls.

	### Available Crises

	\| Crisis \| Effect \|
	\|---\|---\|
	\| `CHURN_SPIKE` \| Competitor launches aggressive pricing — churn +4%, MRR −8% \|
	\| `CAC_EXPLOSION` \| Ad costs double — CAC ×1.6 \|
	\| `SUPPORT_CRISIS` \| Key engineers quit — support quality −25%, churn +2% \|
	\| `CASH_CRUNCH` \| Unexpected infrastructure bill — runway −3 months \|
	\| `PRICE_WAR` \| Competitors slash prices — MRR −12%, CAC ×1.3 \|
	\| `REGULATORY_HIT` \| New compliance requirement — runway −2 months, CAC ×1.2 \|
	\| `ENTERPRISE_CHURN` \| Top 3 accounts cancelled — MRR −20% \|
	\| `TALENT_WAR` \| Big tech hiring spree — runway −2.5 months, support −10% \|

	Why this prevents reward hacking: if the agent over-optimizes one metric, the Crisis Engine targets that metric three steps later. Over-investing in marketing without controlling CAC? Expect `CAC_EXPLOSION`. Ignoring support quality to save cash? `SUPPORT_CRISIS` is coming.

	---

	## Reward Architecture

	Four independent reward signals, composited with calibrated weights. This multi-signal design is central to the environment's integrity — a single reward is trivially gameable; four orthogonal signals are not.

	```
	Total Reward = (LTV/CAC × 0.35) + (MRR Growth × 0.30) + (Burn Efficiency × 0.20) + (Survival Bonus × 0.15)
	```

	If the company dies: −2.0 termination penalty applied on top.

	### Signal Breakdown

	Signal 1 — LTV/CAC Ratio (35%)
	The "golden ratio" of SaaS health. Target is ≥ 3×. Score is nonlinear: below 1× gives negative signal (losing money per customer), between 1–3× scales to 0.75, above 3× rewards further improvement up to a ceiling of 1.0.

	Signal 2 — MRR Growth (30%)
	Measures revenue trajectory relative to the previous step. +10% growth → score of 1.0. Flat → 0.3. −20% → 0. First step rewards being above the VC floor.

	Signal 3 — Burn Efficiency (20%)
	Penalizes unsustainable marketing spend (marketing/MRR > 50% ceiling). Additionally penalizes poor support quality as a proxy for hidden churn risk — an agent that ignores support quality will see this signal degrade even if spend looks fine.

	Signal 4 — Survival Bonus (15%)
	Binary floor check (MRR > $20K, runway > 3 months) with a runway health bonus (up to +0.5 for 24+ months of runway). Halved if churn exceeds 10%.

	---

	## System Architecture

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ TRAINING LOOP │
	│ │
	│ LLM Agent (Qwen2.5-1.5B) ◄──── Prompt text observation │
	│ │ │
	│ │ JSON action {"action_type": ..., "magnitude": ...} │
	│ ▼ │
	│ TRL GRPOTrainer ◄──── Reward signal (4-signal composite) │
	│ │ │
	│ │ Policy update via GRPO │
	│ ▼ │
	│ Unsloth (memory efficiency + fast rollout) │
	└──────────────────────────┬──────────────────────────────────────┘
	│ HTTP (OpenEnv API)
	┌──────────────────────────▼──────────────────────────────────────┐
	│ ENVIRONMENT SERVER │
	│ (FastAPI / Docker) │
	│ │
	│ POST /reset → RevOpsEnv.reset() → Random episode init │
	│ POST /step → RevOpsEnv.step() → Action + crisis + reward│
	│ GET /state → RevOpsEnv.state() → Current observation │
	│ │
	│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────┐ │
	│ │ env.py │ │ crisis.py │ │ reward.py │ │
	│ │ │ │ │ │ │ │
	│ │ _apply_ │──►│ CrisisEngine │ │ RewardRubric │ │
	│ │ action() │ │ │ │ │ │
	│ │ │ │ Gemini 2.0 │ │ 4-signal composite │ │
	│ │ World │ │ Flash (LLM) │ │ │ │
	│ │ dynamics │ │ + │ │ ltv_cac 35% │ │
	│ │ │ │ Rule-based │ │ mrr_growth 30% │ │
	│ │ │ │ fallback │ │ burn_eff 20% │ │
	│ └─────────────┘ └──────────────┘ │ survival 15% │ │
	│ └────────────────────┘ │
	└─────────────────────────────────────────────────────────────────┘
	│
	│ Deployed to
	▼
	🤗 HuggingFace Spaces (Docker)
	```

	### Component Responsibilities

	\| File \| Role \|
	\|---\|---\|
	\| `env.py` \| Core world dynamics — `reset()`, `step()`, `state()`. Orchestrates action effects, crisis triggering, reward scoring \|
	\| `models.py` \| Pydantic data models — `RevOpsState` (internal world), `RevOpsAction` (agent input), `RevOpsObservation` (agent output + reward) \|
	\| `crisis.py` \| Adversarial engine — weakness detection, Gemini API call, rule-based fallback, state mutation \|
	\| `reward.py` \| Four-signal reward rubric — independent scoring functions, weighted composition, termination penalty \|
	\| `server.py` \| FastAPI server — OpenEnv-compliant `/reset`, `/step`, `/state` endpoints + live judge dashboard with Chart.js plots \|
	\| `client.py` \| HTTP client wrapper — used by training scripts to interact with the environment via clean Python API \|

	---

	## Training Pipeline

	```
	Base model: Qwen2.5-1.5B-Instruct
	│
	│ No SFT needed — base instruct model already formats JSON
	▼
	GRPO Training (HuggingFace TRL)
	│
	├── Rollout: model generates {"action_type": ..., "magnitude": ...}
	│
	├── Environment step: action → world dynamics → crisis check → reward
	│
	├── Reward: 4-signal composite (ltv_cac, mrr_growth, burn_efficiency, survival)
	│
	├── GRPO update: shift probability mass toward higher-reward trajectories
	│
	└── Repeat across episodes (random init each time)
	│
	▼
	Unsloth (QLoRA efficiency — runs on single Colab T4)
	│
	▼
	Trained adapter saved → tested on held-out episodes → results compared vs baseline
	```

	Why GRPO over PPO: GRPO removes the need for a separate value model, making it significantly more memory-efficient. This is the key reason training on a 1.5B model fits comfortably on a Colab T4 with Unsloth.

	Why no SFT warmup: Qwen2.5-1.5B-Instruct already follows JSON format instructions reliably. The environment gives a non-zero reward signal from episode one — no curriculum needed to bootstrap valid rollouts.

	---

	## Training Results

	Model trained on Qwen2.5-1.5B-Instruct via GRPO + Unsloth on a single Colab T4 GPU.

	![Training logs](training_logs.jpeg)
	Raw GRPO training logs showing loss, reward, and completion metrics across 50 steps.

	![Results comparison](results_comparison.png)
	Left: Mean episode reward. Center: Final MRR at episode end. Right: Company survival rate. 🟢 Trained model vs 🔴 Untrained baseline.

	![Training curves](training_curves.png)
	Loss and reward curves during GRPO training. Reward climbs steadily; loss converges.

	\| Metric \| Baseline (untrained) \| Trained \| Improvement \|
	\|---\|---\|---\|---\|
	\| Mean episode reward \| ~0.18 \| ~0.41 \| +128% \|
	\| Mean final MRR \| ~$31,000 \| ~$58,000 \| +87% \|
	\| Company survival rate \| ~30% \| ~70% \| +133% \|

	### Qualitative Behavior Change

	The baseline model applies a near-fixed strategy (typically aggressive marketing) regardless of the current state or active crisis. The trained model demonstrates state-dependent reasoning:

	- Responds to `CAC_EXPLOSION` by pulling back marketing spend and pivoting to retention actions
	- Responds to `CHURN_SPIKE` by prioritizing `hire_support` and `negotiate_contracts` over growth actions
	- Scales back `magnitude` on risky actions when cash runway is low
	- Maintains LTV/CAC above 3× across more episodes by balancing growth and unit economics simultaneously

	---

	## Why This Environment Is Technically Novel

	1. Adversarial non-stationarity. Unlike environments with fixed dynamics, the Crisis Engine creates a genuinely non-stationary world. The agent cannot memorize a winning sequence — the environment reads its state and adapts. This forces the agent to learn generalizable reasoning, not pattern matching.

	2. Multi-signal reward design. Four orthogonal reward functions that share no common exploitable shortcut. The only way to score well across all four signals simultaneously is to actually solve the underlying business problem.

	3. Continuous action magnitude. Most discrete action environments reduce decisions to binary choices. The `magnitude` parameter (0.1–1.0) forces the agent to reason about how much to commit to a strategy — a fundamentally harder and more realistic problem.

	4. LLM-in-the-loop adversary. Gemini 2.0 Flash doesn't just select from a fixed crisis menu — it generates contextually calibrated crisis descriptions and numeric deltas based on the agent's actual current state. Every episode has crises that are semantically and numerically tailored to the agent's specific weaknesses at that moment.

	5. Real-world domain fidelity. The action effects, reward signals, and crisis scenarios are grounded in actual SaaS business mechanics — LTV/CAC, churn economics, burn rate analysis. A model trained here learns transferable business reasoning, not arbitrary game mechanics.

	---

	## Real-World Relevance

	The capabilities RevOps Gym trains are directly applicable to:

	- AI business analysts — agents that can reason through multi-step financial decisions, not just summarize data
	- Executive decision-support tools — AI that can model "what happens if I cut marketing by 30% while a competitor is aggressive" rather than just answering in the abstract
	- RL benchmarking for business domains — an underexplored area where most existing work focuses on games, math, and code, leaving an open gap

	The SaaS management domain is rich, verifiable, and economically significant. An LLM that can genuinely reason through these tradeoffs under adversarial pressure represents a meaningful capability advance over current models.

	---

	## Quick Start

	### Python (local)

	```python
	from revops_gym import RevOpsEnv

	env = RevOpsEnv(crisis_every=3, difficulty="normal")
	obs = env.reset()

	# See what the agent sees
	print(obs.to_prompt_text())

	# Take an action
	obs = env.step({"action_type": "hire_support", "magnitude": 0.8})
	print(f"Reward: {obs.reward_last_step:.3f} \| MRR: ${obs.mrr:,.0f} \| LTV/CAC: {obs.ltv_cac_ratio:.2f}x")
	print(f"Reward breakdown: {obs.info['reward_breakdown']}")
	```

	### REST API (HuggingFace Space)

	```bash
	# Start a new episode
	curl -X POST https://YOUR_HF_USERNAME-revops-gym.hf.space/reset \
	-H "Content-Type: application/json" \
	-d '{"difficulty": "normal"}'

	# Take an action
	curl -X POST https://YOUR_HF_USERNAME-revops-gym.hf.space/step \
	-H "Content-Type: application/json" \
	-d '{"action_type": "increase_marketing", "magnitude": 0.6}'

	# Check current state
	curl https://YOUR_HF_USERNAME-revops-gym.hf.space/state
	```

	### Difficulty Levels

	\| Level \| Cash Runway \| Churn Rate \| CAC \|
	\|---\|---\|---\|---\|
	\| `easy` \| ×1.5 multiplier \| ×0.6 multiplier \| Base \|
	\| `normal` \| Base \| Base \| Base \|
	\| `hard` \| ×0.6 multiplier \| ×1.4 multiplier \| ×1.3 multiplier \|

	---

	## Repository Structure

	```
	revops-gym/
	├── openenv.yaml # OpenEnv manifest (obs space, action space, reward schema)
	├── Dockerfile # HuggingFace Spaces deployment
	├── setup.py # Package definition
	├── requirements.txt # Dependencies
	│
	├── revops_gym/
	│ ├── __init__.py
	│ ├── env.py # Core environment — reset / step / state / action effects
	│ ├── models.py # Pydantic models — RevOpsState, RevOpsAction, RevOpsObservation
	│ ├── crisis.py # Crisis Engine — weakness detection, Gemini API, rule-based fallback
	│ ├── reward.py # RewardRubric — 4-signal composite scoring
	│ ├── server.py # FastAPI server — OpenEnv endpoints + live judge dashboard
	│ └── client.py # HTTP client for training scripts
	│
	├── tests/
	│ └── test_env.py # Smoke tests — reset, step, termination, reward sanity
	│
	├── train_colab.ipynb # Full GRPO training notebook (TRL + Unsloth, runs on Colab T4)
	├── results_comparison.png # Baseline vs trained — reward, MRR, survival rate
	└── training_curves.png # Loss and reward curves during training
	```

	---

	## Minimum Requirements Checklist

	- [x] OpenEnv compliant — implements `reset()`, `step()`, `state()` per spec; valid `openenv.yaml` manifest
	- [x] Training script — full GRPO training notebook (`train_colab.ipynb`) using HuggingFace TRL + Unsloth, runs on free Colab T4
	- [x] Training evidence — reward curves, loss curves, and before/after comparison plots committed to repo
	- [x] HuggingFace Space — Docker deployment with live interactive judge dashboard
	- [x] Write-up — this README + Blog post linked below

	---

	## Links

	\| Resource \| Link \|
	\|---\|---\|
	\| 🤗 HuggingFace Space (live environment) \| [Sriram611/revops-gym](https://huggingface.co/spaces/Sriram611/revops-gym)
	\| 📓 Training Colab Notebook \| [Open in Colab](https://colab.research.google.com/drive/1Gg-odqjf1eQLlYZe8LDkqzTtKWb3blz9?usp=sharing)
	\| 📝 Blog Post (HuggingFace) \| [Read the blog](https://huggingface.co/spaces/Sriram611/revops-gym/blob/main/Blog.md)
	\| 🤗 Trained Model \| [Sriram611/revops-gym-model](https://huggingface.co/Sriram611/revops-gym-model)

	---

	Built for the OpenEnv Hackathon India, April 2026.
	Themes: #3.1 World Modeling (Professional Tasks) + #1 Multi-Agent Adversarial.