Spaces:

Jyo-K
/

RevOps

Running

App Files Files Community

RevOps / project.md

Jyo-K

partial

38336e8 1 day ago

preview code

raw

history blame contribute delete

12.9 kB

Here is the full "About the Project" write-up

About the Project

The Problem

Every B2B company runs the same painful cycle: marketing generates a flood of inbound leads, drops them into a CRM, and expects the revenue machine to do the rest. It doesn't. 67% of all B2B lost sales are directly caused by poor lead qualification. Only 25% of marketing-generated leads are good enough to hand to a sales rep at all, yet the other 75% still land in someone's inbox. Bad lead data costs the average sales department 550 hours and $32,000 per rep per year in wasted effort. Companies that mis-route leads — assigning them to the wrong rep, wrong territory, or wrong segment — lose 10–30% of their pipeline annually to pure operational leakage, not market conditions, not product fit. landbase

Speed compounds the damage. A lead that receives a response within 5 minutes is 9x more likely to convert than one contacted even 10 minutes later. The average B2B lead response time is still 42 hours. One in four companies never responds to inbound leads at all. businesswire

The gap between a high-performing and underperforming revenue team is not product quality or even sales talent — it is the quality of the operational layer sitting between a form submission and a human conversation. Enterprise B2B SaaS companies with advanced lead scoring and tight routing achieve 40% MQL-to-SQL conversion. The industry average is 13–15%. That 27-point spread is a pure process gap. pyrsonalize

This is the problem RevOps-Agent is built to close.

What is RevOps-Agent?

RevOps-Agent is an OpenEnv-compliant reinforcement learning environment that simulates a real B2B Revenue Operations (RevOps) inbox. An AI agent receives inbound sales leads — originating from web forms, whitepapers, or click-to-message ads — and must perform the full qualification and routing workflow that a trained human RevOps specialist would execute.

The agent is not just classifying. It is operating. It must decide:

Whether to enrich the lead with company data before scoring it
Whether a CRM match exists and what it means for routing
What score (0–100) reflects the lead's fit against the Ideal Customer Profile (ICP)
Which specific sales rep — by segment, territory, and pipeline load — should receive the lead
Whether the lead is a re-engagement from a closed-lost deal, requiring special handling
Whether a lead should be disqualified entirely, and under which category
When to enrol a lead in a nurture sequence instead of routing to a rep

Every decision has a measurable business consequence, and the environment scores them accordingly.

Why This Environment Exists

Current agent benchmarks overwhelmingly focus on reasoning, coding, and tool-use in isolation. There is almost no RL environment that benchmarks an agent's ability to operate inside a live business workflow with real data dependencies, CRM conflicts, and operational consequences.

RevOps is a perfect domain for this for four reasons:

Structured state, not free text. Lead qualification operates on well-defined data objects — company size, territory, rep roster, opportunity history. This makes observations and action spaces well-typed and graders deterministic, not vibes-based.
Partial progress is meaningful. An agent that enriches, checks CRM, merges the account, and then routes to the wrong rep deserves a different score than an agent that blind-routes immediately. The trajectory matters, not just the final action.
Mistakes have asymmetric costs. Routing a $240,000 re-engagement deal to the wrong rep (instead of the original AE who owns the relationship) is not the same as a missed nurture enrolment. The reward function reflects this directly.
Scale makes automation non-optional. A company receiving 1,000 inbound leads per month cannot manually review each one. The only path to a 40% MQL-to-SQL rate (vs the 13% average) is a reliable, intelligent operational agent that processes leads faster and more accurately than any human team. thedigitalbloom

Environment Design

The environment exposes a standard OpenEnv interface:

reset(task_id) → initializes a fresh episode with a synthetic lead scenario and returns the first RevOpsObservation
step(action) → processes one RevOpsAction, advances the episode state, and returns a new observation with a shaped reward signal
state() → returns the current RevOpsState including action history, accumulated reward, and grader score if the episode is complete

Observation space contains everything a real RevOps analyst would have on their screen: the raw inbound lead (email, title, company, message, source), enrichment results (company size, revenue, industry, tech stack, tier), CRM state (existing accounts, open deals, closed-lost opportunities, assigned rep), the full available rep roster, the ICP criteria document, SLA time remaining, and human-readable feedback from the last action.

Action space is a discrete set of typed operations: enrich_lead, check_crm, update_lead_score, route_to_rep, merge_with_account, flag_reengagement, disqualify, add_to_nurture, request_more_info, and escalate_to_manager. Each action requires specific fields (e.g., a rep_id for routing, an account_id for merging), enforced by Pydantic validation at the API layer.

Reward function is trajectory-shaped, not sparse. Correct intermediate steps earn positive reward. Process violations — scoring without enrichment, routing without checking CRM in the hard task — earn negative reward immediately. The final routing decision carries the largest single reward (+0.50 correct / -0.40 wrong) so the terminal signal is strong, but the full episode reward discriminates between agents that got lucky on the final step vs. agents that understood the full operational context.

The Three Tasks

Task 1 — Direct Match Routing (Easy)

A high-intent VP of Engineering fills out a web form with full company information. Their company is clearly mid-market, AMER-based, and in-ICP. The correct path is straightforward: enrich to confirm, score in the 60–90 range, route to the mid-market AMER rep. This task validates that the agent understands the basic workflow and does not over-engineer a simple case. Passing score: ≥ 0.80.

Task 2 — Enrichment-Gated ICP Scoring (Medium)

A lead arrives with only an email domain — no name, no title, no company. The agent must call enrich_lead before scoring, or it receives a process violation penalty. Once enriched, it discovers a 1,200-person German FinTech (enterprise, EMEA). The agent must score this as a strong ICP match (≥80) and route to the EMEA enterprise rep. This task tests whether the agent understands why enrichment exists — not as a formality, but as a prerequisite to correct scoring. Passing score: ≥ 0.75.

Task 3 — CRM Conflict & Re-Engagement Resolution (Hard)

A CFO from a major financial firm messages through a click-to-message ad: "We looked at your product last year. Ready to revisit." The CRM shows this exact company has a closed-lost opportunity worth $240,000 — lost 4 months ago due to a budget freeze — assigned to a specific Account Executive. The correct path requires: checking the CRM, discovering the closed-lost opportunity, merging the lead into the existing account, flagging it as a re-engagement tied to the correct opportunity ID, and routing specifically to the original AE — not the highest-capacity rep, not round-robin, not the most senior enterprise rep available. Simultaneously, two junk leads (a student and a competitor's sales rep) sit in the queue and must be correctly disqualified. An agent that skips the CRM check and routes to any available enterprise rep scores 0.0 on this task — total failure regardless of other correct actions. Passing score: ≥ 0.70.

What Makes This Hard for Frontier Models

Task 3 is specifically designed to defeat "good enough" language models. The surface-level pattern — enterprise lead, high intent, route to enterprise rep — points to the wrong answer. The correct answer requires the agent to:

Recognize that CRM state supersedes general routing rules
Understand that a re-engagement to a closed-lost account belongs to the rep who owns the relationship, not the rep with the lowest queue
Execute a multi-step merge + flag sequence before routing, not just a single routing action
Disqualify concurrent noise leads correctly without getting distracted from the primary task

This is not a reasoning puzzle. It is an operational judgment call that costs real companies real pipeline every week. revenuehero

Real-World Utility

This environment is immediately deployable as an evaluation benchmark for any team building a sales automation agent, a RevOps AI assistant, or a CRM intelligence layer. The scenarios are synthetically generated but structurally identical to real operational cases sourced from lead routing failure patterns documented across the B2B SaaS industry. leandata

Any organization running HubSpot, Salesforce, or a homegrown CRM with inbound lead volume above ~200 leads/month faces the exact operational decisions modelled here. The environment gives the RL/agent community a reproducible, gradeable benchmark for a workflow that currently costs the industry billions in wasted pipeline annually — and has no standardized agent evaluation framework. betterproposals

File map

File	Purpose
`openenv.yaml`	Required manifest — `openenv validate` reads this meta-pytorch
`models.py`	All Pydantic models: `RevOpsAction`, `RevOpsObservation`, `RevOpsState`, `RevOpsReward` turing
`server/data_generator.py`	Synthetic scenarios for all 3 tasks — no external APIs needed
`server/graders.py`	Deterministic 0.0→1.0 graders, one per task
`server/environment.py`	Core `RevOpsEnvironment(Environment)` with `reset()` / `step()` / `state()` github
`server/app.py`	FastAPI app with optional web UI toggle turing
`client.py`	`RevOpsEnv(HTTPEnvClient)` — what the baseline script talks to github
`baseline.py`	Calls OpenAI API, runs all 3 tasks, prints grader scores github
`server/Dockerfile`	Python 3.11-slim, uvicorn, port 8000 deepwiki
`pyproject.toml`	Package deps including `openenv-core>=0.2.0` pypi

The 3 tasks at a glance

Easy: Clean high-intent lead → enrich → score ~75 → route to mid-market AMER rep. Pass if grader ≥ 0.80. leadshook
Medium: Blank lead (email only) → MUST enrich first or gets penalized → discovers EMEA enterprise → score ≥ 80 → route to EMEA enterprise rep. Pass if grader ≥ 0.75. leadshook
Hard: Re-engaged CFO from closed-lost $240K deal → check CRM → merge account → flag reengagement with correct opp ID → route to ORIGINAL AE, not round-robin. Pass if grader ≥ 0.70. prospeo

The one trick that wins judges

The CRM conflict trap in task_hard is the creative highlight. Any naive agent trained on "route enterprise leads to enterprise reps" will route to the wrong person and score 0.0. Only an agent that reasons through the full CRM state correctly reaches the original AE. That's what makes task_hard "genuinely challenge frontier models" — the exact language in the hackathon rubric. pypi