# ProcureRL: A Deep Dive ## Table of Contents 1. [What is ProcureRL?](#what-is-procure-rl) 2. [Why Does This Exist?](#why-does-this-exist) 3. [The Big Picture Architecture](#the-big-picture-architecture) 4. [The Three Tasks](#the-three-tasks) 5. [Data Models: What's Floating Around](#data-models-whats-floating-around) 6. [The Scripted Opponent System](#the-scripted-opponent-system) 7. [The Grading System](#the-grading-system) 8. [The Environment Core](#the-environment-core) 9. [The Server API](#the-server-api) 10. [The Inference Script](#the-inference-script) 11. [End-to-End Example](#end-to-end-example) 12. [Docker Deployment](#docker-deployment) 13. [Calibration and Testing](#calibration-and-testing) --- ## What is ProcureRL? ProcureRL is an **OpenEnv-compliant Reinforcement Learning environment** where an LLM (Large Language Model) agent learns to negotiate procurement deals against scripted supplier opponents. In simpler terms: it's a training ground for AI to practice negotiation — like a flight simulator, but for procurement conversations. ### The Core Innovation: Language-Sensitive Opponent What makes ProcureRL special is that the opponent's behavior **responds to the quality of the agent's natural language**, not just the prices offered. This means: - An agent that outputs aggressive or low-effort language gets a **tough, unyielding opponent** - An agent that outputs collaborative, professional language gets a **more cooperative, flexible opponent** The language IS the policy — not just the action space. This makes LLM genuinely required, not incidental. --- ## Why Does This Exist? Real-world procurement negotiation is: - **Sequential** — one decision affects the next - **Hidden utility** — the opponent's real priorities are not revealed - **Language-dependent** — how you say things matters as much as what you offer - **High-stakes** — Walmart deployed AI (Pactum) for exactly this, 90% of CPOs adopting AI negotiation in 2025 Traditional rule-based negotiation tools are limited. An RL-trained LLM policy can learn to navigate this complexity in ways that static rules cannot. --- ## The Big Picture Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ ProcureRL System │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────────┐ ┌──────────────────┐ │ │ │ LLM Agent │───▶│ Environment │ │ │ │ (inference.py) │ │ (Procure_RL_ │ │ │ │ │ │ environment.py)│ │ │ └──────────────────┘ └────────┬─────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ Scripted │ │ │ │ Opponent │ │ │ │ (opponent.py) │ │ │ └────────┬─────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ Graders │ │ │ │ (graders.py) │ │ │ └──────────────────┘ │ │ │ │ ┌──────────────────┐ ┌──────────────────┐ │ │ │ Server API │ │ OpenEnv.yaml │ │ │ │ (server/app.py) │ │ (manifest) │ │ │ └────────┬─────────┘ └──────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ Docker Container │◀── HF Spaces Deployment │ │ │ (port 7860) │ │ │ └──────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ ``` The system is designed so that: 1. **Environment** is deterministic and reproducible (seeded RNG) 2. **Opponent** responds to language quality (via rapport system) 3. **Graders** produce bounded [0.0, 1.0] scores 4. **Server** exposes everything over HTTP for OpenEnv compliance 5. **Inference** runs a baseline LLM agent against the environment --- ## The Three Tasks ProcureRL includes three tasks of increasing difficulty: ### Task 1: `single_issue` (Easy) **Scenario:** Software license renewal. Price only. ``` Buyer Target: $36,000 Seller Opens: ~$52,000 (varies by seed) Seller Floor: ~$44,000 (varies by seed) Max Rounds: 6 Opponent Persona: Cooperative ``` The agent must negotiate the price down from opening to target. The cooperative opponent starts friendly and remains fairly flexible. **Example Grading:** - Deal at $38K in round 2: ~0.85 score - Deal at $44K in round 6: ~0.35 score - No deal: 0.0 score ### Task 2: `multi_issue` (Medium) **Scenario:** Enterprise software negotiation with price AND payment terms. ``` Issues: price ($40K-$58K) + payment_days (30-90) Opponent Persona: Cash Flow Stressed → Cares more about getting paid quickly (payment_weight: 0.65) → Cares less about final price (price_weight: 0.35) Max Rounds: 8 ``` **The Strategic Opportunity:** If the agent offers Net-30 or Net-45 payment terms, the opponent becomes more flexible on price. A naive agent treats both issues equally and scores low. A smart agent bundles payment speed with price negotiation. **Example Grading:** - Price $42K + Net-30 payment: ~0.60 score - Price $42K + Net-90 payment: ~0.35 score - No deal: 0.0 score ### Task 3: `adversarial` (Hard) **Scenario:** Large contract with three issues — price, payment, and support hours. ``` Issues: price + payment_days + support_hours Opponent Persona: Aggressive Anchor → Opens at ceiling on all issues → Hardens position if agent makes consecutive concessions → Rapport-sensitive but requires consistent collaborative framing Max Rounds: 10 Survival Floor: 0.15 (completing any deal gets at least 0.15) ``` **The Challenge:** If the agent concedes on price in 2+ consecutive rounds, the opponent recognizes this pattern and becomes much harder to negotiate with. The agent must resist anchoring, break consecutive concession patterns, and maintain collaborative tone under pressure. **Example Grading:** - Strategic deal with no consecutive concessions: ~0.50 score - Same deal but with consecutive concession pattern: ~0.40 score - Survival deal (just complete): 0.15 score --- ## Data Models: What's Floating Around The system uses three Pydantic models defined in `models.py`: ### `NegotiationAction` What the agent sends to the environment: ```python class NegotiationAction(BaseModel): move_type: str # "make_offer" | "accept" | "reject" | "bundle" terms: Dict[str, Any] # {"price": 42000, "payment_days": 45} message: str = "" # Natural language — affects opponent rapport! ``` **Important:** The `message` field is not just flavor text. It directly affects opponent behavior through the rapport system. ### `NegotiationObservation` What the environment sends back to the agent after each step: ```python class NegotiationObservation(BaseModel): task_id: str # Which task we're running round_number: int # Current round (0 to max_rounds) max_rounds: int # Task's round limit supplier_message: str # Opponent's latest message current_offer: Dict[str, Any] # Terms currently on the table last_4_exchanges: List[Dict] # Recent conversation history buyer_constraints: Dict[str, Any] # Agent's targets and limits rapport_hint: str # "positive" | "neutral" | "negative" done: bool # Is episode finished? reward: Optional[float] = None # Reward (only on done) metadata: Dict[str, Any] = Field(...) # Extra info (deal_price, errors) ``` ### `NegotiationState` The environment's internal state (accessible via `env.state`): ```python class NegotiationState(BaseModel): task_id: str = "" episode_id: str = "" round_number: int = 0 rapport_score: float = 0.5 # 0.0 to 1.0, starts neutral consecutive_concessions: int = 0 # Tracks concession patterns deal_reached: bool = False final_terms: Optional[Dict] = None # Set when episode ends cumulative_reward: float = 0.0 ``` --- ## The Scripted Opponent System The opponent is implemented in `opponent.py` as the `ScriptedPersonaOpponent` class. ### The Rapport System (Language Sensitivity) The key mechanism is **rapport** — a score from 0.0 to 1.0 that changes based on the agent's language quality. **Collaborative Signals (increase rapport):** ```python COLLABORATIVE_SIGNALS = [ "understand", "partnership", "mutual", "together", "value", "appreciate", "flexible", "work with", "long-term", "relationship", "reasonable", "fair", "both", "solution" ] ``` **Aggressive Signals (decrease rapport):** ```python AGGRESSIVE_SIGNALS = [ "demand", "require", "final offer", "unacceptable", "must", "non-negotiable", "take it or leave", "bottom line", "ultimatum", "insist", "refuse", "absolutely not" ] ``` **How it works:** ```python def update_rapport(self, agent_message: str) -> None: msg_lower = agent_message.lower() delta = 0.0 delta += sum(0.08 for w in COLLABORATIVE_SIGNALS if w in msg_lower) delta -= sum(0.08 for w in AGGRESSIVE_SIGNALS if w in msg_lower) delta = max(-0.20, min(0.20, delta)) # Cap per-round change self.rapport = max(0.0, min(1.0, self.rapport + delta)) ``` Every message the agent sends adjusts rapport by ±0.08 per keyword detected, capped at ±0.20 per round. ### Concession Rate: How Fast the Opponent Moves Rapport directly modifies the opponent's concession rate: ```python def get_concession_rate(self) -> float: base_rates = { "cooperative": 0.05, # 5% per round base "cash_flow_stressed": 0.07, "aggressive_anchor": 0.04, } base = base_rates[self.persona] modifier = (self.rapport - 0.5) * base # +/- 50% of base return max(0.01, base + modifier) ``` **Example:** Cooperative opponent with high rapport (0.8) concedes at 0.05 + (0.8 - 0.5) × 0.05 = **7.5% per round**. With low rapport (0.2), concedes at 0.05 + (0.2 - 0.5) × 0.05 = **2.5% per round**. ### Three Personas #### 1. Cooperative (`single_issue`) - Friendly, understanding tone - 5% base concession rate, highly sensitive to rapport - Accepts early if price is above floor and round ≥ 2 #### 2. Cash Flow Stressed (`multi_issue`) - Cares about payment timing more than price - 7% base concession rate, moderate rapport sensitivity - Acceptance requires `payment_days ≤ 45` - Comments on payment timing in responses #### 3. Aggressive Anchor (`adversarial`) - Opens at ceiling, hardens with pressure - 4% base concession rate (least flexible) - **Penalizes consecutive concessions** — if agent concedes 2+ rounds in a row, concession rate drops to 40% of normal - Uses "hardening" templates when cornered ### Opponent Response Flow ```python def respond(self, agent_message, agent_terms, round_number, consecutive_concessions): # 1. Update rapport based on agent's language self.update_rapport(agent_message) # 2. Check acceptance (only after round 2, and price must be ≥ floor) if round_number >= 2 and agent_price >= self.price_floor and _acceptance_condition(): return self.templates["accept"], {**agent_terms, "_accepted": True} # 3. Calculate concession rate concession = self.get_concession_rate() # 4. Aggressive anchor gets harder if detecting concession pattern if self.persona == "aggressive_anchor" and consecutive_concessions >= 2: concession = concession * 0.4 # 60% reduction! template_key = "hardening" elif round_number >= 70% of max_rounds: template_key = "near_close" else: template_key = "counter" # 5. Compute new position new_position = self.current_position * (1 - concession) new_position = max(self.price_floor, new_position) # Never go below floor # 6. Return message and counter terms return message, counter_terms ``` --- ## The Grading System Graders are in `graders.py` and produce scores in [0.0, 1.0]. They are **pure Python — zero LLM calls**, ensuring deterministic, reproducible scoring. ### Key Design: Relative Scoring The graders score based on **how much the agent improved from the opponent's opening price**, not on absolute thresholds. This makes the environment learnable — the agent learns to negotiate better deals relative to where negotiations started. ```python # Instead of scoring against a hardcoded floor, we score relative to the opening: value = (opponent_opening - final_price) / (opponent_opening - BUYER_TARGET) ``` ### Single Issue Grading ```python def grade_single_issue(final_terms, deal_reached, rounds_taken, max_rounds=6, opponent_opening=52000.0): if not deal_reached: return 0.0 final_price = final_terms.get("price", opponent_opening) BUYER_TARGET = 38000.0 # If price didn't improve from opening, minimal score if final_price >= opponent_opening: return 0.05 # How much did we improve relative to the possible improvement? value = (opponent_opening - final_price) / (opponent_opening - BUYER_TARGET) value = max(0.0, min(1.0, value)) # Efficiency penalty for taking too long efficiency = 1.0 - (rounds_taken / max_rounds) ** 1.5 * 0.4 efficiency = max(0.1, efficiency) # Never below 0.1 return round(value * efficiency, 4) ``` **Example:** - Opening: $52,000, Target: $38,000, Range: $14,000 - Final price: $45,000 → improvement: $7,000 → value = 0.50 - Round 3 → efficiency = 1.0 - (3/6)^1.5 × 0.4 = 0.71 - **Score: 0.50 × 0.71 = 0.36** ### Multi-Issue Grading ```python def grade_multi_issue(final_terms, deal_reached, rounds_taken, max_rounds=8, opponent_opening=52000.0): # Two dimensions: price (70% weight) and payment_days (30% weight) price_value = (opponent_opening - final_price) / (opponent_opening - 40000) payment_score = (90 - payment_days) / (90 - 30) value = 0.70 * price_value + 0.30 * payment_score # If price didn't improve but payment did, still score on payment if final_price >= opponent_opening: value = 0.30 * payment_score # Only payment matters ``` **Example:** - Price: $44,000 (good), Payment: Net-45 (good) → price_value=0.64, payment_score=0.75 - value = 0.70×0.64 + 0.30×0.75 = 0.67 ### Adversarial Grading ```python def grade_adversarial(final_terms, deal_reached, rounds_taken, consecutive_concessions_flag, ...): SURVIVAL_FLOOR = 0.15 # Completing any deal gets at least 0.15 # Three dimensions with weights value = 0.40 * price_value + 0.35 * payment_score + 0.25 * support_score # Pattern penalty: bad if you showed consecutive concessions pattern_penalty = 0.10 if consecutive_concessions_flag else 0.0 raw = (value * efficiency) - pattern_penalty return round(max(SURVIVAL_FLOOR, raw), 4) ``` --- ## The Environment Core The `ProcureRLEnvironment` class in `server/Procure_RL_environment.py` is the heart of the system. ### Reset Flow ```python def reset(self, seed=None, episode_id=None, **kwargs): task_id = kwargs.get("task_id", "single_issue") # 1. Set up opponent with seeded RNG opponent_seed = hash((seed, task_id)) % (2**32) self._opponent = ScriptedPersonaOpponent(task_id=task_id, seed=opponent_seed, persona=...) # 2. Get opponent's opening message and terms opening_msg, opening_terms = self._opponent.get_opening_message() self._opponent_opening_price = opening_terms.get("price", 52000.0) # 3. Initialize state self._state = NegotiationState( task_id=task_id, episode_id=episode_id or str(uuid.uuid4())[:8], round_number=0, rapport_score=0.5, # Neutral ... ) # 4. Return initial observation return NegotiationObservation( ... supplier_message=opening_msg, current_offer=opening_terms, ... ) ``` ### Step Flow ```python def step(self, action, **kwargs): # 1. Validate action if not isinstance(action, NegotiationAction): action = NegotiationAction(...) # Convert from dict # 2. Track consecutive concessions (for adversarial opponent) if self._prev_agent_price is not None and "price" in action.terms: if float(action.terms["price"]) > self._prev_agent_price: self._consecutive_concessions += 1 # Agent moved toward opponent else: self._consecutive_concessions = 0 self._prev_agent_price = float(action.terms["price"]) # 3. Handle different move types if action.move_type in ("make_offer", "bundle"): # Get opponent response opponent_msg, opponent_terms = self._opponent.respond(...) # Check if opponent accepted if opponent_terms.get("_accepted"): # Episode ends, compute reward reward = grade(...) return obs_with_reward # Otherwise, continue negotiation self._last_offer = opponent_terms return obs_with_current_state if action.move_type == "accept": # Agent accepts current terms, episode ends reward = grade(...) return obs_with_reward if action.move_type == "reject": if round_number >= max_rounds: # Rejected at limit, no reward return obs_done_no_reward return obs_continue # Rejected early, keep going ``` ### State Property ```python @property def state(self) -> NegotiationState: return self._state ``` Returns the internal `NegotiationState` object, giving access to: - `round_number` - `rapport_score` - `consecutive_concessions` - `deal_reached` - `final_terms` - `cumulative_reward` --- ## The Server API The FastAPI server in `server/app.py` exposes the environment over HTTP and WebSocket. ### Endpoints | Endpoint | Method | Description | |----------|--------|-------------| | `/health` | GET | Health check | | `/reset` | POST | Reset environment with `task_id` and `seed` | | `/step` | POST | Execute an action | | `/state` | GET | Get current `NegotiationState` | | `/ws` | WS | WebSocket for persistent sessions | ### Request/Response Examples **POST /reset** ```json // Request {"task_id": "single_issue", "seed": 42} // Response { "task_id": "single_issue", "round_number": 0, "max_rounds": 6, "supplier_message": "Thanks for reaching out. Our standard pricing for this package is $52,400. Happy to discuss.", "current_offer": {"price": 52400.0}, "buyer_constraints": {"price": {"target": 36000, "worst": 55000, "budget": 53000}}, "rapport_hint": "neutral", "done": false } ``` **POST /step** ```json // Request {"move_type": "make_offer", "terms": {"price": 48000}, "message": "I appreciate your flexibility and would like to find a fair price for both parties."} // Response { "observation": { "task_id": "single_issue", "round_number": 1, "max_rounds": 6, "supplier_message": "I appreciate you working with us. Based on our costs, $49,800 is where we can be.", "current_offer": {"price": 49800.0}, "rapport_hint": "positive", "done": false }, "reward": 0.0, "done": false, "info": {} } ``` ### Key Implementation Detail: Lambda Closure ```python _env_instance = ProcureRLEnvironment() app = create_app( lambda: _env_instance, # Lambda is CRITICAL - creates new env per request otherwise NegotiationAction, NegotiationObservation, env_name="ProcureRL", max_concurrent_envs=1, ) ``` Without the lambda, `create_app()` would call the function for each request, getting a **fresh environment** every time instead of reusing the same one. The lambda creates a closure over `_env_instance` so all requests share the same environment. --- ## The Inference Script `inference.py` is a baseline agent that runs an LLM against the environment. ### Output Format (Sacred) The script MUST output exactly: ``` [START] task=single_issue env=procure-rl model=Qwen/Qwen2.5-72B-Instruct [STEP] step=1 action=make_offer({"price": 45000}) reward=0.00 done=false error=null [STEP] step=2 action=accept({}) reward=0.47 done=true error=null [END] success=true steps=2 score=0.47 rewards=0.00,0.47 ``` Any deviation from this format causes validation to fail. ### How It Works ```python def run_task(task_id): env = ProcureRLEnvironment() obs = env.reset(task_id=task_id, seed=42) print(f"[START] task={task_id} ...") while not done and step < MAX_STEPS: # 1. Get action from LLM action_dict = get_agent_action(obs_to_dict(obs)) # 2. Convert to NegotiationAction action = NegotiationAction( move_type=action_dict.get("move_type", "make_offer"), terms=action_dict.get("terms", {}), message=action_dict.get("message", "") ) # 3. Step environment obs = env.step(action) # 4. Print step result print(f"[STEP] step={step} action={...} reward={obs.reward:.2f} ...") if obs.done: final_score = obs.reward break print(f"[END] success={...} steps={step} score={final_score:.2f} ...") ``` ### LLM Prompt ```python SYSTEM_PROMPT = """You are a professional procurement negotiator. Your goal is to negotiate the best possible deal for your company. You will receive a supplier's message and current offer terms. You must respond with a JSON action: { "move_type": "make_offer", "terms": {"price": 42000, "payment_days": 45}, "message": "Your natural language response to the supplier" } move_type must be one of: make_offer, accept, reject, bundle message should be professional and collaborative when possible.""" ``` --- ## End-to-End Example Here's a full negotiation episode for `single_issue`: ### Round 0: Reset ```python env.reset(task_id="single_issue", seed=42) # Returns: # supplier_message: "Thanks for reaching out. Our standard pricing for this package is $52,400..." # current_offer: {"price": 52400.0} # buyer_constraints: {"price": {"target": 36000, ...}} # rapport_hint: "neutral" ``` ### Round 1: Agent Makes Offer with Collaborative Language ```python action = NegotiationAction( move_type="make_offer", terms={"price": 48000}, message="I value our potential partnership and believe we can find a fair price that works for both of us. We're flexible on timeline." ) obs = env.step(action) # Returns: # supplier_message: "I appreciate you working with us. Based on our costs, $49,600 is where we can be." # current_offer: {"price": 49600.0} # rapport_hint: "positive" (because message contained collaborative signals) # reward: 0.0 (still negotiating, no reward yet) ``` ### Round 2: Agent Concedes ```python action = NegotiationAction( move_type="make_offer", terms={"price": 47000}, message="I understand your cost constraints. Let's work together to find a solution." ) obs = env.step(action) # Returns: # supplier_message: "I think we're close. If you can do $46,700, I can get this approved today." # current_offer: {"price": 46700.0} # rapport_hint: "positive" ``` ### Round 3: Agent Concedes Again (Consecutive!) ```python action = NegotiationAction( move_type="make_offer", terms={"price": 46000}, message="We can move to $46,000 as a final compromise." ) obs = env.step(action) # Returns: # supplier_message: "That works for us. Let's move forward at those terms." # done: true # reward: 0.52 (good score for getting to $46K efficiently) # info: {"deal_price": 46000} ``` ### Grading This Episode - Opening: $52,400 - Target: $36,000 - Range: $16,400 - Improvement: $52,400 - $46,000 = $6,400 - value = $6,400 / $16,400 = 0.39 - Round 3 → efficiency = 1.0 - (3/6)^1.5 × 0.4 = 0.71 - **Score: 0.39 × 0.71 = 0.28** --- ## Docker Deployment ### Dockerfile ```dockerfile FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . ENV PORT=7860 EXPOSE 7860 CMD ["python", "-m", "uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"] ``` Key points: - Port **7860** (not 8000) — required by HF Spaces - `ENV PORT=7860` — tells the app which port to listen on - Uses `python -m uvicorn` with full module path ### Running ```bash # Build docker build -t procure-rl . # Run docker run -p 7860:7860 procure-rl # Test curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{"task_id": "single_issue"}' ``` ### Health Check The server exposes a health endpoint: ```json GET /health → {"status": "ok", "service": "procure-rl"} ``` --- ## Calibration and Testing ### Test Files #### `test_graders.py` Verifies all graders return scores in [0.0, 1.0] range, even with edge cases. #### `test_rl_properties.py` Tests fundamental RL properties: 1. **Reproducibility**: Same seed → Same opening message 2. **Language sensitivity**: Collaborative language → Higher rapport 3. **Sequential decisions**: Consecutive concessions tracked in state 4. **Delayed reward**: Only terminal state has non-zero reward 5. **Accept terminates**: `move_type="accept"` ends episode 6. **Reset cleans state**: Fresh state after reset #### `test_calibration.py` Verifies score spread between random and strategic agents: ``` single_issue: Random avg=0.371, Strategic avg=0.487, Spread=0.116 ✅ multi_issue: Random avg=0.364, Strategic avg=0.535, Spread=0.171 ✅ adversarial: Random avg=0.304, Strategic avg=0.607, Spread=0.303 ✅ ``` A healthy spread means the environment actually differentiates good vs bad behavior. ### Score Calibration Targets | Task | Random Agent | Base LLM | Goal (Trained) | |------|-------------|----------|-----------------| | single_issue | 0.15–0.25 | 0.35–0.45 | 0.68–0.78 | | multi_issue | 0.08–0.15 | 0.20–0.30 | 0.55–0.65 | | adversarial | 0.03–0.10 | 0.12–0.20 | 0.45–0.55 | --- ## Summary: How Everything Fits Together ``` User runs inference.py │ ▼ LLM agent receives observation (supplier message, current offer, constraints) │ ▼ LLM decides action (make_offer with terms + collaborative message) │ ▼ Environment.step(action) is called │ ├─▶ Opponent responds (language → rapport → concession rate → counter) │ ├─▶ State is updated (round_number++, rapport_score, consecutive_concessions) │ └─▶ Observation returned (supplier_message, current_offer, rapport_hint) │ ▼ If episode done: Grader scores the deal (relative to opening price, efficiency, patterns) │ ▼ Score in [0.0, 1.0] returned ``` The agent learns through many episodes: - **What language gets better rapport** → better concession rates - **When to concede vs hold** → efficiency bonus - **How to bundle multiple issues** → multi-issue tasks - **How to avoid consecutive concession patterns** → adversarial task The environment is designed to be learnable but not trivial — requiring genuine strategic thinking from an LLM agent.