Spaces:
Sleeping
Sleeping
| # ProcureRL: A Deep Dive | |
| ## Table of Contents | |
| 1. [What is ProcureRL?](#what-is-procure-rl) | |
| 2. [Why Does This Exist?](#why-does-this-exist) | |
| 3. [The Big Picture Architecture](#the-big-picture-architecture) | |
| 4. [The Three Tasks](#the-three-tasks) | |
| 5. [Data Models: What's Floating Around](#data-models-whats-floating-around) | |
| 6. [The Scripted Opponent System](#the-scripted-opponent-system) | |
| 7. [The Grading System](#the-grading-system) | |
| 8. [The Environment Core](#the-environment-core) | |
| 9. [The Server API](#the-server-api) | |
| 10. [The Inference Script](#the-inference-script) | |
| 11. [End-to-End Example](#end-to-end-example) | |
| 12. [Docker Deployment](#docker-deployment) | |
| 13. [Calibration and Testing](#calibration-and-testing) | |
| --- | |
| ## What is ProcureRL? | |
| ProcureRL is an **OpenEnv-compliant Reinforcement Learning environment** where an LLM (Large Language Model) agent learns to negotiate procurement deals against scripted supplier opponents. | |
| In simpler terms: it's a training ground for AI to practice negotiation β like a flight simulator, but for procurement conversations. | |
| ### The Core Innovation: Language-Sensitive Opponent | |
| What makes ProcureRL special is that the opponent's behavior **responds to the quality of the agent's natural language**, not just the prices offered. This means: | |
| - An agent that outputs aggressive or low-effort language gets a **tough, unyielding opponent** | |
| - An agent that outputs collaborative, professional language gets a **more cooperative, flexible opponent** | |
| The language IS the policy β not just the action space. This makes LLM genuinely required, not incidental. | |
| --- | |
| ## Why Does This Exist? | |
| Real-world procurement negotiation is: | |
| - **Sequential** β one decision affects the next | |
| - **Hidden utility** β the opponent's real priorities are not revealed | |
| - **Language-dependent** β how you say things matters as much as what you offer | |
| - **High-stakes** β Walmart deployed AI (Pactum) for exactly this, 90% of CPOs adopting AI negotiation in 2025 | |
| Traditional rule-based negotiation tools are limited. An RL-trained LLM policy can learn to navigate this complexity in ways that static rules cannot. | |
| --- | |
| ## The Big Picture Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β ProcureRL System β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β | |
| β ββββββββββββββββββββ ββββββββββββββββββββ β | |
| β β LLM Agent βββββΆβ Environment β β | |
| β β (inference.py) β β (Procure_RL_ β β | |
| β β β β environment.py)β β | |
| β ββββββββββββββββββββ ββββββββββ¬ββββββββββ β | |
| β β β | |
| β βΌ β | |
| β ββββββββββββββββββββ β | |
| β β Scripted β β | |
| β β Opponent β β | |
| β β (opponent.py) β β | |
| β ββββββββββ¬ββββββββββ β | |
| β β β | |
| β βΌ β | |
| β ββββββββββββββββββββ β | |
| β β Graders β β | |
| β β (graders.py) β β | |
| β ββββββββββββββββββββ β | |
| β β | |
| β ββββββββββββββββββββ ββββββββββββββββββββ β | |
| β β Server API β β OpenEnv.yaml β β | |
| β β (server/app.py) β β (manifest) β β | |
| β ββββββββββ¬ββββββββββ ββββββββββββββββββββ β | |
| β β β | |
| β βΌ β | |
| β ββββββββββββββββββββ β | |
| β β Docker Container ββββ HF Spaces Deployment β | |
| β β (port 7860) β β | |
| β ββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| The system is designed so that: | |
| 1. **Environment** is deterministic and reproducible (seeded RNG) | |
| 2. **Opponent** responds to language quality (via rapport system) | |
| 3. **Graders** produce bounded [0.0, 1.0] scores | |
| 4. **Server** exposes everything over HTTP for OpenEnv compliance | |
| 5. **Inference** runs a baseline LLM agent against the environment | |
| --- | |
| ## The Three Tasks | |
| ProcureRL includes three tasks of increasing difficulty: | |
| ### Task 1: `single_issue` (Easy) | |
| **Scenario:** Software license renewal. Price only. | |
| ``` | |
| Buyer Target: $36,000 | |
| Seller Opens: ~$52,000 (varies by seed) | |
| Seller Floor: ~$44,000 (varies by seed) | |
| Max Rounds: 6 | |
| Opponent Persona: Cooperative | |
| ``` | |
| The agent must negotiate the price down from opening to target. The cooperative opponent starts friendly and remains fairly flexible. | |
| **Example Grading:** | |
| - Deal at $38K in round 2: ~0.85 score | |
| - Deal at $44K in round 6: ~0.35 score | |
| - No deal: 0.0 score | |
| ### Task 2: `multi_issue` (Medium) | |
| **Scenario:** Enterprise software negotiation with price AND payment terms. | |
| ``` | |
| Issues: price ($40K-$58K) + payment_days (30-90) | |
| Opponent Persona: Cash Flow Stressed | |
| β Cares more about getting paid quickly (payment_weight: 0.65) | |
| β Cares less about final price (price_weight: 0.35) | |
| Max Rounds: 8 | |
| ``` | |
| **The Strategic Opportunity:** If the agent offers Net-30 or Net-45 payment terms, the opponent becomes more flexible on price. A naive agent treats both issues equally and scores low. A smart agent bundles payment speed with price negotiation. | |
| **Example Grading:** | |
| - Price $42K + Net-30 payment: ~0.60 score | |
| - Price $42K + Net-90 payment: ~0.35 score | |
| - No deal: 0.0 score | |
| ### Task 3: `adversarial` (Hard) | |
| **Scenario:** Large contract with three issues β price, payment, and support hours. | |
| ``` | |
| Issues: price + payment_days + support_hours | |
| Opponent Persona: Aggressive Anchor | |
| β Opens at ceiling on all issues | |
| β Hardens position if agent makes consecutive concessions | |
| β Rapport-sensitive but requires consistent collaborative framing | |
| Max Rounds: 10 | |
| Survival Floor: 0.15 (completing any deal gets at least 0.15) | |
| ``` | |
| **The Challenge:** If the agent concedes on price in 2+ consecutive rounds, the opponent recognizes this pattern and becomes much harder to negotiate with. The agent must resist anchoring, break consecutive concession patterns, and maintain collaborative tone under pressure. | |
| **Example Grading:** | |
| - Strategic deal with no consecutive concessions: ~0.50 score | |
| - Same deal but with consecutive concession pattern: ~0.40 score | |
| - Survival deal (just complete): 0.15 score | |
| --- | |
| ## Data Models: What's Floating Around | |
| The system uses three Pydantic models defined in `models.py`: | |
| ### `NegotiationAction` | |
| What the agent sends to the environment: | |
| ```python | |
| class NegotiationAction(BaseModel): | |
| move_type: str # "make_offer" | "accept" | "reject" | "bundle" | |
| terms: Dict[str, Any] # {"price": 42000, "payment_days": 45} | |
| message: str = "" # Natural language β affects opponent rapport! | |
| ``` | |
| **Important:** The `message` field is not just flavor text. It directly affects opponent behavior through the rapport system. | |
| ### `NegotiationObservation` | |
| What the environment sends back to the agent after each step: | |
| ```python | |
| class NegotiationObservation(BaseModel): | |
| task_id: str # Which task we're running | |
| round_number: int # Current round (0 to max_rounds) | |
| max_rounds: int # Task's round limit | |
| supplier_message: str # Opponent's latest message | |
| current_offer: Dict[str, Any] # Terms currently on the table | |
| last_4_exchanges: List[Dict] # Recent conversation history | |
| buyer_constraints: Dict[str, Any] # Agent's targets and limits | |
| rapport_hint: str # "positive" | "neutral" | "negative" | |
| done: bool # Is episode finished? | |
| reward: Optional[float] = None # Reward (only on done) | |
| metadata: Dict[str, Any] = Field(...) # Extra info (deal_price, errors) | |
| ``` | |
| ### `NegotiationState` | |
| The environment's internal state (accessible via `env.state`): | |
| ```python | |
| class NegotiationState(BaseModel): | |
| task_id: str = "" | |
| episode_id: str = "" | |
| round_number: int = 0 | |
| rapport_score: float = 0.5 # 0.0 to 1.0, starts neutral | |
| consecutive_concessions: int = 0 # Tracks concession patterns | |
| deal_reached: bool = False | |
| final_terms: Optional[Dict] = None # Set when episode ends | |
| cumulative_reward: float = 0.0 | |
| ``` | |
| --- | |
| ## The Scripted Opponent System | |
| The opponent is implemented in `opponent.py` as the `ScriptedPersonaOpponent` class. | |
| ### The Rapport System (Language Sensitivity) | |
| The key mechanism is **rapport** β a score from 0.0 to 1.0 that changes based on the agent's language quality. | |
| **Collaborative Signals (increase rapport):** | |
| ```python | |
| COLLABORATIVE_SIGNALS = [ | |
| "understand", "partnership", "mutual", "together", "value", | |
| "appreciate", "flexible", "work with", "long-term", "relationship", | |
| "reasonable", "fair", "both", "solution" | |
| ] | |
| ``` | |
| **Aggressive Signals (decrease rapport):** | |
| ```python | |
| AGGRESSIVE_SIGNALS = [ | |
| "demand", "require", "final offer", "unacceptable", "must", | |
| "non-negotiable", "take it or leave", "bottom line", "ultimatum", | |
| "insist", "refuse", "absolutely not" | |
| ] | |
| ``` | |
| **How it works:** | |
| ```python | |
| def update_rapport(self, agent_message: str) -> None: | |
| msg_lower = agent_message.lower() | |
| delta = 0.0 | |
| delta += sum(0.08 for w in COLLABORATIVE_SIGNALS if w in msg_lower) | |
| delta -= sum(0.08 for w in AGGRESSIVE_SIGNALS if w in msg_lower) | |
| delta = max(-0.20, min(0.20, delta)) # Cap per-round change | |
| self.rapport = max(0.0, min(1.0, self.rapport + delta)) | |
| ``` | |
| Every message the agent sends adjusts rapport by Β±0.08 per keyword detected, capped at Β±0.20 per round. | |
| ### Concession Rate: How Fast the Opponent Moves | |
| Rapport directly modifies the opponent's concession rate: | |
| ```python | |
| def get_concession_rate(self) -> float: | |
| base_rates = { | |
| "cooperative": 0.05, # 5% per round base | |
| "cash_flow_stressed": 0.07, | |
| "aggressive_anchor": 0.04, | |
| } | |
| base = base_rates[self.persona] | |
| modifier = (self.rapport - 0.5) * base # +/- 50% of base | |
| return max(0.01, base + modifier) | |
| ``` | |
| **Example:** Cooperative opponent with high rapport (0.8) concedes at 0.05 + (0.8 - 0.5) Γ 0.05 = **7.5% per round**. With low rapport (0.2), concedes at 0.05 + (0.2 - 0.5) Γ 0.05 = **2.5% per round**. | |
| ### Three Personas | |
| #### 1. Cooperative (`single_issue`) | |
| - Friendly, understanding tone | |
| - 5% base concession rate, highly sensitive to rapport | |
| - Accepts early if price is above floor and round β₯ 2 | |
| #### 2. Cash Flow Stressed (`multi_issue`) | |
| - Cares about payment timing more than price | |
| - 7% base concession rate, moderate rapport sensitivity | |
| - Acceptance requires `payment_days β€ 45` | |
| - Comments on payment timing in responses | |
| #### 3. Aggressive Anchor (`adversarial`) | |
| - Opens at ceiling, hardens with pressure | |
| - 4% base concession rate (least flexible) | |
| - **Penalizes consecutive concessions** β if agent concedes 2+ rounds in a row, concession rate drops to 40% of normal | |
| - Uses "hardening" templates when cornered | |
| ### Opponent Response Flow | |
| ```python | |
| def respond(self, agent_message, agent_terms, round_number, consecutive_concessions): | |
| # 1. Update rapport based on agent's language | |
| self.update_rapport(agent_message) | |
| # 2. Check acceptance (only after round 2, and price must be β₯ floor) | |
| if round_number >= 2 and agent_price >= self.price_floor and _acceptance_condition(): | |
| return self.templates["accept"], {**agent_terms, "_accepted": True} | |
| # 3. Calculate concession rate | |
| concession = self.get_concession_rate() | |
| # 4. Aggressive anchor gets harder if detecting concession pattern | |
| if self.persona == "aggressive_anchor" and consecutive_concessions >= 2: | |
| concession = concession * 0.4 # 60% reduction! | |
| template_key = "hardening" | |
| elif round_number >= 70% of max_rounds: | |
| template_key = "near_close" | |
| else: | |
| template_key = "counter" | |
| # 5. Compute new position | |
| new_position = self.current_position * (1 - concession) | |
| new_position = max(self.price_floor, new_position) # Never go below floor | |
| # 6. Return message and counter terms | |
| return message, counter_terms | |
| ``` | |
| --- | |
| ## The Grading System | |
| Graders are in `graders.py` and produce scores in [0.0, 1.0]. They are **pure Python β zero LLM calls**, ensuring deterministic, reproducible scoring. | |
| ### Key Design: Relative Scoring | |
| The graders score based on **how much the agent improved from the opponent's opening price**, not on absolute thresholds. This makes the environment learnable β the agent learns to negotiate better deals relative to where negotiations started. | |
| ```python | |
| # Instead of scoring against a hardcoded floor, we score relative to the opening: | |
| value = (opponent_opening - final_price) / (opponent_opening - BUYER_TARGET) | |
| ``` | |
| ### Single Issue Grading | |
| ```python | |
| def grade_single_issue(final_terms, deal_reached, rounds_taken, max_rounds=6, opponent_opening=52000.0): | |
| if not deal_reached: | |
| return 0.0 | |
| final_price = final_terms.get("price", opponent_opening) | |
| BUYER_TARGET = 38000.0 | |
| # If price didn't improve from opening, minimal score | |
| if final_price >= opponent_opening: | |
| return 0.05 | |
| # How much did we improve relative to the possible improvement? | |
| value = (opponent_opening - final_price) / (opponent_opening - BUYER_TARGET) | |
| value = max(0.0, min(1.0, value)) | |
| # Efficiency penalty for taking too long | |
| efficiency = 1.0 - (rounds_taken / max_rounds) ** 1.5 * 0.4 | |
| efficiency = max(0.1, efficiency) # Never below 0.1 | |
| return round(value * efficiency, 4) | |
| ``` | |
| **Example:** | |
| - Opening: $52,000, Target: $38,000, Range: $14,000 | |
| - Final price: $45,000 β improvement: $7,000 β value = 0.50 | |
| - Round 3 β efficiency = 1.0 - (3/6)^1.5 Γ 0.4 = 0.71 | |
| - **Score: 0.50 Γ 0.71 = 0.36** | |
| ### Multi-Issue Grading | |
| ```python | |
| def grade_multi_issue(final_terms, deal_reached, rounds_taken, max_rounds=8, opponent_opening=52000.0): | |
| # Two dimensions: price (70% weight) and payment_days (30% weight) | |
| price_value = (opponent_opening - final_price) / (opponent_opening - 40000) | |
| payment_score = (90 - payment_days) / (90 - 30) | |
| value = 0.70 * price_value + 0.30 * payment_score | |
| # If price didn't improve but payment did, still score on payment | |
| if final_price >= opponent_opening: | |
| value = 0.30 * payment_score # Only payment matters | |
| ``` | |
| **Example:** | |
| - Price: $44,000 (good), Payment: Net-45 (good) β price_value=0.64, payment_score=0.75 | |
| - value = 0.70Γ0.64 + 0.30Γ0.75 = 0.67 | |
| ### Adversarial Grading | |
| ```python | |
| def grade_adversarial(final_terms, deal_reached, rounds_taken, consecutive_concessions_flag, ...): | |
| SURVIVAL_FLOOR = 0.15 # Completing any deal gets at least 0.15 | |
| # Three dimensions with weights | |
| value = 0.40 * price_value + 0.35 * payment_score + 0.25 * support_score | |
| # Pattern penalty: bad if you showed consecutive concessions | |
| pattern_penalty = 0.10 if consecutive_concessions_flag else 0.0 | |
| raw = (value * efficiency) - pattern_penalty | |
| return round(max(SURVIVAL_FLOOR, raw), 4) | |
| ``` | |
| --- | |
| ## The Environment Core | |
| The `ProcureRLEnvironment` class in `server/Procure_RL_environment.py` is the heart of the system. | |
| ### Reset Flow | |
| ```python | |
| def reset(self, seed=None, episode_id=None, **kwargs): | |
| task_id = kwargs.get("task_id", "single_issue") | |
| # 1. Set up opponent with seeded RNG | |
| opponent_seed = hash((seed, task_id)) % (2**32) | |
| self._opponent = ScriptedPersonaOpponent(task_id=task_id, seed=opponent_seed, persona=...) | |
| # 2. Get opponent's opening message and terms | |
| opening_msg, opening_terms = self._opponent.get_opening_message() | |
| self._opponent_opening_price = opening_terms.get("price", 52000.0) | |
| # 3. Initialize state | |
| self._state = NegotiationState( | |
| task_id=task_id, | |
| episode_id=episode_id or str(uuid.uuid4())[:8], | |
| round_number=0, | |
| rapport_score=0.5, # Neutral | |
| ... | |
| ) | |
| # 4. Return initial observation | |
| return NegotiationObservation( | |
| ... | |
| supplier_message=opening_msg, | |
| current_offer=opening_terms, | |
| ... | |
| ) | |
| ``` | |
| ### Step Flow | |
| ```python | |
| def step(self, action, **kwargs): | |
| # 1. Validate action | |
| if not isinstance(action, NegotiationAction): | |
| action = NegotiationAction(...) # Convert from dict | |
| # 2. Track consecutive concessions (for adversarial opponent) | |
| if self._prev_agent_price is not None and "price" in action.terms: | |
| if float(action.terms["price"]) > self._prev_agent_price: | |
| self._consecutive_concessions += 1 # Agent moved toward opponent | |
| else: | |
| self._consecutive_concessions = 0 | |
| self._prev_agent_price = float(action.terms["price"]) | |
| # 3. Handle different move types | |
| if action.move_type in ("make_offer", "bundle"): | |
| # Get opponent response | |
| opponent_msg, opponent_terms = self._opponent.respond(...) | |
| # Check if opponent accepted | |
| if opponent_terms.get("_accepted"): | |
| # Episode ends, compute reward | |
| reward = grade(...) | |
| return obs_with_reward | |
| # Otherwise, continue negotiation | |
| self._last_offer = opponent_terms | |
| return obs_with_current_state | |
| if action.move_type == "accept": | |
| # Agent accepts current terms, episode ends | |
| reward = grade(...) | |
| return obs_with_reward | |
| if action.move_type == "reject": | |
| if round_number >= max_rounds: | |
| # Rejected at limit, no reward | |
| return obs_done_no_reward | |
| return obs_continue # Rejected early, keep going | |
| ``` | |
| ### State Property | |
| ```python | |
| @property | |
| def state(self) -> NegotiationState: | |
| return self._state | |
| ``` | |
| Returns the internal `NegotiationState` object, giving access to: | |
| - `round_number` | |
| - `rapport_score` | |
| - `consecutive_concessions` | |
| - `deal_reached` | |
| - `final_terms` | |
| - `cumulative_reward` | |
| --- | |
| ## The Server API | |
| The FastAPI server in `server/app.py` exposes the environment over HTTP and WebSocket. | |
| ### Endpoints | |
| | Endpoint | Method | Description | | |
| |----------|--------|-------------| | |
| | `/health` | GET | Health check | | |
| | `/reset` | POST | Reset environment with `task_id` and `seed` | | |
| | `/step` | POST | Execute an action | | |
| | `/state` | GET | Get current `NegotiationState` | | |
| | `/ws` | WS | WebSocket for persistent sessions | | |
| ### Request/Response Examples | |
| **POST /reset** | |
| ```json | |
| // Request | |
| {"task_id": "single_issue", "seed": 42} | |
| // Response | |
| { | |
| "task_id": "single_issue", | |
| "round_number": 0, | |
| "max_rounds": 6, | |
| "supplier_message": "Thanks for reaching out. Our standard pricing for this package is $52,400. Happy to discuss.", | |
| "current_offer": {"price": 52400.0}, | |
| "buyer_constraints": {"price": {"target": 36000, "worst": 55000, "budget": 53000}}, | |
| "rapport_hint": "neutral", | |
| "done": false | |
| } | |
| ``` | |
| **POST /step** | |
| ```json | |
| // Request | |
| {"move_type": "make_offer", "terms": {"price": 48000}, "message": "I appreciate your flexibility and would like to find a fair price for both parties."} | |
| // Response | |
| { | |
| "observation": { | |
| "task_id": "single_issue", | |
| "round_number": 1, | |
| "max_rounds": 6, | |
| "supplier_message": "I appreciate you working with us. Based on our costs, $49,800 is where we can be.", | |
| "current_offer": {"price": 49800.0}, | |
| "rapport_hint": "positive", | |
| "done": false | |
| }, | |
| "reward": 0.0, | |
| "done": false, | |
| "info": {} | |
| } | |
| ``` | |
| ### Key Implementation Detail: Lambda Closure | |
| ```python | |
| _env_instance = ProcureRLEnvironment() | |
| app = create_app( | |
| lambda: _env_instance, # Lambda is CRITICAL - creates new env per request otherwise | |
| NegotiationAction, | |
| NegotiationObservation, | |
| env_name="ProcureRL", | |
| max_concurrent_envs=1, | |
| ) | |
| ``` | |
| Without the lambda, `create_app()` would call the function for each request, getting a **fresh environment** every time instead of reusing the same one. The lambda creates a closure over `_env_instance` so all requests share the same environment. | |
| --- | |
| ## The Inference Script | |
| `inference.py` is a baseline agent that runs an LLM against the environment. | |
| ### Output Format (Sacred) | |
| The script MUST output exactly: | |
| ``` | |
| [START] task=single_issue env=procure-rl model=Qwen/Qwen2.5-72B-Instruct | |
| [STEP] step=1 action=make_offer({"price": 45000}) reward=0.00 done=false error=null | |
| [STEP] step=2 action=accept({}) reward=0.47 done=true error=null | |
| [END] success=true steps=2 score=0.47 rewards=0.00,0.47 | |
| ``` | |
| Any deviation from this format causes validation to fail. | |
| ### How It Works | |
| ```python | |
| def run_task(task_id): | |
| env = ProcureRLEnvironment() | |
| obs = env.reset(task_id=task_id, seed=42) | |
| print(f"[START] task={task_id} ...") | |
| while not done and step < MAX_STEPS: | |
| # 1. Get action from LLM | |
| action_dict = get_agent_action(obs_to_dict(obs)) | |
| # 2. Convert to NegotiationAction | |
| action = NegotiationAction( | |
| move_type=action_dict.get("move_type", "make_offer"), | |
| terms=action_dict.get("terms", {}), | |
| message=action_dict.get("message", "") | |
| ) | |
| # 3. Step environment | |
| obs = env.step(action) | |
| # 4. Print step result | |
| print(f"[STEP] step={step} action={...} reward={obs.reward:.2f} ...") | |
| if obs.done: | |
| final_score = obs.reward | |
| break | |
| print(f"[END] success={...} steps={step} score={final_score:.2f} ...") | |
| ``` | |
| ### LLM Prompt | |
| ```python | |
| SYSTEM_PROMPT = """You are a professional procurement negotiator. Your goal is to negotiate the best possible deal for your company. | |
| You will receive a supplier's message and current offer terms. You must respond with a JSON action: | |
| { | |
| "move_type": "make_offer", | |
| "terms": {"price": 42000, "payment_days": 45}, | |
| "message": "Your natural language response to the supplier" | |
| } | |
| move_type must be one of: make_offer, accept, reject, bundle | |
| message should be professional and collaborative when possible.""" | |
| ``` | |
| --- | |
| ## End-to-End Example | |
| Here's a full negotiation episode for `single_issue`: | |
| ### Round 0: Reset | |
| ```python | |
| env.reset(task_id="single_issue", seed=42) | |
| # Returns: | |
| # supplier_message: "Thanks for reaching out. Our standard pricing for this package is $52,400..." | |
| # current_offer: {"price": 52400.0} | |
| # buyer_constraints: {"price": {"target": 36000, ...}} | |
| # rapport_hint: "neutral" | |
| ``` | |
| ### Round 1: Agent Makes Offer with Collaborative Language | |
| ```python | |
| action = NegotiationAction( | |
| move_type="make_offer", | |
| terms={"price": 48000}, | |
| message="I value our potential partnership and believe we can find a fair price that works for both of us. We're flexible on timeline." | |
| ) | |
| obs = env.step(action) | |
| # Returns: | |
| # supplier_message: "I appreciate you working with us. Based on our costs, $49,600 is where we can be." | |
| # current_offer: {"price": 49600.0} | |
| # rapport_hint: "positive" (because message contained collaborative signals) | |
| # reward: 0.0 (still negotiating, no reward yet) | |
| ``` | |
| ### Round 2: Agent Concedes | |
| ```python | |
| action = NegotiationAction( | |
| move_type="make_offer", | |
| terms={"price": 47000}, | |
| message="I understand your cost constraints. Let's work together to find a solution." | |
| ) | |
| obs = env.step(action) | |
| # Returns: | |
| # supplier_message: "I think we're close. If you can do $46,700, I can get this approved today." | |
| # current_offer: {"price": 46700.0} | |
| # rapport_hint: "positive" | |
| ``` | |
| ### Round 3: Agent Concedes Again (Consecutive!) | |
| ```python | |
| action = NegotiationAction( | |
| move_type="make_offer", | |
| terms={"price": 46000}, | |
| message="We can move to $46,000 as a final compromise." | |
| ) | |
| obs = env.step(action) | |
| # Returns: | |
| # supplier_message: "That works for us. Let's move forward at those terms." | |
| # done: true | |
| # reward: 0.52 (good score for getting to $46K efficiently) | |
| # info: {"deal_price": 46000} | |
| ``` | |
| ### Grading This Episode | |
| - Opening: $52,400 | |
| - Target: $36,000 | |
| - Range: $16,400 | |
| - Improvement: $52,400 - $46,000 = $6,400 | |
| - value = $6,400 / $16,400 = 0.39 | |
| - Round 3 β efficiency = 1.0 - (3/6)^1.5 Γ 0.4 = 0.71 | |
| - **Score: 0.39 Γ 0.71 = 0.28** | |
| --- | |
| ## Docker Deployment | |
| ### Dockerfile | |
| ```dockerfile | |
| FROM python:3.11-slim | |
| WORKDIR /app | |
| COPY requirements.txt . | |
| RUN pip install --no-cache-dir -r requirements.txt | |
| COPY . . | |
| ENV PORT=7860 | |
| EXPOSE 7860 | |
| CMD ["python", "-m", "uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"] | |
| ``` | |
| Key points: | |
| - Port **7860** (not 8000) β required by HF Spaces | |
| - `ENV PORT=7860` β tells the app which port to listen on | |
| - Uses `python -m uvicorn` with full module path | |
| ### Running | |
| ```bash | |
| # Build | |
| docker build -t procure-rl . | |
| # Run | |
| docker run -p 7860:7860 procure-rl | |
| # Test | |
| curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{"task_id": "single_issue"}' | |
| ``` | |
| ### Health Check | |
| The server exposes a health endpoint: | |
| ```json | |
| GET /health β {"status": "ok", "service": "procure-rl"} | |
| ``` | |
| --- | |
| ## Calibration and Testing | |
| ### Test Files | |
| #### `test_graders.py` | |
| Verifies all graders return scores in [0.0, 1.0] range, even with edge cases. | |
| #### `test_rl_properties.py` | |
| Tests fundamental RL properties: | |
| 1. **Reproducibility**: Same seed β Same opening message | |
| 2. **Language sensitivity**: Collaborative language β Higher rapport | |
| 3. **Sequential decisions**: Consecutive concessions tracked in state | |
| 4. **Delayed reward**: Only terminal state has non-zero reward | |
| 5. **Accept terminates**: `move_type="accept"` ends episode | |
| 6. **Reset cleans state**: Fresh state after reset | |
| #### `test_calibration.py` | |
| Verifies score spread between random and strategic agents: | |
| ``` | |
| single_issue: Random avg=0.371, Strategic avg=0.487, Spread=0.116 β | |
| multi_issue: Random avg=0.364, Strategic avg=0.535, Spread=0.171 β | |
| adversarial: Random avg=0.304, Strategic avg=0.607, Spread=0.303 β | |
| ``` | |
| A healthy spread means the environment actually differentiates good vs bad behavior. | |
| ### Score Calibration Targets | |
| | Task | Random Agent | Base LLM | Goal (Trained) | | |
| |------|-------------|----------|-----------------| | |
| | single_issue | 0.15β0.25 | 0.35β0.45 | 0.68β0.78 | | |
| | multi_issue | 0.08β0.15 | 0.20β0.30 | 0.55β0.65 | | |
| | adversarial | 0.03β0.10 | 0.12β0.20 | 0.45β0.55 | | |
| --- | |
| ## Summary: How Everything Fits Together | |
| ``` | |
| User runs inference.py | |
| β | |
| βΌ | |
| LLM agent receives observation (supplier message, current offer, constraints) | |
| β | |
| βΌ | |
| LLM decides action (make_offer with terms + collaborative message) | |
| β | |
| βΌ | |
| Environment.step(action) is called | |
| β | |
| βββΆ Opponent responds (language β rapport β concession rate β counter) | |
| β | |
| βββΆ State is updated (round_number++, rapport_score, consecutive_concessions) | |
| β | |
| βββΆ Observation returned (supplier_message, current_offer, rapport_hint) | |
| β | |
| βΌ | |
| If episode done: Grader scores the deal (relative to opening price, efficiency, patterns) | |
| β | |
| βΌ | |
| Score in [0.0, 1.0] returned | |
| ``` | |
| The agent learns through many episodes: | |
| - **What language gets better rapport** β better concession rates | |
| - **When to concede vs hold** β efficiency bonus | |
| - **How to bundle multiple issues** β multi-issue tasks | |
| - **How to avoid consecutive concession patterns** β adversarial task | |
| The environment is designed to be learnable but not trivial β requiring genuine strategic thinking from an LLM agent. |