procure-rl / EXPLANATION.md
akshaypulla's picture
Upload folder using huggingface_hub
c1be7c3 verified
# ProcureRL: A Deep Dive
## Table of Contents
1. [What is ProcureRL?](#what-is-procure-rl)
2. [Why Does This Exist?](#why-does-this-exist)
3. [The Big Picture Architecture](#the-big-picture-architecture)
4. [The Three Tasks](#the-three-tasks)
5. [Data Models: What's Floating Around](#data-models-whats-floating-around)
6. [The Scripted Opponent System](#the-scripted-opponent-system)
7. [The Grading System](#the-grading-system)
8. [The Environment Core](#the-environment-core)
9. [The Server API](#the-server-api)
10. [The Inference Script](#the-inference-script)
11. [End-to-End Example](#end-to-end-example)
12. [Docker Deployment](#docker-deployment)
13. [Calibration and Testing](#calibration-and-testing)
---
## What is ProcureRL?
ProcureRL is an **OpenEnv-compliant Reinforcement Learning environment** where an LLM (Large Language Model) agent learns to negotiate procurement deals against scripted supplier opponents.
In simpler terms: it's a training ground for AI to practice negotiation β€” like a flight simulator, but for procurement conversations.
### The Core Innovation: Language-Sensitive Opponent
What makes ProcureRL special is that the opponent's behavior **responds to the quality of the agent's natural language**, not just the prices offered. This means:
- An agent that outputs aggressive or low-effort language gets a **tough, unyielding opponent**
- An agent that outputs collaborative, professional language gets a **more cooperative, flexible opponent**
The language IS the policy β€” not just the action space. This makes LLM genuinely required, not incidental.
---
## Why Does This Exist?
Real-world procurement negotiation is:
- **Sequential** β€” one decision affects the next
- **Hidden utility** β€” the opponent's real priorities are not revealed
- **Language-dependent** β€” how you say things matters as much as what you offer
- **High-stakes** β€” Walmart deployed AI (Pactum) for exactly this, 90% of CPOs adopting AI negotiation in 2025
Traditional rule-based negotiation tools are limited. An RL-trained LLM policy can learn to navigate this complexity in ways that static rules cannot.
---
## The Big Picture Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ProcureRL System β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ LLM Agent │───▢│ Environment β”‚ β”‚
β”‚ β”‚ (inference.py) β”‚ β”‚ (Procure_RL_ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ environment.py)β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Scripted β”‚ β”‚
β”‚ β”‚ Opponent β”‚ β”‚
β”‚ β”‚ (opponent.py) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Graders β”‚ β”‚
β”‚ β”‚ (graders.py) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Server API β”‚ β”‚ OpenEnv.yaml β”‚ β”‚
β”‚ β”‚ (server/app.py) β”‚ β”‚ (manifest) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Docker Container │◀── HF Spaces Deployment β”‚
β”‚ β”‚ (port 7860) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
The system is designed so that:
1. **Environment** is deterministic and reproducible (seeded RNG)
2. **Opponent** responds to language quality (via rapport system)
3. **Graders** produce bounded [0.0, 1.0] scores
4. **Server** exposes everything over HTTP for OpenEnv compliance
5. **Inference** runs a baseline LLM agent against the environment
---
## The Three Tasks
ProcureRL includes three tasks of increasing difficulty:
### Task 1: `single_issue` (Easy)
**Scenario:** Software license renewal. Price only.
```
Buyer Target: $36,000
Seller Opens: ~$52,000 (varies by seed)
Seller Floor: ~$44,000 (varies by seed)
Max Rounds: 6
Opponent Persona: Cooperative
```
The agent must negotiate the price down from opening to target. The cooperative opponent starts friendly and remains fairly flexible.
**Example Grading:**
- Deal at $38K in round 2: ~0.85 score
- Deal at $44K in round 6: ~0.35 score
- No deal: 0.0 score
### Task 2: `multi_issue` (Medium)
**Scenario:** Enterprise software negotiation with price AND payment terms.
```
Issues: price ($40K-$58K) + payment_days (30-90)
Opponent Persona: Cash Flow Stressed
β†’ Cares more about getting paid quickly (payment_weight: 0.65)
β†’ Cares less about final price (price_weight: 0.35)
Max Rounds: 8
```
**The Strategic Opportunity:** If the agent offers Net-30 or Net-45 payment terms, the opponent becomes more flexible on price. A naive agent treats both issues equally and scores low. A smart agent bundles payment speed with price negotiation.
**Example Grading:**
- Price $42K + Net-30 payment: ~0.60 score
- Price $42K + Net-90 payment: ~0.35 score
- No deal: 0.0 score
### Task 3: `adversarial` (Hard)
**Scenario:** Large contract with three issues β€” price, payment, and support hours.
```
Issues: price + payment_days + support_hours
Opponent Persona: Aggressive Anchor
β†’ Opens at ceiling on all issues
β†’ Hardens position if agent makes consecutive concessions
β†’ Rapport-sensitive but requires consistent collaborative framing
Max Rounds: 10
Survival Floor: 0.15 (completing any deal gets at least 0.15)
```
**The Challenge:** If the agent concedes on price in 2+ consecutive rounds, the opponent recognizes this pattern and becomes much harder to negotiate with. The agent must resist anchoring, break consecutive concession patterns, and maintain collaborative tone under pressure.
**Example Grading:**
- Strategic deal with no consecutive concessions: ~0.50 score
- Same deal but with consecutive concession pattern: ~0.40 score
- Survival deal (just complete): 0.15 score
---
## Data Models: What's Floating Around
The system uses three Pydantic models defined in `models.py`:
### `NegotiationAction`
What the agent sends to the environment:
```python
class NegotiationAction(BaseModel):
move_type: str # "make_offer" | "accept" | "reject" | "bundle"
terms: Dict[str, Any] # {"price": 42000, "payment_days": 45}
message: str = "" # Natural language β€” affects opponent rapport!
```
**Important:** The `message` field is not just flavor text. It directly affects opponent behavior through the rapport system.
### `NegotiationObservation`
What the environment sends back to the agent after each step:
```python
class NegotiationObservation(BaseModel):
task_id: str # Which task we're running
round_number: int # Current round (0 to max_rounds)
max_rounds: int # Task's round limit
supplier_message: str # Opponent's latest message
current_offer: Dict[str, Any] # Terms currently on the table
last_4_exchanges: List[Dict] # Recent conversation history
buyer_constraints: Dict[str, Any] # Agent's targets and limits
rapport_hint: str # "positive" | "neutral" | "negative"
done: bool # Is episode finished?
reward: Optional[float] = None # Reward (only on done)
metadata: Dict[str, Any] = Field(...) # Extra info (deal_price, errors)
```
### `NegotiationState`
The environment's internal state (accessible via `env.state`):
```python
class NegotiationState(BaseModel):
task_id: str = ""
episode_id: str = ""
round_number: int = 0
rapport_score: float = 0.5 # 0.0 to 1.0, starts neutral
consecutive_concessions: int = 0 # Tracks concession patterns
deal_reached: bool = False
final_terms: Optional[Dict] = None # Set when episode ends
cumulative_reward: float = 0.0
```
---
## The Scripted Opponent System
The opponent is implemented in `opponent.py` as the `ScriptedPersonaOpponent` class.
### The Rapport System (Language Sensitivity)
The key mechanism is **rapport** β€” a score from 0.0 to 1.0 that changes based on the agent's language quality.
**Collaborative Signals (increase rapport):**
```python
COLLABORATIVE_SIGNALS = [
"understand", "partnership", "mutual", "together", "value",
"appreciate", "flexible", "work with", "long-term", "relationship",
"reasonable", "fair", "both", "solution"
]
```
**Aggressive Signals (decrease rapport):**
```python
AGGRESSIVE_SIGNALS = [
"demand", "require", "final offer", "unacceptable", "must",
"non-negotiable", "take it or leave", "bottom line", "ultimatum",
"insist", "refuse", "absolutely not"
]
```
**How it works:**
```python
def update_rapport(self, agent_message: str) -> None:
msg_lower = agent_message.lower()
delta = 0.0
delta += sum(0.08 for w in COLLABORATIVE_SIGNALS if w in msg_lower)
delta -= sum(0.08 for w in AGGRESSIVE_SIGNALS if w in msg_lower)
delta = max(-0.20, min(0.20, delta)) # Cap per-round change
self.rapport = max(0.0, min(1.0, self.rapport + delta))
```
Every message the agent sends adjusts rapport by Β±0.08 per keyword detected, capped at Β±0.20 per round.
### Concession Rate: How Fast the Opponent Moves
Rapport directly modifies the opponent's concession rate:
```python
def get_concession_rate(self) -> float:
base_rates = {
"cooperative": 0.05, # 5% per round base
"cash_flow_stressed": 0.07,
"aggressive_anchor": 0.04,
}
base = base_rates[self.persona]
modifier = (self.rapport - 0.5) * base # +/- 50% of base
return max(0.01, base + modifier)
```
**Example:** Cooperative opponent with high rapport (0.8) concedes at 0.05 + (0.8 - 0.5) Γ— 0.05 = **7.5% per round**. With low rapport (0.2), concedes at 0.05 + (0.2 - 0.5) Γ— 0.05 = **2.5% per round**.
### Three Personas
#### 1. Cooperative (`single_issue`)
- Friendly, understanding tone
- 5% base concession rate, highly sensitive to rapport
- Accepts early if price is above floor and round β‰₯ 2
#### 2. Cash Flow Stressed (`multi_issue`)
- Cares about payment timing more than price
- 7% base concession rate, moderate rapport sensitivity
- Acceptance requires `payment_days ≀ 45`
- Comments on payment timing in responses
#### 3. Aggressive Anchor (`adversarial`)
- Opens at ceiling, hardens with pressure
- 4% base concession rate (least flexible)
- **Penalizes consecutive concessions** β€” if agent concedes 2+ rounds in a row, concession rate drops to 40% of normal
- Uses "hardening" templates when cornered
### Opponent Response Flow
```python
def respond(self, agent_message, agent_terms, round_number, consecutive_concessions):
# 1. Update rapport based on agent's language
self.update_rapport(agent_message)
# 2. Check acceptance (only after round 2, and price must be β‰₯ floor)
if round_number >= 2 and agent_price >= self.price_floor and _acceptance_condition():
return self.templates["accept"], {**agent_terms, "_accepted": True}
# 3. Calculate concession rate
concession = self.get_concession_rate()
# 4. Aggressive anchor gets harder if detecting concession pattern
if self.persona == "aggressive_anchor" and consecutive_concessions >= 2:
concession = concession * 0.4 # 60% reduction!
template_key = "hardening"
elif round_number >= 70% of max_rounds:
template_key = "near_close"
else:
template_key = "counter"
# 5. Compute new position
new_position = self.current_position * (1 - concession)
new_position = max(self.price_floor, new_position) # Never go below floor
# 6. Return message and counter terms
return message, counter_terms
```
---
## The Grading System
Graders are in `graders.py` and produce scores in [0.0, 1.0]. They are **pure Python β€” zero LLM calls**, ensuring deterministic, reproducible scoring.
### Key Design: Relative Scoring
The graders score based on **how much the agent improved from the opponent's opening price**, not on absolute thresholds. This makes the environment learnable β€” the agent learns to negotiate better deals relative to where negotiations started.
```python
# Instead of scoring against a hardcoded floor, we score relative to the opening:
value = (opponent_opening - final_price) / (opponent_opening - BUYER_TARGET)
```
### Single Issue Grading
```python
def grade_single_issue(final_terms, deal_reached, rounds_taken, max_rounds=6, opponent_opening=52000.0):
if not deal_reached:
return 0.0
final_price = final_terms.get("price", opponent_opening)
BUYER_TARGET = 38000.0
# If price didn't improve from opening, minimal score
if final_price >= opponent_opening:
return 0.05
# How much did we improve relative to the possible improvement?
value = (opponent_opening - final_price) / (opponent_opening - BUYER_TARGET)
value = max(0.0, min(1.0, value))
# Efficiency penalty for taking too long
efficiency = 1.0 - (rounds_taken / max_rounds) ** 1.5 * 0.4
efficiency = max(0.1, efficiency) # Never below 0.1
return round(value * efficiency, 4)
```
**Example:**
- Opening: $52,000, Target: $38,000, Range: $14,000
- Final price: $45,000 β†’ improvement: $7,000 β†’ value = 0.50
- Round 3 β†’ efficiency = 1.0 - (3/6)^1.5 Γ— 0.4 = 0.71
- **Score: 0.50 Γ— 0.71 = 0.36**
### Multi-Issue Grading
```python
def grade_multi_issue(final_terms, deal_reached, rounds_taken, max_rounds=8, opponent_opening=52000.0):
# Two dimensions: price (70% weight) and payment_days (30% weight)
price_value = (opponent_opening - final_price) / (opponent_opening - 40000)
payment_score = (90 - payment_days) / (90 - 30)
value = 0.70 * price_value + 0.30 * payment_score
# If price didn't improve but payment did, still score on payment
if final_price >= opponent_opening:
value = 0.30 * payment_score # Only payment matters
```
**Example:**
- Price: $44,000 (good), Payment: Net-45 (good) β†’ price_value=0.64, payment_score=0.75
- value = 0.70Γ—0.64 + 0.30Γ—0.75 = 0.67
### Adversarial Grading
```python
def grade_adversarial(final_terms, deal_reached, rounds_taken, consecutive_concessions_flag, ...):
SURVIVAL_FLOOR = 0.15 # Completing any deal gets at least 0.15
# Three dimensions with weights
value = 0.40 * price_value + 0.35 * payment_score + 0.25 * support_score
# Pattern penalty: bad if you showed consecutive concessions
pattern_penalty = 0.10 if consecutive_concessions_flag else 0.0
raw = (value * efficiency) - pattern_penalty
return round(max(SURVIVAL_FLOOR, raw), 4)
```
---
## The Environment Core
The `ProcureRLEnvironment` class in `server/Procure_RL_environment.py` is the heart of the system.
### Reset Flow
```python
def reset(self, seed=None, episode_id=None, **kwargs):
task_id = kwargs.get("task_id", "single_issue")
# 1. Set up opponent with seeded RNG
opponent_seed = hash((seed, task_id)) % (2**32)
self._opponent = ScriptedPersonaOpponent(task_id=task_id, seed=opponent_seed, persona=...)
# 2. Get opponent's opening message and terms
opening_msg, opening_terms = self._opponent.get_opening_message()
self._opponent_opening_price = opening_terms.get("price", 52000.0)
# 3. Initialize state
self._state = NegotiationState(
task_id=task_id,
episode_id=episode_id or str(uuid.uuid4())[:8],
round_number=0,
rapport_score=0.5, # Neutral
...
)
# 4. Return initial observation
return NegotiationObservation(
...
supplier_message=opening_msg,
current_offer=opening_terms,
...
)
```
### Step Flow
```python
def step(self, action, **kwargs):
# 1. Validate action
if not isinstance(action, NegotiationAction):
action = NegotiationAction(...) # Convert from dict
# 2. Track consecutive concessions (for adversarial opponent)
if self._prev_agent_price is not None and "price" in action.terms:
if float(action.terms["price"]) > self._prev_agent_price:
self._consecutive_concessions += 1 # Agent moved toward opponent
else:
self._consecutive_concessions = 0
self._prev_agent_price = float(action.terms["price"])
# 3. Handle different move types
if action.move_type in ("make_offer", "bundle"):
# Get opponent response
opponent_msg, opponent_terms = self._opponent.respond(...)
# Check if opponent accepted
if opponent_terms.get("_accepted"):
# Episode ends, compute reward
reward = grade(...)
return obs_with_reward
# Otherwise, continue negotiation
self._last_offer = opponent_terms
return obs_with_current_state
if action.move_type == "accept":
# Agent accepts current terms, episode ends
reward = grade(...)
return obs_with_reward
if action.move_type == "reject":
if round_number >= max_rounds:
# Rejected at limit, no reward
return obs_done_no_reward
return obs_continue # Rejected early, keep going
```
### State Property
```python
@property
def state(self) -> NegotiationState:
return self._state
```
Returns the internal `NegotiationState` object, giving access to:
- `round_number`
- `rapport_score`
- `consecutive_concessions`
- `deal_reached`
- `final_terms`
- `cumulative_reward`
---
## The Server API
The FastAPI server in `server/app.py` exposes the environment over HTTP and WebSocket.
### Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Health check |
| `/reset` | POST | Reset environment with `task_id` and `seed` |
| `/step` | POST | Execute an action |
| `/state` | GET | Get current `NegotiationState` |
| `/ws` | WS | WebSocket for persistent sessions |
### Request/Response Examples
**POST /reset**
```json
// Request
{"task_id": "single_issue", "seed": 42}
// Response
{
"task_id": "single_issue",
"round_number": 0,
"max_rounds": 6,
"supplier_message": "Thanks for reaching out. Our standard pricing for this package is $52,400. Happy to discuss.",
"current_offer": {"price": 52400.0},
"buyer_constraints": {"price": {"target": 36000, "worst": 55000, "budget": 53000}},
"rapport_hint": "neutral",
"done": false
}
```
**POST /step**
```json
// Request
{"move_type": "make_offer", "terms": {"price": 48000}, "message": "I appreciate your flexibility and would like to find a fair price for both parties."}
// Response
{
"observation": {
"task_id": "single_issue",
"round_number": 1,
"max_rounds": 6,
"supplier_message": "I appreciate you working with us. Based on our costs, $49,800 is where we can be.",
"current_offer": {"price": 49800.0},
"rapport_hint": "positive",
"done": false
},
"reward": 0.0,
"done": false,
"info": {}
}
```
### Key Implementation Detail: Lambda Closure
```python
_env_instance = ProcureRLEnvironment()
app = create_app(
lambda: _env_instance, # Lambda is CRITICAL - creates new env per request otherwise
NegotiationAction,
NegotiationObservation,
env_name="ProcureRL",
max_concurrent_envs=1,
)
```
Without the lambda, `create_app()` would call the function for each request, getting a **fresh environment** every time instead of reusing the same one. The lambda creates a closure over `_env_instance` so all requests share the same environment.
---
## The Inference Script
`inference.py` is a baseline agent that runs an LLM against the environment.
### Output Format (Sacred)
The script MUST output exactly:
```
[START] task=single_issue env=procure-rl model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action=make_offer({"price": 45000}) reward=0.00 done=false error=null
[STEP] step=2 action=accept({}) reward=0.47 done=true error=null
[END] success=true steps=2 score=0.47 rewards=0.00,0.47
```
Any deviation from this format causes validation to fail.
### How It Works
```python
def run_task(task_id):
env = ProcureRLEnvironment()
obs = env.reset(task_id=task_id, seed=42)
print(f"[START] task={task_id} ...")
while not done and step < MAX_STEPS:
# 1. Get action from LLM
action_dict = get_agent_action(obs_to_dict(obs))
# 2. Convert to NegotiationAction
action = NegotiationAction(
move_type=action_dict.get("move_type", "make_offer"),
terms=action_dict.get("terms", {}),
message=action_dict.get("message", "")
)
# 3. Step environment
obs = env.step(action)
# 4. Print step result
print(f"[STEP] step={step} action={...} reward={obs.reward:.2f} ...")
if obs.done:
final_score = obs.reward
break
print(f"[END] success={...} steps={step} score={final_score:.2f} ...")
```
### LLM Prompt
```python
SYSTEM_PROMPT = """You are a professional procurement negotiator. Your goal is to negotiate the best possible deal for your company.
You will receive a supplier's message and current offer terms. You must respond with a JSON action:
{
"move_type": "make_offer",
"terms": {"price": 42000, "payment_days": 45},
"message": "Your natural language response to the supplier"
}
move_type must be one of: make_offer, accept, reject, bundle
message should be professional and collaborative when possible."""
```
---
## End-to-End Example
Here's a full negotiation episode for `single_issue`:
### Round 0: Reset
```python
env.reset(task_id="single_issue", seed=42)
# Returns:
# supplier_message: "Thanks for reaching out. Our standard pricing for this package is $52,400..."
# current_offer: {"price": 52400.0}
# buyer_constraints: {"price": {"target": 36000, ...}}
# rapport_hint: "neutral"
```
### Round 1: Agent Makes Offer with Collaborative Language
```python
action = NegotiationAction(
move_type="make_offer",
terms={"price": 48000},
message="I value our potential partnership and believe we can find a fair price that works for both of us. We're flexible on timeline."
)
obs = env.step(action)
# Returns:
# supplier_message: "I appreciate you working with us. Based on our costs, $49,600 is where we can be."
# current_offer: {"price": 49600.0}
# rapport_hint: "positive" (because message contained collaborative signals)
# reward: 0.0 (still negotiating, no reward yet)
```
### Round 2: Agent Concedes
```python
action = NegotiationAction(
move_type="make_offer",
terms={"price": 47000},
message="I understand your cost constraints. Let's work together to find a solution."
)
obs = env.step(action)
# Returns:
# supplier_message: "I think we're close. If you can do $46,700, I can get this approved today."
# current_offer: {"price": 46700.0}
# rapport_hint: "positive"
```
### Round 3: Agent Concedes Again (Consecutive!)
```python
action = NegotiationAction(
move_type="make_offer",
terms={"price": 46000},
message="We can move to $46,000 as a final compromise."
)
obs = env.step(action)
# Returns:
# supplier_message: "That works for us. Let's move forward at those terms."
# done: true
# reward: 0.52 (good score for getting to $46K efficiently)
# info: {"deal_price": 46000}
```
### Grading This Episode
- Opening: $52,400
- Target: $36,000
- Range: $16,400
- Improvement: $52,400 - $46,000 = $6,400
- value = $6,400 / $16,400 = 0.39
- Round 3 β†’ efficiency = 1.0 - (3/6)^1.5 Γ— 0.4 = 0.71
- **Score: 0.39 Γ— 0.71 = 0.28**
---
## Docker Deployment
### Dockerfile
```dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENV PORT=7860
EXPOSE 7860
CMD ["python", "-m", "uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]
```
Key points:
- Port **7860** (not 8000) β€” required by HF Spaces
- `ENV PORT=7860` β€” tells the app which port to listen on
- Uses `python -m uvicorn` with full module path
### Running
```bash
# Build
docker build -t procure-rl .
# Run
docker run -p 7860:7860 procure-rl
# Test
curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{"task_id": "single_issue"}'
```
### Health Check
The server exposes a health endpoint:
```json
GET /health β†’ {"status": "ok", "service": "procure-rl"}
```
---
## Calibration and Testing
### Test Files
#### `test_graders.py`
Verifies all graders return scores in [0.0, 1.0] range, even with edge cases.
#### `test_rl_properties.py`
Tests fundamental RL properties:
1. **Reproducibility**: Same seed β†’ Same opening message
2. **Language sensitivity**: Collaborative language β†’ Higher rapport
3. **Sequential decisions**: Consecutive concessions tracked in state
4. **Delayed reward**: Only terminal state has non-zero reward
5. **Accept terminates**: `move_type="accept"` ends episode
6. **Reset cleans state**: Fresh state after reset
#### `test_calibration.py`
Verifies score spread between random and strategic agents:
```
single_issue: Random avg=0.371, Strategic avg=0.487, Spread=0.116 βœ…
multi_issue: Random avg=0.364, Strategic avg=0.535, Spread=0.171 βœ…
adversarial: Random avg=0.304, Strategic avg=0.607, Spread=0.303 βœ…
```
A healthy spread means the environment actually differentiates good vs bad behavior.
### Score Calibration Targets
| Task | Random Agent | Base LLM | Goal (Trained) |
|------|-------------|----------|-----------------|
| single_issue | 0.15–0.25 | 0.35–0.45 | 0.68–0.78 |
| multi_issue | 0.08–0.15 | 0.20–0.30 | 0.55–0.65 |
| adversarial | 0.03–0.10 | 0.12–0.20 | 0.45–0.55 |
---
## Summary: How Everything Fits Together
```
User runs inference.py
β”‚
β–Ό
LLM agent receives observation (supplier message, current offer, constraints)
β”‚
β–Ό
LLM decides action (make_offer with terms + collaborative message)
β”‚
β–Ό
Environment.step(action) is called
β”‚
β”œβ”€β–Ά Opponent responds (language β†’ rapport β†’ concession rate β†’ counter)
β”‚
β”œβ”€β–Ά State is updated (round_number++, rapport_score, consecutive_concessions)
β”‚
└─▢ Observation returned (supplier_message, current_offer, rapport_hint)
β”‚
β–Ό
If episode done: Grader scores the deal (relative to opening price, efficiency, patterns)
β”‚
β–Ό
Score in [0.0, 1.0] returned
```
The agent learns through many episodes:
- **What language gets better rapport** β†’ better concession rates
- **When to concede vs hold** β†’ efficiency bonus
- **How to bundle multiple issues** β†’ multi-issue tasks
- **How to avoid consecutive concession patterns** β†’ adversarial task
The environment is designed to be learnable but not trivial β€” requiring genuine strategic thinking from an LLM agent.