Spaces:

akshaypulla
/

procure-rl

Sleeping

File size: 29,196 Bytes

c1be7c3

# ProcureRL: A Deep Dive

## Table of Contents
1. [What is ProcureRL?](#what-is-procure-rl)
2. [Why Does This Exist?](#why-does-this-exist)
3. [The Big Picture Architecture](#the-big-picture-architecture)
4. [The Three Tasks](#the-three-tasks)
5. [Data Models: What's Floating Around](#data-models-whats-floating-around)
6. [The Scripted Opponent System](#the-scripted-opponent-system)
7. [The Grading System](#the-grading-system)
8. [The Environment Core](#the-environment-core)
9. [The Server API](#the-server-api)
10. [The Inference Script](#the-inference-script)
11. [End-to-End Example](#end-to-end-example)
12. [Docker Deployment](#docker-deployment)
13. [Calibration and Testing](#calibration-and-testing)

---

## What is ProcureRL?

ProcureRL is an **OpenEnv-compliant Reinforcement Learning environment** where an LLM (Large Language Model) agent learns to negotiate procurement deals against scripted supplier opponents.

In simpler terms: it's a training ground for AI to practice negotiation — like a flight simulator, but for procurement conversations.

### The Core Innovation: Language-Sensitive Opponent

What makes ProcureRL special is that the opponent's behavior **responds to the quality of the agent's natural language**, not just the prices offered. This means:

- An agent that outputs aggressive or low-effort language gets a **tough, unyielding opponent**
- An agent that outputs collaborative, professional language gets a **more cooperative, flexible opponent**

The language IS the policy — not just the action space. This makes LLM genuinely required, not incidental.

---

## Why Does This Exist?

Real-world procurement negotiation is:
- **Sequential** — one decision affects the next
- **Hidden utility** — the opponent's real priorities are not revealed
- **Language-dependent** — how you say things matters as much as what you offer
- **High-stakes** — Walmart deployed AI (Pactum) for exactly this, 90% of CPOs adopting AI negotiation in 2025

Traditional rule-based negotiation tools are limited. An RL-trained LLM policy can learn to navigate this complexity in ways that static rules cannot.

---

## The Big Picture Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                         ProcureRL System                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────────┐    ┌──────────────────┐                  │
│  │   LLM Agent      │───▶│  Environment     │                  │
│  │   (inference.py)  │    │  (Procure_RL_    │                  │
│  │                   │    │   environment.py)│                  │
│  └──────────────────┘    └────────┬─────────┘                  │
│                                    │                            │
│                                    ▼                            │
│                          ┌──────────────────┐                  │
│                          │  Scripted         │                  │
│                          │  Opponent         │                  │
│                          │  (opponent.py)    │                  │
│                          └────────┬─────────┘                  │
│                                    │                            │
│                                    ▼                            │
│                          ┌──────────────────┐                  │
│                          │  Graders         │                  │
│                          │  (graders.py)     │                  │
│                          └──────────────────┘                  │
│                                                                 │
│  ┌──────────────────┐    ┌──────────────────┐                  │
│  │  Server API      │    │  OpenEnv.yaml    │                  │
│  │  (server/app.py)  │    │  (manifest)       │                  │
│  └────────┬─────────┘    └──────────────────┘                  │
│           │                                                     │
│           ▼                                                     │
│  ┌──────────────────┐                                         │
│  │  Docker Container │◀── HF Spaces Deployment                │
│  │  (port 7860)       │                                        │
│  └──────────────────┘                                         │
└─────────────────────────────────────────────────────────────────┘
```

The system is designed so that:
1. **Environment** is deterministic and reproducible (seeded RNG)
2. **Opponent** responds to language quality (via rapport system)
3. **Graders** produce bounded [0.0, 1.0] scores
4. **Server** exposes everything over HTTP for OpenEnv compliance
5. **Inference** runs a baseline LLM agent against the environment

---

## The Three Tasks

ProcureRL includes three tasks of increasing difficulty:

### Task 1: `single_issue` (Easy)

**Scenario:** Software license renewal. Price only.

```
Buyer Target: $36,000
Seller Opens: ~$52,000 (varies by seed)
Seller Floor: ~$44,000 (varies by seed)
Max Rounds: 6
Opponent Persona: Cooperative
```

The agent must negotiate the price down from opening to target. The cooperative opponent starts friendly and remains fairly flexible.

**Example Grading:**
- Deal at $38K in round 2: ~0.85 score
- Deal at $44K in round 6: ~0.35 score
- No deal: 0.0 score

### Task 2: `multi_issue` (Medium)

**Scenario:** Enterprise software negotiation with price AND payment terms.

```
Issues: price ($40K-$58K) + payment_days (30-90)
Opponent Persona: Cash Flow Stressed
  → Cares more about getting paid quickly (payment_weight: 0.65)
  → Cares less about final price (price_weight: 0.35)
Max Rounds: 8
```

**The Strategic Opportunity:** If the agent offers Net-30 or Net-45 payment terms, the opponent becomes more flexible on price. A naive agent treats both issues equally and scores low. A smart agent bundles payment speed with price negotiation.

**Example Grading:**
- Price $42K + Net-30 payment: ~0.60 score
- Price $42K + Net-90 payment: ~0.35 score
- No deal: 0.0 score

### Task 3: `adversarial` (Hard)

**Scenario:** Large contract with three issues — price, payment, and support hours.

```
Issues: price + payment_days + support_hours
Opponent Persona: Aggressive Anchor
  → Opens at ceiling on all issues
  → Hardens position if agent makes consecutive concessions
  → Rapport-sensitive but requires consistent collaborative framing
Max Rounds: 10
Survival Floor: 0.15 (completing any deal gets at least 0.15)
```

**The Challenge:** If the agent concedes on price in 2+ consecutive rounds, the opponent recognizes this pattern and becomes much harder to negotiate with. The agent must resist anchoring, break consecutive concession patterns, and maintain collaborative tone under pressure.

**Example Grading:**
- Strategic deal with no consecutive concessions: ~0.50 score
- Same deal but with consecutive concession pattern: ~0.40 score
- Survival deal (just complete): 0.15 score

---

## Data Models: What's Floating Around

The system uses three Pydantic models defined in `models.py`:

### `NegotiationAction`

What the agent sends to the environment:

```python
class NegotiationAction(BaseModel):
    move_type: str           # "make_offer" | "accept" | "reject" | "bundle"
    terms: Dict[str, Any]    # {"price": 42000, "payment_days": 45}
    message: str = ""        # Natural language — affects opponent rapport!
```

**Important:** The `message` field is not just flavor text. It directly affects opponent behavior through the rapport system.

### `NegotiationObservation`

What the environment sends back to the agent after each step:

```python
class NegotiationObservation(BaseModel):
    task_id: str                           # Which task we're running
    round_number: int                      # Current round (0 to max_rounds)
    max_rounds: int                        # Task's round limit
    supplier_message: str                  # Opponent's latest message
    current_offer: Dict[str, Any]          # Terms currently on the table
    last_4_exchanges: List[Dict]           # Recent conversation history
    buyer_constraints: Dict[str, Any]      # Agent's targets and limits
    rapport_hint: str                       # "positive" | "neutral" | "negative"
    done: bool                             # Is episode finished?
    reward: Optional[float] = None          # Reward (only on done)
    metadata: Dict[str, Any] = Field(...)  # Extra info (deal_price, errors)
```

### `NegotiationState`

The environment's internal state (accessible via `env.state`):

```python
class NegotiationState(BaseModel):
    task_id: str = ""
    episode_id: str = ""
    round_number: int = 0
    rapport_score: float = 0.5              # 0.0 to 1.0, starts neutral
    consecutive_concessions: int = 0        # Tracks concession patterns
    deal_reached: bool = False
    final_terms: Optional[Dict] = None       # Set when episode ends
    cumulative_reward: float = 0.0
```

---

## The Scripted Opponent System

The opponent is implemented in `opponent.py` as the `ScriptedPersonaOpponent` class.

### The Rapport System (Language Sensitivity)

The key mechanism is **rapport** — a score from 0.0 to 1.0 that changes based on the agent's language quality.

**Collaborative Signals (increase rapport):**
```python
COLLABORATIVE_SIGNALS = [
    "understand", "partnership", "mutual", "together", "value",
    "appreciate", "flexible", "work with", "long-term", "relationship",
    "reasonable", "fair", "both", "solution"
]
```

**Aggressive Signals (decrease rapport):**
```python
AGGRESSIVE_SIGNALS = [
    "demand", "require", "final offer", "unacceptable", "must",
    "non-negotiable", "take it or leave", "bottom line", "ultimatum",
    "insist", "refuse", "absolutely not"
]
```

**How it works:**
```python
def update_rapport(self, agent_message: str) -> None:
    msg_lower = agent_message.lower()
    delta = 0.0
    delta += sum(0.08 for w in COLLABORATIVE_SIGNALS if w in msg_lower)
    delta -= sum(0.08 for w in AGGRESSIVE_SIGNALS if w in msg_lower)
    delta = max(-0.20, min(0.20, delta))  # Cap per-round change
    self.rapport = max(0.0, min(1.0, self.rapport + delta))
```

Every message the agent sends adjusts rapport by ±0.08 per keyword detected, capped at ±0.20 per round.

### Concession Rate: How Fast the Opponent Moves

Rapport directly modifies the opponent's concession rate:

```python
def get_concession_rate(self) -> float:
    base_rates = {
        "cooperative": 0.05,        # 5% per round base
        "cash_flow_stressed": 0.07,
        "aggressive_anchor": 0.04,
    }
    base = base_rates[self.persona]
    modifier = (self.rapport - 0.5) * base  # +/- 50% of base
    return max(0.01, base + modifier)
```

**Example:** Cooperative opponent with high rapport (0.8) concedes at 0.05 + (0.8 - 0.5) × 0.05 = **7.5% per round**. With low rapport (0.2), concedes at 0.05 + (0.2 - 0.5) × 0.05 = **2.5% per round**.

### Three Personas

#### 1. Cooperative (`single_issue`)
- Friendly, understanding tone
- 5% base concession rate, highly sensitive to rapport
- Accepts early if price is above floor and round ≥ 2

#### 2. Cash Flow Stressed (`multi_issue`)
- Cares about payment timing more than price
- 7% base concession rate, moderate rapport sensitivity
- Acceptance requires `payment_days ≤ 45`
- Comments on payment timing in responses

#### 3. Aggressive Anchor (`adversarial`)
- Opens at ceiling, hardens with pressure
- 4% base concession rate (least flexible)
- **Penalizes consecutive concessions** — if agent concedes 2+ rounds in a row, concession rate drops to 40% of normal
- Uses "hardening" templates when cornered

### Opponent Response Flow

```python
def respond(self, agent_message, agent_terms, round_number, consecutive_concessions):
    # 1. Update rapport based on agent's language
    self.update_rapport(agent_message)

    # 2. Check acceptance (only after round 2, and price must be ≥ floor)
    if round_number >= 2 and agent_price >= self.price_floor and _acceptance_condition():
        return self.templates["accept"], {**agent_terms, "_accepted": True}

    # 3. Calculate concession rate
    concession = self.get_concession_rate()

    # 4. Aggressive anchor gets harder if detecting concession pattern
    if self.persona == "aggressive_anchor" and consecutive_concessions >= 2:
        concession = concession * 0.4  # 60% reduction!
        template_key = "hardening"
    elif round_number >= 70% of max_rounds:
        template_key = "near_close"
    else:
        template_key = "counter"

    # 5. Compute new position
    new_position = self.current_position * (1 - concession)
    new_position = max(self.price_floor, new_position)  # Never go below floor

    # 6. Return message and counter terms
    return message, counter_terms
```

---

## The Grading System

Graders are in `graders.py` and produce scores in [0.0, 1.0]. They are **pure Python — zero LLM calls**, ensuring deterministic, reproducible scoring.

### Key Design: Relative Scoring

The graders score based on **how much the agent improved from the opponent's opening price**, not on absolute thresholds. This makes the environment learnable — the agent learns to negotiate better deals relative to where negotiations started.

```python
# Instead of scoring against a hardcoded floor, we score relative to the opening:
value = (opponent_opening - final_price) / (opponent_opening - BUYER_TARGET)
```

### Single Issue Grading

```python
def grade_single_issue(final_terms, deal_reached, rounds_taken, max_rounds=6, opponent_opening=52000.0):
    if not deal_reached:
        return 0.0

    final_price = final_terms.get("price", opponent_opening)
    BUYER_TARGET = 38000.0

    # If price didn't improve from opening, minimal score
    if final_price >= opponent_opening:
        return 0.05

    # How much did we improve relative to the possible improvement?
    value = (opponent_opening - final_price) / (opponent_opening - BUYER_TARGET)
    value = max(0.0, min(1.0, value))

    # Efficiency penalty for taking too long
    efficiency = 1.0 - (rounds_taken / max_rounds) ** 1.5 * 0.4
    efficiency = max(0.1, efficiency)  # Never below 0.1

    return round(value * efficiency, 4)
```

**Example:**
- Opening: $52,000, Target: $38,000, Range: $14,000
- Final price: $45,000 → improvement: $7,000 → value = 0.50
- Round 3 → efficiency = 1.0 - (3/6)^1.5 × 0.4 = 0.71
- **Score: 0.50 × 0.71 = 0.36**

### Multi-Issue Grading

```python
def grade_multi_issue(final_terms, deal_reached, rounds_taken, max_rounds=8, opponent_opening=52000.0):
    # Two dimensions: price (70% weight) and payment_days (30% weight)
    price_value = (opponent_opening - final_price) / (opponent_opening - 40000)
    payment_score = (90 - payment_days) / (90 - 30)

    value = 0.70 * price_value + 0.30 * payment_score

    # If price didn't improve but payment did, still score on payment
    if final_price >= opponent_opening:
        value = 0.30 * payment_score  # Only payment matters
```

**Example:**
- Price: $44,000 (good), Payment: Net-45 (good) → price_value=0.64, payment_score=0.75
- value = 0.70×0.64 + 0.30×0.75 = 0.67

### Adversarial Grading

```python
def grade_adversarial(final_terms, deal_reached, rounds_taken, consecutive_concessions_flag, ...):
    SURVIVAL_FLOOR = 0.15  # Completing any deal gets at least 0.15

    # Three dimensions with weights
    value = 0.40 * price_value + 0.35 * payment_score + 0.25 * support_score

    # Pattern penalty: bad if you showed consecutive concessions
    pattern_penalty = 0.10 if consecutive_concessions_flag else 0.0

    raw = (value * efficiency) - pattern_penalty
    return round(max(SURVIVAL_FLOOR, raw), 4)
```

---

## The Environment Core

The `ProcureRLEnvironment` class in `server/Procure_RL_environment.py` is the heart of the system.

### Reset Flow

```python
def reset(self, seed=None, episode_id=None, **kwargs):
    task_id = kwargs.get("task_id", "single_issue")

    # 1. Set up opponent with seeded RNG
    opponent_seed = hash((seed, task_id)) % (2**32)
    self._opponent = ScriptedPersonaOpponent(task_id=task_id, seed=opponent_seed, persona=...)

    # 2. Get opponent's opening message and terms
    opening_msg, opening_terms = self._opponent.get_opening_message()
    self._opponent_opening_price = opening_terms.get("price", 52000.0)

    # 3. Initialize state
    self._state = NegotiationState(
        task_id=task_id,
        episode_id=episode_id or str(uuid.uuid4())[:8],
        round_number=0,
        rapport_score=0.5,  # Neutral
        ...
    )

    # 4. Return initial observation
    return NegotiationObservation(
        ...
        supplier_message=opening_msg,
        current_offer=opening_terms,
        ...
    )
```

### Step Flow

```python
def step(self, action, **kwargs):
    # 1. Validate action
    if not isinstance(action, NegotiationAction):
        action = NegotiationAction(...)  # Convert from dict

    # 2. Track consecutive concessions (for adversarial opponent)
    if self._prev_agent_price is not None and "price" in action.terms:
        if float(action.terms["price"]) > self._prev_agent_price:
            self._consecutive_concessions += 1  # Agent moved toward opponent
        else:
            self._consecutive_concessions = 0
    self._prev_agent_price = float(action.terms["price"])

    # 3. Handle different move types
    if action.move_type in ("make_offer", "bundle"):
        # Get opponent response
        opponent_msg, opponent_terms = self._opponent.respond(...)

        # Check if opponent accepted
        if opponent_terms.get("_accepted"):
            # Episode ends, compute reward
            reward = grade(...)
            return obs_with_reward

        # Otherwise, continue negotiation
        self._last_offer = opponent_terms
        return obs_with_current_state

    if action.move_type == "accept":
        # Agent accepts current terms, episode ends
        reward = grade(...)
        return obs_with_reward

    if action.move_type == "reject":
        if round_number >= max_rounds:
            # Rejected at limit, no reward
            return obs_done_no_reward
        return obs_continue  # Rejected early, keep going
```

### State Property

```python
@property
def state(self) -> NegotiationState:
    return self._state
```

Returns the internal `NegotiationState` object, giving access to:
- `round_number`
- `rapport_score`
- `consecutive_concessions`
- `deal_reached`
- `final_terms`
- `cumulative_reward`

---

## The Server API

The FastAPI server in `server/app.py` exposes the environment over HTTP and WebSocket.

### Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Health check |
| `/reset` | POST | Reset environment with `task_id` and `seed` |
| `/step` | POST | Execute an action |
| `/state` | GET | Get current `NegotiationState` |
| `/ws` | WS | WebSocket for persistent sessions |

### Request/Response Examples

**POST /reset**
```json
// Request
{"task_id": "single_issue", "seed": 42}

// Response
{
  "task_id": "single_issue",
  "round_number": 0,
  "max_rounds": 6,
  "supplier_message": "Thanks for reaching out. Our standard pricing for this package is $52,400. Happy to discuss.",
  "current_offer": {"price": 52400.0},
  "buyer_constraints": {"price": {"target": 36000, "worst": 55000, "budget": 53000}},
  "rapport_hint": "neutral",
  "done": false
}
```

**POST /step**
```json
// Request
{"move_type": "make_offer", "terms": {"price": 48000}, "message": "I appreciate your flexibility and would like to find a fair price for both parties."}

// Response
{
  "observation": {
    "task_id": "single_issue",
    "round_number": 1,
    "max_rounds": 6,
    "supplier_message": "I appreciate you working with us. Based on our costs, $49,800 is where we can be.",
    "current_offer": {"price": 49800.0},
    "rapport_hint": "positive",
    "done": false
  },
  "reward": 0.0,
  "done": false,
  "info": {}
}
```

### Key Implementation Detail: Lambda Closure

```python
_env_instance = ProcureRLEnvironment()

app = create_app(
    lambda: _env_instance,  # Lambda is CRITICAL - creates new env per request otherwise
    NegotiationAction,
    NegotiationObservation,
    env_name="ProcureRL",
    max_concurrent_envs=1,
)
```

Without the lambda, `create_app()` would call the function for each request, getting a **fresh environment** every time instead of reusing the same one. The lambda creates a closure over `_env_instance` so all requests share the same environment.

---

## The Inference Script

`inference.py` is a baseline agent that runs an LLM against the environment.

### Output Format (Sacred)

The script MUST output exactly:
```
[START] task=single_issue env=procure-rl model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action=make_offer({"price": 45000}) reward=0.00 done=false error=null
[STEP] step=2 action=accept({}) reward=0.47 done=true error=null
[END] success=true steps=2 score=0.47 rewards=0.00,0.47
```

Any deviation from this format causes validation to fail.

### How It Works

```python
def run_task(task_id):
    env = ProcureRLEnvironment()
    obs = env.reset(task_id=task_id, seed=42)

    print(f"[START] task={task_id} ...")

    while not done and step < MAX_STEPS:
        # 1. Get action from LLM
        action_dict = get_agent_action(obs_to_dict(obs))

        # 2. Convert to NegotiationAction
        action = NegotiationAction(
            move_type=action_dict.get("move_type", "make_offer"),
            terms=action_dict.get("terms", {}),
            message=action_dict.get("message", "")
        )

        # 3. Step environment
        obs = env.step(action)

        # 4. Print step result
        print(f"[STEP] step={step} action={...} reward={obs.reward:.2f} ...")

        if obs.done:
            final_score = obs.reward
            break

    print(f"[END] success={...} steps={step} score={final_score:.2f} ...")
```

### LLM Prompt

```python
SYSTEM_PROMPT = """You are a professional procurement negotiator. Your goal is to negotiate the best possible deal for your company.

You will receive a supplier's message and current offer terms. You must respond with a JSON action:
{
  "move_type": "make_offer",
  "terms": {"price": 42000, "payment_days": 45},
  "message": "Your natural language response to the supplier"
}

move_type must be one of: make_offer, accept, reject, bundle
message should be professional and collaborative when possible."""
```

---

## End-to-End Example

Here's a full negotiation episode for `single_issue`:

### Round 0: Reset
```python
env.reset(task_id="single_issue", seed=42)
# Returns:
#   supplier_message: "Thanks for reaching out. Our standard pricing for this package is $52,400..."
#   current_offer: {"price": 52400.0}
#   buyer_constraints: {"price": {"target": 36000, ...}}
#   rapport_hint: "neutral"
```

### Round 1: Agent Makes Offer with Collaborative Language

```python
action = NegotiationAction(
    move_type="make_offer",
    terms={"price": 48000},
    message="I value our potential partnership and believe we can find a fair price that works for both of us. We're flexible on timeline."
)
obs = env.step(action)
# Returns:
#   supplier_message: "I appreciate you working with us. Based on our costs, $49,600 is where we can be."
#   current_offer: {"price": 49600.0}
#   rapport_hint: "positive"  (because message contained collaborative signals)
#   reward: 0.0  (still negotiating, no reward yet)
```

### Round 2: Agent Concedes

```python
action = NegotiationAction(
    move_type="make_offer",
    terms={"price": 47000},
    message="I understand your cost constraints. Let's work together to find a solution."
)
obs = env.step(action)
# Returns:
#   supplier_message: "I think we're close. If you can do $46,700, I can get this approved today."
#   current_offer: {"price": 46700.0}
#   rapport_hint: "positive"
```

### Round 3: Agent Concedes Again (Consecutive!)

```python
action = NegotiationAction(
    move_type="make_offer",
    terms={"price": 46000},
    message="We can move to $46,000 as a final compromise."
)
obs = env.step(action)
# Returns:
#   supplier_message: "That works for us. Let's move forward at those terms."
#   done: true
#   reward: 0.52  (good score for getting to $46K efficiently)
#   info: {"deal_price": 46000}
```

### Grading This Episode

- Opening: $52,400
- Target: $36,000
- Range: $16,400
- Improvement: $52,400 - $46,000 = $6,400
- value = $6,400 / $16,400 = 0.39
- Round 3 → efficiency = 1.0 - (3/6)^1.5 × 0.4 = 0.71
- **Score: 0.39 × 0.71 = 0.28**

---

## Docker Deployment

### Dockerfile

```dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENV PORT=7860
EXPOSE 7860
CMD ["python", "-m", "uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]
```

Key points:
- Port **7860** (not 8000) — required by HF Spaces
- `ENV PORT=7860` — tells the app which port to listen on
- Uses `python -m uvicorn` with full module path

### Running

```bash
# Build
docker build -t procure-rl .

# Run
docker run -p 7860:7860 procure-rl

# Test
curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{"task_id": "single_issue"}'
```

### Health Check

The server exposes a health endpoint:
```json
GET /health → {"status": "ok", "service": "procure-rl"}
```

---

## Calibration and Testing

### Test Files

#### `test_graders.py`
Verifies all graders return scores in [0.0, 1.0] range, even with edge cases.

#### `test_rl_properties.py`
Tests fundamental RL properties:
1. **Reproducibility**: Same seed → Same opening message
2. **Language sensitivity**: Collaborative language → Higher rapport
3. **Sequential decisions**: Consecutive concessions tracked in state
4. **Delayed reward**: Only terminal state has non-zero reward
5. **Accept terminates**: `move_type="accept"` ends episode
6. **Reset cleans state**: Fresh state after reset

#### `test_calibration.py`
Verifies score spread between random and strategic agents:

```
single_issue: Random avg=0.371, Strategic avg=0.487, Spread=0.116 ✅
multi_issue:   Random avg=0.364, Strategic avg=0.535, Spread=0.171 ✅
adversarial:   Random avg=0.304, Strategic avg=0.607, Spread=0.303 ✅
```

A healthy spread means the environment actually differentiates good vs bad behavior.

### Score Calibration Targets

| Task | Random Agent | Base LLM | Goal (Trained) |
|------|-------------|----------|-----------------|
| single_issue | 0.15–0.25 | 0.35–0.45 | 0.68–0.78 |
| multi_issue | 0.08–0.15 | 0.20–0.30 | 0.55–0.65 |
| adversarial | 0.03–0.10 | 0.12–0.20 | 0.45–0.55 |

---

## Summary: How Everything Fits Together

```
User runs inference.py
    │
    ▼
LLM agent receives observation (supplier message, current offer, constraints)
    │
    ▼
LLM decides action (make_offer with terms + collaborative message)
    │
    ▼
Environment.step(action) is called
    │
    ├─▶ Opponent responds (language → rapport → concession rate → counter)
    │
    ├─▶ State is updated (round_number++, rapport_score, consecutive_concessions)
    │
    └─▶ Observation returned (supplier_message, current_offer, rapport_hint)
    │
    ▼
If episode done: Grader scores the deal (relative to opening price, efficiency, patterns)
    │
    ▼
Score in [0.0, 1.0] returned
```

The agent learns through many episodes:
- **What language gets better rapport** → better concession rates
- **When to concede vs hold** → efficiency bonus
- **How to bundle multiple issues** → multi-issue tasks
- **How to avoid consecutive concession patterns** → adversarial task

The environment is designed to be learnable but not trivial — requiring genuine strategic thinking from an LLM agent.