Spaces:

akshaypulla
/

procure-rl

Sleeping

App Files Files Community

akshaypulla commited on Apr 7

Commit

c1be7c3

verified ·

1 Parent(s): 81ddf95

Upload folder using huggingface_hub

Browse files

Files changed (28) hide show

Dockerfile +89 -0
EXPLANATION.md +829 -0
Instructions.md +283 -0
README.md +322 -4
__init__.py +17 -0
client.py +76 -0
graders.py +161 -0
inference.py +197 -0
models.py +80 -0
openenv.yaml +38 -0
openenv_Procure_RL.egg-info/PKG-INFO +9 -0
openenv_Procure_RL.egg-info/SOURCES.txt +14 -0
openenv_Procure_RL.egg-info/dependency_links.txt +1 -0
openenv_Procure_RL.egg-info/entry_points.txt +2 -0
openenv_Procure_RL.egg-info/requires.txt +5 -0
openenv_Procure_RL.egg-info/top_level.txt +1 -0
opponent.py +213 -0
plan.md +1228 -0
pyproject.toml +45 -0
server/Procure_RL_environment.py +316 -0
server/__init__.py +11 -0
server/app.py +637 -0
server/requirements.txt +6 -0
test_calibration.py +110 -0
test_graders.py +76 -0
test_rl_properties.py +119 -0
uv.lock +0 -0
web_ui.png +0 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,89 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+# Multi-stage build using openenv-base
+# This Dockerfile is flexible and works for both:
+# - In-repo environments (with local OpenEnv sources)
+# - Standalone environments (with openenv from PyPI/Git)
+# The build script (openenv build) handles context detection and sets appropriate build args.
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Ensure git is available (required for installing dependencies from VCS)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+# Build argument to control whether we're building standalone or in-repo
+ARG BUILD_MODE=in-repo
+ARG ENV_NAME=Procure_RL
+# Copy environment code (always at root of build context)
+COPY . /app/env
+# For in-repo builds, openenv is already vendored in the build context
+# For standalone builds, openenv will be installed via pyproject.toml
+WORKDIR /app/env
+# Ensure uv is available (for local builds where base image lacks it)
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+# Install dependencies using uv sync
+# If uv.lock exists, use it; otherwise resolve on the fly
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# Final runtime stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+# Copy the virtual environment from builder
+COPY --from=builder /app/env/.venv /app/.venv
+# Copy the environment code
+COPY --from=builder /app/env /app/env
+# Also copy README.md to /app for the web interface
+COPY --from=builder /app/env/README.md /app/README.md
+# Set PATH to use the virtual environment
+ENV PATH="/app/.venv/bin:$PATH"
+# Set PYTHONPATH so imports work correctly
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Set PORT for HF Spaces compatibility
+ENV PORT=7860
+# enabile the web interface
+ENV ENABLE_WEB_INTERFACE=true
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:7860/health || exit 1
+# Run the FastAPI server
+# The module path is constructed to work with the /app/env structure
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 7860"]

EXPLANATION.md ADDED Viewed

	@@ -0,0 +1,829 @@

+# ProcureRL: A Deep Dive
+## Table of Contents
+1. [What is ProcureRL?](#what-is-procure-rl)
+2. [Why Does This Exist?](#why-does-this-exist)
+3. [The Big Picture Architecture](#the-big-picture-architecture)
+4. [The Three Tasks](#the-three-tasks)
+5. [Data Models: What's Floating Around](#data-models-whats-floating-around)
+6. [The Scripted Opponent System](#the-scripted-opponent-system)
+7. [The Grading System](#the-grading-system)
+8. [The Environment Core](#the-environment-core)
+9. [The Server API](#the-server-api)
+10. [The Inference Script](#the-inference-script)
+11. [End-to-End Example](#end-to-end-example)
+12. [Docker Deployment](#docker-deployment)
+13. [Calibration and Testing](#calibration-and-testing)
+---
+## What is ProcureRL?
+ProcureRL is an **OpenEnv-compliant Reinforcement Learning environment** where an LLM (Large Language Model) agent learns to negotiate procurement deals against scripted supplier opponents.
+In simpler terms: it's a training ground for AI to practice negotiation — like a flight simulator, but for procurement conversations.
+### The Core Innovation: Language-Sensitive Opponent
+What makes ProcureRL special is that the opponent's behavior **responds to the quality of the agent's natural language**, not just the prices offered. This means:
+- An agent that outputs aggressive or low-effort language gets a **tough, unyielding opponent**
+- An agent that outputs collaborative, professional language gets a **more cooperative, flexible opponent**
+The language IS the policy — not just the action space. This makes LLM genuinely required, not incidental.
+---
+## Why Does This Exist?
+Real-world procurement negotiation is:
+- **Sequential** — one decision affects the next
+- **Hidden utility** — the opponent's real priorities are not revealed
+- **Language-dependent** — how you say things matters as much as what you offer
+- **High-stakes** — Walmart deployed AI (Pactum) for exactly this, 90% of CPOs adopting AI negotiation in 2025
+Traditional rule-based negotiation tools are limited. An RL-trained LLM policy can learn to navigate this complexity in ways that static rules cannot.
+---
+## The Big Picture Architecture
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                         ProcureRL System                        │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                 │
+│  ┌──────────────────┐    ┌──────────────────┐                  │
+│  │   LLM Agent      │───▶│  Environment     │                  │
+│  │   (inference.py)  │    │  (Procure_RL_    │                  │
+│  │                   │    │   environment.py)│                  │
+│  └──────────────────┘    └────────┬─────────┘                  │
+│                                    │                            │
+│                                    ▼                            │
+│                          ┌──────────────────┐                  │
+│                          │  Scripted         │                  │
+│                          │  Opponent         │                  │
+│                          │  (opponent.py)    │                  │
+│                          └────────┬─────────┘                  │
+│                                    │                            │
+│                                    ▼                            │
+│                          ┌──────────────────┐                  │
+│                          │  Graders         │                  │
+│                          │  (graders.py)     │                  │
+│                          └──────────────────┘                  │
+│                                                                 │
+│  ┌──────────────────┐    ┌──────────────────┐                  │
+│  │  Server API      │    │  OpenEnv.yaml    │                  │
+│  │  (server/app.py)  │    │  (manifest)       │                  │
+│  └────────┬─────────┘    └──────────────────┘                  │
+│           │                                                     │
+│           ▼                                                     │
+│  ┌──────────────────┐                                         │
+│  │  Docker Container │◀── HF Spaces Deployment                │
+│  │  (port 7860)       │                                        │
+│  └──────────────────┘                                         │
+└─────────────────────────────────────────────────────────────────┘
+```
+The system is designed so that:
+1. **Environment** is deterministic and reproducible (seeded RNG)
+2. **Opponent** responds to language quality (via rapport system)
+3. **Graders** produce bounded [0.0, 1.0] scores
+4. **Server** exposes everything over HTTP for OpenEnv compliance
+5. **Inference** runs a baseline LLM agent against the environment
+---
+## The Three Tasks
+ProcureRL includes three tasks of increasing difficulty:
+### Task 1: `single_issue` (Easy)
+**Scenario:** Software license renewal. Price only.
+```
+Buyer Target: $36,000
+Seller Opens: ~$52,000 (varies by seed)
+Seller Floor: ~$44,000 (varies by seed)
+Max Rounds: 6
+Opponent Persona: Cooperative
+```
+The agent must negotiate the price down from opening to target. The cooperative opponent starts friendly and remains fairly flexible.
+**Example Grading:**
+- Deal at $38K in round 2: ~0.85 score
+- Deal at $44K in round 6: ~0.35 score
+- No deal: 0.0 score
+### Task 2: `multi_issue` (Medium)
+**Scenario:** Enterprise software negotiation with price AND payment terms.
+```
+Issues: price ($40K-$58K) + payment_days (30-90)
+Opponent Persona: Cash Flow Stressed
+  → Cares more about getting paid quickly (payment_weight: 0.65)
+  → Cares less about final price (price_weight: 0.35)
+Max Rounds: 8
+```
+**The Strategic Opportunity:** If the agent offers Net-30 or Net-45 payment terms, the opponent becomes more flexible on price. A naive agent treats both issues equally and scores low. A smart agent bundles payment speed with price negotiation.
+**Example Grading:**
+- Price $42K + Net-30 payment: ~0.60 score
+- Price $42K + Net-90 payment: ~0.35 score
+- No deal: 0.0 score
+### Task 3: `adversarial` (Hard)
+**Scenario:** Large contract with three issues — price, payment, and support hours.
+```
+Issues: price + payment_days + support_hours
+Opponent Persona: Aggressive Anchor
+  → Opens at ceiling on all issues
+  → Hardens position if agent makes consecutive concessions
+  → Rapport-sensitive but requires consistent collaborative framing
+Max Rounds: 10
+Survival Floor: 0.15 (completing any deal gets at least 0.15)
+```
+**The Challenge:** If the agent concedes on price in 2+ consecutive rounds, the opponent recognizes this pattern and becomes much harder to negotiate with. The agent must resist anchoring, break consecutive concession patterns, and maintain collaborative tone under pressure.
+**Example Grading:**
+- Strategic deal with no consecutive concessions: ~0.50 score
+- Same deal but with consecutive concession pattern: ~0.40 score
+- Survival deal (just complete): 0.15 score
+---
+## Data Models: What's Floating Around
+The system uses three Pydantic models defined in `models.py`:
+### `NegotiationAction`
+What the agent sends to the environment:
+```python
+class NegotiationAction(BaseModel):
+    move_type: str           # "make_offer" | "accept" | "reject" | "bundle"
+    terms: Dict[str, Any]    # {"price": 42000, "payment_days": 45}
+    message: str = ""        # Natural language — affects opponent rapport!
+```
+**Important:** The `message` field is not just flavor text. It directly affects opponent behavior through the rapport system.
+### `NegotiationObservation`
+What the environment sends back to the agent after each step:
+```python
+class NegotiationObservation(BaseModel):
+    task_id: str                           # Which task we're running
+    round_number: int                      # Current round (0 to max_rounds)
+    max_rounds: int                        # Task's round limit
+    supplier_message: str                  # Opponent's latest message
+    current_offer: Dict[str, Any]          # Terms currently on the table
+    last_4_exchanges: List[Dict]           # Recent conversation history
+    buyer_constraints: Dict[str, Any]      # Agent's targets and limits
+    rapport_hint: str                       # "positive" | "neutral" | "negative"
+    done: bool                             # Is episode finished?
+    reward: Optional[float] = None          # Reward (only on done)
+    metadata: Dict[str, Any] = Field(...)  # Extra info (deal_price, errors)
+```
+### `NegotiationState`
+The environment's internal state (accessible via `env.state`):
+```python
+class NegotiationState(BaseModel):
+    task_id: str = ""
+    episode_id: str = ""
+    round_number: int = 0
+    rapport_score: float = 0.5              # 0.0 to 1.0, starts neutral
+    consecutive_concessions: int = 0        # Tracks concession patterns
+    deal_reached: bool = False
+    final_terms: Optional[Dict] = None       # Set when episode ends
+    cumulative_reward: float = 0.0
+```
+---
+## The Scripted Opponent System
+The opponent is implemented in `opponent.py` as the `ScriptedPersonaOpponent` class.
+### The Rapport System (Language Sensitivity)
+The key mechanism is **rapport** — a score from 0.0 to 1.0 that changes based on the agent's language quality.
+**Collaborative Signals (increase rapport):**
+```python
+COLLABORATIVE_SIGNALS = [
+    "understand", "partnership", "mutual", "together", "value",
+    "appreciate", "flexible", "work with", "long-term", "relationship",
+    "reasonable", "fair", "both", "solution"
+]
+```
+**Aggressive Signals (decrease rapport):**
+```python
+AGGRESSIVE_SIGNALS = [
+    "demand", "require", "final offer", "unacceptable", "must",
+    "non-negotiable", "take it or leave", "bottom line", "ultimatum",
+    "insist", "refuse", "absolutely not"
+]
+```
+**How it works:**
+```python
+def update_rapport(self, agent_message: str) -> None:
+    msg_lower = agent_message.lower()
+    delta = 0.0
+    delta += sum(0.08 for w in COLLABORATIVE_SIGNALS if w in msg_lower)
+    delta -= sum(0.08 for w in AGGRESSIVE_SIGNALS if w in msg_lower)
+    delta = max(-0.20, min(0.20, delta))  # Cap per-round change
+    self.rapport = max(0.0, min(1.0, self.rapport + delta))
+```
+Every message the agent sends adjusts rapport by ±0.08 per keyword detected, capped at ±0.20 per round.
+### Concession Rate: How Fast the Opponent Moves
+Rapport directly modifies the opponent's concession rate:
+```python
+def get_concession_rate(self) -> float:
+    base_rates = {
+        "cooperative": 0.05,        # 5% per round base
+        "cash_flow_stressed": 0.07,
+        "aggressive_anchor": 0.04,
+    }
+    base = base_rates[self.persona]
+    modifier = (self.rapport - 0.5) * base  # +/- 50% of base
+    return max(0.01, base + modifier)
+```
+**Example:** Cooperative opponent with high rapport (0.8) concedes at 0.05 + (0.8 - 0.5) × 0.05 = **7.5% per round**. With low rapport (0.2), concedes at 0.05 + (0.2 - 0.5) × 0.05 = **2.5% per round**.
+### Three Personas
+#### 1. Cooperative (`single_issue`)
+- Friendly, understanding tone
+- 5% base concession rate, highly sensitive to rapport
+- Accepts early if price is above floor and round ≥ 2
+#### 2. Cash Flow Stressed (`multi_issue`)
+- Cares about payment timing more than price
+- 7% base concession rate, moderate rapport sensitivity
+- Acceptance requires `payment_days ≤ 45`
+- Comments on payment timing in responses
+#### 3. Aggressive Anchor (`adversarial`)
+- Opens at ceiling, hardens with pressure
+- 4% base concession rate (least flexible)
+- **Penalizes consecutive concessions** — if agent concedes 2+ rounds in a row, concession rate drops to 40% of normal
+- Uses "hardening" templates when cornered
+### Opponent Response Flow
+```python
+def respond(self, agent_message, agent_terms, round_number, consecutive_concessions):
+    # 1. Update rapport based on agent's language
+    self.update_rapport(agent_message)
+    # 2. Check acceptance (only after round 2, and price must be ≥ floor)
+    if round_number >= 2 and agent_price >= self.price_floor and _acceptance_condition():
+        return self.templates["accept"], {**agent_terms, "_accepted": True}
+    # 3. Calculate concession rate
+    concession = self.get_concession_rate()
+    # 4. Aggressive anchor gets harder if detecting concession pattern
+    if self.persona == "aggressive_anchor" and consecutive_concessions >= 2:
+        concession = concession * 0.4  # 60% reduction!
+        template_key = "hardening"
+    elif round_number >= 70% of max_rounds:
+        template_key = "near_close"
+    else:
+        template_key = "counter"
+    # 5. Compute new position
+    new_position = self.current_position * (1 - concession)
+    new_position = max(self.price_floor, new_position)  # Never go below floor
+    # 6. Return message and counter terms
+    return message, counter_terms
+```
+---
+## The Grading System
+Graders are in `graders.py` and produce scores in [0.0, 1.0]. They are **pure Python — zero LLM calls**, ensuring deterministic, reproducible scoring.
+### Key Design: Relative Scoring
+The graders score based on **how much the agent improved from the opponent's opening price**, not on absolute thresholds. This makes the environment learnable — the agent learns to negotiate better deals relative to where negotiations started.
+```python
+# Instead of scoring against a hardcoded floor, we score relative to the opening:
+value = (opponent_opening - final_price) / (opponent_opening - BUYER_TARGET)
+```
+### Single Issue Grading
+```python
+def grade_single_issue(final_terms, deal_reached, rounds_taken, max_rounds=6, opponent_opening=52000.0):
+    if not deal_reached:
+        return 0.0
+    final_price = final_terms.get("price", opponent_opening)
+    BUYER_TARGET = 38000.0
+    # If price didn't improve from opening, minimal score
+    if final_price >= opponent_opening:
+        return 0.05
+    # How much did we improve relative to the possible improvement?
+    value = (opponent_opening - final_price) / (opponent_opening - BUYER_TARGET)
+    value = max(0.0, min(1.0, value))
+    # Efficiency penalty for taking too long
+    efficiency = 1.0 - (rounds_taken / max_rounds) ** 1.5 * 0.4
+    efficiency = max(0.1, efficiency)  # Never below 0.1
+    return round(value * efficiency, 4)
+```
+**Example:**
+- Opening: $52,000, Target: $38,000, Range: $14,000
+- Final price: $45,000 → improvement: $7,000 → value = 0.50
+- Round 3 → efficiency = 1.0 - (3/6)^1.5 × 0.4 = 0.71
+- **Score: 0.50 × 0.71 = 0.36**
+### Multi-Issue Grading
+```python
+def grade_multi_issue(final_terms, deal_reached, rounds_taken, max_rounds=8, opponent_opening=52000.0):
+    # Two dimensions: price (70% weight) and payment_days (30% weight)
+    price_value = (opponent_opening - final_price) / (opponent_opening - 40000)
+    payment_score = (90 - payment_days) / (90 - 30)
+    value = 0.70 * price_value + 0.30 * payment_score
+    # If price didn't improve but payment did, still score on payment
+    if final_price >= opponent_opening:
+        value = 0.30 * payment_score  # Only payment matters
+```
+**Example:**
+- Price: $44,000 (good), Payment: Net-45 (good) → price_value=0.64, payment_score=0.75
+- value = 0.70×0.64 + 0.30×0.75 = 0.67
+### Adversarial Grading
+```python
+def grade_adversarial(final_terms, deal_reached, rounds_taken, consecutive_concessions_flag, ...):
+    SURVIVAL_FLOOR = 0.15  # Completing any deal gets at least 0.15
+    # Three dimensions with weights
+    value = 0.40 * price_value + 0.35 * payment_score + 0.25 * support_score
+    # Pattern penalty: bad if you showed consecutive concessions
+    pattern_penalty = 0.10 if consecutive_concessions_flag else 0.0
+    raw = (value * efficiency) - pattern_penalty
+    return round(max(SURVIVAL_FLOOR, raw), 4)
+```
+---
+## The Environment Core
+The `ProcureRLEnvironment` class in `server/Procure_RL_environment.py` is the heart of the system.
+### Reset Flow
+```python
+def reset(self, seed=None, episode_id=None, **kwargs):
+    task_id = kwargs.get("task_id", "single_issue")
+    # 1. Set up opponent with seeded RNG
+    opponent_seed = hash((seed, task_id)) % (2**32)
+    self._opponent = ScriptedPersonaOpponent(task_id=task_id, seed=opponent_seed, persona=...)
+    # 2. Get opponent's opening message and terms
+    opening_msg, opening_terms = self._opponent.get_opening_message()
+    self._opponent_opening_price = opening_terms.get("price", 52000.0)
+    # 3. Initialize state
+    self._state = NegotiationState(
+        task_id=task_id,
+        episode_id=episode_id or str(uuid.uuid4())[:8],
+        round_number=0,
+        rapport_score=0.5,  # Neutral
+        ...
+    )
+    # 4. Return initial observation
+    return NegotiationObservation(
+        ...
+        supplier_message=opening_msg,
+        current_offer=opening_terms,
+        ...
+    )
+```
+### Step Flow
+```python
+def step(self, action, **kwargs):
+    # 1. Validate action
+    if not isinstance(action, NegotiationAction):
+        action = NegotiationAction(...)  # Convert from dict
+    # 2. Track consecutive concessions (for adversarial opponent)
+    if self._prev_agent_price is not None and "price" in action.terms:
+        if float(action.terms["price"]) > self._prev_agent_price:
+            self._consecutive_concessions += 1  # Agent moved toward opponent
+        else:
+            self._consecutive_concessions = 0
+    self._prev_agent_price = float(action.terms["price"])
+    # 3. Handle different move types
+    if action.move_type in ("make_offer", "bundle"):
+        # Get opponent response
+        opponent_msg, opponent_terms = self._opponent.respond(...)
+        # Check if opponent accepted
+        if opponent_terms.get("_accepted"):
+            # Episode ends, compute reward
+            reward = grade(...)
+            return obs_with_reward
+        # Otherwise, continue negotiation
+        self._last_offer = opponent_terms
+        return obs_with_current_state
+    if action.move_type == "accept":
+        # Agent accepts current terms, episode ends
+        reward = grade(...)
+        return obs_with_reward
+    if action.move_type == "reject":
+        if round_number >= max_rounds:
+            # Rejected at limit, no reward
+            return obs_done_no_reward
+        return obs_continue  # Rejected early, keep going
+```
+### State Property
+```python
+@property
+def state(self) -> NegotiationState:
+    return self._state
+```
+Returns the internal `NegotiationState` object, giving access to:
+- `round_number`
+- `rapport_score`
+- `consecutive_concessions`
+- `deal_reached`
+- `final_terms`
+- `cumulative_reward`
+---
+## The Server API
+The FastAPI server in `server/app.py` exposes the environment over HTTP and WebSocket.
+### Endpoints
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/health` | GET | Health check |
+| `/reset` | POST | Reset environment with `task_id` and `seed` |
+| `/step` | POST | Execute an action |
+| `/state` | GET | Get current `NegotiationState` |
+| `/ws` | WS | WebSocket for persistent sessions |
+### Request/Response Examples
+**POST /reset**
+```json
+// Request
+{"task_id": "single_issue", "seed": 42}
+// Response
+{
+  "task_id": "single_issue",
+  "round_number": 0,
+  "max_rounds": 6,
+  "supplier_message": "Thanks for reaching out. Our standard pricing for this package is $52,400. Happy to discuss.",
+  "current_offer": {"price": 52400.0},
+  "buyer_constraints": {"price": {"target": 36000, "worst": 55000, "budget": 53000}},
+  "rapport_hint": "neutral",
+  "done": false
+}
+```
+**POST /step**
+```json
+// Request
+{"move_type": "make_offer", "terms": {"price": 48000}, "message": "I appreciate your flexibility and would like to find a fair price for both parties."}
+// Response
+{
+  "observation": {
+    "task_id": "single_issue",
+    "round_number": 1,
+    "max_rounds": 6,
+    "supplier_message": "I appreciate you working with us. Based on our costs, $49,800 is where we can be.",
+    "current_offer": {"price": 49800.0},
+    "rapport_hint": "positive",
+    "done": false
+  },
+  "reward": 0.0,
+  "done": false,
+  "info": {}
+}
+```
+### Key Implementation Detail: Lambda Closure
+```python
+_env_instance = ProcureRLEnvironment()
+app = create_app(
+    lambda: _env_instance,  # Lambda is CRITICAL - creates new env per request otherwise
+    NegotiationAction,
+    NegotiationObservation,
+    env_name="ProcureRL",
+    max_concurrent_envs=1,
+)
+```
+Without the lambda, `create_app()` would call the function for each request, getting a **fresh environment** every time instead of reusing the same one. The lambda creates a closure over `_env_instance` so all requests share the same environment.
+---
+## The Inference Script
+`inference.py` is a baseline agent that runs an LLM against the environment.
+### Output Format (Sacred)
+The script MUST output exactly:
+```
+[START] task=single_issue env=procure-rl model=Qwen/Qwen2.5-72B-Instruct
+[STEP] step=1 action=make_offer({"price": 45000}) reward=0.00 done=false error=null
+[STEP] step=2 action=accept({}) reward=0.47 done=true error=null
+[END] success=true steps=2 score=0.47 rewards=0.00,0.47
+```
+Any deviation from this format causes validation to fail.
+### How It Works
+```python
+def run_task(task_id):
+    env = ProcureRLEnvironment()
+    obs = env.reset(task_id=task_id, seed=42)
+    print(f"[START] task={task_id} ...")
+    while not done and step < MAX_STEPS:
+        # 1. Get action from LLM
+        action_dict = get_agent_action(obs_to_dict(obs))
+        # 2. Convert to NegotiationAction
+        action = NegotiationAction(
+            move_type=action_dict.get("move_type", "make_offer"),
+            terms=action_dict.get("terms", {}),
+            message=action_dict.get("message", "")
+        )
+        # 3. Step environment
+        obs = env.step(action)
+        # 4. Print step result
+        print(f"[STEP] step={step} action={...} reward={obs.reward:.2f} ...")
+        if obs.done:
+            final_score = obs.reward
+            break
+    print(f"[END] success={...} steps={step} score={final_score:.2f} ...")
+```
+### LLM Prompt
+```python
+SYSTEM_PROMPT = """You are a professional procurement negotiator. Your goal is to negotiate the best possible deal for your company.
+You will receive a supplier's message and current offer terms. You must respond with a JSON action:
+{
+  "move_type": "make_offer",
+  "terms": {"price": 42000, "payment_days": 45},
+  "message": "Your natural language response to the supplier"
+}
+move_type must be one of: make_offer, accept, reject, bundle
+message should be professional and collaborative when possible."""
+```
+---
+## End-to-End Example
+Here's a full negotiation episode for `single_issue`:
+### Round 0: Reset
+```python
+env.reset(task_id="single_issue", seed=42)
+# Returns:
+#   supplier_message: "Thanks for reaching out. Our standard pricing for this package is $52,400..."
+#   current_offer: {"price": 52400.0}
+#   buyer_constraints: {"price": {"target": 36000, ...}}
+#   rapport_hint: "neutral"
+```
+### Round 1: Agent Makes Offer with Collaborative Language
+```python
+action = NegotiationAction(
+    move_type="make_offer",
+    terms={"price": 48000},
+    message="I value our potential partnership and believe we can find a fair price that works for both of us. We're flexible on timeline."
+)
+obs = env.step(action)
+# Returns:
+#   supplier_message: "I appreciate you working with us. Based on our costs, $49,600 is where we can be."
+#   current_offer: {"price": 49600.0}
+#   rapport_hint: "positive"  (because message contained collaborative signals)
+#   reward: 0.0  (still negotiating, no reward yet)
+```
+### Round 2: Agent Concedes
+```python
+action = NegotiationAction(
+    move_type="make_offer",
+    terms={"price": 47000},
+    message="I understand your cost constraints. Let's work together to find a solution."
+)
+obs = env.step(action)
+# Returns:
+#   supplier_message: "I think we're close. If you can do $46,700, I can get this approved today."
+#   current_offer: {"price": 46700.0}
+#   rapport_hint: "positive"
+```
+### Round 3: Agent Concedes Again (Consecutive!)
+```python
+action = NegotiationAction(
+    move_type="make_offer",
+    terms={"price": 46000},
+    message="We can move to $46,000 as a final compromise."
+)
+obs = env.step(action)
+# Returns:
+#   supplier_message: "That works for us. Let's move forward at those terms."
+#   done: true
+#   reward: 0.52  (good score for getting to $46K efficiently)
+#   info: {"deal_price": 46000}
+```
+### Grading This Episode
+- Opening: $52,400
+- Target: $36,000
+- Range: $16,400
+- Improvement: $52,400 - $46,000 = $6,400
+- value = $6,400 / $16,400 = 0.39
+- Round 3 → efficiency = 1.0 - (3/6)^1.5 × 0.4 = 0.71
+- **Score: 0.39 × 0.71 = 0.28**
+---
+## Docker Deployment
+### Dockerfile
+```dockerfile
+FROM python:3.11-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+ENV PORT=7860
+EXPOSE 7860
+CMD ["python", "-m", "uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]
+```
+Key points:
+- Port **7860** (not 8000) — required by HF Spaces
+- `ENV PORT=7860` — tells the app which port to listen on
+- Uses `python -m uvicorn` with full module path
+### Running
+```bash
+# Build
+docker build -t procure-rl .
+# Run
+docker run -p 7860:7860 procure-rl
+# Test
+curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{"task_id": "single_issue"}'
+```
+### Health Check
+The server exposes a health endpoint:
+```json
+GET /health → {"status": "ok", "service": "procure-rl"}
+```
+---
+## Calibration and Testing
+### Test Files
+#### `test_graders.py`
+Verifies all graders return scores in [0.0, 1.0] range, even with edge cases.
+#### `test_rl_properties.py`
+Tests fundamental RL properties:
+1. **Reproducibility**: Same seed → Same opening message
+2. **Language sensitivity**: Collaborative language → Higher rapport
+3. **Sequential decisions**: Consecutive concessions tracked in state
+4. **Delayed reward**: Only terminal state has non-zero reward
+5. **Accept terminates**: `move_type="accept"` ends episode
+6. **Reset cleans state**: Fresh state after reset
+#### `test_calibration.py`
+Verifies score spread between random and strategic agents:
+```
+single_issue: Random avg=0.371, Strategic avg=0.487, Spread=0.116 ✅
+multi_issue:   Random avg=0.364, Strategic avg=0.535, Spread=0.171 ✅
+adversarial:   Random avg=0.304, Strategic avg=0.607, Spread=0.303 ✅
+```
+A healthy spread means the environment actually differentiates good vs bad behavior.
+### Score Calibration Targets
+| Task | Random Agent | Base LLM | Goal (Trained) |
+|------|-------------|----------|-----------------|
+| single_issue | 0.15–0.25 | 0.35–0.45 | 0.68–0.78 |
+| multi_issue | 0.08–0.15 | 0.20–0.30 | 0.55–0.65 |
+| adversarial | 0.03–0.10 | 0.12–0.20 | 0.45–0.55 |
+---
+## Summary: How Everything Fits Together
+```
+User runs inference.py
+    │
+    ▼
+LLM agent receives observation (supplier message, current offer, constraints)
+    │
+    ▼
+LLM decides action (make_offer with terms + collaborative message)
+    │
+    ▼
+Environment.step(action) is called
+    │
+    ├─▶ Opponent responds (language → rapport → concession rate → counter)
+    │
+    ├─▶ State is updated (round_number++, rapport_score, consecutive_concessions)
+    │
+    └─▶ Observation returned (supplier_message, current_offer, rapport_hint)
+    │
+    ▼
+If episode done: Grader scores the deal (relative to opening price, efficiency, patterns)
+    │
+    ▼
+Score in [0.0, 1.0] returned
+```
+The agent learns through many episodes:
+- **What language gets better rapport** → better concession rates
+- **When to concede vs hold** → efficiency bonus
+- **How to bundle multiple issues** → multi-issue tasks
+- **How to avoid consecutive concession patterns** → adversarial task
+The environment is designed to be learnable but not trivial — requiring genuine strategic thinking from an LLM agent.

Instructions.md ADDED Viewed

	@@ -0,0 +1,283 @@

+## Overview
+Build a **deterministic OpenEnv environment** for real-world procurement negotiation.
+- Must follow OpenEnv API (`reset / step / state`)
+- Must include **3 tasks (easy → medium → hard)**
+- Must produce **deterministic rewards in [0.0, 1.0]**
+- Must be **fully reproducible and deployable**
+---
+## Core Requirements
+### 1. Environment
+Implement in:
+```
+procure_rl/environment.py
+```
+- `reset(task_id, seed)` → initial observation
+- `step(action)` → `(observation, reward, done, info)`
+- `state()` → internal state
+Use typed models from:
+```
+procure_rl/models.py
+```
+---
+### 2. Tasks (MANDATORY: 3)
+Defined in:
+```
+procure_rl/environment.py (TASK_CONFIG)
+```
+| Task         | Description                       |
+| ------------ | --------------------------------- |
+| single_issue | price-only negotiation            |
+| multi_issue  | price + payment tradeoff          |
+| adversarial  | multi-issue + aggressive opponent |
+Each must:
+- have different difficulty
+- run within step limits
+- produce score ∈ [0,1]
+---
+### 3. Opponent (CRITICAL)
+Implemented in:
+```
+procure_rl/opponent.py
+```
+Requirements:
+- deterministic (seeded RNG)
+- no LLM usage
+- **language-sensitive behavior** (via keyword detection)
+👉 This is what makes LLM useful without breaking reproducibility.
+---
+### 4. Reward / Graders
+Implemented in:
+```
+procure_rl/graders.py
+```
+Requirements:
+- deterministic
+- bounded [0.0, 1.0]
+- reflect:
+  - deal quality
+  - efficiency (rounds)
+- no randomness, no LLM
+---
+### 5. API Server
+Implemented in:
+```
+server/app.py
+```
+Endpoints:
+- `/reset`
+- `/step`
+- `/state`
+- `/health`
+Must return valid JSON and HTTP 200.
+---
+### 6. OpenEnv Spec
+File:
+```
+openenv.yaml
+```
+Must define:
+- environment name
+- tasks (3+)
+- reward range
+- action/observation description
+Validate with:
+```
+openenv validate
+```
+---
+### 7. Inference Script (MANDATORY)
+File:
+```
+inference.py
+```
+Requirements:
+- uses OpenAI client
+- reads:
+  - `API_BASE_URL`
+  - `MODEL_NAME`
+  - `HF_TOKEN`
+- interacts with env via loop
+- prints EXACT format:
+```
+[START] ...
+[STEP] ...
+[END] ...
+```
+⚠️ Any formatting deviation → failure
+---
+### 8. Docker + Deployment
+File:
+```
+Dockerfile
+```
+Must:
+- build successfully
+- expose port `7860`
+- run FastAPI server
+Test:
+```
+docker build -t procure-rl .
+docker run -p 7860:7860 procure-rl
+```
+---
+### 9. Hugging Face Space
+Must:
+- deploy successfully
+- respond to `/reset` with HTTP 200
+---
+### 10. README
+Must include:
+- environment description
+- action & observation formats
+- task descriptions
+- setup instructions
+- baseline scores
+---
+## Validation Checklist (ALL REQUIRED)
+Run before submission:
+```
+openenv validate
+docker build .
+python inference.py
+```
+Ensure:
+- all 3 tasks run
+- scores ∈ [0,1]
+- runtime < 20 minutes
+- no crashes
+---
+## Constraints
+- No LLM inside environment
+- No randomness without seed
+- Must run on:
+  - 2 vCPU
+  - 8GB RAM
+---
+## Key Design Principle
+> LLM is used for **decision-making**, not environment logic.
+- Environment = deterministic
+- Agent (LLM) = intelligent
+---
+## File Reference Summary
+```
+procure_rl/
+  models.py        # dataclasses
+  environment.py   # core logic
+  opponent.py      # scripted opponent
+  graders.py       # reward functions
+server/
+  app.py           # API
+inference.py       # baseline agent
+openenv.yaml       # spec
+Dockerfile         # deployment
+README.md          # docs
+```
+---
+## Final Rule
+If any of these fail:
+- Docker build
+- openenv validate
+- inference script
+👉 **Submission is disqualified**
+---
+## One-line Goal
+> Build a deterministic, real-world negotiation environment where an LLM agent must make sequential decisions to maximize reward.
+---

README.md CHANGED Viewed

@@ -1,10 +1,328 @@
 ---
-title: Procure Rl
-emoji: 🏆
 colorFrom: green
-colorTo: purple
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: ProcureRL Environment
+emoji: 🤝
 colorFrom: green
+colorTo: blue
 sdk: docker
 pinned: false
+app_port: 7860
+base_path: /web
+tags:
+  - openenv
+  - negotiation
+  - procurement
+  - rl
+  - real-world
 ---
+# ProcureRL: Procurement Negotiation RL Environment
+An OpenEnv-compliant RL environment where an LLM agent learns to negotiate procurement deals against scripted supplier opponents with language-sensitive behavior.
+## The Key Innovation: Language-Sensitive Opponent
+The opponent's concession rate is directly affected by the **quality of the agent's natural language**:
+- **Collaborative language** ("let's work together", "mutual benefit") → increases rapport → opponent concedes more
+- **Neutral language** → opponent concedes at baseline rate
+- **Aggressive language** ("final offer", "take it or leave it") → rapport drops → opponent hardens
+This makes LLM genuinely required — output quality directly affects negotiation outcomes.
+## Quick Start
+```python
+from server.Procure_RL_environment import ProcureRLEnvironment
+from models import NegotiationAction
+env = ProcureRLEnvironment()
+obs = env.reset(task_id="single_issue", seed=42)
+print(f"Supplier: {obs.supplier_message}")
+print(f"Offer: {obs.current_offer}")
+print(f"Your target: {obs.buyer_constraints}")
+action = NegotiationAction(
+    move_type="make_offer",
+    terms={"price": 42000},
+    message="Let's find a mutually beneficial solution."
+)
+obs = env.step(action)
+print(f"Response: {obs.supplier_message}")
+print(f"New offer: {obs.current_offer}")
+```
+## Web Interface Example
+The web interface at `/web` provides a visual playground. Here's how to use it:
+### Step 1: Reset the Environment
+Click **Reset** to start a new negotiation episode. You can customize the reset by passing JSON:
+```json
+{"task_id": "single_issue", "seed": 42}
+```
+**Available tasks:**
+- `single_issue` — Price-only negotiation (6 rounds max)
+- `multi_issue` — Price + payment terms (8 rounds max)
+- `adversarial` — Price + payment + support hours (10 rounds max)
+### Step 2: Make an Offer
+Fill in the form fields:
+| Field | Example Value | Notes |
+|-------|--------------|-------|
+| `move_type` | `make_offer` | Options: make_offer, accept, reject, bundle |
+| `terms` | `{"price": 42000}` | JSON object with negotiation terms |
+| `message` | `I value our partnership and believe we can find a fair solution.` | Your natural language message (affects opponent rapport!) |
+**Example: Making a collaborative offer**
+```
+move_type: make_offer
+terms: {"price": 45000}
+message: We appreciate your flexibility and would like to work together to find a solution that benefits both parties.
+```
+### Step 3: Read the Response
+After clicking **Step**, you'll see:
+- `supplier_message` — The opponent's natural language response
+- `current_offer` — Updated terms on the table
+- `rapport_hint` — "positive", "neutral", or "negative" based on your language
+- `round_number` — Current round (0-indexed)
+### Step 4: Continue or Accept
+- **Make another offer** to continue negotiating
+- **Use `accept`** when you're satisfied with the current terms
+- **Use `reject`** only if you want to walk away (no reward)
+**Example: Accepting current terms**
+```
+move_type: accept
+terms: {}
+message:
+```
+### Multi-Issue Negotiation (Task 2 & 3)
+For `multi_issue` and `adversarial`, include multiple terms:
+```json
+{
+  "move_type": "make_offer",
+  "terms": {
+    "price": 44000,
+    "payment_days": 30
+  },
+  "message": "We can offer faster payment terms if that helps your cash flow."
+}
+```
+**Key insight:** In `multi_issue`, the opponent cares more about payment timing than price. Offering Net-30 payment can get you a better price!
+### Example Full Episode
+**Round 0 (Reset):**
+- Task: `single_issue`
+- Supplier opens: ~$52,000
+- Your target: $36,000
+**Round 1:**
+- `move_type`: `make_offer`
+- `terms`: `{"price": 48000}`
+- `message`: `We value your partnership and want to find a fair price for both parties.`
+**Round 2:**
+- Supplier counter-offers at ~$46,000 (rapport is positive!)
+- `move_type`: `make_offer`
+- `terms`: `{"price": 45000}`
+- `message`: `I appreciate your movement. Let's see if we can get to $45,000.`
+**Round 3:**
+- Supplier accepts or counter-offers near your target
+- `move_type`: `accept`
+- `terms`: `{}`
+- Final score: Based on how close to target and how efficiently
+## The Three Tasks
+### 1. `single_issue` (Easy)
+Renew software license. Price only.
+- Buyer target: $36,000, Budget: $53,000
+- Seller opens: ~$52,000 (varies by seed)
+- Opponent persona: Cooperative
+- Max rounds: 6
+**Scoring:** Deal quality (how close to target) × Efficiency (how few rounds)
+### 2. `multi_issue` (Medium)
+Enterprise software deal. Price + payment terms.
+- Buyer weights: price 70%, payment 30%
+- Seller persona: Cash Flow Stressed (cares more about payment timing)
+- **Trade opportunity**: offer Net-30 payment to get lower price
+- Max rounds: 8
+**Scoring:** Weighted combination of price improvement + payment terms
+### 3. `adversarial` (Hard)
+Large contract negotiation. Price + payment + support hours.
+- Opponent persona: Aggressive Anchor
+  - Opens at ceiling on all issues
+  - Hardens position if you make 2+ consecutive concessions
+  - Requires consistent collaborative framing
+- Survival floor: any deal scores at least 0.15
+- Max rounds: 10
+**Scoring:** Multi-dimensional value minus pattern penalty for consecutive concessions
+## Action Space
+```python
+NegotiationAction(
+    move_type="make_offer",  # make_offer | accept | reject | bundle
+    terms={"price": 44000, "payment_days": 45, "support_hours": 120},
+    message="We appreciate your flexibility on this."
+)
+```
+| move_type | Description |
+|-----------|-------------|
+| `make_offer` | Propose terms (price required, others optional) |
+| `accept` | Accept current offer on table |
+| `reject` | Walk away (only use at final round) |
+| `bundle` | Alias for make_offer with multi-issue terms |
+## Observation Space
+```python
+NegotiationObservation(
+    task_id="single_issue",
+    round_number=2,
+    max_rounds=6,
+    supplier_message="I appreciate your offer. Based on our costs...",
+    current_offer={"price": 46000},
+    last_4_exchanges=[...],
+    buyer_constraints={"price": {"target": 36000, "worst": 55000, "budget": 53000}},
+    rapport_hint="positive",  # positive | neutral | negative
+    done=False
+)
+```
+## Running the Server
+```bash
+# Build Docker image
+docker build -t procure-rl -f server/Dockerfile .
+# Run container (port 7860 - required for HF Spaces)
+docker run -p 7860:7860 procure-rl
+# Access web interface at http://localhost:7860/web
+```
+## API Endpoints
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/health` | GET | Health check |
+| `/metadata` | GET | Environment metadata |
+| `/reset` | POST | Reset environment |
+| `/step` | POST | Execute action |
+| `/state` | GET | Get current state |
+| `/ws` | WS | WebSocket for persistent sessions |
+## Baseline Inference
+Run inference against all three tasks:
+```bash
+cp .env.example .env
+# Edit .env and add your HF_TOKEN
+HF_TOKEN=your_token python inference.py
+```
+Output format (exact):
+```
+[START] task=single_issue env=procure-rl model=Qwen/Qwen2.5-72B-Instruct
+[STEP] step=1 action=make_offer({"price": 42000}) reward=0.00 done=false error=null
+[STEP] step=2 action=make_offer({"price": 41000}) reward=0.52 done=true error=null
+[END] success=true steps=2 score=0.52 rewards=0.00,0.52
+```
+## Environment Design
+### Rapport System
+The opponent maintains a rapport score (0.0 to 1.0) updated per-round:
+```python
+COLLABORATIVE_SIGNALS = ["understand", "partnership", "mutual", "together", ...]
+AGGRESSIVE_SIGNALS = ["demand", "require", "final offer", "unacceptable", ...]
+delta = +0.08 per collaborative signal detected
+delta = -0.08 per aggressive signal detected
+delta = max(-0.20, min(0.20, delta))  # cap per round
+```
+### Opponent Personas
+| Persona | Base Concession | Rapport Modifier | Special Behavior |
+|---------|----------------|-------------------|-------------------|
+| `cooperative` | 5% | ±50% | Responsive to language |
+| `cash_flow_stressed` | 7% | ±50% | Accepts Net-45+, comments on payment |
+| `aggressive_anchor` | 4% | ±50% | Hardens after 2+ consecutive concessions |
+### Grading
+Graders are pure Python — zero LLM calls. They combine:
+- **Value**: how close to buyer's target
+- **Efficiency**: penalty for taking too many rounds
+- **Pattern penalty** (adversarial only): for consecutive concession behavior
+Graders never crash on malformed input — they fall back to worst-case values.
+## Project Structure
+```
+Procure_RL/
+├── __init__.py                    # Package exports
+├── client.py                      # EnvClient wrapper
+├── models.py                      # NegotiationAction, NegotiationObservation, NegotiationState
+├── opponent.py                    # ScriptedPersonaOpponent with 3 personas + rapport
+├── graders.py                     # grade_single_issue, grade_multi_issue, grade_adversarial
+├── inference.py                   # Baseline agent with [START][STEP][END] output
+├── server/
+│   ├── __init__.py
+│   ├── app.py                    # FastAPI app
+│   ├── Procure_RL_environment.py # ProcureRLEnvironment
+│   ├── requirements.txt
+│   └── Dockerfile
+├── openenv.yaml                  # OpenEnv manifest
+├── pyproject.toml
+├── plan.md                       # Full design specification
+└── README.md                     # This file
+```
+## Why This Environment?
+**Market validation**: Walmart deployed Pactum for AI negotiation. 90% of CPOs adopting AI negotiation in 2025.
+**Research gap**: Zero negotiation environments in OpenEnv hub.
+**LLM advantage**: Language quality directly affects opponent rapport — the language IS the policy.
+**Reproducibility**: Deterministic scripted opponent, pure Python graders, no LLM in environment loop.
+## Calibration
+If base LLM scores above 0.55 on single_issue → opponent too easy, reduce cooperative concession rate.
+If base LLM scores below 0.15 on single_issue → opponent too hard, increase cooperative concession rate.

__init__.py ADDED Viewed

	@@ -0,0 +1,17 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""ProcureRL Environment."""
+from .client import ProcureRLEnv
+from .models import NegotiationAction, NegotiationObservation, NegotiationState
+__all__ = [
+    "NegotiationAction",
+    "NegotiationObservation",
+    "NegotiationState",
+    "ProcureRLEnv",
+]

client.py ADDED Viewed

	@@ -0,0 +1,76 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""ProcureRL Environment Client."""
+from typing import Dict, Any
+from openenv.core import EnvClient
+from openenv.core.client_types import StepResult
+from .models import NegotiationAction, NegotiationObservation, NegotiationState
+class ProcureRLEnv(
+    EnvClient[NegotiationAction, NegotiationObservation, NegotiationState]
+):
+    """
+    Client for the ProcureRL Environment.
+    This client maintains a persistent WebSocket connection to the environment server,
+    enabling efficient multi-step interactions with lower latency.
+    Each client instance has its own dedicated environment session on the server.
+    Example:
+        >>> with ProcureRLEnv(base_url="http://localhost:8000") as client:
+        ...     result = client.reset(task_id="single_issue")
+        ...     print(result.observation.supplier_message)
+        ...
+        ...     action = NegotiationAction(move_type="make_offer", terms={"price": 42000}, message="Let's discuss")
+        ...     result = client.step(action)
+        ...     print(result.observation.supplier_message)
+    """
+    def _step_payload(self, action: NegotiationAction) -> Dict[str, Any]:
+        return {
+            "move_type": action.move_type,
+            "terms": action.terms,
+            "message": action.message,
+        }
+    def _parse_result(
+        self, payload: Dict[str, Any]
+    ) -> StepResult[NegotiationObservation]:
+        obs_data = payload.get("observation", {})
+        observation = NegotiationObservation(
+            task_id=obs_data.get("task_id", ""),
+            round_number=obs_data.get("round_number", 0),
+            max_rounds=obs_data.get("max_rounds", 0),
+            supplier_message=obs_data.get("supplier_message", ""),
+            current_offer=obs_data.get("current_offer", {}),
+            last_4_exchanges=obs_data.get("last_4_exchanges", []),
+            buyer_constraints=obs_data.get("buyer_constraints", {}),
+            rapport_hint=obs_data.get("rapport_hint", "neutral"),
+            done=obs_data.get("done", False),
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward", 0.0),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict[str, Any]) -> NegotiationState:
+        return NegotiationState(
+            task_id=payload.get("task_id", ""),
+            episode_id=payload.get("episode_id", ""),
+            round_number=payload.get("round_number", 0),
+            rapport_score=payload.get("rapport_score", 0.5),
+            consecutive_concessions=payload.get("consecutive_concessions", 0),
+            deal_reached=payload.get("deal_reached", False),
+            final_terms=payload.get("final_terms"),
+            cumulative_reward=payload.get("cumulative_reward", 0.0),
+        )

graders.py ADDED Viewed

	@@ -0,0 +1,161 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+Grading functions for procurement negotiation tasks.
+Pure Python — zero LLM calls. Graders must never crash on malformed input.
+Scoring is based on how much the agent improved from the opponent's opening price,
+not on absolute thresholds. This makes the environment learnable — the agent learns
+to negotiate better deals relative to where negotiations started.
+"""
+from typing import Dict, Optional
+def grade_single_issue(
+    final_terms: Dict,
+    deal_reached: bool,
+    rounds_taken: int,
+    max_rounds: int = 6,
+    opponent_opening: float = 52000.0,
+) -> float:
+    if not deal_reached:
+        return 0.0
+    if final_terms is None:
+        return 0.0
+    final_price = final_terms.get("price", opponent_opening)
+    BUYER_TARGET = 38000.0
+    if final_price >= opponent_opening:
+        return 0.05
+    value = (opponent_opening - final_price) / (opponent_opening - BUYER_TARGET)
+    value = max(0.0, min(1.0, value))
+    efficiency = 1.0 - (rounds_taken / max_rounds) ** 1.5 * 0.4
+    efficiency = max(0.1, efficiency)
+    return round(value * efficiency, 4)
+def grade_multi_issue(
+    final_terms: Dict,
+    deal_reached: bool,
+    rounds_taken: int,
+    max_rounds: int = 8,
+    opponent_opening: float = 52000.0,
+) -> float:
+    if not deal_reached:
+        return 0.0
+    if final_terms is None:
+        return 0.0
+    final_price = final_terms.get("price", opponent_opening)
+    payment_days = final_terms.get("payment_days", 90)
+    BUYER_PRICE_TARGET = 40000.0
+    PAYMENT_TARGET = 30
+    PAYMENT_WORST = 90
+    if final_price >= opponent_opening and payment_days >= 90:
+        return 0.05
+    price_value = (opponent_opening - final_price) / (
+        opponent_opening - BUYER_PRICE_TARGET
+    )
+    price_value = max(0.0, min(1.0, price_value))
+    payment_score = (PAYMENT_WORST - payment_days) / (PAYMENT_WORST - PAYMENT_TARGET)
+    payment_score = max(0.0, min(1.0, payment_score))
+    value = 0.70 * price_value + 0.30 * payment_score
+    if final_price >= opponent_opening:
+        value = 0.30 * payment_score
+    efficiency = 1.0 - (rounds_taken / max_rounds) * 0.30
+    efficiency = max(0.1, efficiency)
+    return round(value * efficiency, 4)
+def grade_adversarial(
+    final_terms: Dict,
+    deal_reached: bool,
+    rounds_taken: int,
+    consecutive_concessions_flag: bool,
+    max_rounds: int = 10,
+    opponent_opening: float = 100000.0,
+) -> float:
+    if not deal_reached:
+        return 0.0
+    if final_terms is None:
+        return 0.0
+    SURVIVAL_FLOOR = 0.15
+    final_price = final_terms.get("price", opponent_opening)
+    payment_days = final_terms.get("payment_days", 90)
+    support_hours = final_terms.get("support_hours", 80)
+    BUYER_PRICE_TARGET = 80000.0
+    price_value = (opponent_opening - final_price) / (
+        opponent_opening - BUYER_PRICE_TARGET
+    )
+    price_value = max(0.0, min(1.0, price_value))
+    payment_score = (90 - payment_days) / (90 - 30)
+    payment_score = max(0.0, min(1.0, payment_score))
+    support_score = (support_hours - 80) / (200 - 80)
+    support_score = max(0.0, min(1.0, support_score))
+    value = 0.40 * price_value + 0.35 * payment_score + 0.25 * support_score
+    if final_price >= opponent_opening:
+        value = 0.25 * (0.35 * payment_score + 0.25 * support_score)
+    efficiency = 1.0 - (rounds_taken / max_rounds) * 0.25
+    efficiency = max(0.1, efficiency)
+    pattern_penalty = 0.10 if consecutive_concessions_flag else 0.0
+    raw = (value * efficiency) - pattern_penalty
+    return round(max(SURVIVAL_FLOOR, raw), 4)
+def grade(
+    task_id: str,
+    final_terms: Dict,
+    deal_reached: bool,
+    rounds_taken: int,
+    opponent_opening: float = 52000.0,
+    **kwargs,
+) -> float:
+    if task_id == "single_issue":
+        return grade_single_issue(
+            final_terms, deal_reached, rounds_taken, opponent_opening=opponent_opening
+        )
+    elif task_id == "multi_issue":
+        return grade_multi_issue(
+            final_terms, deal_reached, rounds_taken, opponent_opening=opponent_opening
+        )
+    elif task_id == "adversarial":
+        return grade_adversarial(
+            final_terms,
+            deal_reached,
+            rounds_taken,
+            kwargs.get("consecutive_concessions_flag", False),
+            opponent_opening=opponent_opening,
+        )
+    return 0.0

inference.py ADDED Viewed

	@@ -0,0 +1,197 @@

+#!/usr/bin/env python3
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+Baseline inference script for ProcureRL.
+Runs an LLM agent against the procurement negotiation environment
+and outputs results in exact [START][STEP][END] format.
+"""
+import os
+import sys
+import json
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+BENCHMARK = "procure-rl"
+MAX_STEPS = 10
+try:
+    from openai import OpenAI
+    client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
+except Exception as e:
+    print(f"[ERROR] Failed to initialize OpenAI client: {e}")
+    sys.exit(1)
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from server.Procure_RL_environment import ProcureRLEnvironment
+from models import NegotiationAction
+TASKS = ["single_issue", "multi_issue", "adversarial"]
+SYSTEM_PROMPT = """You are a professional procurement negotiator. Your goal is to negotiate the best possible deal for your company.
+You will receive a supplier's message and current offer terms. You must respond with a JSON action in this exact format:
+{
+  "move_type": "make_offer",
+  "terms": {"price": 42000, "payment_days": 45},
+  "message": "Your natural language response to the supplier"
+}
+move_type must be one of: make_offer, accept, reject, bundle
+terms must include price and any other issues being negotiated.
+message should be professional and collaborative when possible.
+Your buyer constraints will be provided. Do not exceed your budget. Try to reach the target price."""
+def get_agent_action(obs_dict: dict) -> dict:
+    task_id = obs_dict.get("task_id", "single_issue")
+    supplier_msg = obs_dict.get("supplier_message", "")
+    current_offer = obs_dict.get("current_offer", {})
+    constraints = obs_dict.get("buyer_constraints", {})
+    rapport_hint = obs_dict.get("rapport_hint", "neutral")
+    round_num = obs_dict.get("round_number", 0)
+    max_rounds = obs_dict.get("max_rounds", 10)
+    user_content = f"""Task: {task_id}
+Round: {round_num}/{max_rounds}
+Supplier says: "{supplier_msg}"
+Current offer on table: {json.dumps(current_offer)}
+Your constraints: {json.dumps(constraints)}
+Relationship rapport: {rapport_hint}
+Respond with your negotiation action as JSON."""
+    try:
+        response = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_content},
+            ],
+            max_tokens=300,
+            temperature=0.3,
+        )
+        content = response.choices[0].message.content.strip()
+    except Exception as e:
+        return {
+            "move_type": "make_offer",
+            "terms": current_offer,
+            "message": f"Error: {str(e)}",
+        }
+    try:
+        start = content.find("{")
+        end = content.rfind("}") + 1
+        if start >= 0 and end > start:
+            action_dict = json.loads(content[start:end])
+        else:
+            action_dict = {
+                "move_type": "make_offer",
+                "terms": current_offer,
+                "message": content[:200]
+                if content
+                else "I'd like to continue our discussion.",
+            }
+    except:
+        action_dict = {
+            "move_type": "make_offer",
+            "terms": current_offer,
+            "message": "I'd like to continue our discussion.",
+        }
+    return action_dict
+def obs_to_dict(obs) -> dict:
+    return {
+        "task_id": obs.task_id,
+        "round_number": obs.round_number,
+        "max_rounds": obs.max_rounds,
+        "supplier_message": obs.supplier_message,
+        "current_offer": obs.current_offer,
+        "buyer_constraints": obs.buyer_constraints,
+        "rapport_hint": obs.rapport_hint,
+        "done": obs.done,
+    }
+def run_task(task_id: str) -> dict:
+    env = ProcureRLEnvironment()
+    obs = env.reset(task_id=task_id, seed=42)
+    obs_dict = obs_to_dict(obs)
+    print(f"[START] task={task_id} env={BENCHMARK} model={MODEL_NAME}")
+    rewards = []
+    step = 0
+    done = False
+    final_score = 0.0
+    while not done and step < MAX_STEPS:
+        step += 1
+        action_dict = get_agent_action(obs_dict)
+        action = NegotiationAction(
+            move_type=action_dict.get("move_type", "make_offer"),
+            terms=action_dict.get("terms", {}),
+            message=action_dict.get("message", ""),
+        )
+        obs = env.step(action)
+        rewards.append(obs.reward if obs.reward is not None else 0.0)
+        action_str = f"{action.move_type}({json.dumps(action.terms)})"
+        error = obs.metadata.get("error", None) if obs.metadata else None
+        print(
+            f"[STEP] step={step} action={action_str} reward={obs.reward if obs.reward else 0.0:.2f} done={str(obs.done).lower()} error={error if error else 'null'}"
+        )
+        if obs.done:
+            final_score = (
+                obs.reward
+                if obs.reward is not None and obs.reward > 0
+                else (max(rewards) if rewards else 0.0)
+            )
+            break
+        obs_dict = obs_to_dict(obs)
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    success = final_score > 0.1
+    print(
+        f"[END] success={str(success).lower()} steps={step} score={final_score:.2f} rewards={rewards_str}"
+    )
+    return {"task": task_id, "score": final_score, "steps": step}
+if __name__ == "__main__":
+    if not API_KEY:
+        print("[ERROR] HF_TOKEN or API_KEY environment variable not set")
+        sys.exit(1)
+    results = []
+    for task in TASKS:
+        try:
+            result = run_task(task)
+            results.append(result)
+        except Exception as e:
+            print(f"[ERROR] Task {task} failed: {e}")
+            results.append({"task": task, "score": 0.0, "steps": 0, "error": str(e)})
+    print(f"\nBaseline Results:")
+    for r in results:
+        task = r["task"]
+        score = r["score"]
+        print(f"  {task}: {score:.3f}")

models.py ADDED Viewed

	@@ -0,0 +1,80 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+Data models for the ProcureRL Environment.
+The ProcureRL environment is a procurement negotiation RL environment where
+an LLM agent learns to negotiate against scripted supplier opponents.
+"""
+from typing import Optional, List, Dict, Any
+from pydantic import BaseModel, Field, ConfigDict
+try:
+    from openenv.core.env_server.types import Action, Observation, State as OpenEnvState
+except ImportError:
+    OpenEnvState = object
+class NegotiationAction(BaseModel):
+    model_config = ConfigDict(extra="allow")
+    move_type: str = Field(
+        default="make_offer",
+        description="Choose action: make_offer (propose), accept (take current deal), reject (walk away), bundle (multi-issue offer)",
+    )
+    terms: Dict[str, Any] = Field(
+        default_factory=lambda: {"price": 45000},
+        description='For single_issue: {"price": 45000}. For multi_issue: {"price": 45000, "payment_days": 30}. For adversarial: add "support_hours": 100',
+    )
+    message: str = Field(
+        default="I value our partnership and believe we can reach a fair agreement together.",
+        description="Write a collaborative message. Use: partnership, mutual, flexible, understand, solution. Avoid: demand, final offer, ultimatum",
+    )
+    def model_post_init(self, *args, **kwargs):
+        valid_moves = ("make_offer", "accept", "reject", "bundle")
+        if self.move_type not in valid_moves:
+            raise ValueError(
+                f"Invalid move_type: {self.move_type}. Must be one of {valid_moves}"
+            )
+class NegotiationObservation(BaseModel):
+    model_config = ConfigDict(extra="allow")
+    task_id: str = ""
+    round_number: int = 0
+    max_rounds: int = 0
+    supplier_message: str = ""
+    current_offer: Dict[str, Any] = Field(default_factory=dict)
+    last_4_exchanges: List[Dict] = Field(default_factory=list)
+    buyer_constraints: Dict[str, Any] = Field(default_factory=dict)
+    rapport_hint: str = "neutral"
+    done: bool = False
+    reward: Optional[float] = None
+    metadata: Dict[str, Any] = Field(default_factory=dict)
+class NegotiationState(BaseModel):
+    model_config = ConfigDict(extra="allow", validate_assignment=True)
+    task_id: str = ""
+    episode_id: str = ""
+    round_number: int = 0
+    step_count: int = 0  # Required by OpenEnv web interface
+    rapport_score: float = 0.5
+    consecutive_concessions: int = 0
+    deal_reached: bool = False
+    final_terms: Optional[Dict] = None
+    cumulative_reward: float = 0.0
+    def __getitem__(self, key):
+        return getattr(self, key)
+    def get(self, key, default=None):
+        return getattr(self, key, default)

openenv.yaml ADDED Viewed

	@@ -0,0 +1,38 @@

+spec_version: 1
+name: procure-rl
+version: "1.0.0"
+type: space
+runtime: fastapi
+app: server.app:app
+port: 7860
+description: "LLM agent learns procurement negotiation strategy against scripted supplier opponents with hidden utility functions"
+author: "procure-rl"
+tags:
+  - openenv
+  - negotiation
+  - procurement
+  - real-world
+  - rl
+tasks:
+  - id: single_issue
+    description: "Negotiate software license price with cooperative supplier"
+    difficulty: easy
+    max_steps: 6
+    reward_range: [0.0, 1.0]
+  - id: multi_issue
+    description: "Negotiate price and payment terms with cash-flow-sensitive supplier"
+    difficulty: medium
+    max_steps: 8
+    reward_range: [0.0, 1.0]
+  - id: adversarial
+    description: "Negotiate multiple issues against aggressive anchoring supplier"
+    difficulty: hard
+    max_steps: 10
+    reward_range: [0.0, 1.0]
+reward_range: [0.0, 1.0]
+observation_space:
+  type: object
+  description: "Natural language supplier message with structured negotiation state and rapport signal"
+action_space:
+  type: object
+  description: "Negotiation move type, structured terms, and natural language message"

openenv_Procure_RL.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,9 @@

+Metadata-Version: 2.4
+Name: openenv-Procure_RL
+Version: 0.1.0
+Summary: Procure Rl environment for OpenEnv
+Requires-Python: >=3.10
+Requires-Dist: openenv-core[core]>=0.2.2
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"

openenv_Procure_RL.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+README.md
+pyproject.toml
+./__init__.py
+./client.py
+./models.py
+openenv_Procure_RL.egg-info/PKG-INFO
+openenv_Procure_RL.egg-info/SOURCES.txt
+openenv_Procure_RL.egg-info/dependency_links.txt
+openenv_Procure_RL.egg-info/entry_points.txt
+openenv_Procure_RL.egg-info/requires.txt
+openenv_Procure_RL.egg-info/top_level.txt
+server/Procure_RL_environment.py
+server/__init__.py
+server/app.py

openenv_Procure_RL.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

openenv_Procure_RL.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ server = Procure_RL.server.app:main

openenv_Procure_RL.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+openenv-core[core]>=0.2.2
+[dev]
+pytest>=8.0.0
+pytest-cov>=4.0.0

openenv_Procure_RL.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ Procure_RL

opponent.py ADDED Viewed

	@@ -0,0 +1,213 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+Scripted persona opponent for procurement negotiation.
+The opponent's behavior is deterministic given a seed AND sensitive to
+the agent's language quality via the rapport system.
+"""
+import random
+from dataclasses import dataclass, field
+from typing import Dict, Tuple
+COLLABORATIVE_SIGNALS = [
+    "understand",
+    "partnership",
+    "mutual",
+    "together",
+    "value",
+    "appreciate",
+    "flexible",
+    "work with",
+    "long-term",
+    "relationship",
+    "reasonable",
+    "fair",
+    "both",
+    "solution",
+]
+AGGRESSIVE_SIGNALS = [
+    "demand",
+    "require",
+    "final offer",
+    "unacceptable",
+    "must",
+    "non-negotiable",
+    "take it or leave",
+    "bottom line",
+    "ultimatum",
+    "insist",
+    "refuse",
+    "absolutely not",
+]
+PERSONA_TEMPLATES = {
+    "cooperative": {
+        "opening": [
+            "Thanks for reaching out. Our standard pricing for this package is ${target}. Happy to discuss.",
+            "We value your interest. We're pricing this at ${target} based on current market rates.",
+        ],
+        "counter": [
+            "I appreciate you working with us. Based on our costs, ${counter} is where we can be.",
+            "Thank you for your offer. We can move to ${counter} given our margin requirements.",
+        ],
+        "near_close": [
+            "I think we're close. If you can do ${close}, I can get this approved today.",
+            "We're almost there. ${close} works for our team. Shall we finalize?",
+        ],
+        "accept": "That works for us. Let's move forward at those terms.",
+        "reject": "That's below what we can accept, but we want to make this work.",
+    },
+    "cash_flow_stressed": {
+        "opening": [
+            "Our pricing is ${target}. I should mention — payment timing is particularly important to us this quarter.",
+            "We're at ${target}. Between us, our finance team has specific requirements around cash flow timing.",
+        ],
+        "counter": [
+            "We can move on price if payment terms work for you. ${counter} with your payment preference?",
+            "Price flexibility depends on receivables timing for us. ${counter} if we can discuss payment terms.",
+        ],
+        "near_close": [
+            "If you can do Net-30 on payment, we can get to ${close} on price.",
+            "Payment timing is our real constraint. ${close} with faster payment terms?",
+        ],
+        "accept": "Agreed. The payment structure works for our cash flow needs.",
+        "reject": "The price is tight but we could explore it if payment terms align.",
+    },
+    "aggressive_anchor": {
+        "opening": [
+            "Our price is ${target}. This reflects our full service quality and market position.",
+            "We're firm at ${target}. This is based on our cost structure and service level.",
+        ],
+        "counter": [
+            "We can go to ${counter}. That's already a significant concession from our position.",
+            "${counter} is our revised position. We're not in a position to move much further.",
+        ],
+        "hardening": [
+            "We've already moved considerably. ${floor} is our absolute position.",
+            "I need to be direct — we're at ${floor} and that's where we'll stay.",
+        ],
+        "near_close": [
+            "Final position: ${close}. We need a decision today.",
+            "${close} is where we are. This is our best and final offer.",
+        ],
+        "accept": "Accepted.",
+        "reject": "That doesn't work. Come back with a serious offer.",
+    },
+}
+class ScriptedPersonaOpponent:
+    def __init__(self, task_id: str, seed: int, persona: str):
+        self.rng = random.Random(seed)
+        self.task_id = task_id
+        self.persona = persona
+        self.templates = PERSONA_TEMPLATES[persona]
+        if task_id == "single_issue":
+            self.price_floor = self.rng.uniform(42000, 46000)
+            self.price_target = self.price_floor * self.rng.uniform(1.28, 1.38)
+        elif task_id == "multi_issue":
+            self.price_floor = self.rng.uniform(40000, 46000)
+            self.price_target = self.price_floor * self.rng.uniform(1.25, 1.35)
+            self.payment_preference = self.rng.choice([30, 45, 60])
+        elif task_id == "adversarial":
+            self.price_floor = self.rng.uniform(85000, 95000)
+            self.price_target = self.price_floor * self.rng.uniform(1.30, 1.40)
+        self.rapport = 0.5
+        self.concession_count = 0
+        self.current_position = self.price_target
+    def update_rapport(self, agent_message: str) -> None:
+        msg_lower = agent_message.lower()
+        delta = 0.0
+        delta += sum(0.08 for w in COLLABORATIVE_SIGNALS if w in msg_lower)
+        delta -= sum(0.08 for w in AGGRESSIVE_SIGNALS if w in msg_lower)
+        delta = max(-0.20, min(0.20, delta))
+        self.rapport = max(0.0, min(1.0, self.rapport + delta))
+    def get_concession_rate(self) -> float:
+        base_rates = {
+            "cooperative": 0.05,
+            "cash_flow_stressed": 0.07,
+            "aggressive_anchor": 0.04,
+        }
+        base = base_rates[self.persona]
+        modifier = (self.rapport - 0.5) * base
+        return max(0.01, base + modifier)
+    def respond(
+        self,
+        agent_message: str,
+        agent_terms: Dict,
+        round_number: int,
+        consecutive_concessions: int,
+    ) -> Tuple[str, Dict]:
+        self.update_rapport(agent_message)
+        self.concession_count += 1
+        agent_price = agent_terms.get("price", 0)
+        if (
+            round_number >= 2
+            and agent_price >= self.price_floor
+            and self._acceptance_condition(agent_terms)
+        ):
+            return self.templates["accept"], {**agent_terms, "_accepted": True}
+        concession = self.get_concession_rate()
+        if self.persona == "aggressive_anchor" and consecutive_concessions >= 2:
+            concession = concession * 0.4
+            template_key = "hardening"
+        elif round_number >= self._max_rounds() * 0.7:
+            template_key = "near_close"
+        else:
+            template_key = "counter"
+        new_position = self.current_position * (1 - concession)
+        new_position = max(self.price_floor, new_position)
+        self.current_position = new_position
+        templates_for_key = self.templates.get(template_key, self.templates["counter"])
+        template = self.rng.choice(templates_for_key)
+        message = template.replace("${counter}", f"${new_position:,.0f}")
+        message = message.replace("${floor}", f"${self.price_floor:,.0f}")
+        message = message.replace("${close}", f"${new_position:,.0f}")
+        counter_terms = dict(agent_terms)
+        counter_terms["price"] = round(new_position, 2)
+        if self.persona == "cash_flow_stressed" and "payment_days" in agent_terms:
+            if agent_terms["payment_days"] > 60:
+                message += (
+                    " Though I'll need to flag the payment timing to our finance team."
+                )
+        return message, counter_terms
+    def _acceptance_condition(self, terms: Dict) -> bool:
+        if self.persona == "cash_flow_stressed":
+            payment_ok = terms.get("payment_days", 60) <= 45
+            return payment_ok
+        return True
+    def _max_rounds(self) -> int:
+        return {"single_issue": 6, "multi_issue": 8, "adversarial": 10}[self.task_id]
+    def get_opening_message(self) -> Tuple[str, Dict]:
+        template = self.rng.choice(self.templates["opening"])
+        message = template.replace("${target}", f"${self.price_target:,.0f}")
+        terms = {"price": round(self.price_target, 2)}
+        if self.task_id in ["multi_issue", "adversarial"]:
+            terms["payment_days"] = 90
+        if self.task_id == "adversarial":
+            terms["support_hours"] = 80
+        return message, terms

plan.md ADDED Viewed

	@@ -0,0 +1,1228 @@

+# THE DEFINITIVE FINAL DESIGN
+## Core Mechanic: Language-Sensitive Scripted Opponent
+This is the one thing that makes everything work. The opponent's behavior is deterministic given a seed AND sensitive to the agent's language quality.
+```python
+# Deterministic keyword detection — pure Python
+COLLABORATIVE_SIGNALS = [
+    "understand", "partnership", "mutual", "together", "value",
+    "appreciate", "flexible", "work with", "long-term", "relationship"
+]
+AGGRESSIVE_SIGNALS = [
+    "demand", "require", "final offer", "unacceptable", "must",
+    "non-negotiable", "take it or leave", "bottom line", "ultimatum"
+]
+def update_rapport(current_rapport: float, agent_message: str) -> float:
+    msg_lower = agent_message.lower()
+    delta = 0.0
+    delta += sum(0.08 for w in COLLABORATIVE_SIGNALS if w in msg_lower)
+    delta -= sum(0.08 for w in AGGRESSIVE_SIGNALS if w in msg_lower)
+    delta = max(-0.20, min(0.20, delta))  # cap per-round change
+    return max(0.0, min(1.0, current_rapport + delta))
+```
+The rapport score directly modifies the opponent's concession rate:
+- Rapport 0.8: opponent concedes 12% per round
+- Rapport 0.5: opponent concedes 7% per round (neutral)
+- Rapport 0.2: opponent concedes 3% per round (hardened)
+A heuristic agent that outputs nothing or outputs aggressive language gets neutral/hostile opponent. An LLM that learns collaborative framing gets cooperative opponent. This is the LLM advantage.
+## The Three Tasks — Final, Locked
+### Task 1: `single_issue` (Easy)
+**Scenario:** Renew software license. Price only.
+```
+Buyer target: $38,000
+Seller opens: $52,000
+Seller floor: $44,000
+Pareto optimal: $43,000
+Max rounds: 6
+Persona: Cooperative (concedes 10% baseline, rapport-sensitive)
+```
+**Calibration:** A base LLM that simply offers reasonable prices without collaborative language scores ~0.38. A base LLM that naturally uses professional language scores ~0.52. Scores above 0.75 require learning to time concessions correctly.
+**Grader:**
+```python
+def grade_single_issue(final_price, deal_reached, rounds_taken):
+    if not deal_reached:
+        return 0.0
+    # Value: how close to buyer target
+    value = (44000 - final_price) / (44000 - 38000)
+    value = max(0.0, min(1.0, value))
+    # Efficiency: penalty grows sharply in late rounds
+    efficiency = 1.0 - (rounds_taken / 6) ** 1.5 * 0.4
+    efficiency = max(0.0, efficiency)
+    return round(value * efficiency, 4)
+```
+### Task 2: `multi_issue` (Medium)
+**Scenario:** Enterprise software. Price + payment terms.
+```
+Issues: price ($40K-$58K) + payment_days (30-90)
+Seller persona: Cash Flow Stressed
+  → price_weight: 0.35 (somewhat cares)
+  → payment_weight: 0.65 (cares much more)
+Buyer weights: price 0.70, payment 0.30
+Pareto insight: buyer should offer Net-30 to get lower price
+Max rounds: 8
+```
+**Why medium:** Base LLM treats both issues equally, misses the trade opportunity. Score ~0.25. LLM that discovers seller cares about payment can bundle correctly. Score ~0.50.
+**Grader:**
+```python
+def grade_multi_issue(final_terms, deal_reached, rounds_taken):
+    if not deal_reached:
+        return 0.0
+    # Buyer utility function
+    price_score = (58000 - final_terms['price']) / (58000 - 40000)
+    payment_score = (90 - final_terms['payment_days']) / (90 - 30)
+    price_score = max(0.0, min(1.0, price_score))
+    payment_score = max(0.0, min(1.0, payment_score))
+    value = 0.70 * price_score + 0.30 * payment_score
+    efficiency = 1.0 - (rounds_taken / 8) * 0.30
+    return round(value * efficiency, 4)
+```
+### Task 3: `adversarial` (Hard)
+**Scenario:** Large contract. Price + payment + support hours.
+```
+Issues: price + payment_days + support_hours
+Seller persona: Aggressive Anchor
+  → Opens at ceiling on all issues
+  → Hardens position if agent makes consecutive concessions
+  → Rapport-sensitive but requires consistent collaborative framing
+Adaptation: if agent concedes 2+ rounds in a row, seller increases floor by 3%
+Max rounds: 10
+Survival floor: deal at any terms scores minimum 0.15
+```
+**Why hard:** Agent must resist anchoring, break consecutive concession patterns, maintain collaborative tone under pressure. Base LLM score ~0.15. Strong LLM ~0.40.
+**Grader:**
+```python
+def grade_adversarial(final_terms, deal_reached, rounds_taken, consecutive_flag):
+    if not deal_reached:
+        return 0.0
+    # Survival floor — completing deal always scores at least 0.15
+    floor = 0.15
+    price_score = (120000 - final_terms['price']) / (120000 - 80000)
+    payment_score = (90 - final_terms['payment_days']) / (90 - 30)
+    support_score = (final_terms['support_hours'] - 80) / (200 - 80)
+    for s in [price_score, payment_score, support_score]:
+        s = max(0.0, min(1.0, s))
+    value = 0.40 * price_score + 0.35 * payment_score + 0.25 * support_score
+    efficiency = 1.0 - (rounds_taken / 10) * 0.25
+    # Penalty for consecutive concession pattern
+    pattern_penalty = 0.1 if consecutive_flag else 0.0
+    raw = (value * efficiency) - pattern_penalty
+    return round(max(floor, raw), 4)
+```
+## Score Calibration Table
+| Agent Type                  | single_issue | multi_issue | adversarial |
+| --------------------------- | ------------ | ----------- | ----------- |
+| Random/heuristic            | 0.15–0.25    | 0.08–0.15   | 0.03–0.10   |
+| Base LLM (no language)      | 0.35–0.45    | 0.20–0.30   | 0.12–0.20   |
+| Base LLM (natural language) | 0.48–0.58    | 0.28–0.38   | 0.18–0.28   |
+| GRPO-trained LLM (goal)     | 0.68–0.78    | 0.55–0.65   | 0.45–0.55   |
+This gives clear score spread at every level. Phase 2 will show meaningful differentiation.
+---
+# THE CLAUDE CODE PROMPT
+Paste this entire block into Claude Code:
+---
+**Build ProcureRL: A Procurement Negotiation RL Environment**
+This is a complete OpenEnv-compliant environment. Build everything exactly as specified. No additions, no changes to the design.
+---
+**Project Structure:**
+```
+procure-rl/
+├── procure_rl/
+│   ├── __init__.py
+│   ├── environment.py
+│   ├── models.py
+│   ├── opponent.py
+│   ├── graders.py
+│   └── scenarios.py
+├── server/
+│   └── app.py
+├── inference.py
+├── openenv.yaml
+├── Dockerfile
+├── requirements.txt
+└── README.md
+```
+---
+**models.py — exact dataclasses:**
+```python
+from dataclasses import dataclass, field
+from typing import Optional, List, Dict, Any
+try:
+    from openenv.core.env_server import Action, Observation, State
+except ImportError:
+    Action = object
+    Observation = object
+    State = object
+@dataclass
+class NegotiationAction(Action):
+    move_type: str  # make_offer | accept | reject | bundle
+    terms: Dict[str, Any]  # {price: 44000, payment_days: 45, support_hours: 120}
+    message: str = ""  # natural language — affects opponent rapport
+@dataclass
+class NegotiationObservation(Observation):
+    task_id: str
+    round_number: int
+    max_rounds: int
+    supplier_message: str
+    current_offer: Dict[str, Any]
+    last_4_exchanges: List[Dict]  # capped at 4 for token efficiency
+    buyer_constraints: Dict[str, Any]  # buyer's targets and limits
+    rapport_hint: str  # "positive" | "neutral" | "negative" — visible to agent
+    done: bool
+@dataclass
+class NegotiationState(State):
+    task_id: str = ""
+    episode_id: str = ""
+    round_number: int = 0
+    rapport_score: float = 0.5
+    consecutive_concessions: int = 0
+    deal_reached: bool = False
+    final_terms: Optional[Dict] = None
+    cumulative_reward: float = 0.0
+```
+---
+**opponent.py — ScriptedPersonaOpponent:**
+```python
+import random
+from dataclasses import dataclass, field
+from typing import Dict, Tuple
+COLLABORATIVE_SIGNALS = [
+    "understand", "partnership", "mutual", "together", "value",
+    "appreciate", "flexible", "work with", "long-term", "relationship",
+    "reasonable", "fair", "both", "solution"
+]
+AGGRESSIVE_SIGNALS = [
+    "demand", "require", "final offer", "unacceptable", "must",
+    "non-negotiable", "take it or leave", "bottom line", "ultimatum",
+    "insist", "refuse", "absolutely not"
+]
+PERSONA_TEMPLATES = {
+    "cooperative": {
+        "opening": [
+            "Thanks for reaching out. Our standard pricing for this package is ${target}. Happy to discuss.",
+            "We value your interest. We're pricing this at ${target} based on current market rates.",
+        ],
+        "counter": [
+            "I appreciate you working with us. Based on our costs, ${counter} is where we can be.",
+            "Thank you for your offer. We can move to ${counter} given our margin requirements.",
+        ],
+        "near_close": [
+            "I think we're close. If you can do ${close}, I can get this approved today.",
+            "We're almost there. ${close} works for our team. Shall we finalize?"
+        ],
+        "accept": "That works for us. Let's move forward at those terms.",
+        "reject": "That's below what we can accept, but we want to make this work."
+    },
+    "cash_flow_stressed": {
+        "opening": [
+            "Our pricing is ${target}. I should mention — payment timing is particularly important to us this quarter.",
+            "We're at ${target}. Between us, our finance team has specific requirements around cash flow timing.",
+        ],
+        "counter": [
+            "We can move on price if payment terms work for you. ${counter} with your payment preference?",
+            "Price flexibility depends on receivables timing for us. ${counter} if we can discuss payment terms.",
+        ],
+        "near_close": [
+            "If you can do Net-30 on payment, we can get to ${close} on price.",
+            "Payment timing is our real constraint. ${close} with faster payment terms?"
+        ],
+        "accept": "Agreed. The payment structure works for our cash flow needs.",
+        "reject": "The price is tight but we could explore it if payment terms align."
+    },
+    "aggressive_anchor": {
+        "opening": [
+            "Our price is ${target}. This reflects our full service quality and market position.",
+            "We're firm at ${target}. This is based on our cost structure and service level.",
+        ],
+        "counter": [
+            "We can go to ${counter}. That's already a significant concession from our position.",
+            "${counter} is our revised position. We're not in a position to move much further.",
+        ],
+        "hardening": [
+            "We've already moved considerably. ${floor} is our absolute position.",
+            "I need to be direct — we're at ${floor} and that's where we'll stay.",
+        ],
+        "near_close": [
+            "Final position: ${close}. We need a decision today.",
+            "${close} is where we are. This is our best and final offer."
+        ],
+        "accept": "Accepted.",
+        "reject": "That doesn't work. Come back with a serious offer."
+    }
+}
+class ScriptedPersonaOpponent:
+    def __init__(self, task_id: str, seed: int, persona: str):
+        self.rng = random.Random(seed)
+        self.task_id = task_id
+        self.persona = persona
+        self.templates = PERSONA_TEMPLATES[persona]
+        # Sampled reservation values — never revealed to agent
+        if task_id == "single_issue":
+            self.price_floor = self.rng.uniform(42000, 46000)
+            self.price_target = self.price_floor * self.rng.uniform(1.28, 1.38)
+        elif task_id == "multi_issue":
+            self.price_floor = self.rng.uniform(40000, 46000)
+            self.price_target = self.price_floor * self.rng.uniform(1.25, 1.35)
+            self.payment_preference = self.rng.choice([30, 45, 60])  # preferred days
+        elif task_id == "adversarial":
+            self.price_floor = self.rng.uniform(85000, 95000)
+            self.price_target = self.price_floor * self.rng.uniform(1.30, 1.40)
+        self.rapport = 0.5
+        self.concession_count = 0
+        self.current_position = self.price_target
+    def update_rapport(self, agent_message: str) -> None:
+        msg_lower = agent_message.lower()
+        delta = 0.0
+        delta += sum(0.08 for w in COLLABORATIVE_SIGNALS if w in msg_lower)
+        delta -= sum(0.08 for w in AGGRESSIVE_SIGNALS if w in msg_lower)
+        delta = max(-0.20, min(0.20, delta))
+        self.rapport = max(0.0, min(1.0, self.rapport + delta))
+    def get_concession_rate(self) -> float:
+        # Base rate by persona
+        base_rates = {
+            "cooperative": 0.10,
+            "cash_flow_stressed": 0.07,
+            "aggressive_anchor": 0.04
+        }
+        base = base_rates[self.persona]
+        # Rapport modifier: +/- 50% of base rate
+        modifier = (self.rapport - 0.5) * base
+        return max(0.01, base + modifier)
+    def respond(self, agent_message: str, agent_terms: Dict,
+                round_number: int, consecutive_concessions: int) -> Tuple[str, Dict]:
+        self.update_rapport(agent_message)
+        self.concession_count += 1
+        agent_price = agent_terms.get('price', 0)
+        # Check if we should accept
+        if agent_price >= self.price_floor and self._acceptance_condition(agent_terms):
+            return self.templates["accept"], {**agent_terms, "_accepted": True}
+        # Compute counter position
+        concession = self.get_concession_rate()
+        # Aggressive anchor hardens if consecutive concessions detected
+        if self.persona == "aggressive_anchor" and consecutive_concessions >= 2:
+            concession = concession * 0.4  # barely moves
+            template_key = "hardening"
+        elif round_number >= self._max_rounds() * 0.7:
+            template_key = "near_close"
+        else:
+            template_key = "counter"
+        new_position = self.current_position * (1 - concession)
+        new_position = max(self.price_floor, new_position)
+        self.current_position = new_position
+        # Select template
+        templates_for_key = self.templates.get(template_key, self.templates["counter"])
+        template = self.rng.choice(templates_for_key)
+        message = template.replace("${counter}", f"${new_position:,.0f}")
+        message = message.replace("${floor}", f"${self.price_floor:,.0f}")
+        message = message.replace("${close}", f"${new_position:,.0f}")
+        counter_terms = dict(agent_terms)
+        counter_terms['price'] = round(new_position, 2)
+        # Cash flow stressed adds payment commentary
+        if self.persona == "cash_flow_stressed" and 'payment_days' in agent_terms:
+            if agent_terms['payment_days'] > 60:
+                message += " Though I'll need to flag the payment timing to our finance team."
+        return message, counter_terms
+    def _acceptance_condition(self, terms: Dict) -> bool:
+        if self.persona == "cash_flow_stressed":
+            payment_ok = terms.get('payment_days', 60) <= 45
+            return payment_ok
+        return True
+    def _max_rounds(self) -> int:
+        return {"single_issue": 6, "multi_issue": 8, "adversarial": 10}[self.task_id]
+    def get_opening_message(self) -> Tuple[str, Dict]:
+        template = self.rng.choice(self.templates["opening"])
+        message = template.replace("${target}", f"${self.price_target:,.0f}")
+        terms = {"price": round(self.price_target, 2)}
+        if self.task_id in ["multi_issue", "adversarial"]:
+            terms["payment_days"] = 90
+        if self.task_id == "adversarial":
+            terms["support_hours"] = 80
+        return message, terms
+```
+---
+**graders.py — pure Python, zero LLM calls:**
+```python
+from typing import Dict, Optional
+def grade_single_issue(
+    final_terms: Dict,
+    deal_reached: bool,
+    rounds_taken: int,
+    max_rounds: int = 6
+) -> float:
+    if not deal_reached:
+        return 0.0
+    final_price = final_terms.get('price', 99999)
+    # Buyer target: $38K, seller floor: ~$44K
+    BUYER_TARGET = 38000
+    SELLER_FLOOR = 44000
+    value = (SELLER_FLOOR - final_price) / (SELLER_FLOOR - BUYER_TARGET)
+    value = max(0.0, min(1.0, value))
+    # Efficiency penalty grows sharply in late rounds
+    efficiency = 1.0 - (rounds_taken / max_rounds) ** 1.5 * 0.4
+    efficiency = max(0.1, efficiency)
+    return round(value * efficiency, 4)
+def grade_multi_issue(
+    final_terms: Dict,
+    deal_reached: bool,
+    rounds_taken: int,
+    max_rounds: int = 8
+) -> float:
+    if not deal_reached:
+        return 0.0
+    final_price = final_terms.get('price', 99999)
+    payment_days = final_terms.get('payment_days', 90)
+    # Price component (buyer cares 70%)
+    PRICE_WORST = 58000
+    PRICE_TARGET = 40000
+    price_score = (PRICE_WORST - final_price) / (PRICE_WORST - PRICE_TARGET)
+    price_score = max(0.0, min(1.0, price_score))
+    # Payment component (buyer cares 30%)
+    PAYMENT_WORST = 90
+    PAYMENT_TARGET = 30
+    payment_score = (PAYMENT_WORST - payment_days) / (PAYMENT_WORST - PAYMENT_TARGET)
+    payment_score = max(0.0, min(1.0, payment_score))
+    value = 0.70 * price_score + 0.30 * payment_score
+    efficiency = 1.0 - (rounds_taken / max_rounds) * 0.30
+    efficiency = max(0.1, efficiency)
+    return round(value * efficiency, 4)
+def grade_adversarial(
+    final_terms: Dict,
+    deal_reached: bool,
+    rounds_taken: int,
+    consecutive_concessions_flag: bool,
+    max_rounds: int = 10
+) -> float:
+    if not deal_reached:
+        return 0.0
+    SURVIVAL_FLOOR = 0.15
+    final_price = final_terms.get('price', 999999)
+    payment_days = final_terms.get('payment_days', 90)
+    support_hours = final_terms.get('support_hours', 80)
+    # Price (buyer weight 40%)
+    PRICE_WORST = 120000
+    PRICE_TARGET = 80000
+    price_score = (PRICE_WORST - final_price) / (PRICE_WORST - PRICE_TARGET)
+    price_score = max(0.0, min(1.0, price_score))
+    # Payment (buyer weight 35%)
+    payment_score = (90 - payment_days) / (90 - 30)
+    payment_score = max(0.0, min(1.0, payment_score))
+    # Support hours (buyer weight 25%)
+    support_score = (support_hours - 80) / (200 - 80)
+    support_score = max(0.0, min(1.0, support_score))
+    value = 0.40 * price_score + 0.35 * payment_score + 0.25 * support_score
+    efficiency = 1.0 - (rounds_taken / max_rounds) * 0.25
+    efficiency = max(0.1, efficiency)
+    # Penalty for being exploited by consecutive concession pattern
+    pattern_penalty = 0.10 if consecutive_concessions_flag else 0.0
+    raw = (value * efficiency) - pattern_penalty
+    return round(max(SURVIVAL_FLOOR, raw), 4)
+def grade(task_id: str, final_terms: Dict, deal_reached: bool,
+          rounds_taken: int, **kwargs) -> float:
+    if task_id == "single_issue":
+        return grade_single_issue(final_terms, deal_reached, rounds_taken)
+    elif task_id == "multi_issue":
+        return grade_multi_issue(final_terms, deal_reached, rounds_taken)
+    elif task_id == "adversarial":
+        return grade_adversarial(
+            final_terms, deal_reached, rounds_taken,
+            kwargs.get("consecutive_concessions_flag", False)
+        )
+    raise ValueError(f"Unknown task: {task_id}")
+```
+---
+**environment.py:**
+```python
+import uuid
+from typing import Optional
+from procure_rl.models import NegotiationAction, NegotiationObservation, NegotiationState
+from procure_rl.opponent import ScriptedPersonaOpponent
+from procure_rl.graders import grade
+TASK_CONFIG = {
+    "single_issue": {
+        "persona": "cooperative",
+        "max_rounds": 6,
+        "buyer_constraints": {
+            "price": {"target": 38000, "worst": 52000, "budget": 50000}
+        }
+    },
+    "multi_issue": {
+        "persona": "cash_flow_stressed",
+        "max_rounds": 8,
+        "buyer_constraints": {
+            "price": {"target": 40000, "worst": 58000, "budget": 55000},
+            "payment_days": {"target": 60, "worst": 30, "preference": 60}
+        }
+    },
+    "adversarial": {
+        "persona": "aggressive_anchor",
+        "max_rounds": 10,
+        "buyer_constraints": {
+            "price": {"target": 80000, "worst": 120000, "budget": 115000},
+            "payment_days": {"target": 60, "worst": 30, "preference": 60},
+            "support_hours": {"target": 150, "worst": 80, "preference": 150}
+        }
+    }
+}
+try:
+    from openenv.core.env_server import Environment
+except ImportError:
+    class Environment:
+        pass
+class ProcureRLEnvironment(Environment):
+    def __init__(self):
+        self._state = NegotiationState()
+        self._opponent = None
+        self._task_config = None
+        self._done = False
+        self._last_offer = {}
+        self._consecutive_concessions = 0
+        self._prev_agent_price = None
+    def reset(self, task_id: str = "single_issue", seed: int = 42) -> NegotiationObservation:
+        if task_id not in TASK_CONFIG:
+            raise ValueError(f"Unknown task: {task_id}. Valid: {list(TASK_CONFIG.keys())}")
+        config = TASK_CONFIG[task_id]
+        self._task_config = config
+        self._done = False
+        self._consecutive_concessions = 0
+        self._prev_agent_price = None
+        self._opponent = ScriptedPersonaOpponent(
+            task_id=task_id,
+            seed=seed,
+            persona=config["persona"]
+        )
+        opening_msg, opening_terms = self._opponent.get_opening_message()
+        self._last_offer = opening_terms
+        self._state = NegotiationState(
+            task_id=task_id,
+            episode_id=str(uuid.uuid4())[:8],
+            round_number=0,
+            rapport_score=0.5,
+            consecutive_concessions=0,
+            deal_reached=False,
+            final_terms=None,
+            cumulative_reward=0.0
+        )
+        return NegotiationObservation(
+            task_id=task_id,
+            round_number=0,
+            max_rounds=config["max_rounds"],
+            supplier_message=opening_msg,
+            current_offer=opening_terms,
+            last_4_exchanges=[{"role": "supplier", "message": opening_msg, "terms": opening_terms}],
+            buyer_constraints=config["buyer_constraints"],
+            rapport_hint="neutral",
+            done=False
+        )
+    def step(self, action: NegotiationAction):
+        if self._done:
+            obs = self._make_obs("Episode finished. Call reset().")
+            return obs, 0.0, True, {"error": "episode_done"}
+        self._state.round_number += 1
+        round_num = self._state.round_number
+        config = self._task_config
+        max_rounds = config["max_rounds"]
+        reward = 0.0
+        error = None
+        # Track consecutive concessions
+        if self._prev_agent_price is not None:
+            current_price = action.terms.get('price', self._prev_agent_price)
+            if current_price > self._prev_agent_price:  # agent conceded (price went up toward seller)
+                self._consecutive_concessions += 1
+            else:
+                self._consecutive_concessions = 0
+        self._prev_agent_price = action.terms.get('price')
+        self._state.consecutive_concessions = self._consecutive_concessions
+        # Handle accept
+        if action.move_type == "accept":
+            self._done = True
+            self._state.deal_reached = True
+            self._state.final_terms = self._last_offer
+            reward = grade(
+                self._state.task_id,
+                self._last_offer,
+                True,
+                round_num,
+                consecutive_concessions_flag=(self._consecutive_concessions >= 2)
+            )
+            self._state.cumulative_reward = reward
+            obs = self._make_obs()
+            obs.done = True
+            return obs, reward, True, {"deal_price": self._last_offer.get('price')}
+        # Handle reject
+        if action.move_type == "reject":
+            if round_num >= max_rounds:
+                self._done = True
+                reward = 0.0
+                obs = self._make_obs()
+                obs.done = True
+                return obs, reward, True, {"error": "rejected_at_limit"}
+            obs = self._make_obs()
+            return obs, 0.0, False, {}
+        # Handle make_offer or bundle
+        opponent_msg, opponent_terms = self._opponent.respond(
+            agent_message=action.message,
+            agent_terms=action.terms,
+            round_number=round_num,
+            consecutive_concessions=self._consecutive_concessions
+        )
+        # Check if opponent accepted
+        if opponent_terms.get("_accepted"):
+            self._done = True
+            self._state.deal_reached = True
+            self._state.final_terms = action.terms
+            reward = grade(
+                self._state.task_id,
+                action.terms,
+                True,
+                round_num,
+                consecutive_concessions_flag=(self._consecutive_concessions >= 2)
+            )
+            self._state.cumulative_reward = reward
+            obs = self._make_obs(supplier_message=opponent_msg)
+            obs.done = True
+            return obs, reward, True, {"deal_price": action.terms.get('price')}
+        self._last_offer = opponent_terms
+        self._state.rapport_score = self._opponent.rapport
+        # Episode limit
+        if round_num >= max_rounds:
+            self._done = True
+            reward = 0.0
+            obs = self._make_obs(supplier_message=opponent_msg)
+            obs.done = True
+            return obs, reward, True, {"error": "max_rounds_reached"}
+        obs = self._make_obs(supplier_message=opponent_msg)
+        return obs, 0.0, False, {}
+    def state(self) -> NegotiationState:
+        return self._state
+    def _make_obs(self, supplier_message: str = None) -> NegotiationObservation:
+        rapport = self._state.rapport_score
+        if rapport >= 0.65:
+            hint = "positive"
+        elif rapport <= 0.35:
+            hint = "negative"
+        else:
+            hint = "neutral"
+        return NegotiationObservation(
+            task_id=self._state.task_id,
+            round_number=self._state.round_number,
+            max_rounds=self._task_config["max_rounds"],
+            supplier_message=supplier_message or "",
+            current_offer=self._last_offer,
+            last_4_exchanges=[],
+            buyer_constraints=self._task_config["buyer_constraints"],
+            rapport_hint=hint,
+            done=self._done
+        )
+```
+---
+**server/app.py:**
+```python
+import sys, os
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from dataclasses import asdict
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+from typing import Optional, Dict, Any
+from procure_rl.environment import ProcureRLEnvironment
+from procure_rl.models import NegotiationAction
+app = FastAPI(title="ProcureRL", version="1.0.0")
+app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"])
+_env = ProcureRLEnvironment()
+class ResetRequest(BaseModel):
+    task_id: Optional[str] = "single_issue"
+    seed: Optional[int] = 42
+class StepRequest(BaseModel):
+    move_type: str = "make_offer"
+    terms: Dict[str, Any] = {}
+    message: str = ""
+@app.get("/health")
+async def health():
+    return {"status": "ok", "service": "procure-rl"}
+@app.get("/metadata")
+async def metadata():
+    return {
+        "name": "procure-rl",
+        "tasks": ["single_issue", "multi_issue", "adversarial"],
+        "description": "LLM agent learns procurement negotiation"
+    }
+@app.post("/reset")
+async def reset(req: ResetRequest = ResetRequest()):
+    try:
+        obs = _env.reset(task_id=req.task_id, seed=req.seed)
+        return asdict(obs)
+    except ValueError as e:
+        raise HTTPException(400, str(e))
+    except Exception as e:
+        raise HTTPException(500, f"Reset failed: {e}")
+@app.post("/step")
+async def step(req: StepRequest):
+    action = NegotiationAction(
+        move_type=req.move_type,
+        terms=req.terms,
+        message=req.message
+    )
+    try:
+        obs, reward, done, info = _env.step(action)
+        return {"observation": asdict(obs), "reward": reward, "done": done, "info": info}
+    except Exception as e:
+        raise HTTPException(500, f"Step failed: {e}")
+@app.get("/state")
+async def state():
+    return asdict(_env.state())
+if __name__ == "__main__":
+    import uvicorn
+    port = int(os.getenv("PORT", 7860))
+    uvicorn.run("server.app:app", host="0.0.0.0", port=port)
+```
+---
+**inference.py — exact stdout format, no deviation:**
+```python
+import os
+import json
+from openai import OpenAI
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+BENCHMARK = "procure-rl"
+MAX_STEPS = 10
+client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
+# Import environment directly (not via HTTP for baseline)
+import sys
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from procure_rl.environment import ProcureRLEnvironment
+from procure_rl.models import NegotiationAction
+TASKS = ["single_issue", "multi_issue", "adversarial"]
+SYSTEM_PROMPT = """You are a professional procurement negotiator. Your goal is to negotiate the best possible deal for your company.
+You will receive a supplier's message and current offer terms. You must respond with a JSON action in this exact format:
+{
+  "move_type": "make_offer",
+  "terms": {"price": 42000, "payment_days": 45},
+  "message": "Your natural language response to the supplier"
+}
+move_type must be one of: make_offer, accept, reject, bundle
+terms must include price and any other issues being negotiated.
+message should be professional and collaborative when possible.
+Your buyer constraints will be provided. Do not exceed your budget. Try to reach the target price."""
+def get_agent_action(obs_dict: dict) -> dict:
+    task_id = obs_dict.get("task_id", "single_issue")
+    supplier_msg = obs_dict.get("supplier_message", "")
+    current_offer = obs_dict.get("current_offer", {})
+    constraints = obs_dict.get("buyer_constraints", {})
+    rapport_hint = obs_dict.get("rapport_hint", "neutral")
+    round_num = obs_dict.get("round_number", 0)
+    max_rounds = obs_dict.get("max_rounds", 10)
+    user_content = f"""Task: {task_id}
+Round: {round_num}/{max_rounds}
+Supplier says: "{supplier_msg}"
+Current offer on table: {json.dumps(current_offer)}
+Your constraints: {json.dumps(constraints)}
+Relationship rapport: {rapport_hint}
+Respond with your negotiation action as JSON."""
+    response = client.chat.completions.create(
+        model=MODEL_NAME,
+        messages=[
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content": user_content}
+        ],
+        max_tokens=300,
+        temperature=0.3
+    )
+    content = response.choices[0].message.content.strip()
+    # Parse JSON from response
+    try:
+        # Find JSON in response
+        start = content.find('{')
+        end = content.rfind('}') + 1
+        if start >= 0 and end > start:
+            action_dict = json.loads(content[start:end])
+        else:
+            # Fallback
+            action_dict = {
+                "move_type": "make_offer",
+                "terms": current_offer,
+                "message": content[:200]
+            }
+    except:
+        action_dict = {
+            "move_type": "make_offer",
+            "terms": current_offer,
+            "message": "I'd like to continue our discussion."
+        }
+    return action_dict
+def run_task(task_id: str) -> dict:
+    env = ProcureRLEnvironment()
+    obs = env.reset(task_id=task_id, seed=42)
+    obs_dict = {
+        "task_id": obs.task_id,
+        "round_number": obs.round_number,
+        "max_rounds": obs.max_rounds,
+        "supplier_message": obs.supplier_message,
+        "current_offer": obs.current_offer,
+        "buyer_constraints": obs.buyer_constraints,
+        "rapport_hint": obs.rapport_hint,
+        "done": obs.done
+    }
+    print(f"[START] task={task_id} env={BENCHMARK} model={MODEL_NAME}")
+    rewards = []
+    step = 0
+    done = False
+    final_score = 0.0
+    while not done and step < MAX_STEPS:
+        step += 1
+        action_dict = get_agent_action(obs_dict)
+        action = NegotiationAction(
+            move_type=action_dict.get("move_type", "make_offer"),
+            terms=action_dict.get("terms", {}),
+            message=action_dict.get("message", "")
+        )
+        obs, reward, done, info = env.step(action)
+        rewards.append(reward)
+        action_str = f"{action.move_type}({json.dumps(action.terms)})"
+        error = info.get("error", None)
+        print(f"[STEP] step={step} action={action_str} reward={reward:.2f} done={str(done).lower()} error={error if error else 'null'}")
+        if done:
+            final_score = reward if reward > 0 else (max(rewards) if rewards else 0.0)
+            break
+        obs_dict = {
+            "task_id": obs.task_id,
+            "round_number": obs.round_number,
+            "max_rounds": obs.max_rounds,
+            "supplier_message": obs.supplier_message,
+            "current_offer": obs.current_offer,
+            "buyer_constraints": obs.buyer_constraints,
+            "rapport_hint": obs.rapport_hint,
+            "done": obs.done
+        }
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    success = final_score > 0.1
+    print(f"[END] success={str(success).lower()} steps={step} score={final_score:.2f} rewards={rewards_str}")
+    return {"task": task_id, "score": final_score, "steps": step}
+if __name__ == "__main__":
+    results = []
+    for task in TASKS:
+        result = run_task(task)
+        results.append(result)
+    print(f"\nBaseline Results:")
+    for r in results:
+        print(f"  {r['task']}: {r['score']:.3f}")
+```
+---
+**openenv.yaml:**
+```yaml
+name: procure-rl
+version: "1.0.0"
+description: "LLM agent learns procurement negotiation strategy against scripted supplier opponents with hidden utility functions"
+author: "your-hf-username"
+tags:
+  - openenv
+  - negotiation
+  - procurement
+  - real-world
+  - rl
+tasks:
+  - id: single_issue
+    description: "Negotiate software license price with cooperative supplier"
+    difficulty: easy
+    max_steps: 6
+    reward_range: [0.0, 1.0]
+  - id: multi_issue
+    description: "Negotiate price and payment terms with cash-flow-sensitive supplier"
+    difficulty: medium
+    max_steps: 8
+    reward_range: [0.0, 1.0]
+  - id: adversarial
+    description: "Negotiate multiple issues against aggressive anchoring supplier"
+    difficulty: hard
+    max_steps: 10
+    reward_range: [0.0, 1.0]
+reward_range: [0.0, 1.0]
+observation_space:
+  type: object
+  description: "Natural language supplier message with structured negotiation state and rapport signal"
+action_space:
+  type: object
+  description: "Negotiation move type, structured terms, and natural language message"
+```
+---
+**Dockerfile:**
+```dockerfile
+FROM python:3.11-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+ENV PORT=7860
+EXPOSE 7860
+CMD ["python", "-m", "uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]
+```
+---
+**requirements.txt:**
+```
+fastapi==0.109.0
+uvicorn==0.27.0
+pydantic>=2.0.0
+openai>=1.0.0
+openenv-core>=0.1.0
+```
+---
+Build all files exactly as specified. Run locally with:
+```
+docker build -t procure-rl .
+docker run -p 7860:7860 procure-rl
+```
+Test with:
+```
+curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{"task_id": "single_issue"}'
+```
+Then run inference.py locally with HF_TOKEN set to verify [START][STEP][END] format.
+---
+# PLAN.MD
+````markdown
+# ProcureRL — Implementation Plan
+## What We Are Building
+An OpenEnv-compliant RL environment where an LLM agent learns
+procurement negotiation strategy against scripted supplier opponents.
+The key innovation: language-sensitive opponent behavior. The agent's
+natural language quality affects opponent concession rates, making LLM
+genuinely required — not just for parsing but for output quality.
+## Why This Wins
+- Zero negotiation environments in OpenEnv hub — confirmed
+- Documented LLM weakness in buyer negotiation (ACL 2024)
+- Walmart/Pactum market validation — real enterprise deployment exists
+- Nash-inspired grader with language mechanism — novel and memorable
+- Deterministic, reproducible, pure Python graders
+## Implementation Order (strict)
+### Phase 1: Core Logic (Day 1, first 4 hours)
+- [ ] procure_rl/models.py — dataclasses only
+- [ ] procure_rl/opponent.py — ScriptedPersonaOpponent
+- [ ] procure_rl/graders.py — three grader functions
+- [ ] procure_rl/environment.py — ProcureRLEnvironment
+- [ ] Test: import and run reset() + step() in Python shell
+### Phase 2: Server (Day 1, next 2 hours)
+- [ ] server/app.py — FastAPI with /health /reset /step /state
+- [ ] requirements.txt
+- [ ] Test: uvicorn server.app:app, curl /health
+### Phase 3: Spec Compliance (Day 1, final 2 hours)
+- [ ] openenv.yaml — exact schema
+- [ ] Run: openenv validate
+- [ ] Fix any validation errors
+### Phase 4: Dockerfile + HF Spaces (Day 2, first 3 hours)
+- [ ] Dockerfile
+- [ ] docker build -t procure-rl .
+- [ ] docker run -p 7860:7860 procure-rl
+- [ ] curl http://localhost:7860/health
+- [ ] Push to HF Spaces
+### Phase 5: Inference Script (Day 2, next 2 hours)
+- [ ] inference.py
+- [ ] Run locally: HF_TOKEN=xxx python inference.py
+- [ ] Verify [START][STEP][END] format exactly
+- [ ] Verify runtime < 20 minutes
+### Phase 6: README + Calibration (Day 2, final 2 hours)
+- [ ] README.md with all required sections
+- [ ] Run inference.py with weak model (7B) and strong model (72B)
+- [ ] Verify score spread exists
+- [ ] Submit
+## Critical Checks Before Submission
+```bash
+# 1. Spec compliance
+openenv validate
+# 2. Docker build
+docker build -t procure-rl .
+# 3. Docker run
+docker run -p 7860:7860 procure-rl &
+curl -X POST http://localhost:7860/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_id": "single_issue"}'
+# 4. Inference script
+HF_TOKEN=your_token python inference.py
+# 5. Score verification
+# single_issue: should be 0.30-0.55
+# multi_issue: should be 0.15-0.35
+# adversarial: should be 0.10-0.25
+```
+````
+## Score Calibration Targets
+| Task         | Random    | Base LLM  | Goal      |
+| ------------ | --------- | --------- | --------- |
+| single_issue | 0.15-0.25 | 0.35-0.50 | 0.65-0.78 |
+| multi_issue  | 0.08-0.15 | 0.20-0.32 | 0.52-0.65 |
+| adversarial  | 0.03-0.10 | 0.12-0.22 | 0.42-0.55 |
+If base LLM scores above 0.55 on single_issue → opponent too easy,
+reduce cooperative concession rate.
+If base LLM scores below 0.15 on single_issue → opponent too hard,
+increase cooperative concession rate.
+## README Required Sections
+1. Environment description and motivation (Walmart/Pactum reference)
+2. The Language-Sensitive Opponent (this is your wow factor)
+3. Action space definition with examples
+4. Observation space definition
+5. Task descriptions with expected scores
+6. Setup instructions (pip install + docker)
+7. Baseline scores (from inference.py run)
+## What NOT To Add
+- Nash bargaining (too complex, edge cases)
+- Step reward shaping (shaping bias risk)
+- LLM inside environment (reproducibility)
+- More than 3 tasks (scope creep)
+- Preference shift mechanics (complexity risk)
+## The One Sentence For Every Judge Question
+"Why RL?"
+→ Sequential decisions, delayed reward, hidden opponent utility — policy
+only emerges through thousands of negotiation episodes.
+"Why LLM?"
+→ Language quality directly affects opponent rapport score and concession
+rate. A heuristic agent gets neutral rapport. An LLM that learns
+collaborative framing gets cooperative responses. The language IS the policy.
+"Is this real?"
+→ Walmart deployed Pactum for exactly this. 90% of CPOs adopting AI
+negotiation in 2025. The gap between rule-based current tools and
+trained LLM policy is the research contribution.
+"Is this novel?"
+→ Zero negotiation environments in OpenEnv hub. Confirmed.
+```
+---
+**This is the final version. Build it exactly as specified.**
+```

pyproject.toml ADDED Viewed

	@@ -0,0 +1,45 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-Procure_RL"
+version = "0.1.0"
+description = "Procure Rl environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
+    # install from github
+    # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
+    "openenv-core[core]>=0.2.2",
+    # Environment-specific dependencies
+    # Add all dependencies needed for your environment here
+    # Examples:
+    # "numpy>=1.19.0",
+    # "torch>=2.0.0",
+    # "gymnasium>=0.29.0",
+    # "openspiel>=1.0.0",
+    # "smolagents>=1.22.0,<2",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+# Server entry point - enables running via: uv run --project . server
+# or: python -m Procure_RL.server.app
+server = "Procure_RL.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["Procure_RL", "Procure_RL.server"]
+package-dir = { "Procure_RL" = ".", "Procure_RL.server" = "server" }

server/Procure_RL_environment.py ADDED Viewed

	@@ -0,0 +1,316 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+ProcureRL Environment Implementation.
+An OpenEnv-compliant RL environment for procurement negotiation where
+an LLM agent learns to negotiate against scripted supplier opponents.
+"""
+import uuid
+from typing import Optional, Dict, Any
+try:
+    from openenv.core.env_server.interfaces import Environment
+except ImportError:
+    Environment = object
+try:
+    from ..models import NegotiationAction, NegotiationObservation, NegotiationState
+    from ..opponent import ScriptedPersonaOpponent
+    from ..graders import grade
+except ImportError:
+    import sys
+    import os
+    sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+    from models import NegotiationAction, NegotiationObservation, NegotiationState
+    from opponent import ScriptedPersonaOpponent
+    from graders import grade
+TASK_CONFIG = {
+    "single_issue": {
+        "persona": "cooperative",
+        "max_rounds": 6,
+        "buyer_constraints": {
+            "price": {"target": 36000, "worst": 55000, "budget": 53000}
+        },
+    },
+    "multi_issue": {
+        "persona": "cash_flow_stressed",
+        "max_rounds": 8,
+        "buyer_constraints": {
+            "price": {"target": 40000, "worst": 58000, "budget": 55000},
+            "payment_days": {"target": 60, "worst": 30, "preference": 60},
+        },
+    },
+    "adversarial": {
+        "persona": "aggressive_anchor",
+        "max_rounds": 10,
+        "buyer_constraints": {
+            "price": {"target": 80000, "worst": 120000, "budget": 115000},
+            "payment_days": {"target": 60, "worst": 30, "preference": 60},
+            "support_hours": {"target": 150, "worst": 80, "preference": 150},
+        },
+    },
+}
+VALID_MOVES = ("make_offer", "accept", "reject", "bundle")
+class ProcureRLEnvironment(Environment):
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    def __init__(self):
+        self._state = NegotiationState()
+        self._opponent = None
+        self._task_config = None
+        self._done = False
+        self._last_offer: Dict[str, Any] = {}
+        self._consecutive_concessions = 0
+        self._prev_agent_price: Optional[float] = None
+        self._exchanges: list = []
+        self._last_info: Dict[str, Any] = {}
+    def reset(
+        self, seed: Optional[int] = None, episode_id: Optional[str] = None, **kwargs
+    ) -> NegotiationObservation:
+        task_id = kwargs.get("task_id", "single_issue")
+        seed = seed if seed is not None else 42
+        if task_id not in TASK_CONFIG:
+            obs = self._make_obs(
+                f"Unknown task: {task_id}. Valid: {list(TASK_CONFIG.keys())}"
+            )
+            obs.done = True
+            obs.metadata["error"] = f"unknown_task:{task_id}"
+            return obs
+        config = TASK_CONFIG[task_id]
+        self._task_config = config
+        self._done = False
+        self._consecutive_concessions = 0
+        self._prev_agent_price = None
+        self._exchanges = []
+        self._last_info = {}
+        opponent_seed = hash((seed, task_id)) % (2**32)
+        self._opponent = ScriptedPersonaOpponent(
+            task_id=task_id, seed=opponent_seed, persona=config["persona"]
+        )
+        opening_msg, opening_terms = self._opponent.get_opening_message()
+        self._last_offer = opening_terms
+        self._opponent_opening_price = opening_terms.get("price", 52000.0)
+        self._state = NegotiationState(
+            task_id=task_id,
+            episode_id=episode_id or str(uuid.uuid4())[:8],
+            round_number=0,
+            step_count=0,
+            rapport_score=0.5,
+            consecutive_concessions=0,
+            deal_reached=False,
+            final_terms=None,
+            cumulative_reward=0.0,
+        )
+        self._exchanges.append(
+            {"role": "supplier", "message": opening_msg, "terms": opening_terms}
+        )
+        return NegotiationObservation(
+            task_id=task_id,
+            round_number=0,
+            max_rounds=config["max_rounds"],
+            supplier_message=opening_msg,
+            current_offer=opening_terms,
+            last_4_exchanges=self._exchanges[-4:],
+            buyer_constraints=config["buyer_constraints"],
+            rapport_hint="neutral",
+            done=False,
+        )
+    def step(self, action: NegotiationAction, **kwargs) -> NegotiationObservation:
+        self._last_info = {}
+        if self._done:
+            obs = self._make_obs("Episode finished. Call reset().")
+            obs.done = True
+            obs.metadata["error"] = "episode_done"
+            return obs
+        if self._task_config is None:
+            obs = self._make_obs("Environment not initialized. Call reset() first.")
+            obs.done = True
+            obs.metadata["error"] = "not_initialized"
+            return obs
+        if not isinstance(action, NegotiationAction):
+            action_dict = (
+                action if isinstance(action, dict) else {"move_type": "make_offer"}
+            )
+            action = NegotiationAction(
+                move_type=action_dict.get("move_type", "make_offer"),
+                terms=action_dict.get("terms", {}),
+                message=action_dict.get("message", ""),
+            )
+        if action.move_type not in VALID_MOVES:
+            obs = self._make_obs()
+            obs.metadata["error"] = f"invalid_move_type:{action.move_type}"
+            return obs
+        self._state.round_number += 1
+        self._state.step_count += 1
+        round_num = self._state.round_number
+        config = self._task_config
+        max_rounds = config["max_rounds"]
+        reward = 0.0
+        if self._prev_agent_price is not None and "price" in action.terms:
+            current_price = float(action.terms.get("price", self._prev_agent_price))
+            if current_price > self._prev_agent_price:
+                self._consecutive_concessions += 1
+            else:
+                self._consecutive_concessions = 0
+        if "price" in action.terms:
+            self._prev_agent_price = float(action.terms.get("price"))
+        self._state.consecutive_concessions = self._consecutive_concessions
+        if action.move_type in ("make_offer", "bundle"):
+            opponent_msg, opponent_terms = self._opponent.respond(
+                agent_message=action.message,
+                agent_terms=action.terms,
+                round_number=round_num,
+                consecutive_concessions=self._consecutive_concessions,
+            )
+            self._exchanges.append(
+                {"role": "agent", "message": action.message, "terms": action.terms}
+            )
+            if opponent_terms.get("_accepted"):
+                self._done = True
+                self._state.deal_reached = True
+                self._state.final_terms = action.terms
+                reward = grade(
+                    self._state.task_id,
+                    action.terms,
+                    True,
+                    round_num,
+                    opponent_opening=self._opponent_opening_price,
+                    consecutive_concessions_flag=(self._consecutive_concessions >= 2),
+                )
+                self._state.cumulative_reward = reward
+                obs = self._make_obs(supplier_message=opponent_msg)
+                obs.done = True
+                obs.reward = reward
+                self._last_info["deal_price"] = action.terms.get("price")
+                self._exchanges.append(
+                    {
+                        "role": "supplier",
+                        "message": opponent_msg,
+                        "terms": {
+                            k: v
+                            for k, v in opponent_terms.items()
+                            if not k.startswith("_")
+                        },
+                    }
+                )
+                return obs
+            self._last_offer = {
+                k: v for k, v in opponent_terms.items() if not k.startswith("_")
+            }
+            self._state.rapport_score = self._opponent.rapport
+            self._exchanges.append(
+                {"role": "supplier", "message": opponent_msg, "terms": self._last_offer}
+            )
+            if round_num >= max_rounds:
+                self._done = True
+                reward = 0.0
+                obs = self._make_obs(supplier_message=opponent_msg)
+                obs.done = True
+                obs.reward = reward
+                self._last_info["error"] = "max_rounds_reached"
+                return obs
+            obs = self._make_obs(supplier_message=opponent_msg)
+            obs.reward = reward
+            return obs
+        if action.move_type == "accept":
+            self._done = True
+            self._state.deal_reached = True
+            self._state.final_terms = self._last_offer
+            reward = grade(
+                self._state.task_id,
+                self._last_offer,
+                True,
+                round_num,
+                opponent_opening=self._opponent_opening_price,
+                consecutive_concessions_flag=(self._consecutive_concessions >= 2),
+            )
+            self._state.cumulative_reward = reward
+            obs = self._make_obs()
+            obs.done = True
+            obs.reward = reward
+            self._last_info["deal_price"] = self._last_offer.get("price")
+            return obs
+        if action.move_type == "reject":
+            if round_num >= max_rounds:
+                self._done = True
+                reward = 0.0
+                obs = self._make_obs()
+                obs.done = True
+                obs.reward = reward
+                self._last_info["error"] = "rejected_at_limit"
+                return obs
+            obs = self._make_obs()
+            obs.reward = 0.0
+            return obs
+        obs = self._make_obs()
+        obs.reward = 0.0
+        return obs
+    @property
+    def state(self) -> NegotiationState:
+        return self._state
+    def close(self) -> None:
+        pass
+    def _make_obs(self, supplier_message: str = None) -> NegotiationObservation:
+        rapport = self._state.rapport_score
+        if rapport >= 0.65:
+            hint = "positive"
+        elif rapport <= 0.35:
+            hint = "negative"
+        else:
+            hint = "neutral"
+        return NegotiationObservation(
+            task_id=self._state.task_id or "",
+            round_number=self._state.round_number,
+            max_rounds=self._task_config["max_rounds"] if self._task_config else 0,
+            supplier_message=supplier_message or "",
+            current_offer=self._last_offer,
+            last_4_exchanges=self._exchanges[-4:] if self._exchanges else [],
+            buyer_constraints=self._task_config["buyer_constraints"]
+            if self._task_config
+            else {},
+            rapport_hint=hint,
+            done=self._done,
+            metadata=self._last_info,
+        )

server/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""ProcureRL environment server components."""
+from .Procure_RL_environment import ProcureRLEnvironment
+__all__ = ["ProcureRLEnvironment"]

server/app.py ADDED Viewed

	@@ -0,0 +1,637 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+FastAPI application for the ProcureRL Environment.
+This module creates an HTTP server that exposes the ProcureRLEnvironment
+over HTTP and WebSocket endpoints, compatible with EnvClient.
+Endpoints:
+    - POST /reset: Reset the environment
+    - POST /step: Execute an action
+    - GET /state: Get current environment state
+    - GET /schema: Get action/observation schemas
+    - WS /ws: WebSocket endpoint for persistent sessions
+Usage:
+    # Development (with auto-reload):
+    uvicorn server.app:app --reload --host 0.0.0.0 --port 7860
+    # Production:
+    uvicorn server.app:app --host 0.0.0.0 --port 7860 --workers 4
+    # Or run directly:
+    python -m server.app
+"""
+import sys
+import os
+import json
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+try:
+    from openenv.core.env_server.http_server import create_app
+except Exception as e:
+    raise ImportError(
+        "openenv is required for the web interface. Install dependencies with '\n    uv sync\n'"
+    ) from e
+import gradio as gr
+from models import NegotiationAction, NegotiationObservation, NegotiationState
+from server.Procure_RL_environment import ProcureRLEnvironment
+_env_instance = ProcureRLEnvironment()
+def build_custom_gradio_ui(
+    web_manager,
+    action_fields,
+    metadata,
+    is_chat_env,
+    title,
+    quick_start_md,
+):
+    """Custom Gradio UI with interactive negotiation simulation."""
+    readme_content = _load_readme_content(metadata)
+    display_title = metadata.name if metadata else title
+    # Example actions for the Example tab
+    EXAMPLE_1 = {
+        "move_type": "make_offer",
+        "terms": {"price": 48000},
+        "message": "I value our partnership and believe we can reach a fair agreement together. Let's work collaboratively to find a solution.",
+    }
+    EXAMPLE_2 = {
+        "move_type": "make_offer",
+        "terms": {"price": 45000},
+        "message": "We appreciate your flexibility. Here's our counter-offer to move us closer to a mutual agreement.",
+    }
+    # Agent strategies for auto-play
+    AGENT_STRATEGY = [
+        (
+            "make_offer",
+            {"price": 48000},
+            "I value our partnership and believe we can reach a fair agreement together.",
+        ),
+        (
+            "make_offer",
+            {"price": 46000},
+            "I appreciate your movement. Let's see if we can meet in the middle.",
+        ),
+        (
+            "make_offer",
+            {"price": 44000},
+            "We're getting closer. I think we can finalize this at a fair price for both parties.",
+        ),
+        (
+            "make_offer",
+            {"price": 42000},
+            "I believe we've found a good deal. Let's accept these terms.",
+        ),
+        ("accept", {}, ""),
+    ]
+    async def reset_env(task_id, seed):
+        try:
+            data = await web_manager.reset_environment(
+                {"task_id": task_id, "seed": int(seed)}
+            )
+            obs_d = _format_observation_full(data)
+            conv_h = _build_conversation_hist([])
+            price_d = _build_price_display(0, 52000, 36000, 52000)
+            status = "✅ Reset successful! Make your offer."
+            json_d = json.dumps(data, indent=2)
+            return obs_d, conv_h, price_d, status, json_d
+        except Exception as e:
+            return f"Error: {e}", "", "", f"Error: {e}", ""
+    async def step_manual(move_type, terms_str, message, conversation_state):
+        try:
+            terms = json.loads(terms_str) if terms_str.strip() else {}
+            action_data = {"move_type": move_type, "terms": terms, "message": message}
+            data = await web_manager.step_environment(action_data)
+            # Update conversation
+            new_conv = conversation_state.copy() if conversation_state else []
+            new_conv.append(
+                {
+                    "role": "you",
+                    "message": message or f"[{move_type}: {terms}]",
+                    "terms": terms,
+                }
+            )
+            if not data.get("observation", {}).get("done"):
+                supplier_msg = data.get("observation", {}).get("supplier_message", "")
+                new_conv.append(
+                    {
+                        "role": "supplier",
+                        "message": supplier_msg,
+                        "terms": data.get("observation", {}).get("current_offer", {}),
+                    }
+                )
+            # Get price info for chart
+            obs = data.get("observation", {})
+            current_price = obs.get("current_offer", {}).get("price", 0)
+            opponent_opening = 52000  # Will be extracted from state
+            reward = obs.get("reward")
+            done = obs.get("done", False)
+            status_msg = f"Step complete! Round {obs.get('round_number', 0)}/{obs.get('max_rounds', 6)}"
+            if done and reward is not None:
+                status_msg = f"🏁 Deal done! Final score: {reward:.4f}"
+            elif done:
+                status_msg = "❌ No deal reached."
+            obs_display = _format_observation_full(data)
+            conv_hist = _build_conversation_hist(new_conv)
+            price_disp = _build_price_display(
+                obs.get("round_number", 0), current_price, 36000, 52000
+            )
+            json_data = json.dumps(data, indent=2)
+            return obs_display, conv_hist, price_disp, status_msg, json_data
+        except json.JSONDecodeError:
+            return "", "", "", "❌ Invalid JSON in terms field", ""
+        except Exception as e:
+            return "", "", "", f"Error: {e}", f"Error: {str(e)}"
+    async def run_agent_example(task_id="single_issue", seed=42):
+        try:
+            # Reset first
+            await web_manager.reset_environment({"task_id": task_id, "seed": seed})
+            conv = []
+            steps_log = []
+            price_points = []
+            for i, (move_type, terms, message) in enumerate(AGENT_STRATEGY):
+                action_data = {
+                    "move_type": move_type,
+                    "terms": terms,
+                    "message": message,
+                }
+                data = await web_manager.step_environment(action_data)
+                obs = data.get("observation", {})
+                current_price = obs.get("current_offer", {}).get("price", 0)
+                price_points.append(current_price)
+                conv.append(
+                    {
+                        "role": "you",
+                        "message": message or f"[{move_type}: {terms}]",
+                        "terms": terms,
+                    }
+                )
+                steps_log.append(
+                    f"**Step {i + 1}:** `{move_type}` → ${current_price:,.0f}"
+                )
+                if obs.get("done"):
+                    steps_log.append(
+                        f"✅ Deal completed! Reward: **{obs.get('reward', 0):.4f}**"
+                    )
+                    conv.append(
+                        {
+                            "role": "supplier",
+                            "message": obs.get("supplier_message", ""),
+                            "terms": obs.get("current_offer", {}),
+                        }
+                    )
+                    break
+                supplier_msg = obs.get("supplier_message", "")
+                conv.append(
+                    {
+                        "role": "supplier",
+                        "message": supplier_msg,
+                        "terms": obs.get("current_offer", {}),
+                    }
+                )
+            return (
+                _build_agent_demo_result(steps_log, conv, price_points),
+                json.dumps(data, indent=2),
+                "✅ Agent demo complete!",
+            )
+        except Exception as e:
+            return f"Error: {e}", "", f"Error: {e}", ""
+    def apply_example(example_data):
+        return (
+            example_data["move_type"],
+            json.dumps(example_data["terms"]),
+            example_data["message"],
+        )
+    def _format_observation_full(data):
+        """Format observation as rich markdown."""
+        if not data:
+            return "No data"
+        obs = data.get("observation", data)
+        lines = [f"## 🎯 Round {obs.get('round_number', 0)}/{obs.get('max_rounds', 6)}"]
+        lines.append(f"**Task:** `{obs.get('task_id', '')}`")
+        lines.append(
+            f"**Rapport:** {_get_rapport_emoji(obs.get('rapport_hint', 'neutral'))} {obs.get('rapport_hint', 'neutral')}"
+        )
+        if obs.get("done"):
+            r = obs.get("reward")
+            lines.append(f"\n### 🏁 Episode Complete!")
+            if r is not None:
+                lines.append(f"**Final Score:** `{r:.4f}`")
+            return "\n".join(lines)
+        lines.append(f"\n### 💬 Supplier says:")
+        lines.append(f"> {obs.get('supplier_message', '')}")
+        offer = obs.get("current_offer", {})
+        if offer:
+            lines.append(f"\n### 📋 Current Offer:")
+            for k, v in offer.items():
+                lines.append(
+                    f"- **{k.title()}:** `{v:,.2f}`"
+                    if isinstance(v, float)
+                    else f"- **{k.title()}:** `{v}`"
+                )
+        constraints = obs.get("buyer_constraints", {})
+        if constraints:
+            lines.append(f"\n### 🎯 Your Targets:")
+            for k, v in constraints.items():
+                if isinstance(v, dict):
+                    lines.append(
+                        f"- **{k.title()}:** target `${v.get('target', 'N/A'):,}` | worst `${v.get('worst', 'N/A'):,}`"
+                    )
+        return "\n".join(lines)
+    def _get_rapport_emoji(rapport):
+        if rapport == "positive":
+            return "😊"
+        elif rapport == "negative":
+            return "😤"
+        return "😐"
+    def _build_conversation_hist(conv):
+        """Build conversation history HTML."""
+        if not conv:
+            return "**Conversation will appear here...**\n\nMake your first offer to start the negotiation!"
+        lines = ["## 💬 Conversation History\n"]
+        for msg in conv:
+            if msg["role"] == "you":
+                lines.append(f"**🧑 You:** {msg['message']}")
+                if msg.get("terms"):
+                    lines.append(f"   → Terms: `{json.dumps(msg['terms'])}`")
+            else:
+                lines.append(f"**🏪 Supplier:** {msg['message']}")
+        return "\n".join(lines)
+    def _build_price_display(round_num, current_price, target, opening):
+        """Build price tracker display."""
+        range_price = opening - target
+        progress = (
+            ((opening - current_price) / range_price * 100) if range_price > 0 else 0
+        )
+        progress = max(0, min(100, progress))
+        bar = "█" * int(progress / 5) + "░" * (20 - int(progress / 5))
+        lines = [
+            f"## 📊 Price Tracker\n",
+            f"Opening: `${opening:,.0f}`",
+            f"Target:  `${target:,.0f}`",
+            f"Current: `${current_price:,.0f}`",
+            f"\n**Progress:** `{progress:.1f}%`",
+            f"\n[{bar}]",
+        ]
+        return "\n".join(lines)
+    def _build_agent_demo_result(steps_log, conv, price_points):
+        """Build agent demo result display."""
+        lines = ["## 🤖 Agent Negotiation Demo\n"]
+        lines.append("Watch how a strategic agent negotiates:\n")
+        lines.append("### 📜 Steps:")
+        lines.extend(steps_log)
+        lines.append("\n### 💬 Full Conversation:")
+        for msg in conv:
+            if msg["role"] == "you":
+                lines.append(f"**🧑 You:** {msg['message']}")
+            else:
+                lines.append(f"**🏪 Supplier:** {msg['message']}")
+        if price_points:
+            lines.append(f"\n### 📈 Price Journey:")
+            lines.append(f"`{' → '.join(f'${p:,.0f}' for p in price_points)}`")
+        return "\n".join(lines)
+    with gr.Blocks(title=display_title) as demo:
+        gr.Markdown(f"# 🤝 {display_title}")
+        gr.Markdown("### Interactive Procurement Negotiation Simulation")
+        with gr.Tabs() as tabs:
+            with gr.TabItem("🎮 Play Now"):
+                """Interactive tab where user plays against the opponent."""
+                with gr.Row():
+                    with gr.Column(scale=2):
+                        conversation_display = gr.Markdown(
+                            "*Click Reset to start a new negotiation!*"
+                        )
+                        price_tracker = gr.Markdown(
+                            "## 📊 Price Tracker\n*Reset to see price tracker*"
+                        )
+                        obs_display = gr.Markdown("*Reset to see current state*")
+                    with gr.Column(scale=1):
+                        gr.Markdown("### ⚙️ Controls")
+                        task_dropdown = gr.Dropdown(
+                            choices=["single_issue", "multi_issue", "adversarial"],
+                            value="single_issue",
+                            label="Task",
+                            info="Choose which negotiation scenario",
+                        )
+                        seed_input = gr.Number(
+                            value=42,
+                            label="Seed",
+                            info="Random seed for reproducibility",
+                        )
+                        move_type_input = gr.Textbox(
+                            label="Move Type",
+                            info="make_offer | accept | reject | bundle",
+                            value="make_offer",
+                        )
+                        terms_input = gr.Textbox(
+                            label="Terms (JSON)",
+                            info='Example: {"price": 45000}',
+                            value='{"price": 48000}',
+                        )
+                        message_input = gr.Textbox(
+                            label="Your Message",
+                            info="Be collaborative for better rapport!",
+                            value="I value our partnership and believe we can reach a fair agreement.",
+                            lines=2,
+                        )
+                        gr.Markdown("**💡 Quick Examples:**")
+                        with gr.Row():
+                            eg1_btn = gr.Button(
+                                "😊 Friendly", variant="secondary", size="sm"
+                            )
+                            eg2_btn = gr.Button(
+                                "💼 Professional", variant="secondary", size="sm"
+                            )
+                            eg3_btn = gr.Button(
+                                "⚡ Counter-Offer", variant="secondary", size="sm"
+                            )
+                        with gr.Row():
+                            step_btn = gr.Button("📤 Submit Offer", variant="primary")
+                            accept_btn = gr.Button("✅ Accept Deal", variant="primary")
+                            reset_btn = gr.Button("🔄 Reset", variant="secondary")
+                        status_output = gr.Textbox(
+                            label="Status", interactive=False, lines=1
+                        )
+                        with gr.Accordion("📋 Raw JSON", open=False):
+                            raw_json = gr.Code(
+                                label="", language="json", interactive=False, lines=10
+                            )
+                # Example messages for quick fill
+                FRIENDLY_EX = (
+                    "make_offer",
+                    '{"price": 48000}',
+                    "I truly value our partnership and believe we can find a fair solution that benefits both parties. I'm flexible and want to work with you.",
+                )
+                PROF_EX = (
+                    "make_offer",
+                    '{"price": 46000}',
+                    "Based on market research and our long-term relationship potential, I believe $46,000 is a fair price. What do you think?",
+                )
+                COUNTER_EX = (
+                    "make_offer",
+                    '{"price": 44000}',
+                    "We've made good progress. I can meet you at $44,000 if you can agree to these terms today.",
+                )
+                def get_friendly():
+                    return FRIENDLY_EX[0], FRIENDLY_EX[1], FRIENDLY_EX[2]
+                def get_prof():
+                    return PROF_EX[0], PROF_EX[1], PROF_EX[2]
+                def get_counter():
+                    return COUNTER_EX[0], COUNTER_EX[1], COUNTER_EX[2]
+                eg1_btn.click(
+                    fn=get_friendly,
+                    outputs=[move_type_input, terms_input, message_input],
+                )
+                eg2_btn.click(
+                    fn=get_prof,
+                    outputs=[move_type_input, terms_input, message_input],
+                )
+                eg3_btn.click(
+                    fn=get_counter,
+                    outputs=[move_type_input, terms_input, message_input],
+                )
+                async def do_reset(task_id, seed):
+                    return await reset_env(task_id, seed)
+                reset_btn.click(
+                    fn=do_reset,
+                    inputs=[task_dropdown, seed_input],
+                    outputs=[
+                        conversation_display,
+                        price_tracker,
+                        obs_display,
+                        status_output,
+                        raw_json,
+                    ],
+                )
+                async def do_step(mt, ts, msg):
+                    return await step_manual(mt, ts, msg, [])
+                step_btn.click(
+                    fn=do_step,
+                    inputs=[move_type_input, terms_input, message_input],
+                    outputs=[
+                        obs_display,
+                        conversation_display,
+                        price_tracker,
+                        status_output,
+                        raw_json,
+                    ],
+                )
+                async def do_accept():
+                    return await step_manual("accept", "{}", "", [])
+                accept_btn.click(
+                    fn=do_accept,
+                    outputs=[
+                        obs_display,
+                        conversation_display,
+                        price_tracker,
+                        status_output,
+                        raw_json,
+                    ],
+                )
+            with gr.TabItem("🤖 Watch Agent"):
+                """Example tab showing agent negotiation demo."""
+                gr.Markdown("### Watch a Strategic Agent Negotiate")
+                gr.Markdown(
+                    "This demo shows how an LLM agent would approach the negotiation with collaborative language and strategic pricing."
+                )
+                with gr.Row():
+                    task_selector = gr.Dropdown(
+                        choices=["single_issue", "multi_issue", "adversarial"],
+                        value="single_issue",
+                        label="Select Task",
+                    )
+                    run_btn = gr.Button(
+                        "▶️ Run Agent Demo", variant="primary", size="lg"
+                    )
+                agent_result = gr.Markdown(
+                    "*Click 'Run Agent Demo' to watch the agent negotiate*"
+                )
+                agent_json = gr.Code(
+                    label="Full JSON", language="json", interactive=False, lines=15
+                )
+                agent_status = gr.Textbox(label="Status", interactive=False)
+                async def do_agent_run(tid):
+                    return await run_agent_example(tid, 42)
+                run_btn.click(
+                    fn=do_agent_run,
+                    inputs=[task_selector],
+                    outputs=[agent_result, agent_json, agent_status],
+                )
+            with gr.TabItem("📖 Instructions"):
+                """Instructions tab."""
+                gr.Markdown("""
+                ## 🎮 How to Play
+                ### 1. Choose Your Task
+                - **single_issue**: Negotiate only the price (easiest)
+                - **multi_issue**: Negotiate price + payment terms (medium)
+                - **adversarial**: Negotiate price + payment + support (hardest)
+                ### 2. Make Offers
+                - **Move Type**: `make_offer` to propose terms, `accept` to take current deal, `reject` to walk away
+                - **Terms**: JSON with your offered price (and payment_days for multi_issue/adversarial)
+                - **Message**: Be collaborative! Use words like "partnership", "mutual", "flexible" to increase rapport
+                ### 3. Watch the Response
+                - The supplier will counter-offer or accept
+                - Your **rapport** changes based on your language quality
+                - Higher rapport → opponent gives better concessions
+                ### 4. Goal
+                - Get the price as close to your target (shown in observations) as possible
+                - Use fewer rounds for a better efficiency score
+                - **Don't make 2+ consecutive concessions** in adversarial mode!
+                ## 🎯 Quick Tips
+                | Do | Don't |
+                |---|---|
+                | Use collaborative language | Use aggressive language ("final offer", "ultimatum") |
+                | Make strategic concessions | Concede every round (adversarial mode) |
+                | Offer Net-30 payment (multi_issue) | Ignore payment terms |
+                | Accept when terms are good | Wait until max rounds |
+                ## 🤖 Agent Demo
+                The "Watch Agent" tab shows how a strategic agent negotiates step-by-step.
+                """)
+        # Quick Start and README accordions
+        with gr.Accordion("📘 Quick Start Guide", open=False):
+            if quick_start_md:
+                gr.Markdown(quick_start_md)
+        with gr.Accordion("📚 Full README", open=False):
+            gr.Markdown(readme_content)
+    return demo
+def _load_readme_content(metadata):
+    """Load README content from metadata or filesystem."""
+    if metadata and hasattr(metadata, "readme_content") and metadata.readme_content:
+        return metadata.readme_content
+    try:
+        from pathlib import Path
+        readme_path = Path("/app/README.md")
+        if readme_path.exists():
+            return readme_path.read_text(encoding="utf-8")
+    except:
+        pass
+    return "No README available."
+def _format_observation(data):
+    """Format observation as markdown for display."""
+    if not data:
+        return "No data"
+    obs = data.get("observation", data)
+    lines = []
+    task_id = obs.get("task_id", "")
+    round_num = obs.get("round_number", 0)
+    max_rounds = obs.get("max_rounds", 0)
+    done = obs.get("done", False)
+    reward = obs.get("reward")
+    lines.append(f"### Round {round_num}/{max_rounds}")
+    lines.append(f"**Task:** {task_id}")
+    lines.append(f"**Done:** {done}")
+    if reward is not None:
+        lines.append(f"**Reward:** {reward:.4f}")
+    supplier_msg = obs.get("supplier_message", "")
+    if supplier_msg:
+        lines.append(f"\n**Supplier:** {supplier_msg}")
+    current_offer = obs.get("current_offer", {})
+    if current_offer:
+        lines.append(f"\n**Current Offer:** {json.dumps(current_offer)}")
+    rapport = obs.get("rapport_hint", "neutral")
+    lines.append(f"\n**Rapport:** {rapport}")
+    return "\n".join(lines)
+app = create_app(
+    lambda: _env_instance,
+    NegotiationAction,
+    NegotiationObservation,
+    env_name="ProcureRL",
+    max_concurrent_envs=1,
+    gradio_builder=build_custom_gradio_ui,
+)
+if __name__ == "__main__":
+    port = int(os.getenv("PORT", 7860))
+    import uvicorn
+    uvicorn.run("server.app:app", host="0.0.0.0", port=port)
+def main():
+    import uvicorn
+    port = int(os.getenv("PORT", 7860))
+    uvicorn.run("server.app:app", host="0.0.0.0", port=port)

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+openenv[core]>=0.2.0
+fastapi>=0.115.0
+uvicorn>=0.24.0

test_calibration.py ADDED Viewed

	@@ -0,0 +1,110 @@

+#!/usr/bin/env python3
+# test_calibration.py
+import sys
+sys.path.insert(0, ".")
+from server.Procure_RL_environment import ProcureRLEnvironment
+from models import NegotiationAction
+import random
+def run_random_agent(task_id, seed=42):
+    """Simulate a dumb agent that makes random offers"""
+    env = ProcureRLEnvironment()
+    obs = env.reset(seed=seed, task_id=task_id)
+    rng = random.Random(seed + 1)
+    config = {
+        "single_issue": {"price": (38000, 52000)},
+        "multi_issue": {"price": (40000, 58000), "payment_days": (30, 90)},
+        "adversarial": {
+            "price": (80000, 120000),
+            "payment_days": (30, 90),
+            "support_hours": (80, 200),
+        },
+    }
+    for step in range(15):
+        terms = {}
+        for issue, (lo, hi) in config[task_id].items():
+            terms[issue] = rng.uniform(lo, hi)
+        action = NegotiationAction(
+            move_type="make_offer", terms=terms, message="Here is my offer."
+        )
+        obs = env.step(action)
+        if obs.done:
+            return obs.reward or 0.0
+    # Force accept at end
+    obs = env.step(NegotiationAction(move_type="accept", terms={}, message=""))
+    return obs.reward or 0.0
+def run_good_agent(task_id, seed=42):
+    """Simulate a smart agent with collaborative language and adaptive pricing"""
+    env = ProcureRLEnvironment()
+    obs = env.reset(seed=seed, task_id=task_id)
+    # Get opponent's opening to adapt our target
+    opening_price = obs.current_offer.get("price", 52000)
+    # Get opponent's floor (never go below floor or opponent won't accept)
+    floor = (
+        env._opponent.price_floor
+        if hasattr(env._opponent, "price_floor")
+        else opening_price * 0.80
+    )
+    # Adaptive targets that stay above floor
+    if task_id == "single_issue":
+        # Target 20-25% below opening but MUST be above floor
+        target_price = max(opening_price * 0.78, floor * 1.05)
+        targets = {"price": target_price}
+    elif task_id == "multi_issue":
+        # Target 20% below opening, above floor
+        target_price = max(opening_price * 0.80, floor * 1.05)
+        targets = {"price": target_price, "payment_days": 45}
+    else:  # adversarial
+        # Target 20% below opening, above floor
+        target_price = max(opening_price * 0.80, floor * 1.05)
+        targets = {"price": target_price, "payment_days": 50, "support_hours": 160}
+    for step in range(10):
+        action = NegotiationAction(
+            move_type="make_offer",
+            terms=targets,
+            message="I value our partnership and believe this offer reflects fair market value for both parties. I'm flexible and want to find a solution that works for us both.",
+        )
+        obs = env.step(action)
+        if obs.done:
+            return obs.reward or 0.0
+    obs = env.step(NegotiationAction(move_type="accept", terms={}, message=""))
+    return obs.reward or 0.0
+print("=== Score Spread Calibration ===")
+for task in ["single_issue", "multi_issue", "adversarial"]:
+    random_scores = [run_random_agent(task, seed=i) for i in range(5)]
+    good_scores = [run_good_agent(task, seed=i) for i in range(5)]
+    random_avg = sum(random_scores) / len(random_scores)
+    good_avg = sum(good_scores) / len(good_scores)
+    spread = good_avg - random_avg
+    print(f"\n{task}:")
+    print(
+        f"  Random agent:      {[round(s, 3) for s in random_scores]} avg={random_avg:.3f}"
+    )
+    print(
+        f"  Strategic agent:   {[round(s, 3) for s in good_scores]} avg={good_avg:.3f}"
+    )
+    print(f"  Spread:            {spread:.3f}")
+    if spread < 0.05:
+        print(f"  ⚠️  WARNING: spread too small — environment may be trivial or broken")
+    elif good_avg < 0.10:
+        print(f"  ⚠️  WARNING: even good agent scores very low — too hard")
+    else:
+        print(f"  ✅ Score spread looks healthy")

test_graders.py ADDED Viewed

	@@ -0,0 +1,76 @@

+#!/usr/bin/env python3
+# test_graders.py
+import sys
+sys.path.insert(0, ".")
+from graders import grade_single_issue, grade_multi_issue, grade_adversarial
+print("=== single_issue grader ===")
+# Perfect deal — should be near 1.0
+print("Perfect:", grade_single_issue({"price": 38000}, True, 1, 6))
+# Worst acceptable deal
+print("Worst deal:", grade_single_issue({"price": 44000}, True, 6, 6))
+# No deal
+print("No deal:", grade_single_issue({}, False, 0, 6))
+# Late deal — efficiency penalty
+print("Late deal:", grade_single_issue({"price": 40000}, True, 5, 6))
+# Boundary — price above floor
+print("Above floor:", grade_single_issue({"price": 46000}, True, 3, 6))
+print("\n=== multi_issue grader ===")
+# Best possible
+print("Best:", grade_multi_issue({"price": 40000, "payment_days": 30}, True, 1, 8))
+# Price good, payment bad
+print(
+    "Price only:", grade_multi_issue({"price": 40000, "payment_days": 90}, True, 4, 8)
+)
+# Payment good, price bad
+print(
+    "Payment only:", grade_multi_issue({"price": 58000, "payment_days": 30}, True, 4, 8)
+)
+# No deal
+print("No deal:", grade_multi_issue({}, False, 0, 8))
+print("\n=== adversarial grader ===")
+# Best possible
+print(
+    "Best:",
+    grade_adversarial(
+        {"price": 80000, "payment_days": 30, "support_hours": 200}, True, 1, False, 10
+    ),
+)
+# Survival floor — bad deal still completed
+print(
+    "Bad deal (floor):",
+    grade_adversarial(
+        {"price": 120000, "payment_days": 90, "support_hours": 80}, True, 10, True, 10
+    ),
+)
+# No deal
+print("No deal:", grade_adversarial({}, False, 0, False, 10))
+# Consecutive concession penalty
+print(
+    "Pattern penalty:",
+    grade_adversarial(
+        {"price": 90000, "payment_days": 60, "support_hours": 140}, True, 5, True, 10
+    ),
+)
+print("\n=== Verify all scores are in [0.0, 1.0] ===")
+all_scores = [
+    grade_single_issue({"price": 38000}, True, 1, 6),
+    grade_single_issue({"price": 44000}, True, 6, 6),
+    grade_single_issue({}, False, 0, 6),
+    grade_multi_issue({"price": 40000, "payment_days": 30}, True, 1, 8),
+    grade_multi_issue({}, False, 0, 8),
+    grade_adversarial(
+        {"price": 80000, "payment_days": 30, "support_hours": 200}, True, 1, False, 10
+    ),
+    grade_adversarial({}, False, 0, False, 10),
+]
+print("All scores:", all_scores)
+assert all(0.0 <= s <= 1.0 for s in all_scores), (
+    f"FAIL: scores outside [0.0, 1.0]: {all_scores}"
+)
+print("All scores in range: PASS")

test_rl_properties.py ADDED Viewed

	@@ -0,0 +1,119 @@

+#!/usr/bin/env python3
+# test_rl_properties.py
+import sys
+sys.path.insert(0, ".")
+from server.Procure_RL_environment import ProcureRLEnvironment
+from models import NegotiationAction
+print("=== Test 1: Reproducibility ===")
+env1 = ProcureRLEnvironment()
+obs1 = env1.reset(seed=42, task_id="single_issue")
+env2 = ProcureRLEnvironment()
+obs2 = env2.reset(seed=42, task_id="single_issue")
+assert obs1.supplier_message == obs2.supplier_message, (
+    "FAIL: same seed gives different opening"
+)
+print("Same seed = same opening message: PASS")
+print(f"Opening: {obs1.supplier_message[:80]}...")
+print("\n=== Test 2: Different seeds give different behavior ===")
+env3 = ProcureRLEnvironment()
+obs3 = env3.reset(seed=99, task_id="single_issue")
+print(f"Seed 42 opening price: {obs1.current_offer}")
+print(f"Seed 99 opening price: {obs3.current_offer}")
+print("\n=== Test 3: Rapport affects opponent ===")
+# Agent with collaborative language
+env_collab = ProcureRLEnvironment()
+env_collab.reset(seed=42, task_id="single_issue")
+action_collab = NegotiationAction(
+    move_type="make_offer",
+    terms={"price": 40000},
+    message="I genuinely value a long-term partnership and believe this price reflects our mutual interests.",
+)
+obs_c = env_collab.step(action_collab)
+rapport_collab = env_collab.state.rapport_score
+# Agent with aggressive language
+env_aggro = ProcureRLEnvironment()
+env_aggro.reset(seed=42, task_id="single_issue")
+action_aggro = NegotiationAction(
+    move_type="make_offer",
+    terms={"price": 40000},
+    message="This is my final offer. Non-negotiable. Take it or leave it.",
+)
+obs_a = env_aggro.step(action_aggro)
+rapport_aggro = env_aggro.state.rapport_score
+print(f"Collaborative rapport: {rapport_collab:.3f}")
+print(f"Aggressive rapport: {rapport_aggro:.3f}")
+assert rapport_collab > rapport_aggro, "FAIL: rapport not sensitive to language"
+print("Language affects rapport: PASS")
+print("\n=== Test 4: Sequential decisions matter ===")
+env = ProcureRLEnvironment()
+obs = env.reset(seed=42, task_id="single_issue")
+print(f"Round 0: {obs.current_offer}")
+# Make 3 consecutive concessions
+for i in range(3):
+    action = NegotiationAction(
+        move_type="make_offer",
+        terms={"price": 40000 + i * 1000},
+        message="We can move slightly on price.",
+    )
+    obs = env.step(action)
+    print(
+        f"Round {i + 1}: consecutive_concessions={env.state.consecutive_concessions}, reward={obs.reward}"
+    )
+    if obs.done:
+        break
+print("Sequential state tracking: PASS")
+print("\n=== Test 5: Delayed reward ===")
+env = ProcureRLEnvironment()
+env.reset(seed=42, task_id="single_issue")
+rewards = []
+for i in range(5):
+    action = NegotiationAction(
+        move_type="make_offer",
+        terms={"price": 41000},
+        message="I think this is a fair price for both parties.",
+    )
+    obs = env.step(action)
+    rewards.append(obs.reward)
+    if obs.done:
+        break
+print(f"Intermediate rewards: {rewards[:-1]}")
+print(f"Final reward: {rewards[-1]}")
+assert all(r == 0.0 for r in rewards[:-1]) or rewards[-1] > 0, "Reward structure check"
+print("Reward is delayed to episode end: PASS")
+print("\n=== Test 6: Accept terminates correctly ===")
+env = ProcureRLEnvironment()
+env.reset(seed=42, task_id="single_issue")
+# First make an offer
+env.step(
+    NegotiationAction(
+        move_type="make_offer", terms={"price": 43000}, message="Reasonable offer."
+    )
+)
+# Then accept current terms
+obs = env.step(NegotiationAction(move_type="accept", terms={}, message=""))
+print(f"Accept: done={obs.done}, reward={obs.reward:.4f}")
+assert obs.done == True, "FAIL: accept should terminate episode"
+assert obs.reward >= 0.0, "FAIL: reward should be non-negative on accept"
+print("Accept terminates episode: PASS")
+print("\n=== Test 7: Reset produces clean state ===")
+env.reset(seed=42, task_id="multi_issue")
+assert env.state.round_number == 0
+assert env.state.deal_reached == False
+assert env.state.cumulative_reward == 0.0
+print("Reset produces clean state: PASS")
+print("\n=== All RL property tests passed ===")

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff

web_ui.png ADDED Viewed