procure-rl / README.md
akshaypulla's picture
Upload folder using huggingface_hub
c1be7c3 verified
metadata
title: ProcureRL Environment
emoji: 🀝
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
app_port: 7860
base_path: /web
tags:
  - openenv
  - negotiation
  - procurement
  - rl
  - real-world

ProcureRL: Procurement Negotiation RL Environment

An OpenEnv-compliant RL environment where an LLM agent learns to negotiate procurement deals against scripted supplier opponents with language-sensitive behavior.

The Key Innovation: Language-Sensitive Opponent

The opponent's concession rate is directly affected by the quality of the agent's natural language:

  • Collaborative language ("let's work together", "mutual benefit") β†’ increases rapport β†’ opponent concedes more
  • Neutral language β†’ opponent concedes at baseline rate
  • Aggressive language ("final offer", "take it or leave it") β†’ rapport drops β†’ opponent hardens

This makes LLM genuinely required β€” output quality directly affects negotiation outcomes.

Quick Start

from server.Procure_RL_environment import ProcureRLEnvironment
from models import NegotiationAction

env = ProcureRLEnvironment()
obs = env.reset(task_id="single_issue", seed=42)

print(f"Supplier: {obs.supplier_message}")
print(f"Offer: {obs.current_offer}")
print(f"Your target: {obs.buyer_constraints}")

action = NegotiationAction(
    move_type="make_offer",
    terms={"price": 42000},
    message="Let's find a mutually beneficial solution."
)
obs = env.step(action)
print(f"Response: {obs.supplier_message}")
print(f"New offer: {obs.current_offer}")

Web Interface Example

The web interface at /web provides a visual playground. Here's how to use it:

Step 1: Reset the Environment

Click Reset to start a new negotiation episode. You can customize the reset by passing JSON:

{"task_id": "single_issue", "seed": 42}

Available tasks:

  • single_issue β€” Price-only negotiation (6 rounds max)
  • multi_issue β€” Price + payment terms (8 rounds max)
  • adversarial β€” Price + payment + support hours (10 rounds max)

Step 2: Make an Offer

Fill in the form fields:

Field Example Value Notes
move_type make_offer Options: make_offer, accept, reject, bundle
terms {"price": 42000} JSON object with negotiation terms
message I value our partnership and believe we can find a fair solution. Your natural language message (affects opponent rapport!)

Example: Making a collaborative offer

move_type: make_offer
terms: {"price": 45000}
message: We appreciate your flexibility and would like to work together to find a solution that benefits both parties.

Step 3: Read the Response

After clicking Step, you'll see:

  • supplier_message β€” The opponent's natural language response
  • current_offer β€” Updated terms on the table
  • rapport_hint β€” "positive", "neutral", or "negative" based on your language
  • round_number β€” Current round (0-indexed)

Step 4: Continue or Accept

  • Make another offer to continue negotiating
  • Use accept when you're satisfied with the current terms
  • Use reject only if you want to walk away (no reward)

Example: Accepting current terms

move_type: accept
terms: {}
message: 

Multi-Issue Negotiation (Task 2 & 3)

For multi_issue and adversarial, include multiple terms:

{
  "move_type": "make_offer",
  "terms": {
    "price": 44000,
    "payment_days": 30
  },
  "message": "We can offer faster payment terms if that helps your cash flow."
}

Key insight: In multi_issue, the opponent cares more about payment timing than price. Offering Net-30 payment can get you a better price!

Example Full Episode

Round 0 (Reset):

  • Task: single_issue
  • Supplier opens: ~$52,000
  • Your target: $36,000

Round 1:

  • move_type: make_offer
  • terms: {"price": 48000}
  • message: We value your partnership and want to find a fair price for both parties.

Round 2:

  • Supplier counter-offers at ~$46,000 (rapport is positive!)
  • move_type: make_offer
  • terms: {"price": 45000}
  • message: I appreciate your movement. Let's see if we can get to $45,000.

Round 3:

  • Supplier accepts or counter-offers near your target
  • move_type: accept
  • terms: {}
  • Final score: Based on how close to target and how efficiently

The Three Tasks

1. single_issue (Easy)

Renew software license. Price only.

  • Buyer target: $36,000, Budget: $53,000
  • Seller opens: ~$52,000 (varies by seed)
  • Opponent persona: Cooperative
  • Max rounds: 6

Scoring: Deal quality (how close to target) Γ— Efficiency (how few rounds)

2. multi_issue (Medium)

Enterprise software deal. Price + payment terms.

  • Buyer weights: price 70%, payment 30%
  • Seller persona: Cash Flow Stressed (cares more about payment timing)
  • Trade opportunity: offer Net-30 payment to get lower price
  • Max rounds: 8

Scoring: Weighted combination of price improvement + payment terms

3. adversarial (Hard)

Large contract negotiation. Price + payment + support hours.

  • Opponent persona: Aggressive Anchor
    • Opens at ceiling on all issues
    • Hardens position if you make 2+ consecutive concessions
    • Requires consistent collaborative framing
  • Survival floor: any deal scores at least 0.15
  • Max rounds: 10

Scoring: Multi-dimensional value minus pattern penalty for consecutive concessions

Action Space

NegotiationAction(
    move_type="make_offer",  # make_offer | accept | reject | bundle
    terms={"price": 44000, "payment_days": 45, "support_hours": 120},
    message="We appreciate your flexibility on this."
)
move_type Description
make_offer Propose terms (price required, others optional)
accept Accept current offer on table
reject Walk away (only use at final round)
bundle Alias for make_offer with multi-issue terms

Observation Space

NegotiationObservation(
    task_id="single_issue",
    round_number=2,
    max_rounds=6,
    supplier_message="I appreciate your offer. Based on our costs...",
    current_offer={"price": 46000},
    last_4_exchanges=[...],
    buyer_constraints={"price": {"target": 36000, "worst": 55000, "budget": 53000}},
    rapport_hint="positive",  # positive | neutral | negative
    done=False
)

Running the Server

# Build Docker image
docker build -t procure-rl -f server/Dockerfile .

# Run container (port 7860 - required for HF Spaces)
docker run -p 7860:7860 procure-rl

# Access web interface at http://localhost:7860/web

API Endpoints

Endpoint Method Description
/health GET Health check
/metadata GET Environment metadata
/reset POST Reset environment
/step POST Execute action
/state GET Get current state
/ws WS WebSocket for persistent sessions

Baseline Inference

Run inference against all three tasks:

cp .env.example .env
# Edit .env and add your HF_TOKEN
HF_TOKEN=your_token python inference.py

Output format (exact):

[START] task=single_issue env=procure-rl model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action=make_offer({"price": 42000}) reward=0.00 done=false error=null
[STEP] step=2 action=make_offer({"price": 41000}) reward=0.52 done=true error=null
[END] success=true steps=2 score=0.52 rewards=0.00,0.52

Environment Design

Rapport System

The opponent maintains a rapport score (0.0 to 1.0) updated per-round:

COLLABORATIVE_SIGNALS = ["understand", "partnership", "mutual", "together", ...]
AGGRESSIVE_SIGNALS = ["demand", "require", "final offer", "unacceptable", ...]

delta = +0.08 per collaborative signal detected
delta = -0.08 per aggressive signal detected
delta = max(-0.20, min(0.20, delta))  # cap per round

Opponent Personas

Persona Base Concession Rapport Modifier Special Behavior
cooperative 5% Β±50% Responsive to language
cash_flow_stressed 7% Β±50% Accepts Net-45+, comments on payment
aggressive_anchor 4% Β±50% Hardens after 2+ consecutive concessions

Grading

Graders are pure Python β€” zero LLM calls. They combine:

  • Value: how close to buyer's target
  • Efficiency: penalty for taking too many rounds
  • Pattern penalty (adversarial only): for consecutive concession behavior

Graders never crash on malformed input β€” they fall back to worst-case values.

Project Structure

Procure_RL/
β”œβ”€β”€ __init__.py                    # Package exports
β”œβ”€β”€ client.py                      # EnvClient wrapper
β”œβ”€β”€ models.py                      # NegotiationAction, NegotiationObservation, NegotiationState
β”œβ”€β”€ opponent.py                    # ScriptedPersonaOpponent with 3 personas + rapport
β”œβ”€β”€ graders.py                     # grade_single_issue, grade_multi_issue, grade_adversarial
β”œβ”€β”€ inference.py                   # Baseline agent with [START][STEP][END] output
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ app.py                    # FastAPI app
β”‚   β”œβ”€β”€ Procure_RL_environment.py # ProcureRLEnvironment
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └── Dockerfile
β”œβ”€β”€ openenv.yaml                  # OpenEnv manifest
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ plan.md                       # Full design specification
└── README.md                     # This file

Why This Environment?

Market validation: Walmart deployed Pactum for AI negotiation. 90% of CPOs adopting AI negotiation in 2025.

Research gap: Zero negotiation environments in OpenEnv hub.

LLM advantage: Language quality directly affects opponent rapport β€” the language IS the policy.

Reproducibility: Deterministic scripted opponent, pure Python graders, no LLM in environment loop.

Calibration

If base LLM scores above 0.55 on single_issue β†’ opponent too easy, reduce cooperative concession rate.

If base LLM scores below 0.15 on single_issue β†’ opponent too hard, increase cooperative concession rate.