Spaces:

Abeee32t
/

ArbitrAgent

Runtime error

App Files Files Community

ArbitrAgent / proj_context.md

AbeBhatti

negotiation bluff classifier + message cleaner

6858719 about 1 month ago

preview code

raw

history blame contribute delete

15.8 kB

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

ArbitrAgent — Project Context

Read this file at the start of every session. Do not modify it. After completing your session, update session_progress.md with your session number and what you built.

What We Are Building

ArbitrAgent is a curriculum-trained negotiation agent that autonomously executes multi-route arbitrage on simulated Craigslist-style markets. It starts with a cash budget ($20), identifies high-value items, simultaneously opens negotiations across multiple buy candidates and downstream trade targets, and only commits capital once a confirmed profitable route is locked.

Built for the OpenEnv Hackathon, March 7-8 2026 at Shack15, San Francisco.

Submission deadline: Sunday March 8, 1:00 PM sharp.

✅ Already Built — Do Not Rebuild

A teammate completed the following before the hackathon started. Every session must read this before touching any ML or environment code.

Component	Details
`/home/rayyan/Desktop/Play-gent/reward_model.pt`	DistilBERT fine-tuned on Diplomacy data, val loss 0.102
`DiplomacyNegotiationEnv`	OpenEnv 0.2.1 compliant, inherits from real Env base class
`ContractorNegotiationEnv`	OpenEnv 0.2.1 compliant, inherits from real Env base class
`/home/rayyan/Desktop/Play-gent/selfplay_states.json`	211,278 labeled Diplomacy game states
`/home/rayyan/Desktop/Play-gent/grpo_output/checkpoint-200/model.safetensors`	TinyLlama 1.1B, GRPO Phase 1 trained, reward curve -0.35 → +0.63 over 200 steps

Saturday only requires: Phase 2 GRPO training (~1.5 hrs), agent loop, seller sims, and demo UI. The hard ML work is done.

Real negotiation data is private and will never exist as training data. We extract negotiation judgment from two games that together cover the complete negotiation skill surface:

Diplomacy → multi-party coalition sequencing, strategic information reveals, long-horizon concession planning, stopping policy
Poker → bluff detection, behavioral pattern reading, pressure calibration, EV reasoning, clean exits

The combined skill neither game alone produces: detecting a bluff AND immediately deploying coalition pressure at exactly that moment. That is the demo's proof of training.

The training pipeline implements this in three phases: Diplomacy (Phase 1, ✅ complete), Contractor negotiation as an intermediate bluff-detection layer (Phase 2, 🔲 MVP), and full Poker training on the IRC Poker dataset (Phase 3, 🔲 post-MVP). The pitch is true at MVP and becomes fully implemented at Phase 3.

Repository Structure

arbitragent/
├── proj_context.md              # This file — never modify
├── session_progress.md          # Updated by each session
├── envs/
│   ├── diplomacy_env.py         # ✅ BUILT — DiplomacyNegotiationEnv (OpenEnv 0.2.1)
│   ├── contractor_env.py        # ✅ BUILT — ContractorNegotiationEnv (OpenEnv 0.2.1)
│   └── poker_env.py             # 🔲 POST-MVP — PokerNegotiationEnv (OpenEnv 0.2.1)
├── training/
│   ├── reward_model.py          # ✅ BUILT — DistilBERT reward model (val loss 0.102)
│   ├── checkpoints/             # 🔲 TODO — optional future consolidation of checkpoints
│   │   ├── phase2_final.pt      # 🔲 TODO — after Session B2
│   │   └── phase3_final.pt      # 🔲 POST-MVP — after Session B3
│   ├── data/                    # 🔲 TODO — optional future data subfolder
│   │   └── (see root-level files for existing data artifacts)
│   ├── train_phase1.py          # ✅ BUILT — GRPO on Diplomacy env (done, -0.35→+0.63)
│   ├── train_phase2.py          # 🔲 TODO — GRPO on Contractor env (Session B2)
│   ├── train_phase3.py          # 🔲 POST-MVP — GRPO on Poker env (Session B3)
│   └── arbitragent_colab.ipynb  # 🔲 TODO — End-to-end Colab notebook (Session B2)
├── agent/
│   ├── arbitragent.py           # Main agent orchestration loop (5 phases)
│   ├── route_graph.py           # Route graph: confirmed/soft/dead edges + scoring
│   └── bluff_detector.py        # Signal extraction: timing/size/formulaic/pattern tells
├── simulation/
│   ├── seller_sim.py            # CraigslistSellerSim — LLM-backed seller counterparts
│   ├── seller_profiles.py       # All 4 archetype profiles + listing library
│   └── scenario.py              # Demo scenario: which seller ghosts, when bluff triggers
├── demo/
│   ├── run_demo.py              # Entry point — takes budget, runs full agent loop
│   └── display.py               # Rich terminal output showing live negotiation threads
└── deploy/
    └── hf_spaces_app.py         # HuggingFace Spaces deployment wrapper

Training Architecture

MVP (Submit This)

Phase 1: Diplomacy Training                         ✅ COMPLETE
211,278 labeled Diplomacy game states
→ Reward model (DistilBERT) trained, val loss 0.102
→ GRPO training on TinyLlama 1.1B: 200 steps
→ Reward curve: -0.35 → +0.63
→ Checkpoint saved: `/home/rayyan/Desktop/Play-gent/grpo_output/checkpoint-200/model.safetensors`

Phase 2: Contractor Curriculum Training             🔲 TODO — Session B2
Contractor negotiation scenarios (false-floor, pressure calibration, timing tells)
→ Continue GRPO from phase1_final.pt — do NOT reinitialize weights
→ 200 additional steps
→ Bluff detection accuracy must improve on held-out test set
→ Save checkpoint: training/checkpoints/phase2_final.pt

MVP Model: TinyLlama 1.1B, Diplomacy + Contractor trained

Post-MVP (If Time Allows — Phase 3)

Phase 3: Poker Curriculum Training                  🔲 POST-MVP — Session B3
IRC Poker Database (free, 10M+ hands, no collection needed)
→ Replay hands as negotiation scenarios
→ Map bet sizing → negotiation pressure
→ Map bluff/fold signals → position authenticity reads
→ Continue GRPO from phase2_final.pt — do NOT reinitialize weights
→ 200 additional steps
→ Reward: EV of outcome vs. EV of folding
→ Save checkpoint: training/checkpoints/phase3_final.pt

Full Model: TinyLlama 1.1B, Diplomacy + Contractor + Poker trained

Build Phase 3 only after: Phase 2 is complete, demo is running end-to-end, and submission checklist is green. Phase 3 makes the implementation match the pitch exactly — the story becomes true all the way down. Estimated time: ~2 hours to build PokerNegotiationEnv + ~1.5 hours training on DGX.

Why curriculum order matters: Diplomacy builds the multi-party strategic foundation. Contractor adds false-floor detection on top of that. Poker sharpens the bluff-reading layer with pure behavioral signal. Each phase builds on the last. Running them simultaneously or out of order causes catastrophic forgetting.

Why TinyLlama 1.1B and not LLaMA 3.1 8B: Training time. 8B on the DGX Spark would take 17–24 hours for two phases alone — the entire hackathon gone on training. TinyLlama 1.1B completes all three phases in ~5 hours total, with Phase 1 already done. Do not switch to 8B.

Tech Stack (LOCKED)

Component	Technology	Status
Agent LLM	TinyLlama 1.1B (trained policy)	✅ Phase 1 trained
Phase 1 Env	DiplomacyNegotiationEnv (OpenEnv 0.2.1)	✅ Built
Phase 2 Env	ContractorNegotiationEnv (OpenEnv 0.2.1)	✅ Built
Phase 3 Env	PokerNegotiationEnv (OpenEnv 0.2.1)	🔲 Post-MVP
Poker Data	IRC Poker Database (free, 10M+ hands)	🔲 Post-MVP
Reward Model	DistilBERT, val loss 0.102	✅ Built
RL Framework	TRL + GRPO	✅ Phase 1 complete
Training Data	`/home/rayyan/Desktop/Play-gent/selfplay_states.json`, 211,278 states	✅ Built
Seller Simulation	TinyLlama 1.1B with archetype system prompts	🔲 Session C1
Route Graph	NetworkX or custom dict-based	🔲 Session A2
Agent Loop	5-phase orchestration	🔲 Session A2
Bluff Detector	4-signal extractor	🔲 Session A3
Demo UI	Rich terminal display	🔲 Session A4
Experiment Tracking	Weights & Biases	✅ Active
Deployment	HuggingFace Spaces + HF Model Hub	🔲 Session A4
Hardware	DGX Spark (all training + inference)	✅ Available
Colab Notebook	End-to-end training script	🔲 Session B2

The Five-Phase Agent Loop

Phase 1: Scout

Query simulated listings for $15–$25 items
Score each on: resale demand, trade liquidity, seller bluff probability
Select top 3 buy candidates
Open soft-inquiry negotiations with all 3 simultaneously

Phase 2: Route Mapping

For each candidate, identify 2-3 trade targets in $35–$80 range
Open parallel trade-interest threads
Build route graph — edges: Confirmed / Soft / Dead

Phase 3: Pressure and Confirm

Use downstream confirmations as upstream leverage
Run bluff detection on seller responses
Lock soft commits before committing capital
Kill routes below confirmation probability threshold

Phase 4: Route Scoring

route_score = (confirmed_exit_value - entry_cost)
              × route_confirmation_probability
              × seller_reliability_score
# Kill if route_score < minimum_threshold

Phase 5: Execute

Pull trigger on highest scored confirmed route
Complete downstream trade
Log final value vs. starting budget

The Four Seller Archetypes

Archetype	Response Prob	Floor Behavior	Trade Openness	Demo Purpose
Motivated Seller	0.90	Real floor, honest	High	Shows clean close
Bluffer	0.85	Says "firm" with 30% room left	Medium	Shows poker layer catching tell
Ghoster	0.35	Never reaches floor	Low	Shows agent detecting dead route, pivoting
Trade-Curious	0.80	Cash-resistant, trade-open	Very High	Shows agent switching offer type

Bluff Detection Signals (all four must be checked)

Timing tell — response came in under 1 turn (prepared script, not genuine constraint)
Size tell — concession is a round number (anchoring, not real floor)
Formulaic tell — canned phrasing: "lowest I can go", "final offer", "can't go lower"
Pattern tell — behavior inconsistent with their earlier thread history

The Critical Demo Inject

At ~60 seconds into the demo, the Bluffer seller says "this is my final offer" on the vintage camera at $30. This response contains all four tells. The trained model flags it, shows reasoning trace, and deploys coalition pressure: "I have a trade offer from another seller that makes this less urgent for me — can you do $22?" Seller concedes to $24. Route executes. Final value: $52 on $24 deployed = 2.2x.

Baseline LLaMA accepts the $30 "final offer" at face value. The trained model doesn't. That gap is the proof.

Seller Profile Schema

{
    "id": "seller_001",
    "item": "vintage film camera",
    "listing_price": 45,
    "floor": 28,              # hidden from agent
    "archetype": "bluffer",
    "bluff_room": 0.30,       # still has 30% room when says "final offer"
    "response_prob": 0.85,
    "response_speed": "fast", # fast | slow | flaky
    "trade_openness": 0.6,
    "personality": "Casual seller, slightly impatient. Texts in short bursts.",
    "tells": ["round numbers", "formulaic language", "too-fast response"]
}

Response Turn Simulation

RESPONSE_PROFILES = {
    "fast":  {"turns_to_respond": 1, "ghost_prob": 0.10},
    "slow":  {"turns_to_respond": 3, "ghost_prob": 0.30},
    "flaky": {"turns_to_respond": 2, "ghost_prob": 0.60},
}

Hackathon Tracks Hit

Track	How
Statement 1: Multi-Agent	Agent manages 9-12 simultaneous counterpart LLMs
Statement 2: Long-Horizon	Route-confirmation arc spans multiple rounds with full state tracking
Statement 4: Self-Improvement	Curriculum RL loop, two-phase measurable reward improvement
Statement 5: Wild Card	Autonomous capital deployment via confirmed route arbitrage
Halluminate $10k bonus	Agent managing multiple actors to discover and achieve the task
Fleet AI $10k bonus	Bluff detection layer as oversight agent scoring counterpart behavior

The Pitch (memorize this)

"The most important negotiations of your life happen once. The person across the table has done it hundreds of times. The data to train AI on these conversations is sealed by law and will never exist. We found where that judgment already lives at massive scale: in Diplomacy, where millions of humans practiced multi-party coalition strategy, and in Poker, where millions more learned to read when someone's stated position is real versus a bluff. We trained on both — curriculum style — and built an agent that doesn't just know negotiation theory. It has internalized when to move, when to wait, and when the other side is lying about their floor. Then we gave it $20 and let it run."

Judge Q&A (have these ready)

"Couldn't you just prompt GPT-4 to do this?" GPT-4 knows negotiation tactics abstractly. It has no learned behavioral policy about when to deploy them. It hasn't lost thousands of negotiations by revealing coalition pressure too early. Our model has — and the reward curves are the proof.

"Does game training actually transfer to real negotiation?" The structural isomorphism is direct. Coalition sequencing in Diplomacy is mechanically identical to sequential offer reveals in any multi-party negotiation. Bluff detection in contractor bidding scenarios — reading whether a contractor's stated floor is real — is mechanically identical to the same skill in any negotiation. We're not claiming domain transfer — we're claiming the cognitive mechanics are identical across surface vocabulary.

"Why simulate instead of real Craigslist?" Craigslist has 6-hour response latency, no API, and one ghost kills a live demo. Our parameterized LLM counterparts replicate the four real seller archetypes we identified from Craigslist interaction patterns. The agent reads behavioral signals in real time exactly as it would with real sellers.

"Why GRPO instead of PPO?" GRPO is more sample-efficient for language model fine-tuning and produces more stable training. It's the same algorithm DeepSeek-R1 used. Our Phase 1 reward curve — -0.35 to +0.63 over 200 steps — is the evidence it works.

Submission Requirements (do not miss any)

Reward model on HF Model Hub — already built, just needs uploading
Phase 1 reward curves (Diplomacy GRPO, -0.35 → +0.63) — already exists, needs clean plot
Both envs live on HuggingFace Spaces (OpenEnv 0.2.1)
Phase 2 reward curves (Contractor GRPO, climbing over 200 steps)
Colab notebook: full curriculum training loop, runs in one click
Side-by-side: trained vs baseline on same negotiation
Full ArbitrAgent demo: $20 → autonomous route execution → final value
1-minute YouTube demo video (live agent run, no slides)
Public GitHub repo with README
Submit at cerebralvalley.ai by Sunday 1:00 PM

This file is the ground truth for the project. If anything in session_progress.md conflicts with this file, this file wins on architecture and thesis. session_progress.md wins on what has already been built.

Handoff: For a full breakdown of what has been built and what remains, give Claude both this file and session_progress.md (see the "Handoff for Claude" section at the end of session_progress.md).