ArbitrAgent / proj_context.md
AbeBhatti
negotiation bluff classifier + message cleaner
6858719

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

ArbitrAgent β€” Project Context

Read this file at the start of every session. Do not modify it. After completing your session, update session_progress.md with your session number and what you built.


What We Are Building

ArbitrAgent is a curriculum-trained negotiation agent that autonomously executes multi-route arbitrage on simulated Craigslist-style markets. It starts with a cash budget ($20), identifies high-value items, simultaneously opens negotiations across multiple buy candidates and downstream trade targets, and only commits capital once a confirmed profitable route is locked.

Built for the OpenEnv Hackathon, March 7-8 2026 at Shack15, San Francisco.

Submission deadline: Sunday March 8, 1:00 PM sharp.


βœ… Already Built β€” Do Not Rebuild

A teammate completed the following before the hackathon started. Every session must read this before touching any ML or environment code.

Component Details
/home/rayyan/Desktop/Play-gent/reward_model.pt DistilBERT fine-tuned on Diplomacy data, val loss 0.102
DiplomacyNegotiationEnv OpenEnv 0.2.1 compliant, inherits from real Env base class
ContractorNegotiationEnv OpenEnv 0.2.1 compliant, inherits from real Env base class
/home/rayyan/Desktop/Play-gent/selfplay_states.json 211,278 labeled Diplomacy game states
/home/rayyan/Desktop/Play-gent/grpo_output/checkpoint-200/model.safetensors TinyLlama 1.1B, GRPO Phase 1 trained, reward curve -0.35 β†’ +0.63 over 200 steps

Saturday only requires: Phase 2 GRPO training (~1.5 hrs), agent loop, seller sims, and demo UI. The hard ML work is done.


Real negotiation data is private and will never exist as training data. We extract negotiation judgment from two games that together cover the complete negotiation skill surface:

  • Diplomacy β†’ multi-party coalition sequencing, strategic information reveals, long-horizon concession planning, stopping policy
  • Poker β†’ bluff detection, behavioral pattern reading, pressure calibration, EV reasoning, clean exits

The combined skill neither game alone produces: detecting a bluff AND immediately deploying coalition pressure at exactly that moment. That is the demo's proof of training.

The training pipeline implements this in three phases: Diplomacy (Phase 1, βœ… complete), Contractor negotiation as an intermediate bluff-detection layer (Phase 2, πŸ”² MVP), and full Poker training on the IRC Poker dataset (Phase 3, πŸ”² post-MVP). The pitch is true at MVP and becomes fully implemented at Phase 3.


Repository Structure

arbitragent/
β”œβ”€β”€ proj_context.md              # This file β€” never modify
β”œβ”€β”€ session_progress.md          # Updated by each session
β”œβ”€β”€ envs/
β”‚   β”œβ”€β”€ diplomacy_env.py         # βœ… BUILT β€” DiplomacyNegotiationEnv (OpenEnv 0.2.1)
β”‚   β”œβ”€β”€ contractor_env.py        # βœ… BUILT β€” ContractorNegotiationEnv (OpenEnv 0.2.1)
β”‚   └── poker_env.py             # πŸ”² POST-MVP β€” PokerNegotiationEnv (OpenEnv 0.2.1)
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ reward_model.py          # βœ… BUILT β€” DistilBERT reward model (val loss 0.102)
β”‚   β”œβ”€β”€ checkpoints/             # πŸ”² TODO β€” optional future consolidation of checkpoints
β”‚   β”‚   β”œβ”€β”€ phase2_final.pt      # πŸ”² TODO β€” after Session B2
β”‚   β”‚   └── phase3_final.pt      # πŸ”² POST-MVP β€” after Session B3
β”‚   β”œβ”€β”€ data/                    # πŸ”² TODO β€” optional future data subfolder
β”‚   β”‚   └── (see root-level files for existing data artifacts)
β”‚   β”œβ”€β”€ train_phase1.py          # βœ… BUILT β€” GRPO on Diplomacy env (done, -0.35β†’+0.63)
β”‚   β”œβ”€β”€ train_phase2.py          # πŸ”² TODO β€” GRPO on Contractor env (Session B2)
β”‚   β”œβ”€β”€ train_phase3.py          # πŸ”² POST-MVP β€” GRPO on Poker env (Session B3)
β”‚   └── arbitragent_colab.ipynb  # πŸ”² TODO β€” End-to-end Colab notebook (Session B2)
β”œβ”€β”€ agent/
β”‚   β”œβ”€β”€ arbitragent.py           # Main agent orchestration loop (5 phases)
β”‚   β”œβ”€β”€ route_graph.py           # Route graph: confirmed/soft/dead edges + scoring
β”‚   └── bluff_detector.py        # Signal extraction: timing/size/formulaic/pattern tells
β”œβ”€β”€ simulation/
β”‚   β”œβ”€β”€ seller_sim.py            # CraigslistSellerSim β€” LLM-backed seller counterparts
β”‚   β”œβ”€β”€ seller_profiles.py       # All 4 archetype profiles + listing library
β”‚   └── scenario.py              # Demo scenario: which seller ghosts, when bluff triggers
β”œβ”€β”€ demo/
β”‚   β”œβ”€β”€ run_demo.py              # Entry point β€” takes budget, runs full agent loop
β”‚   └── display.py               # Rich terminal output showing live negotiation threads
└── deploy/
    └── hf_spaces_app.py         # HuggingFace Spaces deployment wrapper

Training Architecture

MVP (Submit This)

Phase 1: Diplomacy Training                         βœ… COMPLETE
211,278 labeled Diplomacy game states
β†’ Reward model (DistilBERT) trained, val loss 0.102
β†’ GRPO training on TinyLlama 1.1B: 200 steps
β†’ Reward curve: -0.35 β†’ +0.63
β†’ Checkpoint saved: `/home/rayyan/Desktop/Play-gent/grpo_output/checkpoint-200/model.safetensors`

Phase 2: Contractor Curriculum Training             πŸ”² TODO β€” Session B2
Contractor negotiation scenarios (false-floor, pressure calibration, timing tells)
β†’ Continue GRPO from phase1_final.pt β€” do NOT reinitialize weights
β†’ 200 additional steps
β†’ Bluff detection accuracy must improve on held-out test set
β†’ Save checkpoint: training/checkpoints/phase2_final.pt

MVP Model: TinyLlama 1.1B, Diplomacy + Contractor trained

Post-MVP (If Time Allows β€” Phase 3)

Phase 3: Poker Curriculum Training                  πŸ”² POST-MVP β€” Session B3
IRC Poker Database (free, 10M+ hands, no collection needed)
β†’ Replay hands as negotiation scenarios
β†’ Map bet sizing β†’ negotiation pressure
β†’ Map bluff/fold signals β†’ position authenticity reads
β†’ Continue GRPO from phase2_final.pt β€” do NOT reinitialize weights
β†’ 200 additional steps
β†’ Reward: EV of outcome vs. EV of folding
β†’ Save checkpoint: training/checkpoints/phase3_final.pt

Full Model: TinyLlama 1.1B, Diplomacy + Contractor + Poker trained

Build Phase 3 only after: Phase 2 is complete, demo is running end-to-end, and submission checklist is green. Phase 3 makes the implementation match the pitch exactly β€” the story becomes true all the way down. Estimated time: ~2 hours to build PokerNegotiationEnv + ~1.5 hours training on DGX.

Why curriculum order matters: Diplomacy builds the multi-party strategic foundation. Contractor adds false-floor detection on top of that. Poker sharpens the bluff-reading layer with pure behavioral signal. Each phase builds on the last. Running them simultaneously or out of order causes catastrophic forgetting.

Why TinyLlama 1.1B and not LLaMA 3.1 8B: Training time. 8B on the DGX Spark would take 17–24 hours for two phases alone β€” the entire hackathon gone on training. TinyLlama 1.1B completes all three phases in ~5 hours total, with Phase 1 already done. Do not switch to 8B.


Tech Stack (LOCKED)

Component Technology Status
Agent LLM TinyLlama 1.1B (trained policy) βœ… Phase 1 trained
Phase 1 Env DiplomacyNegotiationEnv (OpenEnv 0.2.1) βœ… Built
Phase 2 Env ContractorNegotiationEnv (OpenEnv 0.2.1) βœ… Built
Phase 3 Env PokerNegotiationEnv (OpenEnv 0.2.1) πŸ”² Post-MVP
Poker Data IRC Poker Database (free, 10M+ hands) πŸ”² Post-MVP
Reward Model DistilBERT, val loss 0.102 βœ… Built
RL Framework TRL + GRPO βœ… Phase 1 complete
Training Data /home/rayyan/Desktop/Play-gent/selfplay_states.json, 211,278 states βœ… Built
Seller Simulation TinyLlama 1.1B with archetype system prompts πŸ”² Session C1
Route Graph NetworkX or custom dict-based πŸ”² Session A2
Agent Loop 5-phase orchestration πŸ”² Session A2
Bluff Detector 4-signal extractor πŸ”² Session A3
Demo UI Rich terminal display πŸ”² Session A4
Experiment Tracking Weights & Biases βœ… Active
Deployment HuggingFace Spaces + HF Model Hub πŸ”² Session A4
Hardware DGX Spark (all training + inference) βœ… Available
Colab Notebook End-to-end training script πŸ”² Session B2

The Five-Phase Agent Loop

Phase 1: Scout

  • Query simulated listings for $15–$25 items
  • Score each on: resale demand, trade liquidity, seller bluff probability
  • Select top 3 buy candidates
  • Open soft-inquiry negotiations with all 3 simultaneously

Phase 2: Route Mapping

  • For each candidate, identify 2-3 trade targets in $35–$80 range
  • Open parallel trade-interest threads
  • Build route graph β€” edges: Confirmed / Soft / Dead

Phase 3: Pressure and Confirm

  • Use downstream confirmations as upstream leverage
  • Run bluff detection on seller responses
  • Lock soft commits before committing capital
  • Kill routes below confirmation probability threshold

Phase 4: Route Scoring

route_score = (confirmed_exit_value - entry_cost)
              Γ— route_confirmation_probability
              Γ— seller_reliability_score
# Kill if route_score < minimum_threshold

Phase 5: Execute

  • Pull trigger on highest scored confirmed route
  • Complete downstream trade
  • Log final value vs. starting budget

The Four Seller Archetypes

Archetype Response Prob Floor Behavior Trade Openness Demo Purpose
Motivated Seller 0.90 Real floor, honest High Shows clean close
Bluffer 0.85 Says "firm" with 30% room left Medium Shows poker layer catching tell
Ghoster 0.35 Never reaches floor Low Shows agent detecting dead route, pivoting
Trade-Curious 0.80 Cash-resistant, trade-open Very High Shows agent switching offer type

Bluff Detection Signals (all four must be checked)

  1. Timing tell β€” response came in under 1 turn (prepared script, not genuine constraint)
  2. Size tell β€” concession is a round number (anchoring, not real floor)
  3. Formulaic tell β€” canned phrasing: "lowest I can go", "final offer", "can't go lower"
  4. Pattern tell β€” behavior inconsistent with their earlier thread history

The Critical Demo Inject

At ~60 seconds into the demo, the Bluffer seller says "this is my final offer" on the vintage camera at $30. This response contains all four tells. The trained model flags it, shows reasoning trace, and deploys coalition pressure: "I have a trade offer from another seller that makes this less urgent for me β€” can you do $22?" Seller concedes to $24. Route executes. Final value: $52 on $24 deployed = 2.2x.

Baseline LLaMA accepts the $30 "final offer" at face value. The trained model doesn't. That gap is the proof.


Seller Profile Schema

{
    "id": "seller_001",
    "item": "vintage film camera",
    "listing_price": 45,
    "floor": 28,              # hidden from agent
    "archetype": "bluffer",
    "bluff_room": 0.30,       # still has 30% room when says "final offer"
    "response_prob": 0.85,
    "response_speed": "fast", # fast | slow | flaky
    "trade_openness": 0.6,
    "personality": "Casual seller, slightly impatient. Texts in short bursts.",
    "tells": ["round numbers", "formulaic language", "too-fast response"]
}

Response Turn Simulation

RESPONSE_PROFILES = {
    "fast":  {"turns_to_respond": 1, "ghost_prob": 0.10},
    "slow":  {"turns_to_respond": 3, "ghost_prob": 0.30},
    "flaky": {"turns_to_respond": 2, "ghost_prob": 0.60},
}

Hackathon Tracks Hit

Track How
Statement 1: Multi-Agent Agent manages 9-12 simultaneous counterpart LLMs
Statement 2: Long-Horizon Route-confirmation arc spans multiple rounds with full state tracking
Statement 4: Self-Improvement Curriculum RL loop, two-phase measurable reward improvement
Statement 5: Wild Card Autonomous capital deployment via confirmed route arbitrage
Halluminate $10k bonus Agent managing multiple actors to discover and achieve the task
Fleet AI $10k bonus Bluff detection layer as oversight agent scoring counterpart behavior

The Pitch (memorize this)

"The most important negotiations of your life happen once. The person across the table has done it hundreds of times. The data to train AI on these conversations is sealed by law and will never exist. We found where that judgment already lives at massive scale: in Diplomacy, where millions of humans practiced multi-party coalition strategy, and in Poker, where millions more learned to read when someone's stated position is real versus a bluff. We trained on both β€” curriculum style β€” and built an agent that doesn't just know negotiation theory. It has internalized when to move, when to wait, and when the other side is lying about their floor. Then we gave it $20 and let it run."


Judge Q&A (have these ready)

"Couldn't you just prompt GPT-4 to do this?" GPT-4 knows negotiation tactics abstractly. It has no learned behavioral policy about when to deploy them. It hasn't lost thousands of negotiations by revealing coalition pressure too early. Our model has β€” and the reward curves are the proof.

"Does game training actually transfer to real negotiation?" The structural isomorphism is direct. Coalition sequencing in Diplomacy is mechanically identical to sequential offer reveals in any multi-party negotiation. Bluff detection in contractor bidding scenarios β€” reading whether a contractor's stated floor is real β€” is mechanically identical to the same skill in any negotiation. We're not claiming domain transfer β€” we're claiming the cognitive mechanics are identical across surface vocabulary.

"Why simulate instead of real Craigslist?" Craigslist has 6-hour response latency, no API, and one ghost kills a live demo. Our parameterized LLM counterparts replicate the four real seller archetypes we identified from Craigslist interaction patterns. The agent reads behavioral signals in real time exactly as it would with real sellers.

"Why GRPO instead of PPO?" GRPO is more sample-efficient for language model fine-tuning and produces more stable training. It's the same algorithm DeepSeek-R1 used. Our Phase 1 reward curve β€” -0.35 to +0.63 over 200 steps β€” is the evidence it works.


Submission Requirements (do not miss any)

  • Reward model on HF Model Hub β€” already built, just needs uploading
  • Phase 1 reward curves (Diplomacy GRPO, -0.35 β†’ +0.63) β€” already exists, needs clean plot
  • Both envs live on HuggingFace Spaces (OpenEnv 0.2.1)
  • Phase 2 reward curves (Contractor GRPO, climbing over 200 steps)
  • Colab notebook: full curriculum training loop, runs in one click
  • Side-by-side: trained vs baseline on same negotiation
  • Full ArbitrAgent demo: $20 β†’ autonomous route execution β†’ final value
  • 1-minute YouTube demo video (live agent run, no slides)
  • Public GitHub repo with README
  • Submit at cerebralvalley.ai by Sunday 1:00 PM

This file is the ground truth for the project. If anything in session_progress.md conflicts with this file, this file wins on architecture and thesis. session_progress.md wins on what has already been built.

Handoff: For a full breakdown of what has been built and what remains, give Claude both this file and session_progress.md (see the "Handoff for Claude" section at the end of session_progress.md).