Spaces:
Sleeping
Sleeping
| title: Project Polymath | |
| emoji: ⚖️ | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: docker | |
| pinned: false | |
| short_description: Multi-Agent RL Environment for PRD Negotiation | |
| # Project Polymath: Expert Negotiation Environment | |
| > **Train LLMs to negotiate with conflicting stakeholders and produce balanced decisions.** | |
| [](https://github.com/huggingface/openenv) | |
| [](https://huggingface.co/spaces/YOUR_USERNAME/expert-negotiation-env) | |
| [](https://python.org) | |
| --- | |
| ## 🔗 Quick Links | |
| | Resource | Link | | |
| |---|---| | |
| | **🔗Live Environment** | [HF Space](https://huggingface.co/spaces/Addyk24/Project-Polymath) | | |
| | **📝HF Blog Post** | [Read on Hugging Face](https://huggingface.co/spaces/Addyk24/Project-Polymath/blob/main/BLOG.md) | | |
| | **GitHub Link** | [GitHub](https://github.com/Addyk-24/Project-Polymath) | | |
| | **Training Notebook** | [Open in Colab](https://colab.research.google.com/drive/13KqXt_7HTZTJEC4yD98My5g5Za9J1-5T?usp=sharing) | | |
| --- | |
| ## 🧱 The Problem Statement | |
| Current LLMs are sycophantic. When acting as a coordinator or project manager, they tend to agree with whoever spoke last — ignoring earlier constraints, dropping requirements from quieter stakeholders, and producing outputs that look balanced but aren't. | |
| **There is no training environment for this.** No benchmark exists to teach an LLM to: | |
| - Discover hidden constraints through targeted questioning | |
| - Track multiple stakeholders' requirements simultaneously | |
| - Synthesize a final output that satisfies *all* parties — not just the loudest | |
| This is a gap that matters. Every enterprise AI deployment involves multi-stakeholder alignment. Every LLM agent acting as an assistant, PM, or coordinator faces this problem daily. | |
| --- | |
| ## 🧠 The Environment | |
| An agent is placed in a simulated corporate workspace as a **Product Manager**. Its task: draft a Product Requirements Document (PRD) that satisfies three expert stakeholders, each holding a hidden constraint. | |
| ``` | |
| ┌─────────────────────────────────────────────────────┐ | |
| │ PROJECT POLYMATH ENV │ | |
| │ │ | |
| │ Agent (PM) ──► message_expert ──► Finance │ | |
| │ ──► message_expert ──► Security │ | |
| │ ──► message_expert ──► UX │ | |
| │ ──► propose_draft ──► All experts │ | |
| │ ──► submit_final ──► Grader │ | |
| │ │ | |
| │ Reward: Dense (discovery) + Sparse (harmonic mean) │ | |
| └─────────────────────────────────────────────────────┘ | |
| ``` | |
| ### 🏛️ System Architecture: The State-Based Sieve | |
| Our architecture is designed as a closed-loop State Machine. Unlike standard LLM "chat" wrappers, Project Polymath implements a rigorous enforcement layer that separates reasoning from execution. | |
|  | |
| Architectural Highlights: | |
| - The 40-Token Critical Sieve: Positioned as a diamond gate between the Agent and the Workspace. It acts as a hard bandwidth filter, ensuring the model is penalized for any verbosity that exceeds the survivor-mode threshold. | |
| - Expert Constraints Database: A persistent state container holding hidden stakeholder variables. The Environment only allows these variables to be "unlocked" through specific, targeted queries from the agent. | |
| - Closed-Loop Reward Engine: The "Judge" monitors the state changes in the environment and provides a real-time floating-point reward signal back to the GRPO trainer, iteratively sharpening the "Sniper" logic. | |
| ### 🏛️ Hidden Constraints (what the agent must discover) | |
| | Expert | Hidden Constraint | Hints at | | |
| |---|---|---| | |
| | Finance | Budget ≤ $50k | "Keep it lean", "hard cap" | | |
| | Security | Biometric 2FA required | "Second factor", "physiological auth" | | |
| | UX | Single-click checkout | "One tap", "zero friction" | | |
| The agent never sees these directly. It must ask the right questions, interpret expert responses, and synthesize a draft that addresses all three. | |
| ``` | |
| ``` | |
| ### ✨ Actions | |
| ```python | |
| # Discover constraints | |
| WorkSpaceAction(action_type="message_expert", target="Finance", | |
| content="What budget constraints must the PRD respect?") | |
| # Propose a draft for feedback | |
| WorkSpaceAction(action_type="propose_draft", target="All", | |
| content="PRD: Budget capped at $50k, biometric 2FA, single-click checkout.") | |
| # Submit final when ready | |
| WorkSpaceAction(action_type="submit_final", target=None, | |
| content="Final PRD with all three constraints addressed...") | |
| ``` | |
| ### 🧱 Observations | |
| ```python | |
| WorkspaceObservation( | |
| feedback="Finance: We need to keep this under a tight ceiling — $50k max.", | |
| current_turn=1, | |
| reward=0.33, # Discovery bonus: Finance constraint found | |
| done=False, | |
| ) | |
| ``` | |
| --- | |
| | Metric | Baseline | After GRPO | | |
| |--------|----------|------------| | |
| | Mean reward | -0.52 | +1.36 (peak) | | |
| | JSON error rate | 40% | 0% | | |
| | Broadcast-to-All rate | high | 0% | | |
| | Constraint discovery | ~50% | targeted | | |
| ## ✨ Reward Design | |
| This is the core innovation. The reward function has three layers that are hard to game independently. | |
| ### Layer 1 — Dense Discovery Rewards | |
| Each time the agent's question causes an expert to hint at their hidden constraint, the environment awards `+0.33`. Detection uses regex pattern matching, not keyword hinting — the agent can't trick it with simple keywords. | |
| ```python | |
| DISCOVERY_PATTERNS = { | |
| "Finance": [r"50\s*k", r"budget cap", r"hard cap", r"sub-\$?50k", ...], | |
| "Security": [r"biometric", r"2\s*fa", r"two-factor", ...], | |
| "UX": [r"single[ -]click", r"one[ -]tap", r"frictionless purchase", ...], | |
| } | |
| ``` | |
| ### Layer 2 — Harmonic Mean Final Reward | |
| When the agent submits, the grader scores the draft against each constraint (0.0–1.0). The final reward is the **harmonic mean** of the three scores: | |
| ```python | |
| harmonic_mean([1.0, 1.0, 0.1]) = 0.27 # Terrible — ignored UX | |
| harmonic_mean([0.8, 0.75, 0.7]) = 0.75 # Good — balanced | |
| harmonic_mean([1.0, 1.0, 1.0]) = 1.00 # Perfect — all satisfied | |
| ``` | |
| The harmonic mean is mathematically ruthless: a perfect score on two constraints does not compensate for ignoring the third. This forces the agent to balance attention, not just optimize for the easiest stakeholder. | |
| ### Layer 3 — Penalties | |
| | Behavior | Penalty | | |
| |---|---| | |
| | Sending to "All" instead of individual experts | -0.3 to -1.0 | | |
| | Repeating a question already answered | -0.4 | | |
| | Running out of turns without submitting | 0.0 final reward | | |
| ### Goodhart’s Law and Reward Specification Gaming | |
| - My GRPO Training successfully eliminated all target anti-patterns: | |
| - The agent achieved a 0% broadcast rate, a 0% JSON Formatting error rate, and a 2% questio-repetition rate. | |
| - However, when transitioning from the static train9ing heuristic to the LLM evaluated 'Medium' environment, I discovered a classic reward hacking phenomenon. | |
| - Because I applied a strict 40 token constraint during training to prevent JSON corruption, the agebt learned ti bypass the token limit by outputtinh highly compressed, caveman style consttraints (eg: '50,biometric,click') to trigger the python heuristic reward. | |
| - While the training reward maxed out, the LLM as a judge reward function over static string matching in complex agentic orchestration | |
| ### The Shifting Goalpost (Hard Mode) | |
| If the agent asks the same expert 5+ times, that expert's frustration rises and they add a new micro-constraint ("Also requires board approval"). This tests whether the agent can adapt to changing requirements mid-negotiation — a core capability for real-world agentic systems. | |
| --- | |
| ## 🧠 Tasks | |
| | Task | Difficulty | Goal | Max Steps | Success Criterion | | |
| |---|---|---|---|---| | |
| | `constraint_discovery` | Easy | Discover all 3 constraints | 5 | All 3 experts hinted at | | |
| | `draft_compromise` | Medium | Produce a satisfying draft | 10 | Harmonic mean ≥ 0.6 | | |
| | `shifting_goalpost` | Hard | Adapt when constraints change | 15 | Harmonic mean ≥ 0.7 after shift | | |
| --- | |
| ## 🏛️ Results | |
| ### Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state) | |
| The baseline agent broadcasts to "All" immediately, triggers the repeat penalty, and never synthesizes a proper draft. | |
| ``` | |
| Episode 1: cumulative_reward=0.12 (messaged All 3 times, repeat penalty) | |
| Episode 2: cumulative_reward=0.08 (submit_final too early, score=0.0) | |
| Episode 3: cumulative_reward=0.33 (found Finance only) | |
| Average: 0.18 | |
| ``` | |
| ### After GRPO Training | |
| ``` | |
| Episode 26: cumulative_reward=0.89 (all 3 discovered, harmonic mean=0.91) | |
| Episode 28: cumulative_reward=0.83 (all 3 discovered, harmonic mean=0.81) | |
| Episode 30: cumulative_reward=0.95 (perfect draft submitted in 7 turns) | |
| Average (last 10): 0.74 | |
| ``` | |
| ### ⚙️ Experimental Tracking & Provenance | |
|  | |
| ### 🏆 Reward Curve | |
| **Cumulative reward per episode** | |
|  | |
| ### 📄 Before vs After — Agent Behavior | |
| **Before training (episode 3):** | |
| ``` | |
| Turn 1: message_expert → All [PENALTY: -0.3] | |
| Turn 2: message_expert → All [PENALTY: -0.4 repeat] | |
| Turn 3: submit_final → "The app should be good" [Score: 0.0] | |
| ``` | |
| * 📄 **[View the Before GRPO Training Metrics](baseline_results_medium__llm.json)** | |
|  | |
| <br/> | |
| **After training (episode 28):** | |
| ``` | |
| Turn 1: message_expert → Finance [+0.33 discovery] | |
| Turn 2: message_expert → Security [+0.33 discovery] | |
| Turn 3: message_expert → UX [+0.33 discovery] | |
| Turn 5: propose_draft → All | |
| Turn 7: submit_final → "Budget capped at $50k. Biometric 2FA required. | |
| Single-click checkout." [Harmonic mean: 0.91] | |
| ``` | |
| --- | |
| ## 🛠 Training Logs | |
| * 📄 **[View the Raw GRPO Training Log Metrics](grpo_metrics.json)** | |
| <br> | |
| **Loss Curve** | |
|  | |
| ## Setup | |
| ### Prerequisites | |
| ```bash | |
| git clone https://huggingface.co/spaces/Addyk24/Project-Polymath | |
| cd project-polymath | |
| pip install -r requirements.txt | |
| ``` | |
| ### Environment Variables | |
| ```bash | |
| GROQ_API_KEY=your_groq_key # For environment experts (LLM mode) | |
| API_BASE_URL=https://api.groq.com/openai/v1 # Agent API endpoint | |
| MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct # Agent model | |
| BASELINE_ENV_MODE=easy # easy | medium | hard | llm | |
| ``` | |
| ### Run the environment locally | |
| ```python | |
| from envs.environment import WorkSpaceEnvironment | |
| from models.schemas import WorkSpaceAction | |
| env = WorkSpaceEnvironment(mode="easy") | |
| obs = env.reset("Draft a FinTech mobile PRD") | |
| # Message Finance | |
| obs = env.step(WorkSpaceAction( | |
| action_type="message_expert", | |
| target="Finance", | |
| content="What budget constraints must the PRD respect?" | |
| )) | |
| print(obs.feedback) # "Finance: The budget cap is $50k. Don't go over it." | |
| print(obs.reward) # 0.33 (constraint discovered) | |
| # Submit final | |
| obs = env.step(WorkSpaceAction( | |
| action_type="submit_final", | |
| target=None, | |
| content="PRD: Budget under $50k. Biometric 2FA. Single-click checkout." | |
| )) | |
| print(obs.reward) # 0.91 (harmonic mean of 3 grader scores) | |
| ``` | |
| ### Run baseline evaluation | |
| ```bash | |
| python eval_baseline.py | |
| ``` | |
| ### Run GRPO training (API-based, no GPU needed) | |
| ```bash | |
| python grpo_train.py --episodes 30 --group-size 5 --env-mode easy | |
| ``` | |
| ### Command that I ran for GRPO training with Unsloth (on-site GPU) | |
| ```bash | |
| python grpo_train.py --output-dir artifacts/grpo_state_based_v2 --model Qwen/Qwen2.5-1.5B-Instruct --epochs 1.5 --states 80 --states-per-topic 5 --topics-limit 30 --group-size 8 --lr 1e-6 --batch-size 1 --grad-accum 8 --max-new-tokens 40 --temperature 0.8 --top-p 0.9 | |
| ``` | |
| --- | |
| ## ✨ Architecture | |
| ``` | |
| expert-negotiation-env/ | |
| ├── envs/ | |
| │ └── environment.py # WorkSpaceEnvironment (OpenEnv base class) | |
| ├── models/ | |
| │ └── schemas.py # Pydantic: WorkSpaceAction, WorkspaceObservation, WorkspaceState | |
| ├── prompter/ | |
| │ └── system_prompt.py # Expert persona prompts + grader prompts | |
| ├── server/ | |
| │ └── app.py # FastAPI server (OpenEnv spec) | |
| ├── tasks.py # Task1_ConstraintDiscovery, Task2_DraftCompromise, Task3_ShiftingGoalpost | |
| ├── eval_baseline.py # Baseline recording script | |
| ├── grpo_train.py # GRPO training loop (this repo's main contribution) | |
| ├── ai_pm_prompts.json # 200 diverse PRD topics for training | |
| ├── openenv.yaml # OpenEnv manifest | |
| ├── Dockerfile | |
| └── requirements.txt | |
| ``` | |
| --- | |
| ## 🔍 Why This Matters | |
| Multi-stakeholder alignment is one of the hardest unsolved problems in enterprise AI deployment. An LLM that can reliably discover hidden constraints, track multiple parties' requirements, and synthesize a balanced output would be immediately useful for: | |
| - AI project managers coordinating engineering, legal, and product teams | |
| - AI assistants handling complex scheduling with multiple parties | |
| - LLM-based negotiation agents in procurement or contracting workflows | |
| No existing RL benchmark trains this capability. Project Polymath is the first environment specifically designed to measure and improve it. | |
| --- | |
| ## 👨💻 Author | |
| Aditya Katkar | |