--- title: Project Polymath emoji: βš–οΈ colorFrom: blue colorTo: indigo sdk: docker pinned: false short_description: Multi-Agent RL Environment for PRD Negotiation --- # Project Polymath: Expert Negotiation Environment > **Train LLMs to negotiate with conflicting stakeholders and produce balanced decisions.** [![OpenEnv](https://img.shields.io/badge/OpenEnv-latest-blue)](https://github.com/huggingface/openenv) [![HF Space](https://img.shields.io/badge/HuggingFace-Space-yellow)](https://huggingface.co/spaces/YOUR_USERNAME/expert-negotiation-env) [![Python](https://img.shields.io/badge/Python-3.11+-green)](https://python.org) --- ## πŸ”— Quick Links | Resource | Link | |---|---| | **πŸ”—Live Environment** | [HF Space](https://huggingface.co/spaces/Addyk24/Project-Polymath) | | **πŸ“HF Blog Post** | [Read on Hugging Face](https://huggingface.co/spaces/Addyk24/Project-Polymath/blob/main/BLOG.md) | | **GitHub Link** | [GitHub](https://github.com/Addyk-24/Project-Polymath) | | **Training Notebook** | [Open in Colab](https://colab.research.google.com/drive/13KqXt_7HTZTJEC4yD98My5g5Za9J1-5T?usp=sharing) | --- ## 🧱 The Problem Statement Current LLMs are sycophantic. When acting as a coordinator or project manager, they tend to agree with whoever spoke last β€” ignoring earlier constraints, dropping requirements from quieter stakeholders, and producing outputs that look balanced but aren't. **There is no training environment for this.** No benchmark exists to teach an LLM to: - Discover hidden constraints through targeted questioning - Track multiple stakeholders' requirements simultaneously - Synthesize a final output that satisfies *all* parties β€” not just the loudest This is a gap that matters. Every enterprise AI deployment involves multi-stakeholder alignment. Every LLM agent acting as an assistant, PM, or coordinator faces this problem daily. --- ## 🧠 The Environment An agent is placed in a simulated corporate workspace as a **Product Manager**. Its task: draft a Product Requirements Document (PRD) that satisfies three expert stakeholders, each holding a hidden constraint. ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ PROJECT POLYMATH ENV β”‚ β”‚ β”‚ β”‚ Agent (PM) ──► message_expert ──► Finance β”‚ β”‚ ──► message_expert ──► Security β”‚ β”‚ ──► message_expert ──► UX β”‚ β”‚ ──► propose_draft ──► All experts β”‚ β”‚ ──► submit_final ──► Grader β”‚ β”‚ β”‚ β”‚ Reward: Dense (discovery) + Sparse (harmonic mean) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### πŸ›οΈ System Architecture: The State-Based Sieve Our architecture is designed as a closed-loop State Machine. Unlike standard LLM "chat" wrappers, Project Polymath implements a rigorous enforcement layer that separates reasoning from execution. ![architecture](system_architecture.png) Architectural Highlights: - The 40-Token Critical Sieve: Positioned as a diamond gate between the Agent and the Workspace. It acts as a hard bandwidth filter, ensuring the model is penalized for any verbosity that exceeds the survivor-mode threshold. - Expert Constraints Database: A persistent state container holding hidden stakeholder variables. The Environment only allows these variables to be "unlocked" through specific, targeted queries from the agent. - Closed-Loop Reward Engine: The "Judge" monitors the state changes in the environment and provides a real-time floating-point reward signal back to the GRPO trainer, iteratively sharpening the "Sniper" logic. ### πŸ›οΈ Hidden Constraints (what the agent must discover) | Expert | Hidden Constraint | Hints at | |---|---|---| | Finance | Budget ≀ $50k | "Keep it lean", "hard cap" | | Security | Biometric 2FA required | "Second factor", "physiological auth" | | UX | Single-click checkout | "One tap", "zero friction" | The agent never sees these directly. It must ask the right questions, interpret expert responses, and synthesize a draft that addresses all three. ``` ``` ### ✨ Actions ```python # Discover constraints WorkSpaceAction(action_type="message_expert", target="Finance", content="What budget constraints must the PRD respect?") # Propose a draft for feedback WorkSpaceAction(action_type="propose_draft", target="All", content="PRD: Budget capped at $50k, biometric 2FA, single-click checkout.") # Submit final when ready WorkSpaceAction(action_type="submit_final", target=None, content="Final PRD with all three constraints addressed...") ``` ### 🧱 Observations ```python WorkspaceObservation( feedback="Finance: We need to keep this under a tight ceiling β€” $50k max.", current_turn=1, reward=0.33, # Discovery bonus: Finance constraint found done=False, ) ``` --- | Metric | Baseline | After GRPO | |--------|----------|------------| | Mean reward | -0.52 | +1.36 (peak) | | JSON error rate | 40% | 0% | | Broadcast-to-All rate | high | 0% | | Constraint discovery | ~50% | targeted | ## ✨ Reward Design This is the core innovation. The reward function has three layers that are hard to game independently. ### Layer 1 β€” Dense Discovery Rewards Each time the agent's question causes an expert to hint at their hidden constraint, the environment awards `+0.33`. Detection uses regex pattern matching, not keyword hinting β€” the agent can't trick it with simple keywords. ```python DISCOVERY_PATTERNS = { "Finance": [r"50\s*k", r"budget cap", r"hard cap", r"sub-\$?50k", ...], "Security": [r"biometric", r"2\s*fa", r"two-factor", ...], "UX": [r"single[ -]click", r"one[ -]tap", r"frictionless purchase", ...], } ``` ### Layer 2 β€” Harmonic Mean Final Reward When the agent submits, the grader scores the draft against each constraint (0.0–1.0). The final reward is the **harmonic mean** of the three scores: ```python harmonic_mean([1.0, 1.0, 0.1]) = 0.27 # Terrible β€” ignored UX harmonic_mean([0.8, 0.75, 0.7]) = 0.75 # Good β€” balanced harmonic_mean([1.0, 1.0, 1.0]) = 1.00 # Perfect β€” all satisfied ``` The harmonic mean is mathematically ruthless: a perfect score on two constraints does not compensate for ignoring the third. This forces the agent to balance attention, not just optimize for the easiest stakeholder. ### Layer 3 β€” Penalties | Behavior | Penalty | |---|---| | Sending to "All" instead of individual experts | -0.3 to -1.0 | | Repeating a question already answered | -0.4 | | Running out of turns without submitting | 0.0 final reward | ### Goodhart’s Law and Reward Specification Gaming - My GRPO Training successfully eliminated all target anti-patterns: - The agent achieved a 0% broadcast rate, a 0% JSON Formatting error rate, and a 2% questio-repetition rate. - However, when transitioning from the static train9ing heuristic to the LLM evaluated 'Medium' environment, I discovered a classic reward hacking phenomenon. - Because I applied a strict 40 token constraint during training to prevent JSON corruption, the agebt learned ti bypass the token limit by outputtinh highly compressed, caveman style consttraints (eg: '50,biometric,click') to trigger the python heuristic reward. - While the training reward maxed out, the LLM as a judge reward function over static string matching in complex agentic orchestration ### The Shifting Goalpost (Hard Mode) If the agent asks the same expert 5+ times, that expert's frustration rises and they add a new micro-constraint ("Also requires board approval"). This tests whether the agent can adapt to changing requirements mid-negotiation β€” a core capability for real-world agentic systems. --- ## 🧠 Tasks | Task | Difficulty | Goal | Max Steps | Success Criterion | |---|---|---|---|---| | `constraint_discovery` | Easy | Discover all 3 constraints | 5 | All 3 experts hinted at | | `draft_compromise` | Medium | Produce a satisfying draft | 10 | Harmonic mean β‰₯ 0.6 | | `shifting_goalpost` | Hard | Adapt when constraints change | 15 | Harmonic mean β‰₯ 0.7 after shift | --- ## πŸ›οΈ Results ### Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state) The baseline agent broadcasts to "All" immediately, triggers the repeat penalty, and never synthesizes a proper draft. ``` Episode 1: cumulative_reward=0.12 (messaged All 3 times, repeat penalty) Episode 2: cumulative_reward=0.08 (submit_final too early, score=0.0) Episode 3: cumulative_reward=0.33 (found Finance only) Average: 0.18 ``` ### After GRPO Training ``` Episode 26: cumulative_reward=0.89 (all 3 discovered, harmonic mean=0.91) Episode 28: cumulative_reward=0.83 (all 3 discovered, harmonic mean=0.81) Episode 30: cumulative_reward=0.95 (perfect draft submitted in 7 turns) Average (last 10): 0.74 ``` ### βš™οΈ Experimental Tracking & Provenance ![weight_bais](weight_bias.png) ### πŸ† Reward Curve **Cumulative reward per episode** ![Telemetry Dashboard](reward_curve.png) ### πŸ“„ Before vs After β€” Agent Behavior **Before training (episode 3):** ``` Turn 1: message_expert β†’ All [PENALTY: -0.3] Turn 2: message_expert β†’ All [PENALTY: -0.4 repeat] Turn 3: submit_final β†’ "The app should be good" [Score: 0.0] ``` * πŸ“„ **[View the Before GRPO Training Metrics](baseline_results_medium__llm.json)** ![Telemetry Dashboard](before_reward_distribution_per_ep.png)
**After training (episode 28):** ``` Turn 1: message_expert β†’ Finance [+0.33 discovery] Turn 2: message_expert β†’ Security [+0.33 discovery] Turn 3: message_expert β†’ UX [+0.33 discovery] Turn 5: propose_draft β†’ All Turn 7: submit_final β†’ "Budget capped at $50k. Biometric 2FA required. Single-click checkout." [Harmonic mean: 0.91] ``` --- ## πŸ›  Training Logs * πŸ“„ **[View the Raw GRPO Training Log Metrics](grpo_metrics.json)**
**Loss Curve** ![Telemetry Dashboard](loss_curve.png) ## Setup ### Prerequisites ```bash git clone https://huggingface.co/spaces/Addyk24/Project-Polymath cd project-polymath pip install -r requirements.txt ``` ### Environment Variables ```bash GROQ_API_KEY=your_groq_key # For environment experts (LLM mode) API_BASE_URL=https://api.groq.com/openai/v1 # Agent API endpoint MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct # Agent model BASELINE_ENV_MODE=easy # easy | medium | hard | llm ``` ### Run the environment locally ```python from envs.environment import WorkSpaceEnvironment from models.schemas import WorkSpaceAction env = WorkSpaceEnvironment(mode="easy") obs = env.reset("Draft a FinTech mobile PRD") # Message Finance obs = env.step(WorkSpaceAction( action_type="message_expert", target="Finance", content="What budget constraints must the PRD respect?" )) print(obs.feedback) # "Finance: The budget cap is $50k. Don't go over it." print(obs.reward) # 0.33 (constraint discovered) # Submit final obs = env.step(WorkSpaceAction( action_type="submit_final", target=None, content="PRD: Budget under $50k. Biometric 2FA. Single-click checkout." )) print(obs.reward) # 0.91 (harmonic mean of 3 grader scores) ``` ### Run baseline evaluation ```bash python eval_baseline.py ``` ### Run GRPO training (API-based, no GPU needed) ```bash python grpo_train.py --episodes 30 --group-size 5 --env-mode easy ``` ### Command that I ran for GRPO training with Unsloth (on-site GPU) ```bash python grpo_train.py --output-dir artifacts/grpo_state_based_v2 --model Qwen/Qwen2.5-1.5B-Instruct --epochs 1.5 --states 80 --states-per-topic 5 --topics-limit 30 --group-size 8 --lr 1e-6 --batch-size 1 --grad-accum 8 --max-new-tokens 40 --temperature 0.8 --top-p 0.9 ``` --- ## ✨ Architecture ``` expert-negotiation-env/ β”œβ”€β”€ envs/ β”‚ └── environment.py # WorkSpaceEnvironment (OpenEnv base class) β”œβ”€β”€ models/ β”‚ └── schemas.py # Pydantic: WorkSpaceAction, WorkspaceObservation, WorkspaceState β”œβ”€β”€ prompter/ β”‚ └── system_prompt.py # Expert persona prompts + grader prompts β”œβ”€β”€ server/ β”‚ └── app.py # FastAPI server (OpenEnv spec) β”œβ”€β”€ tasks.py # Task1_ConstraintDiscovery, Task2_DraftCompromise, Task3_ShiftingGoalpost β”œβ”€β”€ eval_baseline.py # Baseline recording script β”œβ”€β”€ grpo_train.py # GRPO training loop (this repo's main contribution) β”œβ”€β”€ ai_pm_prompts.json # 200 diverse PRD topics for training β”œβ”€β”€ openenv.yaml # OpenEnv manifest β”œβ”€β”€ Dockerfile └── requirements.txt ``` --- ## πŸ” Why This Matters Multi-stakeholder alignment is one of the hardest unsolved problems in enterprise AI deployment. An LLM that can reliably discover hidden constraints, track multiple parties' requirements, and synthesize a balanced output would be immediately useful for: - AI project managers coordinating engineering, legal, and product teams - AI assistants handling complex scheduling with multiple parties - LLM-based negotiation agents in procurement or contracting workflows No existing RL benchmark trains this capability. Project Polymath is the first environment specifically designed to measure and improve it. --- ## πŸ‘¨β€πŸ’» Author Aditya Katkar