---
title: Project Polymath
emoji: βοΈ
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
short_description: Multi-Agent RL Environment for PRD Negotiation
---
# Project Polymath: Expert Negotiation Environment
> **Train LLMs to negotiate with conflicting stakeholders and produce balanced decisions.**
[](https://github.com/huggingface/openenv)
[](https://huggingface.co/spaces/YOUR_USERNAME/expert-negotiation-env)
[](https://python.org)
---
## π Quick Links
| Resource | Link |
|---|---|
| **πLive Environment** | [HF Space](https://huggingface.co/spaces/Addyk24/Project-Polymath) |
| **πHF Blog Post** | [Read on Hugging Face](https://huggingface.co/spaces/Addyk24/Project-Polymath/blob/main/BLOG.md) |
| **GitHub Link** | [GitHub](https://github.com/Addyk-24/Project-Polymath) |
| **Training Notebook** | [Open in Colab](https://colab.research.google.com/drive/13KqXt_7HTZTJEC4yD98My5g5Za9J1-5T?usp=sharing) |
---
## π§± The Problem Statement
Current LLMs are sycophantic. When acting as a coordinator or project manager, they tend to agree with whoever spoke last β ignoring earlier constraints, dropping requirements from quieter stakeholders, and producing outputs that look balanced but aren't.
**There is no training environment for this.** No benchmark exists to teach an LLM to:
- Discover hidden constraints through targeted questioning
- Track multiple stakeholders' requirements simultaneously
- Synthesize a final output that satisfies *all* parties β not just the loudest
This is a gap that matters. Every enterprise AI deployment involves multi-stakeholder alignment. Every LLM agent acting as an assistant, PM, or coordinator faces this problem daily.
---
## π§ The Environment
An agent is placed in a simulated corporate workspace as a **Product Manager**. Its task: draft a Product Requirements Document (PRD) that satisfies three expert stakeholders, each holding a hidden constraint.
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PROJECT POLYMATH ENV β
β β
β Agent (PM) βββΊ message_expert βββΊ Finance β
β βββΊ message_expert βββΊ Security β
β βββΊ message_expert βββΊ UX β
β βββΊ propose_draft βββΊ All experts β
β βββΊ submit_final βββΊ Grader β
β β
β Reward: Dense (discovery) + Sparse (harmonic mean) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
### ποΈ System Architecture: The State-Based Sieve
Our architecture is designed as a closed-loop State Machine. Unlike standard LLM "chat" wrappers, Project Polymath implements a rigorous enforcement layer that separates reasoning from execution.

Architectural Highlights:
- The 40-Token Critical Sieve: Positioned as a diamond gate between the Agent and the Workspace. It acts as a hard bandwidth filter, ensuring the model is penalized for any verbosity that exceeds the survivor-mode threshold.
- Expert Constraints Database: A persistent state container holding hidden stakeholder variables. The Environment only allows these variables to be "unlocked" through specific, targeted queries from the agent.
- Closed-Loop Reward Engine: The "Judge" monitors the state changes in the environment and provides a real-time floating-point reward signal back to the GRPO trainer, iteratively sharpening the "Sniper" logic.
### ποΈ Hidden Constraints (what the agent must discover)
| Expert | Hidden Constraint | Hints at |
|---|---|---|
| Finance | Budget β€ $50k | "Keep it lean", "hard cap" |
| Security | Biometric 2FA required | "Second factor", "physiological auth" |
| UX | Single-click checkout | "One tap", "zero friction" |
The agent never sees these directly. It must ask the right questions, interpret expert responses, and synthesize a draft that addresses all three.
```
```
### β¨ Actions
```python
# Discover constraints
WorkSpaceAction(action_type="message_expert", target="Finance",
content="What budget constraints must the PRD respect?")
# Propose a draft for feedback
WorkSpaceAction(action_type="propose_draft", target="All",
content="PRD: Budget capped at $50k, biometric 2FA, single-click checkout.")
# Submit final when ready
WorkSpaceAction(action_type="submit_final", target=None,
content="Final PRD with all three constraints addressed...")
```
### π§± Observations
```python
WorkspaceObservation(
feedback="Finance: We need to keep this under a tight ceiling β $50k max.",
current_turn=1,
reward=0.33, # Discovery bonus: Finance constraint found
done=False,
)
```
---
| Metric | Baseline | After GRPO |
|--------|----------|------------|
| Mean reward | -0.52 | +1.36 (peak) |
| JSON error rate | 40% | 0% |
| Broadcast-to-All rate | high | 0% |
| Constraint discovery | ~50% | targeted |
## β¨ Reward Design
This is the core innovation. The reward function has three layers that are hard to game independently.
### Layer 1 β Dense Discovery Rewards
Each time the agent's question causes an expert to hint at their hidden constraint, the environment awards `+0.33`. Detection uses regex pattern matching, not keyword hinting β the agent can't trick it with simple keywords.
```python
DISCOVERY_PATTERNS = {
"Finance": [r"50\s*k", r"budget cap", r"hard cap", r"sub-\$?50k", ...],
"Security": [r"biometric", r"2\s*fa", r"two-factor", ...],
"UX": [r"single[ -]click", r"one[ -]tap", r"frictionless purchase", ...],
}
```
### Layer 2 β Harmonic Mean Final Reward
When the agent submits, the grader scores the draft against each constraint (0.0β1.0). The final reward is the **harmonic mean** of the three scores:
```python
harmonic_mean([1.0, 1.0, 0.1]) = 0.27 # Terrible β ignored UX
harmonic_mean([0.8, 0.75, 0.7]) = 0.75 # Good β balanced
harmonic_mean([1.0, 1.0, 1.0]) = 1.00 # Perfect β all satisfied
```
The harmonic mean is mathematically ruthless: a perfect score on two constraints does not compensate for ignoring the third. This forces the agent to balance attention, not just optimize for the easiest stakeholder.
### Layer 3 β Penalties
| Behavior | Penalty |
|---|---|
| Sending to "All" instead of individual experts | -0.3 to -1.0 |
| Repeating a question already answered | -0.4 |
| Running out of turns without submitting | 0.0 final reward |
### Goodhartβs Law and Reward Specification Gaming
- My GRPO Training successfully eliminated all target anti-patterns:
- The agent achieved a 0% broadcast rate, a 0% JSON Formatting error rate, and a 2% questio-repetition rate.
- However, when transitioning from the static train9ing heuristic to the LLM evaluated 'Medium' environment, I discovered a classic reward hacking phenomenon.
- Because I applied a strict 40 token constraint during training to prevent JSON corruption, the agebt learned ti bypass the token limit by outputtinh highly compressed, caveman style consttraints (eg: '50,biometric,click') to trigger the python heuristic reward.
- While the training reward maxed out, the LLM as a judge reward function over static string matching in complex agentic orchestration
### The Shifting Goalpost (Hard Mode)
If the agent asks the same expert 5+ times, that expert's frustration rises and they add a new micro-constraint ("Also requires board approval"). This tests whether the agent can adapt to changing requirements mid-negotiation β a core capability for real-world agentic systems.
---
## π§ Tasks
| Task | Difficulty | Goal | Max Steps | Success Criterion |
|---|---|---|---|---|
| `constraint_discovery` | Easy | Discover all 3 constraints | 5 | All 3 experts hinted at |
| `draft_compromise` | Medium | Produce a satisfying draft | 10 | Harmonic mean β₯ 0.6 |
| `shifting_goalpost` | Hard | Adapt when constraints change | 15 | Harmonic mean β₯ 0.7 after shift |
---
## ποΈ Results
### Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state)
The baseline agent broadcasts to "All" immediately, triggers the repeat penalty, and never synthesizes a proper draft.
```
Episode 1: cumulative_reward=0.12 (messaged All 3 times, repeat penalty)
Episode 2: cumulative_reward=0.08 (submit_final too early, score=0.0)
Episode 3: cumulative_reward=0.33 (found Finance only)
Average: 0.18
```
### After GRPO Training
```
Episode 26: cumulative_reward=0.89 (all 3 discovered, harmonic mean=0.91)
Episode 28: cumulative_reward=0.83 (all 3 discovered, harmonic mean=0.81)
Episode 30: cumulative_reward=0.95 (perfect draft submitted in 7 turns)
Average (last 10): 0.74
```
### βοΈ Experimental Tracking & Provenance

### π Reward Curve
**Cumulative reward per episode**

### π Before vs After β Agent Behavior
**Before training (episode 3):**
```
Turn 1: message_expert β All [PENALTY: -0.3]
Turn 2: message_expert β All [PENALTY: -0.4 repeat]
Turn 3: submit_final β "The app should be good" [Score: 0.0]
```
* π **[View the Before GRPO Training Metrics](baseline_results_medium__llm.json)**

**After training (episode 28):**
```
Turn 1: message_expert β Finance [+0.33 discovery]
Turn 2: message_expert β Security [+0.33 discovery]
Turn 3: message_expert β UX [+0.33 discovery]
Turn 5: propose_draft β All
Turn 7: submit_final β "Budget capped at $50k. Biometric 2FA required.
Single-click checkout." [Harmonic mean: 0.91]
```
---
## π Training Logs
* π **[View the Raw GRPO Training Log Metrics](grpo_metrics.json)**
**Loss Curve**

## Setup
### Prerequisites
```bash
git clone https://huggingface.co/spaces/Addyk24/Project-Polymath
cd project-polymath
pip install -r requirements.txt
```
### Environment Variables
```bash
GROQ_API_KEY=your_groq_key # For environment experts (LLM mode)
API_BASE_URL=https://api.groq.com/openai/v1 # Agent API endpoint
MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct # Agent model
BASELINE_ENV_MODE=easy # easy | medium | hard | llm
```
### Run the environment locally
```python
from envs.environment import WorkSpaceEnvironment
from models.schemas import WorkSpaceAction
env = WorkSpaceEnvironment(mode="easy")
obs = env.reset("Draft a FinTech mobile PRD")
# Message Finance
obs = env.step(WorkSpaceAction(
action_type="message_expert",
target="Finance",
content="What budget constraints must the PRD respect?"
))
print(obs.feedback) # "Finance: The budget cap is $50k. Don't go over it."
print(obs.reward) # 0.33 (constraint discovered)
# Submit final
obs = env.step(WorkSpaceAction(
action_type="submit_final",
target=None,
content="PRD: Budget under $50k. Biometric 2FA. Single-click checkout."
))
print(obs.reward) # 0.91 (harmonic mean of 3 grader scores)
```
### Run baseline evaluation
```bash
python eval_baseline.py
```
### Run GRPO training (API-based, no GPU needed)
```bash
python grpo_train.py --episodes 30 --group-size 5 --env-mode easy
```
### Command that I ran for GRPO training with Unsloth (on-site GPU)
```bash
python grpo_train.py --output-dir artifacts/grpo_state_based_v2 --model Qwen/Qwen2.5-1.5B-Instruct --epochs 1.5 --states 80 --states-per-topic 5 --topics-limit 30 --group-size 8 --lr 1e-6 --batch-size 1 --grad-accum 8 --max-new-tokens 40 --temperature 0.8 --top-p 0.9
```
---
## β¨ Architecture
```
expert-negotiation-env/
βββ envs/
β βββ environment.py # WorkSpaceEnvironment (OpenEnv base class)
βββ models/
β βββ schemas.py # Pydantic: WorkSpaceAction, WorkspaceObservation, WorkspaceState
βββ prompter/
β βββ system_prompt.py # Expert persona prompts + grader prompts
βββ server/
β βββ app.py # FastAPI server (OpenEnv spec)
βββ tasks.py # Task1_ConstraintDiscovery, Task2_DraftCompromise, Task3_ShiftingGoalpost
βββ eval_baseline.py # Baseline recording script
βββ grpo_train.py # GRPO training loop (this repo's main contribution)
βββ ai_pm_prompts.json # 200 diverse PRD topics for training
βββ openenv.yaml # OpenEnv manifest
βββ Dockerfile
βββ requirements.txt
```
---
## π Why This Matters
Multi-stakeholder alignment is one of the hardest unsolved problems in enterprise AI deployment. An LLM that can reliably discover hidden constraints, track multiple parties' requirements, and synthesize a balanced output would be immediately useful for:
- AI project managers coordinating engineering, legal, and product teams
- AI assistants handling complex scheduling with multiple parties
- LLM-based negotiation agents in procurement or contracting workflows
No existing RL benchmark trains this capability. Project Polymath is the first environment specifically designed to measure and improve it.
---
## π¨βπ» Author
Aditya Katkar