---
title: Project Polymath
emoji: ⚖️
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
short_description: Multi-Agent RL Environment for PRD Negotiation
---

# Project Polymath: Expert Negotiation Environment

> **Train LLMs to negotiate with conflicting stakeholders and produce balanced decisions.**

[![OpenEnv](https://img.shields.io/badge/OpenEnv-latest-blue)](https://github.com/huggingface/openenv)
[![HF Space](https://img.shields.io/badge/HuggingFace-Space-yellow)](https://huggingface.co/spaces/YOUR_USERNAME/expert-negotiation-env)
[![Python](https://img.shields.io/badge/Python-3.11+-green)](https://python.org)

---

## 🔗 Quick Links

| Resource | Link |
|---|---|
| **🔗Live Environment** | [HF Space](https://huggingface.co/spaces/Addyk24/Project-Polymath) |
| **📝HF Blog Post** | [Read on Hugging Face](https://huggingface.co/spaces/Addyk24/Project-Polymath/blob/main/BLOG.md) |
| **GitHub Link** | [GitHub](https://github.com/Addyk-24/Project-Polymath) |
| **Training Notebook** | [Open in Colab](https://colab.research.google.com/drive/13KqXt_7HTZTJEC4yD98My5g5Za9J1-5T?usp=sharing) |

---

## 🧱 The Problem Statement

Current LLMs are sycophantic. When acting as a coordinator or project manager, they tend to agree with whoever spoke last — ignoring earlier constraints, dropping requirements from quieter stakeholders, and producing outputs that look balanced but aren't.

**There is no training environment for this.** No benchmark exists to teach an LLM to:
- Discover hidden constraints through targeted questioning
- Track multiple stakeholders' requirements simultaneously
- Synthesize a final output that satisfies *all* parties — not just the loudest

This is a gap that matters. Every enterprise AI deployment involves multi-stakeholder alignment. Every LLM agent acting as an assistant, PM, or coordinator faces this problem daily.

---

## 🧠 The Environment

An agent is placed in a simulated corporate workspace as a **Product Manager**. Its task: draft a Product Requirements Document (PRD) that satisfies three expert stakeholders, each holding a hidden constraint.

```
┌─────────────────────────────────────────────────────┐
│              PROJECT POLYMATH ENV                   │
│                                                     │
│  Agent (PM) ──► message_expert ──► Finance          │
│            ──► message_expert ──► Security          │  
│            ──► message_expert ──► UX                │
│            ──► propose_draft  ──► All experts       │
│            ──► submit_final   ──► Grader            │
│                                                     │
│  Reward: Dense (discovery) + Sparse (harmonic mean) │
└─────────────────────────────────────────────────────┘


```
### 🏛️ System Architecture: The State-Based Sieve

Our architecture is designed as a closed-loop State Machine. Unlike standard LLM "chat" wrappers, Project Polymath implements a rigorous enforcement layer that separates reasoning from execution.


![architecture](system_architecture.png)


Architectural Highlights:

- The 40-Token Critical Sieve: Positioned as a diamond gate between the Agent and the Workspace. It acts as a hard bandwidth filter, ensuring the model is penalized for any verbosity that exceeds the survivor-mode threshold.

- Expert Constraints Database: A persistent state container holding hidden stakeholder variables. The Environment only allows these variables to be "unlocked" through specific, targeted queries from the agent.

- Closed-Loop Reward Engine: The "Judge" monitors the state changes in the environment and provides a real-time floating-point reward signal back to the GRPO trainer, iteratively sharpening the "Sniper" logic.


### 🏛️ Hidden Constraints (what the agent must discover)

| Expert | Hidden Constraint | Hints at |
|---|---|---|
| Finance | Budget ≤ $50k | "Keep it lean", "hard cap" |
| Security | Biometric 2FA required | "Second factor", "physiological auth" |
| UX | Single-click checkout | "One tap", "zero friction" |

The agent never sees these directly. It must ask the right questions, interpret expert responses, and synthesize a draft that addresses all three.

```
```
### ✨ Actions

```python
# Discover constraints
WorkSpaceAction(action_type="message_expert", target="Finance",
                content="What budget constraints must the PRD respect?")

# Propose a draft for feedback
WorkSpaceAction(action_type="propose_draft", target="All",
                content="PRD: Budget capped at $50k, biometric 2FA, single-click checkout.")

# Submit final when ready
WorkSpaceAction(action_type="submit_final", target=None,
                content="Final PRD with all three constraints addressed...")
```

### 🧱 Observations

```python
WorkspaceObservation(
    feedback="Finance: We need to keep this under a tight ceiling — $50k max.",
    current_turn=1,
    reward=0.33,   # Discovery bonus: Finance constraint found
    done=False,
)
```

---

| Metric | Baseline | After GRPO |
|--------|----------|------------|
| Mean reward | -0.52 | +1.36 (peak) |
| JSON error rate | 40% | 0% |
| Broadcast-to-All rate | high | 0% |
| Constraint discovery | ~50% | targeted |

## ✨ Reward Design

This is the core innovation. The reward function has three layers that are hard to game independently.

### Layer 1 — Dense Discovery Rewards

Each time the agent's question causes an expert to hint at their hidden constraint, the environment awards `+0.33`. Detection uses regex pattern matching, not keyword hinting — the agent can't trick it with simple keywords.

```python
DISCOVERY_PATTERNS = {
    "Finance": [r"50\s*k", r"budget cap", r"hard cap", r"sub-\$?50k", ...],
    "Security": [r"biometric", r"2\s*fa", r"two-factor", ...],
    "UX": [r"single[ -]click", r"one[ -]tap", r"frictionless purchase", ...],
}
```

### Layer 2 — Harmonic Mean Final Reward

When the agent submits, the grader scores the draft against each constraint (0.0–1.0). The final reward is the **harmonic mean** of the three scores:

```python
harmonic_mean([1.0, 1.0, 0.1]) = 0.27  # Terrible — ignored UX
harmonic_mean([0.8, 0.75, 0.7]) = 0.75  # Good — balanced
harmonic_mean([1.0, 1.0, 1.0]) = 1.00  # Perfect — all satisfied
```

The harmonic mean is mathematically ruthless: a perfect score on two constraints does not compensate for ignoring the third. This forces the agent to balance attention, not just optimize for the easiest stakeholder.

### Layer 3 — Penalties

| Behavior | Penalty |
|---|---|
| Sending to "All" instead of individual experts | -0.3 to -1.0 |
| Repeating a question already answered | -0.4 |
| Running out of turns without submitting | 0.0 final reward |


### Goodhart’s Law and Reward Specification Gaming

- My GRPO Training successfully eliminated all target anti-patterns: 
- The agent achieved a 0% broadcast rate, a 0% JSON Formatting error rate, and a 2% questio-repetition rate.
- However, when transitioning from the static train9ing heuristic to the LLM evaluated 'Medium' environment, I discovered a classic reward hacking phenomenon.
- Because I applied a strict 40 token constraint during training to prevent JSON corruption, the agebt learned ti bypass the token limit by outputtinh highly compressed, caveman style consttraints (eg: '50,biometric,click') to trigger the python heuristic reward.
- While the training reward maxed out, the LLM as a judge reward function over static string matching in complex agentic orchestration

### The Shifting Goalpost (Hard Mode)

If the agent asks the same expert 5+ times, that expert's frustration rises and they add a new micro-constraint ("Also requires board approval"). This tests whether the agent can adapt to changing requirements mid-negotiation — a core capability for real-world agentic systems.

---

## 🧠 Tasks

| Task | Difficulty | Goal | Max Steps | Success Criterion |
|---|---|---|---|---|
| `constraint_discovery` | Easy | Discover all 3 constraints | 5 | All 3 experts hinted at |
| `draft_compromise` | Medium | Produce a satisfying draft | 10 | Harmonic mean ≥ 0.6 |
| `shifting_goalpost` | Hard | Adapt when constraints change | 15 | Harmonic mean ≥ 0.7 after shift |

---

## 🏛️ Results

### Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state)

The baseline agent broadcasts to "All" immediately, triggers the repeat penalty, and never synthesizes a proper draft.

```
Episode 1:  cumulative_reward=0.12  (messaged All 3 times, repeat penalty)
Episode 2:  cumulative_reward=0.08  (submit_final too early, score=0.0)
Episode 3:  cumulative_reward=0.33  (found Finance only)
Average: 0.18
```

### After GRPO Training

```
Episode 26: cumulative_reward=0.89  (all 3 discovered, harmonic mean=0.91)
Episode 28: cumulative_reward=0.83  (all 3 discovered, harmonic mean=0.81)
Episode 30: cumulative_reward=0.95  (perfect draft submitted in 7 turns)
Average (last 10): 0.74
```
### ⚙️ Experimental Tracking & Provenance

![weight_bais](weight_bias.png)

### 🏆 Reward Curve

**Cumulative reward per episode**

![Telemetry Dashboard](reward_curve.png)


### 📄 Before vs After — Agent Behavior

**Before training (episode 3):**
```
Turn 1: message_expert → All  [PENALTY: -0.3]
Turn 2: message_expert → All  [PENALTY: -0.4 repeat]
Turn 3: submit_final → "The app should be good"  [Score: 0.0]
```
* 📄 **[View the Before GRPO Training Metrics](baseline_results_medium__llm.json)**

  
![Telemetry Dashboard](before_reward_distribution_per_ep.png)

<br/>

**After training (episode 28):**
```
Turn 1: message_expert → Finance  [+0.33 discovery]
Turn 2: message_expert → Security [+0.33 discovery]
Turn 3: message_expert → UX       [+0.33 discovery]
Turn 5: propose_draft → All
Turn 7: submit_final → "Budget capped at $50k. Biometric 2FA required.
         Single-click checkout." [Harmonic mean: 0.91]
```

---
## 🛠 Training Logs
* 📄 **[View the Raw GRPO Training Log Metrics](grpo_metrics.json)**

  <br>
  
**Loss Curve**

![Telemetry Dashboard](loss_curve.png)


## Setup

### Prerequisites

```bash
git clone https://huggingface.co/spaces/Addyk24/Project-Polymath
cd project-polymath
pip install -r requirements.txt
```

### Environment Variables

```bash
GROQ_API_KEY=your_groq_key        # For environment experts (LLM mode)
API_BASE_URL=https://api.groq.com/openai/v1  # Agent API endpoint
MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct  # Agent model
BASELINE_ENV_MODE=easy            # easy | medium | hard | llm
```

### Run the environment locally

```python
from envs.environment import WorkSpaceEnvironment
from models.schemas import WorkSpaceAction

env = WorkSpaceEnvironment(mode="easy")
obs = env.reset("Draft a FinTech mobile PRD")

# Message Finance
obs = env.step(WorkSpaceAction(
    action_type="message_expert",
    target="Finance",
    content="What budget constraints must the PRD respect?"
))
print(obs.feedback)   # "Finance: The budget cap is $50k. Don't go over it."
print(obs.reward)     # 0.33 (constraint discovered)

# Submit final
obs = env.step(WorkSpaceAction(
    action_type="submit_final",
    target=None,
    content="PRD: Budget under $50k. Biometric 2FA. Single-click checkout."
))
print(obs.reward)     # 0.91 (harmonic mean of 3 grader scores)
```

### Run baseline evaluation

```bash
python eval_baseline.py
```

### Run GRPO training (API-based, no GPU needed)

```bash
python grpo_train.py --episodes 30 --group-size 5 --env-mode easy
```

### Command that I ran for GRPO training with Unsloth (on-site GPU)

```bash
python grpo_train.py --output-dir artifacts/grpo_state_based_v2 --model Qwen/Qwen2.5-1.5B-Instruct --epochs 1.5 --states 80 --states-per-topic 5 --topics-limit 30 --group-size 8 --lr 1e-6 --batch-size 1 --grad-accum 8 --max-new-tokens 40 --temperature 0.8 --top-p 0.9
```

---

## ✨ Architecture

```
expert-negotiation-env/
├── envs/
│   └── environment.py      # WorkSpaceEnvironment (OpenEnv base class)
├── models/
│   └── schemas.py          # Pydantic: WorkSpaceAction, WorkspaceObservation, WorkspaceState
├── prompter/
│   └── system_prompt.py    # Expert persona prompts + grader prompts
├── server/
│   └── app.py              # FastAPI server (OpenEnv spec)
├── tasks.py                # Task1_ConstraintDiscovery, Task2_DraftCompromise, Task3_ShiftingGoalpost
├── eval_baseline.py        # Baseline recording script
├── grpo_train.py           # GRPO training loop (this repo's main contribution)
├── ai_pm_prompts.json      # 200 diverse PRD topics for training
├── openenv.yaml            # OpenEnv manifest
├── Dockerfile
└── requirements.txt
```

---

## 🔍 Why This Matters

Multi-stakeholder alignment is one of the hardest unsolved problems in enterprise AI deployment. An LLM that can reliably discover hidden constraints, track multiple parties' requirements, and synthesize a balanced output would be immediately useful for:

- AI project managers coordinating engineering, legal, and product teams
- AI assistants handling complex scheduling with multiple parties
- LLM-based negotiation agents in procurement or contracting workflows

No existing RL benchmark trains this capability. Project Polymath is the first environment specifically designed to measure and improve it.

---

## 👨‍💻 Author
Aditya Katkar