File size: 13,739 Bytes
9ad9911
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73e82fa
9ad9911
 
 
 
 
db9cb7a
9ad9911
 
 
 
 
 
 
 
 
 
 
 
db9cb7a
9ad9911
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b87e31c
 
9ad9911
b87e31c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9ad9911
db9cb7a
9ad9911
 
 
 
 
 
 
 
 
b87e31c
 
db9cb7a
9ad9911
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db9cb7a
9ad9911
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db9cb7a
9ad9911
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db9cb7a
9ad9911
 
 
 
 
 
 
 
 
db9cb7a
9ad9911
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4b7e007
9ad9911
ec9a31a
9ad9911
4b7e007
9ad9911
ec9a31a
 
9ad9911
 
 
6686755
9ad9911
 
 
 
 
 
 
a6a41cd
9ad9911
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6686755
dd43a5d
9ad9911
e0a67fd
 
 
9ad9911
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6686755
9ad9911
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6686755
9ad9911
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
---
title: Project Polymath
emoji: ⚖️
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
short_description: Multi-Agent RL Environment for PRD Negotiation
---

# Project Polymath: Expert Negotiation Environment

> **Train LLMs to negotiate with conflicting stakeholders and produce balanced decisions.**

[![OpenEnv](https://img.shields.io/badge/OpenEnv-latest-blue)](https://github.com/huggingface/openenv)
[![HF Space](https://img.shields.io/badge/HuggingFace-Space-yellow)](https://huggingface.co/spaces/YOUR_USERNAME/expert-negotiation-env)
[![Python](https://img.shields.io/badge/Python-3.11+-green)](https://python.org)

---

## 🔗 Quick Links

| Resource | Link |
|---|---|
| **🔗Live Environment** | [HF Space](https://huggingface.co/spaces/Addyk24/Project-Polymath) |
| **📝HF Blog Post** | [Read on Hugging Face](https://huggingface.co/spaces/Addyk24/Project-Polymath/blob/main/BLOG.md) |
| **GitHub Link** | [GitHub](https://github.com/Addyk-24/Project-Polymath) |
| **Training Notebook** | [Open in Colab](https://colab.research.google.com/drive/13KqXt_7HTZTJEC4yD98My5g5Za9J1-5T?usp=sharing) |

---

## 🧱 The Problem Statement

Current LLMs are sycophantic. When acting as a coordinator or project manager, they tend to agree with whoever spoke last — ignoring earlier constraints, dropping requirements from quieter stakeholders, and producing outputs that look balanced but aren't.

**There is no training environment for this.** No benchmark exists to teach an LLM to:
- Discover hidden constraints through targeted questioning
- Track multiple stakeholders' requirements simultaneously
- Synthesize a final output that satisfies *all* parties — not just the loudest

This is a gap that matters. Every enterprise AI deployment involves multi-stakeholder alignment. Every LLM agent acting as an assistant, PM, or coordinator faces this problem daily.

---

## 🧠 The Environment

An agent is placed in a simulated corporate workspace as a **Product Manager**. Its task: draft a Product Requirements Document (PRD) that satisfies three expert stakeholders, each holding a hidden constraint.

```
┌─────────────────────────────────────────────────────┐
│              PROJECT POLYMATH ENV                   │
│                                                     │
│  Agent (PM) ──► message_expert ──► Finance          │
│            ──► message_expert ──► Security          │  
│            ──► message_expert ──► UX                │
│            ──► propose_draft  ──► All experts       │
│            ──► submit_final   ──► Grader            │
│                                                     │
│  Reward: Dense (discovery) + Sparse (harmonic mean) │
└─────────────────────────────────────────────────────┘


```
### 🏛️ System Architecture: The State-Based Sieve

Our architecture is designed as a closed-loop State Machine. Unlike standard LLM "chat" wrappers, Project Polymath implements a rigorous enforcement layer that separates reasoning from execution.


![architecture](system_architecture.png)


Architectural Highlights:

- The 40-Token Critical Sieve: Positioned as a diamond gate between the Agent and the Workspace. It acts as a hard bandwidth filter, ensuring the model is penalized for any verbosity that exceeds the survivor-mode threshold.

- Expert Constraints Database: A persistent state container holding hidden stakeholder variables. The Environment only allows these variables to be "unlocked" through specific, targeted queries from the agent.

- Closed-Loop Reward Engine: The "Judge" monitors the state changes in the environment and provides a real-time floating-point reward signal back to the GRPO trainer, iteratively sharpening the "Sniper" logic.


### 🏛️ Hidden Constraints (what the agent must discover)

| Expert | Hidden Constraint | Hints at |
|---|---|---|
| Finance | Budget ≤ $50k | "Keep it lean", "hard cap" |
| Security | Biometric 2FA required | "Second factor", "physiological auth" |
| UX | Single-click checkout | "One tap", "zero friction" |

The agent never sees these directly. It must ask the right questions, interpret expert responses, and synthesize a draft that addresses all three.

```
```
### ✨ Actions

```python
# Discover constraints
WorkSpaceAction(action_type="message_expert", target="Finance",
                content="What budget constraints must the PRD respect?")

# Propose a draft for feedback
WorkSpaceAction(action_type="propose_draft", target="All",
                content="PRD: Budget capped at $50k, biometric 2FA, single-click checkout.")

# Submit final when ready
WorkSpaceAction(action_type="submit_final", target=None,
                content="Final PRD with all three constraints addressed...")
```

### 🧱 Observations

```python
WorkspaceObservation(
    feedback="Finance: We need to keep this under a tight ceiling — $50k max.",
    current_turn=1,
    reward=0.33,   # Discovery bonus: Finance constraint found
    done=False,
)
```

---

| Metric | Baseline | After GRPO |
|--------|----------|------------|
| Mean reward | -0.52 | +1.36 (peak) |
| JSON error rate | 40% | 0% |
| Broadcast-to-All rate | high | 0% |
| Constraint discovery | ~50% | targeted |

## ✨ Reward Design

This is the core innovation. The reward function has three layers that are hard to game independently.

### Layer 1 — Dense Discovery Rewards

Each time the agent's question causes an expert to hint at their hidden constraint, the environment awards `+0.33`. Detection uses regex pattern matching, not keyword hinting — the agent can't trick it with simple keywords.

```python
DISCOVERY_PATTERNS = {
    "Finance": [r"50\s*k", r"budget cap", r"hard cap", r"sub-\$?50k", ...],
    "Security": [r"biometric", r"2\s*fa", r"two-factor", ...],
    "UX": [r"single[ -]click", r"one[ -]tap", r"frictionless purchase", ...],
}
```

### Layer 2 — Harmonic Mean Final Reward

When the agent submits, the grader scores the draft against each constraint (0.0–1.0). The final reward is the **harmonic mean** of the three scores:

```python
harmonic_mean([1.0, 1.0, 0.1]) = 0.27  # Terrible — ignored UX
harmonic_mean([0.8, 0.75, 0.7]) = 0.75  # Good — balanced
harmonic_mean([1.0, 1.0, 1.0]) = 1.00  # Perfect — all satisfied
```

The harmonic mean is mathematically ruthless: a perfect score on two constraints does not compensate for ignoring the third. This forces the agent to balance attention, not just optimize for the easiest stakeholder.

### Layer 3 — Penalties

| Behavior | Penalty |
|---|---|
| Sending to "All" instead of individual experts | -0.3 to -1.0 |
| Repeating a question already answered | -0.4 |
| Running out of turns without submitting | 0.0 final reward |


### Goodhart’s Law and Reward Specification Gaming

- My GRPO Training successfully eliminated all target anti-patterns: 
- The agent achieved a 0% broadcast rate, a 0% JSON Formatting error rate, and a 2% questio-repetition rate.
- However, when transitioning from the static train9ing heuristic to the LLM evaluated 'Medium' environment, I discovered a classic reward hacking phenomenon.
- Because I applied a strict 40 token constraint during training to prevent JSON corruption, the agebt learned ti bypass the token limit by outputtinh highly compressed, caveman style consttraints (eg: '50,biometric,click') to trigger the python heuristic reward.
- While the training reward maxed out, the LLM as a judge reward function over static string matching in complex agentic orchestration

### The Shifting Goalpost (Hard Mode)

If the agent asks the same expert 5+ times, that expert's frustration rises and they add a new micro-constraint ("Also requires board approval"). This tests whether the agent can adapt to changing requirements mid-negotiation — a core capability for real-world agentic systems.

---

## 🧠 Tasks

| Task | Difficulty | Goal | Max Steps | Success Criterion |
|---|---|---|---|---|
| `constraint_discovery` | Easy | Discover all 3 constraints | 5 | All 3 experts hinted at |
| `draft_compromise` | Medium | Produce a satisfying draft | 10 | Harmonic mean ≥ 0.6 |
| `shifting_goalpost` | Hard | Adapt when constraints change | 15 | Harmonic mean ≥ 0.7 after shift |

---

## 🏛️ Results

### Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state)

The baseline agent broadcasts to "All" immediately, triggers the repeat penalty, and never synthesizes a proper draft.

```
Episode 1:  cumulative_reward=0.12  (messaged All 3 times, repeat penalty)
Episode 2:  cumulative_reward=0.08  (submit_final too early, score=0.0)
Episode 3:  cumulative_reward=0.33  (found Finance only)
Average: 0.18
```

### After GRPO Training

```
Episode 26: cumulative_reward=0.89  (all 3 discovered, harmonic mean=0.91)
Episode 28: cumulative_reward=0.83  (all 3 discovered, harmonic mean=0.81)
Episode 30: cumulative_reward=0.95  (perfect draft submitted in 7 turns)
Average (last 10): 0.74
```
### ⚙️ Experimental Tracking & Provenance

![weight_bais](weight_bias.png)

### 🏆 Reward Curve

**Cumulative reward per episode**

![Telemetry Dashboard](reward_curve.png)


### 📄 Before vs After — Agent Behavior

**Before training (episode 3):**
```
Turn 1: message_expert → All  [PENALTY: -0.3]
Turn 2: message_expert → All  [PENALTY: -0.4 repeat]
Turn 3: submit_final → "The app should be good"  [Score: 0.0]
```
* 📄 **[View the Before GRPO Training Metrics](baseline_results_medium__llm.json)**

  
![Telemetry Dashboard](before_reward_distribution_per_ep.png)

<br/>

**After training (episode 28):**
```
Turn 1: message_expert → Finance  [+0.33 discovery]
Turn 2: message_expert → Security [+0.33 discovery]
Turn 3: message_expert → UX       [+0.33 discovery]
Turn 5: propose_draft → All
Turn 7: submit_final → "Budget capped at $50k. Biometric 2FA required.
         Single-click checkout." [Harmonic mean: 0.91]
```

---
## 🛠 Training Logs
* 📄 **[View the Raw GRPO Training Log Metrics](grpo_metrics.json)**

  <br>
  
**Loss Curve**

![Telemetry Dashboard](loss_curve.png)


## Setup

### Prerequisites

```bash
git clone https://huggingface.co/spaces/Addyk24/Project-Polymath
cd project-polymath
pip install -r requirements.txt
```

### Environment Variables

```bash
GROQ_API_KEY=your_groq_key        # For environment experts (LLM mode)
API_BASE_URL=https://api.groq.com/openai/v1  # Agent API endpoint
MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct  # Agent model
BASELINE_ENV_MODE=easy            # easy | medium | hard | llm
```

### Run the environment locally

```python
from envs.environment import WorkSpaceEnvironment
from models.schemas import WorkSpaceAction

env = WorkSpaceEnvironment(mode="easy")
obs = env.reset("Draft a FinTech mobile PRD")

# Message Finance
obs = env.step(WorkSpaceAction(
    action_type="message_expert",
    target="Finance",
    content="What budget constraints must the PRD respect?"
))
print(obs.feedback)   # "Finance: The budget cap is $50k. Don't go over it."
print(obs.reward)     # 0.33 (constraint discovered)

# Submit final
obs = env.step(WorkSpaceAction(
    action_type="submit_final",
    target=None,
    content="PRD: Budget under $50k. Biometric 2FA. Single-click checkout."
))
print(obs.reward)     # 0.91 (harmonic mean of 3 grader scores)
```

### Run baseline evaluation

```bash
python eval_baseline.py
```

### Run GRPO training (API-based, no GPU needed)

```bash
python grpo_train.py --episodes 30 --group-size 5 --env-mode easy
```

### Command that I ran for GRPO training with Unsloth (on-site GPU)

```bash
python grpo_train.py --output-dir artifacts/grpo_state_based_v2 --model Qwen/Qwen2.5-1.5B-Instruct --epochs 1.5 --states 80 --states-per-topic 5 --topics-limit 30 --group-size 8 --lr 1e-6 --batch-size 1 --grad-accum 8 --max-new-tokens 40 --temperature 0.8 --top-p 0.9
```

---

## ✨ Architecture

```
expert-negotiation-env/
├── envs/
│   └── environment.py      # WorkSpaceEnvironment (OpenEnv base class)
├── models/
│   └── schemas.py          # Pydantic: WorkSpaceAction, WorkspaceObservation, WorkspaceState
├── prompter/
│   └── system_prompt.py    # Expert persona prompts + grader prompts
├── server/
│   └── app.py              # FastAPI server (OpenEnv spec)
├── tasks.py                # Task1_ConstraintDiscovery, Task2_DraftCompromise, Task3_ShiftingGoalpost
├── eval_baseline.py        # Baseline recording script
├── grpo_train.py           # GRPO training loop (this repo's main contribution)
├── ai_pm_prompts.json      # 200 diverse PRD topics for training
├── openenv.yaml            # OpenEnv manifest
├── Dockerfile
└── requirements.txt
```

---

## 🔍 Why This Matters

Multi-stakeholder alignment is one of the hardest unsolved problems in enterprise AI deployment. An LLM that can reliably discover hidden constraints, track multiple parties' requirements, and synthesize a balanced output would be immediately useful for:

- AI project managers coordinating engineering, legal, and product teams
- AI assistants handling complex scheduling with multiple parties
- LLM-based negotiation agents in procurement or contracting workflows

No existing RL benchmark trains this capability. Project Polymath is the first environment specifically designed to measure and improve it.

---

## 👨‍💻 Author
Aditya Katkar