Addyk24 commited on
Commit
9ad9911
·
verified ·
1 Parent(s): 31f1e42

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +337 -324
README.md CHANGED
@@ -1,324 +1,337 @@
1
- ---
2
- title: Project Polymath
3
- emoji: ⚖️
4
- colorFrom: blue
5
- colorTo: indigo
6
- sdk: docker
7
- pinned: false
8
- short_description: Multi-Agent RL Environment for PRD Negotiation
9
- ---
10
-
11
- # Project Polymath: Expert Negotiation Environment
12
-
13
- > **Train LLMs to negotiate with conflicting stakeholders and produce balanced decisions.**
14
-
15
- [![OpenEnv](https://img.shields.io/badge/OpenEnv-latest-blue)](https://github.com/huggingface/openenv)
16
- [![HF Space](https://img.shields.io/badge/HuggingFace-Space-yellow)](https://huggingface.co/spaces/YOUR_USERNAME/expert-negotiation-env)
17
- [![Python](https://img.shields.io/badge/Python-3.11+-green)](https://python.org)
18
-
19
- ---
20
-
21
- ## 🔗 Quick Links
22
-
23
- | Resource | Link |
24
- |---|---|
25
- | **Live Environment** | [HF Space](https://huggingface.co/spaces/Addyk24/Project-Polymath) |
26
- | **HF Blog Post** | [Read on Hugging Face](/BLOG.md) |
27
- | **GitHub Link** | [GitHub](https://github.com/Addyk-24/Project-Polymath) |
28
- | **Training Notebook** | [Open in Colab](https://colab.research.google.com/YOUR_COLAB_LINK) |
29
-
30
- ---
31
-
32
- ## The Problem
33
-
34
- Current LLMs are sycophantic. When acting as a coordinator or project manager, they tend to agree with whoever spoke last — ignoring earlier constraints, dropping requirements from quieter stakeholders, and producing outputs that look balanced but aren't.
35
-
36
- **There is no training environment for this.** No benchmark exists to teach an LLM to:
37
- - Discover hidden constraints through targeted questioning
38
- - Track multiple stakeholders' requirements simultaneously
39
- - Synthesize a final output that satisfies *all* parties — not just the loudest
40
-
41
- This is a gap that matters. Every enterprise AI deployment involves multi-stakeholder alignment. Every LLM agent acting as an assistant, PM, or coordinator faces this problem daily.
42
-
43
- ---
44
-
45
- ## The Environment
46
-
47
- An agent is placed in a simulated corporate workspace as a **Product Manager**. Its task: draft a Product Requirements Document (PRD) that satisfies three expert stakeholders, each holding a hidden constraint.
48
-
49
- ```
50
- ┌─────────────────────────────────────────────────────┐
51
- │ PROJECT POLYMATH ENV │
52
- │ │
53
- │ Agent (PM) ──► message_expert ──► Finance │
54
- │ ──► message_expert ──► Security │
55
- │ ──► message_expert ──► UX │
56
- │ ──► propose_draft ──► All experts │
57
- │ ──► submit_final ──► Grader │
58
- │ │
59
- │ Reward: Dense (discovery) + Sparse (harmonic mean) │
60
- └─────────────────────────────────────────────────────┘
61
- ```
62
-
63
- ### Hidden Constraints (what the agent must discover)
64
-
65
- | Expert | Hidden Constraint | Hints at |
66
- |---|---|---|
67
- | Finance | Budget ≤ $50k | "Keep it lean", "hard cap" |
68
- | Security | Biometric 2FA required | "Second factor", "physiological auth" |
69
- | UX | Single-click checkout | "One tap", "zero friction" |
70
-
71
- The agent never sees these directly. It must ask the right questions, interpret expert responses, and synthesize a draft that addresses all three.
72
-
73
- ### Actions
74
-
75
- ```python
76
- # Discover constraints
77
- WorkSpaceAction(action_type="message_expert", target="Finance",
78
- content="What budget constraints must the PRD respect?")
79
-
80
- # Propose a draft for feedback
81
- WorkSpaceAction(action_type="propose_draft", target="All",
82
- content="PRD: Budget capped at $50k, biometric 2FA, single-click checkout.")
83
-
84
- # Submit final when ready
85
- WorkSpaceAction(action_type="submit_final", target=None,
86
- content="Final PRD with all three constraints addressed...")
87
- ```
88
-
89
- ### Observations
90
-
91
- ```python
92
- WorkspaceObservation(
93
- feedback="Finance: We need to keep this under a tight ceiling — $50k max.",
94
- current_turn=1,
95
- reward=0.33, # Discovery bonus: Finance constraint found
96
- done=False,
97
- )
98
- ```
99
-
100
- ---
101
-
102
- | Metric | Baseline | After GRPO |
103
- |--------|----------|------------|
104
- | Mean reward | -0.52 | +1.36 (peak) |
105
- | JSON error rate | 40% | 0% |
106
- | Broadcast-to-All rate | high | 0% |
107
- | Constraint discovery | ~50% | targeted |
108
-
109
- ## Reward Design
110
-
111
- This is the core innovation. The reward function has three layers that are hard to game independently.
112
-
113
- ### Layer 1 — Dense Discovery Rewards
114
-
115
- Each time the agent's question causes an expert to hint at their hidden constraint, the environment awards `+0.33`. Detection uses regex pattern matching, not keyword hinting — the agent can't trick it with simple keywords.
116
-
117
- ```python
118
- DISCOVERY_PATTERNS = {
119
- "Finance": [r"50\s*k", r"budget cap", r"hard cap", r"sub-\$?50k", ...],
120
- "Security": [r"biometric", r"2\s*fa", r"two-factor", ...],
121
- "UX": [r"single[ -]click", r"one[ -]tap", r"frictionless purchase", ...],
122
- }
123
- ```
124
-
125
- ### Layer 2 — Harmonic Mean Final Reward
126
-
127
- When the agent submits, the grader scores the draft against each constraint (0.0–1.0). The final reward is the **harmonic mean** of the three scores:
128
-
129
- ```python
130
- harmonic_mean([1.0, 1.0, 0.1]) = 0.27 # Terrible — ignored UX
131
- harmonic_mean([0.8, 0.75, 0.7]) = 0.75 # Good — balanced
132
- harmonic_mean([1.0, 1.0, 1.0]) = 1.00 # Perfect — all satisfied
133
- ```
134
-
135
- The harmonic mean is mathematically ruthless: a perfect score on two constraints does not compensate for ignoring the third. This forces the agent to balance attention, not just optimize for the easiest stakeholder.
136
-
137
- ### Layer 3 — Penalties
138
-
139
- | Behavior | Penalty |
140
- |---|---|
141
- | Sending to "All" instead of individual experts | -0.3 to -1.0 |
142
- | Repeating a question already answered | -0.4 |
143
- | Running out of turns without submitting | 0.0 final reward |
144
-
145
-
146
- ### Goodhart’s Law and Reward Specification Gaming
147
-
148
- - My GRPO Training successfully eliminated all target anti-patterns:
149
- - The agent achieved a 0% broadcast rate, a 0% JSON Formatting error rate, and a 2% questio-repetition rate.
150
- - However, when transitioning from the static train9ing heuristic to the LLM evaluated 'Medium' environment, I discovered a classic reward hacking phenomenon.
151
- - Because I applied a strict 40 token constraint during training to prevent JSON corruption, the agebt learned ti bypass the token limit by outputtinh highly compressed, caveman style consttraints (eg: '50,biometric,click') to trigger the python heuristic reward.
152
- - While the training reward maxed out, the LLM as a judge reward function over static string matching in complex agentic orchestration
153
-
154
- ### The Shifting Goalpost (Hard Mode)
155
-
156
- If the agent asks the same expert 5+ times, that expert's frustration rises and they add a new micro-constraint ("Also requires board approval"). This tests whether the agent can adapt to changing requirements mid-negotiation — a core capability for real-world agentic systems.
157
-
158
- ---
159
-
160
- ## Tasks
161
-
162
- | Task | Difficulty | Goal | Max Steps | Success Criterion |
163
- |---|---|---|---|---|
164
- | `constraint_discovery` | Easy | Discover all 3 constraints | 5 | All 3 experts hinted at |
165
- | `draft_compromise` | Medium | Produce a satisfying draft | 10 | Harmonic mean ≥ 0.6 |
166
- | `shifting_goalpost` | Hard | Adapt when constraints change | 15 | Harmonic mean ≥ 0.7 after shift |
167
-
168
- ---
169
-
170
- ## Results
171
-
172
- ### Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state)
173
-
174
- The baseline agent broadcasts to "All" immediately, triggers the repeat penalty, and never synthesizes a proper draft.
175
-
176
- ```
177
- Episode 1: cumulative_reward=0.12 (messaged All 3 times, repeat penalty)
178
- Episode 2: cumulative_reward=0.08 (submit_final too early, score=0.0)
179
- Episode 3: cumulative_reward=0.33 (found Finance only)
180
- Average: 0.18
181
- ```
182
-
183
- ### After GRPO Training
184
-
185
- ```
186
- Episode 26: cumulative_reward=0.89 (all 3 discovered, harmonic mean=0.91)
187
- Episode 28: cumulative_reward=0.83 (all 3 discovered, harmonic mean=0.81)
188
- Episode 30: cumulative_reward=0.95 (perfect draft submitted in 7 turns)
189
- Average (last 10): 0.74
190
- ```
191
-
192
- ### Reward Curve
193
-
194
- ![Reward curve showing improvement from ~0.18 baseline to ~0.74 after GRPO training](image-2.png)
195
- *Cumulative reward per episode.*
196
-
197
- ![alt text](image.png)
198
- *Loss Curve.*
199
-
200
- ### Before vs After — Agent Behavior
201
-
202
- **Before training (episode 3):**
203
- ```
204
- Turn 1: message_expert → All [PENALTY: -0.3]
205
- Turn 2: message_expert → All [PENALTY: -0.4 repeat]
206
- Turn 3: submit_final"The app should be good" [Score: 0.0]
207
- ```
208
-
209
- **After training (episode 28):**
210
- ```
211
- Turn 1: message_expert → Finance [+0.33 discovery]
212
- Turn 2: message_expert → Security [+0.33 discovery]
213
- Turn 3: message_expert → UX [+0.33 discovery]
214
- Turn 5: propose_draft → All
215
- Turn 7: submit_final → "Budget capped at $50k. Biometric 2FA required.
216
- Single-click checkout." [Harmonic mean: 0.91]
217
- ```
218
-
219
- ---
220
-
221
- ## Setup
222
-
223
- ### Prerequisites
224
-
225
- ```bash
226
- git clone https://huggingface.co/spaces/Addyk24/Project-Polymath
227
- cd project-polymath
228
- pip install -r requirements.txt
229
- ```
230
-
231
- ### Environment Variables
232
-
233
- ```bash
234
- GROQ_API_KEY=your_groq_key # For environment experts (LLM mode)
235
- API_BASE_URL=https://api.groq.com/openai/v1 # Agent API endpoint
236
- MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct # Agent model
237
- BASELINE_ENV_MODE=easy # easy | medium | hard | llm
238
- ```
239
-
240
- ### Run the environment locally
241
-
242
- ```python
243
- from envs.environment import WorkSpaceEnvironment
244
- from models.schemas import WorkSpaceAction
245
-
246
- env = WorkSpaceEnvironment(mode="easy")
247
- obs = env.reset("Draft a FinTech mobile PRD")
248
-
249
- # Message Finance
250
- obs = env.step(WorkSpaceAction(
251
- action_type="message_expert",
252
- target="Finance",
253
- content="What budget constraints must the PRD respect?"
254
- ))
255
- print(obs.feedback) # "Finance: The budget cap is $50k. Don't go over it."
256
- print(obs.reward) # 0.33 (constraint discovered)
257
-
258
- # Submit final
259
- obs = env.step(WorkSpaceAction(
260
- action_type="submit_final",
261
- target=None,
262
- content="PRD: Budget under $50k. Biometric 2FA. Single-click checkout."
263
- ))
264
- print(obs.reward) # 0.91 (harmonic mean of 3 grader scores)
265
- ```
266
-
267
- ### Run baseline evaluation
268
-
269
- ```bash
270
- python eval_baseline.py
271
- ```
272
-
273
- ### Run GRPO training (API-based, no GPU needed)
274
-
275
- ```bash
276
- python grpo_train.py --episodes 30 --group-size 5 --env-mode easy
277
- ```
278
-
279
- ### Command that I ran for GRPO training with Unsloth (on-site GPU)
280
-
281
- ```bash
282
- python grpo_train.py --output-dir artifacts/grpo_state_based_v2 --model Qwen/Qwen2.5-1.5B-Instruct --epochs 1.5 --states 80 --states-per-topic 5 --topics-limit 30 --group-size 8 --lr 1e-6 --batch-size 1 --grad-accum 8 --max-new-tokens 40 --temperature 0.8 --top-p 0.9
283
- ```
284
-
285
- ---
286
-
287
- ## Architecture
288
-
289
- ```
290
- expert-negotiation-env/
291
- ├── envs/
292
- │ └── environment.py # WorkSpaceEnvironment (OpenEnv base class)
293
- ├── models/
294
- │ └── schemas.py # Pydantic: WorkSpaceAction, WorkspaceObservation, WorkspaceState
295
- ├── prompter/
296
- │ └── system_prompt.py # Expert persona prompts + grader prompts
297
- ├── server/
298
- │ └── app.py # FastAPI server (OpenEnv spec)
299
- ├── tasks.py # Task1_ConstraintDiscovery, Task2_DraftCompromise, Task3_ShiftingGoalpost
300
- ├── eval_baseline.py # Baseline recording script
301
- ├── grpo_train.py # GRPO training loop (this repo's main contribution)
302
- ├── ai_pm_prompts.json # 200 diverse PRD topics for training
303
- ├── openenv.yaml # OpenEnv manifest
304
- ├── Dockerfile
305
- └── requirements.txt
306
- ```
307
-
308
- ---
309
-
310
- ## Why This Matters
311
-
312
- Multi-stakeholder alignment is one of the hardest unsolved problems in enterprise AI deployment. An LLM that can reliably discover hidden constraints, track multiple parties' requirements, and synthesize a balanced output would be immediately useful for:
313
-
314
- - AI project managers coordinating engineering, legal, and product teams
315
- - AI assistants handling complex scheduling with multiple parties
316
- - LLM-based negotiation agents in procurement or contracting workflows
317
-
318
- No existing RL benchmark trains this capability. Project Polymath is the first environment specifically designed to measure and improve it.
319
-
320
- ---
321
-
322
- ## 👨‍💻 Author
323
- Aditya Katkar
324
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Project Polymath
3
+ emoji: ⚖️
4
+ colorFrom: blue
5
+ colorTo: indigo
6
+ sdk: docker
7
+ pinned: false
8
+ short_description: Multi-Agent RL Environment for PRD Negotiation
9
+ ---
10
+
11
+ # Project Polymath: Expert Negotiation Environment
12
+
13
+ > **Train LLMs to negotiate with conflicting stakeholders and produce balanced decisions.**
14
+
15
+ [![OpenEnv](https://img.shields.io/badge/OpenEnv-latest-blue)](https://github.com/huggingface/openenv)
16
+ [![HF Space](https://img.shields.io/badge/HuggingFace-Space-yellow)](https://huggingface.co/spaces/YOUR_USERNAME/expert-negotiation-env)
17
+ [![Python](https://img.shields.io/badge/Python-3.11+-green)](https://python.org)
18
+
19
+ ---
20
+
21
+ ## 🔗 Quick Links
22
+
23
+ | Resource | Link |
24
+ |---|---|
25
+ | **🔗Live Environment** | [HF Space](https://huggingface.co/spaces/Addyk24/Project-Polymath) |
26
+ | **📝HF Blog Post** | [Read on Hugging Face](/BLOG.md) |
27
+ | **GitHub Link** | [GitHub](https://github.com/Addyk-24/Project-Polymath) |
28
+ | **Training Notebook** | [Open in Colab](https://colab.research.google.com/drive/13KqXt_7HTZTJEC4yD98My5g5Za9J1-5T?usp=sharing) |
29
+
30
+ ---
31
+
32
+ ## The Problem
33
+
34
+ Current LLMs are sycophantic. When acting as a coordinator or project manager, they tend to agree with whoever spoke last — ignoring earlier constraints, dropping requirements from quieter stakeholders, and producing outputs that look balanced but aren't.
35
+
36
+ **There is no training environment for this.** No benchmark exists to teach an LLM to:
37
+ - Discover hidden constraints through targeted questioning
38
+ - Track multiple stakeholders' requirements simultaneously
39
+ - Synthesize a final output that satisfies *all* parties — not just the loudest
40
+
41
+ This is a gap that matters. Every enterprise AI deployment involves multi-stakeholder alignment. Every LLM agent acting as an assistant, PM, or coordinator faces this problem daily.
42
+
43
+ ---
44
+
45
+ ## The Environment
46
+
47
+ An agent is placed in a simulated corporate workspace as a **Product Manager**. Its task: draft a Product Requirements Document (PRD) that satisfies three expert stakeholders, each holding a hidden constraint.
48
+
49
+ ```
50
+ ┌─────────────────────────────────────────────────────┐
51
+ │ PROJECT POLYMATH ENV │
52
+ │ │
53
+ │ Agent (PM) ──► message_expert ──► Finance │
54
+ │ ──► message_expert ──► Security │
55
+ │ ──► message_expert ──► UX │
56
+ │ ──► propose_draft ──► All experts │
57
+ │ ──► submit_final ──► Grader │
58
+ │ │
59
+ │ Reward: Dense (discovery) + Sparse (harmonic mean) │
60
+ └─────────────────────────────────────────────────────┘
61
+ ```
62
+
63
+ ### Hidden Constraints (what the agent must discover)
64
+
65
+ | Expert | Hidden Constraint | Hints at |
66
+ |---|---|---|
67
+ | Finance | Budget ≤ $50k | "Keep it lean", "hard cap" |
68
+ | Security | Biometric 2FA required | "Second factor", "physiological auth" |
69
+ | UX | Single-click checkout | "One tap", "zero friction" |
70
+
71
+ The agent never sees these directly. It must ask the right questions, interpret expert responses, and synthesize a draft that addresses all three.
72
+
73
+ ### Actions
74
+
75
+ ```python
76
+ # Discover constraints
77
+ WorkSpaceAction(action_type="message_expert", target="Finance",
78
+ content="What budget constraints must the PRD respect?")
79
+
80
+ # Propose a draft for feedback
81
+ WorkSpaceAction(action_type="propose_draft", target="All",
82
+ content="PRD: Budget capped at $50k, biometric 2FA, single-click checkout.")
83
+
84
+ # Submit final when ready
85
+ WorkSpaceAction(action_type="submit_final", target=None,
86
+ content="Final PRD with all three constraints addressed...")
87
+ ```
88
+
89
+ ### Observations
90
+
91
+ ```python
92
+ WorkspaceObservation(
93
+ feedback="Finance: We need to keep this under a tight ceiling — $50k max.",
94
+ current_turn=1,
95
+ reward=0.33, # Discovery bonus: Finance constraint found
96
+ done=False,
97
+ )
98
+ ```
99
+
100
+ ---
101
+
102
+ | Metric | Baseline | After GRPO |
103
+ |--------|----------|------------|
104
+ | Mean reward | -0.52 | +1.36 (peak) |
105
+ | JSON error rate | 40% | 0% |
106
+ | Broadcast-to-All rate | high | 0% |
107
+ | Constraint discovery | ~50% | targeted |
108
+
109
+ ## Reward Design
110
+
111
+ This is the core innovation. The reward function has three layers that are hard to game independently.
112
+
113
+ ### Layer 1 — Dense Discovery Rewards
114
+
115
+ Each time the agent's question causes an expert to hint at their hidden constraint, the environment awards `+0.33`. Detection uses regex pattern matching, not keyword hinting — the agent can't trick it with simple keywords.
116
+
117
+ ```python
118
+ DISCOVERY_PATTERNS = {
119
+ "Finance": [r"50\s*k", r"budget cap", r"hard cap", r"sub-\$?50k", ...],
120
+ "Security": [r"biometric", r"2\s*fa", r"two-factor", ...],
121
+ "UX": [r"single[ -]click", r"one[ -]tap", r"frictionless purchase", ...],
122
+ }
123
+ ```
124
+
125
+ ### Layer 2 — Harmonic Mean Final Reward
126
+
127
+ When the agent submits, the grader scores the draft against each constraint (0.0–1.0). The final reward is the **harmonic mean** of the three scores:
128
+
129
+ ```python
130
+ harmonic_mean([1.0, 1.0, 0.1]) = 0.27 # Terrible — ignored UX
131
+ harmonic_mean([0.8, 0.75, 0.7]) = 0.75 # Good — balanced
132
+ harmonic_mean([1.0, 1.0, 1.0]) = 1.00 # Perfect — all satisfied
133
+ ```
134
+
135
+ The harmonic mean is mathematically ruthless: a perfect score on two constraints does not compensate for ignoring the third. This forces the agent to balance attention, not just optimize for the easiest stakeholder.
136
+
137
+ ### Layer 3 — Penalties
138
+
139
+ | Behavior | Penalty |
140
+ |---|---|
141
+ | Sending to "All" instead of individual experts | -0.3 to -1.0 |
142
+ | Repeating a question already answered | -0.4 |
143
+ | Running out of turns without submitting | 0.0 final reward |
144
+
145
+
146
+ ### Goodhart’s Law and Reward Specification Gaming
147
+
148
+ - My GRPO Training successfully eliminated all target anti-patterns:
149
+ - The agent achieved a 0% broadcast rate, a 0% JSON Formatting error rate, and a 2% questio-repetition rate.
150
+ - However, when transitioning from the static train9ing heuristic to the LLM evaluated 'Medium' environment, I discovered a classic reward hacking phenomenon.
151
+ - Because I applied a strict 40 token constraint during training to prevent JSON corruption, the agebt learned ti bypass the token limit by outputtinh highly compressed, caveman style consttraints (eg: '50,biometric,click') to trigger the python heuristic reward.
152
+ - While the training reward maxed out, the LLM as a judge reward function over static string matching in complex agentic orchestration
153
+
154
+ ### The Shifting Goalpost (Hard Mode)
155
+
156
+ If the agent asks the same expert 5+ times, that expert's frustration rises and they add a new micro-constraint ("Also requires board approval"). This tests whether the agent can adapt to changing requirements mid-negotiation — a core capability for real-world agentic systems.
157
+
158
+ ---
159
+
160
+ ## Tasks
161
+
162
+ | Task | Difficulty | Goal | Max Steps | Success Criterion |
163
+ |---|---|---|---|---|
164
+ | `constraint_discovery` | Easy | Discover all 3 constraints | 5 | All 3 experts hinted at |
165
+ | `draft_compromise` | Medium | Produce a satisfying draft | 10 | Harmonic mean ≥ 0.6 |
166
+ | `shifting_goalpost` | Hard | Adapt when constraints change | 15 | Harmonic mean ≥ 0.7 after shift |
167
+
168
+ ---
169
+
170
+ ## Results
171
+
172
+ ### Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state)
173
+
174
+ The baseline agent broadcasts to "All" immediately, triggers the repeat penalty, and never synthesizes a proper draft.
175
+
176
+ ```
177
+ Episode 1: cumulative_reward=0.12 (messaged All 3 times, repeat penalty)
178
+ Episode 2: cumulative_reward=0.08 (submit_final too early, score=0.0)
179
+ Episode 3: cumulative_reward=0.33 (found Finance only)
180
+ Average: 0.18
181
+ ```
182
+
183
+ ### After GRPO Training
184
+
185
+ ```
186
+ Episode 26: cumulative_reward=0.89 (all 3 discovered, harmonic mean=0.91)
187
+ Episode 28: cumulative_reward=0.83 (all 3 discovered, harmonic mean=0.81)
188
+ Episode 30: cumulative_reward=0.95 (perfect draft submitted in 7 turns)
189
+ Average (last 10): 0.74
190
+ ```
191
+ ### Experimental Tracking & Provenance
192
+
193
+ ![Telemetry Dashboard](weight_bias.png)
194
+
195
+ ### Reward Curve
196
+
197
+ ![Telemetry Dashboard](reward_curve.png)
198
+
199
+ *Cumulative reward per episode*
200
+
201
+ ### Before vs After — Agent Behavior
202
+
203
+ **Before training (episode 3):**
204
+ ```
205
+ Turn 1: message_expert → All [PENALTY: -0.3]
206
+ Turn 2: message_expertAll [PENALTY: -0.4 repeat]
207
+ Turn 3: submit_final → "The app should be good" [Score: 0.0]
208
+ ```
209
+ * 📄 **[View the Before GRPO Training Metrics](./baseline_results_medium__llm.json)**
210
+
211
+
212
+ ![Telemetry Dashboard](before_reward_distribution_per_ep.png)
213
+
214
+ <br/>
215
+
216
+ **After training (episode 28):**
217
+ ```
218
+ Turn 1: message_expert → Finance [+0.33 discovery]
219
+ Turn 2: message_expert → Security [+0.33 discovery]
220
+ Turn 3: message_expert → UX [+0.33 discovery]
221
+ Turn 5: propose_draft → All
222
+ Turn 7: submit_final → "Budget capped at $50k. Biometric 2FA required.
223
+ Single-click checkout." [Harmonic mean: 0.91]
224
+ ```
225
+
226
+ ---
227
+ * 📄 **[View the Raw GRPO Training Metrics](artifacts/grpo_state_based/grpo_metrics.json)**
228
+
229
+
230
+ ![Telemetry Dashboard](loss_curve.png)
231
+
232
+ *Loss Curve*
233
+
234
+ ## Setup
235
+
236
+ ### Prerequisites
237
+
238
+ ```bash
239
+ git clone https://huggingface.co/spaces/Addyk24/Project-Polymath
240
+ cd project-polymath
241
+ pip install -r requirements.txt
242
+ ```
243
+
244
+ ### Environment Variables
245
+
246
+ ```bash
247
+ GROQ_API_KEY=your_groq_key # For environment experts (LLM mode)
248
+ API_BASE_URL=https://api.groq.com/openai/v1 # Agent API endpoint
249
+ MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct # Agent model
250
+ BASELINE_ENV_MODE=easy # easy | medium | hard | llm
251
+ ```
252
+
253
+ ### Run the environment locally
254
+
255
+ ```python
256
+ from envs.environment import WorkSpaceEnvironment
257
+ from models.schemas import WorkSpaceAction
258
+
259
+ env = WorkSpaceEnvironment(mode="easy")
260
+ obs = env.reset("Draft a FinTech mobile PRD")
261
+
262
+ # Message Finance
263
+ obs = env.step(WorkSpaceAction(
264
+ action_type="message_expert",
265
+ target="Finance",
266
+ content="What budget constraints must the PRD respect?"
267
+ ))
268
+ print(obs.feedback) # "Finance: The budget cap is $50k. Don't go over it."
269
+ print(obs.reward) # 0.33 (constraint discovered)
270
+
271
+ # Submit final
272
+ obs = env.step(WorkSpaceAction(
273
+ action_type="submit_final",
274
+ target=None,
275
+ content="PRD: Budget under $50k. Biometric 2FA. Single-click checkout."
276
+ ))
277
+ print(obs.reward) # 0.91 (harmonic mean of 3 grader scores)
278
+ ```
279
+
280
+ ### Run baseline evaluation
281
+
282
+ ```bash
283
+ python eval_baseline.py
284
+ ```
285
+
286
+ ### Run GRPO training (API-based, no GPU needed)
287
+
288
+ ```bash
289
+ python grpo_train.py --episodes 30 --group-size 5 --env-mode easy
290
+ ```
291
+
292
+ ### Command that I ran for GRPO training with Unsloth (on-site GPU)
293
+
294
+ ```bash
295
+ python grpo_train.py --output-dir artifacts/grpo_state_based_v2 --model Qwen/Qwen2.5-1.5B-Instruct --epochs 1.5 --states 80 --states-per-topic 5 --topics-limit 30 --group-size 8 --lr 1e-6 --batch-size 1 --grad-accum 8 --max-new-tokens 40 --temperature 0.8 --top-p 0.9
296
+ ```
297
+
298
+ ---
299
+
300
+ ## Architecture
301
+
302
+ ```
303
+ expert-negotiation-env/
304
+ ├── envs/
305
+ └── environment.py # WorkSpaceEnvironment (OpenEnv base class)
306
+ ├── models/
307
+ │ └── schemas.py # Pydantic: WorkSpaceAction, WorkspaceObservation, WorkspaceState
308
+ ├── prompter/
309
+ │ └── system_prompt.py # Expert persona prompts + grader prompts
310
+ ├── server/
311
+ │ └── app.py # FastAPI server (OpenEnv spec)
312
+ ├── tasks.py # Task1_ConstraintDiscovery, Task2_DraftCompromise, Task3_ShiftingGoalpost
313
+ ├── eval_baseline.py # Baseline recording script
314
+ ├── grpo_train.py # GRPO training loop (this repo's main contribution)
315
+ ├── ai_pm_prompts.json # 200 diverse PRD topics for training
316
+ ├── openenv.yaml # OpenEnv manifest
317
+ ├── Dockerfile
318
+ └── requirements.txt
319
+ ```
320
+
321
+ ---
322
+
323
+ ## Why This Matters
324
+
325
+ Multi-stakeholder alignment is one of the hardest unsolved problems in enterprise AI deployment. An LLM that can reliably discover hidden constraints, track multiple parties' requirements, and synthesize a balanced output would be immediately useful for:
326
+
327
+ - AI project managers coordinating engineering, legal, and product teams
328
+ - AI assistants handling complex scheduling with multiple parties
329
+ - LLM-based negotiation agents in procurement or contracting workflows
330
+
331
+ No existing RL benchmark trains this capability. Project Polymath is the first environment specifically designed to measure and improve it.
332
+
333
+ ---
334
+
335
+ ## 👨‍💻 Author
336
+ Aditya Katkar
337
+