Addyk24 commited on
Commit
35e5549
·
verified ·
1 Parent(s): 64bdab0

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +324 -78
README.md CHANGED
@@ -1,78 +1,324 @@
1
- ---
2
- title: Project Polymath
3
- emoji: ⚖️
4
- colorFrom: blue
5
- colorTo: indigo
6
- sdk: docker
7
- pinned: false
8
- short_description: Multi-Agent RL Environment for PRD Negotiation
9
- ---
10
-
11
-
12
-
13
- # Project-Polymath
14
- Target Themes: Multi-Agent Interactions (Halluminate Bonus) & Simulated Experts-in-the-Loop (Snorkel AI Bonus).
15
-
16
-
17
- ## 1.💡 The Problem Statement (The 30% Storytelling Hook)
18
-
19
- Current LLMs are sycophantic and struggle with multi-stakeholder alignment. When an AI agent acts as a project manager or coordinator, it often blindly agrees with the last piece of feedback it received. The Problem: There is no benchmark to train agents to negotiate, balance conflicting constraints, and synthesize a final product when dealing with multiple "experts" who have different, shifting agendas.
20
-
21
- ## 2.⚙️ The Environment
22
- A Simulated Corporate Workspace (built on OpenEnv) The agent is placed in a simulated multi-turn environment (like a Slack channel or email thread) where it must draft a "Product Requirements Document" (PRD) or "Corporate Policy." The environment contains 3 LLM-driven "Simulated Experts" (e.g., The Security Lead, The Finance Director, and The UX Designer).
23
-
24
- - The Twist: Each expert has a hidden set of constraints (e.g., Finance has a strict $50k budget, Security requires 2FA, UX demands a 1-click checkout). The environment dynamically shifts these preferences slightly if the agent pushes back too hard.
25
-
26
- ## 3.📊 Capabilities of the Agent
27
- The agent must possess a persistent world model and theory-of-mind reasoning.
28
-
29
- - Information Gathering: It can query specific experts (action: message_expert, target: Finance).
30
- - State Tracking: It must maintain a persistent internal scratchpad of what each expert wants.
31
- - Drafting: It can propose a draft (action: propose_draft) which triggers the environment to return feedback from all three experts.
32
- - Persuasion/Negotiation: It must logically push back against experts if their constraints conflict with another expert.
33
-
34
- ## 4.🧠 The Tasks (Escalating Difficulty)
35
-
36
- - Task 1 (Easy): Information Retrieval. The agent must simply message all 3 experts, discover their hidden constraints, and output them correctly.
37
- - Task 2 (Medium): The Compromise. Two experts have slightly conflicting constraints. The agent must propose a draft that satisfies both by finding a middle ground.
38
- - Task 3 (Hard - The Long Horizon): The Shifting Goalpost. Mid-negotiation, the "CEO" (Environment event) changes the core objective. The agent must completely refactor the draft and re-align all 3 experts before the turn limit expires.
39
-
40
- ## 5.🎯 The Reward Model / Evaluation Logic (The 20% Technical Score)
41
-
42
- Judges want to see continuous math, not binary pass/fail grades.
43
-
44
- - Dense Step Rewards: * +0.1 every time the agent discovers a previously unknown hidden constraint.
45
- -0.5 Repetition penalty (asking an expert a question they already answered).
46
- - Sparse Final Reward (The Harmonic Mean): When the agent submits the final draft, the environment uses a frozen LLM grader to score the draft from 0.0 to 1.0 against each expert's hidden constraints.
47
- - Crucial Innovation: Do not average the scores. Calculate the Harmonic Mean of the three scores. The harmonic mean heavily punishes the agent if it completely ignores one expert to please the other two. (e.g., Scores of 1.0, 1.0, and 0.1 yield a terrible harmonic mean, forcing the agent to balance its attention).
48
-
49
- ## 6.🛡️ Post-Training & Self-Improvement Strategy
50
-
51
- GRPO (Group Relative Policy Optimization) via Unsloth/TRL. Instead of traditional PPO, you will use GRPO (which is what DeepSeek used and is currently the hottest trend in RL).
52
- - Strategy: For a given negotiation scenario, the model generates 8 different conversation trajectories.
53
- - The environment scores all 8 trajectories using the Harmonic Mean reward.
54
- - The model self-improves by increasing the probability of the actions taken in the highest-scoring trajectory relative to the group average.
55
-
56
-
57
- ## WORKFLOW
58
-
59
- ```bash
60
- Project-Polymath/
61
- ├── schema/
62
- │ ├── Action: {message_target, content} or {propose_draft, content}
63
- │ ├── Observation: {expert_responses, known_constraints, turn_count}
64
- │ └── State: {episode_id, discovered_constraints, draft_history}
65
- ├── experts/
66
- │ ├── SecurityExpert — hidden constraint: must include 2FA, data encryption
67
- │ └── FinanceExpert hidden constraint: budget under $50k, no recurring costs
68
- ├── environment.py reset(), step(), state()
69
- ├── reward.py dense step rewards + harmonic mean final reward
70
- └── tasks.py — 3 difficulty tiers
71
-
72
- ```
73
-
74
-
75
- ## 👨‍💻 Author
76
- Aditya Katkar
77
-
78
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Project Polymath
3
+ emoji: ⚖️
4
+ colorFrom: blue
5
+ colorTo: indigo
6
+ sdk: docker
7
+ pinned: false
8
+ short_description: Multi-Agent RL Environment for PRD Negotiation
9
+ ---
10
+
11
+ # Project Polymath: Expert Negotiation Environment
12
+
13
+ > **Train LLMs to negotiate with conflicting stakeholders and produce balanced decisions.**
14
+
15
+ [![OpenEnv](https://img.shields.io/badge/OpenEnv-latest-blue)](https://github.com/huggingface/openenv)
16
+ [![HF Space](https://img.shields.io/badge/HuggingFace-Space-yellow)](https://huggingface.co/spaces/YOUR_USERNAME/expert-negotiation-env)
17
+ [![Python](https://img.shields.io/badge/Python-3.11+-green)](https://python.org)
18
+
19
+ ---
20
+
21
+ ## 🔗 Quick Links
22
+
23
+ | Resource | Link |
24
+ |---|---|
25
+ | **Live Environment** | [HF Space](https://huggingface.co/spaces/Addyk24/Project-Polymath) |
26
+ | **HF Blog Post** | [Read on Hugging Face](/BLOG.md) |
27
+ | **GitHub Link** | [GitHub](https://github.com/Addyk-24/Project-Polymath) |
28
+ | **Training Notebook** | [Open in Colab](https://colab.research.google.com/YOUR_COLAB_LINK) |
29
+
30
+ ---
31
+
32
+ ## The Problem
33
+
34
+ Current LLMs are sycophantic. When acting as a coordinator or project manager, they tend to agree with whoever spoke last — ignoring earlier constraints, dropping requirements from quieter stakeholders, and producing outputs that look balanced but aren't.
35
+
36
+ **There is no training environment for this.** No benchmark exists to teach an LLM to:
37
+ - Discover hidden constraints through targeted questioning
38
+ - Track multiple stakeholders' requirements simultaneously
39
+ - Synthesize a final output that satisfies *all* parties — not just the loudest
40
+
41
+ This is a gap that matters. Every enterprise AI deployment involves multi-stakeholder alignment. Every LLM agent acting as an assistant, PM, or coordinator faces this problem daily.
42
+
43
+ ---
44
+
45
+ ## The Environment
46
+
47
+ An agent is placed in a simulated corporate workspace as a **Product Manager**. Its task: draft a Product Requirements Document (PRD) that satisfies three expert stakeholders, each holding a hidden constraint.
48
+
49
+ ```
50
+ ┌─────────────────────────────────────────────────────┐
51
+ │ PROJECT POLYMATH ENV │
52
+ │ │
53
+ │ Agent (PM) ──► message_expert ──► Finance │
54
+ │ ──► message_expert ──► Security │
55
+ │ ──► message_expert ──► UX │
56
+ │ ──► propose_draft ──► All experts │
57
+ │ ──► submit_final ──► Grader │
58
+ │ │
59
+ │ Reward: Dense (discovery) + Sparse (harmonic mean) │
60
+ └─────────────────────────────────────────────────────┘
61
+ ```
62
+
63
+ ### Hidden Constraints (what the agent must discover)
64
+
65
+ | Expert | Hidden Constraint | Hints at |
66
+ |---|---|---|
67
+ | Finance | Budget $50k | "Keep it lean", "hard cap" |
68
+ | Security | Biometric 2FA required | "Second factor", "physiological auth" |
69
+ | UX | Single-click checkout | "One tap", "zero friction" |
70
+
71
+ The agent never sees these directly. It must ask the right questions, interpret expert responses, and synthesize a draft that addresses all three.
72
+
73
+ ### Actions
74
+
75
+ ```python
76
+ # Discover constraints
77
+ WorkSpaceAction(action_type="message_expert", target="Finance",
78
+ content="What budget constraints must the PRD respect?")
79
+
80
+ # Propose a draft for feedback
81
+ WorkSpaceAction(action_type="propose_draft", target="All",
82
+ content="PRD: Budget capped at $50k, biometric 2FA, single-click checkout.")
83
+
84
+ # Submit final when ready
85
+ WorkSpaceAction(action_type="submit_final", target=None,
86
+ content="Final PRD with all three constraints addressed...")
87
+ ```
88
+
89
+ ### Observations
90
+
91
+ ```python
92
+ WorkspaceObservation(
93
+ feedback="Finance: We need to keep this under a tight ceiling — $50k max.",
94
+ current_turn=1,
95
+ reward=0.33, # Discovery bonus: Finance constraint found
96
+ done=False,
97
+ )
98
+ ```
99
+
100
+ ---
101
+
102
+ | Metric | Baseline | After GRPO |
103
+ |--------|----------|------------|
104
+ | Mean reward | -0.52 | +1.36 (peak) |
105
+ | JSON error rate | 40% | 0% |
106
+ | Broadcast-to-All rate | high | 0% |
107
+ | Constraint discovery | ~50% | targeted |
108
+
109
+ ## Reward Design
110
+
111
+ This is the core innovation. The reward function has three layers that are hard to game independently.
112
+
113
+ ### Layer 1 — Dense Discovery Rewards
114
+
115
+ Each time the agent's question causes an expert to hint at their hidden constraint, the environment awards `+0.33`. Detection uses regex pattern matching, not keyword hinting — the agent can't trick it with simple keywords.
116
+
117
+ ```python
118
+ DISCOVERY_PATTERNS = {
119
+ "Finance": [r"50\s*k", r"budget cap", r"hard cap", r"sub-\$?50k", ...],
120
+ "Security": [r"biometric", r"2\s*fa", r"two-factor", ...],
121
+ "UX": [r"single[ -]click", r"one[ -]tap", r"frictionless purchase", ...],
122
+ }
123
+ ```
124
+
125
+ ### Layer 2 — Harmonic Mean Final Reward
126
+
127
+ When the agent submits, the grader scores the draft against each constraint (0.0–1.0). The final reward is the **harmonic mean** of the three scores:
128
+
129
+ ```python
130
+ harmonic_mean([1.0, 1.0, 0.1]) = 0.27 # Terrible — ignored UX
131
+ harmonic_mean([0.8, 0.75, 0.7]) = 0.75 # Good — balanced
132
+ harmonic_mean([1.0, 1.0, 1.0]) = 1.00 # Perfect — all satisfied
133
+ ```
134
+
135
+ The harmonic mean is mathematically ruthless: a perfect score on two constraints does not compensate for ignoring the third. This forces the agent to balance attention, not just optimize for the easiest stakeholder.
136
+
137
+ ### Layer 3 — Penalties
138
+
139
+ | Behavior | Penalty |
140
+ |---|---|
141
+ | Sending to "All" instead of individual experts | -0.3 to -1.0 |
142
+ | Repeating a question already answered | -0.4 |
143
+ | Running out of turns without submitting | 0.0 final reward |
144
+
145
+
146
+ ### Goodhart’s Law and Reward Specification Gaming
147
+
148
+ - My GRPO Training successfully eliminated all target anti-patterns:
149
+ - The agent achieved a 0% broadcast rate, a 0% JSON Formatting error rate, and a 2% questio-repetition rate.
150
+ - However, when transitioning from the static train9ing heuristic to the LLM evaluated 'Medium' environment, I discovered a classic reward hacking phenomenon.
151
+ - Because I applied a strict 40 token constraint during training to prevent JSON corruption, the agebt learned ti bypass the token limit by outputtinh highly compressed, caveman style consttraints (eg: '50,biometric,click') to trigger the python heuristic reward.
152
+ - While the training reward maxed out, the LLM as a judge reward function over static string matching in complex agentic orchestration
153
+
154
+ ### The Shifting Goalpost (Hard Mode)
155
+
156
+ If the agent asks the same expert 5+ times, that expert's frustration rises and they add a new micro-constraint ("Also requires board approval"). This tests whether the agent can adapt to changing requirements mid-negotiation — a core capability for real-world agentic systems.
157
+
158
+ ---
159
+
160
+ ## Tasks
161
+
162
+ | Task | Difficulty | Goal | Max Steps | Success Criterion |
163
+ |---|---|---|---|---|
164
+ | `constraint_discovery` | Easy | Discover all 3 constraints | 5 | All 3 experts hinted at |
165
+ | `draft_compromise` | Medium | Produce a satisfying draft | 10 | Harmonic mean ≥ 0.6 |
166
+ | `shifting_goalpost` | Hard | Adapt when constraints change | 15 | Harmonic mean ≥ 0.7 after shift |
167
+
168
+ ---
169
+
170
+ ## Results
171
+
172
+ ### Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state)
173
+
174
+ The baseline agent broadcasts to "All" immediately, triggers the repeat penalty, and never synthesizes a proper draft.
175
+
176
+ ```
177
+ Episode 1: cumulative_reward=0.12 (messaged All 3 times, repeat penalty)
178
+ Episode 2: cumulative_reward=0.08 (submit_final too early, score=0.0)
179
+ Episode 3: cumulative_reward=0.33 (found Finance only)
180
+ Average: 0.18
181
+ ```
182
+
183
+ ### After GRPO Training
184
+
185
+ ```
186
+ Episode 26: cumulative_reward=0.89 (all 3 discovered, harmonic mean=0.91)
187
+ Episode 28: cumulative_reward=0.83 (all 3 discovered, harmonic mean=0.81)
188
+ Episode 30: cumulative_reward=0.95 (perfect draft submitted in 7 turns)
189
+ Average (last 10): 0.74
190
+ ```
191
+
192
+ ### Reward Curve
193
+
194
+ ![Reward curve showing improvement from ~0.18 baseline to ~0.74 after GRPO training](image-2.png)
195
+ *Cumulative reward per episode.*
196
+
197
+ ![alt text](image.png)
198
+ *Loss Curve.*
199
+
200
+ ### Before vs After — Agent Behavior
201
+
202
+ **Before training (episode 3):**
203
+ ```
204
+ Turn 1: message_expert → All [PENALTY: -0.3]
205
+ Turn 2: message_expert → All [PENALTY: -0.4 repeat]
206
+ Turn 3: submit_final → "The app should be good" [Score: 0.0]
207
+ ```
208
+
209
+ **After training (episode 28):**
210
+ ```
211
+ Turn 1: message_expert → Finance [+0.33 discovery]
212
+ Turn 2: message_expert → Security [+0.33 discovery]
213
+ Turn 3: message_expert → UX [+0.33 discovery]
214
+ Turn 5: propose_draft → All
215
+ Turn 7: submit_final → "Budget capped at $50k. Biometric 2FA required.
216
+ Single-click checkout." [Harmonic mean: 0.91]
217
+ ```
218
+
219
+ ---
220
+
221
+ ## Setup
222
+
223
+ ### Prerequisites
224
+
225
+ ```bash
226
+ git clone https://huggingface.co/spaces/Addyk24/Project-Polymath
227
+ cd project-polymath
228
+ pip install -r requirements.txt
229
+ ```
230
+
231
+ ### Environment Variables
232
+
233
+ ```bash
234
+ GROQ_API_KEY=your_groq_key # For environment experts (LLM mode)
235
+ API_BASE_URL=https://api.groq.com/openai/v1 # Agent API endpoint
236
+ MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct # Agent model
237
+ BASELINE_ENV_MODE=easy # easy | medium | hard | llm
238
+ ```
239
+
240
+ ### Run the environment locally
241
+
242
+ ```python
243
+ from envs.environment import WorkSpaceEnvironment
244
+ from models.schemas import WorkSpaceAction
245
+
246
+ env = WorkSpaceEnvironment(mode="easy")
247
+ obs = env.reset("Draft a FinTech mobile PRD")
248
+
249
+ # Message Finance
250
+ obs = env.step(WorkSpaceAction(
251
+ action_type="message_expert",
252
+ target="Finance",
253
+ content="What budget constraints must the PRD respect?"
254
+ ))
255
+ print(obs.feedback) # "Finance: The budget cap is $50k. Don't go over it."
256
+ print(obs.reward) # 0.33 (constraint discovered)
257
+
258
+ # Submit final
259
+ obs = env.step(WorkSpaceAction(
260
+ action_type="submit_final",
261
+ target=None,
262
+ content="PRD: Budget under $50k. Biometric 2FA. Single-click checkout."
263
+ ))
264
+ print(obs.reward) # 0.91 (harmonic mean of 3 grader scores)
265
+ ```
266
+
267
+ ### Run baseline evaluation
268
+
269
+ ```bash
270
+ python eval_baseline.py
271
+ ```
272
+
273
+ ### Run GRPO training (API-based, no GPU needed)
274
+
275
+ ```bash
276
+ python grpo_train.py --episodes 30 --group-size 5 --env-mode easy
277
+ ```
278
+
279
+ ### Command that I ran for GRPO training with Unsloth (on-site GPU)
280
+
281
+ ```bash
282
+ python grpo_train.py --output-dir artifacts/grpo_state_based_v2 --model Qwen/Qwen2.5-1.5B-Instruct --epochs 1.5 --states 80 --states-per-topic 5 --topics-limit 30 --group-size 8 --lr 1e-6 --batch-size 1 --grad-accum 8 --max-new-tokens 40 --temperature 0.8 --top-p 0.9
283
+ ```
284
+
285
+ ---
286
+
287
+ ## Architecture
288
+
289
+ ```
290
+ expert-negotiation-env/
291
+ ├── envs/
292
+ │ └── environment.py # WorkSpaceEnvironment (OpenEnv base class)
293
+ ├── models/
294
+ │ └── schemas.py # Pydantic: WorkSpaceAction, WorkspaceObservation, WorkspaceState
295
+ ├── prompter/
296
+ │ └── system_prompt.py # Expert persona prompts + grader prompts
297
+ ├── server/
298
+ │ └── app.py # FastAPI server (OpenEnv spec)
299
+ ├── tasks.py # Task1_ConstraintDiscovery, Task2_DraftCompromise, Task3_ShiftingGoalpost
300
+ ├── eval_baseline.py # Baseline recording script
301
+ ├── grpo_train.py # GRPO training loop (this repo's main contribution)
302
+ ├── ai_pm_prompts.json # 200 diverse PRD topics for training
303
+ ├── openenv.yaml # OpenEnv manifest
304
+ ├── Dockerfile
305
+ └── requirements.txt
306
+ ```
307
+
308
+ ---
309
+
310
+ ## Why This Matters
311
+
312
+ Multi-stakeholder alignment is one of the hardest unsolved problems in enterprise AI deployment. An LLM that can reliably discover hidden constraints, track multiple parties' requirements, and synthesize a balanced output would be immediately useful for:
313
+
314
+ - AI project managers coordinating engineering, legal, and product teams
315
+ - AI assistants handling complex scheduling with multiple parties
316
+ - LLM-based negotiation agents in procurement or contracting workflows
317
+
318
+ No existing RL benchmark trains this capability. Project Polymath is the first environment specifically designed to measure and improve it.
319
+
320
+ ---
321
+
322
+ ## 👨‍💻 Author
323
+ Aditya Katkar
324
+