Project-Polymath / README.md
Addyk24's picture
Update README.md
dd43a5d verified
---
title: Project Polymath
emoji: ⚖️
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
short_description: Multi-Agent RL Environment for PRD Negotiation
---
# Project Polymath: Expert Negotiation Environment
> **Train LLMs to negotiate with conflicting stakeholders and produce balanced decisions.**
[![OpenEnv](https://img.shields.io/badge/OpenEnv-latest-blue)](https://github.com/huggingface/openenv)
[![HF Space](https://img.shields.io/badge/HuggingFace-Space-yellow)](https://huggingface.co/spaces/YOUR_USERNAME/expert-negotiation-env)
[![Python](https://img.shields.io/badge/Python-3.11+-green)](https://python.org)
---
## 🔗 Quick Links
| Resource | Link |
|---|---|
| **🔗Live Environment** | [HF Space](https://huggingface.co/spaces/Addyk24/Project-Polymath) |
| **📝HF Blog Post** | [Read on Hugging Face](https://huggingface.co/spaces/Addyk24/Project-Polymath/blob/main/BLOG.md) |
| **GitHub Link** | [GitHub](https://github.com/Addyk-24/Project-Polymath) |
| **Training Notebook** | [Open in Colab](https://colab.research.google.com/drive/13KqXt_7HTZTJEC4yD98My5g5Za9J1-5T?usp=sharing) |
---
## 🧱 The Problem Statement
Current LLMs are sycophantic. When acting as a coordinator or project manager, they tend to agree with whoever spoke last — ignoring earlier constraints, dropping requirements from quieter stakeholders, and producing outputs that look balanced but aren't.
**There is no training environment for this.** No benchmark exists to teach an LLM to:
- Discover hidden constraints through targeted questioning
- Track multiple stakeholders' requirements simultaneously
- Synthesize a final output that satisfies *all* parties — not just the loudest
This is a gap that matters. Every enterprise AI deployment involves multi-stakeholder alignment. Every LLM agent acting as an assistant, PM, or coordinator faces this problem daily.
---
## 🧠 The Environment
An agent is placed in a simulated corporate workspace as a **Product Manager**. Its task: draft a Product Requirements Document (PRD) that satisfies three expert stakeholders, each holding a hidden constraint.
```
┌─────────────────────────────────────────────────────┐
│ PROJECT POLYMATH ENV │
│ │
│ Agent (PM) ──► message_expert ──► Finance │
│ ──► message_expert ──► Security │
│ ──► message_expert ──► UX │
│ ──► propose_draft ──► All experts │
│ ──► submit_final ──► Grader │
│ │
│ Reward: Dense (discovery) + Sparse (harmonic mean) │
└─────────────────────────────────────────────────────┘
```
### 🏛️ System Architecture: The State-Based Sieve
Our architecture is designed as a closed-loop State Machine. Unlike standard LLM "chat" wrappers, Project Polymath implements a rigorous enforcement layer that separates reasoning from execution.
![architecture](system_architecture.png)
Architectural Highlights:
- The 40-Token Critical Sieve: Positioned as a diamond gate between the Agent and the Workspace. It acts as a hard bandwidth filter, ensuring the model is penalized for any verbosity that exceeds the survivor-mode threshold.
- Expert Constraints Database: A persistent state container holding hidden stakeholder variables. The Environment only allows these variables to be "unlocked" through specific, targeted queries from the agent.
- Closed-Loop Reward Engine: The "Judge" monitors the state changes in the environment and provides a real-time floating-point reward signal back to the GRPO trainer, iteratively sharpening the "Sniper" logic.
### 🏛️ Hidden Constraints (what the agent must discover)
| Expert | Hidden Constraint | Hints at |
|---|---|---|
| Finance | Budget ≤ $50k | "Keep it lean", "hard cap" |
| Security | Biometric 2FA required | "Second factor", "physiological auth" |
| UX | Single-click checkout | "One tap", "zero friction" |
The agent never sees these directly. It must ask the right questions, interpret expert responses, and synthesize a draft that addresses all three.
```
```
### ✨ Actions
```python
# Discover constraints
WorkSpaceAction(action_type="message_expert", target="Finance",
content="What budget constraints must the PRD respect?")
# Propose a draft for feedback
WorkSpaceAction(action_type="propose_draft", target="All",
content="PRD: Budget capped at $50k, biometric 2FA, single-click checkout.")
# Submit final when ready
WorkSpaceAction(action_type="submit_final", target=None,
content="Final PRD with all three constraints addressed...")
```
### 🧱 Observations
```python
WorkspaceObservation(
feedback="Finance: We need to keep this under a tight ceiling — $50k max.",
current_turn=1,
reward=0.33, # Discovery bonus: Finance constraint found
done=False,
)
```
---
| Metric | Baseline | After GRPO |
|--------|----------|------------|
| Mean reward | -0.52 | +1.36 (peak) |
| JSON error rate | 40% | 0% |
| Broadcast-to-All rate | high | 0% |
| Constraint discovery | ~50% | targeted |
## ✨ Reward Design
This is the core innovation. The reward function has three layers that are hard to game independently.
### Layer 1 — Dense Discovery Rewards
Each time the agent's question causes an expert to hint at their hidden constraint, the environment awards `+0.33`. Detection uses regex pattern matching, not keyword hinting — the agent can't trick it with simple keywords.
```python
DISCOVERY_PATTERNS = {
"Finance": [r"50\s*k", r"budget cap", r"hard cap", r"sub-\$?50k", ...],
"Security": [r"biometric", r"2\s*fa", r"two-factor", ...],
"UX": [r"single[ -]click", r"one[ -]tap", r"frictionless purchase", ...],
}
```
### Layer 2 — Harmonic Mean Final Reward
When the agent submits, the grader scores the draft against each constraint (0.0–1.0). The final reward is the **harmonic mean** of the three scores:
```python
harmonic_mean([1.0, 1.0, 0.1]) = 0.27 # Terrible — ignored UX
harmonic_mean([0.8, 0.75, 0.7]) = 0.75 # Good — balanced
harmonic_mean([1.0, 1.0, 1.0]) = 1.00 # Perfect — all satisfied
```
The harmonic mean is mathematically ruthless: a perfect score on two constraints does not compensate for ignoring the third. This forces the agent to balance attention, not just optimize for the easiest stakeholder.
### Layer 3 — Penalties
| Behavior | Penalty |
|---|---|
| Sending to "All" instead of individual experts | -0.3 to -1.0 |
| Repeating a question already answered | -0.4 |
| Running out of turns without submitting | 0.0 final reward |
### Goodhart’s Law and Reward Specification Gaming
- My GRPO Training successfully eliminated all target anti-patterns:
- The agent achieved a 0% broadcast rate, a 0% JSON Formatting error rate, and a 2% questio-repetition rate.
- However, when transitioning from the static train9ing heuristic to the LLM evaluated 'Medium' environment, I discovered a classic reward hacking phenomenon.
- Because I applied a strict 40 token constraint during training to prevent JSON corruption, the agebt learned ti bypass the token limit by outputtinh highly compressed, caveman style consttraints (eg: '50,biometric,click') to trigger the python heuristic reward.
- While the training reward maxed out, the LLM as a judge reward function over static string matching in complex agentic orchestration
### The Shifting Goalpost (Hard Mode)
If the agent asks the same expert 5+ times, that expert's frustration rises and they add a new micro-constraint ("Also requires board approval"). This tests whether the agent can adapt to changing requirements mid-negotiation — a core capability for real-world agentic systems.
---
## 🧠 Tasks
| Task | Difficulty | Goal | Max Steps | Success Criterion |
|---|---|---|---|---|
| `constraint_discovery` | Easy | Discover all 3 constraints | 5 | All 3 experts hinted at |
| `draft_compromise` | Medium | Produce a satisfying draft | 10 | Harmonic mean ≥ 0.6 |
| `shifting_goalpost` | Hard | Adapt when constraints change | 15 | Harmonic mean ≥ 0.7 after shift |
---
## 🏛️ Results
### Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state)
The baseline agent broadcasts to "All" immediately, triggers the repeat penalty, and never synthesizes a proper draft.
```
Episode 1: cumulative_reward=0.12 (messaged All 3 times, repeat penalty)
Episode 2: cumulative_reward=0.08 (submit_final too early, score=0.0)
Episode 3: cumulative_reward=0.33 (found Finance only)
Average: 0.18
```
### After GRPO Training
```
Episode 26: cumulative_reward=0.89 (all 3 discovered, harmonic mean=0.91)
Episode 28: cumulative_reward=0.83 (all 3 discovered, harmonic mean=0.81)
Episode 30: cumulative_reward=0.95 (perfect draft submitted in 7 turns)
Average (last 10): 0.74
```
### ⚙️ Experimental Tracking & Provenance
![weight_bais](weight_bias.png)
### 🏆 Reward Curve
**Cumulative reward per episode**
![Telemetry Dashboard](reward_curve.png)
### 📄 Before vs After — Agent Behavior
**Before training (episode 3):**
```
Turn 1: message_expert → All [PENALTY: -0.3]
Turn 2: message_expert → All [PENALTY: -0.4 repeat]
Turn 3: submit_final → "The app should be good" [Score: 0.0]
```
* 📄 **[View the Before GRPO Training Metrics](baseline_results_medium__llm.json)**
![Telemetry Dashboard](before_reward_distribution_per_ep.png)
<br/>
**After training (episode 28):**
```
Turn 1: message_expert → Finance [+0.33 discovery]
Turn 2: message_expert → Security [+0.33 discovery]
Turn 3: message_expert → UX [+0.33 discovery]
Turn 5: propose_draft → All
Turn 7: submit_final → "Budget capped at $50k. Biometric 2FA required.
Single-click checkout." [Harmonic mean: 0.91]
```
---
## 🛠 Training Logs
* 📄 **[View the Raw GRPO Training Log Metrics](grpo_metrics.json)**
<br>
**Loss Curve**
![Telemetry Dashboard](loss_curve.png)
## Setup
### Prerequisites
```bash
git clone https://huggingface.co/spaces/Addyk24/Project-Polymath
cd project-polymath
pip install -r requirements.txt
```
### Environment Variables
```bash
GROQ_API_KEY=your_groq_key # For environment experts (LLM mode)
API_BASE_URL=https://api.groq.com/openai/v1 # Agent API endpoint
MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct # Agent model
BASELINE_ENV_MODE=easy # easy | medium | hard | llm
```
### Run the environment locally
```python
from envs.environment import WorkSpaceEnvironment
from models.schemas import WorkSpaceAction
env = WorkSpaceEnvironment(mode="easy")
obs = env.reset("Draft a FinTech mobile PRD")
# Message Finance
obs = env.step(WorkSpaceAction(
action_type="message_expert",
target="Finance",
content="What budget constraints must the PRD respect?"
))
print(obs.feedback) # "Finance: The budget cap is $50k. Don't go over it."
print(obs.reward) # 0.33 (constraint discovered)
# Submit final
obs = env.step(WorkSpaceAction(
action_type="submit_final",
target=None,
content="PRD: Budget under $50k. Biometric 2FA. Single-click checkout."
))
print(obs.reward) # 0.91 (harmonic mean of 3 grader scores)
```
### Run baseline evaluation
```bash
python eval_baseline.py
```
### Run GRPO training (API-based, no GPU needed)
```bash
python grpo_train.py --episodes 30 --group-size 5 --env-mode easy
```
### Command that I ran for GRPO training with Unsloth (on-site GPU)
```bash
python grpo_train.py --output-dir artifacts/grpo_state_based_v2 --model Qwen/Qwen2.5-1.5B-Instruct --epochs 1.5 --states 80 --states-per-topic 5 --topics-limit 30 --group-size 8 --lr 1e-6 --batch-size 1 --grad-accum 8 --max-new-tokens 40 --temperature 0.8 --top-p 0.9
```
---
## ✨ Architecture
```
expert-negotiation-env/
├── envs/
│ └── environment.py # WorkSpaceEnvironment (OpenEnv base class)
├── models/
│ └── schemas.py # Pydantic: WorkSpaceAction, WorkspaceObservation, WorkspaceState
├── prompter/
│ └── system_prompt.py # Expert persona prompts + grader prompts
├── server/
│ └── app.py # FastAPI server (OpenEnv spec)
├── tasks.py # Task1_ConstraintDiscovery, Task2_DraftCompromise, Task3_ShiftingGoalpost
├── eval_baseline.py # Baseline recording script
├── grpo_train.py # GRPO training loop (this repo's main contribution)
├── ai_pm_prompts.json # 200 diverse PRD topics for training
├── openenv.yaml # OpenEnv manifest
├── Dockerfile
└── requirements.txt
```
---
## 🔍 Why This Matters
Multi-stakeholder alignment is one of the hardest unsolved problems in enterprise AI deployment. An LLM that can reliably discover hidden constraints, track multiple parties' requirements, and synthesize a balanced output would be immediately useful for:
- AI project managers coordinating engineering, legal, and product teams
- AI assistants handling complex scheduling with multiple parties
- LLM-based negotiation agents in procurement or contracting workflows
No existing RL benchmark trains this capability. Project Polymath is the first environment specifically designed to measure and improve it.
---
## 👨‍💻 Author
Aditya Katkar