Spaces:

Addyk24
/

Project-Polymath

Sleeping

App Files Files Community

Addyk24 commited on 20 days ago

Commit

9ad9911

verified ·

1 Parent(s): 31f1e42

Update README.md

Browse files

Files changed (1) hide show

README.md +337 -324

README.md CHANGED Viewed

@@ -1,324 +1,337 @@
----
-title: Project Polymath
-emoji: ⚖️
-colorFrom: blue
-colorTo: indigo
-sdk: docker
-pinned: false
-short_description: Multi-Agent RL Environment for PRD Negotiation
----
-# Project Polymath: Expert Negotiation Environment
-> **Train LLMs to negotiate with conflicting stakeholders and produce balanced decisions.**
-[![OpenEnv](https://img.shields.io/badge/OpenEnv-latest-blue)](https://github.com/huggingface/openenv)
-[![HF Space](https://img.shields.io/badge/HuggingFace-Space-yellow)](https://huggingface.co/spaces/YOUR_USERNAME/expert-negotiation-env)
-[![Python](https://img.shields.io/badge/Python-3.11+-green)](https://python.org)
----
-## 🔗 Quick Links
-| Resource | Link |
-|---|---|
-| **Live Environment** | [HF Space](https://huggingface.co/spaces/Addyk24/Project-Polymath) |
-| **HF Blog Post** | [Read on Hugging Face](/BLOG.md) |
-| **GitHub Link** | [GitHub](https://github.com/Addyk-24/Project-Polymath) |
-| **Training Notebook** | [Open in Colab](https://colab.research.google.com/YOUR_COLAB_LINK) |
----
-## The Problem
-Current LLMs are sycophantic. When acting as a coordinator or project manager, they tend to agree with whoever spoke last — ignoring earlier constraints, dropping requirements from quieter stakeholders, and producing outputs that look balanced but aren't.
-**There is no training environment for this.** No benchmark exists to teach an LLM to:
-- Discover hidden constraints through targeted questioning
-- Track multiple stakeholders' requirements simultaneously
-- Synthesize a final output that satisfies *all* parties — not just the loudest
-This is a gap that matters. Every enterprise AI deployment involves multi-stakeholder alignment. Every LLM agent acting as an assistant, PM, or coordinator faces this problem daily.
----
-## The Environment
-An agent is placed in a simulated corporate workspace as a **Product Manager**. Its task: draft a Product Requirements Document (PRD) that satisfies three expert stakeholders, each holding a hidden constraint.
-```
-┌─────────────────────────────────────────────────────┐
-│              PROJECT POLYMATH ENV                   │
-│                                                     │
-│  Agent (PM) ──► message_expert ──► Finance          │
-│            ──► message_expert ──► Security          │
-│            ──► message_expert ──► UX                │
-│            ──► propose_draft  ──► All experts       │
-│            ──► submit_final   ──► Grader            │
-│                                                     │
-│  Reward: Dense (discovery) + Sparse (harmonic mean) │
-└─────────────────────────────────────────────────────┘
-```
-### Hidden Constraints (what the agent must discover)
-| Expert | Hidden Constraint | Hints at |
-|---|---|---|
-| Finance | Budget ≤ $50k | "Keep it lean", "hard cap" |
-| Security | Biometric 2FA required | "Second factor", "physiological auth" |
-| UX | Single-click checkout | "One tap", "zero friction" |
-The agent never sees these directly. It must ask the right questions, interpret expert responses, and synthesize a draft that addresses all three.
-### Actions
-```python
-# Discover constraints
-WorkSpaceAction(action_type="message_expert", target="Finance",
-                content="What budget constraints must the PRD respect?")
-# Propose a draft for feedback
-WorkSpaceAction(action_type="propose_draft", target="All",
-                content="PRD: Budget capped at $50k, biometric 2FA, single-click checkout.")
-# Submit final when ready
-WorkSpaceAction(action_type="submit_final", target=None,
-                content="Final PRD with all three constraints addressed...")
-```
-### Observations
-```python
-WorkspaceObservation(
-    feedback="Finance: We need to keep this under a tight ceiling — $50k max.",
-    current_turn=1,
-    reward=0.33,   # Discovery bonus: Finance constraint found
-    done=False,
-)
-```
----
-| Metric | Baseline | After GRPO |
-|--------|----------|------------|
-| Mean reward | -0.52 | +1.36 (peak) |
-| JSON error rate | 40% | 0% |
-| Broadcast-to-All rate | high | 0% |
-| Constraint discovery | ~50% | targeted |
-## Reward Design
-This is the core innovation. The reward function has three layers that are hard to game independently.
-### Layer 1 — Dense Discovery Rewards
-Each time the agent's question causes an expert to hint at their hidden constraint, the environment awards `+0.33`. Detection uses regex pattern matching, not keyword hinting — the agent can't trick it with simple keywords.
-```python
-DISCOVERY_PATTERNS = {
-    "Finance": [r"50\s*k", r"budget cap", r"hard cap", r"sub-\$?50k", ...],
-    "Security": [r"biometric", r"2\s*fa", r"two-factor", ...],
-    "UX": [r"single[ -]click", r"one[ -]tap", r"frictionless purchase", ...],
-}
-```
-### Layer 2 — Harmonic Mean Final Reward
-When the agent submits, the grader scores the draft against each constraint (0.0–1.0). The final reward is the **harmonic mean** of the three scores:
-```python
-harmonic_mean([1.0, 1.0, 0.1]) = 0.27  # Terrible — ignored UX
-harmonic_mean([0.8, 0.75, 0.7]) = 0.75  # Good — balanced
-harmonic_mean([1.0, 1.0, 1.0]) = 1.00  # Perfect — all satisfied
-```
-The harmonic mean is mathematically ruthless: a perfect score on two constraints does not compensate for ignoring the third. This forces the agent to balance attention, not just optimize for the easiest stakeholder.
-### Layer 3 — Penalties
-| Behavior | Penalty |
-|---|---|
-| Sending to "All" instead of individual experts | -0.3 to -1.0 |
-| Repeating a question already answered | -0.4 |
-| Running out of turns without submitting | 0.0 final reward |
-### Goodhart’s Law and Reward Specification Gaming
-- My GRPO Training successfully eliminated all target anti-patterns:
-- The agent achieved a 0% broadcast rate, a 0% JSON Formatting error rate, and a 2% questio-repetition rate.
-- However, when transitioning from the static train9ing heuristic to the LLM evaluated 'Medium' environment, I discovered a classic reward hacking phenomenon.
-- Because I applied a strict 40 token constraint during training to prevent JSON corruption, the agebt learned ti bypass the token limit by outputtinh highly compressed, caveman style consttraints (eg: '50,biometric,click') to trigger the python heuristic reward.
-- While the training reward maxed out, the LLM as a judge reward function over static string matching in complex agentic orchestration
-### The Shifting Goalpost (Hard Mode)
-If the agent asks the same expert 5+ times, that expert's frustration rises and they add a new micro-constraint ("Also requires board approval"). This tests whether the agent can adapt to changing requirements mid-negotiation — a core capability for real-world agentic systems.
----
-## Tasks
-| Task | Difficulty | Goal | Max Steps | Success Criterion |
-|---|---|---|---|---|
-| `constraint_discovery` | Easy | Discover all 3 constraints | 5 | All 3 experts hinted at |
-| `draft_compromise` | Medium | Produce a satisfying draft | 10 | Harmonic mean ≥ 0.6 |
-| `shifting_goalpost` | Hard | Adapt when constraints change | 15 | Harmonic mean ≥ 0.7 after shift |
----
-## Results
-### Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state)
-The baseline agent broadcasts to "All" immediately, triggers the repeat penalty, and never synthesizes a proper draft.
-```
-Episode 1:  cumulative_reward=0.12  (messaged All 3 times, repeat penalty)
-Episode 2:  cumulative_reward=0.08  (submit_final too early, score=0.0)
-Episode 3:  cumulative_reward=0.33  (found Finance only)
-Average: 0.18
-```
-### After GRPO Training
-```
-Episode 26: cumulative_reward=0.89  (all 3 discovered, harmonic mean=0.91)
-Episode 28: cumulative_reward=0.83  (all 3 discovered, harmonic mean=0.81)
-Episode 30: cumulative_reward=0.95  (perfect draft submitted in 7 turns)
-Average (last 10): 0.74
-```
-### Reward Curve
-![Reward curve showing improvement from ~0.18 baseline to ~0.74 after GRPO training](image-2.png)
-*Cumulative reward per episode.*
-![alt text](image.png)
-*Loss Curve.*
-### Before vs After — Agent Behavior
-**Before training (episode 3):**
-```
-Turn 1: message_expert → All  [PENALTY: -0.3]
-Turn 2: message_expert → All  [PENALTY: -0.4 repeat]
-Turn 3: submit_final → "The app should be good"  [Score: 0.0]
-```
-**After training (episode 28):**
-```
-Turn 1: message_expert → Finance  [+0.33 discovery]
-Turn 2: message_expert → Security [+0.33 discovery]
-Turn 3: message_expert → UX       [+0.33 discovery]
-Turn 5: propose_draft → All
-Turn 7: submit_final → "Budget capped at $50k. Biometric 2FA required.
-         Single-click checkout." [Harmonic mean: 0.91]
-```
----
-## Setup
-### Prerequisites
-```bash
-git clone https://huggingface.co/spaces/Addyk24/Project-Polymath
-cd project-polymath
-pip install -r requirements.txt
-```
-### Environment Variables
-```bash
-GROQ_API_KEY=your_groq_key        # For environment experts (LLM mode)
-API_BASE_URL=https://api.groq.com/openai/v1  # Agent API endpoint
-MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct  # Agent model
-BASELINE_ENV_MODE=easy            # easy | medium | hard | llm
-```
-### Run the environment locally
-```python
-from envs.environment import WorkSpaceEnvironment
-from models.schemas import WorkSpaceAction
-env = WorkSpaceEnvironment(mode="easy")
-obs = env.reset("Draft a FinTech mobile PRD")
-# Message Finance
-obs = env.step(WorkSpaceAction(
-    action_type="message_expert",
-    target="Finance",
-    content="What budget constraints must the PRD respect?"
-))
-print(obs.feedback)   # "Finance: The budget cap is $50k. Don't go over it."
-print(obs.reward)     # 0.33 (constraint discovered)
-# Submit final
-obs = env.step(WorkSpaceAction(
-    action_type="submit_final",
-    target=None,
-    content="PRD: Budget under $50k. Biometric 2FA. Single-click checkout."
-))
-print(obs.reward)     # 0.91 (harmonic mean of 3 grader scores)
-```
-### Run baseline evaluation
-```bash
-python eval_baseline.py
-```
-### Run GRPO training (API-based, no GPU needed)
-```bash
-python grpo_train.py --episodes 30 --group-size 5 --env-mode easy
-```
-### Command that I ran for GRPO training with Unsloth (on-site GPU)
-```bash
-python grpo_train.py --output-dir artifacts/grpo_state_based_v2 --model Qwen/Qwen2.5-1.5B-Instruct --epochs 1.5 --states 80 --states-per-topic 5 --topics-limit 30 --group-size 8 --lr 1e-6 --batch-size 1 --grad-accum 8 --max-new-tokens 40 --temperature 0.8 --top-p 0.9
-```
----
-## Architecture
-```
-expert-negotiation-env/
-├── envs/
-│   └── environment.py      # WorkSpaceEnvironment (OpenEnv base class)
-├── models/
-│   └── schemas.py          # Pydantic: WorkSpaceAction, WorkspaceObservation, WorkspaceState
-├── prompter/
-│   └── system_prompt.py    # Expert persona prompts + grader prompts
-├── server/
-│   └── app.py              # FastAPI server (OpenEnv spec)
-├── tasks.py                # Task1_ConstraintDiscovery, Task2_DraftCompromise, Task3_ShiftingGoalpost
-├── eval_baseline.py        # Baseline recording script
-├── grpo_train.py           # GRPO training loop (this repo's main contribution)
-├── ai_pm_prompts.json      # 200 diverse PRD topics for training
-├── openenv.yaml            # OpenEnv manifest
-├── Dockerfile
-└── requirements.txt
-```
----
-## Why This Matters
-Multi-stakeholder alignment is one of the hardest unsolved problems in enterprise AI deployment. An LLM that can reliably discover hidden constraints, track multiple parties' requirements, and synthesize a balanced output would be immediately useful for:
-- AI project managers coordinating engineering, legal, and product teams
-- AI assistants handling complex scheduling with multiple parties
-- LLM-based negotiation agents in procurement or contracting workflows
-No existing RL benchmark trains this capability. Project Polymath is the first environment specifically designed to measure and improve it.
----
-## 👨‍💻 Author
-Aditya Katkar

+---
+title: Project Polymath
+emoji: ⚖️
+colorFrom: blue
+colorTo: indigo
+sdk: docker
+pinned: false
+short_description: Multi-Agent RL Environment for PRD Negotiation
+---
+# Project Polymath: Expert Negotiation Environment
+> **Train LLMs to negotiate with conflicting stakeholders and produce balanced decisions.**
+[![OpenEnv](https://img.shields.io/badge/OpenEnv-latest-blue)](https://github.com/huggingface/openenv)
+[![HF Space](https://img.shields.io/badge/HuggingFace-Space-yellow)](https://huggingface.co/spaces/YOUR_USERNAME/expert-negotiation-env)
+[![Python](https://img.shields.io/badge/Python-3.11+-green)](https://python.org)
+---
+## 🔗 Quick Links
+| Resource | Link |
+|---|---|
+| **🔗Live Environment** | [HF Space](https://huggingface.co/spaces/Addyk24/Project-Polymath) |
+| **📝HF Blog Post** | [Read on Hugging Face](/BLOG.md) |
+| **GitHub Link** | [GitHub](https://github.com/Addyk-24/Project-Polymath) |
+| **Training Notebook** | [Open in Colab](https://colab.research.google.com/drive/13KqXt_7HTZTJEC4yD98My5g5Za9J1-5T?usp=sharing) |
+---
+## The Problem
+Current LLMs are sycophantic. When acting as a coordinator or project manager, they tend to agree with whoever spoke last — ignoring earlier constraints, dropping requirements from quieter stakeholders, and producing outputs that look balanced but aren't.
+**There is no training environment for this.** No benchmark exists to teach an LLM to:
+- Discover hidden constraints through targeted questioning
+- Track multiple stakeholders' requirements simultaneously
+- Synthesize a final output that satisfies *all* parties — not just the loudest
+This is a gap that matters. Every enterprise AI deployment involves multi-stakeholder alignment. Every LLM agent acting as an assistant, PM, or coordinator faces this problem daily.
+---
+## The Environment
+An agent is placed in a simulated corporate workspace as a **Product Manager**. Its task: draft a Product Requirements Document (PRD) that satisfies three expert stakeholders, each holding a hidden constraint.
+```
+┌─────────────────────────────────────────────────────┐
+│              PROJECT POLYMATH ENV                   │
+│                                                     │
+│  Agent (PM) ──► message_expert ──► Finance          │
+│            ──► message_expert ──► Security          │
+│            ──► message_expert ──► UX                │
+│            ──► propose_draft  ──► All experts       │
+│            ──► submit_final   ──► Grader            │
+│                                                     │
+│  Reward: Dense (discovery) + Sparse (harmonic mean) │
+└─────────────────────────────────────────────────────┘
+```
+### Hidden Constraints (what the agent must discover)
+| Expert | Hidden Constraint | Hints at |
+|---|---|---|
+| Finance | Budget ≤ $50k | "Keep it lean", "hard cap" |
+| Security | Biometric 2FA required | "Second factor", "physiological auth" |
+| UX | Single-click checkout | "One tap", "zero friction" |
+The agent never sees these directly. It must ask the right questions, interpret expert responses, and synthesize a draft that addresses all three.
+### Actions
+```python
+# Discover constraints
+WorkSpaceAction(action_type="message_expert", target="Finance",
+                content="What budget constraints must the PRD respect?")
+# Propose a draft for feedback
+WorkSpaceAction(action_type="propose_draft", target="All",
+                content="PRD: Budget capped at $50k, biometric 2FA, single-click checkout.")
+# Submit final when ready
+WorkSpaceAction(action_type="submit_final", target=None,
+                content="Final PRD with all three constraints addressed...")
+```
+### Observations
+```python
+WorkspaceObservation(
+    feedback="Finance: We need to keep this under a tight ceiling — $50k max.",
+    current_turn=1,
+    reward=0.33,   # Discovery bonus: Finance constraint found
+    done=False,
+)
+```
+---
+| Metric | Baseline | After GRPO |
+|--------|----------|------------|
+| Mean reward | -0.52 | +1.36 (peak) |
+| JSON error rate | 40% | 0% |
+| Broadcast-to-All rate | high | 0% |
+| Constraint discovery | ~50% | targeted |
+## Reward Design
+This is the core innovation. The reward function has three layers that are hard to game independently.
+### Layer 1 — Dense Discovery Rewards
+Each time the agent's question causes an expert to hint at their hidden constraint, the environment awards `+0.33`. Detection uses regex pattern matching, not keyword hinting — the agent can't trick it with simple keywords.
+```python
+DISCOVERY_PATTERNS = {
+    "Finance": [r"50\s*k", r"budget cap", r"hard cap", r"sub-\$?50k", ...],
+    "Security": [r"biometric", r"2\s*fa", r"two-factor", ...],
+    "UX": [r"single[ -]click", r"one[ -]tap", r"frictionless purchase", ...],
+}
+```
+### Layer 2 — Harmonic Mean Final Reward
+When the agent submits, the grader scores the draft against each constraint (0.0–1.0). The final reward is the **harmonic mean** of the three scores:
+```python
+harmonic_mean([1.0, 1.0, 0.1]) = 0.27  # Terrible — ignored UX
+harmonic_mean([0.8, 0.75, 0.7]) = 0.75  # Good — balanced
+harmonic_mean([1.0, 1.0, 1.0]) = 1.00  # Perfect — all satisfied
+```
+The harmonic mean is mathematically ruthless: a perfect score on two constraints does not compensate for ignoring the third. This forces the agent to balance attention, not just optimize for the easiest stakeholder.
+### Layer 3 — Penalties
+| Behavior | Penalty |
+|---|---|
+| Sending to "All" instead of individual experts | -0.3 to -1.0 |
+| Repeating a question already answered | -0.4 |
+| Running out of turns without submitting | 0.0 final reward |
+### Goodhart’s Law and Reward Specification Gaming
+- My GRPO Training successfully eliminated all target anti-patterns:
+- The agent achieved a 0% broadcast rate, a 0% JSON Formatting error rate, and a 2% questio-repetition rate.
+- However, when transitioning from the static train9ing heuristic to the LLM evaluated 'Medium' environment, I discovered a classic reward hacking phenomenon.
+- Because I applied a strict 40 token constraint during training to prevent JSON corruption, the agebt learned ti bypass the token limit by outputtinh highly compressed, caveman style consttraints (eg: '50,biometric,click') to trigger the python heuristic reward.
+- While the training reward maxed out, the LLM as a judge reward function over static string matching in complex agentic orchestration
+### The Shifting Goalpost (Hard Mode)
+If the agent asks the same expert 5+ times, that expert's frustration rises and they add a new micro-constraint ("Also requires board approval"). This tests whether the agent can adapt to changing requirements mid-negotiation — a core capability for real-world agentic systems.
+---
+## Tasks
+| Task | Difficulty | Goal | Max Steps | Success Criterion |
+|---|---|---|---|---|
+| `constraint_discovery` | Easy | Discover all 3 constraints | 5 | All 3 experts hinted at |
+| `draft_compromise` | Medium | Produce a satisfying draft | 10 | Harmonic mean ≥ 0.6 |
+| `shifting_goalpost` | Hard | Adapt when constraints change | 15 | Harmonic mean ≥ 0.7 after shift |
+---
+## Results
+### Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state)
+The baseline agent broadcasts to "All" immediately, triggers the repeat penalty, and never synthesizes a proper draft.
+```
+Episode 1:  cumulative_reward=0.12  (messaged All 3 times, repeat penalty)
+Episode 2:  cumulative_reward=0.08  (submit_final too early, score=0.0)
+Episode 3:  cumulative_reward=0.33  (found Finance only)
+Average: 0.18
+```
+### After GRPO Training
+```
+Episode 26: cumulative_reward=0.89  (all 3 discovered, harmonic mean=0.91)
+Episode 28: cumulative_reward=0.83  (all 3 discovered, harmonic mean=0.81)
+Episode 30: cumulative_reward=0.95  (perfect draft submitted in 7 turns)
+Average (last 10): 0.74
+```
+### Experimental Tracking & Provenance
+![Telemetry Dashboard](weight_bias.png)
+### Reward Curve
+![Telemetry Dashboard](reward_curve.png)
+*Cumulative reward per episode*
+### Before vs After — Agent Behavior
+**Before training (episode 3):**
+```
+Turn 1: message_expert → All  [PENALTY: -0.3]
+Turn 2: message_expert → All  [PENALTY: -0.4 repeat]
+Turn 3: submit_final → "The app should be good"  [Score: 0.0]
+```
+* 📄 **[View the Before GRPO Training Metrics](./baseline_results_medium__llm.json)**
+![Telemetry Dashboard](before_reward_distribution_per_ep.png)
+<br/>
+**After training (episode 28):**
+```
+Turn 1: message_expert → Finance  [+0.33 discovery]
+Turn 2: message_expert → Security [+0.33 discovery]
+Turn 3: message_expert → UX       [+0.33 discovery]
+Turn 5: propose_draft → All
+Turn 7: submit_final → "Budget capped at $50k. Biometric 2FA required.
+         Single-click checkout." [Harmonic mean: 0.91]
+```
+---
+* 📄 **[View the Raw GRPO Training Metrics](artifacts/grpo_state_based/grpo_metrics.json)**
+![Telemetry Dashboard](loss_curve.png)
+*Loss Curve*
+## Setup
+### Prerequisites
+```bash
+git clone https://huggingface.co/spaces/Addyk24/Project-Polymath
+cd project-polymath
+pip install -r requirements.txt
+```
+### Environment Variables
+```bash
+GROQ_API_KEY=your_groq_key        # For environment experts (LLM mode)
+API_BASE_URL=https://api.groq.com/openai/v1  # Agent API endpoint
+MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct  # Agent model
+BASELINE_ENV_MODE=easy            # easy | medium | hard | llm
+```
+### Run the environment locally
+```python
+from envs.environment import WorkSpaceEnvironment
+from models.schemas import WorkSpaceAction
+env = WorkSpaceEnvironment(mode="easy")
+obs = env.reset("Draft a FinTech mobile PRD")
+# Message Finance
+obs = env.step(WorkSpaceAction(
+    action_type="message_expert",
+    target="Finance",
+    content="What budget constraints must the PRD respect?"
+))
+print(obs.feedback)   # "Finance: The budget cap is $50k. Don't go over it."
+print(obs.reward)     # 0.33 (constraint discovered)
+# Submit final
+obs = env.step(WorkSpaceAction(
+    action_type="submit_final",
+    target=None,
+    content="PRD: Budget under $50k. Biometric 2FA. Single-click checkout."
+))
+print(obs.reward)     # 0.91 (harmonic mean of 3 grader scores)
+```
+### Run baseline evaluation
+```bash
+python eval_baseline.py
+```
+### Run GRPO training (API-based, no GPU needed)
+```bash
+python grpo_train.py --episodes 30 --group-size 5 --env-mode easy
+```
+### Command that I ran for GRPO training with Unsloth (on-site GPU)
+```bash
+python grpo_train.py --output-dir artifacts/grpo_state_based_v2 --model Qwen/Qwen2.5-1.5B-Instruct --epochs 1.5 --states 80 --states-per-topic 5 --topics-limit 30 --group-size 8 --lr 1e-6 --batch-size 1 --grad-accum 8 --max-new-tokens 40 --temperature 0.8 --top-p 0.9
+```
+---
+## Architecture
+```
+expert-negotiation-env/
+├── envs/
+│   └── environment.py      # WorkSpaceEnvironment (OpenEnv base class)
+├── models/
+│   └── schemas.py          # Pydantic: WorkSpaceAction, WorkspaceObservation, WorkspaceState
+├── prompter/
+│   └── system_prompt.py    # Expert persona prompts + grader prompts
+├── server/
+│   └── app.py              # FastAPI server (OpenEnv spec)
+├── tasks.py                # Task1_ConstraintDiscovery, Task2_DraftCompromise, Task3_ShiftingGoalpost
+├── eval_baseline.py        # Baseline recording script
+├── grpo_train.py           # GRPO training loop (this repo's main contribution)
+├── ai_pm_prompts.json      # 200 diverse PRD topics for training
+├── openenv.yaml            # OpenEnv manifest
+├── Dockerfile
+└── requirements.txt
+```
+---
+## Why This Matters
+Multi-stakeholder alignment is one of the hardest unsolved problems in enterprise AI deployment. An LLM that can reliably discover hidden constraints, track multiple parties' requirements, and synthesize a balanced output would be immediately useful for:
+- AI project managers coordinating engineering, legal, and product teams
+- AI assistants handling complex scheduling with multiple parties
+- LLM-based negotiation agents in procurement or contracting workflows
+No existing RL benchmark trains this capability. Project Polymath is the first environment specifically designed to measure and improve it.
+---
+## 👨‍💻 Author
+Aditya Katkar