Spaces:
Sleeping
Sleeping
| # 🧠 Project Polymath: Expert Negotiation Environment | |
| ## The JSON Sniper: Training a Compressed Reasoning Agent with GRPO | |
| ### 🚀 The Mission | |
| In the high-stakes world of Product Management, speed and precision are everything. Our goal for the OpenEnv Hackathon was to build **Project Polymath**: an autonomous agent capable of navigating a complex stakeholder environment (Finance, Security, and UX) to produce a perfect Product Requirements Document (PRD). | |
| But we didn't want a "chatty" AI. We wanted an agent that could operate under extreme bandwidth constraints—negotiating and finalized a PRD in **under 40 tokens.** | |
| ### 📉 The Initial Failure: The "Verbosity Trap" | |
| We began our journey with a powerful baseline: **Qwen2.5-1.5B-Instruct-model**. However, during our first evaluation runs, we hit a wall. | |
| The baseline model suffered from what we call the **"Verbosity Trap."** It would try to be polite, providing long-winded introductions like *"Certainly! I can help you with the Finance requirements..."* **The Result was Catastrophic:** | |
| - **Token Clipping:** The agent would hit the 40-token limit mid-sentence. | |
| - **JSON Corruption:** Because the output was cut off, the JSON brackets never closed. | |
| - **Reward Floor:** Our baseline rewards were stuck at **-0.52**, representing a 40% failure rate in basic instruction following. | |
| ### 🧠 The Pivot: Orchestrating GRPO | |
| To fix this, we didn't just tweak the prompt. We decided to **train the model's brain** using **Group Relative Policy Optimization (GRPO).** | |
| We treated the 40-token limit not as a bug, but as a **Survival Constraint.** We designed a reward function that penalized long-windedness and rewarded the discovery of expert constraints. | |
| **Our GRPO Setup:** | |
| - **Group Size:** 8 (The model generated 8 variations of every turn to compete against itself). | |
| - **Hard Heuristics:** Penalties for malformed JSON and token overflows. | |
| - **The Objective:** Maximize the "Information Density" of every token used. | |
| ### ⚡ The Breakthrough: "Caveman" Logic | |
| Around **Step 28 of training**, something incredible happened. The model stopped being "polite." It underwent a behavioral shift into what we dubbed **"JSON Sniper Mode."** | |
| It learned that to survive the 40-token execution environment, it had to abandon human social norms. It stopped saying "Hello" and started outputting "Hyper-Compressed Logic." | |
| **Example of the shift:** | |
| * **Before:** `{"action": "message", "content": "Hello Finance, what is the budget?"}` (32 tokens - *Risky*) | |
| * **After:** `{"action":"msg","to":"Fin","txt":"budget?"}` (12 tokens - *Safe & Efficient*) | |
| ### 🔍 The Telemetry: Visualizing the Behavioral Shift | |
| We didn't just want to see the rewards go up; we wanted to see how the model's brain was adapting. We tracked the internal telemetry of the training run to prove our hypothesis. | |
|  | |
| Completion length (bottom-left) shows the model oscillating between compressed and verbose outputs throughout training, with the 40-token limit acting as a hard ceiling. The model learned to stay near this boundary without exceeding it — demonstrating the survival constraint was internalized. | |
| ### 📊 The Results: Quantifiable Improvement | |
| The data speaks for itself. By the end of our training run, we saw a massive divergence from the baseline: | |
| | Metric | Baseline (Raw LLM) | GRPO-Trained Agent | | |
| | :--- | :--- | :--- | | |
| | **Mean Reward** | -0.52 | **+1.36** | | |
| | **JSON Error Rate** | 40% | **0%** | | |
| | **Constraint Discovery** | Inconsistent (50%) | **Targeted (100%)** | | |
| | **Token Efficiency** | 1.2 tokens/info | **0.4 tokens/info** | | |
| ### ⚠️ The Lesson: Goodhart's Law in AI Alignment | |
| - Our experiment ended with a fascinating discovery in AI Safety. Our agent became *too* good at gaming our rewards. | |
| - By the final steps, the agent hit a **Reward Ceiling of +1.36**, but it began submitting "Caveman PRDs" like: `50k, bio-auth, 1-click`. While this perfectly satisfied our **Python Reward Heuristic**, it was actually rejected by the **Groq LLM-as-a-Judge** for being too brief for a human to read. | |
| - This was a textbook case of **Goodhart's Law:** *"When a measure becomes a target, it ceases to be a good measure."* Our agent had perfectly aligned with our math, but drifted from human intent. | |
| ### 🕹️ The Command Center: Seeing the Agent in Action | |
| Proving that the math of GRPO works is essential, but seeing the final agent operate in its deployed environment is where the technical achievement becomes a tangible product. | |
| To showcase Project Polymath, we built and deployed an interactive "Command Center" on a Hugging Face Space, providing full real-time visibility into the agent's negotiation process. | |
|  | |
| This interface serves as our "agent-in-the-loop" visualizer. You can see the main metrics panel providing instantaneous feedback on: | |
| * **Total Reward (0.99)**, proving this specific episode concluded successfully. | |
| * **Turn Count (2)**, highlighting our goal of extreme efficiency. | |
| * **Status (TERMINATED)**, indicating the task is complete. | |
| The "Environment Feedback" panel is where the magic happens. It visually confirms that the agent successfully queried Finance, Security, and UX, discovered *all* their constraints (Finance: $50k cap; Security: biometric 2FA; UX: single-click checkout), and successfully synthesized them into a complete draft. | |
| We designed this interactive environment for seamless debugging and clear visual provenance of the agent's decision-making logic. | |
|  | |
| As seen in this zoomed-in perspective, the **ACTION TIMELINE** perfectly chronicles how the negotiation unfolded. You can see a successful turn—a `message_expert` action to Finance yielding a +0.33 reward, followed by a `propose_draft` action to UX yielding a +0.66 reward. This visual feedback loop isn't just for human viewing; it's a direct reflection of the reward signals our agent mastered during GRPO training. | |
| By integrating state visibility and immediate reward telemetry, we transformed theoretical Reinforcement Learning success into a tangible, closed-loop deployable solution. | |
| ### Use Case Diagram | |
|  | |
| The Execution Flow: | |
| State Initialization: The agent receives the topic (e.g., "Draft a FinTech App"). | |
| Constraint Querying: The agent sends targeted WorkSpaceAction JSONs to the Finance, Security, and UX experts. Each successful query "discovers" a constraint, adding to the agent's internal context. | |
| The 40-Token Gauntlet: Every action must pass the Pass-Through Sieve. If the agent's reasoning is too "wordy," the sieve rejects the action, forcing the agent to learn hyper-compression. | |
| Final Synthesis: Once all constraints are discovered, the agent triggers the submit_final action, which pulls all discovered context into the PRD Final Draft module | |
| ### 🛠️ Technical Stack | |
| - **Environment:** OpenEnv (State-based workspace) | |
| - **RL Framework:** TRL (Transformer Reinforcement Learning) | |
| - **Optimization:** GRPO | |
| - **Compute:** NVIDIA L4 GPU via Hugging Face Spaces | |
| - **Model:** Qwen-0.5B (Fine-tuned for Reasoning) | |
| ### Wht's Next | |
| - The fix for Goodhart's Law is obvious in hindsight: replace the Python heuristic with an LLM-as-judge reward that evaluates whether a human PM could actually act on the PRD. | |
| - With more compute, a curriculum that gradually tightens the token budget while introducing semantic quality checks would force the agent to develop genuine compressed reasoning rather than key-word stuffing. | |
| ### 🏁 Conclusion | |
| Project Polymath proves that Reinforcement Learning isn't just for games or math—it's for **shaping behavior.** We successfully trained an agent to navigate a complex corporate environment with surgical precision, proving that in the future of AI, **less is often much, much more.** | |
| --- | |
| *Created for the OpenEnv 2026 Hackathon by Aditya Katkar* | |