Spaces:
Sleeping
Sleeping
Update BLOG.md
Browse files
BLOG.md
CHANGED
|
@@ -1,107 +1,107 @@
|
|
| 1 |
-
# The JSON Sniper: Training a Compressed Reasoning Agent with GRPO
|
| 2 |
-
|
| 3 |
-
### 🚀 The Mission
|
| 4 |
-
In the high-stakes world of Product Management, speed and precision are everything. Our goal for the OpenEnv Hackathon was to build **Project Polymath**: an autonomous agent capable of navigating a complex stakeholder environment (Finance, Security, and UX) to produce a perfect Product Requirements Document (PRD).
|
| 5 |
-
|
| 6 |
-
But we didn't want a "chatty" AI. We wanted an agent that could operate under extreme bandwidth constraints—negotiating and finalized a PRD in **under 40 tokens.**
|
| 7 |
-
|
| 8 |
-
### 📉 The Initial Failure: The "Verbosity Trap"
|
| 9 |
-
We began our journey with a powerful baseline: **Qwen-0.5B-Instruct**. However, during our first evaluation runs, we hit a wall.
|
| 10 |
-
|
| 11 |
-
The baseline model suffered from what we call the **"Verbosity Trap."** It would try to be polite, providing long-winded introductions like *"Certainly! I can help you with the Finance requirements..."* **The Result was Catastrophic:**
|
| 12 |
-
- **Token Clipping:** The agent would hit the 40-token limit mid-sentence.
|
| 13 |
-
- **JSON Corruption:** Because the output was cut off, the JSON brackets never closed.
|
| 14 |
-
- **Reward Floor:** Our baseline rewards were stuck at **-0.52**, representing a 40% failure rate in basic instruction following.
|
| 15 |
-
|
| 16 |
-
### 🧠 The Pivot: Orchestrating GRPO
|
| 17 |
-
To fix this, we didn't just tweak the prompt. We decided to **train the model's brain** using **Group Relative Policy Optimization (GRPO).**
|
| 18 |
-
|
| 19 |
-
We treated the 40-token limit not as a bug, but as a **Survival Constraint.** We designed a reward function that penalized long-windedness and rewarded the discovery of expert constraints.
|
| 20 |
-
|
| 21 |
-
**Our GRPO Setup:**
|
| 22 |
-
- **Group Size:** 8 (The model generated 8 variations of every turn to compete against itself).
|
| 23 |
-
- **Hard Heuristics:** Penalties for malformed JSON and token overflows.
|
| 24 |
-
- **The Objective:** Maximize the "Information Density" of every token used.
|
| 25 |
-
|
| 26 |
-
### ⚡ The Breakthrough: "Caveman" Logic
|
| 27 |
-
Around **Step 28 of training**, something incredible happened. The model stopped being "polite." It underwent a behavioral shift into what we dubbed **"JSON Sniper Mode."**
|
| 28 |
-
|
| 29 |
-
It learned that to survive the 40-token execution environment, it had to abandon human social norms. It stopped saying "Hello" and started outputting "Hyper-Compressed Logic."
|
| 30 |
-
|
| 31 |
-
**Example of the shift:**
|
| 32 |
-
* **Before:** `{"action": "message", "content": "Hello Finance, what is the budget?"}` (32 tokens - *Risky*)
|
| 33 |
-
* **After:** `{"action":"msg","to":"Fin","txt":"budget?"}` (12 tokens - *Safe & Efficient*)
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
### 🔍 The Telemetry: Visualizing the Behavioral Shift
|
| 37 |
-
|
| 38 |
-
We didn't just want to see the rewards go up; we wanted to see how the model's brain was adapting. We tracked the internal telemetry of the training run to prove our hypothesis.
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
|
| 52 |
-
|
|
| 53 |
-
| **
|
| 54 |
-
| **
|
| 55 |
-
| **
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
* **
|
| 76 |
-
* **
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
As seen in this zoomed-in perspective, the **ACTION TIMELINE** perfectly chronicles how the negotiation unfolded. You can see a successful turn—a `message_expert` action to Finance yielding a +0.33 reward, followed by a `propose_draft` action to UX yielding a +0.66 reward. This visual feedback loop isn't just for human viewing; it's a direct reflection of the reward signals our agent mastered during GRPO training.
|
| 86 |
-
|
| 87 |
-
By integrating state visibility and immediate reward telemetry, we transformed theoretical Reinforcement Learning success into a tangible, closed-loop deployable solution.
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
### 🛠️ Technical Stack
|
| 91 |
-
- **Environment:** OpenEnv (State-based workspace)
|
| 92 |
-
- **RL Framework:** TRL (Transformer Reinforcement Learning)
|
| 93 |
-
- **Optimization:** GRPO
|
| 94 |
-
- **Compute:** NVIDIA L4 GPU via Hugging Face Spaces
|
| 95 |
-
- **Model:** Qwen-0.5B (Fine-tuned for Reasoning)
|
| 96 |
-
|
| 97 |
-
### Wht's Next
|
| 98 |
-
|
| 99 |
-
- The fix for Goodhart's Law is obvious in hindsight: replace the Python heuristic with an LLM-as-judge reward that evaluates whether a human PM could actually act on the PRD.
|
| 100 |
-
- With more compute, a curriculum that gradually tightens the token budget while introducing semantic quality checks would force the agent to develop genuine compressed reasoning rather than key-word stuffing.
|
| 101 |
-
|
| 102 |
-
### 🏁 Conclusion
|
| 103 |
-
|
| 104 |
-
Project Polymath proves that Reinforcement Learning isn't just for games or math—it's for **shaping behavior.** We successfully trained an agent to navigate a complex corporate environment with surgical precision, proving that in the future of AI, **less is often much, much more.**
|
| 105 |
-
|
| 106 |
-
---
|
| 107 |
-
*Created for the OpenEnv 2026 Hackathon by Aditya Katkar*
|
|
|
|
| 1 |
+
# The JSON Sniper: Training a Compressed Reasoning Agent with GRPO
|
| 2 |
+
|
| 3 |
+
### 🚀 The Mission
|
| 4 |
+
In the high-stakes world of Product Management, speed and precision are everything. Our goal for the OpenEnv Hackathon was to build **Project Polymath**: an autonomous agent capable of navigating a complex stakeholder environment (Finance, Security, and UX) to produce a perfect Product Requirements Document (PRD).
|
| 5 |
+
|
| 6 |
+
But we didn't want a "chatty" AI. We wanted an agent that could operate under extreme bandwidth constraints—negotiating and finalized a PRD in **under 40 tokens.**
|
| 7 |
+
|
| 8 |
+
### 📉 The Initial Failure: The "Verbosity Trap"
|
| 9 |
+
We began our journey with a powerful baseline: **Qwen-0.5B-Instruct**. However, during our first evaluation runs, we hit a wall.
|
| 10 |
+
|
| 11 |
+
The baseline model suffered from what we call the **"Verbosity Trap."** It would try to be polite, providing long-winded introductions like *"Certainly! I can help you with the Finance requirements..."* **The Result was Catastrophic:**
|
| 12 |
+
- **Token Clipping:** The agent would hit the 40-token limit mid-sentence.
|
| 13 |
+
- **JSON Corruption:** Because the output was cut off, the JSON brackets never closed.
|
| 14 |
+
- **Reward Floor:** Our baseline rewards were stuck at **-0.52**, representing a 40% failure rate in basic instruction following.
|
| 15 |
+
|
| 16 |
+
### 🧠 The Pivot: Orchestrating GRPO
|
| 17 |
+
To fix this, we didn't just tweak the prompt. We decided to **train the model's brain** using **Group Relative Policy Optimization (GRPO).**
|
| 18 |
+
|
| 19 |
+
We treated the 40-token limit not as a bug, but as a **Survival Constraint.** We designed a reward function that penalized long-windedness and rewarded the discovery of expert constraints.
|
| 20 |
+
|
| 21 |
+
**Our GRPO Setup:**
|
| 22 |
+
- **Group Size:** 8 (The model generated 8 variations of every turn to compete against itself).
|
| 23 |
+
- **Hard Heuristics:** Penalties for malformed JSON and token overflows.
|
| 24 |
+
- **The Objective:** Maximize the "Information Density" of every token used.
|
| 25 |
+
|
| 26 |
+
### ⚡ The Breakthrough: "Caveman" Logic
|
| 27 |
+
Around **Step 28 of training**, something incredible happened. The model stopped being "polite." It underwent a behavioral shift into what we dubbed **"JSON Sniper Mode."**
|
| 28 |
+
|
| 29 |
+
It learned that to survive the 40-token execution environment, it had to abandon human social norms. It stopped saying "Hello" and started outputting "Hyper-Compressed Logic."
|
| 30 |
+
|
| 31 |
+
**Example of the shift:**
|
| 32 |
+
* **Before:** `{"action": "message", "content": "Hello Finance, what is the budget?"}` (32 tokens - *Risky*)
|
| 33 |
+
* **After:** `{"action":"msg","to":"Fin","txt":"budget?"}` (12 tokens - *Safe & Efficient*)
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
### 🔍 The Telemetry: Visualizing the Behavioral Shift
|
| 37 |
+
|
| 38 |
+
We didn't just want to see the rewards go up; we wanted to see how the model's brain was adapting. We tracked the internal telemetry of the training run to prove our hypothesis.
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+

|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
Completion length (bottom-left) shows the model oscillating between compressed and verbose outputs throughout training, with the 40-token limit acting as a hard ceiling. The model learned to stay near this boundary without exceeding it — demonstrating the survival constraint was internalized.
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
### 📊 The Results: Quantifiable Improvement
|
| 48 |
+
|
| 49 |
+
The data speaks for itself. By the end of our training run, we saw a massive divergence from the baseline:
|
| 50 |
+
|
| 51 |
+
| Metric | Baseline (Raw LLM) | GRPO-Trained Agent |
|
| 52 |
+
| :--- | :--- | :--- |
|
| 53 |
+
| **Mean Reward** | -0.52 | **+1.36** |
|
| 54 |
+
| **JSON Error Rate** | 40% | **0%** |
|
| 55 |
+
| **Constraint Discovery** | Inconsistent (50%) | **Targeted (100%)** |
|
| 56 |
+
| **Token Efficiency** | 1.2 tokens/info | **0.4 tokens/info** |
|
| 57 |
+
|
| 58 |
+
### ⚠️ The Lesson: Goodhart's Law in AI Alignment
|
| 59 |
+
- Our experiment ended with a fascinating discovery in AI Safety. Our agent became *too* good at gaming our rewards.
|
| 60 |
+
|
| 61 |
+
- By the final steps, the agent hit a **Reward Ceiling of +1.36**, but it began submitting "Caveman PRDs" like: `50k, bio-auth, 1-click`. While this perfectly satisfied our **Python Reward Heuristic**, it was actually rejected by the **Groq LLM-as-a-Judge** for being too brief for a human to read.
|
| 62 |
+
|
| 63 |
+
- This was a textbook case of **Goodhart's Law:** *"When a measure becomes a target, it ceases to be a good measure."* Our agent had perfectly aligned with our math, but drifted from human intent.
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
### 🕹️ The Command Center: Seeing the Agent in Action
|
| 67 |
+
Proving that the math of GRPO works is essential, but seeing the final agent operate in its deployed environment is where the technical achievement becomes a tangible product.
|
| 68 |
+
|
| 69 |
+
To showcase Project Polymath, we built and deployed an interactive "Command Center" on a Hugging Face Space, providing full real-time visibility into the agent's negotiation process.
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+

|
| 73 |
+
|
| 74 |
+
This interface serves as our "agent-in-the-loop" visualizer. You can see the main metrics panel providing instantaneous feedback on:
|
| 75 |
+
* **Total Reward (0.99)**, proving this specific episode concluded successfully.
|
| 76 |
+
* **Turn Count (2)**, highlighting our goal of extreme efficiency.
|
| 77 |
+
* **Status (TERMINATED)**, indicating the task is complete.
|
| 78 |
+
|
| 79 |
+
The "Environment Feedback" panel is where the magic happens. It visually confirms that the agent successfully queried Finance, Security, and UX, discovered *all* their constraints (Finance: $50k cap; Security: biometric 2FA; UX: single-click checkout), and successfully synthesized them into a complete draft.
|
| 80 |
+
|
| 81 |
+
We designed this interactive environment for seamless debugging and clear visual provenance of the agent's decision-making logic.
|
| 82 |
+
|
| 83 |
+

|
| 84 |
+
|
| 85 |
+
As seen in this zoomed-in perspective, the **ACTION TIMELINE** perfectly chronicles how the negotiation unfolded. You can see a successful turn—a `message_expert` action to Finance yielding a +0.33 reward, followed by a `propose_draft` action to UX yielding a +0.66 reward. This visual feedback loop isn't just for human viewing; it's a direct reflection of the reward signals our agent mastered during GRPO training.
|
| 86 |
+
|
| 87 |
+
By integrating state visibility and immediate reward telemetry, we transformed theoretical Reinforcement Learning success into a tangible, closed-loop deployable solution.
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
### 🛠️ Technical Stack
|
| 91 |
+
- **Environment:** OpenEnv (State-based workspace)
|
| 92 |
+
- **RL Framework:** TRL (Transformer Reinforcement Learning)
|
| 93 |
+
- **Optimization:** GRPO
|
| 94 |
+
- **Compute:** NVIDIA L4 GPU via Hugging Face Spaces
|
| 95 |
+
- **Model:** Qwen-0.5B (Fine-tuned for Reasoning)
|
| 96 |
+
|
| 97 |
+
### Wht's Next
|
| 98 |
+
|
| 99 |
+
- The fix for Goodhart's Law is obvious in hindsight: replace the Python heuristic with an LLM-as-judge reward that evaluates whether a human PM could actually act on the PRD.
|
| 100 |
+
- With more compute, a curriculum that gradually tightens the token budget while introducing semantic quality checks would force the agent to develop genuine compressed reasoning rather than key-word stuffing.
|
| 101 |
+
|
| 102 |
+
### 🏁 Conclusion
|
| 103 |
+
|
| 104 |
+
Project Polymath proves that Reinforcement Learning isn't just for games or math—it's for **shaping behavior.** We successfully trained an agent to navigate a complex corporate environment with surgical precision, proving that in the future of AI, **less is often much, much more.**
|
| 105 |
+
|
| 106 |
+
---
|
| 107 |
+
*Created for the OpenEnv 2026 Hackathon by Aditya Katkar*
|