Addyk24 commited on
Commit
3db85fe
·
verified ·
1 Parent(s): 7173be4

Update BLOG.md

Browse files
Files changed (1) hide show
  1. BLOG.md +107 -107
BLOG.md CHANGED
@@ -1,107 +1,107 @@
1
- # The JSON Sniper: Training a Compressed Reasoning Agent with GRPO
2
-
3
- ### 🚀 The Mission
4
- In the high-stakes world of Product Management, speed and precision are everything. Our goal for the OpenEnv Hackathon was to build **Project Polymath**: an autonomous agent capable of navigating a complex stakeholder environment (Finance, Security, and UX) to produce a perfect Product Requirements Document (PRD).
5
-
6
- But we didn't want a "chatty" AI. We wanted an agent that could operate under extreme bandwidth constraints—negotiating and finalized a PRD in **under 40 tokens.**
7
-
8
- ### 📉 The Initial Failure: The "Verbosity Trap"
9
- We began our journey with a powerful baseline: **Qwen-0.5B-Instruct**. However, during our first evaluation runs, we hit a wall.
10
-
11
- The baseline model suffered from what we call the **"Verbosity Trap."** It would try to be polite, providing long-winded introductions like *"Certainly! I can help you with the Finance requirements..."* **The Result was Catastrophic:**
12
- - **Token Clipping:** The agent would hit the 40-token limit mid-sentence.
13
- - **JSON Corruption:** Because the output was cut off, the JSON brackets never closed.
14
- - **Reward Floor:** Our baseline rewards were stuck at **-0.52**, representing a 40% failure rate in basic instruction following.
15
-
16
- ### 🧠 The Pivot: Orchestrating GRPO
17
- To fix this, we didn't just tweak the prompt. We decided to **train the model's brain** using **Group Relative Policy Optimization (GRPO).**
18
-
19
- We treated the 40-token limit not as a bug, but as a **Survival Constraint.** We designed a reward function that penalized long-windedness and rewarded the discovery of expert constraints.
20
-
21
- **Our GRPO Setup:**
22
- - **Group Size:** 8 (The model generated 8 variations of every turn to compete against itself).
23
- - **Hard Heuristics:** Penalties for malformed JSON and token overflows.
24
- - **The Objective:** Maximize the "Information Density" of every token used.
25
-
26
- ### ⚡ The Breakthrough: "Caveman" Logic
27
- Around **Step 28 of training**, something incredible happened. The model stopped being "polite." It underwent a behavioral shift into what we dubbed **"JSON Sniper Mode."**
28
-
29
- It learned that to survive the 40-token execution environment, it had to abandon human social norms. It stopped saying "Hello" and started outputting "Hyper-Compressed Logic."
30
-
31
- **Example of the shift:**
32
- * **Before:** `{"action": "message", "content": "Hello Finance, what is the budget?"}` (32 tokens - *Risky*)
33
- * **After:** `{"action":"msg","to":"Fin","txt":"budget?"}` (12 tokens - *Safe & Efficient*)
34
-
35
-
36
- ### 🔍 The Telemetry: Visualizing the Behavioral Shift
37
-
38
- We didn't just want to see the rewards go up; we wanted to see how the model's brain was adapting. We tracked the internal telemetry of the training run to prove our hypothesis.
39
-
40
- <img width="2076" height="1473" alt="weight_bias" src="https://github.com/user-attachments/assets/e041bd1a-fd74-48ce-9712-d9dfdee12d83" />
41
-
42
-
43
- Completion length (bottom-left) shows the model oscillating between compressed and verbose outputs throughout training, with the 40-token limit acting as a hard ceiling. The model learned to stay near this boundary without exceeding it — demonstrating the survival constraint was internalized.
44
-
45
-
46
- ### 📊 The Results: Quantifiable Improvement
47
-
48
- The data speaks for itself. By the end of our training run, we saw a massive divergence from the baseline:
49
-
50
- | Metric | Baseline (Raw LLM) | GRPO-Trained Agent |
51
- | :--- | :--- | :--- |
52
- | **Mean Reward** | -0.52 | **+1.36** |
53
- | **JSON Error Rate** | 40% | **0%** |
54
- | **Constraint Discovery** | Inconsistent (50%) | **Targeted (100%)** |
55
- | **Token Efficiency** | 1.2 tokens/info | **0.4 tokens/info** |
56
-
57
- ### ⚠️ The Lesson: Goodhart's Law in AI Alignment
58
- - Our experiment ended with a fascinating discovery in AI Safety. Our agent became *too* good at gaming our rewards.
59
-
60
- - By the final steps, the agent hit a **Reward Ceiling of +1.36**, but it began submitting "Caveman PRDs" like: `50k, bio-auth, 1-click`. While this perfectly satisfied our **Python Reward Heuristic**, it was actually rejected by the **Groq LLM-as-a-Judge** for being too brief for a human to read.
61
-
62
- - This was a textbook case of **Goodhart's Law:** *"When a measure becomes a target, it ceases to be a good measure."* Our agent had perfectly aligned with our math, but drifted from human intent.
63
-
64
-
65
- ### 🕹️ The Command Center: Seeing the Agent in Action
66
- Proving that the math of GRPO works is essential, but seeing the final agent operate in its deployed environment is where the technical achievement becomes a tangible product.
67
-
68
- To showcase Project Polymath, we built and deployed an interactive "Command Center" on a Hugging Face Space, providing full real-time visibility into the agent's negotiation process.
69
-
70
- <img width="569" height="443" alt="space_ui_1" src="https://github.com/user-attachments/assets/2a95c852-7eb2-4c43-b4c9-5ccd6335acc3" />
71
-
72
-
73
- This interface serves as our "agent-in-the-loop" visualizer. You can see the main metrics panel providing instantaneous feedback on:
74
- * **Total Reward (0.99)**, proving this specific episode concluded successfully.
75
- * **Turn Count (2)**, highlighting our goal of extreme efficiency.
76
- * **Status (TERMINATED)**, indicating the task is complete.
77
-
78
- The "Environment Feedback" panel is where the magic happens. It visually confirms that the agent successfully queried Finance, Security, and UX, discovered *all* their constraints (Finance: $50k cap; Security: biometric 2FA; UX: single-click checkout), and successfully synthesized them into a complete draft.
79
-
80
- We designed this interactive environment for seamless debugging and clear visual provenance of the agent's decision-making logic.
81
-
82
- <img width="524" height="290" alt="space_ui_2" src="https://github.com/user-attachments/assets/20fdd7f8-3009-4db0-a0b5-b73b6282a1e8" />
83
-
84
-
85
- As seen in this zoomed-in perspective, the **ACTION TIMELINE** perfectly chronicles how the negotiation unfolded. You can see a successful turn—a `message_expert` action to Finance yielding a +0.33 reward, followed by a `propose_draft` action to UX yielding a +0.66 reward. This visual feedback loop isn't just for human viewing; it's a direct reflection of the reward signals our agent mastered during GRPO training.
86
-
87
- By integrating state visibility and immediate reward telemetry, we transformed theoretical Reinforcement Learning success into a tangible, closed-loop deployable solution.
88
-
89
-
90
- ### 🛠️ Technical Stack
91
- - **Environment:** OpenEnv (State-based workspace)
92
- - **RL Framework:** TRL (Transformer Reinforcement Learning)
93
- - **Optimization:** GRPO
94
- - **Compute:** NVIDIA L4 GPU via Hugging Face Spaces
95
- - **Model:** Qwen-0.5B (Fine-tuned for Reasoning)
96
-
97
- ### Wht's Next
98
-
99
- - The fix for Goodhart's Law is obvious in hindsight: replace the Python heuristic with an LLM-as-judge reward that evaluates whether a human PM could actually act on the PRD.
100
- - With more compute, a curriculum that gradually tightens the token budget while introducing semantic quality checks would force the agent to develop genuine compressed reasoning rather than key-word stuffing.
101
-
102
- ### 🏁 Conclusion
103
-
104
- Project Polymath proves that Reinforcement Learning isn't just for games or math—it's for **shaping behavior.** We successfully trained an agent to navigate a complex corporate environment with surgical precision, proving that in the future of AI, **less is often much, much more.**
105
-
106
- ---
107
- *Created for the OpenEnv 2026 Hackathon by Aditya Katkar*
 
1
+ # The JSON Sniper: Training a Compressed Reasoning Agent with GRPO
2
+
3
+ ### 🚀 The Mission
4
+ In the high-stakes world of Product Management, speed and precision are everything. Our goal for the OpenEnv Hackathon was to build **Project Polymath**: an autonomous agent capable of navigating a complex stakeholder environment (Finance, Security, and UX) to produce a perfect Product Requirements Document (PRD).
5
+
6
+ But we didn't want a "chatty" AI. We wanted an agent that could operate under extreme bandwidth constraints—negotiating and finalized a PRD in **under 40 tokens.**
7
+
8
+ ### 📉 The Initial Failure: The "Verbosity Trap"
9
+ We began our journey with a powerful baseline: **Qwen-0.5B-Instruct**. However, during our first evaluation runs, we hit a wall.
10
+
11
+ The baseline model suffered from what we call the **"Verbosity Trap."** It would try to be polite, providing long-winded introductions like *"Certainly! I can help you with the Finance requirements..."* **The Result was Catastrophic:**
12
+ - **Token Clipping:** The agent would hit the 40-token limit mid-sentence.
13
+ - **JSON Corruption:** Because the output was cut off, the JSON brackets never closed.
14
+ - **Reward Floor:** Our baseline rewards were stuck at **-0.52**, representing a 40% failure rate in basic instruction following.
15
+
16
+ ### 🧠 The Pivot: Orchestrating GRPO
17
+ To fix this, we didn't just tweak the prompt. We decided to **train the model's brain** using **Group Relative Policy Optimization (GRPO).**
18
+
19
+ We treated the 40-token limit not as a bug, but as a **Survival Constraint.** We designed a reward function that penalized long-windedness and rewarded the discovery of expert constraints.
20
+
21
+ **Our GRPO Setup:**
22
+ - **Group Size:** 8 (The model generated 8 variations of every turn to compete against itself).
23
+ - **Hard Heuristics:** Penalties for malformed JSON and token overflows.
24
+ - **The Objective:** Maximize the "Information Density" of every token used.
25
+
26
+ ### ⚡ The Breakthrough: "Caveman" Logic
27
+ Around **Step 28 of training**, something incredible happened. The model stopped being "polite." It underwent a behavioral shift into what we dubbed **"JSON Sniper Mode."**
28
+
29
+ It learned that to survive the 40-token execution environment, it had to abandon human social norms. It stopped saying "Hello" and started outputting "Hyper-Compressed Logic."
30
+
31
+ **Example of the shift:**
32
+ * **Before:** `{"action": "message", "content": "Hello Finance, what is the budget?"}` (32 tokens - *Risky*)
33
+ * **After:** `{"action":"msg","to":"Fin","txt":"budget?"}` (12 tokens - *Safe & Efficient*)
34
+
35
+
36
+ ### 🔍 The Telemetry: Visualizing the Behavioral Shift
37
+
38
+ We didn't just want to see the rewards go up; we wanted to see how the model's brain was adapting. We tracked the internal telemetry of the training run to prove our hypothesis.
39
+
40
+
41
+ ![weight_bias](weight_bias.png)
42
+
43
+
44
+ Completion length (bottom-left) shows the model oscillating between compressed and verbose outputs throughout training, with the 40-token limit acting as a hard ceiling. The model learned to stay near this boundary without exceeding it — demonstrating the survival constraint was internalized.
45
+
46
+
47
+ ### 📊 The Results: Quantifiable Improvement
48
+
49
+ The data speaks for itself. By the end of our training run, we saw a massive divergence from the baseline:
50
+
51
+ | Metric | Baseline (Raw LLM) | GRPO-Trained Agent |
52
+ | :--- | :--- | :--- |
53
+ | **Mean Reward** | -0.52 | **+1.36** |
54
+ | **JSON Error Rate** | 40% | **0%** |
55
+ | **Constraint Discovery** | Inconsistent (50%) | **Targeted (100%)** |
56
+ | **Token Efficiency** | 1.2 tokens/info | **0.4 tokens/info** |
57
+
58
+ ### ⚠️ The Lesson: Goodhart's Law in AI Alignment
59
+ - Our experiment ended with a fascinating discovery in AI Safety. Our agent became *too* good at gaming our rewards.
60
+
61
+ - By the final steps, the agent hit a **Reward Ceiling of +1.36**, but it began submitting "Caveman PRDs" like: `50k, bio-auth, 1-click`. While this perfectly satisfied our **Python Reward Heuristic**, it was actually rejected by the **Groq LLM-as-a-Judge** for being too brief for a human to read.
62
+
63
+ - This was a textbook case of **Goodhart's Law:** *"When a measure becomes a target, it ceases to be a good measure."* Our agent had perfectly aligned with our math, but drifted from human intent.
64
+
65
+
66
+ ### 🕹️ The Command Center: Seeing the Agent in Action
67
+ Proving that the math of GRPO works is essential, but seeing the final agent operate in its deployed environment is where the technical achievement becomes a tangible product.
68
+
69
+ To showcase Project Polymath, we built and deployed an interactive "Command Center" on a Hugging Face Space, providing full real-time visibility into the agent's negotiation process.
70
+
71
+
72
+ ![space_ui_1](space_ui_1.png)
73
+
74
+ This interface serves as our "agent-in-the-loop" visualizer. You can see the main metrics panel providing instantaneous feedback on:
75
+ * **Total Reward (0.99)**, proving this specific episode concluded successfully.
76
+ * **Turn Count (2)**, highlighting our goal of extreme efficiency.
77
+ * **Status (TERMINATED)**, indicating the task is complete.
78
+
79
+ The "Environment Feedback" panel is where the magic happens. It visually confirms that the agent successfully queried Finance, Security, and UX, discovered *all* their constraints (Finance: $50k cap; Security: biometric 2FA; UX: single-click checkout), and successfully synthesized them into a complete draft.
80
+
81
+ We designed this interactive environment for seamless debugging and clear visual provenance of the agent's decision-making logic.
82
+
83
+ ![space_ui_2](space_ui_2.png)
84
+
85
+ As seen in this zoomed-in perspective, the **ACTION TIMELINE** perfectly chronicles how the negotiation unfolded. You can see a successful turn—a `message_expert` action to Finance yielding a +0.33 reward, followed by a `propose_draft` action to UX yielding a +0.66 reward. This visual feedback loop isn't just for human viewing; it's a direct reflection of the reward signals our agent mastered during GRPO training.
86
+
87
+ By integrating state visibility and immediate reward telemetry, we transformed theoretical Reinforcement Learning success into a tangible, closed-loop deployable solution.
88
+
89
+
90
+ ### 🛠️ Technical Stack
91
+ - **Environment:** OpenEnv (State-based workspace)
92
+ - **RL Framework:** TRL (Transformer Reinforcement Learning)
93
+ - **Optimization:** GRPO
94
+ - **Compute:** NVIDIA L4 GPU via Hugging Face Spaces
95
+ - **Model:** Qwen-0.5B (Fine-tuned for Reasoning)
96
+
97
+ ### Wht's Next
98
+
99
+ - The fix for Goodhart's Law is obvious in hindsight: replace the Python heuristic with an LLM-as-judge reward that evaluates whether a human PM could actually act on the PRD.
100
+ - With more compute, a curriculum that gradually tightens the token budget while introducing semantic quality checks would force the agent to develop genuine compressed reasoning rather than key-word stuffing.
101
+
102
+ ### 🏁 Conclusion
103
+
104
+ Project Polymath proves that Reinforcement Learning isn't just for games or math—it's for **shaping behavior.** We successfully trained an agent to navigate a complex corporate environment with surgical precision, proving that in the future of AI, **less is often much, much more.**
105
+
106
+ ---
107
+ *Created for the OpenEnv 2026 Hackathon by Aditya Katkar*