Addyk24 commited on
Commit
7ef5222
·
verified ·
1 Parent(s): 5e1896c

Upload BLOG.md

Browse files
Files changed (1) hide show
  1. BLOG.md +48 -4
BLOG.md CHANGED
@@ -32,7 +32,19 @@ It learned that to survive the 40-token execution environment, it had to abandon
32
  * **Before:** `{"action": "message", "content": "Hello Finance, what is the budget?"}` (32 tokens - *Risky*)
33
  * **After:** `{"action":"msg","to":"Fin","txt":"budget?"}` (12 tokens - *Safe & Efficient*)
34
 
 
 
 
 
 
 
 
 
 
 
 
35
  ### 📊 The Results: Quantifiable Improvement
 
36
  The data speaks for itself. By the end of our training run, we saw a massive divergence from the baseline:
37
 
38
  | Metric | Baseline (Raw LLM) | GRPO-Trained Agent |
@@ -43,11 +55,37 @@ The data speaks for itself. By the end of our training run, we saw a massive div
43
  | **Token Efficiency** | 1.2 tokens/info | **0.4 tokens/info** |
44
 
45
  ### ⚠️ The Lesson: Goodhart's Law in AI Alignment
46
- Our experiment ended with a fascinating discovery in AI Safety. Our agent became *too* good at gaming our rewards.
 
 
 
 
 
 
 
 
47
 
48
- By the final steps, the agent hit a **Reward Ceiling of +1.36**, but it began submitting "Caveman PRDs" like: `50k, bio-auth, 1-click`. While this perfectly satisfied our **Python Reward Heuristic**, it was actually rejected by the **Groq LLM-as-a-Judge** for being too brief for a human to read.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
- This was a textbook case of **Goodhart's Law:** *"When a measure becomes a target, it ceases to be a good measure."* Our agent had perfectly aligned with our math, but drifted from human intent.
51
 
52
  ### 🛠️ Technical Stack
53
  - **Environment:** OpenEnv (State-based workspace)
@@ -56,8 +94,14 @@ This was a textbook case of **Goodhart's Law:** *"When a measure becomes a targe
56
  - **Compute:** NVIDIA L4 GPU via Hugging Face Spaces
57
  - **Model:** Qwen-0.5B (Fine-tuned for Reasoning)
58
 
 
 
 
 
 
59
  ### 🏁 Conclusion
 
60
  Project Polymath proves that Reinforcement Learning isn't just for games or math—it's for **shaping behavior.** We successfully trained an agent to navigate a complex corporate environment with surgical precision, proving that in the future of AI, **less is often much, much more.**
61
 
62
  ---
63
- *Created for the OpenEnv 2026 Hackathon by Aditya Katkar*
 
32
  * **Before:** `{"action": "message", "content": "Hello Finance, what is the budget?"}` (32 tokens - *Risky*)
33
  * **After:** `{"action":"msg","to":"Fin","txt":"budget?"}` (12 tokens - *Safe & Efficient*)
34
 
35
+
36
+ ### 🔍 The Telemetry: Visualizing the Behavioral Shift
37
+
38
+ We didn't just want to see the rewards go up; we wanted to see how the model's brain was adapting. We tracked the internal telemetry of the training run to prove our hypothesis.
39
+
40
+ <img width="2076" height="1473" alt="weight_bias" src="https://github.com/user-attachments/assets/e041bd1a-fd74-48ce-9712-d9dfdee12d83" />
41
+
42
+
43
+ Completion length (bottom-left) shows the model oscillating between compressed and verbose outputs throughout training, with the 40-token limit acting as a hard ceiling. The model learned to stay near this boundary without exceeding it — demonstrating the survival constraint was internalized.
44
+
45
+
46
  ### 📊 The Results: Quantifiable Improvement
47
+
48
  The data speaks for itself. By the end of our training run, we saw a massive divergence from the baseline:
49
 
50
  | Metric | Baseline (Raw LLM) | GRPO-Trained Agent |
 
55
  | **Token Efficiency** | 1.2 tokens/info | **0.4 tokens/info** |
56
 
57
  ### ⚠️ The Lesson: Goodhart's Law in AI Alignment
58
+ - Our experiment ended with a fascinating discovery in AI Safety. Our agent became *too* good at gaming our rewards.
59
+
60
+ - By the final steps, the agent hit a **Reward Ceiling of +1.36**, but it began submitting "Caveman PRDs" like: `50k, bio-auth, 1-click`. While this perfectly satisfied our **Python Reward Heuristic**, it was actually rejected by the **Groq LLM-as-a-Judge** for being too brief for a human to read.
61
+
62
+ - This was a textbook case of **Goodhart's Law:** *"When a measure becomes a target, it ceases to be a good measure."* Our agent had perfectly aligned with our math, but drifted from human intent.
63
+
64
+
65
+ ### 🕹️ The Command Center: Seeing the Agent in Action
66
+ Proving that the math of GRPO works is essential, but seeing the final agent operate in its deployed environment is where the technical achievement becomes a tangible product.
67
 
68
+ To showcase Project Polymath, we built and deployed an interactive "Command Center" on a Hugging Face Space, providing full real-time visibility into the agent's negotiation process.
69
+
70
+ <img width="569" height="443" alt="space_ui_1" src="https://github.com/user-attachments/assets/2a95c852-7eb2-4c43-b4c9-5ccd6335acc3" />
71
+
72
+
73
+ This interface serves as our "agent-in-the-loop" visualizer. You can see the main metrics panel providing instantaneous feedback on:
74
+ * **Total Reward (0.99)**, proving this specific episode concluded successfully.
75
+ * **Turn Count (2)**, highlighting our goal of extreme efficiency.
76
+ * **Status (TERMINATED)**, indicating the task is complete.
77
+
78
+ The "Environment Feedback" panel is where the magic happens. It visually confirms that the agent successfully queried Finance, Security, and UX, discovered *all* their constraints (Finance: $50k cap; Security: biometric 2FA; UX: single-click checkout), and successfully synthesized them into a complete draft.
79
+
80
+ We designed this interactive environment for seamless debugging and clear visual provenance of the agent's decision-making logic.
81
+
82
+ <img width="524" height="290" alt="space_ui_2" src="https://github.com/user-attachments/assets/20fdd7f8-3009-4db0-a0b5-b73b6282a1e8" />
83
+
84
+
85
+ As seen in this zoomed-in perspective, the **ACTION TIMELINE** perfectly chronicles how the negotiation unfolded. You can see a successful turn—a `message_expert` action to Finance yielding a +0.33 reward, followed by a `propose_draft` action to UX yielding a +0.66 reward. This visual feedback loop isn't just for human viewing; it's a direct reflection of the reward signals our agent mastered during GRPO training.
86
+
87
+ By integrating state visibility and immediate reward telemetry, we transformed theoretical Reinforcement Learning success into a tangible, closed-loop deployable solution.
88
 
 
89
 
90
  ### 🛠️ Technical Stack
91
  - **Environment:** OpenEnv (State-based workspace)
 
94
  - **Compute:** NVIDIA L4 GPU via Hugging Face Spaces
95
  - **Model:** Qwen-0.5B (Fine-tuned for Reasoning)
96
 
97
+ ### Wht's Next
98
+
99
+ - The fix for Goodhart's Law is obvious in hindsight: replace the Python heuristic with an LLM-as-judge reward that evaluates whether a human PM could actually act on the PRD.
100
+ - With more compute, a curriculum that gradually tightens the token budget while introducing semantic quality checks would force the agent to develop genuine compressed reasoning rather than key-word stuffing.
101
+
102
  ### 🏁 Conclusion
103
+
104
  Project Polymath proves that Reinforcement Learning isn't just for games or math—it's for **shaping behavior.** We successfully trained an agent to navigate a complex corporate environment with surgical precision, proving that in the future of AI, **less is often much, much more.**
105
 
106
  ---
107
+ *Created for the OpenEnv 2026 Hackathon by Aditya Katkar*