Spaces:

Addyk24
/

Project-Polymath

Sleeping

App Files Files Community

Project-Polymath / BLOG.md

Addyk24

Update BLOG.md

d99c928 verified 20 days ago

preview code

raw

history blame contribute delete

7.93 kB

	# 🧠 Project Polymath: Expert Negotiation Environment

	## The JSON Sniper: Training a Compressed Reasoning Agent with GRPO

	### 🚀 The Mission
	In the high-stakes world of Product Management, speed and precision are everything. Our goal for the OpenEnv Hackathon was to build Project Polymath: an autonomous agent capable of navigating a complex stakeholder environment (Finance, Security, and UX) to produce a perfect Product Requirements Document (PRD).

	But we didn't want a "chatty" AI. We wanted an agent that could operate under extreme bandwidth constraints—negotiating and finalized a PRD in under 40 tokens.

	### 📉 The Initial Failure: The "Verbosity Trap"
	We began our journey with a powerful baseline: Qwen2.5-1.5B-Instruct-model. However, during our first evaluation runs, we hit a wall.

	The baseline model suffered from what we call the "Verbosity Trap." It would try to be polite, providing long-winded introductions like "Certainly! I can help you with the Finance requirements..." The Result was Catastrophic:
	- Token Clipping: The agent would hit the 40-token limit mid-sentence.
	- JSON Corruption: Because the output was cut off, the JSON brackets never closed.
	- Reward Floor: Our baseline rewards were stuck at -0.52, representing a 40% failure rate in basic instruction following.

	### 🧠 The Pivot: Orchestrating GRPO
	To fix this, we didn't just tweak the prompt. We decided to train the model's brain using Group Relative Policy Optimization (GRPO).

	We treated the 40-token limit not as a bug, but as a Survival Constraint. We designed a reward function that penalized long-windedness and rewarded the discovery of expert constraints.

	Our GRPO Setup:
	- Group Size: 8 (The model generated 8 variations of every turn to compete against itself).
	- Hard Heuristics: Penalties for malformed JSON and token overflows.
	- The Objective: Maximize the "Information Density" of every token used.

	### ⚡ The Breakthrough: "Caveman" Logic
	Around Step 28 of training, something incredible happened. The model stopped being "polite." It underwent a behavioral shift into what we dubbed "JSON Sniper Mode."

	It learned that to survive the 40-token execution environment, it had to abandon human social norms. It stopped saying "Hello" and started outputting "Hyper-Compressed Logic."

	Example of the shift:
	* Before: `{"action": "message", "content": "Hello Finance, what is the budget?"}` (32 tokens - Risky)
	* After: `{"action":"msg","to":"Fin","txt":"budget?"}` (12 tokens - Safe & Efficient)


	### 🔍 The Telemetry: Visualizing the Behavioral Shift

	We didn't just want to see the rewards go up; we wanted to see how the model's brain was adapting. We tracked the internal telemetry of the training run to prove our hypothesis.


	![weight_bias](weight_bias.png)


	Completion length (bottom-left) shows the model oscillating between compressed and verbose outputs throughout training, with the 40-token limit acting as a hard ceiling. The model learned to stay near this boundary without exceeding it — demonstrating the survival constraint was internalized.


	### 📊 The Results: Quantifiable Improvement

	The data speaks for itself. By the end of our training run, we saw a massive divergence from the baseline:

	\| Metric \| Baseline (Raw LLM) \| GRPO-Trained Agent \|
	\| :--- \| :--- \| :--- \|
	\| Mean Reward \| -0.52 \| +1.36 \|
	\| JSON Error Rate \| 40% \| 0% \|
	\| Constraint Discovery \| Inconsistent (50%) \| Targeted (100%) \|
	\| Token Efficiency \| 1.2 tokens/info \| 0.4 tokens/info \|

	### ⚠️ The Lesson: Goodhart's Law in AI Alignment
	- Our experiment ended with a fascinating discovery in AI Safety. Our agent became too good at gaming our rewards.

	- By the final steps, the agent hit a Reward Ceiling of +1.36, but it began submitting "Caveman PRDs" like: `50k, bio-auth, 1-click`. While this perfectly satisfied our Python Reward Heuristic, it was actually rejected by the Groq LLM-as-a-Judge for being too brief for a human to read.

	- This was a textbook case of Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Our agent had perfectly aligned with our math, but drifted from human intent.


	### 🕹️ The Command Center: Seeing the Agent in Action
	Proving that the math of GRPO works is essential, but seeing the final agent operate in its deployed environment is where the technical achievement becomes a tangible product.

	To showcase Project Polymath, we built and deployed an interactive "Command Center" on a Hugging Face Space, providing full real-time visibility into the agent's negotiation process.


	![space_ui_1](space_ui_1.png)

	This interface serves as our "agent-in-the-loop" visualizer. You can see the main metrics panel providing instantaneous feedback on:
	* Total Reward (0.99), proving this specific episode concluded successfully.
	* Turn Count (2), highlighting our goal of extreme efficiency.
	* Status (TERMINATED), indicating the task is complete.

	The "Environment Feedback" panel is where the magic happens. It visually confirms that the agent successfully queried Finance, Security, and UX, discovered all their constraints (Finance: $50k cap; Security: biometric 2FA; UX: single-click checkout), and successfully synthesized them into a complete draft.

	We designed this interactive environment for seamless debugging and clear visual provenance of the agent's decision-making logic.

	![space_ui_2](space_ui_2.png)

	As seen in this zoomed-in perspective, the ACTION TIMELINE perfectly chronicles how the negotiation unfolded. You can see a successful turn—a `message_expert` action to Finance yielding a +0.33 reward, followed by a `propose_draft` action to UX yielding a +0.66 reward. This visual feedback loop isn't just for human viewing; it's a direct reflection of the reward signals our agent mastered during GRPO training.

	By integrating state visibility and immediate reward telemetry, we transformed theoretical Reinforcement Learning success into a tangible, closed-loop deployable solution.

	### Use Case Diagram

	![use-case-diagram](Use_Case_diagram.png)


	The Execution Flow:

	State Initialization: The agent receives the topic (e.g., "Draft a FinTech App").

	Constraint Querying: The agent sends targeted WorkSpaceAction JSONs to the Finance, Security, and UX experts. Each successful query "discovers" a constraint, adding to the agent's internal context.

	The 40-Token Gauntlet: Every action must pass the Pass-Through Sieve. If the agent's reasoning is too "wordy," the sieve rejects the action, forcing the agent to learn hyper-compression.

	Final Synthesis: Once all constraints are discovered, the agent triggers the submit_final action, which pulls all discovered context into the PRD Final Draft module


	### 🛠️ Technical Stack
	- Environment: OpenEnv (State-based workspace)
	- RL Framework: TRL (Transformer Reinforcement Learning)
	- Optimization: GRPO
	- Compute: NVIDIA L4 GPU via Hugging Face Spaces
	- Model: Qwen-0.5B (Fine-tuned for Reasoning)

	### Wht's Next

	- The fix for Goodhart's Law is obvious in hindsight: replace the Python heuristic with an LLM-as-judge reward that evaluates whether a human PM could actually act on the PRD.
	- With more compute, a curriculum that gradually tightens the token budget while introducing semantic quality checks would force the agent to develop genuine compressed reasoning rather than key-word stuffing.

	### 🏁 Conclusion

	Project Polymath proves that Reinforcement Learning isn't just for games or math—it's for shaping behavior. We successfully trained an agent to navigate a complex corporate environment with surgical precision, proving that in the future of AI, less is often much, much more.

	---
	Created for the OpenEnv 2026 Hackathon by Aditya Katkar

	# 🧠 Project Polymath: Expert Negotiation Environment

	## The JSON Sniper: Training a Compressed Reasoning Agent with GRPO

	### 🚀 The Mission
	In the high-stakes world of Product Management, speed and precision are everything. Our goal for the OpenEnv Hackathon was to build Project Polymath: an autonomous agent capable of navigating a complex stakeholder environment (Finance, Security, and UX) to produce a perfect Product Requirements Document (PRD).

	But we didn't want a "chatty" AI. We wanted an agent that could operate under extreme bandwidth constraints—negotiating and finalized a PRD in under 40 tokens.

	### 📉 The Initial Failure: The "Verbosity Trap"
	We began our journey with a powerful baseline: Qwen2.5-1.5B-Instruct-model. However, during our first evaluation runs, we hit a wall.

	The baseline model suffered from what we call the "Verbosity Trap." It would try to be polite, providing long-winded introductions like "Certainly! I can help you with the Finance requirements..." The Result was Catastrophic:
	- Token Clipping: The agent would hit the 40-token limit mid-sentence.
	- JSON Corruption: Because the output was cut off, the JSON brackets never closed.
	- Reward Floor: Our baseline rewards were stuck at -0.52, representing a 40% failure rate in basic instruction following.

	### 🧠 The Pivot: Orchestrating GRPO
	To fix this, we didn't just tweak the prompt. We decided to train the model's brain using Group Relative Policy Optimization (GRPO).

	We treated the 40-token limit not as a bug, but as a Survival Constraint. We designed a reward function that penalized long-windedness and rewarded the discovery of expert constraints.

	Our GRPO Setup:
	- Group Size: 8 (The model generated 8 variations of every turn to compete against itself).
	- Hard Heuristics: Penalties for malformed JSON and token overflows.
	- The Objective: Maximize the "Information Density" of every token used.

	### ⚡ The Breakthrough: "Caveman" Logic
	Around Step 28 of training, something incredible happened. The model stopped being "polite." It underwent a behavioral shift into what we dubbed "JSON Sniper Mode."

	It learned that to survive the 40-token execution environment, it had to abandon human social norms. It stopped saying "Hello" and started outputting "Hyper-Compressed Logic."

	Example of the shift:
	* Before: `{"action": "message", "content": "Hello Finance, what is the budget?"}` (32 tokens - Risky)
	* After: `{"action":"msg","to":"Fin","txt":"budget?"}` (12 tokens - Safe & Efficient)


	### 🔍 The Telemetry: Visualizing the Behavioral Shift

	We didn't just want to see the rewards go up; we wanted to see how the model's brain was adapting. We tracked the internal telemetry of the training run to prove our hypothesis.


	![weight_bias](weight_bias.png)


	Completion length (bottom-left) shows the model oscillating between compressed and verbose outputs throughout training, with the 40-token limit acting as a hard ceiling. The model learned to stay near this boundary without exceeding it — demonstrating the survival constraint was internalized.


	### 📊 The Results: Quantifiable Improvement

	The data speaks for itself. By the end of our training run, we saw a massive divergence from the baseline:

	\| Metric \| Baseline (Raw LLM) \| GRPO-Trained Agent \|
	\| :--- \| :--- \| :--- \|
	\| Mean Reward \| -0.52 \| +1.36 \|
	\| JSON Error Rate \| 40% \| 0% \|
	\| Constraint Discovery \| Inconsistent (50%) \| Targeted (100%) \|
	\| Token Efficiency \| 1.2 tokens/info \| 0.4 tokens/info \|

	### ⚠️ The Lesson: Goodhart's Law in AI Alignment
	- Our experiment ended with a fascinating discovery in AI Safety. Our agent became too good at gaming our rewards.

	- By the final steps, the agent hit a Reward Ceiling of +1.36, but it began submitting "Caveman PRDs" like: `50k, bio-auth, 1-click`. While this perfectly satisfied our Python Reward Heuristic, it was actually rejected by the Groq LLM-as-a-Judge for being too brief for a human to read.

	- This was a textbook case of Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Our agent had perfectly aligned with our math, but drifted from human intent.


	### 🕹️ The Command Center: Seeing the Agent in Action
	Proving that the math of GRPO works is essential, but seeing the final agent operate in its deployed environment is where the technical achievement becomes a tangible product.

	To showcase Project Polymath, we built and deployed an interactive "Command Center" on a Hugging Face Space, providing full real-time visibility into the agent's negotiation process.


	![space_ui_1](space_ui_1.png)

	This interface serves as our "agent-in-the-loop" visualizer. You can see the main metrics panel providing instantaneous feedback on:
	* Total Reward (0.99), proving this specific episode concluded successfully.
	* Turn Count (2), highlighting our goal of extreme efficiency.
	* Status (TERMINATED), indicating the task is complete.

	The "Environment Feedback" panel is where the magic happens. It visually confirms that the agent successfully queried Finance, Security, and UX, discovered all their constraints (Finance: $50k cap; Security: biometric 2FA; UX: single-click checkout), and successfully synthesized them into a complete draft.

	We designed this interactive environment for seamless debugging and clear visual provenance of the agent's decision-making logic.

	![space_ui_2](space_ui_2.png)

	As seen in this zoomed-in perspective, the ACTION TIMELINE perfectly chronicles how the negotiation unfolded. You can see a successful turn—a `message_expert` action to Finance yielding a +0.33 reward, followed by a `propose_draft` action to UX yielding a +0.66 reward. This visual feedback loop isn't just for human viewing; it's a direct reflection of the reward signals our agent mastered during GRPO training.

	By integrating state visibility and immediate reward telemetry, we transformed theoretical Reinforcement Learning success into a tangible, closed-loop deployable solution.

	### Use Case Diagram

	![use-case-diagram](Use_Case_diagram.png)


	The Execution Flow:

	State Initialization: The agent receives the topic (e.g., "Draft a FinTech App").

	Constraint Querying: The agent sends targeted WorkSpaceAction JSONs to the Finance, Security, and UX experts. Each successful query "discovers" a constraint, adding to the agent's internal context.

	The 40-Token Gauntlet: Every action must pass the Pass-Through Sieve. If the agent's reasoning is too "wordy," the sieve rejects the action, forcing the agent to learn hyper-compression.

	Final Synthesis: Once all constraints are discovered, the agent triggers the submit_final action, which pulls all discovered context into the PRD Final Draft module


	### 🛠️ Technical Stack
	- Environment: OpenEnv (State-based workspace)
	- RL Framework: TRL (Transformer Reinforcement Learning)
	- Optimization: GRPO
	- Compute: NVIDIA L4 GPU via Hugging Face Spaces
	- Model: Qwen-0.5B (Fine-tuned for Reasoning)

	### Wht's Next

	- The fix for Goodhart's Law is obvious in hindsight: replace the Python heuristic with an LLM-as-judge reward that evaluates whether a human PM could actually act on the PRD.
	- With more compute, a curriculum that gradually tightens the token budget while introducing semantic quality checks would force the agent to develop genuine compressed reasoning rather than key-word stuffing.

	### 🏁 Conclusion

	Project Polymath proves that Reinforcement Learning isn't just for games or math—it's for shaping behavior. We successfully trained an agent to navigate a complex corporate environment with surgical precision, proving that in the future of AI, less is often much, much more.

	---
	Created for the OpenEnv 2026 Hackathon by Aditya Katkar