Spaces:

StavanKhobare
/

SST-MetaxPyTorch-Hackathon

Sleeping

App Files Files Community

SST-MetaxPyTorch-Hackathon / README.md

StavanKhobare

Update documentation, add blog, and simplify inference script

312c390 about 1 month ago

preview code

raw

history blame contribute delete

3.56 kB

	---
	title: NeuralEdge AI Boardroom — Multi-Agent RL for Theory-of-Mind
	emoji: 🏛️
	colorFrom: indigo
	colorTo: pink
	sdk: docker
	app_port: 8000
	pinned: false
	tags:
	- openenv
	- multi-agent
	- reinforcement-learning
	- theory-of-mind
	- hackathon
	---

	# NeuralEdge AI Boardroom

	A multi-agent RL environment for theory-of-mind training.
	Meta × PyTorch × HuggingFace OpenEnv Hackathon

	## What is this?
	NeuralEdge AI Boardroom is an asymmetric multi-agent environment where an LLM-agent (the CEO) must build winning board coalitions. Across 10 rounds of market crises, the agent must write persuasive pitches to sway 4 NPC board members (CTO, CFO, Investor, Independent), each with a hidden agenda.

	Unlike standard symmetric RL games (like Poker), our environment grades natural language persuasion. The agent must infer hidden preferences from public statements and generate targeted rhetoric to swing votes.

	## Quick Links
	- [Blog Post (Deep Dive)](blog.md): Read our full breakdown of the innovation and reward logic.
	- [Mechanics](MECHANICS.md): Full mathematical reference.
	- [HF Space (Live Env)](https://huggingface.co/spaces/StavanKhobare/SST-MetaxPyTorch-Hackathon)
	- [Merged 16-bit Model](https://huggingface.co/StavanKhobare/SST-MetaxPyTorch-Hackathon-Merged16bit)

	## How it Works

	The agent emits actions in a strict two-line format:
	```text
	DECISION: <one of 3 options>
	PITCH: <1-2 sentences arguing for it, addressing opposing members' concerns>
	```
	The environment scores the `PITCH` against the hidden manifestos of opposing NPCs using sentence-transformers (SBERT). High-quality pitches redirect up to 55% of the NPC's voting weight to the CEO's choice.

	## Training Evidence

	We trained Qwen3 (1.7B/0.6B) using GRPO (Group Relative Policy Optimization) via Unsloth in 4-bit.

	![Reward Curve](assets/reward_curve.png)

	Key Takeaways from the Training Graphs:
	- Pitch Rate Convergence: The agent quickly realizes that writing targeted pitches is a structural advantage. Pitch usage goes from erratic to exactly 1.0 (100%).
	- Terminal Reward Spikes: The reward graphs show distinct spikes up to `+30`. This proves the model isn't just surviving; it's actively navigating the environment to trigger the massive "Strategic Acquisition" terminal bonuses.
	- Loss & Variance: `reward_std` and `loss` show high initial exploration variance that stabilizes as the policy masters the environment's asymmetric dynamics.

	For a full breakdown of how we quantify this learning via our Theory-of-Mind (ToM) Probe, please read our [blog.md](blog.md).

	## Running the Code

	Hosted environment:
	```python
	from board_sim_env import BoardSimEnv
	from board_sim_env.models import BoardSimAction

	with BoardSimEnv(base_url="https://stavankhobare-sst-metaxpytorch-hackathon.hf.space").sync() as env:
	result = env.reset(seed=42)
	obs = result.observation
	while not result.done:
	result = env.step(BoardSimAction(
	decision=obs.options[0],
	coalition_pitch="Margin protection and runway discipline argue for the conservative path.",
	))
	obs = result.observation
	print("final score:", obs.state["profitability_score"])
	```

	Evaluate locally:
	```bash
	python inference.py --mode interactive # human-play one episode
	python inference.py --mode test --episodes 10 # test the environment logic
	```

	Train:
	Run the `notebooks/FinalTrainingScript.ipynb` in Colab or Kaggle.

	---
	License: Apache-2.0