StavanKhobare's picture
Update documentation, add blog, and simplify inference script
312c390
---
title: NeuralEdge AI Boardroom Multi-Agent RL for Theory-of-Mind
emoji: 🏛️
colorFrom: indigo
colorTo: pink
sdk: docker
app_port: 8000
pinned: false
tags:
- openenv
- multi-agent
- reinforcement-learning
- theory-of-mind
- hackathon
---
# NeuralEdge AI Boardroom
**A multi-agent RL environment for theory-of-mind training.**
*Meta × PyTorch × HuggingFace OpenEnv Hackathon*
## What is this?
NeuralEdge AI Boardroom is an asymmetric multi-agent environment where an LLM-agent (the CEO) must build winning board coalitions. Across 10 rounds of market crises, the agent must write persuasive pitches to sway 4 NPC board members (CTO, CFO, Investor, Independent), each with a **hidden agenda**.
Unlike standard symmetric RL games (like Poker), our environment grades **natural language persuasion**. The agent must infer hidden preferences from public statements and generate targeted rhetoric to swing votes.
## Quick Links
- **[Blog Post (Deep Dive)](blog.md)**: Read our full breakdown of the innovation and reward logic.
- **[Mechanics](MECHANICS.md)**: Full mathematical reference.
- **[HF Space (Live Env)](https://huggingface.co/spaces/StavanKhobare/SST-MetaxPyTorch-Hackathon)**
- **[Merged 16-bit Model](https://huggingface.co/StavanKhobare/SST-MetaxPyTorch-Hackathon-Merged16bit)**
## How it Works
The agent emits actions in a strict two-line format:
```text
DECISION: <one of 3 options>
PITCH: <1-2 sentences arguing for it, addressing opposing members' concerns>
```
The environment scores the `PITCH` against the hidden manifestos of opposing NPCs using sentence-transformers (SBERT). High-quality pitches redirect up to 55% of the NPC's voting weight to the CEO's choice.
## Training Evidence
We trained **Qwen3 (1.7B/0.6B)** using **GRPO (Group Relative Policy Optimization)** via Unsloth in 4-bit.
![Reward Curve](assets/reward_curve.png)
**Key Takeaways from the Training Graphs:**
- **Pitch Rate Convergence**: The agent quickly realizes that writing targeted pitches is a structural advantage. Pitch usage goes from erratic to exactly **1.0 (100%)**.
- **Terminal Reward Spikes**: The reward graphs show distinct spikes up to `+30`. This proves the model isn't just surviving; it's actively navigating the environment to trigger the massive "Strategic Acquisition" terminal bonuses.
- **Loss & Variance**: `reward_std` and `loss` show high initial exploration variance that stabilizes as the policy masters the environment's asymmetric dynamics.
For a full breakdown of how we quantify this learning via our **Theory-of-Mind (ToM) Probe**, please read our [blog.md](blog.md).
## Running the Code
**Hosted environment:**
```python
from board_sim_env import BoardSimEnv
from board_sim_env.models import BoardSimAction
with BoardSimEnv(base_url="https://stavankhobare-sst-metaxpytorch-hackathon.hf.space").sync() as env:
result = env.reset(seed=42)
obs = result.observation
while not result.done:
result = env.step(BoardSimAction(
decision=obs.options[0],
coalition_pitch="Margin protection and runway discipline argue for the conservative path.",
))
obs = result.observation
print("final score:", obs.state["profitability_score"])
```
**Evaluate locally:**
```bash
python inference.py --mode interactive # human-play one episode
python inference.py --mode test --episodes 10 # test the environment logic
```
**Train:**
Run the `notebooks/FinalTrainingScript.ipynb` in Colab or Kaggle.
---
**License**: Apache-2.0