---
title: NeuralEdge AI Boardroom — Multi-Agent RL for Theory-of-Mind
emoji: 🏛️
colorFrom: indigo
colorTo: pink
sdk: docker
app_port: 8000
pinned: false
tags:
  - openenv
  - multi-agent
  - reinforcement-learning
  - theory-of-mind
  - hackathon
---

# NeuralEdge AI Boardroom

**A multi-agent RL environment for theory-of-mind training.**
*Meta × PyTorch × HuggingFace OpenEnv Hackathon*

## What is this?
NeuralEdge AI Boardroom is an asymmetric multi-agent environment where an LLM-agent (the CEO) must build winning board coalitions. Across 10 rounds of market crises, the agent must write persuasive pitches to sway 4 NPC board members (CTO, CFO, Investor, Independent), each with a **hidden agenda**.

Unlike standard symmetric RL games (like Poker), our environment grades **natural language persuasion**. The agent must infer hidden preferences from public statements and generate targeted rhetoric to swing votes.

## Quick Links
- **[Blog Post (Deep Dive)](blog.md)**: Read our full breakdown of the innovation and reward logic.
- **[Mechanics](MECHANICS.md)**: Full mathematical reference.
- **[HF Space (Live Env)](https://huggingface.co/spaces/StavanKhobare/SST-MetaxPyTorch-Hackathon)**
- **[Merged 16-bit Model](https://huggingface.co/StavanKhobare/SST-MetaxPyTorch-Hackathon-Merged16bit)**

## How it Works

The agent emits actions in a strict two-line format:
```text
DECISION: <one of 3 options>
PITCH: <1-2 sentences arguing for it, addressing opposing members' concerns>
```
The environment scores the `PITCH` against the hidden manifestos of opposing NPCs using sentence-transformers (SBERT). High-quality pitches redirect up to 55% of the NPC's voting weight to the CEO's choice. 

## Training Evidence

We trained **Qwen3 (1.7B/0.6B)** using **GRPO (Group Relative Policy Optimization)** via Unsloth in 4-bit.

![Reward Curve](assets/reward_curve.png)

**Key Takeaways from the Training Graphs:**
- **Pitch Rate Convergence**: The agent quickly realizes that writing targeted pitches is a structural advantage. Pitch usage goes from erratic to exactly **1.0 (100%)**.
- **Terminal Reward Spikes**: The reward graphs show distinct spikes up to `+30`. This proves the model isn't just surviving; it's actively navigating the environment to trigger the massive "Strategic Acquisition" terminal bonuses.
- **Loss & Variance**: `reward_std` and `loss` show high initial exploration variance that stabilizes as the policy masters the environment's asymmetric dynamics.

For a full breakdown of how we quantify this learning via our **Theory-of-Mind (ToM) Probe**, please read our [blog.md](blog.md).

## Running the Code

**Hosted environment:**
```python
from board_sim_env import BoardSimEnv
from board_sim_env.models import BoardSimAction

with BoardSimEnv(base_url="https://stavankhobare-sst-metaxpytorch-hackathon.hf.space").sync() as env:
    result = env.reset(seed=42)
    obs = result.observation
    while not result.done:
        result = env.step(BoardSimAction(
            decision=obs.options[0],
            coalition_pitch="Margin protection and runway discipline argue for the conservative path.",
        ))
        obs = result.observation
    print("final score:", obs.state["profitability_score"])
```

**Evaluate locally:**
```bash
python inference.py --mode interactive                # human-play one episode
python inference.py --mode test --episodes 10         # test the environment logic
```

**Train:**
Run the `notebooks/FinalTrainingScript.ipynb` in Colab or Kaggle.

---
**License**: Apache-2.0