StavanKhobare's picture
Update documentation, add blog, and simplify inference script
312c390
metadata
title: NeuralEdge AI Boardroom  Multi-Agent RL for Theory-of-Mind
emoji: 🏛️
colorFrom: indigo
colorTo: pink
sdk: docker
app_port: 8000
pinned: false
tags:
  - openenv
  - multi-agent
  - reinforcement-learning
  - theory-of-mind
  - hackathon

NeuralEdge AI Boardroom

A multi-agent RL environment for theory-of-mind training. Meta × PyTorch × HuggingFace OpenEnv Hackathon

What is this?

NeuralEdge AI Boardroom is an asymmetric multi-agent environment where an LLM-agent (the CEO) must build winning board coalitions. Across 10 rounds of market crises, the agent must write persuasive pitches to sway 4 NPC board members (CTO, CFO, Investor, Independent), each with a hidden agenda.

Unlike standard symmetric RL games (like Poker), our environment grades natural language persuasion. The agent must infer hidden preferences from public statements and generate targeted rhetoric to swing votes.

Quick Links

How it Works

The agent emits actions in a strict two-line format:

DECISION: <one of 3 options>
PITCH: <1-2 sentences arguing for it, addressing opposing members' concerns>

The environment scores the PITCH against the hidden manifestos of opposing NPCs using sentence-transformers (SBERT). High-quality pitches redirect up to 55% of the NPC's voting weight to the CEO's choice.

Training Evidence

We trained Qwen3 (1.7B/0.6B) using GRPO (Group Relative Policy Optimization) via Unsloth in 4-bit.

Reward Curve

Key Takeaways from the Training Graphs:

  • Pitch Rate Convergence: The agent quickly realizes that writing targeted pitches is a structural advantage. Pitch usage goes from erratic to exactly 1.0 (100%).
  • Terminal Reward Spikes: The reward graphs show distinct spikes up to +30. This proves the model isn't just surviving; it's actively navigating the environment to trigger the massive "Strategic Acquisition" terminal bonuses.
  • Loss & Variance: reward_std and loss show high initial exploration variance that stabilizes as the policy masters the environment's asymmetric dynamics.

For a full breakdown of how we quantify this learning via our Theory-of-Mind (ToM) Probe, please read our blog.md.

Running the Code

Hosted environment:

from board_sim_env import BoardSimEnv
from board_sim_env.models import BoardSimAction

with BoardSimEnv(base_url="https://stavankhobare-sst-metaxpytorch-hackathon.hf.space").sync() as env:
    result = env.reset(seed=42)
    obs = result.observation
    while not result.done:
        result = env.step(BoardSimAction(
            decision=obs.options[0],
            coalition_pitch="Margin protection and runway discipline argue for the conservative path.",
        ))
        obs = result.observation
    print("final score:", obs.state["profitability_score"])

Evaluate locally:

python inference.py --mode interactive                # human-play one episode
python inference.py --mode test --episodes 10         # test the environment logic

Train: Run the notebooks/FinalTrainingScript.ipynb in Colab or Kaggle.


License: Apache-2.0