Spaces:

StavanKhobare
/

SST-MetaxPyTorch-Hackathon

Sleeping

App Files Files Community

SST-MetaxPyTorch-Hackathon / README.md

StavanKhobare

Update documentation, add blog, and simplify inference script

312c390 about 1 month ago

preview code

raw

history blame contribute delete

3.56 kB

metadata

title: NeuralEdge AI Boardroom — Multi-Agent RL for Theory-of-Mind
emoji: 🏛️
colorFrom: indigo
colorTo: pink
sdk: docker
app_port: 8000
pinned: false
tags:
  - openenv
  - multi-agent
  - reinforcement-learning
  - theory-of-mind
  - hackathon

NeuralEdge AI Boardroom

A multi-agent RL environment for theory-of-mind training. Meta × PyTorch × HuggingFace OpenEnv Hackathon

What is this?

NeuralEdge AI Boardroom is an asymmetric multi-agent environment where an LLM-agent (the CEO) must build winning board coalitions. Across 10 rounds of market crises, the agent must write persuasive pitches to sway 4 NPC board members (CTO, CFO, Investor, Independent), each with a hidden agenda.

Unlike standard symmetric RL games (like Poker), our environment grades natural language persuasion. The agent must infer hidden preferences from public statements and generate targeted rhetoric to swing votes.

Quick Links

Blog Post (Deep Dive): Read our full breakdown of the innovation and reward logic.
Mechanics: Full mathematical reference.
HF Space (Live Env)
Merged 16-bit Model

How it Works

The agent emits actions in a strict two-line format:

DECISION: <one of 3 options>
PITCH: <1-2 sentences arguing for it, addressing opposing members' concerns>

The environment scores the PITCH against the hidden manifestos of opposing NPCs using sentence-transformers (SBERT). High-quality pitches redirect up to 55% of the NPC's voting weight to the CEO's choice.

Training Evidence

We trained Qwen3 (1.7B/0.6B) using GRPO (Group Relative Policy Optimization) via Unsloth in 4-bit.

Key Takeaways from the Training Graphs:

Pitch Rate Convergence: The agent quickly realizes that writing targeted pitches is a structural advantage. Pitch usage goes from erratic to exactly 1.0 (100%).
Terminal Reward Spikes: The reward graphs show distinct spikes up to +30. This proves the model isn't just surviving; it's actively navigating the environment to trigger the massive "Strategic Acquisition" terminal bonuses.
Loss & Variance: reward_std and loss show high initial exploration variance that stabilizes as the policy masters the environment's asymmetric dynamics.

For a full breakdown of how we quantify this learning via our Theory-of-Mind (ToM) Probe, please read our blog.md.

Running the Code

Hosted environment:

from board_sim_env import BoardSimEnv
from board_sim_env.models import BoardSimAction

with BoardSimEnv(base_url="https://stavankhobare-sst-metaxpytorch-hackathon.hf.space").sync() as env:
    result = env.reset(seed=42)
    obs = result.observation
    while not result.done:
        result = env.step(BoardSimAction(
            decision=obs.options[0],
            coalition_pitch="Margin protection and runway discipline argue for the conservative path.",
        ))
        obs = result.observation
    print("final score:", obs.state["profitability_score"])

Evaluate locally:

python inference.py --mode interactive                # human-play one episode
python inference.py --mode test --episodes 10         # test the environment logic

Train: Run the notebooks/FinalTrainingScript.ipynb in Colab or Kaggle.

License: Apache-2.0