StavanKhobare commited on
Commit
86360cd
·
1 Parent(s): 312c390

Update blog.md

Browse files
Files changed (1) hide show
  1. blog.md +55 -22
blog.md CHANGED
@@ -1,37 +1,70 @@
1
- # NeuralEdge AI Boardroom: Rethinking Multi-Agent RL
 
 
2
 
3
- Welcome to the NeuralEdge AI Boardroom, our submission for the Meta × PyTorch Hackathon. This blog breaks down our core innovation, the environment's reward design, and the training evidence that proves our LLM agent is learning to exhibit Theory-of-Mind (ToM).
4
 
5
- ## Our Innovation: Asymmetric Theory-of-Mind (ToM) Persuasion
6
 
7
- Most multi-agent RL benchmarks (like Poker or Werewolf) focus on symmetric games with discrete action spaces. We built something fundamentally different: an **asymmetric, partially observable environment where natural language persuasion is graded**.
8
 
9
- Our LLM agent plays the CEO. It must satisfy four Board Members (NPCs), each with a hidden agenda (e.g., the CFO cares about runway; the CTO cares about engineering morale). The agent never sees these agendas. Instead, it must infer them from the NPCs' voting history and public statements, and then generate a **persuasive pitch** to swing their votes.
10
 
11
- We quantify this innovation via our **Theory-of-Mind (ToM) Probe**. During evaluation, we freeze the agent and ask it to predict *which* specific board member will oppose its decision. A random baseline guesses correctly ~25% of the time, but our GRPO-trained model explicitly learns to identify its opponents and tailor its rhetoric to their hidden agendas.
12
 
13
- ## The Reward Function Design
14
 
15
- Our reward function (`board_sim_env_environment.py:723`) is deliberately designed to mix dense step-level shaping with sparse, episodic terminal spikes.
16
 
17
- Here is the reasoning behind its core components:
18
 
19
- 1. **Δ Profitability (Dense):** The agent receives a continuous reward based on the change in the company's profitability score. This teaches basic corporate survival (protect runway, grow revenue).
20
- 2. **Coalition Success (Dense):** Winning a vote yields `+1.0`, losing yields `-0.4`.
21
- 3. **Persuasion & Pitching (The ToM Signal):** If the agent writes a pitch, it gets a `+0.05` bootstrap reward. If it writes a pitch that semantically aligns (measured via SBERT cosine similarity) with the hidden manifestos of *opposing* NPCs, it earns up to `+0.6`. This forces the LLM to learn high-quality, targeted rhetoric.
22
- 4. **Terminal Spikes (Sparse):** Survival alone isn't enough. Running out of money triggers a `-2.0` penalty. Successfully reaching the endgame triggers massive spikes: `+30` for an Acquisition, `+25` for an IPO.
 
 
 
 
23
 
24
- This combination ensures the agent doesn't just learn to "survive" by making safe choices, but actively learns to build trust and persuade opponents to achieve a massive exit.
25
 
26
- ## Training Evidence & Graph Analysis
27
 
28
- We trained Qwen3 (1.7B) using a KL-free **GRPO (Group Relative Policy Optimization)** setup. Our W&B training dashboard reveals exactly how the model adapts to the environment over time:
29
 
30
- ![W&B Training Graphs](assets/reward_curve.png)
 
 
 
 
 
 
 
31
 
32
- 1. **Pitch Rate Convergence (`pitch_rate`)**: Early in training, the agent's pitch rate is erratic. Very quickly, the pitch rate spikes and locks in at **1.0 (100%)**. The agent discovers that emitting an empty pitch is strategically suboptimal. It learns to always write a pitch to capture the persuasion reward channel.
33
- 2. **Reward Maximums (`reward_max` & `reward`)**: The mean step reward stays relatively flat (representing the dense shaping), but we see distinct, massive spikes up to `+30`. These spikes confirm the agent successfully navigated the 10-round gauntlet to achieve the optimal "Acquisition" endgame.
34
- 3. **Reward Standard Deviation (`reward_std`)**: The high variance (spikes up to 15) indicates active exploration. In our episodic structure, high variance is a feature, not a bug—it means the agent is exploring different terminal outcomes (bankruptcy vs. IPO vs. acquisition).
35
- 4. **Loss Stabilization (`loss`)**: The GRPO loss (advantage × NLL) starts highly volatile but compresses around zero as the policy stabilizes and the group-relative advantages converge.
36
 
37
- By the end of training, the model hasn't just learned what decisions to make—it has learned *how to argue for them*.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <h1 align="center">NeuralEdge AI Boardroom</h1>
2
+ <h3 align="center">Asymmetric Theory-of-Mind Training over OpenEnv v0.2.3</h3>
3
+ <p align="center"><em>Submission to the Meta × PyTorch × HuggingFace OpenEnv Hackathon — Theme 1: Multi-Agent Interactions</em></p>
4
 
5
+ ---
6
 
7
+ ## Abstract
8
 
9
+ We present **NeuralEdge AI Boardroom**, an asymmetric, partially-observable multi-agent reinforcement-learning environment in which a single CEO agent must build winning coalitions across ten rounds of strategic crises against four NPC board members with private agendas. Unlike symmetric multi-agent benchmarks (Poker, Werewolf, Diplomacy variants), the action channel is a `(decision, coalition_pitch)` tuple where the pitch is a free-text argument graded against each NPC's hidden manifesto via sentence-transformer cosine similarity. We fine-tune **Qwen3-0.6B** with a 4-bit LoRA (Unsloth, r=32, α=64) under a KL-regularised GRPO objective, and observe the four RL signatures the environment is designed to elicit: rapid pitch-channel uptake, terminal-spike density on `reward_max`, exploration-consistent `reward_std`, and stable GRPO loss.
10
 
11
+ ---
12
 
13
+ ## 1. Motivation
14
 
15
+ Real-world LLM agents do not negotiate against symmetric opponents. They write proposals, reason about heterogeneous stakeholders whose objectives are not directly observable, and argue for decisions in natural language. Existing multi-agent benchmarks capture none of this — they are zero-sum games over discrete action spaces. The underserved capability is **asymmetric persuasion under hidden-preference uncertainty with language-quality grading**, and BoardSim is the environment that trains it.
16
 
17
+ ## 2. Environment Design
18
 
19
+ Built natively on **OpenEnv v0.2.3**: `BoardSimEnv` is an `EnvClient` subclass over a Dockerised FastAPI server hosted on HuggingFace Spaces, with a synchronous `reset(seed) / step(action)` contract and a typed `BoardSimAction(decision, coalition_pitch)` schema.
20
 
21
+ | Property | Value |
22
+ |---|---|
23
+ | Episode length | 10 rounds, events shuffled per seed |
24
+ | Observation | public state + 4 NPC pre-vote statements (zero agenda leak) |
25
+ | Action | `decision ∈ options[3]`, `coalition_pitch: free text` |
26
+ | Vote resolution | weighted tally — CEO=2.5, NPC weights ∈ [0.8, 1.3] |
27
+ | Persuasion cap | up to **55%** of opposing NPC weight redirected by pitch |
28
+ | Trust dynamics | per-NPC trust ∈ [0.1, 1.0], ±0.08/round, multiplicatively gates vote weight |
29
 
30
+ Three independent variability layers event order, ±15% consequence-magnitude noise, ±25% NPC agenda jitter eliminate trajectory memorisation as a learning shortcut.
31
 
32
+ ## 3. Reward Function
33
 
34
+ Reference: `envs/board_sim_env/server/board_sim_env_environment.py:723`. Per-step reward is dense and bounded ≈ [−0.7, +0.65]:
35
 
36
+ ```
37
+ r_t = Δprofitability / 100
38
+ + (+1.0 if winning_decision == agent_decision else −0.4)
39
+ + 0.5 · Σ Δtrust
40
+ + 0.05 · 𝟙[pitch ≠ ∅] # bootstrap
41
+ + 0.6 · mean_pitch_score over opposing NPCs # ToM persuasion
42
+ − 0.5 · 𝟙[decision ∉ options] # format penalty
43
+ ```
44
 
45
+ The terminal step adds episodic spikes (+30 acquisition, +25 IPO, +5 stay-private, −2 bankruptcy) plus a ±10 final-profitability tier. The split is deliberate: dense shaping for per-step credit assignment, sparse spikes for outcome-quality discrimination across the ten-round horizon.
 
 
 
46
 
47
+ ## 4. Training and Empirical Signal
48
+
49
+ **Stack:** Qwen3-0.6B base · Unsloth 4-bit LoRA (r=32, α=64, all linear modules) · KL-regularised GRPO (β=0.04 against a frozen reference) · `GROUP_SIZE=4`, lr=5e-6, T=1.0, top_p=0.95 · OpenEnv v0.2.3 client over the live HF Space.
50
+
51
+ ![GRPO training curves on BoardSim](assets/reward_curve.png)
52
+
53
+ The curves expose the four signatures the environment is designed to elicit:
54
+
55
+ 1. **Pitch-rate convergence to 1.0.** The agent rapidly internalises that emitting an empty `coalition_pitch` is strategically dominated — there is no recovery of the +0.05 bootstrap nor the +0.6 × pitch_score persuasion term, and the format penalty fires whenever the two-line schema breaks. Convergence to 100 % pitch rate is a direct measurement of the policy learning *the structural reward channel that the random baseline cannot exploit*.
56
+ 2. **Terminal-spike density on `reward_max`.** Repeated +25 / +30 spikes are not training noise; they are the agent successfully navigating ten rounds of asymmetric vote resolution into the IPO and Acquisition terminal events. The signal that the long-horizon credit assignment is working.
57
+ 3. **`reward_std` consistent with active exploration.** Group-relative advantage estimation requires within-group reward variance — the curve confirms the policy is sampling distinct terminal outcomes rather than collapsing to a single sub-optimal mode.
58
+ 4. **GRPO loss stabilising.** The advantage-weighted NLL flattens as group-relative advantages stop diverging from the running mean — the standard signature of a stable GRPO checkpoint.
59
+
60
+ ## 5. Discussion
61
+
62
+ The headline measurement is *not* aggregate step-reward, which is dominated by terminal-spike density and seed luck. The axes that genuinely separate a trained policy from the random-baseline floor are **held-out final profitability** on paired same-seed evaluation and **pitch-usage rate**. The 200-episode random baseline scores 45.7 ± 13.1 profitability with **0% pitch usage**; the trained agent's pitch-rate convergence to 1.0 establishes a structural advantage random cannot replicate, because the persuasion reward channel is gated on producing non-empty, role-aligned text. Per-round trust dynamics convert the episode into a long-arc credit-assignment problem in which early persuasion compounds into late-round vote dominance — a regime closer to real-world stakeholder negotiation than zero-sum self-play.
63
+
64
+ ## Conclusion
65
+
66
+ NeuralEdge AI Boardroom contributes, to the OpenEnv ecosystem, an asymmetric multi-agent environment in which **language quality is part of the reward** rather than a wrapper around discrete play. Qwen3-0.6B + LoRA + GRPO produces the predicted RL signatures end-to-end on the live HF-Space-hosted environment, validating BoardSim as a target for theory-of-mind capability training rather than a dressed-up symmetric benchmark.
67
+
68
+ ---
69
+
70
+ <p align="center"><em>Code · Environment · Adapter — see the project README for HF Space and GitHub links.</em></p>