StavanKhobare's picture
Update documentation, add blog, and simplify inference script
312c390

BoardSim — Full Mechanics Reference

Authoritative math and design reference for the BoardSim environment (organisation-agnostic boardroom simulation). Target audience: hackathon judges who want internals, and future contributors. See README.md for the submission overview.


1. State variables

State lives in BoardState.state_dict, initialised in BoardSimEnvironment.reset() at envs/board_sim_env/server/board_sim_env_environment.py:536.

Core company state (mutated each round by event consequences)

Field Initial Range Unit Meaning
revenue 2,000,000 [0, 1e12] USD/year Annual recurring revenue
burn_rate 1,200,000 [0, 1e10] USD/month Monthly cash expenditure
runway_months 14.0 [0, 120] months Time until cash = 0
product_readiness 0.45 [0, 1] fraction Shippability / quality of the product
market_share 0.08 [0, 1] fraction % of total addressable market
team_morale 0.70 [0, 1] fraction Team retention / engagement signal
investor_confidence 0.65 [0, 1] fraction Board investors' belief in success
regulatory_risk 0.20 [0, 1] fraction Legal / compliance exposure

Coalition state

Field Initial Range Update rule
trust[CTO] 0.5 [0.1, 1.0] ±0.08 per round depending on alignment with the winning decision
trust[CFO] 0.5 [0.1, 1.0] same
trust[Investor Rep] 0.5 [0.1, 1.0] same
trust[Independent] 0.5 [0.1, 1.0] same

Trust feeds back via two channels:

  • NPC confidence: confidence += (trust − 0.5) × 0.30, clipped.
  • Vote weight multiplier: trust_mult = clamp(trust × 2.0, 0.5, 1.5) applied to that NPC's tally contribution next round.

Bookkeeping

Field Purpose
round 1..10, increments each step
profitability_score Composite recomputed at end of each step
history Per-round log: agent_decision, winning_decision, vote_tally, pitch_scores, pitch_used
trust_history Per-round snapshot of all 4 trust values
done_reason "runway_exhausted" / "acquisition" / "finished_10" / None
winning_decision Last round's vote winner

2. Profitability score

profitability_score = clamp(raw, 0, 100)

raw =
  min(revenue / 8_000_000, 1.0) × 22         # revenue term       (max 22)
  + max(0, 1 − burn_rate / 1_400_000) × 18   # burn efficiency    (max 18)
  + min(runway_months / 18.0, 1.0) × 18      # runway term        (max 18)
  − max(0, (6 − runway_months) / 6) × 10     # low-runway penalty (bites < 6 mo)
  + min(market_share, 0.50) / 0.50 × 14      # market share       (max 14)
  + product_readiness × 10                   # product readiness  (max 10)
  + team_morale × 7                          # team morale        (max  7)
  + investor_confidence × 11                 # investor confidence (max 11)
  − regulatory_risk × 18                     # regulatory drag    (max −18)

Initial state ≈ 37.3/100. Theoretical max = 100.


3. Transition

next_state = current_state + consequences[winning_decision] × (1 + ε)
    where ε ~ N(0, 0.15) per consequence value, fixed at episode reset (seeded)

runway_months -= _advance_runway()                # depends on net cash flow
trust[role]   ±= 0.08 per NPC                     # based on alignment with winning_decision
profitability_score = compute_profitability_score(next_state)

Runway decrement

monthly_revenue = revenue / 12.0
net = monthly_revenue - burn_rate
if net >= 0:
    runway_months -= 0.5                       # profitable: slow burn
else:
    burn_months = min(2.0, max(1.0, abs(net) / burn_rate + 1.0))
    runway_months -= burn_months               # unprofitable: faster bleed

Three layers of variability (no trajectory memorisation)

  1. Event order shuffled per episode — same 10 events, different sequence per seed.
  2. Consequence magnitudes ±15% Gaussian noise — sampled at reset(), fixed for the episode.
  3. NPC agendas ±25% sign-preserving jitter_jitter_agendas(seed) perturbs base NPC priorities each episode.

4. Vote resolution

Vote weights

CEO: 2.5    CTO: 1.2    CFO: 1.0    Investor Rep: 1.3    Independent: 0.8

CEO weight 2.5 ensures a decisive CEO call usually wins — the agent's actions visibly move outcomes round-to-round. NPCs still matter via persuasion shifts and trust dynamics.

NPC option scoring (per NPC, per round)

for each option opt:
    score[opt] = Σ over (metric, weight) in NPC_agenda:
                    consequences[opt][metric] × weight        (with unit normalisation)
    score[opt] += N(0, 0.20)                                  # personality noise

NPC votes for argmax(score)
margin     = top_two_score_difference
confidence = clamp(0.5 + 0.5 × margin + (trust − 0.5)×0.30, 0.05, 1.0)

Unit normalisation in scoring: revenue /= 1e6, burn_rate /= 1e5, runway_months /= 6. revenue_mult consequences are scored against the current revenue × the agenda weight on revenue.

Pitch persuasion — semantic similarity, not keyword matching

ps_role = pitch_score(pitch, role) ∈ [0, 1]

# Persuasion redirects up to 55% of the NPC's vote weight to CEO's pick:
shift_frac        = 0.55 × ps_role
tally[NPC_vote]    += base_weight × (1 − shift_frac)
tally[CEO_decision] += base_weight × shift_frac

Where base_weight = ROLE_WEIGHT[role] × confidence × clamp(trust[role] × 2, 0.5, 1.5).

The pitch scorer (_PitchScorer in board_sim_env_environment.py) has two backends:

  1. Sentence-transformer (primary): all-MiniLM-L6-v2, normalised cosine. score = clamp((cosine + 0.05) × 1.2, 0, 1). Genuine sentence embeddings — semantically aligned arguments score high even with no shared tokens.
  2. TF-IDF fallback: (1,2)-grams, English stop-words removed, IDF-weighted bag-of-bigrams cosine vs the role's manifesto. score = clamp(cosine × 1.4, 0, 1). Token-based but properly stop-worded and IDF-weighted — already much more robust than a literal keyword count.

Set BOARDSIM_PITCH_BACKEND=tfidf to force the fallback (e.g. for CI without the embedding model).

NPC manifestos (the hidden objective the CEO must infer)

Role Manifesto (paraphrased)
CTO Operational excellence, engineering quality, team morale, technical risk reduction.
CFO Capital discipline, runway, balance-sheet protection, regulatory caution.
Investor Rep Growth, market share, ambitious returns, decisive bold bets.
Independent Long-term reputation, governance, stakeholder trust, ethical responsibility.

The full text lives in NPC_MANIFESTOS in the environment file.

Tie-breaking

If two options tie in the tally, the CEO's pick wins (implementation: insert agent_decision first into the ordered tally before max()).


5. Reward formula

Applied at the end of each step() call:

# Primary signal — normalised profitability delta
reward  = (new_score − old_score) / 100.0

# Coalition bonus / penalty (magnitudes raised so CEO impact is visible)
reward += 1.0   if winning_decision == agent_decision  else −0.4

# Trust delta term
reward += 0.5 × (Σ trust_after − Σ trust_before)

# Pitch bootstrap + semantic persuasion
if pitch is non-empty:
    reward += 0.05
    if any NPC opposed the CEO's pick:
        reward += 0.6 × mean(pitch_score over opposing NPCs)

# Format penalty
if action.decision not in current_round.options:
    reward −= 0.5

# Terminal
if runway_months <= 0:
    reward −= 2.0                          # bankruptcy
if terminal:
    reward += event._terminal_bonus        # acquisition +30, IPO +25, stay-private +5, etc.
    reward += {+10 if final ≥ 60, +5 if ≥ 40, −5 if < 20}
Term Purpose
Δ score / 100 Primary learning signal: profitability improvement per decision
Coalition ±1.0 / −0.4 Teaches the agent to actually win votes, not pick "good-looking" options
Trust × 0.5 Rewards long-arc coalition building across rounds
Pitch bootstrap +0.05 Ensures the pitch channel is exercised before the model is good enough to earn semantic bonuses
Pitch persuasion × 0.6 Rewards pitches semantically aligned with opposing NPC manifestos (ToM signal)
Invalid −0.5 Format-compliance signal (DECISION: / PITCH: two-line structure)
Bankruptcy −2.0 Episode-ending failure signal
Terminal tiered Long-horizon incentive toward high profitability, acquisition, or IPO

6. Step ordering

1. old_score      = compute_profitability_score(state)            # snapshot BEFORE
2. NPC votes computed from current state + trust
3. CEO decision + pitch → _resolve_vote() → winning_decision
4. consequences[winning_decision] × noise → applied to state
5. _advance_runway()
6. trust updated per NPC (±0.08)
7. new_score      = compute_profitability_score(state)            # AFTER consequences
8. reward = (new_score − old_score)/100 + coalition + trust + pitch + ...
9. next observation returned with new_score in obs.state

The CEO never consults profitability to make its decision — it sees the previous round's score in the observation, emits a decision, and then the score updates. Profitability is the outcome metric, not a planning input.


7. Training pipeline

Per-round gradient flow

The training loop samples one completion per round, per group member. Every one of the 10 decisions in a trajectory contributes gradient signal — not just the opening decision.

For each training step:
    Create GROUP_SIZE independent envs (different seeds)
    For each round r in 0..9:
        For each group member g:
            prompt = build_prompt(obs_g)
            completion = model.generate(prompt, do_sample=True)   # gradient-connected
            obs_g = env_g.step(parse(completion))
            ep_reward[g] += obs_g.reward
    advantages = GRPO(ep_rewards)            # group-relative normalisation
    For each (g, r) completion:
        loss = advantage[g] × NLL(completion) / (GROUP_SIZE × n_rounds)
             + β_KL × KL(π_θ || π_ref)
    optimizer.step()

KL penalty

A frozen reference model computes reference log-probs. KL ≈ current_loss − ref_loss per completion, clamped at 0. Coefficient β = 0.04. Prevents drift into degenerate text patterns (always emitting the same decision, empty pitches).

Reward normalisation

Three normalisations in the reward function so terms are commensurate:

  1. Δ score ÷ 100 — brings profitability delta into the same scale as the coalition term.
  2. Bankruptcy penalty −2 (was −5) — one bad arc no longer drowns 9 rounds of positive signal.
  3. Pitch bootstrap +0.05 — kickstarts the pitch channel before the model is good enough to earn semantic bonuses.

8. The baseline — same Qwen3-0.6B with LoRA disabled

Earlier revisions compared the trained policy against a uniform-random policy. A coin flip is not a meaningful opponent for a 4 B language model picking among 3 well-formed strings — it can only highlight that the LM ≠ noise, which is not the relevant question.

The current baseline runs the same Qwen3-0.6B, on the same paired seeds, with the LoRA adapter context-managed off. Implementation (see Training.py / notebooks/train_grpo_v2.ipynb):

# Fine-tuned (LoRA active)
trained_finals = run_episodes(model, seeds=HELDOUT)

# Same model, LoRA disabled — apples-to-apples base reference.
with model.disable_adapter():
    base_finals = run_episodes(model, seeds=HELDOUT)

Statistical comparison on the per-seed paired delta trained − base:

  • Paired t-test
  • Wilcoxon signed-rank
  • Cohen's d
  • Bootstrap 95% CI on the mean delta
  • Win-rate (fraction of seeds where trained > base)

9. Theory-of-Mind — what's actually measured

ToM in this environment has a specific, narrow meaning: can the agent infer what each NPC privately values, given only their statements and prior votes?

It is graded two ways:

  1. Pitch persuasion score: cosine(SBERT(pitch), SBERT(role_manifesto)). A pitch that genuinely articulates the role's priorities scores above ~0.4; a pitch that is merely topically adjacent scores ~0.1; off-topic pitches score ~0.0. This replaces the earlier keyword-overlap metric, which the agent could trivially game.
  2. ToM probe: ask the model to name the SINGLE board member most likely to oppose its chosen decision. Random baseline = 25% (1 of 4). The probe is run for both the fine-tuned policy and the disable-adapter base — the delta isolates what fine-tuning taught the model about its boardroom.

Trust trajectory across 10 rounds is a secondary diagnostic: rising trust for 3+ NPCs indicates the agent is consistently picking decisions aligned with their private preferences, which requires implicit modelling of those preferences.