BoardSim — Full Mechanics Reference
Authoritative math and design reference for the BoardSim environment (organisation-agnostic boardroom simulation). Target audience: hackathon judges who want internals, and future contributors. See
README.mdfor the submission overview.
1. State variables
State lives in BoardState.state_dict, initialised in BoardSimEnvironment.reset() at envs/board_sim_env/server/board_sim_env_environment.py:536.
Core company state (mutated each round by event consequences)
| Field | Initial | Range | Unit | Meaning |
|---|---|---|---|---|
revenue |
2,000,000 | [0, 1e12] | USD/year | Annual recurring revenue |
burn_rate |
1,200,000 | [0, 1e10] | USD/month | Monthly cash expenditure |
runway_months |
14.0 | [0, 120] | months | Time until cash = 0 |
product_readiness |
0.45 | [0, 1] | fraction | Shippability / quality of the product |
market_share |
0.08 | [0, 1] | fraction | % of total addressable market |
team_morale |
0.70 | [0, 1] | fraction | Team retention / engagement signal |
investor_confidence |
0.65 | [0, 1] | fraction | Board investors' belief in success |
regulatory_risk |
0.20 | [0, 1] | fraction | Legal / compliance exposure |
Coalition state
| Field | Initial | Range | Update rule |
|---|---|---|---|
trust[CTO] |
0.5 | [0.1, 1.0] | ±0.08 per round depending on alignment with the winning decision |
trust[CFO] |
0.5 | [0.1, 1.0] | same |
trust[Investor Rep] |
0.5 | [0.1, 1.0] | same |
trust[Independent] |
0.5 | [0.1, 1.0] | same |
Trust feeds back via two channels:
- NPC confidence:
confidence += (trust − 0.5) × 0.30, clipped. - Vote weight multiplier:
trust_mult = clamp(trust × 2.0, 0.5, 1.5)applied to that NPC's tally contribution next round.
Bookkeeping
| Field | Purpose |
|---|---|
round |
1..10, increments each step |
profitability_score |
Composite recomputed at end of each step |
history |
Per-round log: agent_decision, winning_decision, vote_tally, pitch_scores, pitch_used |
trust_history |
Per-round snapshot of all 4 trust values |
done_reason |
"runway_exhausted" / "acquisition" / "finished_10" / None |
winning_decision |
Last round's vote winner |
2. Profitability score
profitability_score = clamp(raw, 0, 100)
raw =
min(revenue / 8_000_000, 1.0) × 22 # revenue term (max 22)
+ max(0, 1 − burn_rate / 1_400_000) × 18 # burn efficiency (max 18)
+ min(runway_months / 18.0, 1.0) × 18 # runway term (max 18)
− max(0, (6 − runway_months) / 6) × 10 # low-runway penalty (bites < 6 mo)
+ min(market_share, 0.50) / 0.50 × 14 # market share (max 14)
+ product_readiness × 10 # product readiness (max 10)
+ team_morale × 7 # team morale (max 7)
+ investor_confidence × 11 # investor confidence (max 11)
− regulatory_risk × 18 # regulatory drag (max −18)
Initial state ≈ 37.3/100. Theoretical max = 100.
3. Transition
next_state = current_state + consequences[winning_decision] × (1 + ε)
where ε ~ N(0, 0.15) per consequence value, fixed at episode reset (seeded)
runway_months -= _advance_runway() # depends on net cash flow
trust[role] ±= 0.08 per NPC # based on alignment with winning_decision
profitability_score = compute_profitability_score(next_state)
Runway decrement
monthly_revenue = revenue / 12.0
net = monthly_revenue - burn_rate
if net >= 0:
runway_months -= 0.5 # profitable: slow burn
else:
burn_months = min(2.0, max(1.0, abs(net) / burn_rate + 1.0))
runway_months -= burn_months # unprofitable: faster bleed
Three layers of variability (no trajectory memorisation)
- Event order shuffled per episode — same 10 events, different sequence per seed.
- Consequence magnitudes ±15% Gaussian noise — sampled at
reset(), fixed for the episode. - NPC agendas ±25% sign-preserving jitter —
_jitter_agendas(seed)perturbs base NPC priorities each episode.
4. Vote resolution
Vote weights
CEO: 2.5 CTO: 1.2 CFO: 1.0 Investor Rep: 1.3 Independent: 0.8
CEO weight 2.5 ensures a decisive CEO call usually wins — the agent's actions visibly move outcomes round-to-round. NPCs still matter via persuasion shifts and trust dynamics.
NPC option scoring (per NPC, per round)
for each option opt:
score[opt] = Σ over (metric, weight) in NPC_agenda:
consequences[opt][metric] × weight (with unit normalisation)
score[opt] += N(0, 0.20) # personality noise
NPC votes for argmax(score)
margin = top_two_score_difference
confidence = clamp(0.5 + 0.5 × margin + (trust − 0.5)×0.30, 0.05, 1.0)
Unit normalisation in scoring: revenue /= 1e6, burn_rate /= 1e5, runway_months /= 6. revenue_mult consequences are scored against the current revenue × the agenda weight on revenue.
Pitch persuasion — semantic similarity, not keyword matching
ps_role = pitch_score(pitch, role) ∈ [0, 1]
# Persuasion redirects up to 55% of the NPC's vote weight to CEO's pick:
shift_frac = 0.55 × ps_role
tally[NPC_vote] += base_weight × (1 − shift_frac)
tally[CEO_decision] += base_weight × shift_frac
Where base_weight = ROLE_WEIGHT[role] × confidence × clamp(trust[role] × 2, 0.5, 1.5).
The pitch scorer (_PitchScorer in board_sim_env_environment.py) has two backends:
- Sentence-transformer (primary):
all-MiniLM-L6-v2, normalised cosine.score = clamp((cosine + 0.05) × 1.2, 0, 1). Genuine sentence embeddings — semantically aligned arguments score high even with no shared tokens. - TF-IDF fallback:
(1,2)-grams, English stop-words removed, IDF-weighted bag-of-bigrams cosine vs the role's manifesto.score = clamp(cosine × 1.4, 0, 1). Token-based but properly stop-worded and IDF-weighted — already much more robust than a literal keyword count.
Set BOARDSIM_PITCH_BACKEND=tfidf to force the fallback (e.g. for CI without the embedding model).
NPC manifestos (the hidden objective the CEO must infer)
| Role | Manifesto (paraphrased) |
|---|---|
| CTO | Operational excellence, engineering quality, team morale, technical risk reduction. |
| CFO | Capital discipline, runway, balance-sheet protection, regulatory caution. |
| Investor Rep | Growth, market share, ambitious returns, decisive bold bets. |
| Independent | Long-term reputation, governance, stakeholder trust, ethical responsibility. |
The full text lives in NPC_MANIFESTOS in the environment file.
Tie-breaking
If two options tie in the tally, the CEO's pick wins (implementation: insert agent_decision first into the ordered tally before max()).
5. Reward formula
Applied at the end of each step() call:
# Primary signal — normalised profitability delta
reward = (new_score − old_score) / 100.0
# Coalition bonus / penalty (magnitudes raised so CEO impact is visible)
reward += 1.0 if winning_decision == agent_decision else −0.4
# Trust delta term
reward += 0.5 × (Σ trust_after − Σ trust_before)
# Pitch bootstrap + semantic persuasion
if pitch is non-empty:
reward += 0.05
if any NPC opposed the CEO's pick:
reward += 0.6 × mean(pitch_score over opposing NPCs)
# Format penalty
if action.decision not in current_round.options:
reward −= 0.5
# Terminal
if runway_months <= 0:
reward −= 2.0 # bankruptcy
if terminal:
reward += event._terminal_bonus # acquisition +30, IPO +25, stay-private +5, etc.
reward += {+10 if final ≥ 60, +5 if ≥ 40, −5 if < 20}
| Term | Purpose |
|---|---|
| Δ score / 100 | Primary learning signal: profitability improvement per decision |
| Coalition ±1.0 / −0.4 | Teaches the agent to actually win votes, not pick "good-looking" options |
| Trust × 0.5 | Rewards long-arc coalition building across rounds |
| Pitch bootstrap +0.05 | Ensures the pitch channel is exercised before the model is good enough to earn semantic bonuses |
| Pitch persuasion × 0.6 | Rewards pitches semantically aligned with opposing NPC manifestos (ToM signal) |
| Invalid −0.5 | Format-compliance signal (DECISION: / PITCH: two-line structure) |
| Bankruptcy −2.0 | Episode-ending failure signal |
| Terminal tiered | Long-horizon incentive toward high profitability, acquisition, or IPO |
6. Step ordering
1. old_score = compute_profitability_score(state) # snapshot BEFORE
2. NPC votes computed from current state + trust
3. CEO decision + pitch → _resolve_vote() → winning_decision
4. consequences[winning_decision] × noise → applied to state
5. _advance_runway()
6. trust updated per NPC (±0.08)
7. new_score = compute_profitability_score(state) # AFTER consequences
8. reward = (new_score − old_score)/100 + coalition + trust + pitch + ...
9. next observation returned with new_score in obs.state
The CEO never consults profitability to make its decision — it sees the previous round's score in the observation, emits a decision, and then the score updates. Profitability is the outcome metric, not a planning input.
7. Training pipeline
Per-round gradient flow
The training loop samples one completion per round, per group member. Every one of the 10 decisions in a trajectory contributes gradient signal — not just the opening decision.
For each training step:
Create GROUP_SIZE independent envs (different seeds)
For each round r in 0..9:
For each group member g:
prompt = build_prompt(obs_g)
completion = model.generate(prompt, do_sample=True) # gradient-connected
obs_g = env_g.step(parse(completion))
ep_reward[g] += obs_g.reward
advantages = GRPO(ep_rewards) # group-relative normalisation
For each (g, r) completion:
loss = advantage[g] × NLL(completion) / (GROUP_SIZE × n_rounds)
+ β_KL × KL(π_θ || π_ref)
optimizer.step()
KL penalty
A frozen reference model computes reference log-probs. KL ≈ current_loss − ref_loss per completion, clamped at 0. Coefficient β = 0.04. Prevents drift into degenerate text patterns (always emitting the same decision, empty pitches).
Reward normalisation
Three normalisations in the reward function so terms are commensurate:
- Δ score ÷ 100 — brings profitability delta into the same scale as the coalition term.
- Bankruptcy penalty −2 (was −5) — one bad arc no longer drowns 9 rounds of positive signal.
- Pitch bootstrap +0.05 — kickstarts the pitch channel before the model is good enough to earn semantic bonuses.
8. The baseline — same Qwen3-0.6B with LoRA disabled
Earlier revisions compared the trained policy against a uniform-random policy. A coin flip is not a meaningful opponent for a 4 B language model picking among 3 well-formed strings — it can only highlight that the LM ≠ noise, which is not the relevant question.
The current baseline runs the same Qwen3-0.6B, on the same paired seeds, with the LoRA adapter context-managed off. Implementation (see Training.py / notebooks/train_grpo_v2.ipynb):
# Fine-tuned (LoRA active)
trained_finals = run_episodes(model, seeds=HELDOUT)
# Same model, LoRA disabled — apples-to-apples base reference.
with model.disable_adapter():
base_finals = run_episodes(model, seeds=HELDOUT)
Statistical comparison on the per-seed paired delta trained − base:
- Paired t-test
- Wilcoxon signed-rank
- Cohen's d
- Bootstrap 95% CI on the mean delta
- Win-rate (fraction of seeds where trained > base)
9. Theory-of-Mind — what's actually measured
ToM in this environment has a specific, narrow meaning: can the agent infer what each NPC privately values, given only their statements and prior votes?
It is graded two ways:
- Pitch persuasion score:
cosine(SBERT(pitch), SBERT(role_manifesto)). A pitch that genuinely articulates the role's priorities scores above ~0.4; a pitch that is merely topically adjacent scores ~0.1; off-topic pitches score ~0.0. This replaces the earlier keyword-overlap metric, which the agent could trivially game. - ToM probe: ask the model to name the SINGLE board member most likely to oppose its chosen decision. Random baseline = 25% (1 of 4). The probe is run for both the fine-tuned policy and the disable-adapter base — the delta isolates what fine-tuning taught the model about its boardroom.
Trust trajectory across 10 rounds is a secondary diagnostic: rising trust for 3+ NPCs indicates the agent is consistently picking decisions aligned with their private preferences, which requires implicit modelling of those preferences.