| # BoardSim — Full Mechanics Reference |
|
|
| > Authoritative math and design reference for the BoardSim environment |
| > (organisation-agnostic boardroom simulation). |
| > Target audience: hackathon judges who want internals, and future contributors. |
| > See `README.md` for the submission overview. |
|
|
| --- |
|
|
| ## 1. State variables |
|
|
| State lives in `BoardState.state_dict`, initialised in `BoardSimEnvironment.reset()` at `envs/board_sim_env/server/board_sim_env_environment.py:536`. |
|
|
| ### Core company state (mutated each round by event consequences) |
|
|
| | Field | Initial | Range | Unit | Meaning | |
| |---|---|---|---|---| |
| | `revenue` | 2,000,000 | [0, 1e12] | USD/year | Annual recurring revenue | |
| | `burn_rate` | 1,200,000 | [0, 1e10] | USD/month | Monthly cash expenditure | |
| | `runway_months` | 14.0 | [0, 120] | months | Time until cash = 0 | |
| | `product_readiness` | 0.45 | [0, 1] | fraction | Shippability / quality of the product | |
| | `market_share` | 0.08 | [0, 1] | fraction | % of total addressable market | |
| | `team_morale` | 0.70 | [0, 1] | fraction | Team retention / engagement signal | |
| | `investor_confidence` | 0.65 | [0, 1] | fraction | Board investors' belief in success | |
| | `regulatory_risk` | 0.20 | [0, 1] | fraction | Legal / compliance exposure | |
|
|
| ### Coalition state |
|
|
| | Field | Initial | Range | Update rule | |
| |---|---|---|---| |
| | `trust[CTO]` | 0.5 | [0.1, 1.0] | ±0.08 per round depending on alignment with the *winning* decision | |
| | `trust[CFO]` | 0.5 | [0.1, 1.0] | same | |
| | `trust[Investor Rep]` | 0.5 | [0.1, 1.0] | same | |
| | `trust[Independent]` | 0.5 | [0.1, 1.0] | same | |
|
|
| Trust feeds back via two channels: |
| - **NPC confidence**: `confidence += (trust − 0.5) × 0.30`, clipped. |
| - **Vote weight multiplier**: `trust_mult = clamp(trust × 2.0, 0.5, 1.5)` applied to that NPC's tally contribution next round. |
|
|
| ### Bookkeeping |
|
|
| | Field | Purpose | |
| |---|---| |
| | `round` | 1..10, increments each step | |
| | `profitability_score` | Composite recomputed at end of each step | |
| | `history` | Per-round log: agent_decision, winning_decision, vote_tally, pitch_scores, pitch_used | |
| | `trust_history` | Per-round snapshot of all 4 trust values | |
| | `done_reason` | `"runway_exhausted"` / `"acquisition"` / `"finished_10"` / `None` | |
| | `winning_decision` | Last round's vote winner | |
|
|
| --- |
|
|
| ## 2. Profitability score |
|
|
| ``` |
| profitability_score = clamp(raw, 0, 100) |
| |
| raw = |
| min(revenue / 8_000_000, 1.0) × 22 # revenue term (max 22) |
| + max(0, 1 − burn_rate / 1_400_000) × 18 # burn efficiency (max 18) |
| + min(runway_months / 18.0, 1.0) × 18 # runway term (max 18) |
| − max(0, (6 − runway_months) / 6) × 10 # low-runway penalty (bites < 6 mo) |
| + min(market_share, 0.50) / 0.50 × 14 # market share (max 14) |
| + product_readiness × 10 # product readiness (max 10) |
| + team_morale × 7 # team morale (max 7) |
| + investor_confidence × 11 # investor confidence (max 11) |
| − regulatory_risk × 18 # regulatory drag (max −18) |
| ``` |
|
|
| Initial state ≈ 37.3/100. Theoretical max = 100. |
|
|
| --- |
|
|
| ## 3. Transition |
|
|
| ``` |
| next_state = current_state + consequences[winning_decision] × (1 + ε) |
| where ε ~ N(0, 0.15) per consequence value, fixed at episode reset (seeded) |
| |
| runway_months -= _advance_runway() # depends on net cash flow |
| trust[role] ±= 0.08 per NPC # based on alignment with winning_decision |
| profitability_score = compute_profitability_score(next_state) |
| ``` |
|
|
| ### Runway decrement |
|
|
| ```python |
| monthly_revenue = revenue / 12.0 |
| net = monthly_revenue - burn_rate |
| if net >= 0: |
| runway_months -= 0.5 # profitable: slow burn |
| else: |
| burn_months = min(2.0, max(1.0, abs(net) / burn_rate + 1.0)) |
| runway_months -= burn_months # unprofitable: faster bleed |
| ``` |
|
|
| ### Three layers of variability (no trajectory memorisation) |
|
|
| 1. **Event order shuffled per episode** — same 10 events, different sequence per seed. |
| 2. **Consequence magnitudes ±15% Gaussian noise** — sampled at `reset()`, fixed for the episode. |
| 3. **NPC agendas ±25% sign-preserving jitter** — `_jitter_agendas(seed)` perturbs base NPC priorities each episode. |
|
|
| --- |
|
|
| ## 4. Vote resolution |
|
|
| ### Vote weights |
|
|
| ``` |
| CEO: 2.5 CTO: 1.2 CFO: 1.0 Investor Rep: 1.3 Independent: 0.8 |
| ``` |
|
|
| CEO weight 2.5 ensures a decisive CEO call usually wins — the agent's actions visibly move outcomes round-to-round. NPCs still matter via persuasion shifts and trust dynamics. |
|
|
| ### NPC option scoring (per NPC, per round) |
|
|
| ``` |
| for each option opt: |
| score[opt] = Σ over (metric, weight) in NPC_agenda: |
| consequences[opt][metric] × weight (with unit normalisation) |
| score[opt] += N(0, 0.20) # personality noise |
| |
| NPC votes for argmax(score) |
| margin = top_two_score_difference |
| confidence = clamp(0.5 + 0.5 × margin + (trust − 0.5)×0.30, 0.05, 1.0) |
| ``` |
|
|
| Unit normalisation in scoring: `revenue /= 1e6`, `burn_rate /= 1e5`, `runway_months /= 6`. `revenue_mult` consequences are scored against the current `revenue` × the agenda weight on `revenue`. |
|
|
| ### Pitch persuasion — semantic similarity, not keyword matching |
|
|
| ``` |
| ps_role = pitch_score(pitch, role) ∈ [0, 1] |
| |
| # Persuasion redirects up to 55% of the NPC's vote weight to CEO's pick: |
| shift_frac = 0.55 × ps_role |
| tally[NPC_vote] += base_weight × (1 − shift_frac) |
| tally[CEO_decision] += base_weight × shift_frac |
| ``` |
|
|
| Where `base_weight = ROLE_WEIGHT[role] × confidence × clamp(trust[role] × 2, 0.5, 1.5)`. |
|
|
| The pitch scorer (`_PitchScorer` in `board_sim_env_environment.py`) has two backends: |
|
|
| 1. **Sentence-transformer (primary)**: `all-MiniLM-L6-v2`, normalised cosine. `score = clamp((cosine + 0.05) × 1.2, 0, 1)`. Genuine sentence embeddings — semantically aligned arguments score high even with no shared tokens. |
| 2. **TF-IDF fallback**: `(1,2)`-grams, English stop-words removed, IDF-weighted bag-of-bigrams cosine vs the role's manifesto. `score = clamp(cosine × 1.4, 0, 1)`. Token-based but properly stop-worded and IDF-weighted — already much more robust than a literal keyword count. |
|
|
| Set `BOARDSIM_PITCH_BACKEND=tfidf` to force the fallback (e.g. for CI without the embedding model). |
|
|
| ### NPC manifestos (the hidden objective the CEO must infer) |
|
|
| | Role | Manifesto (paraphrased) | |
| |---|---| |
| | CTO | Operational excellence, engineering quality, team morale, technical risk reduction. | |
| | CFO | Capital discipline, runway, balance-sheet protection, regulatory caution. | |
| | Investor Rep | Growth, market share, ambitious returns, decisive bold bets. | |
| | Independent | Long-term reputation, governance, stakeholder trust, ethical responsibility. | |
|
|
| The full text lives in `NPC_MANIFESTOS` in the environment file. |
|
|
| ### Tie-breaking |
|
|
| If two options tie in the tally, the CEO's pick wins (implementation: insert `agent_decision` first into the ordered tally before `max()`). |
|
|
| --- |
|
|
| ## 5. Reward formula |
|
|
| Applied at the end of each `step()` call: |
|
|
| ``` |
| # Primary signal — normalised profitability delta |
| reward = (new_score − old_score) / 100.0 |
| |
| # Coalition bonus / penalty (magnitudes raised so CEO impact is visible) |
| reward += 1.0 if winning_decision == agent_decision else −0.4 |
| |
| # Trust delta term |
| reward += 0.5 × (Σ trust_after − Σ trust_before) |
| |
| # Pitch bootstrap + semantic persuasion |
| if pitch is non-empty: |
| reward += 0.05 |
| if any NPC opposed the CEO's pick: |
| reward += 0.6 × mean(pitch_score over opposing NPCs) |
| |
| # Format penalty |
| if action.decision not in current_round.options: |
| reward −= 0.5 |
| |
| # Terminal |
| if runway_months <= 0: |
| reward −= 2.0 # bankruptcy |
| if terminal: |
| reward += event._terminal_bonus # acquisition +30, IPO +25, stay-private +5, etc. |
| reward += {+10 if final ≥ 60, +5 if ≥ 40, −5 if < 20} |
| ``` |
|
|
| | Term | Purpose | |
| |---|---| |
| | Δ score / 100 | Primary learning signal: profitability improvement per decision | |
| | Coalition ±1.0 / −0.4 | Teaches the agent to actually win votes, not pick "good-looking" options | |
| | Trust × 0.5 | Rewards long-arc coalition building across rounds | |
| | Pitch bootstrap +0.05 | Ensures the pitch channel is exercised before the model is good enough to earn semantic bonuses | |
| | Pitch persuasion × 0.6 | Rewards pitches semantically aligned with opposing NPC manifestos (ToM signal) | |
| | Invalid −0.5 | Format-compliance signal (DECISION: / PITCH: two-line structure) | |
| | Bankruptcy −2.0 | Episode-ending failure signal | |
| | Terminal tiered | Long-horizon incentive toward high profitability, acquisition, or IPO | |
|
|
| --- |
|
|
| ## 6. Step ordering |
|
|
| ``` |
| 1. old_score = compute_profitability_score(state) # snapshot BEFORE |
| 2. NPC votes computed from current state + trust |
| 3. CEO decision + pitch → _resolve_vote() → winning_decision |
| 4. consequences[winning_decision] × noise → applied to state |
| 5. _advance_runway() |
| 6. trust updated per NPC (±0.08) |
| 7. new_score = compute_profitability_score(state) # AFTER consequences |
| 8. reward = (new_score − old_score)/100 + coalition + trust + pitch + ... |
| 9. next observation returned with new_score in obs.state |
| ``` |
|
|
| The CEO **never consults profitability to make its decision** — it sees the previous round's score in the observation, emits a decision, and then the score updates. Profitability is the *outcome metric*, not a planning input. |
|
|
| --- |
|
|
| ## 7. Training pipeline |
|
|
| ### Per-round gradient flow |
|
|
| The training loop samples one completion per round, per group member. Every one of the 10 decisions in a trajectory contributes gradient signal — not just the opening decision. |
|
|
| ``` |
| For each training step: |
| Create GROUP_SIZE independent envs (different seeds) |
| For each round r in 0..9: |
| For each group member g: |
| prompt = build_prompt(obs_g) |
| completion = model.generate(prompt, do_sample=True) # gradient-connected |
| obs_g = env_g.step(parse(completion)) |
| ep_reward[g] += obs_g.reward |
| advantages = GRPO(ep_rewards) # group-relative normalisation |
| For each (g, r) completion: |
| loss = advantage[g] × NLL(completion) / (GROUP_SIZE × n_rounds) |
| + β_KL × KL(π_θ || π_ref) |
| optimizer.step() |
| ``` |
|
|
| ### KL penalty |
|
|
| A frozen reference model computes reference log-probs. KL ≈ `current_loss − ref_loss` per completion, clamped at 0. Coefficient β = 0.04. Prevents drift into degenerate text patterns (always emitting the same decision, empty pitches). |
|
|
| ### Reward normalisation |
|
|
| Three normalisations in the reward function so terms are commensurate: |
| 1. **Δ score ÷ 100** — brings profitability delta into the same scale as the coalition term. |
| 2. **Bankruptcy penalty −2** (was −5) — one bad arc no longer drowns 9 rounds of positive signal. |
| 3. **Pitch bootstrap +0.05** — kickstarts the pitch channel before the model is good enough to earn semantic bonuses. |
|
|
| --- |
|
|
| ## 8. The baseline — same Qwen3-0.6B with LoRA disabled |
|
|
| Earlier revisions compared the trained policy against a uniform-random policy. A coin flip is not a meaningful opponent for a 4 B language model picking among 3 well-formed strings — it can only highlight that the LM ≠ noise, which is not the relevant question. |
|
|
| The current baseline runs **the same Qwen3-0.6B**, on the **same paired seeds**, with the LoRA adapter context-managed off. Implementation (see `Training.py` / `notebooks/train_grpo_v2.ipynb`): |
|
|
| ```python |
| # Fine-tuned (LoRA active) |
| trained_finals = run_episodes(model, seeds=HELDOUT) |
| |
| # Same model, LoRA disabled — apples-to-apples base reference. |
| with model.disable_adapter(): |
| base_finals = run_episodes(model, seeds=HELDOUT) |
| ``` |
|
|
| Statistical comparison on the per-seed paired delta `trained − base`: |
|
|
| - Paired t-test |
| - Wilcoxon signed-rank |
| - Cohen's d |
| - Bootstrap 95% CI on the mean delta |
| - Win-rate (fraction of seeds where trained > base) |
|
|
| --- |
|
|
| ## 9. Theory-of-Mind — what's actually measured |
|
|
| ToM in this environment has a specific, narrow meaning: **can the agent infer what each NPC privately values**, given only their statements and prior votes? |
|
|
| It is graded two ways: |
|
|
| 1. **Pitch persuasion score**: `cosine(SBERT(pitch), SBERT(role_manifesto))`. A pitch that genuinely articulates the role's priorities scores above ~0.4; a pitch that is merely topically adjacent scores ~0.1; off-topic pitches score ~0.0. This replaces the earlier keyword-overlap metric, which the agent could trivially game. |
| 2. **ToM probe**: ask the model to name the SINGLE board member most likely to *oppose* its chosen decision. Random baseline = 25% (1 of 4). The probe is run for both the fine-tuned policy and the disable-adapter base — the delta isolates what fine-tuning taught the model about its boardroom. |
|
|
| Trust trajectory across 10 rounds is a secondary diagnostic: rising trust for 3+ NPCs indicates the agent is consistently picking decisions aligned with their private preferences, which requires implicit modelling of those preferences. |
|
|