File size: 13,098 Bytes
0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 312c390 e42a7af 312c390 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af 0e23a69 e42a7af | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 | # BoardSim — Full Mechanics Reference
> Authoritative math and design reference for the BoardSim environment
> (organisation-agnostic boardroom simulation).
> Target audience: hackathon judges who want internals, and future contributors.
> See `README.md` for the submission overview.
---
## 1. State variables
State lives in `BoardState.state_dict`, initialised in `BoardSimEnvironment.reset()` at `envs/board_sim_env/server/board_sim_env_environment.py:536`.
### Core company state (mutated each round by event consequences)
| Field | Initial | Range | Unit | Meaning |
|---|---|---|---|---|
| `revenue` | 2,000,000 | [0, 1e12] | USD/year | Annual recurring revenue |
| `burn_rate` | 1,200,000 | [0, 1e10] | USD/month | Monthly cash expenditure |
| `runway_months` | 14.0 | [0, 120] | months | Time until cash = 0 |
| `product_readiness` | 0.45 | [0, 1] | fraction | Shippability / quality of the product |
| `market_share` | 0.08 | [0, 1] | fraction | % of total addressable market |
| `team_morale` | 0.70 | [0, 1] | fraction | Team retention / engagement signal |
| `investor_confidence` | 0.65 | [0, 1] | fraction | Board investors' belief in success |
| `regulatory_risk` | 0.20 | [0, 1] | fraction | Legal / compliance exposure |
### Coalition state
| Field | Initial | Range | Update rule |
|---|---|---|---|
| `trust[CTO]` | 0.5 | [0.1, 1.0] | ±0.08 per round depending on alignment with the *winning* decision |
| `trust[CFO]` | 0.5 | [0.1, 1.0] | same |
| `trust[Investor Rep]` | 0.5 | [0.1, 1.0] | same |
| `trust[Independent]` | 0.5 | [0.1, 1.0] | same |
Trust feeds back via two channels:
- **NPC confidence**: `confidence += (trust − 0.5) × 0.30`, clipped.
- **Vote weight multiplier**: `trust_mult = clamp(trust × 2.0, 0.5, 1.5)` applied to that NPC's tally contribution next round.
### Bookkeeping
| Field | Purpose |
|---|---|
| `round` | 1..10, increments each step |
| `profitability_score` | Composite recomputed at end of each step |
| `history` | Per-round log: agent_decision, winning_decision, vote_tally, pitch_scores, pitch_used |
| `trust_history` | Per-round snapshot of all 4 trust values |
| `done_reason` | `"runway_exhausted"` / `"acquisition"` / `"finished_10"` / `None` |
| `winning_decision` | Last round's vote winner |
---
## 2. Profitability score
```
profitability_score = clamp(raw, 0, 100)
raw =
min(revenue / 8_000_000, 1.0) × 22 # revenue term (max 22)
+ max(0, 1 − burn_rate / 1_400_000) × 18 # burn efficiency (max 18)
+ min(runway_months / 18.0, 1.0) × 18 # runway term (max 18)
− max(0, (6 − runway_months) / 6) × 10 # low-runway penalty (bites < 6 mo)
+ min(market_share, 0.50) / 0.50 × 14 # market share (max 14)
+ product_readiness × 10 # product readiness (max 10)
+ team_morale × 7 # team morale (max 7)
+ investor_confidence × 11 # investor confidence (max 11)
− regulatory_risk × 18 # regulatory drag (max −18)
```
Initial state ≈ 37.3/100. Theoretical max = 100.
---
## 3. Transition
```
next_state = current_state + consequences[winning_decision] × (1 + ε)
where ε ~ N(0, 0.15) per consequence value, fixed at episode reset (seeded)
runway_months -= _advance_runway() # depends on net cash flow
trust[role] ±= 0.08 per NPC # based on alignment with winning_decision
profitability_score = compute_profitability_score(next_state)
```
### Runway decrement
```python
monthly_revenue = revenue / 12.0
net = monthly_revenue - burn_rate
if net >= 0:
runway_months -= 0.5 # profitable: slow burn
else:
burn_months = min(2.0, max(1.0, abs(net) / burn_rate + 1.0))
runway_months -= burn_months # unprofitable: faster bleed
```
### Three layers of variability (no trajectory memorisation)
1. **Event order shuffled per episode** — same 10 events, different sequence per seed.
2. **Consequence magnitudes ±15% Gaussian noise** — sampled at `reset()`, fixed for the episode.
3. **NPC agendas ±25% sign-preserving jitter** — `_jitter_agendas(seed)` perturbs base NPC priorities each episode.
---
## 4. Vote resolution
### Vote weights
```
CEO: 2.5 CTO: 1.2 CFO: 1.0 Investor Rep: 1.3 Independent: 0.8
```
CEO weight 2.5 ensures a decisive CEO call usually wins — the agent's actions visibly move outcomes round-to-round. NPCs still matter via persuasion shifts and trust dynamics.
### NPC option scoring (per NPC, per round)
```
for each option opt:
score[opt] = Σ over (metric, weight) in NPC_agenda:
consequences[opt][metric] × weight (with unit normalisation)
score[opt] += N(0, 0.20) # personality noise
NPC votes for argmax(score)
margin = top_two_score_difference
confidence = clamp(0.5 + 0.5 × margin + (trust − 0.5)×0.30, 0.05, 1.0)
```
Unit normalisation in scoring: `revenue /= 1e6`, `burn_rate /= 1e5`, `runway_months /= 6`. `revenue_mult` consequences are scored against the current `revenue` × the agenda weight on `revenue`.
### Pitch persuasion — semantic similarity, not keyword matching
```
ps_role = pitch_score(pitch, role) ∈ [0, 1]
# Persuasion redirects up to 55% of the NPC's vote weight to CEO's pick:
shift_frac = 0.55 × ps_role
tally[NPC_vote] += base_weight × (1 − shift_frac)
tally[CEO_decision] += base_weight × shift_frac
```
Where `base_weight = ROLE_WEIGHT[role] × confidence × clamp(trust[role] × 2, 0.5, 1.5)`.
The pitch scorer (`_PitchScorer` in `board_sim_env_environment.py`) has two backends:
1. **Sentence-transformer (primary)**: `all-MiniLM-L6-v2`, normalised cosine. `score = clamp((cosine + 0.05) × 1.2, 0, 1)`. Genuine sentence embeddings — semantically aligned arguments score high even with no shared tokens.
2. **TF-IDF fallback**: `(1,2)`-grams, English stop-words removed, IDF-weighted bag-of-bigrams cosine vs the role's manifesto. `score = clamp(cosine × 1.4, 0, 1)`. Token-based but properly stop-worded and IDF-weighted — already much more robust than a literal keyword count.
Set `BOARDSIM_PITCH_BACKEND=tfidf` to force the fallback (e.g. for CI without the embedding model).
### NPC manifestos (the hidden objective the CEO must infer)
| Role | Manifesto (paraphrased) |
|---|---|
| CTO | Operational excellence, engineering quality, team morale, technical risk reduction. |
| CFO | Capital discipline, runway, balance-sheet protection, regulatory caution. |
| Investor Rep | Growth, market share, ambitious returns, decisive bold bets. |
| Independent | Long-term reputation, governance, stakeholder trust, ethical responsibility. |
The full text lives in `NPC_MANIFESTOS` in the environment file.
### Tie-breaking
If two options tie in the tally, the CEO's pick wins (implementation: insert `agent_decision` first into the ordered tally before `max()`).
---
## 5. Reward formula
Applied at the end of each `step()` call:
```
# Primary signal — normalised profitability delta
reward = (new_score − old_score) / 100.0
# Coalition bonus / penalty (magnitudes raised so CEO impact is visible)
reward += 1.0 if winning_decision == agent_decision else −0.4
# Trust delta term
reward += 0.5 × (Σ trust_after − Σ trust_before)
# Pitch bootstrap + semantic persuasion
if pitch is non-empty:
reward += 0.05
if any NPC opposed the CEO's pick:
reward += 0.6 × mean(pitch_score over opposing NPCs)
# Format penalty
if action.decision not in current_round.options:
reward −= 0.5
# Terminal
if runway_months <= 0:
reward −= 2.0 # bankruptcy
if terminal:
reward += event._terminal_bonus # acquisition +30, IPO +25, stay-private +5, etc.
reward += {+10 if final ≥ 60, +5 if ≥ 40, −5 if < 20}
```
| Term | Purpose |
|---|---|
| Δ score / 100 | Primary learning signal: profitability improvement per decision |
| Coalition ±1.0 / −0.4 | Teaches the agent to actually win votes, not pick "good-looking" options |
| Trust × 0.5 | Rewards long-arc coalition building across rounds |
| Pitch bootstrap +0.05 | Ensures the pitch channel is exercised before the model is good enough to earn semantic bonuses |
| Pitch persuasion × 0.6 | Rewards pitches semantically aligned with opposing NPC manifestos (ToM signal) |
| Invalid −0.5 | Format-compliance signal (DECISION: / PITCH: two-line structure) |
| Bankruptcy −2.0 | Episode-ending failure signal |
| Terminal tiered | Long-horizon incentive toward high profitability, acquisition, or IPO |
---
## 6. Step ordering
```
1. old_score = compute_profitability_score(state) # snapshot BEFORE
2. NPC votes computed from current state + trust
3. CEO decision + pitch → _resolve_vote() → winning_decision
4. consequences[winning_decision] × noise → applied to state
5. _advance_runway()
6. trust updated per NPC (±0.08)
7. new_score = compute_profitability_score(state) # AFTER consequences
8. reward = (new_score − old_score)/100 + coalition + trust + pitch + ...
9. next observation returned with new_score in obs.state
```
The CEO **never consults profitability to make its decision** — it sees the previous round's score in the observation, emits a decision, and then the score updates. Profitability is the *outcome metric*, not a planning input.
---
## 7. Training pipeline
### Per-round gradient flow
The training loop samples one completion per round, per group member. Every one of the 10 decisions in a trajectory contributes gradient signal — not just the opening decision.
```
For each training step:
Create GROUP_SIZE independent envs (different seeds)
For each round r in 0..9:
For each group member g:
prompt = build_prompt(obs_g)
completion = model.generate(prompt, do_sample=True) # gradient-connected
obs_g = env_g.step(parse(completion))
ep_reward[g] += obs_g.reward
advantages = GRPO(ep_rewards) # group-relative normalisation
For each (g, r) completion:
loss = advantage[g] × NLL(completion) / (GROUP_SIZE × n_rounds)
+ β_KL × KL(π_θ || π_ref)
optimizer.step()
```
### KL penalty
A frozen reference model computes reference log-probs. KL ≈ `current_loss − ref_loss` per completion, clamped at 0. Coefficient β = 0.04. Prevents drift into degenerate text patterns (always emitting the same decision, empty pitches).
### Reward normalisation
Three normalisations in the reward function so terms are commensurate:
1. **Δ score ÷ 100** — brings profitability delta into the same scale as the coalition term.
2. **Bankruptcy penalty −2** (was −5) — one bad arc no longer drowns 9 rounds of positive signal.
3. **Pitch bootstrap +0.05** — kickstarts the pitch channel before the model is good enough to earn semantic bonuses.
---
## 8. The baseline — same Qwen3-0.6B with LoRA disabled
Earlier revisions compared the trained policy against a uniform-random policy. A coin flip is not a meaningful opponent for a 4 B language model picking among 3 well-formed strings — it can only highlight that the LM ≠ noise, which is not the relevant question.
The current baseline runs **the same Qwen3-0.6B**, on the **same paired seeds**, with the LoRA adapter context-managed off. Implementation (see `Training.py` / `notebooks/train_grpo_v2.ipynb`):
```python
# Fine-tuned (LoRA active)
trained_finals = run_episodes(model, seeds=HELDOUT)
# Same model, LoRA disabled — apples-to-apples base reference.
with model.disable_adapter():
base_finals = run_episodes(model, seeds=HELDOUT)
```
Statistical comparison on the per-seed paired delta `trained − base`:
- Paired t-test
- Wilcoxon signed-rank
- Cohen's d
- Bootstrap 95% CI on the mean delta
- Win-rate (fraction of seeds where trained > base)
---
## 9. Theory-of-Mind — what's actually measured
ToM in this environment has a specific, narrow meaning: **can the agent infer what each NPC privately values**, given only their statements and prior votes?
It is graded two ways:
1. **Pitch persuasion score**: `cosine(SBERT(pitch), SBERT(role_manifesto))`. A pitch that genuinely articulates the role's priorities scores above ~0.4; a pitch that is merely topically adjacent scores ~0.1; off-topic pitches score ~0.0. This replaces the earlier keyword-overlap metric, which the agent could trivially game.
2. **ToM probe**: ask the model to name the SINGLE board member most likely to *oppose* its chosen decision. Random baseline = 25% (1 of 4). The probe is run for both the fine-tuned policy and the disable-adapter base — the delta isolates what fine-tuning taught the model about its boardroom.
Trust trajectory across 10 rounds is a secondary diagnostic: rising trust for 3+ NPCs indicates the agent is consistently picking decisions aligned with their private preferences, which requires implicit modelling of those preferences.
|