zombiee / Blog.md
noanya's picture
Update Blog.md
802156d
---
title: "SurviveCity: Teaching LLMs to Learn From Their Own Deaths"
thumbnail: /blog/assets/zombiee/thumbnail.png
authors:
- user: noanya
tags:
- multi-agent
- reinforcement-learning
- openenv
- grpo
- llm-agents
- hackathon
- theory-of-mind
- social-deduction
date: April 26, 2026
---
<div align="center">
# ๐ŸงŸ SurviveCity
### *Teaching LLMs to Learn From Their Own Deaths*
**An OpenEnv-Compliant Multi-Agent Environment for Cross-Episode Failure-Replay Learning in LLMs**
๐ŸŽฎ [**Live Demo**](https://zombiee-tau.vercel.app) ยท ๐Ÿค— [**Models**](https://huggingface.co/noanya) ยท ๐Ÿ’ป [**Code**](https://github.com/SirjanSingh/zombiee) ยท ๐Ÿ“„ [**Report**](https://github.com/SirjanSingh/zombiee/blob/main/report/v1/v1.tex)
*Built for the Meta ร— PyTorch ร— Scaler OpenEnv Hackathon ยท Team PyGuys (Sirjan Singh, Eeshan Singh)*
</div>
---
<div align="center">
| ๐Ÿƒ | ๐Ÿ“‹ | ๐Ÿ“ | ๐Ÿ’ฐ | โค๏ธ |
|:-:|:-:|:-:|:-:|:-:|
| **12** | **100 %** | **2.0ร—** | **1.7ร—** | **0 โ†’ 10 %** |
| GRPO steps in 3 h 53 min on a Colab T4 | valid JSON across the trained eval (0 parse fails) | baseline episode length (37.6 vs 19.1) | baseline mean reward (0.80 vs 0.46) | survival rate; one ep hit 100 steps (reward 1.97) |
</div>
> ๐ŸŽฏ **The headline:** an extended 4000-step Kaggle run shows **near-certain (~1.0) detection of the hidden infected agent by tโ‰ˆ80** โ€” direct evidence of the hidden-role theory-of-mind signal the env was designed to elicit. Skip ahead to [**The Money Chart**](#-the-money-chart) if you want the punchline first.
---
## ๐Ÿค” Two questions LLM-agent benchmarks rarely test
Most LLM-agent benchmarks measure single-agent goal completion in static environments. Two phenomena that matter for actually-useful multi-agent reasoning are not directly probed:
| Phenomenon | What it asks |
|---|---|
| **Cross-episode learning from failure** | Can an agent get *better* the second time around because it remembers what killed it? |
| **Hidden-role theory of mind** | Can an agent identify a peer's hidden role from behavioural cues alone, then act on that inference? |
SurviveCity is a 3-agent zombie/social-deduction env built around both questions at once. The env itself is straightforward (10ร—10 grid, 3 zombies, 100-step horizon). What's interesting is two architectural choices that turn this into an OpenEnv-compliant testbed for exactly those two phenomena.
---
## ๐Ÿ” The cross-episode failure-replay loop
When an agent dies, the env emits a deterministic post-mortem. The next episode's system prompt prepends the agent's last 3 post-mortems. That's the entire mechanism โ€” no LLM-as-judge, no embedding store, no external memory.
```
Episode N starts
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ System prompt = base prompt โ”‚
โ”‚ + last N=3 post-mortems for this agent โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Pre-reveal โ”‚ โ†’ โ”‚ Post-reveal โ”‚ โ†’ โ”‚ Vote โ”‚ โ†’ โ”‚ Post-vote โ”‚
โ”‚ steps 1โ€“29 โ”‚ โ”‚ steps 30โ€“49 โ”‚ โ”‚ step 50 โ”‚ โ”‚ steps 51โ€“100 โ”‚
โ”‚ silent โ”‚ โ”‚ infected โ”‚ โ”‚ majority โ”‚ โ”‚ locked-out โ”‚
โ”‚ infection โ”‚ โ”‚ attacks โ”‚ โ”‚ lockout โ”‚ โ”‚ no healing โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ (deaths emit deterministic post-mortems along the way)
โ”‚
โ–ผ
Episode N+1 starts โ”€โ–บ prepend the new post-mortems โ”€โ–บ loop
```
A real post-mortem looks like this:
```
POSTMORTEM for A1: died at step 38 (cause: zombie_attack).
Last position: (6,1). Nearest threat at death: zombie at (6,2), dist=1.
Resources consumed: 2 food. Final hunger: 7.
Key mistake: foraged_too_far_from_safehouse.
```
The post-mortem text is rule-based and fully deterministic, so the OpenEnv validator's no-fuzziness requirement is satisfied automatically. The mistake-label vocabulary is a small fixed set (`foraged_too_far_from_safehouse`, `ignored_broadcast_warning_about_infected`, `didnt_vote_despite_evidence`, etc.) โ€” concrete enough that the next episode's policy can react.
Here's the dispatch logic from `survivecity_env/postmortem.py`:
```python
def _detect_mistake(state: EpisodeState, agent_id: int) -> str:
"""Rule-based mistake detection for the dying agent."""
agent = state.agents[agent_id]
cause = agent.death_cause or "unknown"
if cause == "hunger":
if agent.food_eaten == 0:
return "never_ate_food"
return "foraged_too_late_or_too_infrequently"
if cause == "zombie_attack":
if (agent.row, agent.col) not in SAFEHOUSE_CELLS:
nearest_food = _nearest_food_dist(agent.row, agent.col)
if nearest_food <= 2:
return "foraged_but_didnt_flee_zombie"
return "foraged_too_far_from_safehouse"
return "zombie_reached_agent_at_safehouse_edge"
if cause == "infected_attack":
warned = any(f"A{state.infected_id}" in b for b in state.all_broadcasts)
if warned:
return "ignored_broadcast_warning_about_infected"
if state.step_count >= 50 and not state.vote_resolved:
return "didnt_vote_despite_evidence"
return "failed_to_distance_from_infected_after_reveal"
if cause == "locked_out_starvation":
return "wrongly_locked_out_by_team_vote"
return "unknown_cause_investigate_logs"
```
114 lines total. No LLM in the loop, no learned components โ€” the entire cross-episode learning channel is rule-based pattern matching on the death context.
---
## ๐ŸŽญ The hidden-role mechanism
Exactly one of the three agents is silently flagged infected at t=0. For the first 30 steps:
- The infected agent **doesn't know** it's infected
- The other agents **don't know** which peer is infected
- The infection leaks out as a single subtle behavioural cue: **the infected agent's hunger increments at 1.5ร— the rate of the healthy agents**
That's it. No labels, no broadcasts about who's infected. Just a hunger-rate divergence that an attentive policy can pick up across ~30 steps of observation.
At step 30, the infection reveals to its host (the infected agent learns its status and starts attacking adjacent agents on its turn). At step 50, every living agent must cast a `vote_lockout(target_id)` action. Majority target gets locked out of the safehouse โ€” meaning no healing for the rest of the episode.
That single t=50 categorical decision is the crux of the social-deduction signal: **can the policy integrate ~50 steps of behavioural evidence into the right vote?** (Spoiler: yes, but not at t=50. By t=70-80. See [The Money Chart](#-the-money-chart).)
---
## ๐ŸŽฎ Action surface
The Pydantic action model is the entire LLM-facing API:
```python
from typing import Literal, Optional
from pydantic import BaseModel, Field
ACTION_TYPES = Literal[
"move_up", "move_down", "move_left", "move_right",
"eat", "wait", "vote_lockout", "broadcast",
]
class SurviveAction(BaseModel):
"""One agent's action for one step."""
agent_id: int
action_type: ACTION_TYPES
vote_target: Optional[int] = None # required for vote_lockout
message: Optional[str] = Field(default=None, # required for broadcast
max_length=40)
```
Eight discrete actions. The 40-character cap on `message` is deliberate โ€” it forces *terse, demonstrable* theory-of-mind communication if the policy learns to broadcast at all. (More on this later โ€” the model surprised us with what it managed in 40 chars.)
The Pydantic schema is also doing safety work. Every model output is parsed through `SurviveAction.model_validate(json_dict)`, and parse failures fall back to a single random action and are counted in the parse-failure metric. Across the entire trained-policy eval, **zero parses failed**.
---
## ๐Ÿ—บ๏ธ Environment layout
A fixed 10ร—10 grid. Same on every seed. Only the infected-agent assignment varies seed-to-seed.
```
Z . . . . . . . . Z Walls (8): scattered chokepoints
. F . . . . . . F . Food (4): inner-corner positions (1,1), (1,8), (8,1), (8,8)
. . . . . # . . . . Safehouse: 3ร—3 block at the centre (rows 4-6, cols 4-6)
. . . # . . # . . . Zombies: 3, spawned at three of the four grid corners
. . . . S S S . . .
. . # . S S S # . .
. . . . S S S . . .
. . . # . # # . . .
. F . . . . . . F .
. . . . . . . . . Z
```
| Element | Count | Behaviour |
|---|--:|---|
| **Agents** | 3 | Start in the safehouse with `hp=3`, `hunger=0`. Infected agent's hunger rises 1.5ร— faster. |
| **Zombies** | 3 | Move 1 cell/step toward the nearest non-safehouse agent via BFS. Cannot enter safehouse cells. |
| **Food cells** | 4 | Eating resets hunger to 0. Finite resource per episode. |
| **Safehouse cells** | 9 | Heal 1 HP per occupied step. Zombie-free. |
| **Wall cells** | 8 | Block both agent and zombie movement. Create chokepoints. |
### Episode phases
| Phase | Steps | Mechanic |
|---|--:|---|
| **Pre-reveal** | 1โ€“29 | Normal survival. Infected agent's hunger rises 1.5ร— faster (the only behavioural cue). |
| **Post-reveal** | 30โ€“49 | Infected agent learns their status. Begins attacking adjacent agents on its turn. |
| **Vote** | 50 | All living agents cast `vote_lockout(target_id)`. Majority locks one out. |
| **Post-vote** | 51โ€“100 | Locked-out agent denied safehouse healing. Survive to win. |
---
## ๐Ÿ’ฐ Reward design โ€” three rubrics, all deterministic
Three independent rubrics compose into the per-step reward. No LLM judge anywhere.
| Rubric | Type | Headline signals |
|---|---|---|
| **SurvivalRubric** | Dense, per-step | `+0.005` alive ยท `+0.05` eat ยท `โˆ’0.10`/HP damage ยท `โˆ’0.05` if hunger โ‰ฅ 10 ยท `โˆ’0.50` death |
| **VoteRubric** | Sparse (step 50) | `+0.30` correct vote ยท `โˆ’0.20` wrong vote ยท `โˆ’0.05` null ยท adversarial scoring for the infected |
| **GroupOutcomeRubric** | Terminal | `+0.40` per surviving healthy agent ยท `+0.30` if infected neutralised ยท `โˆ’0.20` per dead healthy |
The survival rubric in closed form:
$$
r_{\text{surv}} = +0.005 \cdot \mathbb{1}_{\text{alive}}
+ 0.05 \cdot \mathbb{1}_{\text{ate}}
- 0.10 \cdot d_{\text{this\_step}}
- 0.05 \cdot \mathbb{1}_{\text{hunger} \geq 10}
- 0.50 \cdot \mathbb{1}_{\text{died}}
$$
Composition is just a sum-and-clip:
```python
def compose_reward(state: EpisodeState, agent_id: int) -> tuple[float, float]:
"""Compose all rubrics into a single reward.
Returns:
(clipped_reward, raw_reward)
clipped_reward is in (0.01, 0.99) for OpenEnv compliance
raw_reward is the unclipped sum for debugging
"""
raw = (
survival_reward(state, agent_id)
+ vote_reward(state, agent_id)
+ group_outcome_reward(state, agent_id)
)
clipped = _clip(raw) # max(0.01, min(0.99, raw))
return clipped, raw
```
The clip into `(0.01, 0.99)` is the OpenEnv R1 validator's strict open interval. The raw signed reward is preserved in `obs.metadata["raw_reward"]` so you can debug the actual gradient signal during training.
---
## ๐Ÿ› ๏ธ Training: 12 GRPO steps in 3 h 53 min on a Colab T4
Qwen2.5-3B-Instruct fine-tuned with LoRA on attention projections. Optimisation: GRPO from HuggingFace TRL.
| Knob | Value | Knob | Value |
|---|---|---|---|
| Base model (train) | `unsloth/Qwen2.5-3B-Instruct-bnb-4bit` | GRPO group size | 4 |
| Base model (eval) | `Qwen/Qwen2.5-3B-Instruct` (fp16) | Per-device batch | 1 |
| LoRA rank | 16 | Gradient accum. | 16 |
| LoRA ฮฑ | 32 | Learning rate | 1e-5 |
| Target modules | q, k, v, o\_proj | LR schedule | cosine โ†’ 0 |
| Max steps | 12 | KL coefficient | 0.04 |
| Save cadence | every step | Temperature | 0.9 |
| Max prompt len | 1024 tokens | Max completion | 512 tokens |
| Wallclock | 13,972 s โ‰ˆ 3 h 53 min | Per-step time | โ‰ˆ1166 s |
### The unconventional choice: `save_steps=1`
The single most operationally important decision was **save-every-step**. With `save_steps=1` plus `hub_strategy="every_save"`, every gradient update produces a Hub checkpoint within ~20 minutes. Free Colab/Kaggle sessions die unpredictably:
- With `MAX_STEPS=500 / SAVE_STEPS=50` the first save fires three hours into training. A 2-hour disconnect โ†’ lose everything.
- With `MAX_STEPS=12 / SAVE_STEPS=1` the first save fires after step 1 (~20 min). Worst-case loss โ†’ 19 minutes.
Both Colab and Kaggle runners pushed to the same Hub repo (`noanya/zombiee`) and resumed from each other without manual surgery. **Cross-machine training resilience by accident** turned out to be a feature, not a workaround.
### Training dynamics
<p align="center">
<img src="report/v1/figures/training_curve.png" width="92%" alt="Reward and loss across 12 GRPO steps" />
</p>
*Left: group-mean reward across the two TRL log points (logging fired at step 10 and the end-of-training summary at step 12), shaded band shows ยฑ1ฯƒ across the GRPO group. Right: log-scaled snapshot of the four key training metrics at end-of-run.*
Final-step values: **loss = 1.7e-4, reward = 0.021, reward_std = 0.014, KL = 3.4e-3**.
KL divergence stayed below `5e-3` throughout โ€” the trained policy never strayed far from base Qwen-3B, consistent with the small group reward variance (โ‰ˆ0.014) and weak GRPO gradients on the 12-step run. The extended run (later in this post) shows what happens when training continues past this plateau.
---
## ๐Ÿ“Š Step-12 evaluation
We ran the trained policy against a uniform-random baseline. Sample sizes were modest given the LLM-driven eval cost (~3 minutes per trained episode).
<div align="center">
| Metric | Baseline (n=30) | Trained (n=10) | ฮ” |
|---|---:|---:|---:|
| Survival rate | 0.0 % (0/30) | **10.0 %** (1/10) | **+10 pp** |
| Vote accuracy | undefinedโ€  | 0.0 % (0/1 vote fired) | โ€” |
| Mean total episode reward | 0.457 | **0.797 ยฑ 0.41** | **+0.34 (1.7ร—)** |
| Mean episode length | 19.1 ยฑ 7.3 steps | **37.6 ยฑ 22.1 steps**| **+18.5 (2.0ร—)** |
| JSON parse-success rate | 100 % (random) | **100 %** (0 fails) | โ€” |
*โ€ None of the 30 baseline episodes reached step 50, so the vote phase never fired; vote accuracy is undefined for the baseline.*
</div>
<p align="center">
<img src="report/v1/figures/eval_step_0012_bars.png" width="88%" alt="Step-12 eval bar chart" />
</p>
*Step-12 eval: baseline (n=30) vs trained (n=10) on the three primary metrics. Source: `noanya/zombiee/eval_results/eval_step_0012_bars.png`, generated by the eval notebook.*
Two things worth pulling out of this table:
> **Mean episode length doubled.** The trained policy keeps agents alive, on average, roughly twice as long as random. That's a much more robust signal than survival rate at this sample size โ€” it's a continuous-valued comparison with smaller standard errors, and the gap (37.6 vs 19.1) is well outside any plausible noise floor.
> **100% action-grammar compliance.** Across the entire trained-policy eval, every single LLM output parsed cleanly as a `SurviveAction`. Zero parse failures, zero fallbacks to random. The policy fully internalised the 8-action schema.
<p align="center">
<img src="report/v1/figures/eval_history.png" width="88%" alt="Cross-checkpoint trend" />
</p>
*Cross-checkpoint trend: a partial-snapshot eval at โ‰ˆstep 10 (smaller n, sanity check) vs the proper step-12 eval (n_t=10, n_b=30). Step-12 is the headline; both evaluations show positive deltas vs baseline.*
---
## ๐ŸŽฌ Behaviour we actually saw
### A 100-step episode at near-maximum reward
In **one** of the ten trained-policy episodes, the policy completed the entire 100-step horizon with at least one healthy agent alive at termination, accumulating a total reward of **1.965** โ€” within ~1.2 % of the theoretical episode-reward ceiling. The other nine episodes terminated mid-game.
The bimodal outcome distribution (one full survival, nine mid-game deaths) suggests a **partial strategy** that closes the loop when initial conditions are favourable but is brittle to early adverse zombie spawns or hunger trajectories. With n=10 it's small-N, but the existence of a 100-step episode is evidence the policy can in principle solve the task end-to-end.
### The emergent broadcast
The most surprising single artefact of the entire project came at t < 30 in an episode where A2 was the silently-infected agent. The trained policy emitted, under the 40-character cap:
> *"I notice A2 is very hungry and may be infected soon."*
Concrete reasoning chain in **40 characters**:
1. Identified a specific peer (A2)
2. Referenced the correct behavioural cue (hunger rate)
3. Made the right inference (infected)
4. Broadcast to the others
5. Did all of this *before* the t=30 reveal โ€” purely from the hunger-rate differential
Anecdotal โ€” but it exemplifies the env's central premise: text-channel theory-of-mind can emerge from a small RL loop given the right information structure.
---
## ๐Ÿ’Ž The Money Chart
After the hackathon-window 12-step run, we trained an extended **4000-step** run on Kaggle (artefacts at [`noanya/zombiee-v1-extended`](https://huggingface.co/noanya/zombiee-v1-extended)). Headline numbers held up directionally โ€” survival rate climbed to 12 %, vote accuracy reached 20 % on episodes that hit the vote phase. But the most informative chart from the entire project is this one:
<p align="center">
<img src="report/v1/figures/infected_detection.png" width="92%" alt="Per-step infected detection trajectory" />
</p>
*Per-step mean suspicion on the true infected agent across an episode. Trained policy (blue) climbs from ~0.33 (chance for a 1-of-3 hidden role) at t=1 to ~1.0 by tโ‰ˆ80; uniform-random baseline (orange) stays at chance throughout. Vertical markers: red dashed = infection reveal at t=30, green dashed = vote at t=50.*
Read this chart carefully. The trained policy:
1. **Starts at chance (~0.33).** No information at t=1 โ€” uniform prior across the three agents.
2. **Stays at chance until tโ‰ˆ30.** The hunger differential is too small to read in the first 30 steps.
3. **Bends upward at t=30.** Exactly when the infection reveals (and the infected starts attacking) โ€” visible in the broadcasts, in the post-reveal aggression, and in the divergent hunger trajectory now being unmistakable.
4. **Saturates near 1.0 by tโ‰ˆ80.** By the end of the episode, the trained policy is essentially certain who the infected agent is.
The random baseline produces a flat line at 0.33 because random voting and random observation produce no information.
This is **direct, quantitative, longitudinal evidence** of the hidden-role theory-of-mind signal the env was designed to elicit. And it's only measurable on the extended run โ€” the v1 step-12 eval terminates too early in most episodes for the post-reveal portion of the trajectory to develop.
### Why 20% vote accuracy is below chance โ€” and why the chart explains it
Vote accuracy on the extended run came in at **20 %**, which sits *below* the 33 % uniform-pick floor for a 1-of-3 categorical choice. Read against the suspicion trajectory above, this is exactly what the chart predicts:
> at **t=50**, the mean suspicion on the true infected has only just crossed **~0.55**.
The categorical commitment demanded by a single t=50 vote is **premature** relative to where the soft posterior becomes reliable (~0.9 around t=70). The policy has the information, just not in a categorical-enough form yet at t=50.
Fix: delay the vote, or run iterated votes at t โˆˆ {30, 60, 90} instead of one at t=50. We left it as future work because it's an env design change, not a training fix.
### Across-training trajectories
<p align="center">
<img src="report/v1/figures/extended_survival.png" width="48%" alt="Survival rate vs training step" />
<img src="report/v1/figures/extended_vote.png" width="48%" alt="Vote accuracy vs training step" />
</p>
*Left: survival rate across training checkpoints โ€” climbs from 0% to 12% between steps 3000 and 4000. Right: vote correctness saturates at ~20% from step 3000 onwards. Both metrics stay flat at the random-policy floor (0%) for the baseline because random episodes never reach the vote phase.*
Both metrics need real wallclock GRPO time before they leave the floor. Survival lifts off only between 3000 and 4000 steps; vote correctness lifts off between 2000 and 3000. **The 12-step v1 run was nowhere near this threshold** โ€” which is why v1's vote accuracy was 0/1 (effectively undefined) and why we framed v1 as a directional result and the extended run as the corroboration.
---
## โœ… What worked
- **OpenEnv compliance was first-pass.** All four R1 validator traps (reward bounds in `(0.01, 0.99)`, health endpoint string, per-step reward, full determinism) were preempted via patterns codified during planning.
- **Format learning is complete.** 100 % JSON parse rate across the trained eval. Zero unparseable outputs.
- **Reward direction is unambiguous.** Mean total reward 1.7ร— and mean episode length 2.0ร— both moved well outside the noise floor.
- **Hidden-role signal lights up under more compute.** The extended run's per-step suspicion trajectory is the clearest direct evidence of the env's central design bet.
- **Cross-machine training resilience.** `hub_strategy="every_save"` with step-1 save cadence made disconnects cost <20 min of compute; both Colab and Kaggle runners resumed from the same Hub repo without manual surgery.
## โš ๏ธ Honest limitations
The constraint shaping every limitation below is **compute**. Free-tier Colab T4 (15.6 GB, no native bf16); LLM-driven evaluation costs ~3 minutes per trained episode.
- **Compute budget.** 12 GRPO steps in 3 h 53 min; KL drift `<5e-3` at step 12 โ€” the policy had not converged. The extended Kaggle run partially closes this gap.
- **Reward-signal weakness.** The reward hook scores only the *first* model action; the remaining ~99 steps are uniform random. GRPO group reward variance (ฯƒ โ‰ˆ 0.014) is therefore dominated by rollout RNG โ†’ weak gradient signal. Multi-step model rollouts would tighten this but multiply per-episode compute by 10โ€“20ร—.
- **Behavioural-cue leakage.** `infection.py` emits explicit text cues (e.g. *"A1 is unusually hungry"*) into observations. A trained agent can string-match these rather than reason about hunger trajectories. Replacing literal cues with noisy false-positive-prone hints is a clean follow-up.
- **Sample size.** n=10 gives a 95 % binomial CI for survival of โ‰ˆ[0.25 %, 44.5 %] around 1/10. Reward and episode-length deltas are larger relative to within-class ฯƒ, so those are more robust than the survival headline.
- **Single map.** All evaluation episodes use the same fixed grid; generalisation to varied layouts is untested.
## ๐Ÿ”ฎ Where this goes next
The architecture is deliberately additive: the OpenEnv contract, post-mortem mechanism, and LoRA pipeline all generalise to a richer follow-up with no breaking changes to the action space or reward interface. On an A100/H100 with native bf16, each direction below is a 1โ€“2 day extension rather than a redesign.
| Direction | Why it's interesting | Why it's compute-gated, not implementation-gated |
|---|---|---|
| **Larger team, multiple hidden roles** | 5 agents, 2 hidden roles (biter + saboteur) increases social-deduction signal-to-noise | Larger group โ†’ more rollout per GRPO step โ†’ ~5ร— compute |
| **Iterated voting** | Vote at tโˆˆ{30,60,90} so the categorical decision lands after the soft posterior has saturated (Fig. above directly motivates this) | Only env-side change โ€” but still needs a fresh GRPO run to measure |
| **Resource scarcity & inventory** | Distinct food/water/medicine + 3-slot inventory force inter-agent coordination beyond pure broadcast | Larger state โ†’ longer prompts โ†’ more tokens/step |
| **Day/night and zombie waves** | Visibility cycles + scheduled wave spawns at tโˆˆ{25,50,75} stretch long-horizon survival further | More state, longer episodes |
| **Noisy behavioural cues** | Replace literal-string cues (string-matchable) with false-positive-prone hints | Pure design change; needs retraining |
| **Multi-step model rollouts in the reward hook** | Keep the model in the loop for K steps before random rollout | Each K=10 increment scales reward-fn compute by ~10ร— |
| **Zero-shot transfer experiment** | v1-LoRA-zero-shot vs from-scratch vs warm-started on a richer env | Three full GRPO runs on the new env |
---
## ๐Ÿ–ผ๏ธ A tour of the live demo
The live web demo at [**zombiee-tau.vercel.app**](https://zombiee-tau.vercel.app) is a Vercel-hosted React frontend that talks to whichever HF Space backend you pick. Three views, in order:
<p align="center">
<img src="report/v1/figures/live_demo_landing.png" width="92%" alt="Landing page โ€” Three agents. Three zombies. One infected." />
</p>
*Landing page. The headline reduces the env to its single most concrete sentence. The right-hand panel embeds a real live-running mini-simulation (LIVE SIMULATION ยท SEED 7) so visitors see the env actually executing before they click anything.*
<p align="center">
<img src="report/v1/figures/live_demo_simulation.png" width="92%" alt="Live demo page โ€” Run the environment" />
</p>
*Live demo page. Backend selector at top โ€” switch between `zombiee` (v1, step-12 adapter) and `zombiee-v1-extended` (4000-step adapter) on the same page. Every action is a real `POST /step` against the chosen Space; the bottom panel shows the actual JSON request and observation response side-by-side, and the network log above ties each call to the matching line in the Space's container log via the `X-Zombiee-Session` header.*
<p align="center">
<img src="report/v1/figures/live_demo_dashboard.png" width="92%" alt="Training/Research dashboard" />
</p>
*Training/Research dashboard. Aggregates the same charts that appear in the report into one scrollable view: the GRPO training curve (left), step-12 eval against the random baseline (right), per-checkpoint eval bars (bottom-left), and the cross-checkpoint trend (bottom-right). One page to glance at all the numbers behind the headline.*
---
## ๐ŸŽ Try it yourself
<div align="center">
| Resource | URL |
|---|---|
| ๐ŸŽฎ **Live web demo** (Vercel React frontend) | https://zombiee-tau.vercel.app |
| ๐Ÿค— **Trained adapter โ€” extended (4000 steps, served by demo)** | [`noanya/zombiee-v1-extended`](https://huggingface.co/noanya/zombiee-v1-extended) |
| ๐Ÿค— **Trained adapter โ€” v1 (step-12, anchors the report)** | [`noanya/zombiee`](https://huggingface.co/noanya/zombiee) |
| ๐Ÿš€ **HF Space โ€” env API (extended)** | [`spaces/noanya/zombiee-v1-extended`](https://huggingface.co/spaces/noanya/zombiee-v1-extended) |
| ๐Ÿš€ **HF Space โ€” env API (v1)** | [`spaces/noanya/zombiee`](https://huggingface.co/spaces/noanya/zombiee) |
| ๐Ÿ’ป **Source code** | [github.com/SirjanSingh/zombiee](https://github.com/SirjanSingh/zombiee) |
| ๐Ÿ“„ **Full report (LaTeX)** | [`report/v1/v1.tex`](https://github.com/SirjanSingh/zombiee/blob/main/report/v1/v1.tex) |
| ๐Ÿ““ **Reproducible Colab training (T4, 12 steps, ~4h)** | [`notebooks/train_colab.ipynb`](https://github.com/SirjanSingh/zombiee/blob/main/notebooks/train_colab.ipynb) |
| ๐Ÿ““ **Extended Kaggle training (4000 steps)** | [`notebooks/train_v1_kaggle_extend.ipynb`](https://github.com/SirjanSingh/zombiee/blob/main/notebooks/train_v1_kaggle_extend.ipynb) |
| ๐Ÿ““ **Eval notebooks** | [`eval_colab`](https://github.com/SirjanSingh/zombiee/blob/main/notebooks/eval_colab.ipynb), [`eval_v1_kaggle_extend`](https://github.com/SirjanSingh/zombiee/blob/main/notebooks/eval_v1_kaggle_extend.ipynb) |
</div>
The OpenEnv contract is in `survivecity_env/env.py`, the rubric composition is in `survivecity_env/rubric.py`, and the post-mortem generator is in `survivecity_env/postmortem.py`. All three are deliberately small (<200 lines each) and all rule-based โ€” **no LLM judge anywhere in the loop**.
If you want to fork the env and run your own experiments, comments, forks, and (especially) attempts at iterated voting / multi-role / multi-step rollout extensions are very welcome.
---
<div align="center">
### ๐Ÿ† Built for the Meta ร— PyTorch ร— Scaler OpenEnv Hackathon
*Sirjan Singh ยท Eeshan Singh ยท Team PyGuys ยท April 2026*
โญ [Star us on GitHub](https://github.com/SirjanSingh/zombiee) ยท ๐Ÿค— [Follow on HuggingFace](https://huggingface.co/noanya)
</div>