title: 'SurviveCity: Teaching LLMs to Learn From Their Own Deaths'
thumbnail: /blog/assets/zombiee/thumbnail.png
authors:
- user: noanya
tags:
- multi-agent
- reinforcement-learning
- openenv
- grpo
- llm-agents
- hackathon
- theory-of-mind
- social-deduction
date: April 26, 2026
๐ง SurviveCity
Teaching LLMs to Learn From Their Own Deaths
An OpenEnv-Compliant Multi-Agent Environment for Cross-Episode Failure-Replay Learning in LLMs
๐ฎ Live Demo ยท ๐ค Models ยท ๐ป Code ยท ๐ Report
Built for the Meta ร PyTorch ร Scaler OpenEnv Hackathon ยท Team PyGuys (Sirjan Singh, Eeshan Singh)
| ๐ | ๐ | ๐ | ๐ฐ | โค๏ธ |
|---|---|---|---|---|
| 12 | 100 % | 2.0ร | 1.7ร | 0 โ 10 % |
| GRPO steps in 3 h 53 min on a Colab T4 | valid JSON across the trained eval (0 parse fails) | baseline episode length (37.6 vs 19.1) | baseline mean reward (0.80 vs 0.46) | survival rate; one ep hit 100 steps (reward 1.97) |
๐ฏ The headline: an extended 4000-step Kaggle run shows near-certain (~1.0) detection of the hidden infected agent by tโ80 โ direct evidence of the hidden-role theory-of-mind signal the env was designed to elicit. Skip ahead to The Money Chart if you want the punchline first.
๐ค Two questions LLM-agent benchmarks rarely test
Most LLM-agent benchmarks measure single-agent goal completion in static environments. Two phenomena that matter for actually-useful multi-agent reasoning are not directly probed:
| Phenomenon | What it asks |
|---|---|
| Cross-episode learning from failure | Can an agent get better the second time around because it remembers what killed it? |
| Hidden-role theory of mind | Can an agent identify a peer's hidden role from behavioural cues alone, then act on that inference? |
SurviveCity is a 3-agent zombie/social-deduction env built around both questions at once. The env itself is straightforward (10ร10 grid, 3 zombies, 100-step horizon). What's interesting is two architectural choices that turn this into an OpenEnv-compliant testbed for exactly those two phenomena.
๐ The cross-episode failure-replay loop
When an agent dies, the env emits a deterministic post-mortem. The next episode's system prompt prepends the agent's last 3 post-mortems. That's the entire mechanism โ no LLM-as-judge, no embedding store, no external memory.
Episode N starts
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ System prompt = base prompt โ
โ + last N=3 post-mortems for this agent โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ Pre-reveal โ โ โ Post-reveal โ โ โ Vote โ โ โ Post-vote โ
โ steps 1โ29 โ โ steps 30โ49 โ โ step 50 โ โ steps 51โ100 โ
โ silent โ โ infected โ โ majority โ โ locked-out โ
โ infection โ โ attacks โ โ lockout โ โ no healing โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ
โผ (deaths emit deterministic post-mortems along the way)
โ
โผ
Episode N+1 starts โโบ prepend the new post-mortems โโบ loop
A real post-mortem looks like this:
POSTMORTEM for A1: died at step 38 (cause: zombie_attack).
Last position: (6,1). Nearest threat at death: zombie at (6,2), dist=1.
Resources consumed: 2 food. Final hunger: 7.
Key mistake: foraged_too_far_from_safehouse.
The post-mortem text is rule-based and fully deterministic, so the OpenEnv validator's no-fuzziness requirement is satisfied automatically. The mistake-label vocabulary is a small fixed set (foraged_too_far_from_safehouse, ignored_broadcast_warning_about_infected, didnt_vote_despite_evidence, etc.) โ concrete enough that the next episode's policy can react.
Here's the dispatch logic from survivecity_env/postmortem.py:
def _detect_mistake(state: EpisodeState, agent_id: int) -> str:
"""Rule-based mistake detection for the dying agent."""
agent = state.agents[agent_id]
cause = agent.death_cause or "unknown"
if cause == "hunger":
if agent.food_eaten == 0:
return "never_ate_food"
return "foraged_too_late_or_too_infrequently"
if cause == "zombie_attack":
if (agent.row, agent.col) not in SAFEHOUSE_CELLS:
nearest_food = _nearest_food_dist(agent.row, agent.col)
if nearest_food <= 2:
return "foraged_but_didnt_flee_zombie"
return "foraged_too_far_from_safehouse"
return "zombie_reached_agent_at_safehouse_edge"
if cause == "infected_attack":
warned = any(f"A{state.infected_id}" in b for b in state.all_broadcasts)
if warned:
return "ignored_broadcast_warning_about_infected"
if state.step_count >= 50 and not state.vote_resolved:
return "didnt_vote_despite_evidence"
return "failed_to_distance_from_infected_after_reveal"
if cause == "locked_out_starvation":
return "wrongly_locked_out_by_team_vote"
return "unknown_cause_investigate_logs"
114 lines total. No LLM in the loop, no learned components โ the entire cross-episode learning channel is rule-based pattern matching on the death context.
๐ญ The hidden-role mechanism
Exactly one of the three agents is silently flagged infected at t=0. For the first 30 steps:
- The infected agent doesn't know it's infected
- The other agents don't know which peer is infected
- The infection leaks out as a single subtle behavioural cue: the infected agent's hunger increments at 1.5ร the rate of the healthy agents
That's it. No labels, no broadcasts about who's infected. Just a hunger-rate divergence that an attentive policy can pick up across ~30 steps of observation.
At step 30, the infection reveals to its host (the infected agent learns its status and starts attacking adjacent agents on its turn). At step 50, every living agent must cast a vote_lockout(target_id) action. Majority target gets locked out of the safehouse โ meaning no healing for the rest of the episode.
That single t=50 categorical decision is the crux of the social-deduction signal: can the policy integrate ~50 steps of behavioural evidence into the right vote? (Spoiler: yes, but not at t=50. By t=70-80. See The Money Chart.)
๐ฎ Action surface
The Pydantic action model is the entire LLM-facing API:
from typing import Literal, Optional
from pydantic import BaseModel, Field
ACTION_TYPES = Literal[
"move_up", "move_down", "move_left", "move_right",
"eat", "wait", "vote_lockout", "broadcast",
]
class SurviveAction(BaseModel):
"""One agent's action for one step."""
agent_id: int
action_type: ACTION_TYPES
vote_target: Optional[int] = None # required for vote_lockout
message: Optional[str] = Field(default=None, # required for broadcast
max_length=40)
Eight discrete actions. The 40-character cap on message is deliberate โ it forces terse, demonstrable theory-of-mind communication if the policy learns to broadcast at all. (More on this later โ the model surprised us with what it managed in 40 chars.)
The Pydantic schema is also doing safety work. Every model output is parsed through SurviveAction.model_validate(json_dict), and parse failures fall back to a single random action and are counted in the parse-failure metric. Across the entire trained-policy eval, zero parses failed.
๐บ๏ธ Environment layout
A fixed 10ร10 grid. Same on every seed. Only the infected-agent assignment varies seed-to-seed.
Z . . . . . . . . Z Walls (8): scattered chokepoints
. F . . . . . . F . Food (4): inner-corner positions (1,1), (1,8), (8,1), (8,8)
. . . . . # . . . . Safehouse: 3ร3 block at the centre (rows 4-6, cols 4-6)
. . . # . . # . . . Zombies: 3, spawned at three of the four grid corners
. . . . S S S . . .
. . # . S S S # . .
. . . . S S S . . .
. . . # . # # . . .
. F . . . . . . F .
. . . . . . . . . Z
| Element | Count | Behaviour |
|---|---|---|
| Agents | 3 | Start in the safehouse with hp=3, hunger=0. Infected agent's hunger rises 1.5ร faster. |
| Zombies | 3 | Move 1 cell/step toward the nearest non-safehouse agent via BFS. Cannot enter safehouse cells. |
| Food cells | 4 | Eating resets hunger to 0. Finite resource per episode. |
| Safehouse cells | 9 | Heal 1 HP per occupied step. Zombie-free. |
| Wall cells | 8 | Block both agent and zombie movement. Create chokepoints. |
Episode phases
| Phase | Steps | Mechanic |
|---|---|---|
| Pre-reveal | 1โ29 | Normal survival. Infected agent's hunger rises 1.5ร faster (the only behavioural cue). |
| Post-reveal | 30โ49 | Infected agent learns their status. Begins attacking adjacent agents on its turn. |
| Vote | 50 | All living agents cast vote_lockout(target_id). Majority locks one out. |
| Post-vote | 51โ100 | Locked-out agent denied safehouse healing. Survive to win. |
๐ฐ Reward design โ three rubrics, all deterministic
Three independent rubrics compose into the per-step reward. No LLM judge anywhere.
| Rubric | Type | Headline signals |
|---|---|---|
| SurvivalRubric | Dense, per-step | +0.005 alive ยท +0.05 eat ยท โ0.10/HP damage ยท โ0.05 if hunger โฅ 10 ยท โ0.50 death |
| VoteRubric | Sparse (step 50) | +0.30 correct vote ยท โ0.20 wrong vote ยท โ0.05 null ยท adversarial scoring for the infected |
| GroupOutcomeRubric | Terminal | +0.40 per surviving healthy agent ยท +0.30 if infected neutralised ยท โ0.20 per dead healthy |
The survival rubric in closed form:
Composition is just a sum-and-clip:
def compose_reward(state: EpisodeState, agent_id: int) -> tuple[float, float]:
"""Compose all rubrics into a single reward.
Returns:
(clipped_reward, raw_reward)
clipped_reward is in (0.01, 0.99) for OpenEnv compliance
raw_reward is the unclipped sum for debugging
"""
raw = (
survival_reward(state, agent_id)
+ vote_reward(state, agent_id)
+ group_outcome_reward(state, agent_id)
)
clipped = _clip(raw) # max(0.01, min(0.99, raw))
return clipped, raw
The clip into (0.01, 0.99) is the OpenEnv R1 validator's strict open interval. The raw signed reward is preserved in obs.metadata["raw_reward"] so you can debug the actual gradient signal during training.
๐ ๏ธ Training: 12 GRPO steps in 3 h 53 min on a Colab T4
Qwen2.5-3B-Instruct fine-tuned with LoRA on attention projections. Optimisation: GRPO from HuggingFace TRL.
| Knob | Value | Knob | Value |
|---|---|---|---|
| Base model (train) | unsloth/Qwen2.5-3B-Instruct-bnb-4bit |
GRPO group size | 4 |
| Base model (eval) | Qwen/Qwen2.5-3B-Instruct (fp16) |
Per-device batch | 1 |
| LoRA rank | 16 | Gradient accum. | 16 |
| LoRA ฮฑ | 32 | Learning rate | 1e-5 |
| Target modules | q, k, v, o_proj | LR schedule | cosine โ 0 |
| Max steps | 12 | KL coefficient | 0.04 |
| Save cadence | every step | Temperature | 0.9 |
| Max prompt len | 1024 tokens | Max completion | 512 tokens |
| Wallclock | 13,972 s โ 3 h 53 min | Per-step time | โ1166 s |
The unconventional choice: save_steps=1
The single most operationally important decision was save-every-step. With save_steps=1 plus hub_strategy="every_save", every gradient update produces a Hub checkpoint within ~20 minutes. Free Colab/Kaggle sessions die unpredictably:
- With
MAX_STEPS=500 / SAVE_STEPS=50the first save fires three hours into training. A 2-hour disconnect โ lose everything. - With
MAX_STEPS=12 / SAVE_STEPS=1the first save fires after step 1 (~20 min). Worst-case loss โ 19 minutes.
Both Colab and Kaggle runners pushed to the same Hub repo (noanya/zombiee) and resumed from each other without manual surgery. Cross-machine training resilience by accident turned out to be a feature, not a workaround.
Training dynamics
Left: group-mean reward across the two TRL log points (logging fired at step 10 and the end-of-training summary at step 12), shaded band shows ยฑ1ฯ across the GRPO group. Right: log-scaled snapshot of the four key training metrics at end-of-run.
Final-step values: loss = 1.7e-4, reward = 0.021, reward_std = 0.014, KL = 3.4e-3.
KL divergence stayed below 5e-3 throughout โ the trained policy never strayed far from base Qwen-3B, consistent with the small group reward variance (โ0.014) and weak GRPO gradients on the 12-step run. The extended run (later in this post) shows what happens when training continues past this plateau.
๐ Step-12 evaluation
We ran the trained policy against a uniform-random baseline. Sample sizes were modest given the LLM-driven eval cost (~3 minutes per trained episode).
| Metric | Baseline (n=30) | Trained (n=10) | ฮ |
|---|---|---|---|
| Survival rate | 0.0 % (0/30) | 10.0 % (1/10) | +10 pp |
| Vote accuracy | undefinedโ | 0.0 % (0/1 vote fired) | โ |
| Mean total episode reward | 0.457 | 0.797 ยฑ 0.41 | +0.34 (1.7ร) |
| Mean episode length | 19.1 ยฑ 7.3 steps | 37.6 ยฑ 22.1 steps | +18.5 (2.0ร) |
| JSON parse-success rate | 100 % (random) | 100 % (0 fails) | โ |
โ None of the 30 baseline episodes reached step 50, so the vote phase never fired; vote accuracy is undefined for the baseline.
Step-12 eval: baseline (n=30) vs trained (n=10) on the three primary metrics. Source: noanya/zombiee/eval_results/eval_step_0012_bars.png, generated by the eval notebook.
Two things worth pulling out of this table:
Mean episode length doubled. The trained policy keeps agents alive, on average, roughly twice as long as random. That's a much more robust signal than survival rate at this sample size โ it's a continuous-valued comparison with smaller standard errors, and the gap (37.6 vs 19.1) is well outside any plausible noise floor.
100% action-grammar compliance. Across the entire trained-policy eval, every single LLM output parsed cleanly as a
SurviveAction. Zero parse failures, zero fallbacks to random. The policy fully internalised the 8-action schema.
Cross-checkpoint trend: a partial-snapshot eval at โstep 10 (smaller n, sanity check) vs the proper step-12 eval (n_t=10, n_b=30). Step-12 is the headline; both evaluations show positive deltas vs baseline.
๐ฌ Behaviour we actually saw
A 100-step episode at near-maximum reward
In one of the ten trained-policy episodes, the policy completed the entire 100-step horizon with at least one healthy agent alive at termination, accumulating a total reward of 1.965 โ within ~1.2 % of the theoretical episode-reward ceiling. The other nine episodes terminated mid-game.
The bimodal outcome distribution (one full survival, nine mid-game deaths) suggests a partial strategy that closes the loop when initial conditions are favourable but is brittle to early adverse zombie spawns or hunger trajectories. With n=10 it's small-N, but the existence of a 100-step episode is evidence the policy can in principle solve the task end-to-end.
The emergent broadcast
The most surprising single artefact of the entire project came at t < 30 in an episode where A2 was the silently-infected agent. The trained policy emitted, under the 40-character cap:
"I notice A2 is very hungry and may be infected soon."
Concrete reasoning chain in 40 characters:
- Identified a specific peer (A2)
- Referenced the correct behavioural cue (hunger rate)
- Made the right inference (infected)
- Broadcast to the others
- Did all of this before the t=30 reveal โ purely from the hunger-rate differential
Anecdotal โ but it exemplifies the env's central premise: text-channel theory-of-mind can emerge from a small RL loop given the right information structure.
๐ The Money Chart
After the hackathon-window 12-step run, we trained an extended 4000-step run on Kaggle (artefacts at noanya/zombiee-v1-extended). Headline numbers held up directionally โ survival rate climbed to 12 %, vote accuracy reached 20 % on episodes that hit the vote phase. But the most informative chart from the entire project is this one:
Per-step mean suspicion on the true infected agent across an episode. Trained policy (blue) climbs from ~0.33 (chance for a 1-of-3 hidden role) at t=1 to ~1.0 by tโ80; uniform-random baseline (orange) stays at chance throughout. Vertical markers: red dashed = infection reveal at t=30, green dashed = vote at t=50.
Read this chart carefully. The trained policy:
- Starts at chance (~0.33). No information at t=1 โ uniform prior across the three agents.
- Stays at chance until tโ30. The hunger differential is too small to read in the first 30 steps.
- Bends upward at t=30. Exactly when the infection reveals (and the infected starts attacking) โ visible in the broadcasts, in the post-reveal aggression, and in the divergent hunger trajectory now being unmistakable.
- Saturates near 1.0 by tโ80. By the end of the episode, the trained policy is essentially certain who the infected agent is.
The random baseline produces a flat line at 0.33 because random voting and random observation produce no information.
This is direct, quantitative, longitudinal evidence of the hidden-role theory-of-mind signal the env was designed to elicit. And it's only measurable on the extended run โ the v1 step-12 eval terminates too early in most episodes for the post-reveal portion of the trajectory to develop.
Why 20% vote accuracy is below chance โ and why the chart explains it
Vote accuracy on the extended run came in at 20 %, which sits below the 33 % uniform-pick floor for a 1-of-3 categorical choice. Read against the suspicion trajectory above, this is exactly what the chart predicts:
at t=50, the mean suspicion on the true infected has only just crossed ~0.55.
The categorical commitment demanded by a single t=50 vote is premature relative to where the soft posterior becomes reliable (~0.9 around t=70). The policy has the information, just not in a categorical-enough form yet at t=50.
Fix: delay the vote, or run iterated votes at t โ {30, 60, 90} instead of one at t=50. We left it as future work because it's an env design change, not a training fix.
Across-training trajectories
Left: survival rate across training checkpoints โ climbs from 0% to 12% between steps 3000 and 4000. Right: vote correctness saturates at ~20% from step 3000 onwards. Both metrics stay flat at the random-policy floor (0%) for the baseline because random episodes never reach the vote phase.
Both metrics need real wallclock GRPO time before they leave the floor. Survival lifts off only between 3000 and 4000 steps; vote correctness lifts off between 2000 and 3000. The 12-step v1 run was nowhere near this threshold โ which is why v1's vote accuracy was 0/1 (effectively undefined) and why we framed v1 as a directional result and the extended run as the corroboration.
โ What worked
- OpenEnv compliance was first-pass. All four R1 validator traps (reward bounds in
(0.01, 0.99), health endpoint string, per-step reward, full determinism) were preempted via patterns codified during planning. - Format learning is complete. 100 % JSON parse rate across the trained eval. Zero unparseable outputs.
- Reward direction is unambiguous. Mean total reward 1.7ร and mean episode length 2.0ร both moved well outside the noise floor.
- Hidden-role signal lights up under more compute. The extended run's per-step suspicion trajectory is the clearest direct evidence of the env's central design bet.
- Cross-machine training resilience.
hub_strategy="every_save"with step-1 save cadence made disconnects cost <20 min of compute; both Colab and Kaggle runners resumed from the same Hub repo without manual surgery.
โ ๏ธ Honest limitations
The constraint shaping every limitation below is compute. Free-tier Colab T4 (15.6 GB, no native bf16); LLM-driven evaluation costs ~3 minutes per trained episode.
- Compute budget. 12 GRPO steps in 3 h 53 min; KL drift
<5e-3at step 12 โ the policy had not converged. The extended Kaggle run partially closes this gap. - Reward-signal weakness. The reward hook scores only the first model action; the remaining ~99 steps are uniform random. GRPO group reward variance (ฯ โ 0.014) is therefore dominated by rollout RNG โ weak gradient signal. Multi-step model rollouts would tighten this but multiply per-episode compute by 10โ20ร.
- Behavioural-cue leakage.
infection.pyemits explicit text cues (e.g. "A1 is unusually hungry") into observations. A trained agent can string-match these rather than reason about hunger trajectories. Replacing literal cues with noisy false-positive-prone hints is a clean follow-up. - Sample size. n=10 gives a 95 % binomial CI for survival of โ[0.25 %, 44.5 %] around 1/10. Reward and episode-length deltas are larger relative to within-class ฯ, so those are more robust than the survival headline.
- Single map. All evaluation episodes use the same fixed grid; generalisation to varied layouts is untested.
๐ฎ Where this goes next
The architecture is deliberately additive: the OpenEnv contract, post-mortem mechanism, and LoRA pipeline all generalise to a richer follow-up with no breaking changes to the action space or reward interface. On an A100/H100 with native bf16, each direction below is a 1โ2 day extension rather than a redesign.
| Direction | Why it's interesting | Why it's compute-gated, not implementation-gated |
|---|---|---|
| Larger team, multiple hidden roles | 5 agents, 2 hidden roles (biter + saboteur) increases social-deduction signal-to-noise | Larger group โ more rollout per GRPO step โ ~5ร compute |
| Iterated voting | Vote at tโ{30,60,90} so the categorical decision lands after the soft posterior has saturated (Fig. above directly motivates this) | Only env-side change โ but still needs a fresh GRPO run to measure |
| Resource scarcity & inventory | Distinct food/water/medicine + 3-slot inventory force inter-agent coordination beyond pure broadcast | Larger state โ longer prompts โ more tokens/step |
| Day/night and zombie waves | Visibility cycles + scheduled wave spawns at tโ{25,50,75} stretch long-horizon survival further | More state, longer episodes |
| Noisy behavioural cues | Replace literal-string cues (string-matchable) with false-positive-prone hints | Pure design change; needs retraining |
| Multi-step model rollouts in the reward hook | Keep the model in the loop for K steps before random rollout | Each K=10 increment scales reward-fn compute by ~10ร |
| Zero-shot transfer experiment | v1-LoRA-zero-shot vs from-scratch vs warm-started on a richer env | Three full GRPO runs on the new env |
๐ผ๏ธ A tour of the live demo
The live web demo at zombiee-tau.vercel.app is a Vercel-hosted React frontend that talks to whichever HF Space backend you pick. Three views, in order:
Landing page. The headline reduces the env to its single most concrete sentence. The right-hand panel embeds a real live-running mini-simulation (LIVE SIMULATION ยท SEED 7) so visitors see the env actually executing before they click anything.
Live demo page. Backend selector at top โ switch between zombiee (v1, step-12 adapter) and zombiee-v1-extended (4000-step adapter) on the same page. Every action is a real POST /step against the chosen Space; the bottom panel shows the actual JSON request and observation response side-by-side, and the network log above ties each call to the matching line in the Space's container log via the X-Zombiee-Session header.
Training/Research dashboard. Aggregates the same charts that appear in the report into one scrollable view: the GRPO training curve (left), step-12 eval against the random baseline (right), per-checkpoint eval bars (bottom-left), and the cross-checkpoint trend (bottom-right). One page to glance at all the numbers behind the headline.
๐ Try it yourself
| Resource | URL |
|---|---|
| ๐ฎ Live web demo (Vercel React frontend) | https://zombiee-tau.vercel.app |
| ๐ค Trained adapter โ extended (4000 steps, served by demo) | noanya/zombiee-v1-extended |
| ๐ค Trained adapter โ v1 (step-12, anchors the report) | noanya/zombiee |
| ๐ HF Space โ env API (extended) | spaces/noanya/zombiee-v1-extended |
| ๐ HF Space โ env API (v1) | spaces/noanya/zombiee |
| ๐ป Source code | github.com/SirjanSingh/zombiee |
| ๐ Full report (LaTeX) | report/v1/v1.tex |
| ๐ Reproducible Colab training (T4, 12 steps, ~4h) | notebooks/train_colab.ipynb |
| ๐ Extended Kaggle training (4000 steps) | notebooks/train_v1_kaggle_extend.ipynb |
| ๐ Eval notebooks | eval_colab, eval_v1_kaggle_extend |
The OpenEnv contract is in survivecity_env/env.py, the rubric composition is in survivecity_env/rubric.py, and the post-mortem generator is in survivecity_env/postmortem.py. All three are deliberately small (<200 lines each) and all rule-based โ no LLM judge anywhere in the loop.
If you want to fork the env and run your own experiments, comments, forks, and (especially) attempts at iterated voting / multi-role / multi-step rollout extensions are very welcome.
๐ Built for the Meta ร PyTorch ร Scaler OpenEnv Hackathon
Sirjan Singh ยท Eeshan Singh ยท Team PyGuys ยท April 2026
โญ Star us on GitHub ยท ๐ค Follow on HuggingFace