Spaces:
Runtime error
Runtime error
File size: 5,630 Bytes
8dc7642 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | # Demo outline
## Open these tabs first
Local resources:
- `pres/index.html`
- `pres/training_results.html`
- `pres/trajectory.html`
- `pres/training_script.html`
- `pres/reward_curve.png`
- `pres/before_after_reward.png`
Remote resources:
- HF Space repo: <https://huggingface.co/spaces/thomasm6m6/freeciv_env>
- HF Space app: <https://thomasm6m6-freeciv-env.hf.space>
Supporting files:
- reward data: `pres/reward_steps.csv`
- training script: `scripts/train_grpo_fast.py`
- env config: `openenv.yaml`
## What we have ready
- real OpenEnv environment for Freeciv
- real live backend on H100 via Freeciv Web
- successful GRPO training run on the live backend
- reward curve PNG
- before/after reward PNG
- live trajectory page with real observations + legal actions
- note: use reward improvement as the before/after story; raw checkpoint-to-checkpoint action examples were too noisy to be worth showing live
- minimal training script page
- HF Space deployed: `thomasm6m6/freeciv_env`
## What not to spend time on
- long architecture explanation
- low-level websocket/runtime debugging
- model internals
- many charts
Use the product demo + reward improvement as the center of the pitch.
---
## 1 minute YouTube flow
### 0:00–0:10
Open: `pres/trajectory.html`
Say:
- We built a real OpenEnv environment for Freeciv, a long-horizon strategy game.
- The model sees text observations and legal actions, and acts turn by turn against a live backend.
### 0:10–0:22
Stay on `pres/trajectory.html`
Say:
- This is not a toy prompt task.
- It has delayed reward, persistent world state, multiple units, city-building, and long-horizon planning.
- That maps directly to the hackathon’s long-horizon planning and world-modeling tracks.
### 0:22–0:38
Switch to `pres/training_script.html`
Say:
- We also built the minimal RL training loop with Unsloth + TRL GRPO.
- The script collects live Freeciv states, formats them into prompts, and trains a policy on the real environment.
### 0:38–0:55
Switch to `pres/training_results.html`
Say:
- We ran training on the H100 against the live Freeciv backend.
- Reward improved from 0.125 at the start to 1.0 by the end of the run.
- This gives observable training progress, which is the key hackathon requirement.
### 0:55–1:00
Optional final cut to HF Space repo URL
Say:
- The environment is packaged as OpenEnv and deployed to Hugging Face Spaces for submission.
---
## 3 minute live pitch flow
### 0:00–0:25 — problem
Open: `pres/trajectory.html`
Say:
- We wanted a real LLM RL environment for long-horizon strategic planning.
- Freeciv is a strong fit because it has persistent state, delayed reward, many legal actions, and requires planning across turns.
### 0:25–1:05 — show the environment
Stay on `pres/trajectory.html`
Point out:
- text-first observation
- legal actions
- units / cities / economy summaries
- live backend on H100
Say:
- The agent does not get a canned benchmark prompt.
- It interacts with a real running world and must choose from legal actions each turn.
### 1:05–1:35 — show the training loop
Open: `pres/training_script.html`
Say:
- This is the minimal GRPO loop.
- We use live Freeciv sessions, prepare observations, build prompts, and train with Unsloth + TRL.
- The important thing is that the training loop is small and actually runs on the real backend.
### 1:35–2:25 — show training improvement
Open: `pres/training_results.html`
Say:
- This is the core result.
- Reward increases over training steps on real Freeciv states.
- Start: 0.125. End: 1.0.
- This is the evidence that the environment and reward pipeline are coherent enough to drive learning.
If short on time, only show:
- reward curve
- before/after reward bars
### 2:25–2:50 — why this matters
Stay on `pres/training_results.html`
Say:
- This fits Statement 2: long-horizon planning.
- It also fits Statement 3.1: world modeling, because the agent interacts with a real dynamic system and must maintain state over time.
### 2:50–3:00 — close
Open: HF Space repo URL or `pres/index.html`
Say:
- The environment is packaged in OpenEnv, runs with a real backend, has a minimal RL script, and already shows reward improvement.
---
## Likely Q/A answers
### Why Freeciv?
- It is long-horizon, strategic, partially observable, and naturally multi-step.
- It is much closer to real planning than one-shot QA.
### What exactly is the observation/action interface?
- Observation is text-first: turn summary, economy, units, cities, map, legal actions.
- Actions are structured: end turn, move unit, build city, set city production, set research.
### Is the backend real?
- Yes. Training was run against a live Freeciv Web backend on the H100.
### What evidence do you have that training worked?
- The reward curve in `pres/training_results.html`.
- It rises from 0.125 to 1.0 during the live run.
### Why not show a bigger model?
- For the hackathon, reliability and observable reward improvement mattered more than model scale.
- A smaller model let us get an end-to-end live run working on the real backend.
### What is still incomplete?
- The environment currently exposes a small action subset rather than the full Freeciv action surface.
- The main accomplishment is that live interaction and RL training now work end to end.
---
## If something breaks during the pitch
Fallback tab order:
1. `pres/training_results.html`
2. `pres/trajectory.html`
3. `pres/training_script.html`
4. HF Space repo URL
If the live environment demo is flaky, just narrate from the trajectory page and go straight to the reward curve.
|