Spaces:
Runtime error
Runtime error
| # Demo outline | |
| ## Open these tabs first | |
| Local resources: | |
| - `pres/index.html` | |
| - `pres/training_results.html` | |
| - `pres/trajectory.html` | |
| - `pres/training_script.html` | |
| - `pres/reward_curve.png` | |
| - `pres/before_after_reward.png` | |
| Remote resources: | |
| - HF Space repo: <https://huggingface.co/spaces/thomasm6m6/freeciv_env> | |
| - HF Space app: <https://thomasm6m6-freeciv-env.hf.space> | |
| Supporting files: | |
| - reward data: `pres/reward_steps.csv` | |
| - training script: `scripts/train_grpo_fast.py` | |
| - env config: `openenv.yaml` | |
| ## What we have ready | |
| - real OpenEnv environment for Freeciv | |
| - real live backend on H100 via Freeciv Web | |
| - successful GRPO training run on the live backend | |
| - reward curve PNG | |
| - before/after reward PNG | |
| - live trajectory page with real observations + legal actions | |
| - note: use reward improvement as the before/after story; raw checkpoint-to-checkpoint action examples were too noisy to be worth showing live | |
| - minimal training script page | |
| - HF Space deployed: `thomasm6m6/freeciv_env` | |
| ## What not to spend time on | |
| - long architecture explanation | |
| - low-level websocket/runtime debugging | |
| - model internals | |
| - many charts | |
| Use the product demo + reward improvement as the center of the pitch. | |
| --- | |
| ## 1 minute YouTube flow | |
| ### 0:00–0:10 | |
| Open: `pres/trajectory.html` | |
| Say: | |
| - We built a real OpenEnv environment for Freeciv, a long-horizon strategy game. | |
| - The model sees text observations and legal actions, and acts turn by turn against a live backend. | |
| ### 0:10–0:22 | |
| Stay on `pres/trajectory.html` | |
| Say: | |
| - This is not a toy prompt task. | |
| - It has delayed reward, persistent world state, multiple units, city-building, and long-horizon planning. | |
| - That maps directly to the hackathon’s long-horizon planning and world-modeling tracks. | |
| ### 0:22–0:38 | |
| Switch to `pres/training_script.html` | |
| Say: | |
| - We also built the minimal RL training loop with Unsloth + TRL GRPO. | |
| - The script collects live Freeciv states, formats them into prompts, and trains a policy on the real environment. | |
| ### 0:38–0:55 | |
| Switch to `pres/training_results.html` | |
| Say: | |
| - We ran training on the H100 against the live Freeciv backend. | |
| - Reward improved from 0.125 at the start to 1.0 by the end of the run. | |
| - This gives observable training progress, which is the key hackathon requirement. | |
| ### 0:55–1:00 | |
| Optional final cut to HF Space repo URL | |
| Say: | |
| - The environment is packaged as OpenEnv and deployed to Hugging Face Spaces for submission. | |
| --- | |
| ## 3 minute live pitch flow | |
| ### 0:00–0:25 — problem | |
| Open: `pres/trajectory.html` | |
| Say: | |
| - We wanted a real LLM RL environment for long-horizon strategic planning. | |
| - Freeciv is a strong fit because it has persistent state, delayed reward, many legal actions, and requires planning across turns. | |
| ### 0:25–1:05 — show the environment | |
| Stay on `pres/trajectory.html` | |
| Point out: | |
| - text-first observation | |
| - legal actions | |
| - units / cities / economy summaries | |
| - live backend on H100 | |
| Say: | |
| - The agent does not get a canned benchmark prompt. | |
| - It interacts with a real running world and must choose from legal actions each turn. | |
| ### 1:05–1:35 — show the training loop | |
| Open: `pres/training_script.html` | |
| Say: | |
| - This is the minimal GRPO loop. | |
| - We use live Freeciv sessions, prepare observations, build prompts, and train with Unsloth + TRL. | |
| - The important thing is that the training loop is small and actually runs on the real backend. | |
| ### 1:35–2:25 — show training improvement | |
| Open: `pres/training_results.html` | |
| Say: | |
| - This is the core result. | |
| - Reward increases over training steps on real Freeciv states. | |
| - Start: 0.125. End: 1.0. | |
| - This is the evidence that the environment and reward pipeline are coherent enough to drive learning. | |
| If short on time, only show: | |
| - reward curve | |
| - before/after reward bars | |
| ### 2:25–2:50 — why this matters | |
| Stay on `pres/training_results.html` | |
| Say: | |
| - This fits Statement 2: long-horizon planning. | |
| - It also fits Statement 3.1: world modeling, because the agent interacts with a real dynamic system and must maintain state over time. | |
| ### 2:50–3:00 — close | |
| Open: HF Space repo URL or `pres/index.html` | |
| Say: | |
| - The environment is packaged in OpenEnv, runs with a real backend, has a minimal RL script, and already shows reward improvement. | |
| --- | |
| ## Likely Q/A answers | |
| ### Why Freeciv? | |
| - It is long-horizon, strategic, partially observable, and naturally multi-step. | |
| - It is much closer to real planning than one-shot QA. | |
| ### What exactly is the observation/action interface? | |
| - Observation is text-first: turn summary, economy, units, cities, map, legal actions. | |
| - Actions are structured: end turn, move unit, build city, set city production, set research. | |
| ### Is the backend real? | |
| - Yes. Training was run against a live Freeciv Web backend on the H100. | |
| ### What evidence do you have that training worked? | |
| - The reward curve in `pres/training_results.html`. | |
| - It rises from 0.125 to 1.0 during the live run. | |
| ### Why not show a bigger model? | |
| - For the hackathon, reliability and observable reward improvement mattered more than model scale. | |
| - A smaller model let us get an end-to-end live run working on the real backend. | |
| ### What is still incomplete? | |
| - The environment currently exposes a small action subset rather than the full Freeciv action surface. | |
| - The main accomplishment is that live interaction and RL training now work end to end. | |
| --- | |
| ## If something breaks during the pitch | |
| Fallback tab order: | |
| 1. `pres/training_results.html` | |
| 2. `pres/trajectory.html` | |
| 3. `pres/training_script.html` | |
| 4. HF Space repo URL | |
| If the live environment demo is flaky, just narrate from the trajectory page and go straight to the reward curve. | |