Spaces:
Runtime error
Runtime error
Demo outline
Open these tabs first
Local resources:
pres/index.htmlpres/training_results.htmlpres/trajectory.htmlpres/training_script.htmlpres/reward_curve.pngpres/before_after_reward.png
Remote resources:
- HF Space repo: https://huggingface.co/spaces/thomasm6m6/freeciv_env
- HF Space app: https://thomasm6m6-freeciv-env.hf.space
Supporting files:
- reward data:
pres/reward_steps.csv - training script:
scripts/train_grpo_fast.py - env config:
openenv.yaml
What we have ready
- real OpenEnv environment for Freeciv
- real live backend on H100 via Freeciv Web
- successful GRPO training run on the live backend
- reward curve PNG
- before/after reward PNG
- live trajectory page with real observations + legal actions
- note: use reward improvement as the before/after story; raw checkpoint-to-checkpoint action examples were too noisy to be worth showing live
- minimal training script page
- HF Space deployed:
thomasm6m6/freeciv_env
What not to spend time on
- long architecture explanation
- low-level websocket/runtime debugging
- model internals
- many charts
Use the product demo + reward improvement as the center of the pitch.
1 minute YouTube flow
0:00–0:10
Open: pres/trajectory.html
Say:
- We built a real OpenEnv environment for Freeciv, a long-horizon strategy game.
- The model sees text observations and legal actions, and acts turn by turn against a live backend.
0:10–0:22
Stay on pres/trajectory.html
Say:
- This is not a toy prompt task.
- It has delayed reward, persistent world state, multiple units, city-building, and long-horizon planning.
- That maps directly to the hackathon’s long-horizon planning and world-modeling tracks.
0:22–0:38
Switch to pres/training_script.html
Say:
- We also built the minimal RL training loop with Unsloth + TRL GRPO.
- The script collects live Freeciv states, formats them into prompts, and trains a policy on the real environment.
0:38–0:55
Switch to pres/training_results.html
Say:
- We ran training on the H100 against the live Freeciv backend.
- Reward improved from 0.125 at the start to 1.0 by the end of the run.
- This gives observable training progress, which is the key hackathon requirement.
0:55–1:00
Optional final cut to HF Space repo URL
Say:
- The environment is packaged as OpenEnv and deployed to Hugging Face Spaces for submission.
3 minute live pitch flow
0:00–0:25 — problem
Open: pres/trajectory.html
Say:
- We wanted a real LLM RL environment for long-horizon strategic planning.
- Freeciv is a strong fit because it has persistent state, delayed reward, many legal actions, and requires planning across turns.
0:25–1:05 — show the environment
Stay on pres/trajectory.html
Point out:
- text-first observation
- legal actions
- units / cities / economy summaries
- live backend on H100
Say:
- The agent does not get a canned benchmark prompt.
- It interacts with a real running world and must choose from legal actions each turn.
1:05–1:35 — show the training loop
Open: pres/training_script.html
Say:
- This is the minimal GRPO loop.
- We use live Freeciv sessions, prepare observations, build prompts, and train with Unsloth + TRL.
- The important thing is that the training loop is small and actually runs on the real backend.
1:35–2:25 — show training improvement
Open: pres/training_results.html
Say:
- This is the core result.
- Reward increases over training steps on real Freeciv states.
- Start: 0.125. End: 1.0.
- This is the evidence that the environment and reward pipeline are coherent enough to drive learning.
If short on time, only show:
- reward curve
- before/after reward bars
2:25–2:50 — why this matters
Stay on pres/training_results.html
Say:
- This fits Statement 2: long-horizon planning.
- It also fits Statement 3.1: world modeling, because the agent interacts with a real dynamic system and must maintain state over time.
2:50–3:00 — close
Open: HF Space repo URL or pres/index.html
Say:
- The environment is packaged in OpenEnv, runs with a real backend, has a minimal RL script, and already shows reward improvement.
Likely Q/A answers
Why Freeciv?
- It is long-horizon, strategic, partially observable, and naturally multi-step.
- It is much closer to real planning than one-shot QA.
What exactly is the observation/action interface?
- Observation is text-first: turn summary, economy, units, cities, map, legal actions.
- Actions are structured: end turn, move unit, build city, set city production, set research.
Is the backend real?
- Yes. Training was run against a live Freeciv Web backend on the H100.
What evidence do you have that training worked?
- The reward curve in
pres/training_results.html. - It rises from 0.125 to 1.0 during the live run.
Why not show a bigger model?
- For the hackathon, reliability and observable reward improvement mattered more than model scale.
- A smaller model let us get an end-to-end live run working on the real backend.
What is still incomplete?
- The environment currently exposes a small action subset rather than the full Freeciv action surface.
- The main accomplishment is that live interaction and RL training now work end to end.
If something breaks during the pitch
Fallback tab order:
pres/training_results.htmlpres/trajectory.htmlpres/training_script.html- HF Space repo URL
If the live environment demo is flaky, just narrate from the trajectory page and go straight to the reward curve.