openenv_hack / outline.md
thomasm6m6's picture
Initial Freeciv OpenEnv Space
8dc7642 verified

Demo outline

Open these tabs first

Local resources:

  • pres/index.html
  • pres/training_results.html
  • pres/trajectory.html
  • pres/training_script.html
  • pres/reward_curve.png
  • pres/before_after_reward.png

Remote resources:

Supporting files:

  • reward data: pres/reward_steps.csv
  • training script: scripts/train_grpo_fast.py
  • env config: openenv.yaml

What we have ready

  • real OpenEnv environment for Freeciv
  • real live backend on H100 via Freeciv Web
  • successful GRPO training run on the live backend
  • reward curve PNG
  • before/after reward PNG
  • live trajectory page with real observations + legal actions
  • note: use reward improvement as the before/after story; raw checkpoint-to-checkpoint action examples were too noisy to be worth showing live
  • minimal training script page
  • HF Space deployed: thomasm6m6/freeciv_env

What not to spend time on

  • long architecture explanation
  • low-level websocket/runtime debugging
  • model internals
  • many charts

Use the product demo + reward improvement as the center of the pitch.


1 minute YouTube flow

0:00–0:10

Open: pres/trajectory.html

Say:

  • We built a real OpenEnv environment for Freeciv, a long-horizon strategy game.
  • The model sees text observations and legal actions, and acts turn by turn against a live backend.

0:10–0:22

Stay on pres/trajectory.html

Say:

  • This is not a toy prompt task.
  • It has delayed reward, persistent world state, multiple units, city-building, and long-horizon planning.
  • That maps directly to the hackathon’s long-horizon planning and world-modeling tracks.

0:22–0:38

Switch to pres/training_script.html

Say:

  • We also built the minimal RL training loop with Unsloth + TRL GRPO.
  • The script collects live Freeciv states, formats them into prompts, and trains a policy on the real environment.

0:38–0:55

Switch to pres/training_results.html

Say:

  • We ran training on the H100 against the live Freeciv backend.
  • Reward improved from 0.125 at the start to 1.0 by the end of the run.
  • This gives observable training progress, which is the key hackathon requirement.

0:55–1:00

Optional final cut to HF Space repo URL

Say:

  • The environment is packaged as OpenEnv and deployed to Hugging Face Spaces for submission.

3 minute live pitch flow

0:00–0:25 — problem

Open: pres/trajectory.html

Say:

  • We wanted a real LLM RL environment for long-horizon strategic planning.
  • Freeciv is a strong fit because it has persistent state, delayed reward, many legal actions, and requires planning across turns.

0:25–1:05 — show the environment

Stay on pres/trajectory.html

Point out:

  • text-first observation
  • legal actions
  • units / cities / economy summaries
  • live backend on H100

Say:

  • The agent does not get a canned benchmark prompt.
  • It interacts with a real running world and must choose from legal actions each turn.

1:05–1:35 — show the training loop

Open: pres/training_script.html

Say:

  • This is the minimal GRPO loop.
  • We use live Freeciv sessions, prepare observations, build prompts, and train with Unsloth + TRL.
  • The important thing is that the training loop is small and actually runs on the real backend.

1:35–2:25 — show training improvement

Open: pres/training_results.html

Say:

  • This is the core result.
  • Reward increases over training steps on real Freeciv states.
  • Start: 0.125. End: 1.0.
  • This is the evidence that the environment and reward pipeline are coherent enough to drive learning.

If short on time, only show:

  • reward curve
  • before/after reward bars

2:25–2:50 — why this matters

Stay on pres/training_results.html

Say:

  • This fits Statement 2: long-horizon planning.
  • It also fits Statement 3.1: world modeling, because the agent interacts with a real dynamic system and must maintain state over time.

2:50–3:00 — close

Open: HF Space repo URL or pres/index.html

Say:

  • The environment is packaged in OpenEnv, runs with a real backend, has a minimal RL script, and already shows reward improvement.

Likely Q/A answers

Why Freeciv?

  • It is long-horizon, strategic, partially observable, and naturally multi-step.
  • It is much closer to real planning than one-shot QA.

What exactly is the observation/action interface?

  • Observation is text-first: turn summary, economy, units, cities, map, legal actions.
  • Actions are structured: end turn, move unit, build city, set city production, set research.

Is the backend real?

  • Yes. Training was run against a live Freeciv Web backend on the H100.

What evidence do you have that training worked?

  • The reward curve in pres/training_results.html.
  • It rises from 0.125 to 1.0 during the live run.

Why not show a bigger model?

  • For the hackathon, reliability and observable reward improvement mattered more than model scale.
  • A smaller model let us get an end-to-end live run working on the real backend.

What is still incomplete?

  • The environment currently exposes a small action subset rather than the full Freeciv action surface.
  • The main accomplishment is that live interaction and RL training now work end to end.

If something breaks during the pitch

Fallback tab order:

  1. pres/training_results.html
  2. pres/trajectory.html
  3. pres/training_script.html
  4. HF Space repo URL

If the live environment demo is flaky, just narrate from the trajectory page and go straight to the reward curve.