Spaces:

thomasm6m6
/

openenv_hack

Runtime error

real OpenEnv environment for Freeciv
real live backend on H100 via Freeciv Web
successful GRPO training run on the live backend
reward curve PNG
before/after reward PNG
live trajectory page with real observations + legal actions
note: use reward improvement as the before/after story; raw checkpoint-to-checkpoint action examples were too noisy to be worth showing live
minimal training script page
HF Space deployed: thomasm6m6/freeciv_env

What not to spend time on

long architecture explanation
low-level websocket/runtime debugging
model internals
many charts

Use the product demo + reward improvement as the center of the pitch.

1 minute YouTube flow

0:00–0:10

Open: pres/trajectory.html

Say:

We built a real OpenEnv environment for Freeciv, a long-horizon strategy game.
The model sees text observations and legal actions, and acts turn by turn against a live backend.

0:10–0:22

Stay on pres/trajectory.html

Say:

This is not a toy prompt task.
It has delayed reward, persistent world state, multiple units, city-building, and long-horizon planning.
That maps directly to the hackathon’s long-horizon planning and world-modeling tracks.

0:22–0:38

Switch to pres/training_script.html

Say:

We also built the minimal RL training loop with Unsloth + TRL GRPO.
The script collects live Freeciv states, formats them into prompts, and trains a policy on the real environment.

0:38–0:55

Switch to pres/training_results.html

Say:

We ran training on the H100 against the live Freeciv backend.
Reward improved from 0.125 at the start to 1.0 by the end of the run.
This gives observable training progress, which is the key hackathon requirement.

0:55–1:00

Optional final cut to HF Space repo URL

Say:

The environment is packaged as OpenEnv and deployed to Hugging Face Spaces for submission.

3 minute live pitch flow

0:00–0:25 — problem

Open: pres/trajectory.html

Say:

We wanted a real LLM RL environment for long-horizon strategic planning.
Freeciv is a strong fit because it has persistent state, delayed reward, many legal actions, and requires planning across turns.

0:25–1:05 — show the environment

Stay on pres/trajectory.html

Point out:

text-first observation
legal actions
units / cities / economy summaries
live backend on H100

Say:

The agent does not get a canned benchmark prompt.
It interacts with a real running world and must choose from legal actions each turn.

1:05–1:35 — show the training loop

Open: pres/training_script.html

Say:

This is the minimal GRPO loop.
We use live Freeciv sessions, prepare observations, build prompts, and train with Unsloth + TRL.
The important thing is that the training loop is small and actually runs on the real backend.

1:35–2:25 — show training improvement

Open: pres/training_results.html

Say:

This is the core result.
Reward increases over training steps on real Freeciv states.
Start: 0.125. End: 1.0.
This is the evidence that the environment and reward pipeline are coherent enough to drive learning.

If short on time, only show:

reward curve
before/after reward bars

2:25–2:50 — why this matters

Stay on pres/training_results.html

Say:

This fits Statement 2: long-horizon planning.
It also fits Statement 3.1: world modeling, because the agent interacts with a real dynamic system and must maintain state over time.

2:50–3:00 — close

Open: HF Space repo URL or pres/index.html

Say:

The environment is packaged in OpenEnv, runs with a real backend, has a minimal RL script, and already shows reward improvement.

Likely Q/A answers

Why Freeciv?

It is long-horizon, strategic, partially observable, and naturally multi-step.
It is much closer to real planning than one-shot QA.

What exactly is the observation/action interface?

Observation is text-first: turn summary, economy, units, cities, map, legal actions.
Actions are structured: end turn, move unit, build city, set city production, set research.

Is the backend real?

Yes. Training was run against a live Freeciv Web backend on the H100.

What evidence do you have that training worked?

The reward curve in pres/training_results.html.
It rises from 0.125 to 1.0 during the live run.

Why not show a bigger model?

For the hackathon, reliability and observable reward improvement mattered more than model scale.
A smaller model let us get an end-to-end live run working on the real backend.

What is still incomplete?

The environment currently exposes a small action subset rather than the full Freeciv action surface.
The main accomplishment is that live interaction and RL training now work end to end.

If something breaks during the pitch

Fallback tab order:

pres/training_results.html
pres/trajectory.html
pres/training_script.html
HF Space repo URL

If the live environment demo is flaky, just narrate from the trajectory page and go straight to the reward curve.