Spaces:

thomasm6m6
/

openenv_hack

Runtime error

App Files Files Community

openenv_hack / outline.md

thomasm6m6

Initial Freeciv OpenEnv Space

8dc7642 verified 9 days ago

preview code

raw

history blame contribute delete

5.63 kB

	# Demo outline

	## Open these tabs first

	Local resources:
	- `pres/index.html`
	- `pres/training_results.html`
	- `pres/trajectory.html`
	- `pres/training_script.html`
	- `pres/reward_curve.png`
	- `pres/before_after_reward.png`

	Remote resources:
	- HF Space repo: <https://huggingface.co/spaces/thomasm6m6/freeciv_env>
	- HF Space app: <https://thomasm6m6-freeciv-env.hf.space>

	Supporting files:
	- reward data: `pres/reward_steps.csv`
	- training script: `scripts/train_grpo_fast.py`
	- env config: `openenv.yaml`

	## What we have ready

	- real OpenEnv environment for Freeciv
	- real live backend on H100 via Freeciv Web
	- successful GRPO training run on the live backend
	- reward curve PNG
	- before/after reward PNG
	- live trajectory page with real observations + legal actions
	- note: use reward improvement as the before/after story; raw checkpoint-to-checkpoint action examples were too noisy to be worth showing live
	- minimal training script page
	- HF Space deployed: `thomasm6m6/freeciv_env`

	## What not to spend time on

	- long architecture explanation
	- low-level websocket/runtime debugging
	- model internals
	- many charts

	Use the product demo + reward improvement as the center of the pitch.

	---

	## 1 minute YouTube flow

	### 0:00–0:10
	Open: `pres/trajectory.html`

	Say:
	- We built a real OpenEnv environment for Freeciv, a long-horizon strategy game.
	- The model sees text observations and legal actions, and acts turn by turn against a live backend.

	### 0:10–0:22
	Stay on `pres/trajectory.html`

	Say:
	- This is not a toy prompt task.
	- It has delayed reward, persistent world state, multiple units, city-building, and long-horizon planning.
	- That maps directly to the hackathon’s long-horizon planning and world-modeling tracks.

	### 0:22–0:38
	Switch to `pres/training_script.html`

	Say:
	- We also built the minimal RL training loop with Unsloth + TRL GRPO.
	- The script collects live Freeciv states, formats them into prompts, and trains a policy on the real environment.

	### 0:38–0:55
	Switch to `pres/training_results.html`

	Say:
	- We ran training on the H100 against the live Freeciv backend.
	- Reward improved from 0.125 at the start to 1.0 by the end of the run.
	- This gives observable training progress, which is the key hackathon requirement.

	### 0:55–1:00
	Optional final cut to HF Space repo URL

	Say:
	- The environment is packaged as OpenEnv and deployed to Hugging Face Spaces for submission.

	---

	## 3 minute live pitch flow

	### 0:00–0:25 — problem
	Open: `pres/trajectory.html`

	Say:
	- We wanted a real LLM RL environment for long-horizon strategic planning.
	- Freeciv is a strong fit because it has persistent state, delayed reward, many legal actions, and requires planning across turns.

	### 0:25–1:05 — show the environment
	Stay on `pres/trajectory.html`

	Point out:
	- text-first observation
	- legal actions
	- units / cities / economy summaries
	- live backend on H100

	Say:
	- The agent does not get a canned benchmark prompt.
	- It interacts with a real running world and must choose from legal actions each turn.

	### 1:05–1:35 — show the training loop
	Open: `pres/training_script.html`

	Say:
	- This is the minimal GRPO loop.
	- We use live Freeciv sessions, prepare observations, build prompts, and train with Unsloth + TRL.
	- The important thing is that the training loop is small and actually runs on the real backend.

	### 1:35–2:25 — show training improvement
	Open: `pres/training_results.html`

	Say:
	- This is the core result.
	- Reward increases over training steps on real Freeciv states.
	- Start: 0.125. End: 1.0.
	- This is the evidence that the environment and reward pipeline are coherent enough to drive learning.

	If short on time, only show:
	- reward curve
	- before/after reward bars

	### 2:25–2:50 — why this matters
	Stay on `pres/training_results.html`

	Say:
	- This fits Statement 2: long-horizon planning.
	- It also fits Statement 3.1: world modeling, because the agent interacts with a real dynamic system and must maintain state over time.

	### 2:50–3:00 — close
	Open: HF Space repo URL or `pres/index.html`

	Say:
	- The environment is packaged in OpenEnv, runs with a real backend, has a minimal RL script, and already shows reward improvement.

	---

	## Likely Q/A answers

	### Why Freeciv?
	- It is long-horizon, strategic, partially observable, and naturally multi-step.
	- It is much closer to real planning than one-shot QA.

	### What exactly is the observation/action interface?
	- Observation is text-first: turn summary, economy, units, cities, map, legal actions.
	- Actions are structured: end turn, move unit, build city, set city production, set research.

	### Is the backend real?
	- Yes. Training was run against a live Freeciv Web backend on the H100.

	### What evidence do you have that training worked?
	- The reward curve in `pres/training_results.html`.
	- It rises from 0.125 to 1.0 during the live run.

	### Why not show a bigger model?
	- For the hackathon, reliability and observable reward improvement mattered more than model scale.
	- A smaller model let us get an end-to-end live run working on the real backend.

	### What is still incomplete?
	- The environment currently exposes a small action subset rather than the full Freeciv action surface.
	- The main accomplishment is that live interaction and RL training now work end to end.

	---

	## If something breaks during the pitch

	Fallback tab order:
	1. `pres/training_results.html`
	2. `pres/trajectory.html`
	3. `pres/training_script.html`
	4. HF Space repo URL

	If the live environment demo is flaky, just narrate from the trajectory page and go straight to the reward curve.