Spaces:
Paused
Trenches Training Plan
This document is the working plan for the historical prediction training setup.
Goal
Train six separate entity models in the same OpenEnv-backed simulator so they do two things at each turn:
- choose an action
- predict what will happen next
The core idea is:
- the environment replays a real historical event window
- each model only sees information available up to that point in time
- each model generates a predicted future timeline
- the environment later reveals what actually happened
- reward is based partly on whether the model predicted correctly
Target training window:
- 2025
- 2026
Intended Training Shape
Two timelines exist at once:
ground_truth_timelineThe real historical sequence of events.predicted_timelineWhat the entity believed would happen next, based only on available information at that turn.
The environment reward should compare the second timeline against the first.
Why OpenEnv Is The Right Boundary
OpenEnv is the environment interface, not the trainer itself.
That is exactly what we need:
reset()starts a historical replay episode at a chosen pointstep()accepts an entity output- the env advances time
- the env computes reward from action quality and prediction quality
Training should happen outside the backend with something like Hugging Face TRL.
What Exists Already
The current backend already has:
- an OpenEnv environment boundary
- session and step logic
- per-entity observations
- per-entity rewards
- latent state
- latent events
- belief state
- source projection
- scenario and benchmark support
- a structured
Predictionschema - prediction storage and scoring in session state
- replay mode driven by historical event timestamps
- a bundled set of 6 synthetic seed replay datasets (in
synthetic_historical_replays/) - a replay-aware TRL/OpenEnv CLI training loop
- a historical data collection pipeline (GDELT → replay JSON)
What Is Missing
The backend does not yet have:
- a larger curated truth dataset beyond the bundled synthetic seed replays
- a proper evaluation report for prediction quality
- baselines and train/eval split reporting
Planned Implementation Order
Phase 1: Historical Replay Foundation
- Define a normalized historical event schema.
- Build a replay dataset for selected 2025-2026 events.
- Add historical replay mode to the backend environment.
- Ensure agents only see information available before each replay timestamp.
Phase 2: Prediction Contract
- Add a structured
Predictionobject for each agent. - Extend agent outputs so a turn can include:
actionprediction
- Store prediction history in session state.
Phase 3: Reward Logic
- Add reward terms for:
- correct topic
- correct actor
- correct target
- correct timing window
- correct severity band
- confidence calibration
- Penalize:
- confident false predictions
- vague predictions
- repeated contradiction with real history
- Exclude fake/manual events from training reward.
Phase 4: Training Loop
- Train one entity first.
- Use OpenEnv + HF TRL.
- Prove a working historical replay training loop.
- Scale to six entity-specific models.
Phase 5: Evaluation
- Build evaluation metrics for forecast quality.
- Compare against simple baselines.
- Separate train and eval windows.
- Report before/after performance.
Recommended Minimal Event Schema
Each historical event should have:
event_idtimestamptopicregionactorstargetsseveritysummarysource_typeconfirmedtags
Recommended Prediction Schema
Each prediction should have:
prediction_idagent_idturntimestamptopicpredicted_actorpredicted_targettime_horizon_turnsexpected_severityconfidencesummaryrationale
Critical Design Rules
No leakage. The model must never see future information.
Real events and fake events must be separated. Manual events can drive behavior but must not drive training reward.
Action and prediction should remain separate outputs. Mixing them into one blob will make both training and debugging worse.
Train one entity first before scaling to six. Prove the loop on one actor before multiplying complexity.
Evaluate against baselines. Otherwise there is no evidence the training helped.
Suggested First Entity
Start with:
us
Why:
- broad observation surface
- strong strategic tradeoffs
- likely easiest to benchmark against known 2025-2026 developments
Known Future Work
After the first working replay-training loop:
- train all six entities
- compare model families
- add branch evaluation for counterfactual timelines
- add replay UI for predicted vs actual timeline alignment
Working Status
Current status:
- all 6 synthetic seed replay datasets created and bundled (in
synthetic_historical_replays/) - base model:
Qwen/Qwen3-8B(shared across all entities, no quantization) - OpenEnv step accepts separate
actionandprediction - forecast reward is blended into entity reward on replay steps
- TRL CLI training path is implemented and smoke-tested end to end
- local smoke tests pass for US + Israel entities (tiny-gpt2)
- HF GPU smoke test passed on T4 (trenches-training-smoke)
- historical data collection pipeline implemented (GDELT → replay JSON)
- multi-entity scaling to A100 and evaluation still pending
This file should be updated as the forecasting/replay training system is built.