trenches

Paused

App Files Files Community

trenches / TRAINING_PLAN.md

Codex

sync main snapshot for HF Space

1794757 about 1 month ago

preview code

raw

history blame contribute delete

5.67 kB

Trenches Training Plan

This document is the working plan for the historical prediction training setup.

Goal

Train six separate entity models in the same OpenEnv-backed simulator so they do two things at each turn:

choose an action
predict what will happen next

The core idea is:

the environment replays a real historical event window
each model only sees information available up to that point in time
each model generates a predicted future timeline
the environment later reveals what actually happened
reward is based partly on whether the model predicted correctly

Target training window:

2025
2026

Intended Training Shape

Two timelines exist at once:

ground_truth_timeline The real historical sequence of events.
predicted_timeline What the entity believed would happen next, based only on available information at that turn.

The environment reward should compare the second timeline against the first.

Why OpenEnv Is The Right Boundary

OpenEnv is the environment interface, not the trainer itself.

That is exactly what we need:

reset() starts a historical replay episode at a chosen point
step() accepts an entity output
the env advances time
the env computes reward from action quality and prediction quality

Training should happen outside the backend with something like Hugging Face TRL.

What Exists Already

The current backend already has:

an OpenEnv environment boundary
session and step logic
per-entity observations
per-entity rewards
latent state
latent events
belief state
source projection
scenario and benchmark support
a structured Prediction schema
prediction storage and scoring in session state
replay mode driven by historical event timestamps
a bundled set of 6 synthetic seed replay datasets (in synthetic_historical_replays/)
a replay-aware TRL/OpenEnv CLI training loop
a historical data collection pipeline (GDELT → replay JSON)

What Is Missing

The backend does not yet have:

a larger curated truth dataset beyond the bundled synthetic seed replays
a proper evaluation report for prediction quality
baselines and train/eval split reporting

Planned Implementation Order

Phase 1: Historical Replay Foundation

Define a normalized historical event schema.
Build a replay dataset for selected 2025-2026 events.
Add historical replay mode to the backend environment.
Ensure agents only see information available before each replay timestamp.

Phase 2: Prediction Contract

Add a structured Prediction object for each agent.
Extend agent outputs so a turn can include:
- action
- prediction
Store prediction history in session state.

Phase 3: Reward Logic

Add reward terms for:
- correct topic
- correct actor
- correct target
- correct timing window
- correct severity band
- confidence calibration
Penalize:
- confident false predictions
- vague predictions
- repeated contradiction with real history
Exclude fake/manual events from training reward.

Phase 4: Training Loop

Train one entity first.
Use OpenEnv + HF TRL.
Prove a working historical replay training loop.
Scale to six entity-specific models.

Phase 5: Evaluation

Build evaluation metrics for forecast quality.
Compare against simple baselines.
Separate train and eval windows.
Report before/after performance.

Recommended Minimal Event Schema

Each historical event should have:

event_id
timestamp
topic
region
actors
targets
severity
summary
source_type
confirmed
tags

Recommended Prediction Schema

Each prediction should have:

prediction_id
agent_id
turn
timestamp
topic
predicted_actor
predicted_target
time_horizon_turns
expected_severity
confidence
summary
rationale

Critical Design Rules

No leakage. The model must never see future information.
Real events and fake events must be separated. Manual events can drive behavior but must not drive training reward.
Action and prediction should remain separate outputs. Mixing them into one blob will make both training and debugging worse.
Train one entity first before scaling to six. Prove the loop on one actor before multiplying complexity.
Evaluate against baselines. Otherwise there is no evidence the training helped.

Suggested First Entity

Start with:

us

Why:

broad observation surface
strong strategic tradeoffs
likely easiest to benchmark against known 2025-2026 developments

Known Future Work

After the first working replay-training loop:

train all six entities
compare model families
add branch evaluation for counterfactual timelines
add replay UI for predicted vs actual timeline alignment

Working Status

Current status:

all 6 synthetic seed replay datasets created and bundled (in synthetic_historical_replays/)
base model: Qwen/Qwen3-8B (shared across all entities, no quantization)
OpenEnv step accepts separate action and prediction
forecast reward is blended into entity reward on replay steps
TRL CLI training path is implemented and smoke-tested end to end
local smoke tests pass for US + Israel entities (tiny-gpt2)
HF GPU smoke test passed on T4 (trenches-training-smoke)
historical data collection pipeline implemented (GDELT → replay JSON)
multi-entity scaling to A100 and evaluation still pending

This file should be updated as the forecasting/replay training system is built.