Data Handoff
Chosen Base Model
Use:
Qwen/Qwen3-8B
Why this is the best default for the 2025-01 -> 2026-01 post-training window:
- it was released inside the required time frame
- it is available on Hugging Face
- it is strong enough for structured action + prediction output
- it is still realistic to run six separate entity post-training jobs on it
This is the recommended first real base model for all six entities.
What I Added For Data
The repo already had:
- synthetic seed replay JSON files under backend/src/trenches_env/historical_replays
- an OpenEnv replay training path
- a training CLI that consumes replay JSON with the
HistoricalReplayDefinition -> HistoricalEventschema
What I added is the first path from real historical sources into that same replay schema.
New Files
backend/src/trenches_env/historical_collection.py
- builds historical source profiles from the existing source manifest
- derives historical domains from allowlisted agent sources
- defines the
2025and2026collection windows - dedupes collected articles
- converts collected articles into the exact replay event schema used by training
backend/src/trenches_env/historical_collection_cli.py
- CLI collector
- queries the GDELT DOC API month by month
- writes raw article audit files
- writes replay JSON files in the same schema as the existing synthetic seeds
backend/tests/test_historical_collection.py
- validates source-profile extraction
- validates article -> replay-event conversion
- validates replay JSON compatibility with the existing historical replay loader
What Source Data It Uses
The collector starts from the existing backend/src/trenches_env/source_manifest.json.
That means it does not invent a separate source universe. It reuses the current project’s aligned sources, then extracts historical domains from them. In practice this means it leans on the project’s existing training-core sources such as:
- Reuters and wire-style reporting
- official government / ministry sources
- regional English-language outlets already assigned to the entities
- market / shipping / sanctions / diplomacy sources already present in the manifest
For historical collection, it converts those sources into domain-filtered GDELT queries and collects article candidates month by month.
Output Files
The collector writes two outputs per run.
1. Replay JSON
Path example:
backend/src/trenches_env/historical_replays/us_historical_2025.json
This matches the same structure as the existing synthetic seed files:
replay_idnamedescriptiontraining_agentevents[]
Each event matches the current training schema:
event_idtimestamptopicregionactorstargetsseveritysummarypublic_summarysource_typeconfirmedtagsimpact
2. Raw Audit JSONL
Path example:
backend/tmp-historical-raw/us_historical_2025.articles.jsonl
Each line contains:
article_idagent_idsource_idsource_nametitleurldomaintimestampquerywindow_id
This is the provenance trail for curator review.
Date Windows
The collector currently supports:
2025->2025-01-01through2026-01-012026->2026-01-01through the current day at collection time
Important note:
As of March 7, 2026, 2026 cannot honestly mean 2026-01-01 -> 2027-01-01 yet. The collector clamps future end dates to the current day so it does not pretend future historical data exists.
What Is Real vs Heuristic
Real:
- source alignment from the project’s own source manifest
- historical article collection via GDELT
- raw audit/provenance files
- replay JSON output in the exact schema the training system already consumes
Heuristic:
- topic classification from article titles
- severity classification from article titles
- dedupe logic
- actor/target inference
- event
impactgeneration
That heuristic layer is intentional. It gives you a bootstrap pipeline from real historical articles into replay training data, but the resulting replay should still be curator-reviewed before production post-training.
Commands
From repo root:
backend/.venv/bin/python -m trenches_env.historical_collection_cli \
--training-agent us \
--window 2025 \
--window 2026 \
--max-records-per-query 50 \
--max-events 128 \
--output-dir backend/src/trenches_env/historical_replays \
--raw-dir backend/tmp-historical-raw
All entities:
backend/.venv/bin/python -m trenches_env.historical_collection_cli \
--training-agent all \
--window 2025 \
--window 2026 \
--max-records-per-query 50 \
--max-events 128 \
--output-dir backend/src/trenches_env/historical_replays \
--raw-dir backend/tmp-historical-raw
Docs Updated
I also updated:
- backend/TRAINING_RUNBOOK.md
- backend/TRAINING_FLOW.md
- backend/POST_TRAINING_PLAN.md
- backend/pyproject.toml
So the collection path is now documented and exposed as a real CLI entry point.
Verification
The added data-collection path was verified locally with:
PYTHONPYCACHEPREFIX=/tmp/trenches-pyc python -m py_compile \
backend/src/trenches_env/historical_collection.py \
backend/src/trenches_env/historical_collection_cli.py
cd backend
uv run --extra dev python -m pytest \
tests/test_historical_collection.py \
tests/test_openenv_adapter.py \
tests/test_server.py -q
Result:
20 passed in 8.78s
Handoff
What is ready now:
- a chosen base model:
Qwen/Qwen3-8B - a collector path from real historical sources into the existing replay schema
- raw provenance output
- replay JSON output compatible with the current OpenEnv training flow
What still needs to happen next:
- Run the collector for each entity.
- Curator-review the raw article audit files and the generated replay JSON.
- Replace the current synthetic seed replays with reviewed historical replays.
- Update the actual training runs to use
Qwen/Qwen3-8Bas the base model. - Keep the old synthetic seeds only for smoke tests.
One important truth:
The collector is the first real data path, but it does not magically make the replay production-grade by itself. The training-ready replay still needs human review because event impact shaping is currently heuristic.