| # Data Handoff | |
| ## Chosen Base Model | |
| Use: | |
| - `Qwen/Qwen3-8B` | |
| Why this is the best default for the `2025-01 -> 2026-01` post-training window: | |
| - it was released inside the required time frame | |
| - it is available on Hugging Face | |
| - it is strong enough for structured action + prediction output | |
| - it is still realistic to run six separate entity post-training jobs on it | |
| This is the recommended first real base model for all six entities. | |
| ## What I Added For Data | |
| The repo already had: | |
| - synthetic seed replay JSON files under [backend/src/trenches_env/historical_replays](/Users/alazarmanakelew/IdeaProjects/trenches/backend/src/trenches_env/historical_replays) | |
| - an OpenEnv replay training path | |
| - a training CLI that consumes replay JSON with the `HistoricalReplayDefinition -> HistoricalEvent` schema | |
| What I added is the first path from real historical sources into that same replay schema. | |
| ### New Files | |
| - [backend/src/trenches_env/historical_collection.py](/Users/alazarmanakelew/IdeaProjects/trenches/backend/src/trenches_env/historical_collection.py) | |
| - builds historical source profiles from the existing source manifest | |
| - derives historical domains from allowlisted agent sources | |
| - defines the `2025` and `2026` collection windows | |
| - dedupes collected articles | |
| - converts collected articles into the exact replay event schema used by training | |
| - [backend/src/trenches_env/historical_collection_cli.py](/Users/alazarmanakelew/IdeaProjects/trenches/backend/src/trenches_env/historical_collection_cli.py) | |
| - CLI collector | |
| - queries the GDELT DOC API month by month | |
| - writes raw article audit files | |
| - writes replay JSON files in the same schema as the existing synthetic seeds | |
| - [backend/tests/test_historical_collection.py](/Users/alazarmanakelew/IdeaProjects/trenches/backend/tests/test_historical_collection.py) | |
| - validates source-profile extraction | |
| - validates article -> replay-event conversion | |
| - validates replay JSON compatibility with the existing historical replay loader | |
| ## What Source Data It Uses | |
| The collector starts from the existing [backend/src/trenches_env/source_manifest.json](/Users/alazarmanakelew/IdeaProjects/trenches/backend/src/trenches_env/source_manifest.json). | |
| That means it does not invent a separate source universe. It reuses the current project’s aligned sources, then extracts historical domains from them. In practice this means it leans on the project’s existing training-core sources such as: | |
| - Reuters and wire-style reporting | |
| - official government / ministry sources | |
| - regional English-language outlets already assigned to the entities | |
| - market / shipping / sanctions / diplomacy sources already present in the manifest | |
| For historical collection, it converts those sources into domain-filtered GDELT queries and collects article candidates month by month. | |
| ## Output Files | |
| The collector writes two outputs per run. | |
| ### 1. Replay JSON | |
| Path example: | |
| - `backend/src/trenches_env/historical_replays/us_historical_2025.json` | |
| This matches the same structure as the existing synthetic seed files: | |
| - `replay_id` | |
| - `name` | |
| - `description` | |
| - `training_agent` | |
| - `events[]` | |
| Each event matches the current training schema: | |
| - `event_id` | |
| - `timestamp` | |
| - `topic` | |
| - `region` | |
| - `actors` | |
| - `targets` | |
| - `severity` | |
| - `summary` | |
| - `public_summary` | |
| - `source_type` | |
| - `confirmed` | |
| - `tags` | |
| - `impact` | |
| ### 2. Raw Audit JSONL | |
| Path example: | |
| - `backend/tmp-historical-raw/us_historical_2025.articles.jsonl` | |
| Each line contains: | |
| - `article_id` | |
| - `agent_id` | |
| - `source_id` | |
| - `source_name` | |
| - `title` | |
| - `url` | |
| - `domain` | |
| - `timestamp` | |
| - `query` | |
| - `window_id` | |
| This is the provenance trail for curator review. | |
| ## Date Windows | |
| The collector currently supports: | |
| - `2025` -> `2025-01-01` through `2026-01-01` | |
| - `2026` -> `2026-01-01` through the current day at collection time | |
| Important note: | |
| As of March 7, 2026, `2026` cannot honestly mean `2026-01-01 -> 2027-01-01` yet. The collector clamps future end dates to the current day so it does not pretend future historical data exists. | |
| ## What Is Real vs Heuristic | |
| Real: | |
| - source alignment from the project’s own source manifest | |
| - historical article collection via GDELT | |
| - raw audit/provenance files | |
| - replay JSON output in the exact schema the training system already consumes | |
| Heuristic: | |
| - topic classification from article titles | |
| - severity classification from article titles | |
| - dedupe logic | |
| - actor/target inference | |
| - event `impact` generation | |
| That heuristic layer is intentional. It gives you a bootstrap pipeline from real historical articles into replay training data, but the resulting replay should still be curator-reviewed before production post-training. | |
| ## Commands | |
| From repo root: | |
| ```bash | |
| backend/.venv/bin/python -m trenches_env.historical_collection_cli \ | |
| --training-agent us \ | |
| --window 2025 \ | |
| --window 2026 \ | |
| --max-records-per-query 50 \ | |
| --max-events 128 \ | |
| --output-dir backend/src/trenches_env/historical_replays \ | |
| --raw-dir backend/tmp-historical-raw | |
| ``` | |
| All entities: | |
| ```bash | |
| backend/.venv/bin/python -m trenches_env.historical_collection_cli \ | |
| --training-agent all \ | |
| --window 2025 \ | |
| --window 2026 \ | |
| --max-records-per-query 50 \ | |
| --max-events 128 \ | |
| --output-dir backend/src/trenches_env/historical_replays \ | |
| --raw-dir backend/tmp-historical-raw | |
| ``` | |
| ## Docs Updated | |
| I also updated: | |
| - [backend/TRAINING_RUNBOOK.md](/Users/alazarmanakelew/IdeaProjects/trenches/backend/TRAINING_RUNBOOK.md) | |
| - [backend/TRAINING_FLOW.md](/Users/alazarmanakelew/IdeaProjects/trenches/backend/TRAINING_FLOW.md) | |
| - [backend/POST_TRAINING_PLAN.md](/Users/alazarmanakelew/IdeaProjects/trenches/backend/POST_TRAINING_PLAN.md) | |
| - [backend/pyproject.toml](/Users/alazarmanakelew/IdeaProjects/trenches/backend/pyproject.toml) | |
| So the collection path is now documented and exposed as a real CLI entry point. | |
| ## Verification | |
| The added data-collection path was verified locally with: | |
| ```bash | |
| PYTHONPYCACHEPREFIX=/tmp/trenches-pyc python -m py_compile \ | |
| backend/src/trenches_env/historical_collection.py \ | |
| backend/src/trenches_env/historical_collection_cli.py | |
| ``` | |
| ```bash | |
| cd backend | |
| uv run --extra dev python -m pytest \ | |
| tests/test_historical_collection.py \ | |
| tests/test_openenv_adapter.py \ | |
| tests/test_server.py -q | |
| ``` | |
| Result: | |
| - `20 passed in 8.78s` | |
| ## Handoff | |
| What is ready now: | |
| - a chosen base model: `Qwen/Qwen3-8B` | |
| - a collector path from real historical sources into the existing replay schema | |
| - raw provenance output | |
| - replay JSON output compatible with the current OpenEnv training flow | |
| What still needs to happen next: | |
| 1. Run the collector for each entity. | |
| 2. Curator-review the raw article audit files and the generated replay JSON. | |
| 3. Replace the current synthetic seed replays with reviewed historical replays. | |
| 4. Update the actual training runs to use `Qwen/Qwen3-8B` as the base model. | |
| 5. Keep the old synthetic seeds only for smoke tests. | |
| One important truth: | |
| The collector is the first real data path, but it does not magically make the replay production-grade by itself. The training-ready replay still needs human review because event impact shaping is currently heuristic. | |