trenches / DATA.md
Codex
sync main snapshot for HF Space
1794757

Data Handoff

Chosen Base Model

Use:

  • Qwen/Qwen3-8B

Why this is the best default for the 2025-01 -> 2026-01 post-training window:

  • it was released inside the required time frame
  • it is available on Hugging Face
  • it is strong enough for structured action + prediction output
  • it is still realistic to run six separate entity post-training jobs on it

This is the recommended first real base model for all six entities.

What I Added For Data

The repo already had:

  • synthetic seed replay JSON files under backend/src/trenches_env/historical_replays
  • an OpenEnv replay training path
  • a training CLI that consumes replay JSON with the HistoricalReplayDefinition -> HistoricalEvent schema

What I added is the first path from real historical sources into that same replay schema.

New Files

  • backend/src/trenches_env/historical_collection.py

    • builds historical source profiles from the existing source manifest
    • derives historical domains from allowlisted agent sources
    • defines the 2025 and 2026 collection windows
    • dedupes collected articles
    • converts collected articles into the exact replay event schema used by training
  • backend/src/trenches_env/historical_collection_cli.py

    • CLI collector
    • queries the GDELT DOC API month by month
    • writes raw article audit files
    • writes replay JSON files in the same schema as the existing synthetic seeds
  • backend/tests/test_historical_collection.py

    • validates source-profile extraction
    • validates article -> replay-event conversion
    • validates replay JSON compatibility with the existing historical replay loader

What Source Data It Uses

The collector starts from the existing backend/src/trenches_env/source_manifest.json.

That means it does not invent a separate source universe. It reuses the current project’s aligned sources, then extracts historical domains from them. In practice this means it leans on the project’s existing training-core sources such as:

  • Reuters and wire-style reporting
  • official government / ministry sources
  • regional English-language outlets already assigned to the entities
  • market / shipping / sanctions / diplomacy sources already present in the manifest

For historical collection, it converts those sources into domain-filtered GDELT queries and collects article candidates month by month.

Output Files

The collector writes two outputs per run.

1. Replay JSON

Path example:

  • backend/src/trenches_env/historical_replays/us_historical_2025.json

This matches the same structure as the existing synthetic seed files:

  • replay_id
  • name
  • description
  • training_agent
  • events[]

Each event matches the current training schema:

  • event_id
  • timestamp
  • topic
  • region
  • actors
  • targets
  • severity
  • summary
  • public_summary
  • source_type
  • confirmed
  • tags
  • impact

2. Raw Audit JSONL

Path example:

  • backend/tmp-historical-raw/us_historical_2025.articles.jsonl

Each line contains:

  • article_id
  • agent_id
  • source_id
  • source_name
  • title
  • url
  • domain
  • timestamp
  • query
  • window_id

This is the provenance trail for curator review.

Date Windows

The collector currently supports:

  • 2025 -> 2025-01-01 through 2026-01-01
  • 2026 -> 2026-01-01 through the current day at collection time

Important note:

As of March 7, 2026, 2026 cannot honestly mean 2026-01-01 -> 2027-01-01 yet. The collector clamps future end dates to the current day so it does not pretend future historical data exists.

What Is Real vs Heuristic

Real:

  • source alignment from the project’s own source manifest
  • historical article collection via GDELT
  • raw audit/provenance files
  • replay JSON output in the exact schema the training system already consumes

Heuristic:

  • topic classification from article titles
  • severity classification from article titles
  • dedupe logic
  • actor/target inference
  • event impact generation

That heuristic layer is intentional. It gives you a bootstrap pipeline from real historical articles into replay training data, but the resulting replay should still be curator-reviewed before production post-training.

Commands

From repo root:

backend/.venv/bin/python -m trenches_env.historical_collection_cli \
  --training-agent us \
  --window 2025 \
  --window 2026 \
  --max-records-per-query 50 \
  --max-events 128 \
  --output-dir backend/src/trenches_env/historical_replays \
  --raw-dir backend/tmp-historical-raw

All entities:

backend/.venv/bin/python -m trenches_env.historical_collection_cli \
  --training-agent all \
  --window 2025 \
  --window 2026 \
  --max-records-per-query 50 \
  --max-events 128 \
  --output-dir backend/src/trenches_env/historical_replays \
  --raw-dir backend/tmp-historical-raw

Docs Updated

I also updated:

So the collection path is now documented and exposed as a real CLI entry point.

Verification

The added data-collection path was verified locally with:

PYTHONPYCACHEPREFIX=/tmp/trenches-pyc python -m py_compile \
  backend/src/trenches_env/historical_collection.py \
  backend/src/trenches_env/historical_collection_cli.py
cd backend
uv run --extra dev python -m pytest \
  tests/test_historical_collection.py \
  tests/test_openenv_adapter.py \
  tests/test_server.py -q

Result:

  • 20 passed in 8.78s

Handoff

What is ready now:

  • a chosen base model: Qwen/Qwen3-8B
  • a collector path from real historical sources into the existing replay schema
  • raw provenance output
  • replay JSON output compatible with the current OpenEnv training flow

What still needs to happen next:

  1. Run the collector for each entity.
  2. Curator-review the raw article audit files and the generated replay JSON.
  3. Replace the current synthetic seed replays with reviewed historical replays.
  4. Update the actual training runs to use Qwen/Qwen3-8B as the base model.
  5. Keep the old synthetic seeds only for smoke tests.

One important truth:

The collector is the first real data path, but it does not magically make the replay production-grade by itself. The training-ready replay still needs human review because event impact shaping is currently heuristic.