# Data Handoff

## Chosen Base Model

Use:

- `Qwen/Qwen3-8B`

Why this is the best default for the `2025-01 -> 2026-01` post-training window:

- it was released inside the required time frame
- it is available on Hugging Face
- it is strong enough for structured action + prediction output
- it is still realistic to run six separate entity post-training jobs on it

This is the recommended first real base model for all six entities.

## What I Added For Data

The repo already had:

- synthetic seed replay JSON files under [backend/src/trenches_env/historical_replays](/Users/alazarmanakelew/IdeaProjects/trenches/backend/src/trenches_env/historical_replays)
- an OpenEnv replay training path
- a training CLI that consumes replay JSON with the `HistoricalReplayDefinition -> HistoricalEvent` schema

What I added is the first path from real historical sources into that same replay schema.

### New Files

- [backend/src/trenches_env/historical_collection.py](/Users/alazarmanakelew/IdeaProjects/trenches/backend/src/trenches_env/historical_collection.py)
  - builds historical source profiles from the existing source manifest
  - derives historical domains from allowlisted agent sources
  - defines the `2025` and `2026` collection windows
  - dedupes collected articles
  - converts collected articles into the exact replay event schema used by training

- [backend/src/trenches_env/historical_collection_cli.py](/Users/alazarmanakelew/IdeaProjects/trenches/backend/src/trenches_env/historical_collection_cli.py)
  - CLI collector
  - queries the GDELT DOC API month by month
  - writes raw article audit files
  - writes replay JSON files in the same schema as the existing synthetic seeds

- [backend/tests/test_historical_collection.py](/Users/alazarmanakelew/IdeaProjects/trenches/backend/tests/test_historical_collection.py)
  - validates source-profile extraction
  - validates article -> replay-event conversion
  - validates replay JSON compatibility with the existing historical replay loader

## What Source Data It Uses

The collector starts from the existing [backend/src/trenches_env/source_manifest.json](/Users/alazarmanakelew/IdeaProjects/trenches/backend/src/trenches_env/source_manifest.json).

That means it does not invent a separate source universe. It reuses the current project’s aligned sources, then extracts historical domains from them. In practice this means it leans on the project’s existing training-core sources such as:

- Reuters and wire-style reporting
- official government / ministry sources
- regional English-language outlets already assigned to the entities
- market / shipping / sanctions / diplomacy sources already present in the manifest

For historical collection, it converts those sources into domain-filtered GDELT queries and collects article candidates month by month.

## Output Files

The collector writes two outputs per run.

### 1. Replay JSON

Path example:

- `backend/src/trenches_env/historical_replays/us_historical_2025.json`

This matches the same structure as the existing synthetic seed files:

- `replay_id`
- `name`
- `description`
- `training_agent`
- `events[]`

Each event matches the current training schema:

- `event_id`
- `timestamp`
- `topic`
- `region`
- `actors`
- `targets`
- `severity`
- `summary`
- `public_summary`
- `source_type`
- `confirmed`
- `tags`
- `impact`

### 2. Raw Audit JSONL

Path example:

- `backend/tmp-historical-raw/us_historical_2025.articles.jsonl`

Each line contains:

- `article_id`
- `agent_id`
- `source_id`
- `source_name`
- `title`
- `url`
- `domain`
- `timestamp`
- `query`
- `window_id`

This is the provenance trail for curator review.

## Date Windows

The collector currently supports:

- `2025` -> `2025-01-01` through `2026-01-01`
- `2026` -> `2026-01-01` through the current day at collection time

Important note:

As of March 7, 2026, `2026` cannot honestly mean `2026-01-01 -> 2027-01-01` yet. The collector clamps future end dates to the current day so it does not pretend future historical data exists.

## What Is Real vs Heuristic

Real:

- source alignment from the project’s own source manifest
- historical article collection via GDELT
- raw audit/provenance files
- replay JSON output in the exact schema the training system already consumes

Heuristic:

- topic classification from article titles
- severity classification from article titles
- dedupe logic
- actor/target inference
- event `impact` generation

That heuristic layer is intentional. It gives you a bootstrap pipeline from real historical articles into replay training data, but the resulting replay should still be curator-reviewed before production post-training.

## Commands

From repo root:

```bash
backend/.venv/bin/python -m trenches_env.historical_collection_cli \
  --training-agent us \
  --window 2025 \
  --window 2026 \
  --max-records-per-query 50 \
  --max-events 128 \
  --output-dir backend/src/trenches_env/historical_replays \
  --raw-dir backend/tmp-historical-raw
```

All entities:

```bash
backend/.venv/bin/python -m trenches_env.historical_collection_cli \
  --training-agent all \
  --window 2025 \
  --window 2026 \
  --max-records-per-query 50 \
  --max-events 128 \
  --output-dir backend/src/trenches_env/historical_replays \
  --raw-dir backend/tmp-historical-raw
```

## Docs Updated

I also updated:

- [backend/TRAINING_RUNBOOK.md](/Users/alazarmanakelew/IdeaProjects/trenches/backend/TRAINING_RUNBOOK.md)
- [backend/TRAINING_FLOW.md](/Users/alazarmanakelew/IdeaProjects/trenches/backend/TRAINING_FLOW.md)
- [backend/POST_TRAINING_PLAN.md](/Users/alazarmanakelew/IdeaProjects/trenches/backend/POST_TRAINING_PLAN.md)
- [backend/pyproject.toml](/Users/alazarmanakelew/IdeaProjects/trenches/backend/pyproject.toml)

So the collection path is now documented and exposed as a real CLI entry point.

## Verification

The added data-collection path was verified locally with:

```bash
PYTHONPYCACHEPREFIX=/tmp/trenches-pyc python -m py_compile \
  backend/src/trenches_env/historical_collection.py \
  backend/src/trenches_env/historical_collection_cli.py
```

```bash
cd backend
uv run --extra dev python -m pytest \
  tests/test_historical_collection.py \
  tests/test_openenv_adapter.py \
  tests/test_server.py -q
```

Result:

- `20 passed in 8.78s`

## Handoff

What is ready now:

- a chosen base model: `Qwen/Qwen3-8B`
- a collector path from real historical sources into the existing replay schema
- raw provenance output
- replay JSON output compatible with the current OpenEnv training flow

What still needs to happen next:

1. Run the collector for each entity.
2. Curator-review the raw article audit files and the generated replay JSON.
3. Replace the current synthetic seed replays with reviewed historical replays.
4. Update the actual training runs to use `Qwen/Qwen3-8B` as the base model.
5. Keep the old synthetic seeds only for smoke tests.

One important truth:

The collector is the first real data path, but it does not magically make the replay production-grade by itself. The training-ready replay still needs human review because event impact shaping is currently heuristic.