File size: 7,140 Bytes
1794757
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
# Data Handoff

## Chosen Base Model

Use:

- `Qwen/Qwen3-8B`

Why this is the best default for the `2025-01 -> 2026-01` post-training window:

- it was released inside the required time frame
- it is available on Hugging Face
- it is strong enough for structured action + prediction output
- it is still realistic to run six separate entity post-training jobs on it

This is the recommended first real base model for all six entities.

## What I Added For Data

The repo already had:

- synthetic seed replay JSON files under [backend/src/trenches_env/historical_replays](/Users/alazarmanakelew/IdeaProjects/trenches/backend/src/trenches_env/historical_replays)
- an OpenEnv replay training path
- a training CLI that consumes replay JSON with the `HistoricalReplayDefinition -> HistoricalEvent` schema

What I added is the first path from real historical sources into that same replay schema.

### New Files

- [backend/src/trenches_env/historical_collection.py](/Users/alazarmanakelew/IdeaProjects/trenches/backend/src/trenches_env/historical_collection.py)
  - builds historical source profiles from the existing source manifest
  - derives historical domains from allowlisted agent sources
  - defines the `2025` and `2026` collection windows
  - dedupes collected articles
  - converts collected articles into the exact replay event schema used by training

- [backend/src/trenches_env/historical_collection_cli.py](/Users/alazarmanakelew/IdeaProjects/trenches/backend/src/trenches_env/historical_collection_cli.py)
  - CLI collector
  - queries the GDELT DOC API month by month
  - writes raw article audit files
  - writes replay JSON files in the same schema as the existing synthetic seeds

- [backend/tests/test_historical_collection.py](/Users/alazarmanakelew/IdeaProjects/trenches/backend/tests/test_historical_collection.py)
  - validates source-profile extraction
  - validates article -> replay-event conversion
  - validates replay JSON compatibility with the existing historical replay loader

## What Source Data It Uses

The collector starts from the existing [backend/src/trenches_env/source_manifest.json](/Users/alazarmanakelew/IdeaProjects/trenches/backend/src/trenches_env/source_manifest.json).

That means it does not invent a separate source universe. It reuses the current project’s aligned sources, then extracts historical domains from them. In practice this means it leans on the project’s existing training-core sources such as:

- Reuters and wire-style reporting
- official government / ministry sources
- regional English-language outlets already assigned to the entities
- market / shipping / sanctions / diplomacy sources already present in the manifest

For historical collection, it converts those sources into domain-filtered GDELT queries and collects article candidates month by month.

## Output Files

The collector writes two outputs per run.

### 1. Replay JSON

Path example:

- `backend/src/trenches_env/historical_replays/us_historical_2025.json`

This matches the same structure as the existing synthetic seed files:

- `replay_id`
- `name`
- `description`
- `training_agent`
- `events[]`

Each event matches the current training schema:

- `event_id`
- `timestamp`
- `topic`
- `region`
- `actors`
- `targets`
- `severity`
- `summary`
- `public_summary`
- `source_type`
- `confirmed`
- `tags`
- `impact`

### 2. Raw Audit JSONL

Path example:

- `backend/tmp-historical-raw/us_historical_2025.articles.jsonl`

Each line contains:

- `article_id`
- `agent_id`
- `source_id`
- `source_name`
- `title`
- `url`
- `domain`
- `timestamp`
- `query`
- `window_id`

This is the provenance trail for curator review.

## Date Windows

The collector currently supports:

- `2025` -> `2025-01-01` through `2026-01-01`
- `2026` -> `2026-01-01` through the current day at collection time

Important note:

As of March 7, 2026, `2026` cannot honestly mean `2026-01-01 -> 2027-01-01` yet. The collector clamps future end dates to the current day so it does not pretend future historical data exists.

## What Is Real vs Heuristic

Real:

- source alignment from the project’s own source manifest
- historical article collection via GDELT
- raw audit/provenance files
- replay JSON output in the exact schema the training system already consumes

Heuristic:

- topic classification from article titles
- severity classification from article titles
- dedupe logic
- actor/target inference
- event `impact` generation

That heuristic layer is intentional. It gives you a bootstrap pipeline from real historical articles into replay training data, but the resulting replay should still be curator-reviewed before production post-training.

## Commands

From repo root:

```bash
backend/.venv/bin/python -m trenches_env.historical_collection_cli \
  --training-agent us \
  --window 2025 \
  --window 2026 \
  --max-records-per-query 50 \
  --max-events 128 \
  --output-dir backend/src/trenches_env/historical_replays \
  --raw-dir backend/tmp-historical-raw
```

All entities:

```bash
backend/.venv/bin/python -m trenches_env.historical_collection_cli \
  --training-agent all \
  --window 2025 \
  --window 2026 \
  --max-records-per-query 50 \
  --max-events 128 \
  --output-dir backend/src/trenches_env/historical_replays \
  --raw-dir backend/tmp-historical-raw
```

## Docs Updated

I also updated:

- [backend/TRAINING_RUNBOOK.md](/Users/alazarmanakelew/IdeaProjects/trenches/backend/TRAINING_RUNBOOK.md)
- [backend/TRAINING_FLOW.md](/Users/alazarmanakelew/IdeaProjects/trenches/backend/TRAINING_FLOW.md)
- [backend/POST_TRAINING_PLAN.md](/Users/alazarmanakelew/IdeaProjects/trenches/backend/POST_TRAINING_PLAN.md)
- [backend/pyproject.toml](/Users/alazarmanakelew/IdeaProjects/trenches/backend/pyproject.toml)

So the collection path is now documented and exposed as a real CLI entry point.

## Verification

The added data-collection path was verified locally with:

```bash
PYTHONPYCACHEPREFIX=/tmp/trenches-pyc python -m py_compile \
  backend/src/trenches_env/historical_collection.py \
  backend/src/trenches_env/historical_collection_cli.py
```

```bash
cd backend
uv run --extra dev python -m pytest \
  tests/test_historical_collection.py \
  tests/test_openenv_adapter.py \
  tests/test_server.py -q
```

Result:

- `20 passed in 8.78s`

## Handoff

What is ready now:

- a chosen base model: `Qwen/Qwen3-8B`
- a collector path from real historical sources into the existing replay schema
- raw provenance output
- replay JSON output compatible with the current OpenEnv training flow

What still needs to happen next:

1. Run the collector for each entity.
2. Curator-review the raw article audit files and the generated replay JSON.
3. Replace the current synthetic seed replays with reviewed historical replays.
4. Update the actual training runs to use `Qwen/Qwen3-8B` as the base model.
5. Keep the old synthetic seeds only for smoke tests.

One important truth:

The collector is the first real data path, but it does not magically make the replay production-grade by itself. The training-ready replay still needs human review because event impact shaping is currently heuristic.