File size: 1,458 Bytes
f016eb7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Training Data

Seed and generated datasets for SFT warm-start live under `data/`.

## Files

| File | Purpose |
|------|---------|
| `sft.jsonl` | Seed SFT dataset in ChatML format, including assistant tool calls and tool responses. |
| `tool_info.md` | Reusable tool catalog that can be injected into generated system prompts with `--tool-info`. |
| `synthetic*.jsonl` | Generated synthetic datasets from `openrange synthetic-data` (gitignored). |

## Seed SFT Format

Each line in `sft.jsonl` is a single solved trajectory:

```json
{
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "...", "tool_calls": [...]},
    {"role": "tool", "tool_call_id": "...", "name": "shell_command", "content": "..."}
  ],
  "metadata": {"source": "bootstrap", "success": true},
  "ground_truth_flag": "FLAG{...}",
  "optimal_steps": 8
}
```

## Generating Synthetic Data

Use the seed file as bootstrap context and merge newly generated OpenRange traces into a single output:

```bash
uv run --extra synthetic openrange synthetic-data \
  --manifest manifests/tier1_basic.yaml \
  --output data/synthetic_sft_5.jsonl \
  --num-traces 5 \
  --roles red \
  --teacher-model azure/gpt-5.2-codex \
  --bootstrap-traces data/sft.jsonl \
  --tool-info data/tool_info.md
```

The output file keeps the imported bootstrap records intact and appends the generated OpenRange records after them.