Spaces:

abrown31
/

open-range

Runtime error

open-range / data /README.md

Aaron Brown

Add episode CLI, synthetic data pipeline, NPC generalization, service manifest

f016eb7 2 months ago

1.46 kB

	# Training Data

	Seed and generated datasets for SFT warm-start live under `data/`.

	## Files

	\| File \| Purpose \|
	\|------\|---------\|
	\| `sft.jsonl` \| Seed SFT dataset in ChatML format, including assistant tool calls and tool responses. \|
	\| `tool_info.md` \| Reusable tool catalog that can be injected into generated system prompts with `--tool-info`. \|
	\| `synthetic*.jsonl` \| Generated synthetic datasets from `openrange synthetic-data` (gitignored). \|

	## Seed SFT Format

	Each line in `sft.jsonl` is a single solved trajectory:

	```json
	{
	"messages": [
	{"role": "system", "content": "..."},
	{"role": "user", "content": "..."},
	{"role": "assistant", "content": "...", "tool_calls": [...]},
	{"role": "tool", "tool_call_id": "...", "name": "shell_command", "content": "..."}
	],
	"metadata": {"source": "bootstrap", "success": true},
	"ground_truth_flag": "FLAG{...}",
	"optimal_steps": 8
	}
	```

	## Generating Synthetic Data

	Use the seed file as bootstrap context and merge newly generated OpenRange traces into a single output:

	```bash
	uv run --extra synthetic openrange synthetic-data \
	--manifest manifests/tier1_basic.yaml \
	--output data/synthetic_sft_5.jsonl \
	--num-traces 5 \
	--roles red \
	--teacher-model azure/gpt-5.2-codex \
	--bootstrap-traces data/sft.jsonl \
	--tool-info data/tool_info.md
	```

	The output file keeps the imported bootstrap records intact and appends the generated OpenRange records after them.