Buckets:

xiaozheyao
/

agent-warmup

Files

xet

xiaozheyao/agent-warmup / README.md

xzyao

22 days ago

preview code

download

raw

6.69 kB

	# agent-warmup

	An SFT dataset of agentic / reasoning trajectories, assembled to **warm-start a
	general-purpose coding agent before RL**. Each row is one conversation in the
	OpenAI chat format (`messages`), so the data is model-agnostic and human-readable
	— you tokenize it for whatever model you train (see Tokenizing for training).

	- 27,679 trajectories
	- Format: line-delimited JSON (`agent_warmup.jsonl`), one trajectory per line
	- License / provenance: see Sources — this is a normalized re-mix of public
	trace datasets plus locally generated SWE-bench rollouts

	## Sources

	\| `source` \| rows \| tool use \| notes \|
	\|--------------------------\|-------:\|----------\|-------\|
	\| `claude-reasoning` \| 8,706 \| none \| reasoning traces, no tools \|
	\| `hermes-agent-reasoning` \| 14,696 \| yes \| tool schemas embedded inline in the system prompt (`<tools>…</tools>`) \|
	\| `swe-bench` \| 4,231 \| yes \| locally generated; tool schemas in the row's `tools` field \|
	\| `pi-traces` \| 46 \| yes \| session logs; no separate tool schema \|

	All rows are currently `verified: false` (no gold-test execution has confirmed
	the trajectories yet).

	## Row schema

	```jsonc
	{
	"id": "swe-bench:django__django-12345", // unique
	"source": "swe-bench", // one of the four above
	"messages": [ // OpenAI chat format
	{"role": "system", "content": "..."},
	{"role": "user", "content": "..."},
	{"role": "assistant", "content": "...", // may be null when only calling tools
	"reasoning": "...", // chain-of-thought, SEPARATE field (optional)
	"tool_calls": [
	{"id": "call_1", "type": "function",
	"function": {"name": "bash",
	"arguments": "{\"command\":\"ls\"}"}}]},
	{"role": "tool", "tool_call_id": "call_1", "name": "bash", "content": "..."}
	],
	"tools": [ // OPTIONAL — only swe-bench rows (OpenAI tool schema)
	{"type": "function", "function": {"name": "bash", "description": "...",
	"parameters": {...}}}],
	"verified": false,
	"meta": { /* source-specific: repo, base_commit, model, category, ... */ }
	}
	```

	Field notes:
	- `messages[].reasoning` — chain-of-thought kept out of* `content` (a separate
	field). Present on `claude-reasoning`, `hermes`, `pi-traces`; absent on
	`swe-bench`. Train on it or drop it as you see fit.
	- `messages[*].content` may be `null` on an assistant turn that only emits
	`tool_calls`.
	- `tools` is present only on `swe-bench` rows. `hermes` describes its tools
	inside the system-prompt text instead; the other sources have none.

	## Tokenizing for training

	The training signal is assistant tokens only. How you render the chat into
	tokens depends on your target model's chat template — below is the exact recipe
	we use for Apertus, which also produced the companion
	`agent_warmup.apertus.parquet`. A reference implementation is in
	[`tokenize_apertus.py`](./tokenize_apertus.py).

	### Why you can't tokenize message-by-message

	The naive approach (tokenize each message alone, concatenate, mask non-assistant
	messages) does not work for templates like Apertus:

	1. The template is stateful — it tracks whether it is inside an assistant
	turn. A standalone `tool` message raises *"Tool message outside of
	assistant"*.
	2. Tool outputs are rendered inside the assistant span with no delimiting
	special token, so the assistant-vs-tool loss boundary can't be recovered from
	the token stream after the fact.

	### The recipe

	For each conversation:

	1. Normalize — replace `null` assistant `content` with `""` (Apertus rejects
	non-string content); drop trailing non-assistant turns; skip conversations
	with no assistant turn.
	2. Render tools — if the row has a `tools` field (swe-bench), flatten the
	OpenAI-nested schema to the flat `{name, description, parameters}` shape the
	Apertus template's `render_tools` expects, and pass it as `tools=` to the
	template. Other sources: pass nothing (their tools are already in the prompt,
	or absent).
	3. Tokenize the whole conversation once:
	`full = tok.apply_chat_template(messages, tools=tools, tokenize=True, add_generation_prompt=False)`
	4. Recover per-message spans by cumulative-prefix LCP. For `k = 0..N-1`,
	tokenize the prefix `messages[:k+1]` the same way and take the
	longest-common-prefix length against `full`:

	```
	ids_k = tok.apply_chat_template(messages[:k+1], tools=tools, tokenize=True,
	add_generation_prompt=False)
	boundary_k = len(longest_common_prefix(ids_k, full))
	message k owns full[boundary_{k-1} : boundary_k]
	```

	(The Apertus template defers the `end_assistant` token when a tool turn
	follows, so `ids_k` is not always a clean prefix of `full` — but the divergent
	trailing special token always sits beyond the LCP, so the boundary still
	lands exactly after the message's content.)
	5. Build the loss mask. Tokens owned by `assistant` messages get
	`loss_mask = 1` — this includes the assistant's own
	`<\|tools_prefix\|>…<\|tools_suffix\|>` tool-call emission, which you do want to
	train. Everything else (system, user, tool outputs, and the rendered tool
	schemas in the developer block) gets `0`.
	6. Truncate at a message boundary `<= max_length` (we use 32768). Drop rows
	whose mask sums to 0.

	Output columns: `input_ids: list[int]`, `loss_mask: list[int]` (same length).

	### Gotcha: transformers version

	Use transformers 4.x. transformers 5.x changed
	`apply_chat_template(tokenize=True)` to return a `BatchEncoding` instead of a
	`list[int]`, which breaks the prefix arithmetic above. If you must use 5.x,
	extract `out["input_ids"]` yourself before the LCP step.

	### Run the reference tokenizer

	```bash
	pip install "transformers<5" pandas pyarrow
	python tokenize_apertus.py \
	--src agent_warmup.jsonl \
	--model swiss-ai/Apertus-8B-Instruct-2509 \
	--out agent_warmup.apertus.parquet \
	--max-length 32768 --workers 16
	```

	### Loading the pre-tokenized parquet (if you also publish it)

	```python
	import pandas as pd, torch
	df = pd.read_parquet("agent_warmup.apertus.parquet")
	row = df.iloc[0]
	input_ids = torch.tensor(row["input_ids"])
	loss_mask = torch.tensor(row["loss_mask"]) # 1 = compute loss, 0 = ignore
	# labels = input_ids.clone(); labels[loss_mask == 0] = -100
	```

Xet Storage Details

Size:: 6.69 kB
Xet hash:: 07f66de4dd4a777dae43e4c2e2358e1fb7e85cf14e6683244430ab240e13c296

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.