Buckets:
| # agent-warmup | |
| An SFT dataset of agentic / reasoning trajectories, assembled to **warm-start a | |
| general-purpose coding agent before RL**. Each row is one conversation in the | |
| OpenAI chat format (`messages`), so the data is model-agnostic and human-readable | |
| — you tokenize it for whatever model you train (see **Tokenizing for training**). | |
| - **27,679** trajectories | |
| - Format: line-delimited JSON (`agent_warmup.jsonl`), one trajectory per line | |
| - License / provenance: see **Sources** — this is a normalized re-mix of public | |
| trace datasets plus locally generated SWE-bench rollouts | |
| ## Sources | |
| | `source` | rows | tool use | notes | | |
| |--------------------------|-------:|----------|-------| | |
| | `claude-reasoning` | 8,706 | none | reasoning traces, no tools | | |
| | `hermes-agent-reasoning` | 14,696 | yes | tool schemas embedded **inline** in the system prompt (`<tools>…</tools>`) | | |
| | `swe-bench` | 4,231 | yes | locally generated; tool schemas in the row's `tools` field | | |
| | `pi-traces` | 46 | yes | session logs; no separate tool schema | | |
| All rows are currently `verified: false` (no gold-test execution has confirmed | |
| the trajectories yet). | |
| ## Row schema | |
| ```jsonc | |
| { | |
| "id": "swe-bench:django__django-12345", // unique | |
| "source": "swe-bench", // one of the four above | |
| "messages": [ // OpenAI chat format | |
| {"role": "system", "content": "..."}, | |
| {"role": "user", "content": "..."}, | |
| {"role": "assistant", "content": "...", // may be null when only calling tools | |
| "reasoning": "...", // chain-of-thought, SEPARATE field (optional) | |
| "tool_calls": [ | |
| {"id": "call_1", "type": "function", | |
| "function": {"name": "bash", | |
| "arguments": "{\"command\":\"ls\"}"}}]}, | |
| {"role": "tool", "tool_call_id": "call_1", "name": "bash", "content": "..."} | |
| ], | |
| "tools": [ // OPTIONAL — only swe-bench rows (OpenAI tool schema) | |
| {"type": "function", "function": {"name": "bash", "description": "...", | |
| "parameters": {...}}}], | |
| "verified": false, | |
| "meta": { /* source-specific: repo, base_commit, model, category, ... */ } | |
| } | |
| ``` | |
| Field notes: | |
| - `messages[*].reasoning` — chain-of-thought kept **out of** `content` (a separate | |
| field). Present on `claude-reasoning`, `hermes`, `pi-traces`; absent on | |
| `swe-bench`. Train on it or drop it as you see fit. | |
| - `messages[*].content` may be `null` on an assistant turn that only emits | |
| `tool_calls`. | |
| - `tools` is present **only** on `swe-bench` rows. `hermes` describes its tools | |
| inside the system-prompt text instead; the other sources have none. | |
| ## Tokenizing for training | |
| The training signal is **assistant tokens only**. How you render the chat into | |
| tokens depends on your target model's chat template — below is the exact recipe | |
| we use for **Apertus**, which also produced the companion | |
| `agent_warmup.apertus.parquet`. A reference implementation is in | |
| [`tokenize_apertus.py`](./tokenize_apertus.py). | |
| ### Why you can't tokenize message-by-message | |
| The naive approach (tokenize each message alone, concatenate, mask non-assistant | |
| messages) **does not work** for templates like Apertus: | |
| 1. The template is **stateful** — it tracks whether it is inside an assistant | |
| turn. A standalone `tool` message raises *"Tool message outside of | |
| assistant"*. | |
| 2. Tool *outputs* are rendered **inside** the assistant span with no delimiting | |
| special token, so the assistant-vs-tool loss boundary can't be recovered from | |
| the token stream after the fact. | |
| ### The recipe | |
| For each conversation: | |
| 1. **Normalize** — replace `null` assistant `content` with `""` (Apertus rejects | |
| non-string content); drop trailing non-assistant turns; skip conversations | |
| with no assistant turn. | |
| 2. **Render tools** — if the row has a `tools` field (swe-bench), flatten the | |
| OpenAI-nested schema to the flat `{name, description, parameters}` shape the | |
| Apertus template's `render_tools` expects, and pass it as `tools=` to the | |
| template. Other sources: pass nothing (their tools are already in the prompt, | |
| or absent). | |
| 3. **Tokenize the whole conversation once**: | |
| `full = tok.apply_chat_template(messages, tools=tools, tokenize=True, add_generation_prompt=False)` | |
| 4. **Recover per-message spans by cumulative-prefix LCP.** For `k = 0..N-1`, | |
| tokenize the prefix `messages[:k+1]` the same way and take the | |
| longest-common-prefix length against `full`: | |
| ``` | |
| ids_k = tok.apply_chat_template(messages[:k+1], tools=tools, tokenize=True, | |
| add_generation_prompt=False) | |
| boundary_k = len(longest_common_prefix(ids_k, full)) | |
| message k owns full[boundary_{k-1} : boundary_k] | |
| ``` | |
| (The Apertus template defers the `end_assistant` token when a tool turn | |
| follows, so `ids_k` is not always a clean prefix of `full` — but the divergent | |
| trailing special token always sits *beyond* the LCP, so the boundary still | |
| lands exactly after the message's content.) | |
| 5. **Build the loss mask.** Tokens owned by `assistant` messages get | |
| `loss_mask = 1` — this **includes** the assistant's own | |
| `<|tools_prefix|>…<|tools_suffix|>` tool-call emission, which you *do* want to | |
| train. Everything else (system, user, tool outputs, and the rendered tool | |
| schemas in the developer block) gets `0`. | |
| 6. **Truncate** at a message boundary `<= max_length` (we use 32768). Drop rows | |
| whose mask sums to 0. | |
| Output columns: `input_ids: list[int]`, `loss_mask: list[int]` (same length). | |
| ### Gotcha: transformers version | |
| Use **transformers 4.x**. transformers **5.x** changed | |
| `apply_chat_template(tokenize=True)` to return a `BatchEncoding` instead of a | |
| `list[int]`, which breaks the prefix arithmetic above. If you must use 5.x, | |
| extract `out["input_ids"]` yourself before the LCP step. | |
| ### Run the reference tokenizer | |
| ```bash | |
| pip install "transformers<5" pandas pyarrow | |
| python tokenize_apertus.py \ | |
| --src agent_warmup.jsonl \ | |
| --model swiss-ai/Apertus-8B-Instruct-2509 \ | |
| --out agent_warmup.apertus.parquet \ | |
| --max-length 32768 --workers 16 | |
| ``` | |
| ### Loading the pre-tokenized parquet (if you also publish it) | |
| ```python | |
| import pandas as pd, torch | |
| df = pd.read_parquet("agent_warmup.apertus.parquet") | |
| row = df.iloc[0] | |
| input_ids = torch.tensor(row["input_ids"]) | |
| loss_mask = torch.tensor(row["loss_mask"]) # 1 = compute loss, 0 = ignore | |
| # labels = input_ids.clone(); labels[loss_mask == 0] = -100 | |
| ``` | |
Xet Storage Details
- Size:
- 6.69 kB
- Xet hash:
- 07f66de4dd4a777dae43e4c2e2358e1fb7e85cf14e6683244430ab240e13c296
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.