File size: 28,983 Bytes
ac4bfb4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a84c060
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0a5ab7
ac4bfb4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
# TRACE_SOURCE_RECONNAISSANCE.md

Spike 007 trace-source audit, feeding ADR-002.

Status: **DECIDED** — recommend **(a) Claude Code session JSONL** (`~/.claude/projects/<encoded>/<sessionId>.jsonl`).

---

## 0. TL;DR

Of the six candidates audited, Claude Code session JSONL wins on every axis except "official Anthropic-published schema" (no such doc exists), and for that single weakness there is now a community-maintained reverse-engineered JSON Schema validated against ~50,000 messages from real sessions, plus three independent third-party schema specs. The user has **1,015 .jsonl sessions on this machine** today; the eight largest sampled span 550 → 17,315 lines and contain **6,762 multi-turn `tool_use` messages**. Acquisition cost is zero. Licensing is clean: the JSONL files are local user-owned data; the proprietary Claude Code binary is not redistributed by us.

The runners-up — OpenHands (well-documented but acquisition is non-trivial), SWE-bench trajectory submissions (heterogeneous schemas across submitters), Aider markdown chat history (lossy / unparseable for tool calls), and Cline (no public stable export format) — each lose on at least one of the four axes.

---

## 1. Context: TraceExample dataclass field reality

**Important correction to the parent task description.** The task brief said "TraceExample dataclass with fields state_text, action_taken, hint_text (optional), reward (float), teacher_id (str)". Reading the actual file at
`/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/teacher_replay.py` shows the existing types are different — there is no `TraceExample` class. The closest existing types are two `TypedDict`s used by `replay_trace()` and `extract_dpo_pairs()`:

```python
class TraceState(TypedDict):
    state_id: str           # unique within the trace
    messages: list[dict]    # conversation up to and including this step's user prompt
    student_action: str     # what the student actually did at this step

class DPOPair(TypedDict):
    state_id: str
    state_messages: list[dict]
    chosen: str       # teacher-consensus action
    rejected: str     # student action
    n_teachers_agreeing: int
```

The mapping sketch in §6 below targets `TraceState` (the *input* to teacher replay), since that is the type a `TraceIngester` is upstream of. If Spike 007 also wants a unified `TraceExample` per the brief, the natural shape is `TraceState` ∪ `{teacher_id: str | None, reward: float | None, hint_text: str | None}` — flagged for ADR-002 to settle.

---

## 2. Candidate audit summary

Scoring legend: `+` good, `~` mixed, `-` bad, on each of the four required axes.

| # | Candidate | Schema documented | Real ≥5 multi-turn traces | Hint-receptive signal density | License OK | Verdict |
|---|---|---|---|---|---|---|
| **a** | **Claude Code JSONL** (`~/.claude/projects/`) | `~` Anthropic publishes high-level format note; community schemas are detailed and validated | **+** 1,015 local sessions, 5+ trivially | **+** Per-step `assistant.message.content[].tool_use` blocks → discrete actions, ideal teacher-correction sites | **+** User-owned local files; framework MIT | **CHOSEN** |
| b | Cline VS Code extension | `-` No published stable export schema | `~` Requires running Cline + manual export | `~` Plausible if exported but unverified | `~` Cline source Apache-2.0 but trace format isn't a stable contract | reject |
| c | OpenHands trajectories | **+** Well-documented (events/, base_state.json, Pydantic Event models) | `-` Need to *run* OpenHands or download eval traces — not zero-cost | **+** ActionEvent/ObservationEvent split is conceptually ideal | **+** OpenHands MIT-licensed | strong runner-up |
| d | Aider chat history | `~` Format is "markdown, level-4 headings for user input" — fragile | `~` Available if Aider was used | `-` Tool calls are flattened into prose; recovering structured actions is lossy | `+` Aider Apache-2.0 | reject |
| e | SWE-bench / Lite leaderboard `trajs/` | `-` Each submitter chooses a free-form text format (md/json/yaml) | **+** ~hundreds of submissions on github.com/swe-bench/experiments | `~` Heterogeneous; structured ones (e.g. mini-swe-agent `.traj.json`) are good, others are essentially logs | **+** Public submissions with usage rights for research | reject as primary; usable as future cross-validation set |
| f | SWE-smith-trajectories on HF | **+** Standard OpenAI messages format, documented per dataset card | **+** 5,017 trajectories, 76,002 rows, public | **+** Single-attempt per-instance SWE-agent runs | **+** Apache-2.0 dataset license | strong runner-up; **complement, not replacement** |

The (f) row was discovered during audit (the parent task allowed "any other public source you find that is better"). It's a strong candidate but answers a *different* question: SWE-bench trajectories give us reproducible benchmark traces; Claude Code JSONL gives us *the user's actual workflow*. For Spike 007's purpose (verify the teacher-replay path works on a real, signal-dense trace at zero acquisition cost), (a) is the right primary; (f) is queued for a later cross-validation phase.

---

## 3. Chosen format spec — Claude Code session JSONL

### 3.1 Location and naming

- **Root**: `~/.claude/projects/` (overridable via `CLAUDE_CONFIG_DIR`).
  Source: <https://code.claude.com/docs/en/sessions> ("Transcripts are stored as JSONL at `~/.claude/projects/<encoded-cwd>/<sessionId>.jsonl`").
- **Project-key encoding**: working-directory absolute path with `/` and `\` and `:` replaced by `-`, with a leading `-`. (Hidden directories with a leading dot become double dashes.)
  Source: <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> §"Project key encoding".
- **File**: `<sessionId>.jsonl`. Subagent transcripts are `agent-<agentId>.jsonl`; a `SessionReader` should *skip* files starting with `agent-` when listing main sessions.
  Source: same `claude_skills` doc, §"Subagent File Location".
- **Encoding**: UTF-8, newline-delimited JSON. One JSON object per line. No `[`/`]` wrapping. Local cleanup default 30 days, configurable via `cleanupPeriodDays` in `~/.claude/settings.json`.
  Source: <https://code.claude.com/docs/en/data-usage> ("Local caching: Claude Code clients store session transcripts locally in plaintext under `~/.claude/projects/` for 30 days by default to enable session resumption.")

### 3.2 Common record fields

Every record (both user and assistant types) carries:

| field | type | meaning |
|---|---|---|
| `parentUuid` | `string \| null` | UUID of the parent record (null on the first record) |
| `uuid` | `string` | This record's UUID |
| `sessionId` | `string` | UUID of the session (matches filename) |
| `timestamp` | `string` (ISO-8601) | Wall-clock time of the record |
| `cwd` | `string` | Absolute working directory |
| `version` | `string` | Claude Code version (e.g. `"2.1.143"`) |
| `gitBranch` | `string` | Empty string `""` when not in a git repo |
| `isSidechain` | `boolean` | True for sub-agent (Task tool) chains |
| `userType` | `string` | `"external"` or similar |
| `type` | `string` | Discriminator — see §3.3 |
| `entrypoint` | `string` | e.g. `"sdk-cli"` |

Sources for these fields:
- <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Type Definitions" → `BaseMessageEntry`
- <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> §"Top-Level Record Fields"
- <https://github.com/moru-ai/agent-schemas/blob/main/claude-code/v2.1.1/session.schema.json> (machine-validated against ~50,000 messages from 480 real sessions)
- Direct inspection (this doc): `head` of `~/.claude/projects/-mnt-e-CS-HF-eidolon/c6967343-51a3-4b1b-9472-a569e96114b1.jsonl` confirms presence of every field above.

### 3.3 Record types (`type` discriminator)

| `type` | Role |
|---|---|
| `user` | Both human prompts AND tool results (distinguished by `message.content[].type`) |
| `assistant` | Model output: text, `thinking`, and `tool_use` blocks |
| `system` | Hook summaries, stop notices |
| `summary` | Context-compaction markers |
| `attachment` | Hook stdout/stderr, e.g. `SessionStart` hook output |
| `queue-operation` | Prompt enqueue/dequeue events |
| `file-history-snapshot` | File-state tracking for undo |
| `last-prompt` | Bookkeeping for resume |

Source: <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Entry Types"; corroborated by direct `Counter` inspection of one local session showing `attachment, assistant, user, last-prompt, queue-operation` types in expected proportions.

### 3.4 The two record types we care about

#### Assistant record carrying a tool call (the "student action")

Real example, redacted from `~/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-doc-adapter-skeleton/39df59f0-674c-413a-b333-cdac0cea9db7.jsonl`:

```json
{
  "type": "assistant",
  "uuid": "24a16a51-3133-4ba5-9d23-472864286154",
  "parentUuid": "1b11c3b3-832b-4473-a944-b61a1f3f2594",
  "sessionId": "39df59f0-…",
  "timestamp": "2026-05-16T04:52:21.947Z",
  "message": {
    "role": "assistant",
    "model": "claude-opus-4-7",
    "content": [
      {
        "type": "tool_use",
        "id": "toolu_bdrk_012HC2dggmSgtVAtWWzwikZq",
        "name": "Bash",
        "input": {
          "command": "ov mail check --agent builder-doc-adapter-skeleton 2>&1 | head -200",
          "description": "Check builder agent inbox"
        }
      }
    ],
    "stop_reason": "tool_use",
    "usage": { "input_tokens": 6, "cache_creation_input_tokens": 48287, "output_tokens": 1021, ... }
  }
}
```

The student's *action* at this step = the JSON of `message.content[i]` where `content[i].type == "tool_use"` (or, if multiple tool_use blocks, the array of them; or if pure-text reply, the `content[i].text` of the `text` block).

#### User record carrying a tool result (the "observation")

```json
{
  "type": "user",
  "uuid": "b9f9414b-…",
  "parentUuid": "24a16a51-…",            // matches the assistant uuid above
  "sessionId": "39df59f0-…",
  "timestamp": "2026-05-16T04:52:23.229Z",
  "message": {
    "role": "user",
    "content": [
      {
        "tool_use_id": "toolu_bdrk_012HC2dggmSgtVAtWWzwikZq",
        "type": "tool_result",
        "content": "  No new messages",
        "is_error": false
      }
    ]
  },
  "toolUseResult": {                       // duplicate, structured form
    "stdout": "  No new messages",
    "stderr": "",
    "interrupted": false,
    "isImage": false,
    "noOutputExpected": false
  },
  "sourceToolAssistantUUID": "24a16a51-…"  // back-pointer to the assistant uuid
}
```

User records carrying actual human prompts have `message.content` as a list with `{"type":"text","text":"..."}` blocks (or, in older logs, `message.content` as a plain string).

### 3.5 Schema stability

- **Anthropic's official documentation** acknowledges the location and "each line is a JSON object for a message, tool use, or metadata entry" but does **not** publish a versioned schema.
- **Practical stability**: moru-ai/agent-schemas tracked v2.0.76 → v2.1.1; only one new field of note (`toolUseResult`). Schema pins `additionalProperties: true` for forward compatibility. This level of stability is sufficient for Spike 007 (a research spike, not a long-lived product API).
- **Mitigation**: pin to a specific Claude Code `version` field range and version-gate the ingester (e.g. accept `2.1.x`, warn on others).

### 3.6 Licensing

- The Claude Code binary is **proprietary** (Anthropic Commercial Terms of Service, <https://github.com/anthropics/claude-code/blob/1e95326e12183286fc6cbd828c8a86a0d8e03c62/LICENSE.md>).
- The session JSONL files are **local user data** generated on the user's machine during ordinary use. Anthropic's data-usage doc explicitly calls them "local caching … session transcripts locally in plaintext" — they belong to the user.
- Our framework is MIT-licensed and we are **not redistributing the Claude Code binary or any third-party trace files**. We are reading the user's own local logs (analogous to processing one's own `.bash_history`).
- We MUST NOT publish raw trace files in our repo without the user's consent (PII risk: cwd, gitBranch, file contents). The framework should ship only the *ingester*, plus a tiny synthetic-fixture trace for unit tests.

---

## 4. Acquiring the 5 real example traces

**Zero acquisition cost.** All five live on this machine right now.

Discovery command (used during this audit):

```bash
find ~/.claude/projects -name "*.jsonl" 2>/dev/null
# → 1015 files
```

Five concrete pre-selected sessions, each multi-turn (≥ 100 tool_use messages), each from a distinct project, each ≥ 50 KB:

| # | Tool-use msgs | User msgs | Asst msgs | Total lines | Path |
|---|---|---|---|---|---|
| 1 | 2,830 | 3,199 | 4,325 | 17,315 | `/home/codeseys/.claude/projects/-mnt-e-CS-HF-eidolon/c6967343-51a3-4b1b-9472-a569e96114b1.jsonl` |
| 2 | 1,350 | 1,407 | 2,016 | 7,673 | `/home/codeseys/.claude/projects/-mnt-e-CS-github-agent-manager/c42b68ea-d410-455e-bc71-92ec6c4adce9.jsonl` |
| 3 | 984 | 1,032 | 1,549 | 5,783 | `/home/codeseys/.claude/projects/-mnt-e-CS-HF-streaming-speech-to-speech/73c9925c-d5e5-48fc-a97b-a58687c2fb3c.jsonl` |
| 4 | 717 | 759 | 1,142 | 4,036 | `/home/codeseys/.claude/projects/-mnt-e-CS-github/6ac8e20f-98ec-4279-9957-e68862a90c5e.jsonl` |
| 5 | 125 | 126 | 197 | 629 | `/home/codeseys/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-iteration-checkpoint/e4a34e2b-40c6-49ce-b253-912a43224aae.jsonl` |

(All five inspected programmatically during this audit — counts above are real, not estimates.)

For users on other machines: `find ~/.claude/projects -name '*.jsonl' -size +50k | head` will surface candidates. For repository CI we will commit a small (~5 KB) **synthetic** fixture conforming to the schema, never any of the user's real traces.

---

## 5. Decision-relevant tradeoffs vs runners-up

### Why we are NOT picking OpenHands trajectories (c)
- **Pro**: cleanest schema we audited — Pydantic `Event` / `ActionEvent` / `ObservationEvent` models, source: <https://docs.openhands.dev/sdk/arch/events>, source code: <https://github.com/OpenHands/OpenHands/blob/3ec999e8/openhands/events/serialization/event.py>. Tool-call structure is *more* normalized than Claude Code's (explicit Action/Observation typing).
- **Con**: zero-acquisition is false here. Persistence dir defaults to `workspace/conversations/` and only exists if the user has *run OpenHands locally*. Public eval trajectories are spread across the eval/ folder rather than a clean public bucket.
- **Decisive**: Spike 001's economic floor was measured on 50 synthetic states. Spike 007's purpose is to verify ingestion + replay on real traces *that already exist*. (a) gives that today; (c) requires standing up OpenHands first, plus the storage format split between v0 (per-event JSON files) and v1 (timestamped files) per <https://github.com/All-Hands-AI/OpenHands/issues/8701>, which is a flux risk.
- **Future use**: if the framework ever ships "trace ingester adapters" plural, OpenHands is the second adapter to write — its event-typed model is conceptually superior.

### Why we are NOT picking SWE-bench leaderboard trajectories (e)
- **Pro**: hundreds of submissions on <https://github.com/swe-bench/experiments>, with required `trajs/` folders.
- **Con**: leaderboard rules say "The reasoning trace can be represented with **any text based file format (e.g. md, json, yaml)**" (source: <https://github.com/swe-bench/experiments> README). Each submitter picks their own. Building a generic ingester is a per-submission engineering project, not a single adapter. SWE-agent uses one shape (`{"action", "observation", "response"}` arrays — confirmed via <https://huggingface.co/datasets/JetBrains-Research/swe-traj-complete>); mini-swe-agent uses `.traj.json` with OpenAI messages format (<https://huggingface.co/datasets/tarsur385/swebench-verified-trajectories>).
- **Decisive**: heterogeneous schema = fragile ingester = wrong choice for *first* spike.

### Why we are NOT picking Aider (d)
- The `chat_history_file` is **markdown** (`.aider.chat.history.md`), per <https://aider.chat/docs/config/dotenv.html>. Source code at <https://github.com/Aider-AI/aider/blob/bdb4d9ff/aider/history.py> shows it's literally `f.write(text)` of formatted prose with `####` for user input.
- **Decisive**: tool calls in Aider are *applied as edits*, not preserved as discrete structured actions in the markdown log. Reconstructing "the action the student took at step k" is lossy. The `.aider.llm.history` log is closer to what we want but is opt-in and not always present.

### Why we are NOT picking Cline (b)
- No public commitment to a stable export schema. Cline's storage is internal to the VS Code extension (workspace state DB + per-task JSON in extension storage). Searching for "Cline trace export schema" yields no Anthropic-style spec doc. Workable in principle, but reverse-engineering an extension's storage is not the right ground for a 1-week spike.

### Why we are NOT picking SWE-smith-trajectories (f)
- This is the **strongest external dataset** we found and **should be Spike 007's stretch goal / Spike 008's primary**: 5,017 fine-tuning trajectories from SWE-agent + Claude 3.7 Sonnet, 4.22 GB on HuggingFace, OpenAI messages format. Source: <https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories>.
- **Why not first**: the messages-only format collapses tool calls and tool results into the OpenAI chat-completions wire format with text-encoded tool blocks. That works for SFT but is *less* signal-dense for the teacher-correction spike than Claude Code's `tool_use` blocks because the model's `name` and `input` fields are structurally separated in Claude Code's format, making "did the teacher pick a different tool?" a one-line check.

---

## 6. TraceIngester sketch

> **Realised in v0.1 (Wave 17 update):** The realised ingester ships at
> `composer_replication/ingestion/claude_code.py` exporting
> `ClaudeCodeIngester`, with the spike at
> `spikes/007-real-trace-ingestion/claude_code_ingester.py`. The
> public production surface is:
>
> ```python
> from pathlib import Path
> from composer_replication.ingestion.claude_code import ClaudeCodeIngester
>
> ingester = ClaudeCodeIngester(skip_sidechain=True, strip_thinking=True)
> for trace_state in ingester.ingest(Path("~/.claude/projects/.../session.jsonl").expanduser()):
>     # trace_state matches the TraceState TypedDict from §1
>     ...
> stats = ingester.last_stats  # IngestionStats — turn counts, skip reasons
> ```
>
> The shipped `ClaudeCodeIngester` differs from the pre-spike sketch
> below in:
> - Class name: `ClaudeCodeIngester` (not `TraceIngester`)
> - Module path: `composer_replication.ingestion.claude_code` (not
>   `spikes/007-trace-ingester/trace_ingester.py`)
> - The constructor takes config kwargs (`system_prompt`,
>   `skip_sidechain`, `strip_thinking`, `max_history_tokens`); paths
>   are passed to `.ingest(Path)` per call instead of being held by the
>   ingester
> - The yielded type is `TraceState` (matches §1)
>
> The pre-spike sketch below is preserved as historical proposal context.

Drop-in adapter for spike-005's `replay_trace()`. Targets `TraceState` (the actual existing TypedDict; see §1).

```python
# spikes/007-trace-ingester/trace_ingester.py
from __future__ import annotations
import json
from collections.abc import Iterator
from pathlib import Path
from typing import Any

# Re-use the existing TypedDicts from spike-005:
#   from spikes.005_integrated_trainer_skeleton.teacher_replay import TraceState

# A "step" in the trace is each assistant record that ends in tool_use. The
# state visible to the model at that step = all messages strictly before it,
# in OpenAI/Anthropic chat format. The student_action = the tool_use payload(s).

def _record_to_chat_message(rec: dict) -> dict | None:
    """Turn one Claude Code JSONL record into an OpenAI/Anthropic chat-message
    dict, or return None for non-conversational records (queue-operation,
    attachment, file-history-snapshot, system, last-prompt, summary)."""
    t = rec.get("type")
    if t not in ("user", "assistant"):
        return None
    msg = rec.get("message")
    if not isinstance(msg, dict):
        return None
    role = msg.get("role")
    content = msg.get("content")
    if role not in ("user", "assistant") or content is None:
        return None
    # Strip thinking blocks — they are not portable across teacher models and
    # should not influence the teacher's decision at replay time.
    if isinstance(content, list):
        content = [c for c in content
                   if not (isinstance(c, dict) and c.get("type") == "thinking")]
    return {"role": role, "content": content}


def _serialize_action(content_blocks: list[dict]) -> str:
    """Canonicalize the student's action at a step.

    For tool_use steps: JSON-encode the (name, input) pairs.
    For text-only steps: return the concatenated text.
    """
    tool_uses = [b for b in content_blocks if isinstance(b, dict) and b.get("type") == "tool_use"]
    if tool_uses:
        return json.dumps(
            [{"name": tu.get("name"), "input": tu.get("input")} for tu in tool_uses],
            sort_keys=True,
        )
    texts = [b.get("text", "") for b in content_blocks if isinstance(b, dict) and b.get("type") == "text"]
    return "\n".join(t for t in texts if t)


class TraceIngester:
    """Reads a Claude Code session JSONL and yields TraceState records.

    One TraceState is emitted per assistant record. The `messages` field is the
    full prior conversation (system + alternating user/assistant) up to but not
    including the current assistant turn; `student_action` is the canonicalized
    serialization of that turn's content blocks.
    """

    def __init__(self, *, skip_thinking: bool = True, min_action_chars: int = 1) -> None:
        self.skip_thinking = skip_thinking
        self.min_action_chars = min_action_chars

    def ingest(self, path: str | Path) -> Iterator[dict]:  # yields TraceState
        path = Path(path)
        prior_messages: list[dict] = []
        session_id_for_state = path.stem  # filename = session UUID

        with path.open("r", encoding="utf-8") as f:
            for line_idx, line in enumerate(f):
                line = line.strip()
                if not line:
                    continue
                try:
                    rec = json.loads(line)
                except json.JSONDecodeError:
                    continue  # tolerate truncated last-line writes

                chat_msg = _record_to_chat_message(rec)
                if chat_msg is None:
                    continue

                if chat_msg["role"] == "assistant":
                    # Emit a TraceState representing "before this turn".
                    blocks = chat_msg["content"] if isinstance(chat_msg["content"], list) else []
                    student_action = _serialize_action(blocks)
                    if len(student_action) >= self.min_action_chars:
                        yield {
                            "state_id": f"{session_id_for_state}:{rec.get('uuid', line_idx)}",
                            "messages": list(prior_messages),    # snapshot
                            "student_action": student_action,
                        }
                # Append to history regardless (so subsequent turns see it).
                prior_messages.append(chat_msg)
```

Notes:
- We skip `thinking` blocks because (1) they're Anthropic-specific and (2) feeding them to other-vendor teachers (GPT/DeepSeek) leaks reasoning the teacher should produce on its own. This matches the philosophy used in spike-005's `_normalize_action`.
- We do NOT inject a system prompt — Claude Code's initial system prompt is not in the JSONL (it's set at SDK init and visible only via `attachment` records). Downstream callers may want to prepend a synthetic system message for teacher fairness. Open question for ADR-002.
- `state_id = f"{sessionId}:{recordUuid}"` is globally unique and stable across re-ingest.
- Failures (unparseable lines, missing fields) are tolerated silently. A counters-based sibling method `ingest_with_stats(path)` is a small follow-up.

### 6.1 Smoke-test plan (for Spike 007 itself)

```python
ingester = TraceIngester()
states = list(ingester.ingest("/home/codeseys/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-iteration-checkpoint/e4a34e2b-40c6-49ce-b253-912a43224aae.jsonl"))
# Expect roughly 197 states (matches asst-message count counted in §4).
# Then teacher-replay on the first 5 states, confirm cost is in the
# spike-001 ballpark ($0.05–$0.20 for 5 states × 3 teachers).
```

Spike 001 baseline to beat: $0.98/trace mean (50-state synthetic), $0.30/trace projected with VOI gating. On real states a ~5–20× cost increase is plausible due to longer message histories (10k+ tokens vs synthetic ~300 tokens), so a relevant **economic check** for Spike 007 is: if the first 5 states cost > $5 (i.e. > $1/state), the VOI gate from Spike 001 is *required* before scaling. Flag this finding in the spike write-up.

---

## 7. Open questions for ADR-002

1. Do we promote `TraceState` to a top-level `TraceExample` dataclass, with optional `teacher_id`, `reward`, `hint_text`? Or keep `TraceState` as ingester output and `DPOPair` as trainer input, treating the brief's "TraceExample" as conceptual?
2. Should `TraceIngester.ingest()` emit one record per **assistant turn** (current sketch) or per **assistant `tool_use` block** within a turn? Some Claude Code records have multiple tool_use blocks in one assistant message.
3. Synthetic system prompt at replay time — yes/no? If yes, what content?
4. Trace-version pinning: hard-fail or warn when `version` field falls outside a known-tested range?
5. Subagent transcripts (`agent-*.jsonl`) — include or skip? They are denser per-turn but their parent context is the orchestrator, not the user, which changes the teacher-replay semantics.

---

## 8. References (primary sources only)

Anthropic / Claude Code official:
- <https://code.claude.com/docs/en/sessions> — session storage location and "JSONL, one JSON per line"
- <https://code.claude.com/docs/en/data-usage> — "local caching … session transcripts locally in plaintext under `~/.claude/projects/` for 30 days by default"
- <https://code.claude.com/docs/en/legal-and-compliance> — Commercial Terms vs Consumer Terms applicability
- <https://github.com/anthropics/claude-code/blob/1e95326e12183286fc6cbd828c8a86a0d8e03c62/LICENSE.md> — proprietary license

Community schemas (reverse-engineered from real session data):
- <https://github.com/moru-ai/agent-schemas/blob/main/claude-code/v2.1.1/session.schema.json> — JSON Schema Draft 2020-12, validated against ~50,000 messages from 480 sessions
- <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Claude Code Session Log Format" — Entry types and TypeScript discriminated union
- <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> — top-level fields, project-key encoding, subagent file location
- <https://github.com/dagster-io/erk/blob/master/docs/learned/sessions/layout.md> — directory structure, plan-mode `slug` field
- <https://github.com/pedropaulovc/claude-code-types> — TypeScript type definitions from session logs

Runners-up reference points:
- OpenHands events: <https://docs.openhands.dev/sdk/arch/events>, <https://docs.openhands.dev/sdk/guides/convo-persistence>, <https://github.com/OpenHands/OpenHands/blob/3ec999e8/openhands/events/serialization/event.py>, <https://github.com/All-Hands-AI/OpenHands/issues/8701>
- SWE-bench experiments: <https://github.com/swe-bench/experiments>
- SWE-smith trajectories on HF: <https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories>
- mini-swe-agent traj.json: <https://huggingface.co/datasets/tarsur385/swebench-verified-trajectories>
- swe-traj-complete (SWE-agent format example): <https://huggingface.co/datasets/JetBrains-Research/swe-traj-complete>
- Aider history file format: <https://aider.chat/docs/config/dotenv.html>, <https://github.com/Aider-AI/aider/blob/bdb4d9ff/aider/history.py>, <https://github.com/paul-gauthier/aider/blob/main/aider/io.py>

Internal references:
- `spikes/005-integrated-trainer-skeleton/teacher_replay.py` — `TraceState`, `DPOPair`, `replay_trace`, `extract_dpo_pairs` (read in full during this audit; see §1 for actual field list)
- Spike 001 economic floor: $0.98/trace mean ungated, $0.30/trace projected with VOI gating