File size: 28,983 Bytes

# TRACE_SOURCE_RECONNAISSANCE.md

Spike 007 trace-source audit, feeding ADR-002.

Status: **DECIDED** — recommend **(a) Claude Code session JSONL** (`~/.claude/projects/<encoded>/<sessionId>.jsonl`).

---

## 0. TL;DR

Of the six candidates audited, Claude Code session JSONL wins on every axis except "official Anthropic-published schema" (no such doc exists), and for that single weakness there is now a community-maintained reverse-engineered JSON Schema validated against ~50,000 messages from real sessions, plus three independent third-party schema specs. The user has **1,015 .jsonl sessions on this machine** today; the eight largest sampled span 550 → 17,315 lines and contain **6,762 multi-turn `tool_use` messages**. Acquisition cost is zero. Licensing is clean: the JSONL files are local user-owned data; the proprietary Claude Code binary is not redistributed by us.

The runners-up — OpenHands (well-documented but acquisition is non-trivial), SWE-bench trajectory submissions (heterogeneous schemas across submitters), Aider markdown chat history (lossy / unparseable for tool calls), and Cline (no public stable export format) — each lose on at least one of the four axes.

---

## 1. Context: TraceExample dataclass field reality

**Important correction to the parent task description.** The task brief said "TraceExample dataclass with fields state_text, action_taken, hint_text (optional), reward (float), teacher_id (str)". Reading the actual file at
`/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/teacher_replay.py` shows the existing types are different — there is no `TraceExample` class. The closest existing types are two `TypedDict`s used by `replay_trace()` and `extract_dpo_pairs()`:

```python
class TraceState(TypedDict):
    state_id: str           # unique within the trace
    messages: list[dict]    # conversation up to and including this step's user prompt
    student_action: str     # what the student actually did at this step

class DPOPair(TypedDict):
    state_id: str
    state_messages: list[dict]
    chosen: str       # teacher-consensus action
    rejected: str     # student action
    n_teachers_agreeing: int
```

The mapping sketch in §6 below targets `TraceState` (the *input* to teacher replay), since that is the type a `TraceIngester` is upstream of. If Spike 007 also wants a unified `TraceExample` per the brief, the natural shape is `TraceState` ∪ `{teacher_id: str | None, reward: float | None, hint_text: str | None}` — flagged for ADR-002 to settle.

---

## 2. Candidate audit summary

Scoring legend: `+` good, `~` mixed, `-` bad, on each of the four required axes.

| # | Candidate | Schema documented | Real ≥5 multi-turn traces | Hint-receptive signal density | License OK | Verdict |
|---|---|---|---|---|---|---|
| **a** | **Claude Code JSONL** (`~/.claude/projects/`) | `~` Anthropic publishes high-level format note; community schemas are detailed and validated | **+** 1,015 local sessions, 5+ trivially | **+** Per-step `assistant.message.content[].tool_use` blocks → discrete actions, ideal teacher-correction sites | **+** User-owned local files; framework MIT | **CHOSEN** |
| b | Cline VS Code extension | `-` No published stable export schema | `~` Requires running Cline + manual export | `~` Plausible if exported but unverified | `~` Cline source Apache-2.0 but trace format isn't a stable contract | reject |
| c | OpenHands trajectories | **+** Well-documented (events/, base_state.json, Pydantic Event models) | `-` Need to *run* OpenHands or download eval traces — not zero-cost | **+** ActionEvent/ObservationEvent split is conceptually ideal | **+** OpenHands MIT-licensed | strong runner-up |
| d | Aider chat history | `~` Format is "markdown, level-4 headings for user input" — fragile | `~` Available if Aider was used | `-` Tool calls are flattened into prose; recovering structured actions is lossy | `+` Aider Apache-2.0 | reject |
| e | SWE-bench / Lite leaderboard `trajs/` | `-` Each submitter chooses a free-form text format (md/json/yaml) | **+** ~hundreds of submissions on github.com/swe-bench/experiments | `~` Heterogeneous; structured ones (e.g. mini-swe-agent `.traj.json`) are good, others are essentially logs | **+** Public submissions with usage rights for research | reject as primary; usable as future cross-validation set |
| f | SWE-smith-trajectories on HF | **+** Standard OpenAI messages format, documented per dataset card | **+** 5,017 trajectories, 76,002 rows, public | **+** Single-attempt per-instance SWE-agent runs | **+** Apache-2.0 dataset license | strong runner-up; **complement, not replacement** |

The (f) row was discovered during audit (the parent task allowed "any other public source you find that is better"). It's a strong candidate but answers a *different* question: SWE-bench trajectories give us reproducible benchmark traces; Claude Code JSONL gives us *the user's actual workflow*. For Spike 007's purpose (verify the teacher-replay path works on a real, signal-dense trace at zero acquisition cost), (a) is the right primary; (f) is queued for a later cross-validation phase.

---

## 3. Chosen format spec — Claude Code session JSONL

### 3.1 Location and naming

- **Root**: `~/.claude/projects/` (overridable via `CLAUDE_CONFIG_DIR`).
  Source: <https://code.claude.com/docs/en/sessions> ("Transcripts are stored as JSONL at `~/.claude/projects/<encoded-cwd>/<sessionId>.jsonl`").
- **Project-key encoding**: working-directory absolute path with `/` and `\` and `:` replaced by `-`, with a leading `-`. (Hidden directories with a leading dot become double dashes.)
  Source: <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> §"Project key encoding".
- **File**: `<sessionId>.jsonl`. Subagent transcripts are `agent-<agentId>.jsonl`; a `SessionReader` should *skip* files starting with `agent-` when listing main sessions.
  Source: same `claude_skills` doc, §"Subagent File Location".
- **Encoding**: UTF-8, newline-delimited JSON. One JSON object per line. No `[`/`]` wrapping. Local cleanup default 30 days, configurable via `cleanupPeriodDays` in `~/.claude/settings.json`.
  Source: <https://code.claude.com/docs/en/data-usage> ("Local caching: Claude Code clients store session transcripts locally in plaintext under `~/.claude/projects/` for 30 days by default to enable session resumption.")

### 3.2 Common record fields

Every record (both user and assistant types) carries:

| field | type | meaning |
|---|---|---|
| `parentUuid` | `string \| null` | UUID of the parent record (null on the first record) |
| `uuid` | `string` | This record's UUID |
| `sessionId` | `string` | UUID of the session (matches filename) |
| `timestamp` | `string` (ISO-8601) | Wall-clock time of the record |
| `cwd` | `string` | Absolute working directory |
| `version` | `string` | Claude Code version (e.g. `"2.1.143"`) |
| `gitBranch` | `string` | Empty string `""` when not in a git repo |
| `isSidechain` | `boolean` | True for sub-agent (Task tool) chains |
| `userType` | `string` | `"external"` or similar |
| `type` | `string` | Discriminator — see §3.3 |
| `entrypoint` | `string` | e.g. `"sdk-cli"` |

Sources for these fields:
- <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Type Definitions" → `BaseMessageEntry`
- <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> §"Top-Level Record Fields"
- <https://github.com/moru-ai/agent-schemas/blob/main/claude-code/v2.1.1/session.schema.json> (machine-validated against ~50,000 messages from 480 real sessions)
- Direct inspection (this doc): `head` of `~/.claude/projects/-mnt-e-CS-HF-eidolon/c6967343-51a3-4b1b-9472-a569e96114b1.jsonl` confirms presence of every field above.

### 3.3 Record types (`type` discriminator)

| `type` | Role |
|---|---|
| `user` | Both human prompts AND tool results (distinguished by `message.content[].type`) |
| `assistant` | Model output: text, `thinking`, and `tool_use` blocks |
| `system` | Hook summaries, stop notices |
| `summary` | Context-compaction markers |
| `attachment` | Hook stdout/stderr, e.g. `SessionStart` hook output |
| `queue-operation` | Prompt enqueue/dequeue events |
| `file-history-snapshot` | File-state tracking for undo |
| `last-prompt` | Bookkeeping for resume |

Source: <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Entry Types"; corroborated by direct `Counter` inspection of one local session showing `attachment, assistant, user, last-prompt, queue-operation` types in expected proportions.

### 3.4 The two record types we care about

#### Assistant record carrying a tool call (the "student action")

Real example, redacted from `~/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-doc-adapter-skeleton/39df59f0-674c-413a-b333-cdac0cea9db7.jsonl`:

```json
{
  "type": "assistant",
  "uuid": "24a16a51-3133-4ba5-9d23-472864286154",
  "parentUuid": "1b11c3b3-832b-4473-a944-b61a1f3f2594",
  "sessionId": "39df59f0-…",
  "timestamp": "2026-05-16T04:52:21.947Z",
  "message": {
    "role": "assistant",
    "model": "claude-opus-4-7",
    "content": [
      {
        "type": "tool_use",
        "id": "toolu_bdrk_012HC2dggmSgtVAtWWzwikZq",
        "name": "Bash",
        "input": {
          "command": "ov mail check --agent builder-doc-adapter-skeleton 2>&1 | head -200",
          "description": "Check builder agent inbox"
        }
      }
    ],
    "stop_reason": "tool_use",
    "usage": { "input_tokens": 6, "cache_creation_input_tokens": 48287, "output_tokens": 1021, ... }
  }
}
```

The student's *action* at this step = the JSON of `message.content[i]` where `content[i].type == "tool_use"` (or, if multiple tool_use blocks, the array of them; or if pure-text reply, the `content[i].text` of the `text` block).

#### User record carrying a tool result (the "observation")

```json
{
  "type": "user",
  "uuid": "b9f9414b-…",
  "parentUuid": "24a16a51-…",            // matches the assistant uuid above
  "sessionId": "39df59f0-…",
  "timestamp": "2026-05-16T04:52:23.229Z",
  "message": {
    "role": "user",
    "content": [
      {
        "tool_use_id": "toolu_bdrk_012HC2dggmSgtVAtWWzwikZq",
        "type": "tool_result",
        "content": "  No new messages",
        "is_error": false
      }
    ]
  },
  "toolUseResult": {                       // duplicate, structured form
    "stdout": "  No new messages",
    "stderr": "",
    "interrupted": false,
    "isImage": false,
    "noOutputExpected": false
  },
  "sourceToolAssistantUUID": "24a16a51-…"  // back-pointer to the assistant uuid
}
```

User records carrying actual human prompts have `message.content` as a list with `{"type":"text","text":"..."}` blocks (or, in older logs, `message.content` as a plain string).

### 3.5 Schema stability

- **Anthropic's official documentation** acknowledges the location and "each line is a JSON object for a message, tool use, or metadata entry" but does **not** publish a versioned schema.
- **Practical stability**: moru-ai/agent-schemas tracked v2.0.76 → v2.1.1; only one new field of note (`toolUseResult`). Schema pins `additionalProperties: true` for forward compatibility. This level of stability is sufficient for Spike 007 (a research spike, not a long-lived product API).
- **Mitigation**: pin to a specific Claude Code `version` field range and version-gate the ingester (e.g. accept `2.1.x`, warn on others).

### 3.6 Licensing

- The Claude Code binary is **proprietary** (Anthropic Commercial Terms of Service, <https://github.com/anthropics/claude-code/blob/1e95326e12183286fc6cbd828c8a86a0d8e03c62/LICENSE.md>).
- The session JSONL files are **local user data** generated on the user's machine during ordinary use. Anthropic's data-usage doc explicitly calls them "local caching … session transcripts locally in plaintext" — they belong to the user.
- Our framework is MIT-licensed and we are **not redistributing the Claude Code binary or any third-party trace files**. We are reading the user's own local logs (analogous to processing one's own `.bash_history`).
- We MUST NOT publish raw trace files in our repo without the user's consent (PII risk: cwd, gitBranch, file contents). The framework should ship only the *ingester*, plus a tiny synthetic-fixture trace for unit tests.

---

## 4. Acquiring the 5 real example traces

**Zero acquisition cost.** All five live on this machine right now.

Discovery command (used during this audit):

```bash
find ~/.claude/projects -name "*.jsonl" 2>/dev/null
# → 1015 files
```

Five concrete pre-selected sessions, each multi-turn (≥ 100 tool_use messages), each from a distinct project, each ≥ 50 KB:

| # | Tool-use msgs | User msgs | Asst msgs | Total lines | Path |
|---|---|---|---|---|---|
| 1 | 2,830 | 3,199 | 4,325 | 17,315 | `/home/codeseys/.claude/projects/-mnt-e-CS-HF-eidolon/c6967343-51a3-4b1b-9472-a569e96114b1.jsonl` |
| 2 | 1,350 | 1,407 | 2,016 | 7,673 | `/home/codeseys/.claude/projects/-mnt-e-CS-github-agent-manager/c42b68ea-d410-455e-bc71-92ec6c4adce9.jsonl` |
| 3 | 984 | 1,032 | 1,549 | 5,783 | `/home/codeseys/.claude/projects/-mnt-e-CS-HF-streaming-speech-to-speech/73c9925c-d5e5-48fc-a97b-a58687c2fb3c.jsonl` |
| 4 | 717 | 759 | 1,142 | 4,036 | `/home/codeseys/.claude/projects/-mnt-e-CS-github/6ac8e20f-98ec-4279-9957-e68862a90c5e.jsonl` |
| 5 | 125 | 126 | 197 | 629 | `/home/codeseys/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-iteration-checkpoint/e4a34e2b-40c6-49ce-b253-912a43224aae.jsonl` |

(All five inspected programmatically during this audit — counts above are real, not estimates.)

For users on other machines: `find ~/.claude/projects -name '*.jsonl' -size +50k | head` will surface candidates. For repository CI we will commit a small (~5 KB) **synthetic** fixture conforming to the schema, never any of the user's real traces.

---

## 5. Decision-relevant tradeoffs vs runners-up

### Why we are NOT picking OpenHands trajectories (c)
- **Pro**: cleanest schema we audited — Pydantic `Event` / `ActionEvent` / `ObservationEvent` models, source: <https://docs.openhands.dev/sdk/arch/events>, source code: <https://github.com/OpenHands/OpenHands/blob/3ec999e8/openhands/events/serialization/event.py>. Tool-call structure is *more* normalized than Claude Code's (explicit Action/Observation typing).
- **Con**: zero-acquisition is false here. Persistence dir defaults to `workspace/conversations/` and only exists if the user has *run OpenHands locally*. Public eval trajectories are spread across the eval/ folder rather than a clean public bucket.
- **Decisive**: Spike 001's economic floor was measured on 50 synthetic states. Spike 007's purpose is to verify ingestion + replay on real traces *that already exist*. (a) gives that today; (c) requires standing up OpenHands first, plus the storage format split between v0 (per-event JSON files) and v1 (timestamped files) per <https://github.com/All-Hands-AI/OpenHands/issues/8701>, which is a flux risk.
- **Future use**: if the framework ever ships "trace ingester adapters" plural, OpenHands is the second adapter to write — its event-typed model is conceptually superior.

### Why we are NOT picking SWE-bench leaderboard trajectories (e)
- **Pro**: hundreds of submissions on <https://github.com/swe-bench/experiments>, with required `trajs/` folders.
- **Con**: leaderboard rules say "The reasoning trace can be represented with **any text based file format (e.g. md, json, yaml)**" (source: <https://github.com/swe-bench/experiments> README). Each submitter picks their own. Building a generic ingester is a per-submission engineering project, not a single adapter. SWE-agent uses one shape (`{"action", "observation", "response"}` arrays — confirmed via <https://huggingface.co/datasets/JetBrains-Research/swe-traj-complete>); mini-swe-agent uses `.traj.json` with OpenAI messages format (<https://huggingface.co/datasets/tarsur385/swebench-verified-trajectories>).
- **Decisive**: heterogeneous schema = fragile ingester = wrong choice for *first* spike.

### Why we are NOT picking Aider (d)
- The `chat_history_file` is **markdown** (`.aider.chat.history.md`), per <https://aider.chat/docs/config/dotenv.html>. Source code at <https://github.com/Aider-AI/aider/blob/bdb4d9ff/aider/history.py> shows it's literally `f.write(text)` of formatted prose with `####` for user input.
- **Decisive**: tool calls in Aider are *applied as edits*, not preserved as discrete structured actions in the markdown log. Reconstructing "the action the student took at step k" is lossy. The `.aider.llm.history` log is closer to what we want but is opt-in and not always present.

### Why we are NOT picking Cline (b)
- No public commitment to a stable export schema. Cline's storage is internal to the VS Code extension (workspace state DB + per-task JSON in extension storage). Searching for "Cline trace export schema" yields no Anthropic-style spec doc. Workable in principle, but reverse-engineering an extension's storage is not the right ground for a 1-week spike.

### Why we are NOT picking SWE-smith-trajectories (f)
- This is the **strongest external dataset** we found and **should be Spike 007's stretch goal / Spike 008's primary**: 5,017 fine-tuning trajectories from SWE-agent + Claude 3.7 Sonnet, 4.22 GB on HuggingFace, OpenAI messages format. Source: <https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories>.
- **Why not first**: the messages-only format collapses tool calls and tool results into the OpenAI chat-completions wire format with text-encoded tool blocks. That works for SFT but is *less* signal-dense for the teacher-correction spike than Claude Code's `tool_use` blocks because the model's `name` and `input` fields are structurally separated in Claude Code's format, making "did the teacher pick a different tool?" a one-line check.

---

## 6. TraceIngester sketch

> **Realised in v0.1 (Wave 17 update):** The realised ingester ships at
> `composer_replication/ingestion/claude_code.py` exporting
> `ClaudeCodeIngester`, with the spike at
> `spikes/007-real-trace-ingestion/claude_code_ingester.py`. The
> public production surface is:
>
> ```python
> from pathlib import Path
> from composer_replication.ingestion.claude_code import ClaudeCodeIngester
>
> ingester = ClaudeCodeIngester(skip_sidechain=True, strip_thinking=True)
> for trace_state in ingester.ingest(Path("~/.claude/projects/.../session.jsonl").expanduser()):
>     # trace_state matches the TraceState TypedDict from §1
>     ...
> stats = ingester.last_stats  # IngestionStats — turn counts, skip reasons
> ```
>
> The shipped `ClaudeCodeIngester` differs from the pre-spike sketch
> below in:
> - Class name: `ClaudeCodeIngester` (not `TraceIngester`)
> - Module path: `composer_replication.ingestion.claude_code` (not
>   `spikes/007-trace-ingester/trace_ingester.py`)
> - The constructor takes config kwargs (`system_prompt`,
>   `skip_sidechain`, `strip_thinking`, `max_history_tokens`); paths
>   are passed to `.ingest(Path)` per call instead of being held by the
>   ingester
> - The yielded type is `TraceState` (matches §1)
>
> The pre-spike sketch below is preserved as historical proposal context.

Drop-in adapter for spike-005's `replay_trace()`. Targets `TraceState` (the actual existing TypedDict; see §1).

```python
# spikes/007-trace-ingester/trace_ingester.py
from __future__ import annotations
import json
from collections.abc import Iterator
from pathlib import Path
from typing import Any

# Re-use the existing TypedDicts from spike-005:
#   from spikes.005_integrated_trainer_skeleton.teacher_replay import TraceState

# A "step" in the trace is each assistant record that ends in tool_use. The
# state visible to the model at that step = all messages strictly before it,
# in OpenAI/Anthropic chat format. The student_action = the tool_use payload(s).

def _record_to_chat_message(rec: dict) -> dict | None:
    """Turn one Claude Code JSONL record into an OpenAI/Anthropic chat-message
    dict, or return None for non-conversational records (queue-operation,
    attachment, file-history-snapshot, system, last-prompt, summary)."""
    t = rec.get("type")
    if t not in ("user", "assistant"):
        return None
    msg = rec.get("message")
    if not isinstance(msg, dict):
        return None
    role = msg.get("role")
    content = msg.get("content")
    if role not in ("user", "assistant") or content is None:
        return None
    # Strip thinking blocks — they are not portable across teacher models and
    # should not influence the teacher's decision at replay time.
    if isinstance(content, list):
        content = [c for c in content
                   if not (isinstance(c, dict) and c.get("type") == "thinking")]
    return {"role": role, "content": content}


def _serialize_action(content_blocks: list[dict]) -> str:
    """Canonicalize the student's action at a step.

    For tool_use steps: JSON-encode the (name, input) pairs.
    For text-only steps: return the concatenated text.
    """
    tool_uses = [b for b in content_blocks if isinstance(b, dict) and b.get("type") == "tool_use"]
    if tool_uses:
        return json.dumps(
            [{"name": tu.get("name"), "input": tu.get("input")} for tu in tool_uses],
            sort_keys=True,
        )
    texts = [b.get("text", "") for b in content_blocks if isinstance(b, dict) and b.get("type") == "text"]
    return "\n".join(t for t in texts if t)


class TraceIngester:
    """Reads a Claude Code session JSONL and yields TraceState records.

    One TraceState is emitted per assistant record. The `messages` field is the
    full prior conversation (system + alternating user/assistant) up to but not
    including the current assistant turn; `student_action` is the canonicalized
    serialization of that turn's content blocks.
    """

    def __init__(self, *, skip_thinking: bool = True, min_action_chars: int = 1) -> None:
        self.skip_thinking = skip_thinking
        self.min_action_chars = min_action_chars

    def ingest(self, path: str | Path) -> Iterator[dict]:  # yields TraceState
        path = Path(path)
        prior_messages: list[dict] = []
        session_id_for_state = path.stem  # filename = session UUID

        with path.open("r", encoding="utf-8") as f:
            for line_idx, line in enumerate(f):
                line = line.strip()
                if not line:
                    continue
                try:
                    rec = json.loads(line)
                except json.JSONDecodeError:
                    continue  # tolerate truncated last-line writes

                chat_msg = _record_to_chat_message(rec)
                if chat_msg is None:
                    continue

                if chat_msg["role"] == "assistant":
                    # Emit a TraceState representing "before this turn".
                    blocks = chat_msg["content"] if isinstance(chat_msg["content"], list) else []
                    student_action = _serialize_action(blocks)
                    if len(student_action) >= self.min_action_chars:
                        yield {
                            "state_id": f"{session_id_for_state}:{rec.get('uuid', line_idx)}",
                            "messages": list(prior_messages),    # snapshot
                            "student_action": student_action,
                        }
                # Append to history regardless (so subsequent turns see it).
                prior_messages.append(chat_msg)
```

Notes:
- We skip `thinking` blocks because (1) they're Anthropic-specific and (2) feeding them to other-vendor teachers (GPT/DeepSeek) leaks reasoning the teacher should produce on its own. This matches the philosophy used in spike-005's `_normalize_action`.
- We do NOT inject a system prompt — Claude Code's initial system prompt is not in the JSONL (it's set at SDK init and visible only via `attachment` records). Downstream callers may want to prepend a synthetic system message for teacher fairness. Open question for ADR-002.
- `state_id = f"{sessionId}:{recordUuid}"` is globally unique and stable across re-ingest.
- Failures (unparseable lines, missing fields) are tolerated silently. A counters-based sibling method `ingest_with_stats(path)` is a small follow-up.

### 6.1 Smoke-test plan (for Spike 007 itself)

```python
ingester = TraceIngester()
states = list(ingester.ingest("/home/codeseys/.claude/projects/-mnt-e-CS-github-VIGOR--overstory-worktrees-builder-iteration-checkpoint/e4a34e2b-40c6-49ce-b253-912a43224aae.jsonl"))
# Expect roughly 197 states (matches asst-message count counted in §4).
# Then teacher-replay on the first 5 states, confirm cost is in the
# spike-001 ballpark ($0.05–$0.20 for 5 states × 3 teachers).
```

Spike 001 baseline to beat: $0.98/trace mean (50-state synthetic), $0.30/trace projected with VOI gating. On real states a ~5–20× cost increase is plausible due to longer message histories (10k+ tokens vs synthetic ~300 tokens), so a relevant **economic check** for Spike 007 is: if the first 5 states cost > $5 (i.e. > $1/state), the VOI gate from Spike 001 is *required* before scaling. Flag this finding in the spike write-up.

---

## 7. Open questions for ADR-002

1. Do we promote `TraceState` to a top-level `TraceExample` dataclass, with optional `teacher_id`, `reward`, `hint_text`? Or keep `TraceState` as ingester output and `DPOPair` as trainer input, treating the brief's "TraceExample" as conceptual?
2. Should `TraceIngester.ingest()` emit one record per **assistant turn** (current sketch) or per **assistant `tool_use` block** within a turn? Some Claude Code records have multiple tool_use blocks in one assistant message.
3. Synthetic system prompt at replay time — yes/no? If yes, what content?
4. Trace-version pinning: hard-fail or warn when `version` field falls outside a known-tested range?
5. Subagent transcripts (`agent-*.jsonl`) — include or skip? They are denser per-turn but their parent context is the orchestrator, not the user, which changes the teacher-replay semantics.

---

## 8. References (primary sources only)

Anthropic / Claude Code official:
- <https://code.claude.com/docs/en/sessions> — session storage location and "JSONL, one JSON per line"
- <https://code.claude.com/docs/en/data-usage> — "local caching … session transcripts locally in plaintext under `~/.claude/projects/` for 30 days by default"
- <https://code.claude.com/docs/en/legal-and-compliance> — Commercial Terms vs Consumer Terms applicability
- <https://github.com/anthropics/claude-code/blob/1e95326e12183286fc6cbd828c8a86a0d8e03c62/LICENSE.md> — proprietary license

Community schemas (reverse-engineered from real session data):
- <https://github.com/moru-ai/agent-schemas/blob/main/claude-code/v2.1.1/session.schema.json> — JSON Schema Draft 2020-12, validated against ~50,000 messages from 480 sessions
- <https://github.com/KyleAMathews/claude-code-ui/blob/main/spec.md> §"Claude Code Session Log Format" — Entry types and TypeScript discriminated union
- <https://github.com/jamie-bitflight/claude_skills/blob/main/plugins/agentskill-kaizen/skills/transcript-analysis/references/session-log-schema.md> — top-level fields, project-key encoding, subagent file location
- <https://github.com/dagster-io/erk/blob/master/docs/learned/sessions/layout.md> — directory structure, plan-mode `slug` field
- <https://github.com/pedropaulovc/claude-code-types> — TypeScript type definitions from session logs

Runners-up reference points:
- OpenHands events: <https://docs.openhands.dev/sdk/arch/events>, <https://docs.openhands.dev/sdk/guides/convo-persistence>, <https://github.com/OpenHands/OpenHands/blob/3ec999e8/openhands/events/serialization/event.py>, <https://github.com/All-Hands-AI/OpenHands/issues/8701>
- SWE-bench experiments: <https://github.com/swe-bench/experiments>
- SWE-smith trajectories on HF: <https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories>
- mini-swe-agent traj.json: <https://huggingface.co/datasets/tarsur385/swebench-verified-trajectories>
- swe-traj-complete (SWE-agent format example): <https://huggingface.co/datasets/JetBrains-Research/swe-traj-complete>
- Aider history file format: <https://aider.chat/docs/config/dotenv.html>, <https://github.com/Aider-AI/aider/blob/bdb4d9ff/aider/history.py>, <https://github.com/paul-gauthier/aider/blob/main/aider/io.py>

Internal references:
- `spikes/005-integrated-trainer-skeleton/teacher_replay.py` — `TraceState`, `DPOPair`, `replay_trace`, `extract_dpo_pairs` (read in full during this audit; see §1 for actual field list)
- Spike 001 economic floor: $0.98/trace mean ungated, $0.30/trace projected with VOI gating