Spaces:

saumilyajj
/

driftcall

Sleeping

App Files Files Community

driftcall / docs /modules /datasets.md

saumilyajj

Upload folder using huggingface_hub

f2df60e verified 23 days ago

preview code

raw

history blame contribute delete

48.4 kB

	# datasets — Four-Layer Dataset Strategy + HF Hub Publication

	Module path: `driftcall/data/` (loaders) + `data/` (on-disk artifacts)
	Owner: Person C (Training & Data)
	Implements: DESIGN.md §8 (Dataset Strategy — §§8.1, 8.2, 8.3, 8.4, 8.5, 8.6)
	Consumed by: `driftcall/task_generator.py` (L1), `driftcall/drift_injector.py` (L2 drift patterns), `driftcall/vendors/.py` (L2 API schemas), `driftcall/audio/.py` (L3 audio), `training/train_grpo.py` (L4 SFT warmup).
	Status: Design spec — no code yet.

	---

	## 1. Purpose

	`datasets` is the authoring-and-loading contract for every piece of static data DriftCall depends on. It is not a training-data-pipeline module (that is `training.md`) and it does not compose rewards (that is `rewards.md`). It does exactly four things:

	1. Defines the four dataset layers per DESIGN.md §8.2 — task-brief templates, vendor API schemas + drift patterns, voice audio, and the optional SFT warmup corpus — as on-disk files with frozen YAML/JSON schemas.
	2. Loads each file through a deterministic, lazy, singleton loader that NFC-normalizes all Indic strings at load time (cross-references `docs/modules/task_generator.md` §3.4 invariant #8).
	3. Validates each file at load time: schema shape, type constraints, license header presence, train/val leakage, and consistency cross-references (e.g., every `drift_slot_tags` token in templates.yaml is targetable by ≥ 1 pattern in drift_patterns.yaml).
	4. Publishes the public-facing bundle `<team>/driftcall-indic-briefs` to the Hugging Face Hub per DESIGN.md §8.6, packaged from a deterministic `enumerate_variants()` walk (see `docs/modules/task_generator.md` §2.2).

	No file in `data/` is ever written at runtime. All four layers are authored before Phase C ships. Runtime only reads. The only exception is the dataset-packaging script (`training/data_export.py`) which writes `train/briefs.jsonl` + `val/briefs.jsonl` once, pre-publication.

	Supervision for GRPO comes from the 5 reward functions (DESIGN.md §8.1) — these files exist to parameterize the environment, not to teach the policy. L4 (SFT warmup) is optional per DESIGN.md §8.2 row 4.

	---

	## 2. Interface

	### 2.1 Directory layout (on disk, shipped inside the env Docker image per DESIGN.md §11.1)

	```
	data/
	├── task_briefs/
	│ ├── templates.yaml # L1 — hand-authored + procedural expansion source (§8.3)
	│ └── i18n.yaml # L1 — Indic localized strings (cities, weekdays, dish names)
	├── drift_patterns/
	│ └── drifts.yaml # L2 — 20 drift patterns (§6.3, §8.2 row 2)
	├── api_schemas/ # L2 — frozen JSON Schema per vendor per version
	│ ├── airline/
	│ │ ├── v1.json
	│ │ ├── v2.json
	│ │ └── v3.json
	│ ├── cab/
	│ │ ├── v1.json
	│ │ ├── v2.json
	│ │ └── v3.json
	│ ├── restaurant/
	│ │ ├── v1.json
	│ │ ├── v2.json
	│ │ └── v3.json
	│ ├── hotel/
	│ │ ├── v1.json
	│ │ ├── v2.json
	│ │ └── v3.json
	│ └── payment/
	│ ├── v1.json
	│ └── v2.json
	├── audio/ # L3 — synthesized + real voice clips (§8.2 row 3, §9)
	│ ├── synth/ # Kokoro-82M output, generated lazily; gitignored
	│ │ └── .gitkeep
	│ ├── real/ # AI4Bharat IndicVoices-R held-out subset for pitch demo
	│ │ └── MANIFEST.jsonl # (utterance_id, path, language, license, sha256)
	│ └── LICENSES.md # per-clip license attribution
	└── sft_warmup/ # L4 — optional Sarvam-M synthesized trajectories (§8.2 row 4)
	├── trajectories.jsonl # 200–500 correct rollouts
	└── LICENSES.md
	```

	Publication structure (HF Hub dataset repo `<team>/driftcall-indic-briefs`, DESIGN.md §8.6):

	```
	driftcall-indic-briefs/
	├── README.md # model card — provenance, license, stats, reward caveats
	├── train/briefs.jsonl # 15,000 sampled episodes (seed, stage, language_weights, GoalSpec)
	├── val/briefs.jsonl # 500 held-out episodes — seeds disjoint from train
	├── drift_patterns.yaml # exact copy of data/drift_patterns/drifts.yaml (20 patterns)
	├── api_schemas/ # exact copy of data/api_schemas/
	└── LICENSE # bundle license (Apache 2.0 by default; see §3.4)
	```

	### 2.2 Per-file contracts

	\| File \| Format \| Authored by \| Runtime writer \| Schema anchor \|
	\|---\|---\|---\|---\|---\|
	\| `data/task_briefs/templates.yaml` \| YAML \| Hand (20 seeds) \| none \| `Template` (§4.1 task_generator.md) \|
	\| `data/task_briefs/i18n.yaml` \| YAML \| Hand \| none \| `Mapping[LanguageCode, Mapping[str, str]]` \|
	\| `data/drift_patterns/drifts.yaml` \| YAML \| Hand \| none \| `DriftPattern` (§4.2 drift_injector.md) \|
	\| `data/api_schemas/<domain>/v<N>.json` \| JSON Schema 2020-12 \| Hand \| none \| `APISchema` (§4.4 below) \|
	\| `data/audio/real/MANIFEST.jsonl` \| JSONL \| Hand (curated from IndicVoices-R) \| none \| `AudioClipManifest` (§4.5) \|
	\| `data/audio/synth/*.wav` \| WAV 16kHz mono \| `audio/tts_kokoro.py` (lazy) \| `audio/tts_kokoro.py` \| n/a — generated \|
	\| `data/sft_warmup/trajectories.jsonl` \| JSONL \| Sarvam-M via HF Inference (offline) \| `training/sft_generator.py` (one-shot) \| `SFTTrajectory` (§4.6) \|

	### 2.3 Loaders (all return frozen dataclasses, all NFC-normalize Indic strings)

	```python
	from __future__ import annotations
	from pathlib import Path
	from driftcall.data.models import (
	TemplateLibrary, I18nLibrary,
	DriftPatternLibrary, APISchemaRegistry,
	AudioManifest, SFTCorpus,
	)

	# L1 — task briefs
	def load_templates(path: Path \| str = "data/task_briefs/templates.yaml") -> TemplateLibrary: ...
	def load_i18n(path: Path \| str = "data/task_briefs/i18n.yaml") -> I18nLibrary: ...

	# L2 — drift patterns + api schemas
	def load_drift_patterns(path: Path \| str = "data/drift_patterns/drifts.yaml") -> DriftPatternLibrary: ...
	def load_api_schemas(root: Path \| str = "data/api_schemas") -> APISchemaRegistry: ...

	# L3 — audio manifest (paths + licenses only; actual WAVs resolved on-demand)
	def load_audio_manifest(path: Path \| str = "data/audio/real/MANIFEST.jsonl") -> AudioManifest: ...

	# L4 — optional SFT warmup
	def load_sft_corpus(path: Path \| str = "data/sft_warmup/trajectories.jsonl") -> SFTCorpus: ...
	```

	Each loader is implemented as a module-level lazy singleton — the first call reads + validates + freezes; subsequent calls return the same instance. Not thread-safe for write (there is no write); safe for concurrent read.

	### 2.4 HF Hub publication commands

	Packaging runs once, pre-event. The script is `training/data_export.py` (see `docs/modules/training.md` for its interface — this module only defines the on-disk shape of what it writes).

	Immutability. The published bundle is IMMUTABLE after publication. Re-running `hf upload` against the same `data/publication/` tree produces a byte-identical bundle (invariant #6). Adding rows to `val/briefs.jsonl` requires a MINOR-version bump (v1.1, v1.2, …) and a new publication seed; the `train/` split NEVER silently mutates between versions — a version bump either adds disjoint val rows or re-publishes train+val together, never partial mutation of train.

	Seed selection (deterministic, locked). Train and val seeds are drawn by `training/data_export.py` using these two exact expressions:

	```python
	import random
	# Train: 15,000 seeds sampled without replacement from [0, 20_000_000).
	train_seeds = random.Random(20260425).sample(range(0, 20_000_000), 15_000)
	# Val: deterministic slice of 500 contiguous seeds in the reserved range.
	val_seeds = list(range(20_000_000, 20_000_500))
	```

	Both lists are byte-identical across re-runs. The publication meta-seed `20260425` is locked; changing it requires a major-version bump and a new repo name or subfolder.

	```bash
	# Generate the sampled briefs by walking enumerate_variants() (see task_generator.md §2.2)
	python3 training/data_export.py \
	--out-train data/publication/train/briefs.jsonl \
	--out-val data/publication/val/briefs.jsonl \
	--n-train 15000 \
	--n-val 500 \
	--seed 20260425 # frozen publication seed; NOT a training seed

	# Copy the static L2 artifacts verbatim
	cp data/drift_patterns/drifts.yaml data/publication/drift_patterns.yaml
	cp -r data/api_schemas data/publication/api_schemas

	# Upload (see DRIFTCALL/CLAUDE.md §6 command table and huggingface-skills:hf-cli).
	# NOTE: `hf` is the modern CLI replacing the deprecated `huggingface-cli`.
	hf upload <org>/driftcall-indic-briefs \
	data/publication/ . \
	--repo-type dataset \
	--hf-org <org> \
	--commit-message "v1.0 publication — locked 2026-04-25"
	```

	The publication seed `20260425` is fixed and recorded in the README. Re-running the script produces a byte-identical bundle (determinism contract inherited from `task_generator.generate` per `docs/modules/task_generator.md` §3.1).

	> Doc-sync flag: `DRIFTCALL/CLAUDE.md` §6 still lists the deprecated `huggingface-cli upload` command; update that table to `hf upload` in the same PR that lands this doc (captured as Open Question #1 / CLAUDE.md sync item).

	---

	## 3. Behavior Spec

	### 3.1 Authoring conventions

	NFC normalization. Every string value in every YAML/JSON file is NFC-normalized before it is committed. The loaders re-normalize defensively at load time (invariant #8, `docs/modules/task_generator.md` §3.4). Editors used during authoring (VS Code, vim) must be configured to save NFC — a pre-commit hook (`ruff`-adjacent script) runs `python -c "import unicodedata, sys; ..."` to reject NFD commits.

	License headers. Every hand-authored file begins with a YAML comment block declaring SPDX identifier, author, year, and upstream attribution if the content is derived from a public dataset (§8.5):

	```yaml
	# SPDX-License-Identifier: Apache-2.0
	# Copyright 2026 DriftCall Team
	# Derived-from: AmazonScience/MASSIVE (intent taxonomy, Apache-2.0)
	# See data/LICENSES.md for full attribution chain.
	```

	JSON files carry the same metadata in a `$comment` field at root (JSON Schema 2020-12 permits `$comment` per RFC 7159 conventions).

	Seed determinism. Every numeric or stochastic sampling decision in template/drift authoring threads through a fixed seed: the publication seed `20260425`, the template-expansion seed `42`, or the curriculum-language seeds declared in DESIGN.md §10.3. No wall-clock, no `random.random()`, no host-machine entropy.

	No PII. Authored strings never contain real names, phone numbers, email addresses, booking reference numbers, card PANs, or IP addresses. The `from` / `to` fields use IATA codes; the `pickup` / `drop` fields use fictional neighborhood landmarks. A CI lint (`grep -En '[0-9]{10}' data/`) runs before every commit and fails on any 10-digit run outside the allowed IATA/timestamp contexts.

	Eval-set held out from training. The 500-episode val set uses seeds drawn from a reserved range (seed ∈ `[20_000_000, 20_000_500)`); training always draws seeds from `[0, 20_000_000)`. The publication script asserts disjointness at write time (see §5 leak detection). The exact seed-selection expressions are specified in §2.4.

	Canonical JSON key ordering. Every row in `train/briefs.jsonl` and `val/briefs.jsonl` is serialized with:

	```python
	json.dumps(row, ensure_ascii=False, sort_keys=True, separators=(",", ":"))
	```

	This is enforced as an invariant precondition for byte-identical re-runs (see §3.5 invariant #6). `ensure_ascii=False` preserves Devanagari / Tamil / Kannada script without `\uXXXX` escaping; `sort_keys=True` canonicalizes key order; `separators=(",", ":")` eliminates whitespace variance across Python/libc versions.

	Per-row data lineage. Every `BriefRow` carries the full six-tuple `(template_id, seed, stage, language, domain, generator_version)` plus three corpus-version hashes (`catalogue_hash`, `templates_sha256`, `i18n_sha256`). This is enforced as an invariant (§3.5 invariant #9) so that any published row is re-derivable from the triple `(seed, stage, library@hash)` alone.

	### 3.2 Lazy singleton loaders

	```python
	# sketch of the module-level pattern, mirrored in every loader
	_LIBRARY: TemplateLibrary \| None = None
	_LIBRARY_LOCK = threading.Lock()

	def load_templates(path: Path \| str = "data/task_briefs/templates.yaml") -> TemplateLibrary:
	global _LIBRARY
	if _LIBRARY is None:
	with _LIBRARY_LOCK:
	if _LIBRARY is None:
	_LIBRARY = _load_and_validate_templates(Path(path))
	return _LIBRARY
	```

	The singleton is path-keyed — if a test passes a different `path`, a fresh instance is built (still cached in a per-path dict). Production callers always use the default path.

	### 3.3 Schema validation at load time

	Each loader does three passes:

	1. YAML/JSON parse. Failure → `MalformedYAMLError` / `MalformedJSONError` with line/column.
	2. Type + shape validation against the dataclass schema in §4. Failure → `DatasetSchemaError` naming the offending key.
	3. Cross-file consistency check (loader-specific):
	- `load_drift_patterns` asserts `pattern.id` values are unique, exactly 20 patterns total, `drift_type ∈ {schema,policy,tnc,pricing,auth}`, and every `from_version`/`to_version` references an existing schema file in `data/api_schemas/<domain>/`.
	- `load_templates` asserts every `drift_slot_tags` token is matched by ≥ 1 `DriftPattern.mutation` key or value (`airline.total_fare_inr` must be targetable, else why tag it).
	- `load_api_schemas` asserts each `v<N>.json` validates as JSON Schema 2020-12 against the meta-schema via `jsonschema.Draft202012Validator.check_schema`.
	- `load_audio_manifest` asserts every referenced `path` exists on disk and its sha256 matches the recorded hash.

	Failures here abort env startup; HTTP 503 is served until the data is fixed (mirrors `DriftCatalogueError` handling in `docs/modules/drift_injector.md` §5).

	### 3.4 License compatibility check

	Per §8.5 the public datasets we reference carry mixed licenses:

	\| Upstream \| License \| Redistributable in our bundle? \|
	\|---\|---\|---\|
	\| AI4Bharat IndicVoices-R \| Apache-2.0 \| Yes, with attribution \|
	\| MASSIVE (Amazon) \| Apache-2.0 \| Yes, with attribution \|
	\| Schema-Guided Dialogue (SGD) \| CC-BY-SA \| Inspiration only — derived schema patterns, not verbatim rows \|
	\| MTOP (Facebook) \| MIT-style (see original repo) \| Inspiration only — derived Hindi task phrasings, not verbatim rows \|
	\| APIs.guru \| CC0 \| Yes, no attribution required but recorded \|

	The bundle license (`LICENSE` at the root of `<team>/driftcall-indic-briefs`) is Apache-2.0. Because CC-BY-SA is copyleft-adjacent, we never copy SGD or MTOP rows verbatim — only inspiration (intent labels, schema shapes). A CI check enforces that no string in `train/briefs.jsonl` or `val/briefs.jsonl` appears verbatim (≥ 10-token suffix match) in a cached SGD/MTOP export. See §5 for the error mode and §7 edge case #3 for the exact detection rule.

	Full verbatim license text (MANDATORY). The root `LICENSE` file MUST contain the full verbatim Apache 2.0 license text as published at https://www.apache.org/licenses/LICENSE-2.0.txt — NOT a URL, NOT a one-line SPDX identifier, NOT a summary. The same requirement applies to `data/audio/LICENSES.md` and `data/sft_warmup/LICENSES.md` (both must embed the full Apache-2.0 text plus per-clip / per-trajectory attribution rows). CI check `tests/data/test_license_text.py` verifies that the byte length of each `LICENSE` file is ≥ 11,000 (Apache-2.0 full text is ~11,357 bytes) and that the canonical "Apache License / Version 2.0, January 2004" header string is present.

	`LICENSES.md` schema (L3 audio + L4 SFT warmup). Both `data/audio/LICENSES.md` and `data/sft_warmup/LICENSES.md` follow the same markdown format:

	1. A preamble (5–15 lines) naming the bundle and linking back to the root `LICENSE`.
	2. The full verbatim Apache-2.0 text (as above).
	3. A single markdown table with exactly these columns, one row per clip (L3) or per trajectory (L4):

	```markdown
	\| utterance_id \| upstream_source \| upstream_license \| attribution_required \| notes \|
	\|--------------\|----------------------\|------------------\|----------------------\|-------------------------------\|
	\| iv_r_kn_0451 \| IndicVoices-R \| Apache-2.0 \| yes \| speaker consent verified \|
	\| sft_00042 \| Sarvam-M (synthesis) \| Apache-2.0 \| no \| rollout seed 42, stage 2 \|
	```

	For L4 the `utterance_id` column is replaced by `trajectory_id` but the other four columns are identical. Loaders do not parse these tables at runtime; they are human-audit artifacts enforced only by pre-commit schema check `scripts/check_licenses_md.py`.

	### 3.5 Invariants (enforced by tests)

	1. Every string value in every loaded library is NFC (`unicodedata.is_normalized("NFC", s) == True`).
	2. `load_drift_patterns()` returns exactly 20 patterns (matches `docs/modules/drift_injector.md` §4.4 and DESIGN.md §6.3).
	3. `load_api_schemas()` returns exactly `{airline:v1,v2,v3 + cab:v1,v2,v3 + restaurant:v1,v2,v3 + hotel:v1,v2,v3 + payment:v1,v2}` = 14 schemas across 5 domains (matches DESIGN.md §8.6 bundle enumeration and §5 vendor catalogue).
	4. `load_templates()` library satisfies: every template has ≥ 1 variant in every `LanguageCode` (`hi`, `ta`, `kn`, `en`, `hinglish`); every primary-domain pattern's `mutation` field set is a subset of the union of `drift_slot_tags` across that domain's templates. The two transversal payment-auth patterns (`payment.auth_scope_upgrade`, `payment.mfa_required`) are EXEMPT from this subset check — they mutate shared payment fields (`token`, `scope`, `mfa_code`) that are intentionally not present in primary-domain goal templates and therefore cannot appear in `drift_slot_tags`.
	5. Publication invariant: train seed set ∩ val seed set = ∅.
	6. Publication invariant: running `data_export.py` twice with the same seed produces byte-identical `train/briefs.jsonl` + `val/briefs.jsonl` (SHA-256 match). Enforced via canonical JSON dump (§3.1): `json.dumps(row, ensure_ascii=False, sort_keys=True, separators=(",", ":"))`.
	7. Every file in `data/` begins with an SPDX license header (YAML comment or JSON `$comment`).
	8. No 10-digit digit-run in any authored string outside the timestamp / IATA allowed contexts (PII guard).
	9. Per-row data lineage. Every `BriefRow` (§4.7) in the published `train/` and `val/` splits carries all of: `template_id`, `seed`, `stage`, `language`, `domain`, `generator_version`, `catalogue_hash`, `templates_sha256`, `i18n_sha256`. At eval-load time, `catalogue_hash` / `templates_sha256` / `i18n_sha256` must match the currently-loaded library hashes, else `CatalogueHashMismatchError` is raised (§5).
	10. Bundle immutability. After publication (§2.4), the train split SHA-256 MUST match across all future re-runs of `hf upload`; adding val rows requires a minor-version bump, never a silent train-split mutation.

	---

	## 4. Data Structures

	All types are frozen dataclasses, immutable after load. Mappings are wrapped in `types.MappingProxyType`.

	### 4.1 `TemplateLibrary` (re-exported from `task_generator.models` — single source of truth)

	```python
	@dataclass(frozen=True)
	class TemplateLibrary:
	templates: tuple[Template, ...] # exactly 20 at v1.0
	# (4 domains × 5 templates);
	# ≥ 20 after minor-version bumps
	cities_by_domain: Mapping[Domain, tuple[str, ...]] # 10 per domain
	i18n: Mapping[LanguageCode, Mapping[str, str]] # merged from i18n.yaml
	source_sha256: str # hash of templates.yaml bytes
	```

	The `templates` tuple length is exactly 20 at v1.0 publication (4 domains × 5 templates per domain). Post-v1.0 minor-version bumps may grow this count monotonically; the invariant `len(templates) >= 20` and `len(templates) % 5 == 0` holds across all future versions. `load_templates` asserts `len(templates) == 20` at v1.0 via the `generator_version` check.

	Authoritative schema lives in `docs/modules/task_generator.md` §4. This module re-exports the type so callers of `load_templates` receive the same object that `task_generator.generate` consumes.

	### 4.2 `I18nLibrary`

	```python
	@dataclass(frozen=True)
	class I18nLibrary:
	strings: Mapping[LanguageCode, Mapping[str, str]]
	# e.g., strings["hi"]["BLR"] = "बेंगलुरु"
	# strings["ta"]["Monday"] = "திங்கள்"
	source_sha256: str
	```

	Merged into `TemplateLibrary.i18n` by `load_templates`, but exposed independently for pure-i18n use cases (e.g., the Gradio demo UI localizing labels).

	### 4.3 `DriftPatternLibrary`

	```python
	@dataclass(frozen=True)
	class DriftPatternLibrary:
	patterns: Mapping[str, DriftPattern] # keyed by DriftPattern.id
	by_domain: Mapping[str, tuple[str, ...]] # domain → pattern_ids
	by_type: Mapping[str, tuple[str, ...]] # drift_type → pattern_ids
	source_sha256: str
	```

	`DriftPattern` itself is defined in `docs/modules/drift_injector.md` §4.2 (see the `DriftPattern` dataclass snippet). This module owns loading, `drift_injector` owns applying.

	### 4.4 `APISchemaRegistry`

	```python
	@dataclass(frozen=True)
	class APISchema:
	domain: str # "airline" \| "cab" \| "restaurant" \| "hotel" \| "payment"
	version: str # "v1" \| "v2" \| "v3"
	schema: Mapping[str, Any] # parsed JSON Schema 2020-12 document
	source_sha256: str

	@dataclass(frozen=True)
	class APISchemaRegistry:
	schemas: Mapping[str, Mapping[str, APISchema]]
	# schemas["airline"]["v2"] = APISchema(...)

	def get(self, domain: str, version: str) -> APISchema: ...
	def versions(self, domain: str) -> tuple[str, ...]: ... # ordered v1,v2,v3
	```

	Each `v<N>.json` is a valid JSON Schema 2020-12 document describing the tool-response shape for that domain at that drift version. Vendors (DESIGN.md §5) validate outgoing responses against these schemas at test time; the injector (`docs/modules/drift_injector.md` §3) consults version transitions via these files.

	### 4.5 `AudioManifest`

	```python
	@dataclass(frozen=True)
	class AudioClip:
	utterance_id: str # stable; matches a curated IndicVoices-R clip id
	path: Path # relative to data/audio/
	language: LanguageCode
	source: Literal["real_indicvoices_r"] # manifest is authored-only; synth clips
	# are lazily generated and NEVER recorded here
	license: str # SPDX identifier
	sha256: str
	duration_s: float # ≤ 20.0 (DESIGN.md §9 upper bound)

	@dataclass(frozen=True)
	class AudioManifest:
	clips: tuple[AudioClip, ...]
	source_sha256: str # hash of MANIFEST.jsonl bytes
	```

	The `source` field is a single-value `Literal` — the manifest is authored-only. Synth clips generated on-demand by `audio/tts_kokoro.py` are never recorded in the manifest (they are transient, gitignored under `data/audio/synth/`). This keeps the manifest auditable and its SHA-256 stable across Kokoro model-weight updates.

	### 4.6 `SFTCorpus` (L4, optional)

	```python
	@dataclass(frozen=True)
	class SFTTrajectory:
	episode_id: int
	goal_seed: int # same seed space as train/; NEVER a val seed (§3.1)
	turns: tuple[Mapping[str, Any], ...] # role/content pairs, JSON-serializable
	stage: Literal[1, 2, 3]
	reward_breakdown: Mapping[str, float] # R1..R5 + total, from the env at synthesis time
	generation_batch_id: str # uuid4 per invocation of sft_generator.py
	generation_index: int # monotonic within a batch, 0..N-1

	@dataclass(frozen=True)
	class SFTCorpus:
	trajectories: tuple[SFTTrajectory, ...]
	generator: Literal["sarvam-m-hf-inference"]
	generation_seed: int
	target_count: int # from --target-count CLI flag
	source_sha256: str
	```

	Consumed by `training/train_grpo.py` only when `--sft-warmup-steps > 0` is passed. Absent by default; loader raises a non-fatal warning if the file is missing (training proceeds without warmup).

	Atomic append + restart recovery (`training/sft_generator.py`):

	- Each trajectory is appended to `data/sft_warmup/trajectories.jsonl` as a single canonical-JSON line followed by `os.fsync(fd)` on the file descriptor, ensuring durability before the next Sarvam-M API call. Partial-write recovery is therefore line-granular.
	- Every row carries `generation_batch_id` (uuid4, generated once per invocation of `sft_generator.py`) and `generation_index` (monotonic integer 0..N-1 within that batch).
	- On restart, `sft_generator.py` reads the existing `trajectories.jsonl`, reconstructs `(seed, generation_index)` pairs already completed, and resumes from the next uncompleted seed in its deterministic seed list. This tolerates Sarvam-M rate-limit drops, OOM kills, and SIGKILL.
	- After all generation completes, the script performs a final count validation: if `len(trajectories) != target_count`, it raises `PartialSFTCorpusError` (§5). The loader `load_sft_corpus` also performs this check at load time and raises the same error if the on-disk row count does not match the `target_count` field.
	- Edge case #11 (§7) walks through a concrete partial-generation-recovery scenario.

	### 4.7 `BriefRow` — canonical publication-row contract

	Every line of `train/briefs.jsonl` and `val/briefs.jsonl` in the published HF Hub bundle is exactly one serialized `BriefRow`. This dataclass is the single-source-of-truth schema for everything an offline consumer (a judge re-running eval, a third party reproducing our experiments) needs to re-derive the episode from `(seed, library@hash)` alone.

	```python
	from __future__ import annotations
	from dataclasses import dataclass
	from typing import Literal
	from driftcall.models import GoalSpec, DriftEvent, LanguageCode

	@dataclass(frozen=True)
	class BriefRow:
	episode_id: str # deterministic from seed + stage (e.g. "s2_ep_00000042")
	seed: int # original episode seed (train: [0, 20_000_000),
	# val: [20_000_000, 20_000_500))
	stage: Literal[1, 2, 3] # curriculum stage at publication time
	language: LanguageCode # "hi" \| "ta" \| "kn" \| "en" \| "hinglish"
	domain: Literal["airline", "cab", "restaurant", "hotel"]
	template_id: str # e.g. "airline.book.budget_timewindow"
	goal: GoalSpec # full GoalSpec (slots + constraints + seed_utterance)
	drift_schedule: tuple[DriftEvent, ...] # schedule pre-computed by drift_injector
	catalogue_hash: str # sha256(drift_patterns/drifts.yaml bytes)
	templates_sha256: str # sha256(task_briefs/templates.yaml bytes)
	i18n_sha256: str # sha256(task_briefs/i18n.yaml bytes)
	generator_version: str # e.g. "driftcall-1.0.0" — semver of the generator
	created_ts_ist: str # ISO 8601 with +05:30 offset, e.g. "2026-04-25T10:30:00+05:30"
	```

	Serialization is always canonical: `json.dumps(asdict(row), ensure_ascii=False, sort_keys=True, separators=(",", ":"))`. A concrete JSONL line example is given in §8.5.

	At eval-load time, the loader re-hashes the currently-loaded `drifts.yaml` / `templates.yaml` / `i18n.yaml` and compares against `catalogue_hash` / `templates_sha256` / `i18n_sha256`. Any mismatch raises `CatalogueHashMismatchError` (§5) — this prevents silent semantic drift where a consumer runs `train/briefs.jsonl` against a newer catalogue and gets different episodes.

	---

	## 5. Error Modes

	All exceptions subclass `DatasetError(Exception)`. Each is raised exactly once and unit-tested.

	\| Exception \| Trigger \| Where raised \|
	\|---\|---\|---\|
	\| `DatasetFileMissingError` \| `data/<path>` absent on disk \| every loader \|
	\| `MalformedYAMLError` \| YAML parse failure (syntax) \| `load_templates`, `load_i18n`, `load_drift_patterns` \|
	\| `MalformedJSONError` \| JSON parse failure (syntax) \| `load_api_schemas`, `load_audio_manifest`, `load_sft_corpus` \|
	\| `DatasetSchemaError` \| type/shape validation failure (missing required key, wrong type, extra unknown key) \| every loader \|
	\| `UnknownLanguageKeyError` \| a language key ∉ `LanguageCode = {"hi","ta","kn","en","hinglish"}` appears in `templates.yaml` or `i18n.yaml` \| `load_templates`, `load_i18n` \|
	\| `LicenseConflictError` \| a CC-BY-SA or GPL-licensed row appears in publication bundle while bundle is Apache-2.0; or verbatim ≥ 10-token suffix matches a CC-BY-SA upstream row \| publication script (see §3.4) \|
	\| `TrainValLeakError` \| train and val seed sets intersect; or an `SFTTrajectory.goal_seed` sits in the val reserved range `[20_000_000, 20_000_500)` \| publication script, `load_sft_corpus` \|
	\| `DriftPatternOrphanError` \| `drift_patterns.yaml` references a `from_version`/`to_version` not present in `data/api_schemas/<domain>/` \| `load_drift_patterns` \|
	\| `ChecksumMismatchError` \| `AudioClip.sha256` does not match the on-disk file's hash \| `load_audio_manifest` \|
	\| `UnicodeNFDError` \| any loaded string fails `unicodedata.is_normalized("NFC", s)` \| every loader \|
	\| `PIIDetectedError` \| a 10-digit run appears outside allowed contexts in authored text \| every text-bearing loader; also CI lint \|
	\| `DuplicateDriftPatternIdError` \| two entries in `drifts.yaml` share an `id` \| `load_drift_patterns` \|
	\| `CatalogueHashMismatchError` \| a `BriefRow` in `train/briefs.jsonl` or `val/briefs.jsonl` carries `catalogue_hash` / `templates_sha256` / `i18n_sha256` that does not match the currently-loaded library (drifts.yaml / templates.yaml / i18n.yaml) hashes \| eval-load path (consumers of published bundle) \|
	\| `PartialSFTCorpusError` \| `len(SFTCorpus.trajectories) != target_count` at final-count validation; raised by `training/sft_generator.py` post-generation and by `load_sft_corpus` at load time \| `load_sft_corpus`, `training/sft_generator.py` \|

	No silent fallbacks. If `data/sft_warmup/trajectories.jsonl` is missing, `load_sft_corpus` raises `DatasetFileMissingError`; the training script is the one that treats this as non-fatal (falls back to no-SFT warmup). Loaders themselves never substitute defaults.

	---

	## 6. Dependencies

	### 6.1 Reads

	- `data/task_briefs/templates.yaml`, `data/task_briefs/i18n.yaml`
	- `data/drift_patterns/drifts.yaml`
	- `data/api_schemas/*/.json`
	- `data/audio/real/MANIFEST.jsonl` + the `.wav` files it references
	- `data/sft_warmup/trajectories.jsonl` (optional)

	### 6.2 Imports

	- `driftcall.models` — `GoalSpec`, `LanguageCode`, `Domain`
	- Python stdlib: `json`, `hashlib`, `pathlib`, `unicodedata`, `threading`, `dataclasses`, `typing`, `types`
	- Third-party: `PyYAML`, `jsonschema` (for JSON Schema 2020-12 meta-validation)

	### 6.3 Consumers

	Consuming modules and the exact function they call:

	- `docs/modules/task_generator.md` — `load_templates()` in `task_generator.generate()`'s lazy-singleton `_get_library()`.
	- `docs/modules/drift_injector.md` — `load_drift_patterns()` in the injector's module-level registry; consults DESIGN.md §6.3 pattern catalogue.
	- `docs/modules/vendors.md` — `load_api_schemas()` at vendor import time; each vendor asserts its own response shape against the schema in test fixtures.
	- `docs/modules/audio.md` — `load_audio_manifest()` for the pitch demo (§9.5 IndicVoices-R clip playback).
	- `docs/modules/training.md` — `load_sft_corpus()` behind `--sft-warmup-steps` flag; also invokes `training/data_export.py` which calls `task_generator.enumerate_variants()` to produce the publication briefs.

	### 6.4 Publishes to

	- HF Hub dataset repo `<team>/driftcall-indic-briefs` (one-time, pre-event, Phase C5 per `DRIFTCALL/CLAUDE.md` §4.1).

	### 6.5 Non-dependencies (explicit)

	- Does not import from `env.py`, `rewards.py`, `app.py`, or the training entrypoint. Pure data layer.
	- Does not hit the network at runtime. Every file is local. Publication script is a separate, explicitly-invoked entrypoint.
	- Does not depend on GPU, CUDA, or PyTorch. CPU-only.

	---

	## 7. Edge Cases

	1. Missing template variant for a rare language. `templates.yaml` is authored with `hinglish` + `hi` + `en` + `ta` but an author forgets `kn` for one template. `load_templates` runs per-template check `set(variants.keys()) == LanguageCode.values` and raises `DatasetSchemaError: template 'restaurant.order.veg' missing language_variants['kn']`. The generator's `NoVariantForLanguageError` (task_generator.md §5) never has a chance to fire because loading fails first. Fix: author supplies the missing variant; loader re-runs.

	2. Unicode NFD in author contribution. A collaborator pastes a Kannada weekday name from macOS clipboard (NFD by default for composed characters). `load_i18n` re-normalizes to NFC before equality/hashing; the assertion `unicodedata.is_normalized("NFC", value)` fires post-normalization as a defense against Python/ICU bugs. In practice the round-trip succeeds and NFC is stored. The pre-commit hook separately catches NFD at commit time so CI never sees it.

	3. License incompatibility (CC-BY-SA row smuggled into an Apache-2.0 bundle). An author, inspired by an SGD row, copies 20 tokens verbatim into a template variant. Publication CI runs a suffix-array check over cached SGD/MTOP exports looking for ≥ 10-token verbatim matches; on hit, `LicenseConflictError("variant in 'airline.book.budget_timewindow' matches SGD row sgd_5432:0 (≥ 10 tokens)")` raises. Fix: rewrite the variant. We keep only inspiration, never verbatim text, from CC-BY-SA sources. The threshold (10 tokens) is a pragmatic choice: below that length we treat overlap as incidental linguistic reuse; at or above we flag.

	4. Empty language cohort in a stage mix. A future curriculum config passes `language_weights = {"en": 1.0, "hi": 0.0, "ta": 0.0, "kn": 0.0, "hinglish": 0.0}`. This is valid at the task-generator level (task_generator.md §3.2 — non-negative weights summing to 1 are legal). `datasets` does not re-validate curriculum config; it only asserts the library has variants for all 5 languages. Downstream (`task_generator`) will simply never draw `hi`/`ta`/`kn`/`hinglish`. No error in this module.

	5. Train/val episode-id collision at publication time. `data_export.py` draws 15,000 seeds for train and 500 for val. If the RNG accidentally maps a train seed into `[20_000_000, 20_000_500)` (the val reserved range) — which cannot happen given the seed-space partitioning in §3.1 — the assertion `train_seeds.isdisjoint(val_seeds)` raises `TrainValLeakError` with the offending seed. Safeguard: train seeds are drawn from `[0, 20_000_000)` and val seeds from `[20_000_000, 20_000_500)`. The two ranges are non-overlapping by construction; the assertion is a defense against future range edits.

	6. Drift-pattern-id orphan (trace references pattern not in YAML). A test fixture or cached trace references `drift_pattern_id='airline.mysterious_fee'` but `drifts.yaml` has no such entry (it was renamed or removed). `load_drift_patterns` does not look at traces — it only checks internal consistency. The trace consumer (`rewards.r2_drift_detection` in `docs/modules/rewards.md`) raises `UnknownDriftPatternError` at scoring time per drift_injector.md §5. If the orphan is discovered during dataset publication, the publication script emits `DriftPatternOrphanError` and aborts.

	7. JSON Schema file that is valid JSON but not valid JSON Schema 2020-12. `data/api_schemas/cab/v3.json` is hand-edited and accidentally drops the `$schema` keyword or uses an unknown keyword. `load_api_schemas` runs `jsonschema.Draft202012Validator.check_schema(schema)` and on failure raises `DatasetSchemaError("cab/v3.json: not a valid JSON Schema 2020-12: <error>")`. The env refuses to serve `reset()` until fixed.

	8. Audio clip on disk does not match manifest sha256. `data/audio/real/MANIFEST.jsonl` lists `kn_greeting_03.wav` with `sha256=abc...`. The file gets re-encoded (e.g., by a well-intentioned ffmpeg pass). `load_audio_manifest` re-hashes every referenced WAV and raises `ChecksumMismatchError("kn_greeting_03.wav: expected abc..., got def...")`. Fix: either revert the WAV, or regenerate the manifest after an audit trail commit.

	9. SFT corpus contains a val-reserved seed. Sarvam-M synthesis inadvertently uses a seed in `[20_000_000, 20_000_500)`. `load_sft_corpus` raises `TrainValLeakError`. The training script may be configured to treat this as fatal (default) or to filter out those trajectories (`--sft-tolerate-leak`); the loader itself always raises.

	10. PyYAML silently deduplicating keys. If `drifts.yaml` has two entries with the same `id`, the YAML parse is valid but one wins. `load_drift_patterns` builds a set of ids during validation and raises `DuplicateDriftPatternIdError` on collision, with both source line numbers.

	11. Partial SFT corpus recovery (L4 restart). `training/sft_generator.py` is mid-run at trajectory 137 of a target 300 when the host OOM-kills the process (Sarvam-M inference peak memory). On restart, the script re-opens `data/sft_warmup/trajectories.jsonl`, reads the existing 137 rows (each fsync'd atomically per §4.6), reconstructs the completed `(generation_batch_id, generation_index)` pairs, and resumes from index 137 of the same batch. It does NOT start a new `generation_batch_id` — batch id is rehydrated from the last row. When generation finally reaches 300, the script validates `len(rows) == target_count`; if a Sarvam-M response was silently truncated (say, only 298 rows written), `PartialSFTCorpusError("expected 300, got 298")` is raised and the operator must decide whether to re-run the missing two or ship a corpus with a smaller `target_count`. `load_sft_corpus` performs the same count check at load time.

	---

	## 8. Examples

	### 8.1 Full `templates.yaml` entry for `airline.book.budget_timewindow`

	```yaml
	# SPDX-License-Identifier: Apache-2.0
	# Copyright 2026 DriftCall Team
	# Derived-from: AmazonScience/MASSIVE (intent taxonomy, Apache-2.0)

	- template_id: airline.book.budget_timewindow
	domain: airline
	intent: book_flight
	min_stage: 1
	required_slots: [from, to, when]
	optional_slots: [seat_pref]
	constraints_template:
	budget_inr:
	distribution: uniform
	low: 3000
	high: 15000
	step: 500
	time_window:
	choices: [morning, afternoon, evening, late_night]
	drift_slot_tags: [price, total_fare_inr]
	# Language keys: ISO short codes matching LanguageCode = Literal["hi","ta","kn","en","hinglish"]
	language_variants:
	hinglish:
	- "Bhai {when} ko {to} jaana hai, cheapest flight {time_window} mein, {budget_inr} rupees max"
	- "{when} ko {from} se {to} ka ticket book kar de, under {budget_inr}, {time_window} ke baad"
	hi:
	- "मुझे {when} को {from} से {to} जाना है, {budget_inr} रुपये से कम में"
	ta:
	- "{when} அன்று {from} லிருந்து {to} க்கு டிக்கெட் வேண்டும், {budget_inr} ரூபாய்க்கு கீழ்"
	kn:
	- "{when} ರಂದು {from} ಇಂದ {to} ಗೆ ಅಗ್ಗದ ವಿಮಾನ ಟಿಕೆಟ್ ಬೇಕು, {budget_inr} ರೂಪಾಯಿಗಳ ಒಳಗೆ"
	en:
	- "Book the cheapest flight from {from} to {to} on {when}, budget under ₹{budget_inr}, departing {time_window}"
	```

	This is the single source-of-truth entry for the Stage-1 airline booking template; mirror of DESIGN.md §8.3 and `docs/modules/task_generator.md` §4.1.

	### 8.2 Full `drift_patterns.yaml` entry for `airline.price_rename`

	```yaml
	# SPDX-License-Identifier: Apache-2.0
	# Copyright 2026 DriftCall Team

	- id: airline.price_rename
	drift_type: schema
	domain: airline
	from_version: v1
	to_version: v2
	description: "field 'price' renamed to 'total_fare_inr'; 'currency' removed"
	mutation:
	rename: {price: total_fare_inr}
	remove: [currency]
	detection_hints:
	- "total_fare_inr"
	- "price"
	- "rename"
	```

	`load_drift_patterns` will (a) parse this, (b) check `id` uniqueness, (c) confirm `from_version=v1` + `to_version=v2` both exist as `data/api_schemas/airline/v1.json` + `data/api_schemas/airline/v2.json`, (d) confirm `detection_hints` is non-empty, (e) wrap `mutation` in `MappingProxyType`. Matches `docs/modules/drift_injector.md` §4.3 byte-for-byte.

	### 8.3 `data/api_schemas/airline/v2.json`

	```json
	{
	"$schema": "https://json-schema.org/draft/2020-12/schema",
	"$id": "https://driftcall.dev/schemas/airline/v2.json",
	"$comment": "SPDX-License-Identifier: Apache-2.0. v2 = post-price_rename drift (DESIGN.md §5.1).",
	"title": "Airline search result (v2)",
	"type": "object",
	"required": ["flight_id", "from", "to", "depart", "total_fare_inr", "seats_left"],
	"additionalProperties": false,
	"properties": {
	"flight_id": {"type": "string", "pattern": "^[0-9A-Z]{2}-[0-9]{4}$"},
	"from": {"type": "string", "pattern": "^[A-Z]{3}$"},
	"to": {"type": "string", "pattern": "^[A-Z]{3}$"},
	"depart": {"type": "string", "format": "date-time"},
	"total_fare_inr": {"type": "integer", "minimum": 0},
	"seats_left": {"type": "integer", "minimum": 0}
	}
	}
	```

	Note that `price` and `currency` from v1 are absent (drift `airline.price_rename` applied). Vendors (`docs/modules/vendors.md`) validate their emitted `airline.search` responses against whichever version the injector has installed in `state.schema_versions['airline']`. This schema also serves as the R2 structural detection surface: a tool call that keys into `price` after drift returns `KeyError` / 422, which is a detection-positive signal per DESIGN.md §7.1 R2.

	### 8.4 `MANIFEST.jsonl` row for a curated IndicVoices-R clip (L3)

	```json
	{"utterance_id": "iv_r_kn_0451", "path": "real/kn/iv_r_kn_0451.wav", "language": "kn", "source": "real_indicvoices_r", "license": "Apache-2.0", "sha256": "b7f1a9c2e5d4...", "duration_s": 4.82}
	```

	Referenced only by the pitch demo. Training never touches this file — DRIFTCALL/CLAUDE.md §9 "Do not put TTS/ASR in the training loop".

	### 8.5 Canonical `BriefRow` JSONL line (single row from `train/briefs.jsonl`)

	One line from the published bundle — canonical JSON (sorted keys, no whitespace, UTF-8 preserved for Devanagari):

	```json
	{"catalogue_hash":"3f9a8e7c2b1d4e5f6a0b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5e4f3a2b1c0d9e8f","created_ts_ist":"2026-04-25T10:30:00+05:30","domain":"airline","drift_schedule":[{"description":"'price' field renamed to 'total_fare_inr'","domain":"airline","drift_type":"schema","from_version":"v1","pattern_id":"airline.price_rename","to_version":"v2","turn":4}],"episode_id":"s2_ep_00000042","generator_version":"driftcall-1.0.0","goal":{"constraints":{"budget_inr":8000,"time_window":"evening"},"domain":"airline","intent":"book_flight","language":"hinglish","seed_utterance":"Bhai Friday ko Bangalore jaana hai, cheapest flight evening mein, 8000 rupees max","slots":{"from":"HYD","to":"BLR","when":"2026-04-30"}},"i18n_sha256":"a1b2c3d4e5f60718293a4b5c6d7e8f901234567890abcdef1234567890abcdef","language":"hinglish","seed":42,"stage":2,"template_id":"airline.book.budget_timewindow","templates_sha256":"b2c3d4e5f60718293a4b5c6d7e8f901234567890abcdef1234567890abcdef12"}
	```

	Note: keys are alphabetically sorted (`catalogue_hash`, `created_ts_ist`, `domain`, …), strings are NFC-normalized, no embedded spaces. The 64-hex hashes are full sha256 hex digests.

	### 8.6 `README.md` YAML frontmatter (HF Hub dataset card)

	The published `<org>/driftcall-indic-briefs/README.md` begins with the following YAML frontmatter. The HF Dataset Viewer reads this block to auto-configure splits, license, and task tags.

	```yaml
	---
	license: apache-2.0
	language: [hi, ta, kn, en]
	size_categories: [10K<n<100K]
	task_categories: [conversational, text-generation]
	pretty_name: DriftCall Indic Briefs
	configs:
	- config_name: default
	data_files:
	- split: train
	path: train/briefs.jsonl
	- split: val
	path: val/briefs.jsonl
	dataset_info:
	features:
	- { name: episode_id, dtype: string }
	- { name: seed, dtype: int64 }
	- { name: stage, dtype: int32 }
	- { name: language, dtype: string }
	- { name: domain, dtype: string }
	- { name: template_id, dtype: string }
	splits:
	- { name: train, num_examples: 15000 }
	- { name: val, num_examples: 500 }
	---
	```

	The body of `README.md` follows below the frontmatter: dataset description, licensing chain (full Apache-2.0 text is in the separate `LICENSE` file per §3.4), provenance (`generator_version`, `catalogue_hash`), reward-caveat paragraph, and usage example. The frontmatter's `features` block lists only the top-level flat columns; nested structs (`goal`, `drift_schedule`) are auto-inferred by the HF Datasets library on first load.

	---

	## 9. Open Questions

	1. HF org name not yet finalized. `<org>` placeholder in `<org>/driftcall-indic-briefs` depends on `DRIFTCALL/CLAUDE.md` §8 kickoff-checklist item "HF org name locked". The publication script parameterizes the org via `--hf-org`; no code change needed once locked, just a CLI arg at publication time. Does not block Phase D. Sync note: `DRIFTCALL/CLAUDE.md` §6 command table still lists the deprecated `huggingface-cli upload` — when the org name is locked, update that table to the modern `hf upload` in the same PR.

	2. SFT warmup corpus size — 200 vs 500 trajectories. DESIGN.md §8.2 row 4 quotes the range "200–500". The exact count depends on Sarvam-M's cost/latency budget during one-shot synthesis. Recommend 200 as a floor (sufficient for format priming per §10 training convergence target) and 500 as a ceiling if inference time permits. Resolution: Person C chooses during Phase C4; does not affect loader or schema.

	3. Audio manifest curation count. DESIGN.md §9 implies a handful of real IndicVoices-R clips for pitch demo realism, but does not specify exact count. Recommend 20 curated clips (4 per language × 5 languages), balanced by speaker gender and dialect region. Resolution: Person D curates during Phase C5; this module only ensures the manifest format is stable.

	### 9.1 Resolved

	- License-cache implementation (previously Open Q #4). `data/.license_cache/{sgd,mtop}.idx` is a sqlite3 FTS5 index built by `scripts/build_license_cache.py` at dev time. Schema: `CREATE VIRTUAL TABLE licensed_text USING fts5(chunk_text, source_id);` with 5-gram tokenization. CI invokes this index (read-only) on each PR to verify that no `seed_utterance` or template variant in the publication bundle substring-matches any upstream CC-BY-SA text (≥ 10-token threshold, §3.4). The index is built once per upstream corpus version and committed to the repo so re-builds are only needed when SGD or MTOP themselves publish a new version. Determinism + reviewability win over per-PR rebuild cost.

	---

	*This doc tells you HOW the four dataset layers are shaped, loaded, validated, and published. Do not write loaders before a fresh critic returns `NOTHING_FURTHER`. Do not commit `data/.yaml` without the pre-commit NFC + PII + license-header guards running. Do not ship the HF Hub bundle without the train/val disjointness and verbatim-match checks green.**