# datasets — Four-Layer Dataset Strategy + HF Hub Publication **Module path:** `driftcall/data/` (loaders) + `data/` (on-disk artifacts) **Owner:** Person C (Training & Data) **Implements:** DESIGN.md §8 (Dataset Strategy — §§8.1, 8.2, 8.3, 8.4, 8.5, 8.6) **Consumed by:** `driftcall/task_generator.py` (L1), `driftcall/drift_injector.py` (L2 drift patterns), `driftcall/vendors/*.py` (L2 API schemas), `driftcall/audio/*.py` (L3 audio), `training/train_grpo.py` (L4 SFT warmup). **Status:** Design spec — no code yet. --- ## 1. Purpose `datasets` is the **authoring-and-loading contract** for every piece of static data DriftCall depends on. It is *not* a training-data-pipeline module (that is `training.md`) and it does *not* compose rewards (that is `rewards.md`). It does exactly four things: 1. **Defines** the four dataset layers per DESIGN.md §8.2 — task-brief templates, vendor API schemas + drift patterns, voice audio, and the optional SFT warmup corpus — as on-disk files with frozen YAML/JSON schemas. 2. **Loads** each file through a deterministic, lazy, singleton loader that NFC-normalizes all Indic strings at load time (cross-references `docs/modules/task_generator.md` §3.4 invariant #8). 3. **Validates** each file at load time: schema shape, type constraints, license header presence, train/val leakage, and consistency cross-references (e.g., every `drift_slot_tags` token in templates.yaml is targetable by ≥ 1 pattern in drift_patterns.yaml). 4. **Publishes** the public-facing bundle `/driftcall-indic-briefs` to the Hugging Face Hub per DESIGN.md §8.6, packaged from a deterministic `enumerate_variants()` walk (see `docs/modules/task_generator.md` §2.2). **No file in `data/` is ever written at runtime.** All four layers are authored before Phase C ships. Runtime only reads. The only exception is the dataset-packaging script (`training/data_export.py`) which writes `train/briefs.jsonl` + `val/briefs.jsonl` once, pre-publication. Supervision for GRPO comes from the 5 reward functions (DESIGN.md §8.1) — these files exist to parameterize the environment, not to teach the policy. L4 (SFT warmup) is optional per DESIGN.md §8.2 row 4. --- ## 2. Interface ### 2.1 Directory layout (on disk, shipped inside the env Docker image per DESIGN.md §11.1) ``` data/ ├── task_briefs/ │ ├── templates.yaml # L1 — hand-authored + procedural expansion source (§8.3) │ └── i18n.yaml # L1 — Indic localized strings (cities, weekdays, dish names) ├── drift_patterns/ │ └── drifts.yaml # L2 — 20 drift patterns (§6.3, §8.2 row 2) ├── api_schemas/ # L2 — frozen JSON Schema per vendor per version │ ├── airline/ │ │ ├── v1.json │ │ ├── v2.json │ │ └── v3.json │ ├── cab/ │ │ ├── v1.json │ │ ├── v2.json │ │ └── v3.json │ ├── restaurant/ │ │ ├── v1.json │ │ ├── v2.json │ │ └── v3.json │ ├── hotel/ │ │ ├── v1.json │ │ ├── v2.json │ │ └── v3.json │ └── payment/ │ ├── v1.json │ └── v2.json ├── audio/ # L3 — synthesized + real voice clips (§8.2 row 3, §9) │ ├── synth/ # Kokoro-82M output, generated lazily; gitignored │ │ └── .gitkeep │ ├── real/ # AI4Bharat IndicVoices-R held-out subset for pitch demo │ │ └── MANIFEST.jsonl # (utterance_id, path, language, license, sha256) │ └── LICENSES.md # per-clip license attribution └── sft_warmup/ # L4 — optional Sarvam-M synthesized trajectories (§8.2 row 4) ├── trajectories.jsonl # 200–500 correct rollouts └── LICENSES.md ``` **Publication structure** (HF Hub dataset repo `/driftcall-indic-briefs`, DESIGN.md §8.6): ``` driftcall-indic-briefs/ ├── README.md # model card — provenance, license, stats, reward caveats ├── train/briefs.jsonl # 15,000 sampled episodes (seed, stage, language_weights, GoalSpec) ├── val/briefs.jsonl # 500 held-out episodes — seeds disjoint from train ├── drift_patterns.yaml # exact copy of data/drift_patterns/drifts.yaml (20 patterns) ├── api_schemas/ # exact copy of data/api_schemas/ └── LICENSE # bundle license (Apache 2.0 by default; see §3.4) ``` ### 2.2 Per-file contracts | File | Format | Authored by | Runtime writer | Schema anchor | |---|---|---|---|---| | `data/task_briefs/templates.yaml` | YAML | Hand (20 seeds) | none | `Template` (§4.1 task_generator.md) | | `data/task_briefs/i18n.yaml` | YAML | Hand | none | `Mapping[LanguageCode, Mapping[str, str]]` | | `data/drift_patterns/drifts.yaml` | YAML | Hand | none | `DriftPattern` (§4.2 drift_injector.md) | | `data/api_schemas//v.json` | JSON Schema 2020-12 | Hand | none | `APISchema` (§4.4 below) | | `data/audio/real/MANIFEST.jsonl` | JSONL | Hand (curated from IndicVoices-R) | none | `AudioClipManifest` (§4.5) | | `data/audio/synth/*.wav` | WAV 16kHz mono | `audio/tts_kokoro.py` (lazy) | `audio/tts_kokoro.py` | n/a — generated | | `data/sft_warmup/trajectories.jsonl` | JSONL | Sarvam-M via HF Inference (offline) | `training/sft_generator.py` (one-shot) | `SFTTrajectory` (§4.6) | ### 2.3 Loaders (all return frozen dataclasses, all NFC-normalize Indic strings) ```python from __future__ import annotations from pathlib import Path from driftcall.data.models import ( TemplateLibrary, I18nLibrary, DriftPatternLibrary, APISchemaRegistry, AudioManifest, SFTCorpus, ) # L1 — task briefs def load_templates(path: Path | str = "data/task_briefs/templates.yaml") -> TemplateLibrary: ... def load_i18n(path: Path | str = "data/task_briefs/i18n.yaml") -> I18nLibrary: ... # L2 — drift patterns + api schemas def load_drift_patterns(path: Path | str = "data/drift_patterns/drifts.yaml") -> DriftPatternLibrary: ... def load_api_schemas(root: Path | str = "data/api_schemas") -> APISchemaRegistry: ... # L3 — audio manifest (paths + licenses only; actual WAVs resolved on-demand) def load_audio_manifest(path: Path | str = "data/audio/real/MANIFEST.jsonl") -> AudioManifest: ... # L4 — optional SFT warmup def load_sft_corpus(path: Path | str = "data/sft_warmup/trajectories.jsonl") -> SFTCorpus: ... ``` Each loader is implemented as a **module-level lazy singleton** — the first call reads + validates + freezes; subsequent calls return the same instance. Not thread-safe for write (there is no write); safe for concurrent read. ### 2.4 HF Hub publication commands Packaging runs *once*, pre-event. The script is `training/data_export.py` (see `docs/modules/training.md` for its interface — this module only defines the on-disk shape of what it writes). **Immutability.** The published bundle is IMMUTABLE after publication. Re-running `hf upload` against the same `data/publication/` tree produces a byte-identical bundle (invariant #6). Adding rows to `val/briefs.jsonl` requires a MINOR-version bump (v1.1, v1.2, …) and a new publication seed; the `train/` split NEVER silently mutates between versions — a version bump either adds disjoint val rows or re-publishes train+val together, never partial mutation of train. **Seed selection (deterministic, locked).** Train and val seeds are drawn by `training/data_export.py` using these two exact expressions: ```python import random # Train: 15,000 seeds sampled without replacement from [0, 20_000_000). train_seeds = random.Random(20260425).sample(range(0, 20_000_000), 15_000) # Val: deterministic slice of 500 contiguous seeds in the reserved range. val_seeds = list(range(20_000_000, 20_000_500)) ``` Both lists are byte-identical across re-runs. The publication meta-seed `20260425` is locked; changing it requires a major-version bump and a new repo name or subfolder. ```bash # Generate the sampled briefs by walking enumerate_variants() (see task_generator.md §2.2) python3 training/data_export.py \ --out-train data/publication/train/briefs.jsonl \ --out-val data/publication/val/briefs.jsonl \ --n-train 15000 \ --n-val 500 \ --seed 20260425 # frozen publication seed; NOT a training seed # Copy the static L2 artifacts verbatim cp data/drift_patterns/drifts.yaml data/publication/drift_patterns.yaml cp -r data/api_schemas data/publication/api_schemas # Upload (see DRIFTCALL/CLAUDE.md §6 command table and huggingface-skills:hf-cli). # NOTE: `hf` is the modern CLI replacing the deprecated `huggingface-cli`. hf upload /driftcall-indic-briefs \ data/publication/ . \ --repo-type dataset \ --hf-org \ --commit-message "v1.0 publication — locked 2026-04-25" ``` The publication seed `20260425` is fixed and recorded in the README. Re-running the script produces a byte-identical bundle (determinism contract inherited from `task_generator.generate` per `docs/modules/task_generator.md` §3.1). > **Doc-sync flag:** `DRIFTCALL/CLAUDE.md` §6 still lists the deprecated `huggingface-cli upload` command; update that table to `hf upload` in the same PR that lands this doc (captured as Open Question #1 / CLAUDE.md sync item). --- ## 3. Behavior Spec ### 3.1 Authoring conventions **NFC normalization.** Every string value in every YAML/JSON file is NFC-normalized before it is committed. The loaders re-normalize defensively at load time (invariant #8, `docs/modules/task_generator.md` §3.4). Editors used during authoring (VS Code, vim) must be configured to save NFC — a pre-commit hook (`ruff`-adjacent script) runs `python -c "import unicodedata, sys; ..."` to reject NFD commits. **License headers.** Every hand-authored file begins with a YAML comment block declaring SPDX identifier, author, year, and upstream attribution if the content is derived from a public dataset (§8.5): ```yaml # SPDX-License-Identifier: Apache-2.0 # Copyright 2026 DriftCall Team # Derived-from: AmazonScience/MASSIVE (intent taxonomy, Apache-2.0) # See data/LICENSES.md for full attribution chain. ``` JSON files carry the same metadata in a `$comment` field at root (JSON Schema 2020-12 permits `$comment` per RFC 7159 conventions). **Seed determinism.** Every numeric or stochastic sampling decision in template/drift authoring threads through a fixed seed: the publication seed `20260425`, the template-expansion seed `42`, or the curriculum-language seeds declared in DESIGN.md §10.3. No wall-clock, no `random.random()`, no host-machine entropy. **No PII.** Authored strings never contain real names, phone numbers, email addresses, booking reference numbers, card PANs, or IP addresses. The `from` / `to` fields use IATA codes; the `pickup` / `drop` fields use fictional neighborhood landmarks. A CI lint (`grep -En '[0-9]{10}' data/`) runs before every commit and fails on any 10-digit run outside the allowed IATA/timestamp contexts. **Eval-set held out from training.** The 500-episode val set uses seeds drawn from a reserved range (seed ∈ `[20_000_000, 20_000_500)`); training always draws seeds from `[0, 20_000_000)`. The publication script asserts disjointness at write time (see §5 leak detection). The exact seed-selection expressions are specified in §2.4. **Canonical JSON key ordering.** Every row in `train/briefs.jsonl` and `val/briefs.jsonl` is serialized with: ```python json.dumps(row, ensure_ascii=False, sort_keys=True, separators=(",", ":")) ``` This is enforced as an invariant precondition for byte-identical re-runs (see §3.5 invariant #6). `ensure_ascii=False` preserves Devanagari / Tamil / Kannada script without `\uXXXX` escaping; `sort_keys=True` canonicalizes key order; `separators=(",", ":")` eliminates whitespace variance across Python/libc versions. **Per-row data lineage.** Every `BriefRow` carries the full six-tuple `(template_id, seed, stage, language, domain, generator_version)` plus three corpus-version hashes (`catalogue_hash`, `templates_sha256`, `i18n_sha256`). This is enforced as an invariant (§3.5 invariant #9) so that any published row is re-derivable from the triple `(seed, stage, library@hash)` alone. ### 3.2 Lazy singleton loaders ```python # sketch of the module-level pattern, mirrored in every loader _LIBRARY: TemplateLibrary | None = None _LIBRARY_LOCK = threading.Lock() def load_templates(path: Path | str = "data/task_briefs/templates.yaml") -> TemplateLibrary: global _LIBRARY if _LIBRARY is None: with _LIBRARY_LOCK: if _LIBRARY is None: _LIBRARY = _load_and_validate_templates(Path(path)) return _LIBRARY ``` The singleton is **path-keyed** — if a test passes a different `path`, a fresh instance is built (still cached in a per-path dict). Production callers always use the default path. ### 3.3 Schema validation at load time Each loader does three passes: 1. **YAML/JSON parse.** Failure → `MalformedYAMLError` / `MalformedJSONError` with line/column. 2. **Type + shape validation** against the dataclass schema in §4. Failure → `DatasetSchemaError` naming the offending key. 3. **Cross-file consistency** check (loader-specific): - `load_drift_patterns` asserts `pattern.id` values are unique, exactly 20 patterns total, `drift_type ∈ {schema,policy,tnc,pricing,auth}`, and every `from_version`/`to_version` references an existing schema file in `data/api_schemas//`. - `load_templates` asserts every `drift_slot_tags` token is matched by ≥ 1 `DriftPattern.mutation` key or value (`airline.total_fare_inr` must be targetable, else why tag it). - `load_api_schemas` asserts each `v.json` validates as JSON Schema 2020-12 against the meta-schema via `jsonschema.Draft202012Validator.check_schema`. - `load_audio_manifest` asserts every referenced `path` exists on disk and its sha256 matches the recorded hash. Failures here abort env startup; HTTP 503 is served until the data is fixed (mirrors `DriftCatalogueError` handling in `docs/modules/drift_injector.md` §5). ### 3.4 License compatibility check Per §8.5 the public datasets we reference carry mixed licenses: | Upstream | License | Redistributable in our bundle? | |---|---|---| | AI4Bharat IndicVoices-R | Apache-2.0 | Yes, with attribution | | MASSIVE (Amazon) | Apache-2.0 | Yes, with attribution | | Schema-Guided Dialogue (SGD) | CC-BY-SA | Inspiration only — derived schema patterns, not verbatim rows | | MTOP (Facebook) | MIT-style (see original repo) | Inspiration only — derived Hindi task phrasings, not verbatim rows | | APIs.guru | CC0 | Yes, no attribution required but recorded | The bundle license (`LICENSE` at the root of `/driftcall-indic-briefs`) is **Apache-2.0**. Because CC-BY-SA is copyleft-adjacent, we never copy SGD or MTOP rows verbatim — only *inspiration* (intent labels, schema shapes). A CI check enforces that no string in `train/briefs.jsonl` or `val/briefs.jsonl` appears verbatim (≥ 10-token suffix match) in a cached SGD/MTOP export. See §5 for the error mode and §7 edge case #3 for the exact detection rule. **Full verbatim license text (MANDATORY).** The root `LICENSE` file MUST contain the **full verbatim Apache 2.0 license text** as published at https://www.apache.org/licenses/LICENSE-2.0.txt — NOT a URL, NOT a one-line SPDX identifier, NOT a summary. The same requirement applies to `data/audio/LICENSES.md` and `data/sft_warmup/LICENSES.md` (both must embed the full Apache-2.0 text plus per-clip / per-trajectory attribution rows). CI check `tests/data/test_license_text.py` verifies that the byte length of each `LICENSE` file is ≥ 11,000 (Apache-2.0 full text is ~11,357 bytes) and that the canonical "Apache License / Version 2.0, January 2004" header string is present. **`LICENSES.md` schema (L3 audio + L4 SFT warmup).** Both `data/audio/LICENSES.md` and `data/sft_warmup/LICENSES.md` follow the same markdown format: 1. A preamble (5–15 lines) naming the bundle and linking back to the root `LICENSE`. 2. The full verbatim Apache-2.0 text (as above). 3. A single markdown table with exactly these columns, one row per clip (L3) or per trajectory (L4): ```markdown | utterance_id | upstream_source | upstream_license | attribution_required | notes | |--------------|----------------------|------------------|----------------------|-------------------------------| | iv_r_kn_0451 | IndicVoices-R | Apache-2.0 | yes | speaker consent verified | | sft_00042 | Sarvam-M (synthesis) | Apache-2.0 | no | rollout seed 42, stage 2 | ``` For L4 the `utterance_id` column is replaced by `trajectory_id` but the other four columns are identical. Loaders do not parse these tables at runtime; they are human-audit artifacts enforced only by pre-commit schema check `scripts/check_licenses_md.py`. ### 3.5 Invariants (enforced by tests) 1. Every string value in every loaded library is NFC (`unicodedata.is_normalized("NFC", s) == True`). 2. `load_drift_patterns()` returns exactly 20 patterns (matches `docs/modules/drift_injector.md` §4.4 and DESIGN.md §6.3). 3. `load_api_schemas()` returns exactly `{airline:v1,v2,v3 + cab:v1,v2,v3 + restaurant:v1,v2,v3 + hotel:v1,v2,v3 + payment:v1,v2}` = **14 schemas across 5 domains** (matches DESIGN.md §8.6 bundle enumeration and §5 vendor catalogue). 4. `load_templates()` library satisfies: every template has ≥ 1 variant in every `LanguageCode` (`hi`, `ta`, `kn`, `en`, `hinglish`); every **primary-domain** pattern's `mutation` field set is a subset of the union of `drift_slot_tags` across that domain's templates. The two transversal payment-auth patterns (`payment.auth_scope_upgrade`, `payment.mfa_required`) are EXEMPT from this subset check — they mutate shared payment fields (`token`, `scope`, `mfa_code`) that are intentionally not present in primary-domain goal templates and therefore cannot appear in `drift_slot_tags`. 5. Publication invariant: train seed set ∩ val seed set = ∅. 6. Publication invariant: running `data_export.py` twice with the same seed produces byte-identical `train/briefs.jsonl` + `val/briefs.jsonl` (SHA-256 match). Enforced via canonical JSON dump (§3.1): `json.dumps(row, ensure_ascii=False, sort_keys=True, separators=(",", ":"))`. 7. Every file in `data/` begins with an SPDX license header (YAML comment or JSON `$comment`). 8. No 10-digit digit-run in any authored string outside the timestamp / IATA allowed contexts (PII guard). 9. **Per-row data lineage.** Every `BriefRow` (§4.7) in the published `train/` and `val/` splits carries all of: `template_id`, `seed`, `stage`, `language`, `domain`, `generator_version`, `catalogue_hash`, `templates_sha256`, `i18n_sha256`. At eval-load time, `catalogue_hash` / `templates_sha256` / `i18n_sha256` must match the currently-loaded library hashes, else `CatalogueHashMismatchError` is raised (§5). 10. **Bundle immutability.** After publication (§2.4), the train split SHA-256 MUST match across all future re-runs of `hf upload`; adding val rows requires a minor-version bump, never a silent train-split mutation. --- ## 4. Data Structures All types are frozen dataclasses, immutable after load. Mappings are wrapped in `types.MappingProxyType`. ### 4.1 `TemplateLibrary` (re-exported from `task_generator.models` — single source of truth) ```python @dataclass(frozen=True) class TemplateLibrary: templates: tuple[Template, ...] # exactly 20 at v1.0 # (4 domains × 5 templates); # ≥ 20 after minor-version bumps cities_by_domain: Mapping[Domain, tuple[str, ...]] # 10 per domain i18n: Mapping[LanguageCode, Mapping[str, str]] # merged from i18n.yaml source_sha256: str # hash of templates.yaml bytes ``` The `templates` tuple length is **exactly 20** at v1.0 publication (4 domains × 5 templates per domain). Post-v1.0 minor-version bumps may grow this count monotonically; the invariant `len(templates) >= 20` and `len(templates) % 5 == 0` holds across all future versions. `load_templates` asserts `len(templates) == 20` at v1.0 via the `generator_version` check. Authoritative schema lives in `docs/modules/task_generator.md` §4. This module re-exports the type so callers of `load_templates` receive the same object that `task_generator.generate` consumes. ### 4.2 `I18nLibrary` ```python @dataclass(frozen=True) class I18nLibrary: strings: Mapping[LanguageCode, Mapping[str, str]] # e.g., strings["hi"]["BLR"] = "बेंगलुरु" # strings["ta"]["Monday"] = "திங்கள்" source_sha256: str ``` Merged into `TemplateLibrary.i18n` by `load_templates`, but exposed independently for pure-i18n use cases (e.g., the Gradio demo UI localizing labels). ### 4.3 `DriftPatternLibrary` ```python @dataclass(frozen=True) class DriftPatternLibrary: patterns: Mapping[str, DriftPattern] # keyed by DriftPattern.id by_domain: Mapping[str, tuple[str, ...]] # domain → pattern_ids by_type: Mapping[str, tuple[str, ...]] # drift_type → pattern_ids source_sha256: str ``` `DriftPattern` itself is defined in `docs/modules/drift_injector.md` §4.2 (see the `DriftPattern` dataclass snippet). This module owns *loading*, `drift_injector` owns *applying*. ### 4.4 `APISchemaRegistry` ```python @dataclass(frozen=True) class APISchema: domain: str # "airline" | "cab" | "restaurant" | "hotel" | "payment" version: str # "v1" | "v2" | "v3" schema: Mapping[str, Any] # parsed JSON Schema 2020-12 document source_sha256: str @dataclass(frozen=True) class APISchemaRegistry: schemas: Mapping[str, Mapping[str, APISchema]] # schemas["airline"]["v2"] = APISchema(...) def get(self, domain: str, version: str) -> APISchema: ... def versions(self, domain: str) -> tuple[str, ...]: ... # ordered v1,v2,v3 ``` Each `v.json` is a valid JSON Schema 2020-12 document describing the tool-response shape for that domain at that drift version. Vendors (DESIGN.md §5) validate outgoing responses against these schemas at test time; the injector (`docs/modules/drift_injector.md` §3) consults version transitions via these files. ### 4.5 `AudioManifest` ```python @dataclass(frozen=True) class AudioClip: utterance_id: str # stable; matches a curated IndicVoices-R clip id path: Path # relative to data/audio/ language: LanguageCode source: Literal["real_indicvoices_r"] # manifest is authored-only; synth clips # are lazily generated and NEVER recorded here license: str # SPDX identifier sha256: str duration_s: float # ≤ 20.0 (DESIGN.md §9 upper bound) @dataclass(frozen=True) class AudioManifest: clips: tuple[AudioClip, ...] source_sha256: str # hash of MANIFEST.jsonl bytes ``` The `source` field is a single-value `Literal` — the manifest is **authored-only**. Synth clips generated on-demand by `audio/tts_kokoro.py` are **never** recorded in the manifest (they are transient, gitignored under `data/audio/synth/`). This keeps the manifest auditable and its SHA-256 stable across Kokoro model-weight updates. ### 4.6 `SFTCorpus` (L4, optional) ```python @dataclass(frozen=True) class SFTTrajectory: episode_id: int goal_seed: int # same seed space as train/; NEVER a val seed (§3.1) turns: tuple[Mapping[str, Any], ...] # role/content pairs, JSON-serializable stage: Literal[1, 2, 3] reward_breakdown: Mapping[str, float] # R1..R5 + total, from the env at synthesis time generation_batch_id: str # uuid4 per invocation of sft_generator.py generation_index: int # monotonic within a batch, 0..N-1 @dataclass(frozen=True) class SFTCorpus: trajectories: tuple[SFTTrajectory, ...] generator: Literal["sarvam-m-hf-inference"] generation_seed: int target_count: int # from --target-count CLI flag source_sha256: str ``` Consumed by `training/train_grpo.py` only when `--sft-warmup-steps > 0` is passed. Absent by default; loader raises a non-fatal warning if the file is missing (training proceeds without warmup). **Atomic append + restart recovery (`training/sft_generator.py`):** - Each trajectory is appended to `data/sft_warmup/trajectories.jsonl` as a single canonical-JSON line followed by `os.fsync(fd)` on the file descriptor, ensuring durability before the next Sarvam-M API call. Partial-write recovery is therefore line-granular. - Every row carries `generation_batch_id` (uuid4, generated once per invocation of `sft_generator.py`) and `generation_index` (monotonic integer 0..N-1 within that batch). - On restart, `sft_generator.py` reads the existing `trajectories.jsonl`, reconstructs `(seed, generation_index)` pairs already completed, and resumes from the next uncompleted seed in its deterministic seed list. This tolerates Sarvam-M rate-limit drops, OOM kills, and SIGKILL. - After all generation completes, the script performs a **final count validation**: if `len(trajectories) != target_count`, it raises `PartialSFTCorpusError` (§5). The loader `load_sft_corpus` also performs this check at load time and raises the same error if the on-disk row count does not match the `target_count` field. - Edge case #11 (§7) walks through a concrete partial-generation-recovery scenario. ### 4.7 `BriefRow` — canonical publication-row contract Every line of `train/briefs.jsonl` and `val/briefs.jsonl` in the published HF Hub bundle is exactly one serialized `BriefRow`. This dataclass is the single-source-of-truth schema for everything an offline consumer (a judge re-running eval, a third party reproducing our experiments) needs to re-derive the episode from `(seed, library@hash)` alone. ```python from __future__ import annotations from dataclasses import dataclass from typing import Literal from driftcall.models import GoalSpec, DriftEvent, LanguageCode @dataclass(frozen=True) class BriefRow: episode_id: str # deterministic from seed + stage (e.g. "s2_ep_00000042") seed: int # original episode seed (train: [0, 20_000_000), # val: [20_000_000, 20_000_500)) stage: Literal[1, 2, 3] # curriculum stage at publication time language: LanguageCode # "hi" | "ta" | "kn" | "en" | "hinglish" domain: Literal["airline", "cab", "restaurant", "hotel"] template_id: str # e.g. "airline.book.budget_timewindow" goal: GoalSpec # full GoalSpec (slots + constraints + seed_utterance) drift_schedule: tuple[DriftEvent, ...] # schedule pre-computed by drift_injector catalogue_hash: str # sha256(drift_patterns/drifts.yaml bytes) templates_sha256: str # sha256(task_briefs/templates.yaml bytes) i18n_sha256: str # sha256(task_briefs/i18n.yaml bytes) generator_version: str # e.g. "driftcall-1.0.0" — semver of the generator created_ts_ist: str # ISO 8601 with +05:30 offset, e.g. "2026-04-25T10:30:00+05:30" ``` Serialization is always canonical: `json.dumps(asdict(row), ensure_ascii=False, sort_keys=True, separators=(",", ":"))`. A concrete JSONL line example is given in §8.5. At eval-load time, the loader re-hashes the currently-loaded `drifts.yaml` / `templates.yaml` / `i18n.yaml` and compares against `catalogue_hash` / `templates_sha256` / `i18n_sha256`. Any mismatch raises `CatalogueHashMismatchError` (§5) — this prevents silent semantic drift where a consumer runs `train/briefs.jsonl` against a newer catalogue and gets different episodes. --- ## 5. Error Modes All exceptions subclass `DatasetError(Exception)`. Each is raised exactly once and unit-tested. | Exception | Trigger | Where raised | |---|---|---| | `DatasetFileMissingError` | `data/` absent on disk | every loader | | `MalformedYAMLError` | YAML parse failure (syntax) | `load_templates`, `load_i18n`, `load_drift_patterns` | | `MalformedJSONError` | JSON parse failure (syntax) | `load_api_schemas`, `load_audio_manifest`, `load_sft_corpus` | | `DatasetSchemaError` | type/shape validation failure (missing required key, wrong type, extra unknown key) | every loader | | `UnknownLanguageKeyError` | a language key ∉ `LanguageCode = {"hi","ta","kn","en","hinglish"}` appears in `templates.yaml` or `i18n.yaml` | `load_templates`, `load_i18n` | | `LicenseConflictError` | a CC-BY-SA or GPL-licensed row appears in publication bundle while bundle is Apache-2.0; or verbatim ≥ 10-token suffix matches a CC-BY-SA upstream row | publication script (see §3.4) | | `TrainValLeakError` | train and val seed sets intersect; or an `SFTTrajectory.goal_seed` sits in the val reserved range `[20_000_000, 20_000_500)` | publication script, `load_sft_corpus` | | `DriftPatternOrphanError` | `drift_patterns.yaml` references a `from_version`/`to_version` not present in `data/api_schemas//` | `load_drift_patterns` | | `ChecksumMismatchError` | `AudioClip.sha256` does not match the on-disk file's hash | `load_audio_manifest` | | `UnicodeNFDError` | any loaded string fails `unicodedata.is_normalized("NFC", s)` | every loader | | `PIIDetectedError` | a 10-digit run appears outside allowed contexts in authored text | every text-bearing loader; also CI lint | | `DuplicateDriftPatternIdError` | two entries in `drifts.yaml` share an `id` | `load_drift_patterns` | | `CatalogueHashMismatchError` | a `BriefRow` in `train/briefs.jsonl` or `val/briefs.jsonl` carries `catalogue_hash` / `templates_sha256` / `i18n_sha256` that does not match the currently-loaded library (drifts.yaml / templates.yaml / i18n.yaml) hashes | eval-load path (consumers of published bundle) | | `PartialSFTCorpusError` | `len(SFTCorpus.trajectories) != target_count` at final-count validation; raised by `training/sft_generator.py` post-generation and by `load_sft_corpus` at load time | `load_sft_corpus`, `training/sft_generator.py` | **No silent fallbacks.** If `data/sft_warmup/trajectories.jsonl` is missing, `load_sft_corpus` raises `DatasetFileMissingError`; the training script is the one that treats this as non-fatal (falls back to no-SFT warmup). Loaders themselves never substitute defaults. --- ## 6. Dependencies ### 6.1 Reads - `data/task_briefs/templates.yaml`, `data/task_briefs/i18n.yaml` - `data/drift_patterns/drifts.yaml` - `data/api_schemas/**/*.json` - `data/audio/real/MANIFEST.jsonl` + the `.wav` files it references - `data/sft_warmup/trajectories.jsonl` (optional) ### 6.2 Imports - `driftcall.models` — `GoalSpec`, `LanguageCode`, `Domain` - Python stdlib: `json`, `hashlib`, `pathlib`, `unicodedata`, `threading`, `dataclasses`, `typing`, `types` - Third-party: `PyYAML`, `jsonschema` (for JSON Schema 2020-12 meta-validation) ### 6.3 Consumers Consuming modules and the exact function they call: - `docs/modules/task_generator.md` — `load_templates()` in `task_generator.generate()`'s lazy-singleton `_get_library()`. - `docs/modules/drift_injector.md` — `load_drift_patterns()` in the injector's module-level registry; consults DESIGN.md §6.3 pattern catalogue. - `docs/modules/vendors.md` — `load_api_schemas()` at vendor import time; each vendor asserts its own response shape against the schema in test fixtures. - `docs/modules/audio.md` — `load_audio_manifest()` for the pitch demo (§9.5 IndicVoices-R clip playback). - `docs/modules/training.md` — `load_sft_corpus()` behind `--sft-warmup-steps` flag; also invokes `training/data_export.py` which calls `task_generator.enumerate_variants()` to produce the publication briefs. ### 6.4 Publishes to - HF Hub dataset repo `/driftcall-indic-briefs` (one-time, pre-event, Phase C5 per `DRIFTCALL/CLAUDE.md` §4.1). ### 6.5 Non-dependencies (explicit) - Does **not** import from `env.py`, `rewards.py`, `app.py`, or the training entrypoint. Pure data layer. - Does **not** hit the network at runtime. Every file is local. Publication script is a separate, explicitly-invoked entrypoint. - Does **not** depend on GPU, CUDA, or PyTorch. CPU-only. --- ## 7. Edge Cases 1. **Missing template variant for a rare language.** `templates.yaml` is authored with `hinglish` + `hi` + `en` + `ta` but an author forgets `kn` for one template. `load_templates` runs per-template check `set(variants.keys()) == LanguageCode.values` and raises `DatasetSchemaError: template 'restaurant.order.veg' missing language_variants['kn']`. The generator's `NoVariantForLanguageError` (task_generator.md §5) never has a chance to fire because loading fails first. Fix: author supplies the missing variant; loader re-runs. 2. **Unicode NFD in author contribution.** A collaborator pastes a Kannada weekday name from macOS clipboard (NFD by default for composed characters). `load_i18n` re-normalizes to NFC *before* equality/hashing; the assertion `unicodedata.is_normalized("NFC", value)` fires post-normalization as a defense against Python/ICU bugs. In practice the round-trip succeeds and NFC is stored. The pre-commit hook separately catches NFD at commit time so CI never sees it. 3. **License incompatibility (CC-BY-SA row smuggled into an Apache-2.0 bundle).** An author, inspired by an SGD row, copies 20 tokens verbatim into a template variant. Publication CI runs a suffix-array check over cached SGD/MTOP exports looking for ≥ 10-token verbatim matches; on hit, `LicenseConflictError("variant in 'airline.book.budget_timewindow' matches SGD row sgd_5432:0 (≥ 10 tokens)")` raises. Fix: rewrite the variant. We keep only *inspiration*, never verbatim text, from CC-BY-SA sources. The threshold (10 tokens) is a pragmatic choice: below that length we treat overlap as incidental linguistic reuse; at or above we flag. 4. **Empty language cohort in a stage mix.** A future curriculum config passes `language_weights = {"en": 1.0, "hi": 0.0, "ta": 0.0, "kn": 0.0, "hinglish": 0.0}`. This is valid at the task-generator level (task_generator.md §3.2 — non-negative weights summing to 1 are legal). `datasets` does not re-validate curriculum config; it only asserts the *library* has variants for all 5 languages. Downstream (`task_generator`) will simply never draw `hi`/`ta`/`kn`/`hinglish`. No error in this module. 5. **Train/val episode-id collision at publication time.** `data_export.py` draws 15,000 seeds for train and 500 for val. If the RNG accidentally maps a train seed into `[20_000_000, 20_000_500)` (the val reserved range) — which cannot happen given the seed-space partitioning in §3.1 — the assertion `train_seeds.isdisjoint(val_seeds)` raises `TrainValLeakError` with the offending seed. Safeguard: train seeds are drawn from `[0, 20_000_000)` and val seeds from `[20_000_000, 20_000_500)`. The two ranges are non-overlapping by construction; the assertion is a defense against future range edits. 6. **Drift-pattern-id orphan (trace references pattern not in YAML).** A test fixture or cached trace references `drift_pattern_id='airline.mysterious_fee'` but `drifts.yaml` has no such entry (it was renamed or removed). `load_drift_patterns` does not look at traces — it only checks internal consistency. The *trace consumer* (`rewards.r2_drift_detection` in `docs/modules/rewards.md`) raises `UnknownDriftPatternError` at scoring time per drift_injector.md §5. If the orphan is discovered during dataset publication, the publication script emits `DriftPatternOrphanError` and aborts. 7. **JSON Schema file that is valid JSON but not valid JSON Schema 2020-12.** `data/api_schemas/cab/v3.json` is hand-edited and accidentally drops the `$schema` keyword or uses an unknown keyword. `load_api_schemas` runs `jsonschema.Draft202012Validator.check_schema(schema)` and on failure raises `DatasetSchemaError("cab/v3.json: not a valid JSON Schema 2020-12: ")`. The env refuses to serve `reset()` until fixed. 8. **Audio clip on disk does not match manifest sha256.** `data/audio/real/MANIFEST.jsonl` lists `kn_greeting_03.wav` with `sha256=abc...`. The file gets re-encoded (e.g., by a well-intentioned ffmpeg pass). `load_audio_manifest` re-hashes every referenced WAV and raises `ChecksumMismatchError("kn_greeting_03.wav: expected abc..., got def...")`. Fix: either revert the WAV, or regenerate the manifest after an audit trail commit. 9. **SFT corpus contains a val-reserved seed.** Sarvam-M synthesis inadvertently uses a seed in `[20_000_000, 20_000_500)`. `load_sft_corpus` raises `TrainValLeakError`. The training script may be configured to treat this as fatal (default) or to filter out those trajectories (`--sft-tolerate-leak`); the loader itself always raises. 10. **PyYAML silently deduplicating keys.** If `drifts.yaml` has two entries with the same `id`, the YAML parse is valid but one wins. `load_drift_patterns` builds a set of ids during validation and raises `DuplicateDriftPatternIdError` on collision, with both source line numbers. 11. **Partial SFT corpus recovery (L4 restart).** `training/sft_generator.py` is mid-run at trajectory 137 of a target 300 when the host OOM-kills the process (Sarvam-M inference peak memory). On restart, the script re-opens `data/sft_warmup/trajectories.jsonl`, reads the existing 137 rows (each fsync'd atomically per §4.6), reconstructs the completed `(generation_batch_id, generation_index)` pairs, and resumes from index 137 of the same batch. It does NOT start a new `generation_batch_id` — batch id is rehydrated from the last row. When generation finally reaches 300, the script validates `len(rows) == target_count`; if a Sarvam-M response was silently truncated (say, only 298 rows written), `PartialSFTCorpusError("expected 300, got 298")` is raised and the operator must decide whether to re-run the missing two or ship a corpus with a smaller `target_count`. `load_sft_corpus` performs the same count check at load time. --- ## 8. Examples ### 8.1 Full `templates.yaml` entry for `airline.book.budget_timewindow` ```yaml # SPDX-License-Identifier: Apache-2.0 # Copyright 2026 DriftCall Team # Derived-from: AmazonScience/MASSIVE (intent taxonomy, Apache-2.0) - template_id: airline.book.budget_timewindow domain: airline intent: book_flight min_stage: 1 required_slots: [from, to, when] optional_slots: [seat_pref] constraints_template: budget_inr: distribution: uniform low: 3000 high: 15000 step: 500 time_window: choices: [morning, afternoon, evening, late_night] drift_slot_tags: [price, total_fare_inr] # Language keys: ISO short codes matching LanguageCode = Literal["hi","ta","kn","en","hinglish"] language_variants: hinglish: - "Bhai {when} ko {to} jaana hai, cheapest flight {time_window} mein, {budget_inr} rupees max" - "{when} ko {from} se {to} ka ticket book kar de, under {budget_inr}, {time_window} ke baad" hi: - "मुझे {when} को {from} से {to} जाना है, {budget_inr} रुपये से कम में" ta: - "{when} அன்று {from} லிருந்து {to} க்கு டிக்கெட் வேண்டும், {budget_inr} ரூபாய்க்கு கீழ்" kn: - "{when} ರಂದು {from} ಇಂದ {to} ಗೆ ಅಗ್ಗದ ವಿಮಾನ ಟಿಕೆಟ್ ಬೇಕು, {budget_inr} ರೂಪಾಯಿಗಳ ಒಳಗೆ" en: - "Book the cheapest flight from {from} to {to} on {when}, budget under ₹{budget_inr}, departing {time_window}" ``` This is the single source-of-truth entry for the Stage-1 airline booking template; mirror of DESIGN.md §8.3 and `docs/modules/task_generator.md` §4.1. ### 8.2 Full `drift_patterns.yaml` entry for `airline.price_rename` ```yaml # SPDX-License-Identifier: Apache-2.0 # Copyright 2026 DriftCall Team - id: airline.price_rename drift_type: schema domain: airline from_version: v1 to_version: v2 description: "field 'price' renamed to 'total_fare_inr'; 'currency' removed" mutation: rename: {price: total_fare_inr} remove: [currency] detection_hints: - "total_fare_inr" - "price" - "rename" ``` `load_drift_patterns` will (a) parse this, (b) check `id` uniqueness, (c) confirm `from_version=v1` + `to_version=v2` both exist as `data/api_schemas/airline/v1.json` + `data/api_schemas/airline/v2.json`, (d) confirm `detection_hints` is non-empty, (e) wrap `mutation` in `MappingProxyType`. Matches `docs/modules/drift_injector.md` §4.3 byte-for-byte. ### 8.3 `data/api_schemas/airline/v2.json` ```json { "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "https://driftcall.dev/schemas/airline/v2.json", "$comment": "SPDX-License-Identifier: Apache-2.0. v2 = post-price_rename drift (DESIGN.md §5.1).", "title": "Airline search result (v2)", "type": "object", "required": ["flight_id", "from", "to", "depart", "total_fare_inr", "seats_left"], "additionalProperties": false, "properties": { "flight_id": {"type": "string", "pattern": "^[0-9A-Z]{2}-[0-9]{4}$"}, "from": {"type": "string", "pattern": "^[A-Z]{3}$"}, "to": {"type": "string", "pattern": "^[A-Z]{3}$"}, "depart": {"type": "string", "format": "date-time"}, "total_fare_inr": {"type": "integer", "minimum": 0}, "seats_left": {"type": "integer", "minimum": 0} } } ``` Note that `price` and `currency` from v1 are absent (drift `airline.price_rename` applied). Vendors (`docs/modules/vendors.md`) validate their emitted `airline.search` responses against whichever version the injector has installed in `state.schema_versions['airline']`. This schema also serves as the R2 structural detection surface: a tool call that keys into `price` after drift returns `KeyError` / 422, which is a detection-positive signal per DESIGN.md §7.1 R2. ### 8.4 `MANIFEST.jsonl` row for a curated IndicVoices-R clip (L3) ```json {"utterance_id": "iv_r_kn_0451", "path": "real/kn/iv_r_kn_0451.wav", "language": "kn", "source": "real_indicvoices_r", "license": "Apache-2.0", "sha256": "b7f1a9c2e5d4...", "duration_s": 4.82} ``` Referenced only by the pitch demo. Training never touches this file — DRIFTCALL/CLAUDE.md §9 "Do not put TTS/ASR in the training loop". ### 8.5 Canonical `BriefRow` JSONL line (single row from `train/briefs.jsonl`) One line from the published bundle — canonical JSON (sorted keys, no whitespace, UTF-8 preserved for Devanagari): ```json {"catalogue_hash":"3f9a8e7c2b1d4e5f6a0b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5e4f3a2b1c0d9e8f","created_ts_ist":"2026-04-25T10:30:00+05:30","domain":"airline","drift_schedule":[{"description":"'price' field renamed to 'total_fare_inr'","domain":"airline","drift_type":"schema","from_version":"v1","pattern_id":"airline.price_rename","to_version":"v2","turn":4}],"episode_id":"s2_ep_00000042","generator_version":"driftcall-1.0.0","goal":{"constraints":{"budget_inr":8000,"time_window":"evening"},"domain":"airline","intent":"book_flight","language":"hinglish","seed_utterance":"Bhai Friday ko Bangalore jaana hai, cheapest flight evening mein, 8000 rupees max","slots":{"from":"HYD","to":"BLR","when":"2026-04-30"}},"i18n_sha256":"a1b2c3d4e5f60718293a4b5c6d7e8f901234567890abcdef1234567890abcdef","language":"hinglish","seed":42,"stage":2,"template_id":"airline.book.budget_timewindow","templates_sha256":"b2c3d4e5f60718293a4b5c6d7e8f901234567890abcdef1234567890abcdef12"} ``` Note: keys are alphabetically sorted (`catalogue_hash`, `created_ts_ist`, `domain`, …), strings are NFC-normalized, no embedded spaces. The 64-hex hashes are full sha256 hex digests. ### 8.6 `README.md` YAML frontmatter (HF Hub dataset card) The published `/driftcall-indic-briefs/README.md` begins with the following YAML frontmatter. The HF Dataset Viewer reads this block to auto-configure splits, license, and task tags. ```yaml --- license: apache-2.0 language: [hi, ta, kn, en] size_categories: [10K` placeholder in `/driftcall-indic-briefs` depends on `DRIFTCALL/CLAUDE.md` §8 kickoff-checklist item "HF org name locked". The publication script parameterizes the org via `--hf-org`; no code change needed once locked, just a CLI arg at publication time. Does not block Phase D. **Sync note:** `DRIFTCALL/CLAUDE.md` §6 command table still lists the deprecated `huggingface-cli upload` — when the org name is locked, update that table to the modern `hf upload` in the same PR. 2. **SFT warmup corpus size — 200 vs 500 trajectories.** DESIGN.md §8.2 row 4 quotes the range "200–500". The exact count depends on Sarvam-M's cost/latency budget during one-shot synthesis. Recommend 200 as a floor (sufficient for format priming per §10 training convergence target) and 500 as a ceiling if inference time permits. Resolution: Person C chooses during Phase C4; does not affect loader or schema. 3. **Audio manifest curation count.** DESIGN.md §9 implies a handful of real IndicVoices-R clips for pitch demo realism, but does not specify exact count. Recommend 20 curated clips (4 per language × 5 languages), balanced by speaker gender and dialect region. Resolution: Person D curates during Phase C5; this module only ensures the manifest format is stable. ### 9.1 Resolved - **License-cache implementation (previously Open Q #4).** `data/.license_cache/{sgd,mtop}.idx` is a sqlite3 FTS5 index built by `scripts/build_license_cache.py` at dev time. Schema: `CREATE VIRTUAL TABLE licensed_text USING fts5(chunk_text, source_id);` with 5-gram tokenization. CI invokes this index (read-only) on each PR to verify that no `seed_utterance` or template variant in the publication bundle substring-matches any upstream CC-BY-SA text (≥ 10-token threshold, §3.4). The index is built once per upstream corpus version and committed to the repo so re-builds are only needed when SGD or MTOP themselves publish a new version. Determinism + reviewability win over per-PR rebuild cost. --- **This doc tells you HOW the four dataset layers are shaped, loaded, validated, and published. Do not write loaders before a fresh critic returns `NOTHING_FURTHER`. Do not commit `data/*.yaml` without the pre-commit NFC + PII + license-header guards running. Do not ship the HF Hub bundle without the train/val disjointness and verbatim-match checks green.**