driftcall / docs /modules /datasets.md
saumilyajj's picture
Upload folder using huggingface_hub
f2df60e verified

datasets — Four-Layer Dataset Strategy + HF Hub Publication

Module path: driftcall/data/ (loaders) + data/ (on-disk artifacts) Owner: Person C (Training & Data) Implements: DESIGN.md §8 (Dataset Strategy — §§8.1, 8.2, 8.3, 8.4, 8.5, 8.6) Consumed by: driftcall/task_generator.py (L1), driftcall/drift_injector.py (L2 drift patterns), driftcall/vendors/*.py (L2 API schemas), driftcall/audio/*.py (L3 audio), training/train_grpo.py (L4 SFT warmup). Status: Design spec — no code yet.


1. Purpose

datasets is the authoring-and-loading contract for every piece of static data DriftCall depends on. It is not a training-data-pipeline module (that is training.md) and it does not compose rewards (that is rewards.md). It does exactly four things:

  1. Defines the four dataset layers per DESIGN.md §8.2 — task-brief templates, vendor API schemas + drift patterns, voice audio, and the optional SFT warmup corpus — as on-disk files with frozen YAML/JSON schemas.
  2. Loads each file through a deterministic, lazy, singleton loader that NFC-normalizes all Indic strings at load time (cross-references docs/modules/task_generator.md §3.4 invariant #8).
  3. Validates each file at load time: schema shape, type constraints, license header presence, train/val leakage, and consistency cross-references (e.g., every drift_slot_tags token in templates.yaml is targetable by ≥ 1 pattern in drift_patterns.yaml).
  4. Publishes the public-facing bundle <team>/driftcall-indic-briefs to the Hugging Face Hub per DESIGN.md §8.6, packaged from a deterministic enumerate_variants() walk (see docs/modules/task_generator.md §2.2).

No file in data/ is ever written at runtime. All four layers are authored before Phase C ships. Runtime only reads. The only exception is the dataset-packaging script (training/data_export.py) which writes train/briefs.jsonl + val/briefs.jsonl once, pre-publication.

Supervision for GRPO comes from the 5 reward functions (DESIGN.md §8.1) — these files exist to parameterize the environment, not to teach the policy. L4 (SFT warmup) is optional per DESIGN.md §8.2 row 4.


2. Interface

2.1 Directory layout (on disk, shipped inside the env Docker image per DESIGN.md §11.1)

data/
├── task_briefs/
│   ├── templates.yaml                # L1 — hand-authored + procedural expansion source (§8.3)
│   └── i18n.yaml                     # L1 — Indic localized strings (cities, weekdays, dish names)
├── drift_patterns/
│   └── drifts.yaml                   # L2 — 20 drift patterns (§6.3, §8.2 row 2)
├── api_schemas/                      # L2 — frozen JSON Schema per vendor per version
│   ├── airline/
│   │   ├── v1.json
│   │   ├── v2.json
│   │   └── v3.json
│   ├── cab/
│   │   ├── v1.json
│   │   ├── v2.json
│   │   └── v3.json
│   ├── restaurant/
│   │   ├── v1.json
│   │   ├── v2.json
│   │   └── v3.json
│   ├── hotel/
│   │   ├── v1.json
│   │   ├── v2.json
│   │   └── v3.json
│   └── payment/
│       ├── v1.json
│       └── v2.json
├── audio/                            # L3 — synthesized + real voice clips (§8.2 row 3, §9)
│   ├── synth/                        # Kokoro-82M output, generated lazily; gitignored
│   │   └── .gitkeep
│   ├── real/                         # AI4Bharat IndicVoices-R held-out subset for pitch demo
│   │   └── MANIFEST.jsonl            # (utterance_id, path, language, license, sha256)
│   └── LICENSES.md                   # per-clip license attribution
└── sft_warmup/                       # L4 — optional Sarvam-M synthesized trajectories (§8.2 row 4)
    ├── trajectories.jsonl            # 200–500 correct rollouts
    └── LICENSES.md

Publication structure (HF Hub dataset repo <team>/driftcall-indic-briefs, DESIGN.md §8.6):

driftcall-indic-briefs/
├── README.md                         # model card — provenance, license, stats, reward caveats
├── train/briefs.jsonl                # 15,000 sampled episodes (seed, stage, language_weights, GoalSpec)
├── val/briefs.jsonl                  #    500 held-out episodes — seeds disjoint from train
├── drift_patterns.yaml               # exact copy of data/drift_patterns/drifts.yaml (20 patterns)
├── api_schemas/                      # exact copy of data/api_schemas/
└── LICENSE                           # bundle license (Apache 2.0 by default; see §3.4)

2.2 Per-file contracts

File Format Authored by Runtime writer Schema anchor
data/task_briefs/templates.yaml YAML Hand (20 seeds) none Template (§4.1 task_generator.md)
data/task_briefs/i18n.yaml YAML Hand none Mapping[LanguageCode, Mapping[str, str]]
data/drift_patterns/drifts.yaml YAML Hand none DriftPattern (§4.2 drift_injector.md)
data/api_schemas/<domain>/v<N>.json JSON Schema 2020-12 Hand none APISchema (§4.4 below)
data/audio/real/MANIFEST.jsonl JSONL Hand (curated from IndicVoices-R) none AudioClipManifest (§4.5)
data/audio/synth/*.wav WAV 16kHz mono audio/tts_kokoro.py (lazy) audio/tts_kokoro.py n/a — generated
data/sft_warmup/trajectories.jsonl JSONL Sarvam-M via HF Inference (offline) training/sft_generator.py (one-shot) SFTTrajectory (§4.6)

2.3 Loaders (all return frozen dataclasses, all NFC-normalize Indic strings)

from __future__ import annotations
from pathlib import Path
from driftcall.data.models import (
    TemplateLibrary, I18nLibrary,
    DriftPatternLibrary, APISchemaRegistry,
    AudioManifest, SFTCorpus,
)

# L1 — task briefs
def load_templates(path: Path | str = "data/task_briefs/templates.yaml") -> TemplateLibrary: ...
def load_i18n(path: Path | str = "data/task_briefs/i18n.yaml") -> I18nLibrary: ...

# L2 — drift patterns + api schemas
def load_drift_patterns(path: Path | str = "data/drift_patterns/drifts.yaml") -> DriftPatternLibrary: ...
def load_api_schemas(root: Path | str = "data/api_schemas") -> APISchemaRegistry: ...

# L3 — audio manifest (paths + licenses only; actual WAVs resolved on-demand)
def load_audio_manifest(path: Path | str = "data/audio/real/MANIFEST.jsonl") -> AudioManifest: ...

# L4 — optional SFT warmup
def load_sft_corpus(path: Path | str = "data/sft_warmup/trajectories.jsonl") -> SFTCorpus: ...

Each loader is implemented as a module-level lazy singleton — the first call reads + validates + freezes; subsequent calls return the same instance. Not thread-safe for write (there is no write); safe for concurrent read.

2.4 HF Hub publication commands

Packaging runs once, pre-event. The script is training/data_export.py (see docs/modules/training.md for its interface — this module only defines the on-disk shape of what it writes).

Immutability. The published bundle is IMMUTABLE after publication. Re-running hf upload against the same data/publication/ tree produces a byte-identical bundle (invariant #6). Adding rows to val/briefs.jsonl requires a MINOR-version bump (v1.1, v1.2, …) and a new publication seed; the train/ split NEVER silently mutates between versions — a version bump either adds disjoint val rows or re-publishes train+val together, never partial mutation of train.

Seed selection (deterministic, locked). Train and val seeds are drawn by training/data_export.py using these two exact expressions:

import random
# Train: 15,000 seeds sampled without replacement from [0, 20_000_000).
train_seeds = random.Random(20260425).sample(range(0, 20_000_000), 15_000)
# Val: deterministic slice of 500 contiguous seeds in the reserved range.
val_seeds = list(range(20_000_000, 20_000_500))

Both lists are byte-identical across re-runs. The publication meta-seed 20260425 is locked; changing it requires a major-version bump and a new repo name or subfolder.

# Generate the sampled briefs by walking enumerate_variants() (see task_generator.md §2.2)
python3 training/data_export.py \
    --out-train data/publication/train/briefs.jsonl \
    --out-val   data/publication/val/briefs.jsonl \
    --n-train   15000 \
    --n-val     500 \
    --seed      20260425        # frozen publication seed; NOT a training seed

# Copy the static L2 artifacts verbatim
cp  data/drift_patterns/drifts.yaml  data/publication/drift_patterns.yaml
cp -r  data/api_schemas              data/publication/api_schemas

# Upload (see DRIFTCALL/CLAUDE.md §6 command table and huggingface-skills:hf-cli).
# NOTE: `hf` is the modern CLI replacing the deprecated `huggingface-cli`.
hf upload <org>/driftcall-indic-briefs \
    data/publication/ . \
    --repo-type dataset \
    --hf-org <org> \
    --commit-message "v1.0 publication — locked 2026-04-25"

The publication seed 20260425 is fixed and recorded in the README. Re-running the script produces a byte-identical bundle (determinism contract inherited from task_generator.generate per docs/modules/task_generator.md §3.1).

Doc-sync flag: DRIFTCALL/CLAUDE.md §6 still lists the deprecated huggingface-cli upload command; update that table to hf upload in the same PR that lands this doc (captured as Open Question #1 / CLAUDE.md sync item).


3. Behavior Spec

3.1 Authoring conventions

NFC normalization. Every string value in every YAML/JSON file is NFC-normalized before it is committed. The loaders re-normalize defensively at load time (invariant #8, docs/modules/task_generator.md §3.4). Editors used during authoring (VS Code, vim) must be configured to save NFC — a pre-commit hook (ruff-adjacent script) runs python -c "import unicodedata, sys; ..." to reject NFD commits.

License headers. Every hand-authored file begins with a YAML comment block declaring SPDX identifier, author, year, and upstream attribution if the content is derived from a public dataset (§8.5):

# SPDX-License-Identifier: Apache-2.0
# Copyright 2026 DriftCall Team
# Derived-from: AmazonScience/MASSIVE (intent taxonomy, Apache-2.0)
# See data/LICENSES.md for full attribution chain.

JSON files carry the same metadata in a $comment field at root (JSON Schema 2020-12 permits $comment per RFC 7159 conventions).

Seed determinism. Every numeric or stochastic sampling decision in template/drift authoring threads through a fixed seed: the publication seed 20260425, the template-expansion seed 42, or the curriculum-language seeds declared in DESIGN.md §10.3. No wall-clock, no random.random(), no host-machine entropy.

No PII. Authored strings never contain real names, phone numbers, email addresses, booking reference numbers, card PANs, or IP addresses. The from / to fields use IATA codes; the pickup / drop fields use fictional neighborhood landmarks. A CI lint (grep -En '[0-9]{10}' data/) runs before every commit and fails on any 10-digit run outside the allowed IATA/timestamp contexts.

Eval-set held out from training. The 500-episode val set uses seeds drawn from a reserved range (seed ∈ [20_000_000, 20_000_500)); training always draws seeds from [0, 20_000_000). The publication script asserts disjointness at write time (see §5 leak detection). The exact seed-selection expressions are specified in §2.4.

Canonical JSON key ordering. Every row in train/briefs.jsonl and val/briefs.jsonl is serialized with:

json.dumps(row, ensure_ascii=False, sort_keys=True, separators=(",", ":"))

This is enforced as an invariant precondition for byte-identical re-runs (see §3.5 invariant #6). ensure_ascii=False preserves Devanagari / Tamil / Kannada script without \uXXXX escaping; sort_keys=True canonicalizes key order; separators=(",", ":") eliminates whitespace variance across Python/libc versions.

Per-row data lineage. Every BriefRow carries the full six-tuple (template_id, seed, stage, language, domain, generator_version) plus three corpus-version hashes (catalogue_hash, templates_sha256, i18n_sha256). This is enforced as an invariant (§3.5 invariant #9) so that any published row is re-derivable from the triple (seed, stage, library@hash) alone.

3.2 Lazy singleton loaders

# sketch of the module-level pattern, mirrored in every loader
_LIBRARY: TemplateLibrary | None = None
_LIBRARY_LOCK = threading.Lock()

def load_templates(path: Path | str = "data/task_briefs/templates.yaml") -> TemplateLibrary:
    global _LIBRARY
    if _LIBRARY is None:
        with _LIBRARY_LOCK:
            if _LIBRARY is None:
                _LIBRARY = _load_and_validate_templates(Path(path))
    return _LIBRARY

The singleton is path-keyed — if a test passes a different path, a fresh instance is built (still cached in a per-path dict). Production callers always use the default path.

3.3 Schema validation at load time

Each loader does three passes:

  1. YAML/JSON parse. Failure → MalformedYAMLError / MalformedJSONError with line/column.
  2. Type + shape validation against the dataclass schema in §4. Failure → DatasetSchemaError naming the offending key.
  3. Cross-file consistency check (loader-specific):
    • load_drift_patterns asserts pattern.id values are unique, exactly 20 patterns total, drift_type ∈ {schema,policy,tnc,pricing,auth}, and every from_version/to_version references an existing schema file in data/api_schemas/<domain>/.
    • load_templates asserts every drift_slot_tags token is matched by ≥ 1 DriftPattern.mutation key or value (airline.total_fare_inr must be targetable, else why tag it).
    • load_api_schemas asserts each v<N>.json validates as JSON Schema 2020-12 against the meta-schema via jsonschema.Draft202012Validator.check_schema.
    • load_audio_manifest asserts every referenced path exists on disk and its sha256 matches the recorded hash.

Failures here abort env startup; HTTP 503 is served until the data is fixed (mirrors DriftCatalogueError handling in docs/modules/drift_injector.md §5).

3.4 License compatibility check

Per §8.5 the public datasets we reference carry mixed licenses:

Upstream License Redistributable in our bundle?
AI4Bharat IndicVoices-R Apache-2.0 Yes, with attribution
MASSIVE (Amazon) Apache-2.0 Yes, with attribution
Schema-Guided Dialogue (SGD) CC-BY-SA Inspiration only — derived schema patterns, not verbatim rows
MTOP (Facebook) MIT-style (see original repo) Inspiration only — derived Hindi task phrasings, not verbatim rows
APIs.guru CC0 Yes, no attribution required but recorded

The bundle license (LICENSE at the root of <team>/driftcall-indic-briefs) is Apache-2.0. Because CC-BY-SA is copyleft-adjacent, we never copy SGD or MTOP rows verbatim — only inspiration (intent labels, schema shapes). A CI check enforces that no string in train/briefs.jsonl or val/briefs.jsonl appears verbatim (≥ 10-token suffix match) in a cached SGD/MTOP export. See §5 for the error mode and §7 edge case #3 for the exact detection rule.

Full verbatim license text (MANDATORY). The root LICENSE file MUST contain the full verbatim Apache 2.0 license text as published at https://www.apache.org/licenses/LICENSE-2.0.txt — NOT a URL, NOT a one-line SPDX identifier, NOT a summary. The same requirement applies to data/audio/LICENSES.md and data/sft_warmup/LICENSES.md (both must embed the full Apache-2.0 text plus per-clip / per-trajectory attribution rows). CI check tests/data/test_license_text.py verifies that the byte length of each LICENSE file is ≥ 11,000 (Apache-2.0 full text is ~11,357 bytes) and that the canonical "Apache License / Version 2.0, January 2004" header string is present.

LICENSES.md schema (L3 audio + L4 SFT warmup). Both data/audio/LICENSES.md and data/sft_warmup/LICENSES.md follow the same markdown format:

  1. A preamble (5–15 lines) naming the bundle and linking back to the root LICENSE.
  2. The full verbatim Apache-2.0 text (as above).
  3. A single markdown table with exactly these columns, one row per clip (L3) or per trajectory (L4):
| utterance_id | upstream_source      | upstream_license | attribution_required | notes                         |
|--------------|----------------------|------------------|----------------------|-------------------------------|
| iv_r_kn_0451 | IndicVoices-R        | Apache-2.0       | yes                  | speaker consent verified      |
| sft_00042    | Sarvam-M (synthesis) | Apache-2.0       | no                   | rollout seed 42, stage 2      |

For L4 the utterance_id column is replaced by trajectory_id but the other four columns are identical. Loaders do not parse these tables at runtime; they are human-audit artifacts enforced only by pre-commit schema check scripts/check_licenses_md.py.

3.5 Invariants (enforced by tests)

  1. Every string value in every loaded library is NFC (unicodedata.is_normalized("NFC", s) == True).
  2. load_drift_patterns() returns exactly 20 patterns (matches docs/modules/drift_injector.md §4.4 and DESIGN.md §6.3).
  3. load_api_schemas() returns exactly {airline:v1,v2,v3 + cab:v1,v2,v3 + restaurant:v1,v2,v3 + hotel:v1,v2,v3 + payment:v1,v2} = 14 schemas across 5 domains (matches DESIGN.md §8.6 bundle enumeration and §5 vendor catalogue).
  4. load_templates() library satisfies: every template has ≥ 1 variant in every LanguageCode (hi, ta, kn, en, hinglish); every primary-domain pattern's mutation field set is a subset of the union of drift_slot_tags across that domain's templates. The two transversal payment-auth patterns (payment.auth_scope_upgrade, payment.mfa_required) are EXEMPT from this subset check — they mutate shared payment fields (token, scope, mfa_code) that are intentionally not present in primary-domain goal templates and therefore cannot appear in drift_slot_tags.
  5. Publication invariant: train seed set ∩ val seed set = ∅.
  6. Publication invariant: running data_export.py twice with the same seed produces byte-identical train/briefs.jsonl + val/briefs.jsonl (SHA-256 match). Enforced via canonical JSON dump (§3.1): json.dumps(row, ensure_ascii=False, sort_keys=True, separators=(",", ":")).
  7. Every file in data/ begins with an SPDX license header (YAML comment or JSON $comment).
  8. No 10-digit digit-run in any authored string outside the timestamp / IATA allowed contexts (PII guard).
  9. Per-row data lineage. Every BriefRow (§4.7) in the published train/ and val/ splits carries all of: template_id, seed, stage, language, domain, generator_version, catalogue_hash, templates_sha256, i18n_sha256. At eval-load time, catalogue_hash / templates_sha256 / i18n_sha256 must match the currently-loaded library hashes, else CatalogueHashMismatchError is raised (§5).
  10. Bundle immutability. After publication (§2.4), the train split SHA-256 MUST match across all future re-runs of hf upload; adding val rows requires a minor-version bump, never a silent train-split mutation.

4. Data Structures

All types are frozen dataclasses, immutable after load. Mappings are wrapped in types.MappingProxyType.

4.1 TemplateLibrary (re-exported from task_generator.models — single source of truth)

@dataclass(frozen=True)
class TemplateLibrary:
    templates: tuple[Template, ...]                                      # exactly 20 at v1.0
                                                                         # (4 domains × 5 templates);
                                                                         # ≥ 20 after minor-version bumps
    cities_by_domain: Mapping[Domain, tuple[str, ...]]                   # 10 per domain
    i18n: Mapping[LanguageCode, Mapping[str, str]]                       # merged from i18n.yaml
    source_sha256: str                                                   # hash of templates.yaml bytes

The templates tuple length is exactly 20 at v1.0 publication (4 domains × 5 templates per domain). Post-v1.0 minor-version bumps may grow this count monotonically; the invariant len(templates) >= 20 and len(templates) % 5 == 0 holds across all future versions. load_templates asserts len(templates) == 20 at v1.0 via the generator_version check.

Authoritative schema lives in docs/modules/task_generator.md §4. This module re-exports the type so callers of load_templates receive the same object that task_generator.generate consumes.

4.2 I18nLibrary

@dataclass(frozen=True)
class I18nLibrary:
    strings: Mapping[LanguageCode, Mapping[str, str]]
    # e.g., strings["hi"]["BLR"] = "बेंगलुरु"
    # strings["ta"]["Monday"] = "திங்கள்"
    source_sha256: str

Merged into TemplateLibrary.i18n by load_templates, but exposed independently for pure-i18n use cases (e.g., the Gradio demo UI localizing labels).

4.3 DriftPatternLibrary

@dataclass(frozen=True)
class DriftPatternLibrary:
    patterns: Mapping[str, DriftPattern]                                 # keyed by DriftPattern.id
    by_domain: Mapping[str, tuple[str, ...]]                             # domain → pattern_ids
    by_type:   Mapping[str, tuple[str, ...]]                             # drift_type → pattern_ids
    source_sha256: str

DriftPattern itself is defined in docs/modules/drift_injector.md §4.2 (see the DriftPattern dataclass snippet). This module owns loading, drift_injector owns applying.

4.4 APISchemaRegistry

@dataclass(frozen=True)
class APISchema:
    domain: str                       # "airline" | "cab" | "restaurant" | "hotel" | "payment"
    version: str                      # "v1" | "v2" | "v3"
    schema: Mapping[str, Any]         # parsed JSON Schema 2020-12 document
    source_sha256: str

@dataclass(frozen=True)
class APISchemaRegistry:
    schemas: Mapping[str, Mapping[str, APISchema]]
    # schemas["airline"]["v2"] = APISchema(...)

    def get(self, domain: str, version: str) -> APISchema: ...
    def versions(self, domain: str) -> tuple[str, ...]: ...              # ordered v1,v2,v3

Each v<N>.json is a valid JSON Schema 2020-12 document describing the tool-response shape for that domain at that drift version. Vendors (DESIGN.md §5) validate outgoing responses against these schemas at test time; the injector (docs/modules/drift_injector.md §3) consults version transitions via these files.

4.5 AudioManifest

@dataclass(frozen=True)
class AudioClip:
    utterance_id: str                 # stable; matches a curated IndicVoices-R clip id
    path: Path                        # relative to data/audio/
    language: LanguageCode
    source: Literal["real_indicvoices_r"]   # manifest is authored-only; synth clips
                                            # are lazily generated and NEVER recorded here
    license: str                      # SPDX identifier
    sha256: str
    duration_s: float                 # ≤ 20.0 (DESIGN.md §9 upper bound)

@dataclass(frozen=True)
class AudioManifest:
    clips: tuple[AudioClip, ...]
    source_sha256: str                # hash of MANIFEST.jsonl bytes

The source field is a single-value Literal — the manifest is authored-only. Synth clips generated on-demand by audio/tts_kokoro.py are never recorded in the manifest (they are transient, gitignored under data/audio/synth/). This keeps the manifest auditable and its SHA-256 stable across Kokoro model-weight updates.

4.6 SFTCorpus (L4, optional)

@dataclass(frozen=True)
class SFTTrajectory:
    episode_id: int
    goal_seed: int                    # same seed space as train/; NEVER a val seed (§3.1)
    turns: tuple[Mapping[str, Any], ...]   # role/content pairs, JSON-serializable
    stage: Literal[1, 2, 3]
    reward_breakdown: Mapping[str, float]  # R1..R5 + total, from the env at synthesis time
    generation_batch_id: str          # uuid4 per invocation of sft_generator.py
    generation_index: int             # monotonic within a batch, 0..N-1

@dataclass(frozen=True)
class SFTCorpus:
    trajectories: tuple[SFTTrajectory, ...]
    generator: Literal["sarvam-m-hf-inference"]
    generation_seed: int
    target_count: int                 # from --target-count CLI flag
    source_sha256: str

Consumed by training/train_grpo.py only when --sft-warmup-steps > 0 is passed. Absent by default; loader raises a non-fatal warning if the file is missing (training proceeds without warmup).

Atomic append + restart recovery (training/sft_generator.py):

  • Each trajectory is appended to data/sft_warmup/trajectories.jsonl as a single canonical-JSON line followed by os.fsync(fd) on the file descriptor, ensuring durability before the next Sarvam-M API call. Partial-write recovery is therefore line-granular.
  • Every row carries generation_batch_id (uuid4, generated once per invocation of sft_generator.py) and generation_index (monotonic integer 0..N-1 within that batch).
  • On restart, sft_generator.py reads the existing trajectories.jsonl, reconstructs (seed, generation_index) pairs already completed, and resumes from the next uncompleted seed in its deterministic seed list. This tolerates Sarvam-M rate-limit drops, OOM kills, and SIGKILL.
  • After all generation completes, the script performs a final count validation: if len(trajectories) != target_count, it raises PartialSFTCorpusError (§5). The loader load_sft_corpus also performs this check at load time and raises the same error if the on-disk row count does not match the target_count field.
  • Edge case #11 (§7) walks through a concrete partial-generation-recovery scenario.

4.7 BriefRow — canonical publication-row contract

Every line of train/briefs.jsonl and val/briefs.jsonl in the published HF Hub bundle is exactly one serialized BriefRow. This dataclass is the single-source-of-truth schema for everything an offline consumer (a judge re-running eval, a third party reproducing our experiments) needs to re-derive the episode from (seed, library@hash) alone.

from __future__ import annotations
from dataclasses import dataclass
from typing import Literal
from driftcall.models import GoalSpec, DriftEvent, LanguageCode

@dataclass(frozen=True)
class BriefRow:
    episode_id: str                    # deterministic from seed + stage (e.g. "s2_ep_00000042")
    seed: int                          # original episode seed (train: [0, 20_000_000),
                                       #                         val:   [20_000_000, 20_000_500))
    stage: Literal[1, 2, 3]            # curriculum stage at publication time
    language: LanguageCode             # "hi" | "ta" | "kn" | "en" | "hinglish"
    domain: Literal["airline", "cab", "restaurant", "hotel"]
    template_id: str                   # e.g. "airline.book.budget_timewindow"
    goal: GoalSpec                     # full GoalSpec (slots + constraints + seed_utterance)
    drift_schedule: tuple[DriftEvent, ...]   # schedule pre-computed by drift_injector
    catalogue_hash: str                # sha256(drift_patterns/drifts.yaml bytes)
    templates_sha256: str              # sha256(task_briefs/templates.yaml bytes)
    i18n_sha256: str                   # sha256(task_briefs/i18n.yaml bytes)
    generator_version: str             # e.g. "driftcall-1.0.0" — semver of the generator
    created_ts_ist: str                # ISO 8601 with +05:30 offset, e.g. "2026-04-25T10:30:00+05:30"

Serialization is always canonical: json.dumps(asdict(row), ensure_ascii=False, sort_keys=True, separators=(",", ":")). A concrete JSONL line example is given in §8.5.

At eval-load time, the loader re-hashes the currently-loaded drifts.yaml / templates.yaml / i18n.yaml and compares against catalogue_hash / templates_sha256 / i18n_sha256. Any mismatch raises CatalogueHashMismatchError (§5) — this prevents silent semantic drift where a consumer runs train/briefs.jsonl against a newer catalogue and gets different episodes.


5. Error Modes

All exceptions subclass DatasetError(Exception). Each is raised exactly once and unit-tested.

Exception Trigger Where raised
DatasetFileMissingError data/<path> absent on disk every loader
MalformedYAMLError YAML parse failure (syntax) load_templates, load_i18n, load_drift_patterns
MalformedJSONError JSON parse failure (syntax) load_api_schemas, load_audio_manifest, load_sft_corpus
DatasetSchemaError type/shape validation failure (missing required key, wrong type, extra unknown key) every loader
UnknownLanguageKeyError a language key ∉ LanguageCode = {"hi","ta","kn","en","hinglish"} appears in templates.yaml or i18n.yaml load_templates, load_i18n
LicenseConflictError a CC-BY-SA or GPL-licensed row appears in publication bundle while bundle is Apache-2.0; or verbatim ≥ 10-token suffix matches a CC-BY-SA upstream row publication script (see §3.4)
TrainValLeakError train and val seed sets intersect; or an SFTTrajectory.goal_seed sits in the val reserved range [20_000_000, 20_000_500) publication script, load_sft_corpus
DriftPatternOrphanError drift_patterns.yaml references a from_version/to_version not present in data/api_schemas/<domain>/ load_drift_patterns
ChecksumMismatchError AudioClip.sha256 does not match the on-disk file's hash load_audio_manifest
UnicodeNFDError any loaded string fails unicodedata.is_normalized("NFC", s) every loader
PIIDetectedError a 10-digit run appears outside allowed contexts in authored text every text-bearing loader; also CI lint
DuplicateDriftPatternIdError two entries in drifts.yaml share an id load_drift_patterns
CatalogueHashMismatchError a BriefRow in train/briefs.jsonl or val/briefs.jsonl carries catalogue_hash / templates_sha256 / i18n_sha256 that does not match the currently-loaded library (drifts.yaml / templates.yaml / i18n.yaml) hashes eval-load path (consumers of published bundle)
PartialSFTCorpusError len(SFTCorpus.trajectories) != target_count at final-count validation; raised by training/sft_generator.py post-generation and by load_sft_corpus at load time load_sft_corpus, training/sft_generator.py

No silent fallbacks. If data/sft_warmup/trajectories.jsonl is missing, load_sft_corpus raises DatasetFileMissingError; the training script is the one that treats this as non-fatal (falls back to no-SFT warmup). Loaders themselves never substitute defaults.


6. Dependencies

6.1 Reads

  • data/task_briefs/templates.yaml, data/task_briefs/i18n.yaml
  • data/drift_patterns/drifts.yaml
  • data/api_schemas/**/*.json
  • data/audio/real/MANIFEST.jsonl + the .wav files it references
  • data/sft_warmup/trajectories.jsonl (optional)

6.2 Imports

  • driftcall.modelsGoalSpec, LanguageCode, Domain
  • Python stdlib: json, hashlib, pathlib, unicodedata, threading, dataclasses, typing, types
  • Third-party: PyYAML, jsonschema (for JSON Schema 2020-12 meta-validation)

6.3 Consumers

Consuming modules and the exact function they call:

  • docs/modules/task_generator.mdload_templates() in task_generator.generate()'s lazy-singleton _get_library().
  • docs/modules/drift_injector.mdload_drift_patterns() in the injector's module-level registry; consults DESIGN.md §6.3 pattern catalogue.
  • docs/modules/vendors.mdload_api_schemas() at vendor import time; each vendor asserts its own response shape against the schema in test fixtures.
  • docs/modules/audio.mdload_audio_manifest() for the pitch demo (§9.5 IndicVoices-R clip playback).
  • docs/modules/training.mdload_sft_corpus() behind --sft-warmup-steps flag; also invokes training/data_export.py which calls task_generator.enumerate_variants() to produce the publication briefs.

6.4 Publishes to

  • HF Hub dataset repo <team>/driftcall-indic-briefs (one-time, pre-event, Phase C5 per DRIFTCALL/CLAUDE.md §4.1).

6.5 Non-dependencies (explicit)

  • Does not import from env.py, rewards.py, app.py, or the training entrypoint. Pure data layer.
  • Does not hit the network at runtime. Every file is local. Publication script is a separate, explicitly-invoked entrypoint.
  • Does not depend on GPU, CUDA, or PyTorch. CPU-only.

7. Edge Cases

  1. Missing template variant for a rare language. templates.yaml is authored with hinglish + hi + en + ta but an author forgets kn for one template. load_templates runs per-template check set(variants.keys()) == LanguageCode.values and raises DatasetSchemaError: template 'restaurant.order.veg' missing language_variants['kn']. The generator's NoVariantForLanguageError (task_generator.md §5) never has a chance to fire because loading fails first. Fix: author supplies the missing variant; loader re-runs.

  2. Unicode NFD in author contribution. A collaborator pastes a Kannada weekday name from macOS clipboard (NFD by default for composed characters). load_i18n re-normalizes to NFC before equality/hashing; the assertion unicodedata.is_normalized("NFC", value) fires post-normalization as a defense against Python/ICU bugs. In practice the round-trip succeeds and NFC is stored. The pre-commit hook separately catches NFD at commit time so CI never sees it.

  3. License incompatibility (CC-BY-SA row smuggled into an Apache-2.0 bundle). An author, inspired by an SGD row, copies 20 tokens verbatim into a template variant. Publication CI runs a suffix-array check over cached SGD/MTOP exports looking for ≥ 10-token verbatim matches; on hit, LicenseConflictError("variant in 'airline.book.budget_timewindow' matches SGD row sgd_5432:0 (≥ 10 tokens)") raises. Fix: rewrite the variant. We keep only inspiration, never verbatim text, from CC-BY-SA sources. The threshold (10 tokens) is a pragmatic choice: below that length we treat overlap as incidental linguistic reuse; at or above we flag.

  4. Empty language cohort in a stage mix. A future curriculum config passes language_weights = {"en": 1.0, "hi": 0.0, "ta": 0.0, "kn": 0.0, "hinglish": 0.0}. This is valid at the task-generator level (task_generator.md §3.2 — non-negative weights summing to 1 are legal). datasets does not re-validate curriculum config; it only asserts the library has variants for all 5 languages. Downstream (task_generator) will simply never draw hi/ta/kn/hinglish. No error in this module.

  5. Train/val episode-id collision at publication time. data_export.py draws 15,000 seeds for train and 500 for val. If the RNG accidentally maps a train seed into [20_000_000, 20_000_500) (the val reserved range) — which cannot happen given the seed-space partitioning in §3.1 — the assertion train_seeds.isdisjoint(val_seeds) raises TrainValLeakError with the offending seed. Safeguard: train seeds are drawn from [0, 20_000_000) and val seeds from [20_000_000, 20_000_500). The two ranges are non-overlapping by construction; the assertion is a defense against future range edits.

  6. Drift-pattern-id orphan (trace references pattern not in YAML). A test fixture or cached trace references drift_pattern_id='airline.mysterious_fee' but drifts.yaml has no such entry (it was renamed or removed). load_drift_patterns does not look at traces — it only checks internal consistency. The trace consumer (rewards.r2_drift_detection in docs/modules/rewards.md) raises UnknownDriftPatternError at scoring time per drift_injector.md §5. If the orphan is discovered during dataset publication, the publication script emits DriftPatternOrphanError and aborts.

  7. JSON Schema file that is valid JSON but not valid JSON Schema 2020-12. data/api_schemas/cab/v3.json is hand-edited and accidentally drops the $schema keyword or uses an unknown keyword. load_api_schemas runs jsonschema.Draft202012Validator.check_schema(schema) and on failure raises DatasetSchemaError("cab/v3.json: not a valid JSON Schema 2020-12: <error>"). The env refuses to serve reset() until fixed.

  8. Audio clip on disk does not match manifest sha256. data/audio/real/MANIFEST.jsonl lists kn_greeting_03.wav with sha256=abc.... The file gets re-encoded (e.g., by a well-intentioned ffmpeg pass). load_audio_manifest re-hashes every referenced WAV and raises ChecksumMismatchError("kn_greeting_03.wav: expected abc..., got def..."). Fix: either revert the WAV, or regenerate the manifest after an audit trail commit.

  9. SFT corpus contains a val-reserved seed. Sarvam-M synthesis inadvertently uses a seed in [20_000_000, 20_000_500). load_sft_corpus raises TrainValLeakError. The training script may be configured to treat this as fatal (default) or to filter out those trajectories (--sft-tolerate-leak); the loader itself always raises.

  10. PyYAML silently deduplicating keys. If drifts.yaml has two entries with the same id, the YAML parse is valid but one wins. load_drift_patterns builds a set of ids during validation and raises DuplicateDriftPatternIdError on collision, with both source line numbers.

  11. Partial SFT corpus recovery (L4 restart). training/sft_generator.py is mid-run at trajectory 137 of a target 300 when the host OOM-kills the process (Sarvam-M inference peak memory). On restart, the script re-opens data/sft_warmup/trajectories.jsonl, reads the existing 137 rows (each fsync'd atomically per §4.6), reconstructs the completed (generation_batch_id, generation_index) pairs, and resumes from index 137 of the same batch. It does NOT start a new generation_batch_id — batch id is rehydrated from the last row. When generation finally reaches 300, the script validates len(rows) == target_count; if a Sarvam-M response was silently truncated (say, only 298 rows written), PartialSFTCorpusError("expected 300, got 298") is raised and the operator must decide whether to re-run the missing two or ship a corpus with a smaller target_count. load_sft_corpus performs the same count check at load time.


8. Examples

8.1 Full templates.yaml entry for airline.book.budget_timewindow

# SPDX-License-Identifier: Apache-2.0
# Copyright 2026 DriftCall Team
# Derived-from: AmazonScience/MASSIVE (intent taxonomy, Apache-2.0)

- template_id: airline.book.budget_timewindow
  domain: airline
  intent: book_flight
  min_stage: 1
  required_slots: [from, to, when]
  optional_slots: [seat_pref]
  constraints_template:
    budget_inr:
      distribution: uniform
      low: 3000
      high: 15000
      step: 500
    time_window:
      choices: [morning, afternoon, evening, late_night]
  drift_slot_tags: [price, total_fare_inr]
  # Language keys: ISO short codes matching LanguageCode = Literal["hi","ta","kn","en","hinglish"]
  language_variants:
    hinglish:
      - "Bhai {when} ko {to} jaana hai, cheapest flight {time_window} mein, {budget_inr} rupees max"
      - "{when} ko {from} se {to} ka ticket book kar de, under {budget_inr}, {time_window} ke baad"
    hi:
      - "मुझे {when} को {from} से {to} जाना है, {budget_inr} रुपये से कम में"
    ta:
      - "{when} அன்று {from} லிருந்து {to} க்கு டிக்கெட் வேண்டும், {budget_inr} ரூபாய்க்கு கீழ்"
    kn:
      - "{when} ರಂದು {from} ಇಂದ {to} ಗೆ ಅಗ್ಗದ ವಿಮಾನ ಟಿಕೆಟ್ ಬೇಕು, {budget_inr} ರೂಪಾಯಿಗಳ ಒಳಗೆ"
    en:
      - "Book the cheapest flight from {from} to {to} on {when}, budget under ₹{budget_inr}, departing {time_window}"

This is the single source-of-truth entry for the Stage-1 airline booking template; mirror of DESIGN.md §8.3 and docs/modules/task_generator.md §4.1.

8.2 Full drift_patterns.yaml entry for airline.price_rename

# SPDX-License-Identifier: Apache-2.0
# Copyright 2026 DriftCall Team

- id: airline.price_rename
  drift_type: schema
  domain: airline
  from_version: v1
  to_version: v2
  description: "field 'price' renamed to 'total_fare_inr'; 'currency' removed"
  mutation:
    rename: {price: total_fare_inr}
    remove: [currency]
  detection_hints:
    - "total_fare_inr"
    - "price"
    - "rename"

load_drift_patterns will (a) parse this, (b) check id uniqueness, (c) confirm from_version=v1 + to_version=v2 both exist as data/api_schemas/airline/v1.json + data/api_schemas/airline/v2.json, (d) confirm detection_hints is non-empty, (e) wrap mutation in MappingProxyType. Matches docs/modules/drift_injector.md §4.3 byte-for-byte.

8.3 data/api_schemas/airline/v2.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://driftcall.dev/schemas/airline/v2.json",
  "$comment": "SPDX-License-Identifier: Apache-2.0. v2 = post-price_rename drift (DESIGN.md §5.1).",
  "title": "Airline search result (v2)",
  "type": "object",
  "required": ["flight_id", "from", "to", "depart", "total_fare_inr", "seats_left"],
  "additionalProperties": false,
  "properties": {
    "flight_id":       {"type": "string", "pattern": "^[0-9A-Z]{2}-[0-9]{4}$"},
    "from":            {"type": "string", "pattern": "^[A-Z]{3}$"},
    "to":              {"type": "string", "pattern": "^[A-Z]{3}$"},
    "depart":          {"type": "string", "format": "date-time"},
    "total_fare_inr":  {"type": "integer", "minimum": 0},
    "seats_left":      {"type": "integer", "minimum": 0}
  }
}

Note that price and currency from v1 are absent (drift airline.price_rename applied). Vendors (docs/modules/vendors.md) validate their emitted airline.search responses against whichever version the injector has installed in state.schema_versions['airline']. This schema also serves as the R2 structural detection surface: a tool call that keys into price after drift returns KeyError / 422, which is a detection-positive signal per DESIGN.md §7.1 R2.

8.4 MANIFEST.jsonl row for a curated IndicVoices-R clip (L3)

{"utterance_id": "iv_r_kn_0451", "path": "real/kn/iv_r_kn_0451.wav", "language": "kn", "source": "real_indicvoices_r", "license": "Apache-2.0", "sha256": "b7f1a9c2e5d4...", "duration_s": 4.82}

Referenced only by the pitch demo. Training never touches this file — DRIFTCALL/CLAUDE.md §9 "Do not put TTS/ASR in the training loop".

8.5 Canonical BriefRow JSONL line (single row from train/briefs.jsonl)

One line from the published bundle — canonical JSON (sorted keys, no whitespace, UTF-8 preserved for Devanagari):

{"catalogue_hash":"3f9a8e7c2b1d4e5f6a0b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5e4f3a2b1c0d9e8f","created_ts_ist":"2026-04-25T10:30:00+05:30","domain":"airline","drift_schedule":[{"description":"'price' field renamed to 'total_fare_inr'","domain":"airline","drift_type":"schema","from_version":"v1","pattern_id":"airline.price_rename","to_version":"v2","turn":4}],"episode_id":"s2_ep_00000042","generator_version":"driftcall-1.0.0","goal":{"constraints":{"budget_inr":8000,"time_window":"evening"},"domain":"airline","intent":"book_flight","language":"hinglish","seed_utterance":"Bhai Friday ko Bangalore jaana hai, cheapest flight evening mein, 8000 rupees max","slots":{"from":"HYD","to":"BLR","when":"2026-04-30"}},"i18n_sha256":"a1b2c3d4e5f60718293a4b5c6d7e8f901234567890abcdef1234567890abcdef","language":"hinglish","seed":42,"stage":2,"template_id":"airline.book.budget_timewindow","templates_sha256":"b2c3d4e5f60718293a4b5c6d7e8f901234567890abcdef1234567890abcdef12"}

Note: keys are alphabetically sorted (catalogue_hash, created_ts_ist, domain, …), strings are NFC-normalized, no embedded spaces. The 64-hex hashes are full sha256 hex digests.

8.6 README.md YAML frontmatter (HF Hub dataset card)

The published <org>/driftcall-indic-briefs/README.md begins with the following YAML frontmatter. The HF Dataset Viewer reads this block to auto-configure splits, license, and task tags.

---
license: apache-2.0
language: [hi, ta, kn, en]
size_categories: [10K<n<100K]
task_categories: [conversational, text-generation]
pretty_name: DriftCall Indic Briefs
configs:
  - config_name: default
    data_files:
      - split: train
        path: train/briefs.jsonl
      - split: val
        path: val/briefs.jsonl
dataset_info:
  features:
    - { name: episode_id, dtype: string }
    - { name: seed, dtype: int64 }
    - { name: stage, dtype: int32 }
    - { name: language, dtype: string }
    - { name: domain, dtype: string }
    - { name: template_id, dtype: string }
  splits:
    - { name: train, num_examples: 15000 }
    - { name: val, num_examples: 500 }
---

The body of README.md follows below the frontmatter: dataset description, licensing chain (full Apache-2.0 text is in the separate LICENSE file per §3.4), provenance (generator_version, catalogue_hash), reward-caveat paragraph, and usage example. The frontmatter's features block lists only the top-level flat columns; nested structs (goal, drift_schedule) are auto-inferred by the HF Datasets library on first load.


9. Open Questions

  1. HF org name not yet finalized. <org> placeholder in <org>/driftcall-indic-briefs depends on DRIFTCALL/CLAUDE.md §8 kickoff-checklist item "HF org name locked". The publication script parameterizes the org via --hf-org; no code change needed once locked, just a CLI arg at publication time. Does not block Phase D. Sync note: DRIFTCALL/CLAUDE.md §6 command table still lists the deprecated huggingface-cli upload — when the org name is locked, update that table to the modern hf upload in the same PR.

  2. SFT warmup corpus size — 200 vs 500 trajectories. DESIGN.md §8.2 row 4 quotes the range "200–500". The exact count depends on Sarvam-M's cost/latency budget during one-shot synthesis. Recommend 200 as a floor (sufficient for format priming per §10 training convergence target) and 500 as a ceiling if inference time permits. Resolution: Person C chooses during Phase C4; does not affect loader or schema.

  3. Audio manifest curation count. DESIGN.md §9 implies a handful of real IndicVoices-R clips for pitch demo realism, but does not specify exact count. Recommend 20 curated clips (4 per language × 5 languages), balanced by speaker gender and dialect region. Resolution: Person D curates during Phase C5; this module only ensures the manifest format is stable.

9.1 Resolved

  • License-cache implementation (previously Open Q #4). data/.license_cache/{sgd,mtop}.idx is a sqlite3 FTS5 index built by scripts/build_license_cache.py at dev time. Schema: CREATE VIRTUAL TABLE licensed_text USING fts5(chunk_text, source_id); with 5-gram tokenization. CI invokes this index (read-only) on each PR to verify that no seed_utterance or template variant in the publication bundle substring-matches any upstream CC-BY-SA text (≥ 10-token threshold, §3.4). The index is built once per upstream corpus version and committed to the repo so re-builds are only needed when SGD or MTOP themselves publish a new version. Determinism + reviewability win over per-PR rebuild cost.

This doc tells you HOW the four dataset layers are shaped, loaded, validated, and published. Do not write loaders before a fresh critic returns NOTHING_FURTHER. Do not commit data/*.yaml without the pre-commit NFC + PII + license-header guards running. Do not ship the HF Hub bundle without the train/val disjointness and verbatim-match checks green.