File size: 48,447 Bytes
f2df60e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
# datasets — Four-Layer Dataset Strategy + HF Hub Publication

**Module path:** `driftcall/data/` (loaders) + `data/` (on-disk artifacts)
**Owner:** Person C (Training & Data)
**Implements:** DESIGN.md §8 (Dataset Strategy — §§8.1, 8.2, 8.3, 8.4, 8.5, 8.6)
**Consumed by:** `driftcall/task_generator.py` (L1), `driftcall/drift_injector.py` (L2 drift patterns), `driftcall/vendors/*.py` (L2 API schemas), `driftcall/audio/*.py` (L3 audio), `training/train_grpo.py` (L4 SFT warmup).
**Status:** Design spec — no code yet.

---

## 1. Purpose

`datasets` is the **authoring-and-loading contract** for every piece of static data DriftCall depends on. It is *not* a training-data-pipeline module (that is `training.md`) and it does *not* compose rewards (that is `rewards.md`). It does exactly four things:

1. **Defines** the four dataset layers per DESIGN.md §8.2 — task-brief templates, vendor API schemas + drift patterns, voice audio, and the optional SFT warmup corpus — as on-disk files with frozen YAML/JSON schemas.
2. **Loads** each file through a deterministic, lazy, singleton loader that NFC-normalizes all Indic strings at load time (cross-references `docs/modules/task_generator.md` §3.4 invariant #8).
3. **Validates** each file at load time: schema shape, type constraints, license header presence, train/val leakage, and consistency cross-references (e.g., every `drift_slot_tags` token in templates.yaml is targetable by ≥ 1 pattern in drift_patterns.yaml).
4. **Publishes** the public-facing bundle `<team>/driftcall-indic-briefs` to the Hugging Face Hub per DESIGN.md §8.6, packaged from a deterministic `enumerate_variants()` walk (see `docs/modules/task_generator.md` §2.2).

**No file in `data/` is ever written at runtime.** All four layers are authored before Phase C ships. Runtime only reads. The only exception is the dataset-packaging script (`training/data_export.py`) which writes `train/briefs.jsonl` + `val/briefs.jsonl` once, pre-publication.

Supervision for GRPO comes from the 5 reward functions (DESIGN.md §8.1) — these files exist to parameterize the environment, not to teach the policy. L4 (SFT warmup) is optional per DESIGN.md §8.2 row 4.

---

## 2. Interface

### 2.1 Directory layout (on disk, shipped inside the env Docker image per DESIGN.md §11.1)

```
data/
├── task_briefs/
│   ├── templates.yaml                # L1 — hand-authored + procedural expansion source (§8.3)
│   └── i18n.yaml                     # L1 — Indic localized strings (cities, weekdays, dish names)
├── drift_patterns/
│   └── drifts.yaml                   # L2 — 20 drift patterns (§6.3, §8.2 row 2)
├── api_schemas/                      # L2 — frozen JSON Schema per vendor per version
│   ├── airline/
│   │   ├── v1.json
│   │   ├── v2.json
│   │   └── v3.json
│   ├── cab/
│   │   ├── v1.json
│   │   ├── v2.json
│   │   └── v3.json
│   ├── restaurant/
│   │   ├── v1.json
│   │   ├── v2.json
│   │   └── v3.json
│   ├── hotel/
│   │   ├── v1.json
│   │   ├── v2.json
│   │   └── v3.json
│   └── payment/
│       ├── v1.json
│       └── v2.json
├── audio/                            # L3 — synthesized + real voice clips (§8.2 row 3, §9)
│   ├── synth/                        # Kokoro-82M output, generated lazily; gitignored
│   │   └── .gitkeep
│   ├── real/                         # AI4Bharat IndicVoices-R held-out subset for pitch demo
│   │   └── MANIFEST.jsonl            # (utterance_id, path, language, license, sha256)
│   └── LICENSES.md                   # per-clip license attribution
└── sft_warmup/                       # L4 — optional Sarvam-M synthesized trajectories (§8.2 row 4)
    ├── trajectories.jsonl            # 200–500 correct rollouts
    └── LICENSES.md
```

**Publication structure** (HF Hub dataset repo `<team>/driftcall-indic-briefs`, DESIGN.md §8.6):

```
driftcall-indic-briefs/
├── README.md                         # model card — provenance, license, stats, reward caveats
├── train/briefs.jsonl                # 15,000 sampled episodes (seed, stage, language_weights, GoalSpec)
├── val/briefs.jsonl                  #    500 held-out episodes — seeds disjoint from train
├── drift_patterns.yaml               # exact copy of data/drift_patterns/drifts.yaml (20 patterns)
├── api_schemas/                      # exact copy of data/api_schemas/
└── LICENSE                           # bundle license (Apache 2.0 by default; see §3.4)
```

### 2.2 Per-file contracts

| File | Format | Authored by | Runtime writer | Schema anchor |
|---|---|---|---|---|
| `data/task_briefs/templates.yaml` | YAML | Hand (20 seeds) | none | `Template` (§4.1 task_generator.md) |
| `data/task_briefs/i18n.yaml` | YAML | Hand | none | `Mapping[LanguageCode, Mapping[str, str]]` |
| `data/drift_patterns/drifts.yaml` | YAML | Hand | none | `DriftPattern` (§4.2 drift_injector.md) |
| `data/api_schemas/<domain>/v<N>.json` | JSON Schema 2020-12 | Hand | none | `APISchema` (§4.4 below) |
| `data/audio/real/MANIFEST.jsonl` | JSONL | Hand (curated from IndicVoices-R) | none | `AudioClipManifest` (§4.5) |
| `data/audio/synth/*.wav` | WAV 16kHz mono | `audio/tts_kokoro.py` (lazy) | `audio/tts_kokoro.py` | n/a — generated |
| `data/sft_warmup/trajectories.jsonl` | JSONL | Sarvam-M via HF Inference (offline) | `training/sft_generator.py` (one-shot) | `SFTTrajectory` (§4.6) |

### 2.3 Loaders (all return frozen dataclasses, all NFC-normalize Indic strings)

```python
from __future__ import annotations
from pathlib import Path
from driftcall.data.models import (
    TemplateLibrary, I18nLibrary,
    DriftPatternLibrary, APISchemaRegistry,
    AudioManifest, SFTCorpus,
)

# L1 — task briefs
def load_templates(path: Path | str = "data/task_briefs/templates.yaml") -> TemplateLibrary: ...
def load_i18n(path: Path | str = "data/task_briefs/i18n.yaml") -> I18nLibrary: ...

# L2 — drift patterns + api schemas
def load_drift_patterns(path: Path | str = "data/drift_patterns/drifts.yaml") -> DriftPatternLibrary: ...
def load_api_schemas(root: Path | str = "data/api_schemas") -> APISchemaRegistry: ...

# L3 — audio manifest (paths + licenses only; actual WAVs resolved on-demand)
def load_audio_manifest(path: Path | str = "data/audio/real/MANIFEST.jsonl") -> AudioManifest: ...

# L4 — optional SFT warmup
def load_sft_corpus(path: Path | str = "data/sft_warmup/trajectories.jsonl") -> SFTCorpus: ...
```

Each loader is implemented as a **module-level lazy singleton** — the first call reads + validates + freezes; subsequent calls return the same instance. Not thread-safe for write (there is no write); safe for concurrent read.

### 2.4 HF Hub publication commands

Packaging runs *once*, pre-event. The script is `training/data_export.py` (see `docs/modules/training.md` for its interface — this module only defines the on-disk shape of what it writes).

**Immutability.** The published bundle is IMMUTABLE after publication. Re-running `hf upload` against the same `data/publication/` tree produces a byte-identical bundle (invariant #6). Adding rows to `val/briefs.jsonl` requires a MINOR-version bump (v1.1, v1.2, …) and a new publication seed; the `train/` split NEVER silently mutates between versions — a version bump either adds disjoint val rows or re-publishes train+val together, never partial mutation of train.

**Seed selection (deterministic, locked).** Train and val seeds are drawn by `training/data_export.py` using these two exact expressions:

```python
import random
# Train: 15,000 seeds sampled without replacement from [0, 20_000_000).
train_seeds = random.Random(20260425).sample(range(0, 20_000_000), 15_000)
# Val: deterministic slice of 500 contiguous seeds in the reserved range.
val_seeds = list(range(20_000_000, 20_000_500))
```

Both lists are byte-identical across re-runs. The publication meta-seed `20260425` is locked; changing it requires a major-version bump and a new repo name or subfolder.

```bash
# Generate the sampled briefs by walking enumerate_variants() (see task_generator.md §2.2)
python3 training/data_export.py \
    --out-train data/publication/train/briefs.jsonl \
    --out-val   data/publication/val/briefs.jsonl \
    --n-train   15000 \
    --n-val     500 \
    --seed      20260425        # frozen publication seed; NOT a training seed

# Copy the static L2 artifacts verbatim
cp  data/drift_patterns/drifts.yaml  data/publication/drift_patterns.yaml
cp -r  data/api_schemas              data/publication/api_schemas

# Upload (see DRIFTCALL/CLAUDE.md §6 command table and huggingface-skills:hf-cli).
# NOTE: `hf` is the modern CLI replacing the deprecated `huggingface-cli`.
hf upload <org>/driftcall-indic-briefs \
    data/publication/ . \
    --repo-type dataset \
    --hf-org <org> \
    --commit-message "v1.0 publication — locked 2026-04-25"
```

The publication seed `20260425` is fixed and recorded in the README. Re-running the script produces a byte-identical bundle (determinism contract inherited from `task_generator.generate` per `docs/modules/task_generator.md` §3.1).

> **Doc-sync flag:** `DRIFTCALL/CLAUDE.md` §6 still lists the deprecated `huggingface-cli upload` command; update that table to `hf upload` in the same PR that lands this doc (captured as Open Question #1 / CLAUDE.md sync item).

---

## 3. Behavior Spec

### 3.1 Authoring conventions

**NFC normalization.** Every string value in every YAML/JSON file is NFC-normalized before it is committed. The loaders re-normalize defensively at load time (invariant #8, `docs/modules/task_generator.md` §3.4). Editors used during authoring (VS Code, vim) must be configured to save NFC — a pre-commit hook (`ruff`-adjacent script) runs `python -c "import unicodedata, sys; ..."` to reject NFD commits.

**License headers.** Every hand-authored file begins with a YAML comment block declaring SPDX identifier, author, year, and upstream attribution if the content is derived from a public dataset (§8.5):

```yaml
# SPDX-License-Identifier: Apache-2.0
# Copyright 2026 DriftCall Team
# Derived-from: AmazonScience/MASSIVE (intent taxonomy, Apache-2.0)
# See data/LICENSES.md for full attribution chain.
```

JSON files carry the same metadata in a `$comment` field at root (JSON Schema 2020-12 permits `$comment` per RFC 7159 conventions).

**Seed determinism.** Every numeric or stochastic sampling decision in template/drift authoring threads through a fixed seed: the publication seed `20260425`, the template-expansion seed `42`, or the curriculum-language seeds declared in DESIGN.md §10.3. No wall-clock, no `random.random()`, no host-machine entropy.

**No PII.** Authored strings never contain real names, phone numbers, email addresses, booking reference numbers, card PANs, or IP addresses. The `from` / `to` fields use IATA codes; the `pickup` / `drop` fields use fictional neighborhood landmarks. A CI lint (`grep -En '[0-9]{10}' data/`) runs before every commit and fails on any 10-digit run outside the allowed IATA/timestamp contexts.

**Eval-set held out from training.** The 500-episode val set uses seeds drawn from a reserved range (seed ∈ `[20_000_000, 20_000_500)`); training always draws seeds from `[0, 20_000_000)`. The publication script asserts disjointness at write time (see §5 leak detection). The exact seed-selection expressions are specified in §2.4.

**Canonical JSON key ordering.** Every row in `train/briefs.jsonl` and `val/briefs.jsonl` is serialized with:

```python
json.dumps(row, ensure_ascii=False, sort_keys=True, separators=(",", ":"))
```

This is enforced as an invariant precondition for byte-identical re-runs (see §3.5 invariant #6). `ensure_ascii=False` preserves Devanagari / Tamil / Kannada script without `\uXXXX` escaping; `sort_keys=True` canonicalizes key order; `separators=(",", ":")` eliminates whitespace variance across Python/libc versions.

**Per-row data lineage.** Every `BriefRow` carries the full six-tuple `(template_id, seed, stage, language, domain, generator_version)` plus three corpus-version hashes (`catalogue_hash`, `templates_sha256`, `i18n_sha256`). This is enforced as an invariant (§3.5 invariant #9) so that any published row is re-derivable from the triple `(seed, stage, library@hash)` alone.

### 3.2 Lazy singleton loaders

```python
# sketch of the module-level pattern, mirrored in every loader
_LIBRARY: TemplateLibrary | None = None
_LIBRARY_LOCK = threading.Lock()

def load_templates(path: Path | str = "data/task_briefs/templates.yaml") -> TemplateLibrary:
    global _LIBRARY
    if _LIBRARY is None:
        with _LIBRARY_LOCK:
            if _LIBRARY is None:
                _LIBRARY = _load_and_validate_templates(Path(path))
    return _LIBRARY
```

The singleton is **path-keyed** — if a test passes a different `path`, a fresh instance is built (still cached in a per-path dict). Production callers always use the default path.

### 3.3 Schema validation at load time

Each loader does three passes:

1. **YAML/JSON parse.** Failure → `MalformedYAMLError` / `MalformedJSONError` with line/column.
2. **Type + shape validation** against the dataclass schema in §4. Failure → `DatasetSchemaError` naming the offending key.
3. **Cross-file consistency** check (loader-specific):
   - `load_drift_patterns` asserts `pattern.id` values are unique, exactly 20 patterns total, `drift_type ∈ {schema,policy,tnc,pricing,auth}`, and every `from_version`/`to_version` references an existing schema file in `data/api_schemas/<domain>/`.
   - `load_templates` asserts every `drift_slot_tags` token is matched by ≥ 1 `DriftPattern.mutation` key or value (`airline.total_fare_inr` must be targetable, else why tag it).
   - `load_api_schemas` asserts each `v<N>.json` validates as JSON Schema 2020-12 against the meta-schema via `jsonschema.Draft202012Validator.check_schema`.
   - `load_audio_manifest` asserts every referenced `path` exists on disk and its sha256 matches the recorded hash.

Failures here abort env startup; HTTP 503 is served until the data is fixed (mirrors `DriftCatalogueError` handling in `docs/modules/drift_injector.md` §5).

### 3.4 License compatibility check

Per §8.5 the public datasets we reference carry mixed licenses:

| Upstream | License | Redistributable in our bundle? |
|---|---|---|
| AI4Bharat IndicVoices-R | Apache-2.0 | Yes, with attribution |
| MASSIVE (Amazon) | Apache-2.0 | Yes, with attribution |
| Schema-Guided Dialogue (SGD) | CC-BY-SA | Inspiration only — derived schema patterns, not verbatim rows |
| MTOP (Facebook) | MIT-style (see original repo) | Inspiration only — derived Hindi task phrasings, not verbatim rows |
| APIs.guru | CC0 | Yes, no attribution required but recorded |

The bundle license (`LICENSE` at the root of `<team>/driftcall-indic-briefs`) is **Apache-2.0**. Because CC-BY-SA is copyleft-adjacent, we never copy SGD or MTOP rows verbatim — only *inspiration* (intent labels, schema shapes). A CI check enforces that no string in `train/briefs.jsonl` or `val/briefs.jsonl` appears verbatim (≥ 10-token suffix match) in a cached SGD/MTOP export. See §5 for the error mode and §7 edge case #3 for the exact detection rule.

**Full verbatim license text (MANDATORY).** The root `LICENSE` file MUST contain the **full verbatim Apache 2.0 license text** as published at https://www.apache.org/licenses/LICENSE-2.0.txt — NOT a URL, NOT a one-line SPDX identifier, NOT a summary. The same requirement applies to `data/audio/LICENSES.md` and `data/sft_warmup/LICENSES.md` (both must embed the full Apache-2.0 text plus per-clip / per-trajectory attribution rows). CI check `tests/data/test_license_text.py` verifies that the byte length of each `LICENSE` file is ≥ 11,000 (Apache-2.0 full text is ~11,357 bytes) and that the canonical "Apache License / Version 2.0, January 2004" header string is present.

**`LICENSES.md` schema (L3 audio + L4 SFT warmup).** Both `data/audio/LICENSES.md` and `data/sft_warmup/LICENSES.md` follow the same markdown format:

1. A preamble (5–15 lines) naming the bundle and linking back to the root `LICENSE`.
2. The full verbatim Apache-2.0 text (as above).
3. A single markdown table with exactly these columns, one row per clip (L3) or per trajectory (L4):

```markdown
| utterance_id | upstream_source      | upstream_license | attribution_required | notes                         |
|--------------|----------------------|------------------|----------------------|-------------------------------|
| iv_r_kn_0451 | IndicVoices-R        | Apache-2.0       | yes                  | speaker consent verified      |
| sft_00042    | Sarvam-M (synthesis) | Apache-2.0       | no                   | rollout seed 42, stage 2      |
```

For L4 the `utterance_id` column is replaced by `trajectory_id` but the other four columns are identical. Loaders do not parse these tables at runtime; they are human-audit artifacts enforced only by pre-commit schema check `scripts/check_licenses_md.py`.

### 3.5 Invariants (enforced by tests)

1. Every string value in every loaded library is NFC (`unicodedata.is_normalized("NFC", s) == True`).
2. `load_drift_patterns()` returns exactly 20 patterns (matches `docs/modules/drift_injector.md` §4.4 and DESIGN.md §6.3).
3. `load_api_schemas()` returns exactly `{airline:v1,v2,v3 + cab:v1,v2,v3 + restaurant:v1,v2,v3 + hotel:v1,v2,v3 + payment:v1,v2}` = **14 schemas across 5 domains** (matches DESIGN.md §8.6 bundle enumeration and §5 vendor catalogue).
4. `load_templates()` library satisfies: every template has ≥ 1 variant in every `LanguageCode` (`hi`, `ta`, `kn`, `en`, `hinglish`); every **primary-domain** pattern's `mutation` field set is a subset of the union of `drift_slot_tags` across that domain's templates. The two transversal payment-auth patterns (`payment.auth_scope_upgrade`, `payment.mfa_required`) are EXEMPT from this subset check — they mutate shared payment fields (`token`, `scope`, `mfa_code`) that are intentionally not present in primary-domain goal templates and therefore cannot appear in `drift_slot_tags`.
5. Publication invariant: train seed set ∩ val seed set = ∅.
6. Publication invariant: running `data_export.py` twice with the same seed produces byte-identical `train/briefs.jsonl` + `val/briefs.jsonl` (SHA-256 match). Enforced via canonical JSON dump (§3.1): `json.dumps(row, ensure_ascii=False, sort_keys=True, separators=(",", ":"))`.
7. Every file in `data/` begins with an SPDX license header (YAML comment or JSON `$comment`).
8. No 10-digit digit-run in any authored string outside the timestamp / IATA allowed contexts (PII guard).
9. **Per-row data lineage.** Every `BriefRow` (§4.7) in the published `train/` and `val/` splits carries all of: `template_id`, `seed`, `stage`, `language`, `domain`, `generator_version`, `catalogue_hash`, `templates_sha256`, `i18n_sha256`. At eval-load time, `catalogue_hash` / `templates_sha256` / `i18n_sha256` must match the currently-loaded library hashes, else `CatalogueHashMismatchError` is raised (§5).
10. **Bundle immutability.** After publication (§2.4), the train split SHA-256 MUST match across all future re-runs of `hf upload`; adding val rows requires a minor-version bump, never a silent train-split mutation.

---

## 4. Data Structures

All types are frozen dataclasses, immutable after load. Mappings are wrapped in `types.MappingProxyType`.

### 4.1 `TemplateLibrary` (re-exported from `task_generator.models` — single source of truth)

```python
@dataclass(frozen=True)
class TemplateLibrary:
    templates: tuple[Template, ...]                                      # exactly 20 at v1.0
                                                                         # (4 domains × 5 templates);
                                                                         # ≥ 20 after minor-version bumps
    cities_by_domain: Mapping[Domain, tuple[str, ...]]                   # 10 per domain
    i18n: Mapping[LanguageCode, Mapping[str, str]]                       # merged from i18n.yaml
    source_sha256: str                                                   # hash of templates.yaml bytes
```

The `templates` tuple length is **exactly 20** at v1.0 publication (4 domains × 5 templates per domain). Post-v1.0 minor-version bumps may grow this count monotonically; the invariant `len(templates) >= 20` and `len(templates) % 5 == 0` holds across all future versions. `load_templates` asserts `len(templates) == 20` at v1.0 via the `generator_version` check.

Authoritative schema lives in `docs/modules/task_generator.md` §4. This module re-exports the type so callers of `load_templates` receive the same object that `task_generator.generate` consumes.

### 4.2 `I18nLibrary`

```python
@dataclass(frozen=True)
class I18nLibrary:
    strings: Mapping[LanguageCode, Mapping[str, str]]
    # e.g., strings["hi"]["BLR"] = "बेंगलुरु"
    # strings["ta"]["Monday"] = "திங்கள்"
    source_sha256: str
```

Merged into `TemplateLibrary.i18n` by `load_templates`, but exposed independently for pure-i18n use cases (e.g., the Gradio demo UI localizing labels).

### 4.3 `DriftPatternLibrary`

```python
@dataclass(frozen=True)
class DriftPatternLibrary:
    patterns: Mapping[str, DriftPattern]                                 # keyed by DriftPattern.id
    by_domain: Mapping[str, tuple[str, ...]]                             # domain → pattern_ids
    by_type:   Mapping[str, tuple[str, ...]]                             # drift_type → pattern_ids
    source_sha256: str
```

`DriftPattern` itself is defined in `docs/modules/drift_injector.md` §4.2 (see the `DriftPattern` dataclass snippet). This module owns *loading*, `drift_injector` owns *applying*.

### 4.4 `APISchemaRegistry`

```python
@dataclass(frozen=True)
class APISchema:
    domain: str                       # "airline" | "cab" | "restaurant" | "hotel" | "payment"
    version: str                      # "v1" | "v2" | "v3"
    schema: Mapping[str, Any]         # parsed JSON Schema 2020-12 document
    source_sha256: str

@dataclass(frozen=True)
class APISchemaRegistry:
    schemas: Mapping[str, Mapping[str, APISchema]]
    # schemas["airline"]["v2"] = APISchema(...)

    def get(self, domain: str, version: str) -> APISchema: ...
    def versions(self, domain: str) -> tuple[str, ...]: ...              # ordered v1,v2,v3
```

Each `v<N>.json` is a valid JSON Schema 2020-12 document describing the tool-response shape for that domain at that drift version. Vendors (DESIGN.md §5) validate outgoing responses against these schemas at test time; the injector (`docs/modules/drift_injector.md` §3) consults version transitions via these files.

### 4.5 `AudioManifest`

```python
@dataclass(frozen=True)
class AudioClip:
    utterance_id: str                 # stable; matches a curated IndicVoices-R clip id
    path: Path                        # relative to data/audio/
    language: LanguageCode
    source: Literal["real_indicvoices_r"]   # manifest is authored-only; synth clips
                                            # are lazily generated and NEVER recorded here
    license: str                      # SPDX identifier
    sha256: str
    duration_s: float                 # ≤ 20.0 (DESIGN.md §9 upper bound)

@dataclass(frozen=True)
class AudioManifest:
    clips: tuple[AudioClip, ...]
    source_sha256: str                # hash of MANIFEST.jsonl bytes
```

The `source` field is a single-value `Literal` — the manifest is **authored-only**. Synth clips generated on-demand by `audio/tts_kokoro.py` are **never** recorded in the manifest (they are transient, gitignored under `data/audio/synth/`). This keeps the manifest auditable and its SHA-256 stable across Kokoro model-weight updates.

### 4.6 `SFTCorpus` (L4, optional)

```python
@dataclass(frozen=True)
class SFTTrajectory:
    episode_id: int
    goal_seed: int                    # same seed space as train/; NEVER a val seed (§3.1)
    turns: tuple[Mapping[str, Any], ...]   # role/content pairs, JSON-serializable
    stage: Literal[1, 2, 3]
    reward_breakdown: Mapping[str, float]  # R1..R5 + total, from the env at synthesis time
    generation_batch_id: str          # uuid4 per invocation of sft_generator.py
    generation_index: int             # monotonic within a batch, 0..N-1

@dataclass(frozen=True)
class SFTCorpus:
    trajectories: tuple[SFTTrajectory, ...]
    generator: Literal["sarvam-m-hf-inference"]
    generation_seed: int
    target_count: int                 # from --target-count CLI flag
    source_sha256: str
```

Consumed by `training/train_grpo.py` only when `--sft-warmup-steps > 0` is passed. Absent by default; loader raises a non-fatal warning if the file is missing (training proceeds without warmup).

**Atomic append + restart recovery (`training/sft_generator.py`):**

- Each trajectory is appended to `data/sft_warmup/trajectories.jsonl` as a single canonical-JSON line followed by `os.fsync(fd)` on the file descriptor, ensuring durability before the next Sarvam-M API call. Partial-write recovery is therefore line-granular.
- Every row carries `generation_batch_id` (uuid4, generated once per invocation of `sft_generator.py`) and `generation_index` (monotonic integer 0..N-1 within that batch).
- On restart, `sft_generator.py` reads the existing `trajectories.jsonl`, reconstructs `(seed, generation_index)` pairs already completed, and resumes from the next uncompleted seed in its deterministic seed list. This tolerates Sarvam-M rate-limit drops, OOM kills, and SIGKILL.
- After all generation completes, the script performs a **final count validation**: if `len(trajectories) != target_count`, it raises `PartialSFTCorpusError` (§5). The loader `load_sft_corpus` also performs this check at load time and raises the same error if the on-disk row count does not match the `target_count` field.
- Edge case #11 (§7) walks through a concrete partial-generation-recovery scenario.

### 4.7 `BriefRow` — canonical publication-row contract

Every line of `train/briefs.jsonl` and `val/briefs.jsonl` in the published HF Hub bundle is exactly one serialized `BriefRow`. This dataclass is the single-source-of-truth schema for everything an offline consumer (a judge re-running eval, a third party reproducing our experiments) needs to re-derive the episode from `(seed, library@hash)` alone.

```python
from __future__ import annotations
from dataclasses import dataclass
from typing import Literal
from driftcall.models import GoalSpec, DriftEvent, LanguageCode

@dataclass(frozen=True)
class BriefRow:
    episode_id: str                    # deterministic from seed + stage (e.g. "s2_ep_00000042")
    seed: int                          # original episode seed (train: [0, 20_000_000),
                                       #                         val:   [20_000_000, 20_000_500))
    stage: Literal[1, 2, 3]            # curriculum stage at publication time
    language: LanguageCode             # "hi" | "ta" | "kn" | "en" | "hinglish"
    domain: Literal["airline", "cab", "restaurant", "hotel"]
    template_id: str                   # e.g. "airline.book.budget_timewindow"
    goal: GoalSpec                     # full GoalSpec (slots + constraints + seed_utterance)
    drift_schedule: tuple[DriftEvent, ...]   # schedule pre-computed by drift_injector
    catalogue_hash: str                # sha256(drift_patterns/drifts.yaml bytes)
    templates_sha256: str              # sha256(task_briefs/templates.yaml bytes)
    i18n_sha256: str                   # sha256(task_briefs/i18n.yaml bytes)
    generator_version: str             # e.g. "driftcall-1.0.0" — semver of the generator
    created_ts_ist: str                # ISO 8601 with +05:30 offset, e.g. "2026-04-25T10:30:00+05:30"
```

Serialization is always canonical: `json.dumps(asdict(row), ensure_ascii=False, sort_keys=True, separators=(",", ":"))`. A concrete JSONL line example is given in §8.5.

At eval-load time, the loader re-hashes the currently-loaded `drifts.yaml` / `templates.yaml` / `i18n.yaml` and compares against `catalogue_hash` / `templates_sha256` / `i18n_sha256`. Any mismatch raises `CatalogueHashMismatchError` (§5) — this prevents silent semantic drift where a consumer runs `train/briefs.jsonl` against a newer catalogue and gets different episodes.

---

## 5. Error Modes

All exceptions subclass `DatasetError(Exception)`. Each is raised exactly once and unit-tested.

| Exception | Trigger | Where raised |
|---|---|---|
| `DatasetFileMissingError` | `data/<path>` absent on disk | every loader |
| `MalformedYAMLError` | YAML parse failure (syntax) | `load_templates`, `load_i18n`, `load_drift_patterns` |
| `MalformedJSONError` | JSON parse failure (syntax) | `load_api_schemas`, `load_audio_manifest`, `load_sft_corpus` |
| `DatasetSchemaError` | type/shape validation failure (missing required key, wrong type, extra unknown key) | every loader |
| `UnknownLanguageKeyError` | a language key ∉ `LanguageCode = {"hi","ta","kn","en","hinglish"}` appears in `templates.yaml` or `i18n.yaml` | `load_templates`, `load_i18n` |
| `LicenseConflictError` | a CC-BY-SA or GPL-licensed row appears in publication bundle while bundle is Apache-2.0; or verbatim ≥ 10-token suffix matches a CC-BY-SA upstream row | publication script (see §3.4) |
| `TrainValLeakError` | train and val seed sets intersect; or an `SFTTrajectory.goal_seed` sits in the val reserved range `[20_000_000, 20_000_500)` | publication script, `load_sft_corpus` |
| `DriftPatternOrphanError` | `drift_patterns.yaml` references a `from_version`/`to_version` not present in `data/api_schemas/<domain>/` | `load_drift_patterns` |
| `ChecksumMismatchError` | `AudioClip.sha256` does not match the on-disk file's hash | `load_audio_manifest` |
| `UnicodeNFDError` | any loaded string fails `unicodedata.is_normalized("NFC", s)` | every loader |
| `PIIDetectedError` | a 10-digit run appears outside allowed contexts in authored text | every text-bearing loader; also CI lint |
| `DuplicateDriftPatternIdError` | two entries in `drifts.yaml` share an `id` | `load_drift_patterns` |
| `CatalogueHashMismatchError` | a `BriefRow` in `train/briefs.jsonl` or `val/briefs.jsonl` carries `catalogue_hash` / `templates_sha256` / `i18n_sha256` that does not match the currently-loaded library (drifts.yaml / templates.yaml / i18n.yaml) hashes | eval-load path (consumers of published bundle) |
| `PartialSFTCorpusError` | `len(SFTCorpus.trajectories) != target_count` at final-count validation; raised by `training/sft_generator.py` post-generation and by `load_sft_corpus` at load time | `load_sft_corpus`, `training/sft_generator.py` |

**No silent fallbacks.** If `data/sft_warmup/trajectories.jsonl` is missing, `load_sft_corpus` raises `DatasetFileMissingError`; the training script is the one that treats this as non-fatal (falls back to no-SFT warmup). Loaders themselves never substitute defaults.

---

## 6. Dependencies

### 6.1 Reads

- `data/task_briefs/templates.yaml`, `data/task_briefs/i18n.yaml`
- `data/drift_patterns/drifts.yaml`
- `data/api_schemas/**/*.json`
- `data/audio/real/MANIFEST.jsonl` + the `.wav` files it references
- `data/sft_warmup/trajectories.jsonl` (optional)

### 6.2 Imports

- `driftcall.models``GoalSpec`, `LanguageCode`, `Domain`
- Python stdlib: `json`, `hashlib`, `pathlib`, `unicodedata`, `threading`, `dataclasses`, `typing`, `types`
- Third-party: `PyYAML`, `jsonschema` (for JSON Schema 2020-12 meta-validation)

### 6.3 Consumers

Consuming modules and the exact function they call:

- `docs/modules/task_generator.md``load_templates()` in `task_generator.generate()`'s lazy-singleton `_get_library()`.
- `docs/modules/drift_injector.md``load_drift_patterns()` in the injector's module-level registry; consults DESIGN.md §6.3 pattern catalogue.
- `docs/modules/vendors.md``load_api_schemas()` at vendor import time; each vendor asserts its own response shape against the schema in test fixtures.
- `docs/modules/audio.md``load_audio_manifest()` for the pitch demo (§9.5 IndicVoices-R clip playback).
- `docs/modules/training.md``load_sft_corpus()` behind `--sft-warmup-steps` flag; also invokes `training/data_export.py` which calls `task_generator.enumerate_variants()` to produce the publication briefs.

### 6.4 Publishes to

- HF Hub dataset repo `<team>/driftcall-indic-briefs` (one-time, pre-event, Phase C5 per `DRIFTCALL/CLAUDE.md` §4.1).

### 6.5 Non-dependencies (explicit)

- Does **not** import from `env.py`, `rewards.py`, `app.py`, or the training entrypoint. Pure data layer.
- Does **not** hit the network at runtime. Every file is local. Publication script is a separate, explicitly-invoked entrypoint.
- Does **not** depend on GPU, CUDA, or PyTorch. CPU-only.

---

## 7. Edge Cases

1. **Missing template variant for a rare language.** `templates.yaml` is authored with `hinglish` + `hi` + `en` + `ta` but an author forgets `kn` for one template. `load_templates` runs per-template check `set(variants.keys()) == LanguageCode.values` and raises `DatasetSchemaError: template 'restaurant.order.veg' missing language_variants['kn']`. The generator's `NoVariantForLanguageError` (task_generator.md §5) never has a chance to fire because loading fails first. Fix: author supplies the missing variant; loader re-runs.

2. **Unicode NFD in author contribution.** A collaborator pastes a Kannada weekday name from macOS clipboard (NFD by default for composed characters). `load_i18n` re-normalizes to NFC *before* equality/hashing; the assertion `unicodedata.is_normalized("NFC", value)` fires post-normalization as a defense against Python/ICU bugs. In practice the round-trip succeeds and NFC is stored. The pre-commit hook separately catches NFD at commit time so CI never sees it.

3. **License incompatibility (CC-BY-SA row smuggled into an Apache-2.0 bundle).** An author, inspired by an SGD row, copies 20 tokens verbatim into a template variant. Publication CI runs a suffix-array check over cached SGD/MTOP exports looking for ≥ 10-token verbatim matches; on hit, `LicenseConflictError("variant in 'airline.book.budget_timewindow' matches SGD row sgd_5432:0 (≥ 10 tokens)")` raises. Fix: rewrite the variant. We keep only *inspiration*, never verbatim text, from CC-BY-SA sources. The threshold (10 tokens) is a pragmatic choice: below that length we treat overlap as incidental linguistic reuse; at or above we flag.

4. **Empty language cohort in a stage mix.** A future curriculum config passes `language_weights = {"en": 1.0, "hi": 0.0, "ta": 0.0, "kn": 0.0, "hinglish": 0.0}`. This is valid at the task-generator level (task_generator.md §3.2 — non-negative weights summing to 1 are legal). `datasets` does not re-validate curriculum config; it only asserts the *library* has variants for all 5 languages. Downstream (`task_generator`) will simply never draw `hi`/`ta`/`kn`/`hinglish`. No error in this module.

5. **Train/val episode-id collision at publication time.** `data_export.py` draws 15,000 seeds for train and 500 for val. If the RNG accidentally maps a train seed into `[20_000_000, 20_000_500)` (the val reserved range) — which cannot happen given the seed-space partitioning in §3.1 — the assertion `train_seeds.isdisjoint(val_seeds)` raises `TrainValLeakError` with the offending seed. Safeguard: train seeds are drawn from `[0, 20_000_000)` and val seeds from `[20_000_000, 20_000_500)`. The two ranges are non-overlapping by construction; the assertion is a defense against future range edits.

6. **Drift-pattern-id orphan (trace references pattern not in YAML).** A test fixture or cached trace references `drift_pattern_id='airline.mysterious_fee'` but `drifts.yaml` has no such entry (it was renamed or removed). `load_drift_patterns` does not look at traces — it only checks internal consistency. The *trace consumer* (`rewards.r2_drift_detection` in `docs/modules/rewards.md`) raises `UnknownDriftPatternError` at scoring time per drift_injector.md §5. If the orphan is discovered during dataset publication, the publication script emits `DriftPatternOrphanError` and aborts.

7. **JSON Schema file that is valid JSON but not valid JSON Schema 2020-12.** `data/api_schemas/cab/v3.json` is hand-edited and accidentally drops the `$schema` keyword or uses an unknown keyword. `load_api_schemas` runs `jsonschema.Draft202012Validator.check_schema(schema)` and on failure raises `DatasetSchemaError("cab/v3.json: not a valid JSON Schema 2020-12: <error>")`. The env refuses to serve `reset()` until fixed.

8. **Audio clip on disk does not match manifest sha256.** `data/audio/real/MANIFEST.jsonl` lists `kn_greeting_03.wav` with `sha256=abc...`. The file gets re-encoded (e.g., by a well-intentioned ffmpeg pass). `load_audio_manifest` re-hashes every referenced WAV and raises `ChecksumMismatchError("kn_greeting_03.wav: expected abc..., got def...")`. Fix: either revert the WAV, or regenerate the manifest after an audit trail commit.

9. **SFT corpus contains a val-reserved seed.** Sarvam-M synthesis inadvertently uses a seed in `[20_000_000, 20_000_500)`. `load_sft_corpus` raises `TrainValLeakError`. The training script may be configured to treat this as fatal (default) or to filter out those trajectories (`--sft-tolerate-leak`); the loader itself always raises.

10. **PyYAML silently deduplicating keys.** If `drifts.yaml` has two entries with the same `id`, the YAML parse is valid but one wins. `load_drift_patterns` builds a set of ids during validation and raises `DuplicateDriftPatternIdError` on collision, with both source line numbers.

11. **Partial SFT corpus recovery (L4 restart).** `training/sft_generator.py` is mid-run at trajectory 137 of a target 300 when the host OOM-kills the process (Sarvam-M inference peak memory). On restart, the script re-opens `data/sft_warmup/trajectories.jsonl`, reads the existing 137 rows (each fsync'd atomically per §4.6), reconstructs the completed `(generation_batch_id, generation_index)` pairs, and resumes from index 137 of the same batch. It does NOT start a new `generation_batch_id` — batch id is rehydrated from the last row. When generation finally reaches 300, the script validates `len(rows) == target_count`; if a Sarvam-M response was silently truncated (say, only 298 rows written), `PartialSFTCorpusError("expected 300, got 298")` is raised and the operator must decide whether to re-run the missing two or ship a corpus with a smaller `target_count`. `load_sft_corpus` performs the same count check at load time.

---

## 8. Examples

### 8.1 Full `templates.yaml` entry for `airline.book.budget_timewindow`

```yaml
# SPDX-License-Identifier: Apache-2.0
# Copyright 2026 DriftCall Team
# Derived-from: AmazonScience/MASSIVE (intent taxonomy, Apache-2.0)

- template_id: airline.book.budget_timewindow
  domain: airline
  intent: book_flight
  min_stage: 1
  required_slots: [from, to, when]
  optional_slots: [seat_pref]
  constraints_template:
    budget_inr:
      distribution: uniform
      low: 3000
      high: 15000
      step: 500
    time_window:
      choices: [morning, afternoon, evening, late_night]
  drift_slot_tags: [price, total_fare_inr]
  # Language keys: ISO short codes matching LanguageCode = Literal["hi","ta","kn","en","hinglish"]
  language_variants:
    hinglish:
      - "Bhai {when} ko {to} jaana hai, cheapest flight {time_window} mein, {budget_inr} rupees max"
      - "{when} ko {from} se {to} ka ticket book kar de, under {budget_inr}, {time_window} ke baad"
    hi:
      - "मुझे {when} को {from} से {to} जाना है, {budget_inr} रुपये से कम में"
    ta:
      - "{when} அன்று {from} லிருந்து {to} க்கு டிக்கெட் வேண்டும், {budget_inr} ரூபாய்க்கு கீழ்"
    kn:
      - "{when} ರಂದು {from} ಇಂದ {to} ಗೆ ಅಗ್ಗದ ವಿಮಾನ ಟಿಕೆಟ್ ಬೇಕು, {budget_inr} ರೂಪಾಯಿಗಳ ಒಳಗೆ"
    en:
      - "Book the cheapest flight from {from} to {to} on {when}, budget under ₹{budget_inr}, departing {time_window}"
```

This is the single source-of-truth entry for the Stage-1 airline booking template; mirror of DESIGN.md §8.3 and `docs/modules/task_generator.md` §4.1.

### 8.2 Full `drift_patterns.yaml` entry for `airline.price_rename`

```yaml
# SPDX-License-Identifier: Apache-2.0
# Copyright 2026 DriftCall Team

- id: airline.price_rename
  drift_type: schema
  domain: airline
  from_version: v1
  to_version: v2
  description: "field 'price' renamed to 'total_fare_inr'; 'currency' removed"
  mutation:
    rename: {price: total_fare_inr}
    remove: [currency]
  detection_hints:
    - "total_fare_inr"
    - "price"
    - "rename"
```

`load_drift_patterns` will (a) parse this, (b) check `id` uniqueness, (c) confirm `from_version=v1` + `to_version=v2` both exist as `data/api_schemas/airline/v1.json` + `data/api_schemas/airline/v2.json`, (d) confirm `detection_hints` is non-empty, (e) wrap `mutation` in `MappingProxyType`. Matches `docs/modules/drift_injector.md` §4.3 byte-for-byte.

### 8.3 `data/api_schemas/airline/v2.json`

```json
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://driftcall.dev/schemas/airline/v2.json",
  "$comment": "SPDX-License-Identifier: Apache-2.0. v2 = post-price_rename drift (DESIGN.md §5.1).",
  "title": "Airline search result (v2)",
  "type": "object",
  "required": ["flight_id", "from", "to", "depart", "total_fare_inr", "seats_left"],
  "additionalProperties": false,
  "properties": {
    "flight_id":       {"type": "string", "pattern": "^[0-9A-Z]{2}-[0-9]{4}$"},
    "from":            {"type": "string", "pattern": "^[A-Z]{3}$"},
    "to":              {"type": "string", "pattern": "^[A-Z]{3}$"},
    "depart":          {"type": "string", "format": "date-time"},
    "total_fare_inr":  {"type": "integer", "minimum": 0},
    "seats_left":      {"type": "integer", "minimum": 0}
  }
}
```

Note that `price` and `currency` from v1 are absent (drift `airline.price_rename` applied). Vendors (`docs/modules/vendors.md`) validate their emitted `airline.search` responses against whichever version the injector has installed in `state.schema_versions['airline']`. This schema also serves as the R2 structural detection surface: a tool call that keys into `price` after drift returns `KeyError` / 422, which is a detection-positive signal per DESIGN.md §7.1 R2.

### 8.4 `MANIFEST.jsonl` row for a curated IndicVoices-R clip (L3)

```json
{"utterance_id": "iv_r_kn_0451", "path": "real/kn/iv_r_kn_0451.wav", "language": "kn", "source": "real_indicvoices_r", "license": "Apache-2.0", "sha256": "b7f1a9c2e5d4...", "duration_s": 4.82}
```

Referenced only by the pitch demo. Training never touches this file — DRIFTCALL/CLAUDE.md §9 "Do not put TTS/ASR in the training loop".

### 8.5 Canonical `BriefRow` JSONL line (single row from `train/briefs.jsonl`)

One line from the published bundle — canonical JSON (sorted keys, no whitespace, UTF-8 preserved for Devanagari):

```json
{"catalogue_hash":"3f9a8e7c2b1d4e5f6a0b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5e4f3a2b1c0d9e8f","created_ts_ist":"2026-04-25T10:30:00+05:30","domain":"airline","drift_schedule":[{"description":"'price' field renamed to 'total_fare_inr'","domain":"airline","drift_type":"schema","from_version":"v1","pattern_id":"airline.price_rename","to_version":"v2","turn":4}],"episode_id":"s2_ep_00000042","generator_version":"driftcall-1.0.0","goal":{"constraints":{"budget_inr":8000,"time_window":"evening"},"domain":"airline","intent":"book_flight","language":"hinglish","seed_utterance":"Bhai Friday ko Bangalore jaana hai, cheapest flight evening mein, 8000 rupees max","slots":{"from":"HYD","to":"BLR","when":"2026-04-30"}},"i18n_sha256":"a1b2c3d4e5f60718293a4b5c6d7e8f901234567890abcdef1234567890abcdef","language":"hinglish","seed":42,"stage":2,"template_id":"airline.book.budget_timewindow","templates_sha256":"b2c3d4e5f60718293a4b5c6d7e8f901234567890abcdef1234567890abcdef12"}
```

Note: keys are alphabetically sorted (`catalogue_hash`, `created_ts_ist`, `domain`, …), strings are NFC-normalized, no embedded spaces. The 64-hex hashes are full sha256 hex digests.

### 8.6 `README.md` YAML frontmatter (HF Hub dataset card)

The published `<org>/driftcall-indic-briefs/README.md` begins with the following YAML frontmatter. The HF Dataset Viewer reads this block to auto-configure splits, license, and task tags.

```yaml
---
license: apache-2.0
language: [hi, ta, kn, en]
size_categories: [10K<n<100K]
task_categories: [conversational, text-generation]
pretty_name: DriftCall Indic Briefs
configs:
  - config_name: default
    data_files:
      - split: train
        path: train/briefs.jsonl
      - split: val
        path: val/briefs.jsonl
dataset_info:
  features:
    - { name: episode_id, dtype: string }
    - { name: seed, dtype: int64 }
    - { name: stage, dtype: int32 }
    - { name: language, dtype: string }
    - { name: domain, dtype: string }
    - { name: template_id, dtype: string }
  splits:
    - { name: train, num_examples: 15000 }
    - { name: val, num_examples: 500 }
---
```

The body of `README.md` follows below the frontmatter: dataset description, licensing chain (full Apache-2.0 text is in the separate `LICENSE` file per §3.4), provenance (`generator_version`, `catalogue_hash`), reward-caveat paragraph, and usage example. The frontmatter's `features` block lists only the top-level flat columns; nested structs (`goal`, `drift_schedule`) are auto-inferred by the HF Datasets library on first load.

---

## 9. Open Questions

1. **HF org name not yet finalized.** `<org>` placeholder in `<org>/driftcall-indic-briefs` depends on `DRIFTCALL/CLAUDE.md` §8 kickoff-checklist item "HF org name locked". The publication script parameterizes the org via `--hf-org`; no code change needed once locked, just a CLI arg at publication time. Does not block Phase D. **Sync note:** `DRIFTCALL/CLAUDE.md` §6 command table still lists the deprecated `huggingface-cli upload` — when the org name is locked, update that table to the modern `hf upload` in the same PR.

2. **SFT warmup corpus size — 200 vs 500 trajectories.** DESIGN.md §8.2 row 4 quotes the range "200–500". The exact count depends on Sarvam-M's cost/latency budget during one-shot synthesis. Recommend 200 as a floor (sufficient for format priming per §10 training convergence target) and 500 as a ceiling if inference time permits. Resolution: Person C chooses during Phase C4; does not affect loader or schema.

3. **Audio manifest curation count.** DESIGN.md §9 implies a handful of real IndicVoices-R clips for pitch demo realism, but does not specify exact count. Recommend 20 curated clips (4 per language × 5 languages), balanced by speaker gender and dialect region. Resolution: Person D curates during Phase C5; this module only ensures the manifest format is stable.

### 9.1 Resolved

- **License-cache implementation (previously Open Q #4).** `data/.license_cache/{sgd,mtop}.idx` is a sqlite3 FTS5 index built by `scripts/build_license_cache.py` at dev time. Schema: `CREATE VIRTUAL TABLE licensed_text USING fts5(chunk_text, source_id);` with 5-gram tokenization. CI invokes this index (read-only) on each PR to verify that no `seed_utterance` or template variant in the publication bundle substring-matches any upstream CC-BY-SA text (≥ 10-token threshold, §3.4). The index is built once per upstream corpus version and committed to the repo so re-builds are only needed when SGD or MTOP themselves publish a new version. Determinism + reviewability win over per-PR rebuild cost.

---

**This doc tells you HOW the four dataset layers are shaped, loaded, validated, and published. Do not write loaders before a fresh critic returns `NOTHING_FURTHER`. Do not commit `data/*.yaml` without the pre-commit NFC + PII + license-header guards running. Do not ship the HF Hub bundle without the train/val disjointness and verbatim-match checks green.**