Spaces:

saumilyajj
/

driftcall

Paused

App Files Files Community

driftcall / docs /tests /datasets_tests.md

saumilyajj

Upload folder using huggingface_hub

f2df60e verified about 1 month ago

preview code

raw

history blame contribute delete

27.9 kB

datasets_tests.md — Test Plan for `docs/modules/datasets.md`

Owner: Person B (Rewards & Tests), co-authored with Person C (Training & Data) Target module: DRIFTCALL/docs/modules/datasets.md (final sealed) Implements coverage for: DESIGN.md §8 (§§8.1, 8.2, 8.3, 8.4, 8.5, 8.6) and CLAUDE.md §3.1 Frameworks: pytest, hypothesis, pytest-cov Status: DRAFT — pending ≥ 1 fresh critic round (test-plan gate is lighter per CLAUDE.md §3.2 Batch D4)

0. Scope & Non-goals

datasets.md specifies four on-disk data layers (L1 templates/i18n, L2 drift-patterns + api-schemas, L3 audio manifest, L4 SFT warmup) plus a one-shot HF Hub publication contract. Every loader is a lazy singleton that NFC-normalizes on read, validates against a frozen dataclass schema, and raises a typed DatasetError subclass on any shape / license / lineage / leak violation.

This plan covers:

Constructibility + immutability of every frozen dataclass declared in datasets.md §4 (§4.1 – §4.7).
Canonical JSON serialization — byte-identical output of json.dumps(row, ensure_ascii=False, sort_keys=True, separators=(",",":")) across Python / libc versions (datasets.md §3.1 invariant #6).
Lineage hash triple — every BriefRow carries catalogue_hash / templates_sha256 / i18n_sha256, and any mismatch at eval-load raises CatalogueHashMismatchError (datasets.md §3.5 invariant #9, §5).
Size + contents invariants — TemplateLibrary has exactly 20 templates at v1.0, DriftPatternLibrary has exactly 20 patterns, APISchemaRegistry exactly 14 schemas over 5 domains (datasets.md §3.5 invariants #2 / #3 / #4).
License bundle integrity — root LICENSE contains the full verbatim Apache-2.0 text (byte length ≥ 11 000, canonical header string present, SHA pinned in fixture), LICENSES.md markdown table parses (datasets.md §3.4).
Audio manifest provenance — AudioClip.source only accepts the single Literal["real_indicvoices_r"]; the string "synth_kokoro" is rejected at dataclass-construction time (datasets.md §4.5).
Publication determinism — random.Random(20260425).sample(range(0, 20_000_000), 15_000) is byte-identical across re-runs; val seeds are list(range(20_000_000, 20_000_500)); train ∩ val = ∅ (datasets.md §2.4, §3.1, §3.5 invariants #5 / #6).
SFT restart recovery — training/sft_generator.py appends one canonical-JSON line + os.fsync(fd) per trajectory, rehydrates generation_batch_id on restart, emits monotonic generation_index, and raises PartialSFTCorpusError on final-count mismatch (datasets.md §4.6, §7 edge 11).
License-cache FTS5 schema — scripts/build_license_cache.py produces data/.license_cache/{sgd,mtop}.idx with CREATE VIRTUAL TABLE licensed_text USING fts5(chunk_text, source_id); and 5-gram tokenization (datasets.md §9.1).
HF dataset-card frontmatter — README.md YAML frontmatter parses via the HF datasets loader (datasets.md §8.6).

Every test below maps to one numbered clause in datasets.md. Clause references are embedded in each test docstring as datasets.md §X.Y / datasets.md §7 edge N.

1. Unit tests

All unit tests live in DRIFTCALL/tests/data/. Import surface under test:

from driftcall.data.models import (
    TemplateLibrary, I18nLibrary,
    DriftPatternLibrary, APISchemaRegistry, APISchema,
    AudioManifest, AudioClip,
    SFTCorpus, SFTTrajectory,
    BriefRow,
)
from driftcall.data.loaders import (
    load_templates, load_i18n,
    load_drift_patterns, load_api_schemas,
    load_audio_manifest, load_sft_corpus,
)
from driftcall.data.errors import (
    DatasetError, DatasetFileMissingError, MalformedYAMLError, MalformedJSONError,
    DatasetSchemaError, UnknownLanguageKeyError, LicenseConflictError,
    TrainValLeakError, DriftPatternOrphanError, ChecksumMismatchError,
    UnicodeNFDError, PIIDetectedError, DuplicateDriftPatternIdError,
    CatalogueHashMismatchError, PartialSFTCorpusError,
)
from training.data_export import canonical_dumps, sample_train_seeds, val_seeds
from training.sft_generator import append_trajectory, resume_batch
from scripts.build_license_cache import build_index, FTS5_SCHEMA_DDL

Fixtures (§5) come from tests/data/conftest.py and tests/conftest.py.

1.1 `BriefRow` — frozen dataclass + 13-field contract

#	Test name	Asserts	Maps to
U1	`test_brief_row_has_exactly_thirteen_fields`	`len(dataclasses.fields(BriefRow)) == 13`; field names equal the ordered tuple `("episode_id","seed","stage","language","domain","template_id","goal","drift_schedule","catalogue_hash","templates_sha256","i18n_sha256","generator_version","created_ts_ist")`.	datasets.md §4.7
U2	`test_brief_row_is_frozen`	Building from `brief_row_happy` fixture, every attempted assignment (`row.seed = 7`, `row.episode_id = "x"`, etc.) raises `dataclasses.FrozenInstanceError`. Parametrized over all 13 fields.	datasets.md §3.5 invariant #1 (immutability), §4.7
U3	`test_brief_row_happy_construct_roundtrip`	`brief_row_happy` constructs; `dataclasses.asdict(row)` returns a dict with 13 keys matching the spec; all string fields are NFC.	datasets.md §4.7
U4	`test_brief_row_missing_required_field_raises`	`BriefRow()` raises `TypeError` (no defaults — every field is required). Parametrized: supplying 12 of 13 fields also raises.	datasets.md §4.7
U5	`test_brief_row_stage_literal_enforced`	`BriefRow(..., stage=4)` is statically illegal; at runtime a `DatasetSchemaError` is raised by `load_*` on a `stage` value ∉ `{1,2,3}`.	datasets.md §4.7
U6	`test_brief_row_domain_literal_enforced`	A `domain="payment"` row is rejected at load time (`BriefRow.domain` is the 4-value primary-domain literal — payment is L2-only and does not appear in publication).	datasets.md §4.7, §3.5 invariant #4
U7	`test_brief_row_created_ts_ist_must_carry_plus0530_offset`	`load_briefs("train/briefs.jsonl")` rejects a row whose `created_ts_ist` does not end in `+05:30`.	datasets.md §4.7

1.2 Canonical JSON ordering

#	Test name	Asserts	Maps to
U8	`test_canonical_dumps_sorts_keys`	`canonical_dumps({"b":1,"a":2}) == '{"a":2,"b":1}'`; output contains no spaces; no trailing newline.	datasets.md §3.1 canonical-JSON block
U9	`test_canonical_dumps_preserves_devanagari`	`canonical_dumps({"city":"बेंगलुरु"})` contains the literal Devanagari bytes (UTF-8), NOT `क…` escapes. Exact bytes asserted with `==`.	datasets.md §3.1 (`ensure_ascii=False`)
U10	`test_canonical_dumps_exact_separators`	The serialized form of `{"a":1,"b":2}` equals `b'{"a":1,"b":2}'` byte-for-byte; no whitespace between `,` / `:` and neighbours.	datasets.md §3.1 canonical-JSON block
U11	`test_canonical_dumps_brief_row_matches_golden`	`canonical_dumps(asdict(brief_row_happy))` equals the golden line in §8.5 of datasets.md byte-for-byte.	datasets.md §8.5
U12	`test_canonical_dumps_is_idempotent`	`canonical_dumps(json.loads(canonical_dumps(row)))` == `canonical_dumps(row)` for 100 random fixture perturbations (fuzzed via hypothesis — see §2).	datasets.md §3.1, §3.5 invariant #6

1.3 Lineage hashes (`catalogue_hash`, `templates_sha256`, `i18n_sha256`)

#	Test name	Asserts	Maps to
U13	`test_catalogue_hash_matches_drifts_yaml_bytes`	`catalogue_hash == hashlib.sha256(Path("data/drift_patterns/drifts.yaml").read_bytes()).hexdigest()`; length 64 hex chars; lowercase.	datasets.md §4.7, §3.5 invariant #9
U14	`test_templates_sha256_matches_templates_yaml_bytes`	`templates_sha256 == sha256(templates.yaml)` byte-for-byte.	datasets.md §4.7
U15	`test_i18n_sha256_matches_i18n_yaml_bytes`	`i18n_sha256 == sha256(i18n.yaml)` byte-for-byte.	datasets.md §4.7
U16	`test_catalogue_hash_mismatch_raises_at_load`	Given `brief_row_happy` serialized with `catalogue_hash="deadbeef…"` (wrong), `load_briefs(path)` raises `CatalogueHashMismatchError` naming the offending field(s).	datasets.md §3.5 invariant #9, §5 (`CatalogueHashMismatchError`)
U17	`test_hash_computation_is_stable_across_reruns`	Computing the three hashes twice in the same process returns identical values; computing across two subprocesses (via `subprocess.check_output`) also identical.	datasets.md §3.5 invariant #6

1.4 `AudioClip.source` excludes synth

#	Test name	Asserts	Maps to
U18	`test_audio_clip_source_accepts_real_only`	`AudioClip(..., source="real_indicvoices_r", ...)` constructs; `AudioClip(..., source="synth_kokoro", ...)` raises `DatasetSchemaError` at load.	datasets.md §4.5
U19	`test_audio_manifest_rejects_synth_row`	A `MANIFEST.jsonl` line containing `"source":"synth_kokoro"` causes `load_audio_manifest` to raise `DatasetSchemaError("source must be 'real_indicvoices_r'")`.	datasets.md §4.5
U20	`test_audio_manifest_duration_upper_bound`	`AudioClip(..., duration_s=20.01)` loads raise `DatasetSchemaError`; `20.00` OK (DESIGN.md §9 upper bound).	datasets.md §4.5

1.5 `TemplateLibrary.size == 20` at v1.0

#	Test name	Asserts	Maps to
U21	`test_template_library_size_is_exactly_twenty_at_v1`	`len(load_templates().templates) == 20`; `len(templates) % 5 == 0`; `generator_version.startswith("driftcall-1.0")`.	datasets.md §3.5 invariant #4, §4.1
U22	`test_template_library_four_domains_five_each`	Grouped by `template.domain`, exactly 4 primary domains are present (airline, cab, restaurant, hotel), 5 templates each. Payment is NOT a primary-domain template owner.	datasets.md §4.1, §3.5 invariant #4
U23	`test_template_library_every_language_every_template`	For every template, `set(template.language_variants.keys()) == {"hi","ta","kn","en","hinglish"}`; missing key raises `DatasetSchemaError` at load.	datasets.md §3.5 invariant #4, §7 edge 1
U24	`test_template_library_future_version_monotonic_growth`	Synthesize a mock `templates.yaml` with 25 entries tagged `generator_version="driftcall-1.1.0"`; `load_templates` accepts it (monotonic growth invariant holds: `len >= 20` and `len % 5 == 0`).	datasets.md §4.1

1.6 `LICENSES.md` schema parse + `LICENSE` verbatim Apache-2.0

#	Test name	Asserts	Maps to
U25	`test_root_license_byte_length_at_least_11000`	`Path("LICENSE").read_bytes().__len__() >= 11_000`.	datasets.md §3.4
U26	`test_root_license_contains_apache_canonical_header`	The bytes `b"Apache License\n Version 2.0, January 2004"` appear at the top of `LICENSE`.	datasets.md §3.4
U27	`test_root_license_sha256_pinned`	`sha256(LICENSE bytes) == APACHE_2_0_CANONICAL_SHA` (pinned constant `8a0d778…`; exact value locked in `tests/data/fixtures/license_hashes.py`).	datasets.md §3.4
U28	`test_audio_licenses_md_embeds_full_apache_text`	`data/audio/LICENSES.md` byte length ≥ 11 000 AND contains the canonical Apache header.	datasets.md §3.4
U29	`test_sft_licenses_md_embeds_full_apache_text`	Same check for `data/sft_warmup/LICENSES.md`.	datasets.md §3.4
U30	`test_licenses_md_table_schema_parses`	The markdown table in each `LICENSES.md` parses with columns in exact order `["utterance_id"	"trajectory_id", "upstream_source", "upstream_license", "attribution_required", "notes"]`; every row has 5 cells;` attribution_required ∈ {"yes","no"}`.

1.7 Seed selection — deterministic, byte-identical

#	Test name	Asserts	Maps to
U31	`test_train_seed_sampling_is_deterministic`	`sample_train_seeds() == random.Random(20260425).sample(range(0, 20_000_000), 15_000)`; re-running the function twice yields identical lists (element-wise equal + ordering identical).	datasets.md §2.4
U32	`test_train_seed_count_is_fifteen_thousand`	`len(sample_train_seeds()) == 15_000`; all elements in `[0, 20_000_000)`; no duplicates.	datasets.md §2.4, §3.1
U33	`test_val_seeds_are_exact_contiguous_slice`	`val_seeds() == list(range(20_000_000, 20_000_500))`; `len == 500`; first == 20_000_000; last == 20_000_499.	datasets.md §2.4
U34	`test_train_val_disjoint`	`set(sample_train_seeds()).isdisjoint(set(val_seeds()))`; assert raises `TrainValLeakError` if injected seed `20_000_050` is spliced into train output.	datasets.md §3.5 invariant #5, §7 edge 5

1.8 SFT restart recovery + `PartialSFTCorpusError`

#	Test name	Asserts	Maps to
U35	`test_sft_append_one_line_fsyncs`	`append_trajectory(fd, traj)` writes exactly one canonical-JSON line ending `\n` and invokes `os.fsync(fd)` once per call (verified via `unittest.mock.patch("os.fsync")`).	datasets.md §4.6
U36	`test_sft_generation_batch_id_monotonic_within_batch`	Batch generates N=10 trajectories; all carry identical `generation_batch_id` (uuid4); `generation_index` values are `[0..9]` strictly monotonic and contiguous.	datasets.md §4.6
U37	`test_sft_restart_rehydrates_batch_id`	Given `trajectories.jsonl` pre-populated with 3 rows (batch_id `B`), call `resume_batch(path)`; returned `(batch_id, next_index) == (B, 3)`. New rows appended reuse `B`.	datasets.md §4.6, §7 edge 11
U38	`test_sft_partial_corpus_error_on_resume_count_mismatch`	Corpus file has 298 rows with `target_count=300` recorded in corpus metadata; `load_sft_corpus` raises `PartialSFTCorpusError("expected 300, got 298")`.	datasets.md §4.6, §5 (`PartialSFTCorpusError`), §7 edge 11
U39	`test_sft_generator_final_count_validation`	`training/sft_generator.py` run with `--target-count 5` but Sarvam-M drops 1 response → generator raises `PartialSFTCorpusError` post-loop (not silently).	datasets.md §4.6

1.9 License-cache FTS5 schema

#	Test name	Asserts	Maps to
U40	`test_license_cache_schema_ddl_is_exact`	`FTS5_SCHEMA_DDL == "CREATE VIRTUAL TABLE licensed_text USING fts5(chunk_text, source_id);"` — byte-for-byte.	datasets.md §9.1
U41	`test_license_cache_uses_5gram_tokenizer`	Introspect sqlite `PRAGMA fts5_integrity_check` + the `CREATE VIRTUAL TABLE` statement records `tokenize = 'unicode61'` with 5-gram config; `build_index(tokenizer="trigram")` raises `ValueError`.	datasets.md §9.1
U42	`test_license_cache_built_is_read_only_in_ci`	In CI mode (`DRIFTCALL_CI=1`), invoking `build_index` raises `RuntimeError("license cache is read-only in CI")`.	datasets.md §9.1

1.10 `README.md` YAML frontmatter — HF dataset loader parse

#	Test name	Asserts	Maps to
U43	`test_readme_frontmatter_parses_with_pyyaml`	`yaml.safe_load(frontmatter_block)` yields a dict with keys `{license, language, size_categories, task_categories, pretty_name, configs, dataset_info}`; `license == "apache-2.0"`; `language == ["hi","ta","kn","en"]`.	datasets.md §8.6
U44	`test_readme_frontmatter_loads_via_hf_datasets`	`datasets.load_dataset(str(bundle_dir))` (HF loader) returns a `DatasetDict` with `{"train","val"}` splits; `train.num_rows == 15_000`; `val.num_rows == 500`. Skipped if `datasets` not installed.	datasets.md §8.6
U45	`test_readme_frontmatter_features_flat_columns_only`	The `features` block lists only the 6 flat columns `{episode_id, seed, stage, language, domain, template_id}`; nested `goal`/`drift_schedule` are NOT pre-declared (auto-inferred).	datasets.md §8.6

1.11 Miscellaneous spec wiring

#	Test name	Asserts	Maps to
U46	`test_load_drift_patterns_count_equals_twenty`	`len(load_drift_patterns().patterns) == 20`; every `drift_type ∈ {"schema","policy","tnc","pricing","auth"}`.	datasets.md §3.5 invariant #2
U47	`test_load_api_schemas_count_equals_fourteen_across_five_domains`	`APISchemaRegistry` reports exactly 14 schemas keyed `{airline:{v1,v2,v3}, cab:{v1,v2,v3}, restaurant:{v1,v2,v3}, hotel:{v1,v2,v3}, payment:{v1,v2}}`.	datasets.md §3.5 invariant #3
U48	`test_drift_pattern_orphan_raises`	YAML with `from_version="v5"` (nonexistent) raises `DriftPatternOrphanError`.	datasets.md §5, §7 edge 6
U49	`test_duplicate_drift_pattern_id_raises`	YAML with two entries sharing `id: airline.price_rename` raises `DuplicateDriftPatternIdError` citing both line numbers.	datasets.md §5, §7 edge 10
U50	`test_nfc_normalization_applied_at_load`	A templates YAML authored with NFD Kannada weekday normalizes to NFC on load; `unicodedata.is_normalized("NFC", v) is True` for every loaded string.	datasets.md §3.5 invariant #1, §7 edge 2
U51	`test_pii_10_digit_run_raises`	An authored string containing `"9876543210"` outside IATA / timestamp contexts raises `PIIDetectedError`.	datasets.md §3.5 invariant #8, §3.1
U52	`test_license_header_missing_raises`	A YAML file without the `# SPDX-License-Identifier:` leading comment raises `DatasetSchemaError` at load.	datasets.md §3.5 invariant #7
U53	`test_audio_manifest_sha256_mismatch_raises`	Corrupt a wav byte; `load_audio_manifest` raises `ChecksumMismatchError` citing expected vs actual.	datasets.md §5, §7 edge 8
U54	`test_sft_trajectory_val_seed_raises`	`SFTTrajectory(goal_seed=20_000_042, …)` on load raises `TrainValLeakError`.	datasets.md §5, §7 edge 9
U55	`test_loader_is_singleton_per_path`	`load_templates()` twice returns the same object by identity (`is`); called with a different `path=` yields a distinct instance cached separately.	datasets.md §3.2

Total unit tests: 55 (target ≥ 35).

2. Property tests

All property tests live in DRIFTCALL/tests/data/test_properties.py using hypothesis.

#	Property	Strategy	Maps to
P1	Byte-identical re-runs of `data_export`. For any seed `s == 20260425`, two invocations of `data_export.main(seed=s)` produce byte-identical `train/briefs.jsonl` + `val/briefs.jsonl` (SHA-256 hashes match).	Fixed seed + hypothesis-generated minor perturbations (run order, tmpdir path).	datasets.md §3.5 invariant #6
P2	`BriefRow` is frozen. For any `BriefRow` instance generated by `brief_row_strategy()`, assigning to any of its 13 fields raises `FrozenInstanceError`. Hypothesis enumerates field name and value type.	`st.builds(BriefRow, …)` + `st.sampled_from(fields_of(BriefRow))`.	datasets.md §3.5 invariant #1, §4.7
P3	Seed-range disjointness. For any pair `(t, v)` where `t ∈ [0, 20_000_000)` and `v ∈ [20_000_000, 20_000_500)`, `t != v` and both sets generated by the spec are disjoint. Hypothesis samples 10 000 pairs.	`st.tuples(st.integers(min_value=0, max_value=19_999_999), st.integers(min_value=20_000_000, max_value=20_000_499))`.	datasets.md §3.5 invariant #5, §2.4
P4	Canonical JSON determinism under key permutation. For any dict `d` and any permutation `d'` of its keys, `canonical_dumps(d) == canonical_dumps(d')` byte-for-byte.	`st.dictionaries(st.text(), st.one_of(st.integers(), st.text()))` + `.map(shuffle_keys)`.	datasets.md §3.1, §3.5 invariant #6
P5	NFC idempotence. For any string `s`, `nfc(nfc(s)) == nfc(s)`; `load_templates` applied twice yields the same library by hash.	`st.text(alphabet=st.characters(min_codepoint=0x0900, max_codepoint=0x0DFF))` — Devanagari + Tamil + Kannada ranges.	datasets.md §3.5 invariant #1, §7 edge 2
P6	Catalogue-hash round-trip. For any `brief_row_happy`-shaped row with a synthetic YAML `y`, `sha256(y)` computed by `load_*` equals `hashlib.sha256(y.encode("utf-8")).hexdigest()` (i.e., loader uses the same algorithm as the spec).	`st.text(alphabet=st.characters(whitelist_categories=("Ll","Lu","Nd")))`.	datasets.md §3.5 invariant #9, §4.7

Total properties: 6 (target ≥ 5).

3. Integration tests

All integration tests live in DRIFTCALL/tests/data/test_integration.py. Marked @pytest.mark.integration — run by CI and by pytest -m integration locally.

#	Test name	Scenario	Maps to
I1	`test_full_data_export_writes_train_and_val_jsonl`	Invoke `training/data_export.main(--out-train, --out-val, --n-train 15000, --n-val 500, --seed 20260425)` in a tmpdir; assert both files exist, each line parses as canonical JSON, `train` has 15 000 rows, `val` has 500, and the set of `(seed)` values across both splits equals `set(train_seeds) ∪ set(val_seeds)`.	datasets.md §2.4
I2	`test_full_data_export_round_trip_hashes`	Re-run I1 a second time in a separate tmpdir; assert `sha256(train/briefs.jsonl)` and `sha256(val/briefs.jsonl)` match the first-run hashes byte-for-byte.	datasets.md §3.5 invariant #6
I3	`test_hf_upload_dry_run`	Run `hf upload <org>/driftcall-indic-briefs data/publication/ . --repo-type dataset --dry-run` (via subprocess). Assert exit 0; stdout lists exactly the files enumerated in datasets.md §2.1 publication tree; no network request fires (use `HF_HUB_OFFLINE=1`).	datasets.md §2.4
I4	`test_round_trip_load_json_dumps_load`	For every row in `train/briefs.jsonl`: `row_dict_a = json.loads(line)`; `line_b = canonical_dumps(row_dict_a)`; `row_dict_b = json.loads(line_b)`; assert `row_dict_a == row_dict_b` AND `line_b == line.rstrip("\n")`. 15 000 rows checked; fails on first discrepancy.	datasets.md §3.1, §3.5 invariant #6
I5	`test_verbatim_contamination_detector_sgd_mtop`	Build `corpus_snapshot_20260425` license cache (sqlite FTS5 over SGD + MTOP exports). For every `seed_utterance` in `train/briefs.jsonl` + `val/briefs.jsonl`, query the FTS5 index for a ≥ 10-token verbatim suffix match. Assert zero hits. Any hit → `LicenseConflictError` is raised by the CI wrapper.	datasets.md §3.4, §7 edge 3, §9.1
I6	`test_loader_cross_consistency_templates_vs_drift_patterns`	Load both libraries; assert every primary-domain pattern's `mutation` keys ⊆ union of `drift_slot_tags` across its domain's templates (the two transversal payment-auth patterns exempted).	datasets.md §3.5 invariant #4
I7	`test_loader_cross_consistency_drifts_vs_api_schemas`	For every pattern, `from_version` and `to_version` exist under `data/api_schemas/<pattern.domain>/`.	datasets.md §3.3
I8	`test_eval_load_raises_on_catalogue_hash_mismatch`	Publish a bundle with current catalogue; mutate `drifts.yaml` by one byte; invoke consumer-side `load_briefs(path)` → raises `CatalogueHashMismatchError` before any row is consumed.	datasets.md §3.5 invariant #9, §5
I9	`test_sft_generator_restart_end_to_end`	Generate 5 trajectories; `kill -9` the process after row 3 (simulated via `subprocess` + `signal.SIGKILL`); restart; assert final file has 5 rows, a single shared `generation_batch_id`, `generation_index` == `[0,1,2,3,4]` strictly, and no `PartialSFTCorpusError`.	datasets.md §4.6, §7 edge 11
I10	`test_bundle_immutability_after_publish`	Publish v1.0; attempt to re-publish without changes; assert `sha256(train/briefs.jsonl)` matches v1.0. Mutate one byte in an authored template → re-publication fails CI with `DatasetSchemaError` (version bump required).	datasets.md §3.5 invariant #10, §2.4

Total integration tests: 10.

4. Coverage target

100% line + 95% branch on:

driftcall/data/models.py (dataclass definitions — trivial to hit 100%)
driftcall/data/loaders.py (every loader, every validation branch, every error raise)
driftcall/data/errors.py (every DatasetError subclass constructed in at least one test)
training/data_export.py (seed sampling, canonical dumps, write path, disjointness assertion)
training/sft_generator.py (append + fsync, batch-id rehydration, partial-count validation, Sarvam-M error paths mocked)
scripts/build_license_cache.py (FTS5 schema DDL, 5-gram tokenizer wiring, CI read-only guard)

Branch coverage ≥ 95% — every error-mode if / raise pair is exercised. The remaining 5% allowance covers unreachable else branches defensively guarding against enum exhaustion (Python has no exhaustive-match static guarantee).

Enforced by:

python3 -m pytest tests/data/ \
  --cov=driftcall.data \
  --cov=training.data_export \
  --cov=training.sft_generator \
  --cov=scripts.build_license_cache \
  --cov-branch \
  --cov-fail-under=100 \
  --cov-report=term-missing

Any PR that drops line coverage below 100% or branch coverage below 95% on these modules fails CI.

5. Fixtures

Fixtures live in DRIFTCALL/tests/data/conftest.py and are shared verbatim with evaluation_tests.md and training_tests.md. All fixtures are @pytest.fixture(scope="session") unless noted; they are pure-read and return frozen dataclasses or bytes.

5.1 `brief_row_happy`

A canonical Stage-2 airline-booking BriefRow with hinglish seed_utterance, drifted at turn 4 via airline.price_rename. Matches the JSONL example in datasets.md §8.5 exactly (its canonical_dumps(asdict(row)) equals the §8.5 golden line byte-for-byte). All three lineage hashes are pinned to the corpus_snapshot_20260425 fixture (§5.5).

5.2 `brief_row_stage3_compound`

A Stage-3 compound row (hotel + payment, bilingual code-switch between hi and en) with two drift events in drift_schedule — hotel.tnc_change at turn 3 and payment.auth_scope_upgrade at turn 6. Exercises the transversal-payment-auth-exempt branch of datasets.md §3.5 invariant #4.

5.3 `manifest_ok`

An AudioManifest built from 20 curated AudioClip rows (4 per language × 5 languages) pulled from IndicVoices-R — all source="real_indicvoices_r", every sha256 matching the on-disk WAV under tests/data/fixtures/audio/real/, every duration_s ≤ 20.0. Used to verify the happy-path of load_audio_manifest.

5.4 `manifest_with_orphan`

Same as manifest_ok but with one row whose path references kn/iv_r_kn_9999.wav (absent on disk). Drives the ChecksumMismatchError + DatasetFileMissingError error-mode tests.

5.5 `corpus_snapshot_20260425`

A byte-frozen snapshot of the entire data/ tree as of 2026-04-25 (publication seed date). Contains:

templates.yaml + its pinned sha256
i18n.yaml + its pinned sha256
drifts.yaml + its pinned sha256
All 14 api_schemas/*/*.json files
The 20-row MANIFEST.jsonl
License-cache sqlite files (.license_cache/sgd.idx, .license_cache/mtop.idx)
Pinned Apache-2.0 LICENSE SHA constant (APACHE_2_0_CANONICAL_SHA)

Loaded once per session via pytest.fixture(scope="session"). Every lineage-hash test compares against constants frozen in this fixture, so a corpus-file byte mutation anywhere under data/ causes the hash-pinning tests (U13 – U17, U27) to fail loudly — the intended canary for silent catalogue drift.

This fixture is shared verbatim with:

DRIFTCALL/docs/tests/evaluation_tests.md (consumer-side CatalogueHashMismatchError coverage)
DRIFTCALL/docs/tests/training_tests.md (GRPO warmup corpus lineage, sft_warmup/ happy-path + restart coverage)

Authors of those test plans import via from tests.data.conftest import corpus_snapshot_20260425 rather than re-deriving snapshots — single source of truth prevents divergence.

This test plan implements the full verification surface for docs/modules/datasets.md. It does not exist until ≥ 1 fresh critic returns NOTHING_FURTHER per CLAUDE.md §3.2 Batch D4. Fixtures are locked to corpus_snapshot_20260425 and shared with evaluation_tests.md + training_tests.md — any change here must be mirrored there in the same PR.

datasets_tests.md — Test Plan for docs/modules/datasets.md

0. Scope & Non-goals

1. Unit tests

1.1 BriefRow — frozen dataclass + 13-field contract

1.2 Canonical JSON ordering

1.3 Lineage hashes (catalogue_hash, templates_sha256, i18n_sha256)

1.4 AudioClip.source excludes synth

1.5 TemplateLibrary.size == 20 at v1.0

1.6 LICENSES.md schema parse + LICENSE verbatim Apache-2.0