Spaces:
Paused
datasets_tests.md — Test Plan for docs/modules/datasets.md
Owner: Person B (Rewards & Tests), co-authored with Person C (Training & Data)
Target module: DRIFTCALL/docs/modules/datasets.md (final sealed)
Implements coverage for: DESIGN.md §8 (§§8.1, 8.2, 8.3, 8.4, 8.5, 8.6) and CLAUDE.md §3.1
Frameworks: pytest, hypothesis, pytest-cov
Status: DRAFT — pending ≥ 1 fresh critic round (test-plan gate is lighter per CLAUDE.md §3.2 Batch D4)
0. Scope & Non-goals
datasets.md specifies four on-disk data layers (L1 templates/i18n, L2 drift-patterns + api-schemas, L3 audio manifest, L4 SFT warmup) plus a one-shot HF Hub publication contract. Every loader is a lazy singleton that NFC-normalizes on read, validates against a frozen dataclass schema, and raises a typed DatasetError subclass on any shape / license / lineage / leak violation.
This plan covers:
- Constructibility + immutability of every frozen dataclass declared in datasets.md §4 (§4.1 – §4.7).
- Canonical JSON serialization — byte-identical output of
json.dumps(row, ensure_ascii=False, sort_keys=True, separators=(",",":"))across Python / libc versions (datasets.md §3.1 invariant #6). - Lineage hash triple — every
BriefRowcarriescatalogue_hash/templates_sha256/i18n_sha256, and any mismatch at eval-load raisesCatalogueHashMismatchError(datasets.md §3.5 invariant #9, §5). - Size + contents invariants —
TemplateLibraryhas exactly 20 templates at v1.0,DriftPatternLibraryhas exactly 20 patterns,APISchemaRegistryexactly 14 schemas over 5 domains (datasets.md §3.5 invariants #2 / #3 / #4). - License bundle integrity — root
LICENSEcontains the full verbatim Apache-2.0 text (byte length ≥ 11 000, canonical header string present, SHA pinned in fixture),LICENSES.mdmarkdown table parses (datasets.md §3.4). - Audio manifest provenance —
AudioClip.sourceonly accepts the singleLiteral["real_indicvoices_r"]; the string"synth_kokoro"is rejected at dataclass-construction time (datasets.md §4.5). - Publication determinism —
random.Random(20260425).sample(range(0, 20_000_000), 15_000)is byte-identical across re-runs; val seeds arelist(range(20_000_000, 20_000_500)); train ∩ val = ∅ (datasets.md §2.4, §3.1, §3.5 invariants #5 / #6). - SFT restart recovery —
training/sft_generator.pyappends one canonical-JSON line +os.fsync(fd)per trajectory, rehydratesgeneration_batch_idon restart, emits monotonicgeneration_index, and raisesPartialSFTCorpusErroron final-count mismatch (datasets.md §4.6, §7 edge 11). - License-cache FTS5 schema —
scripts/build_license_cache.pyproducesdata/.license_cache/{sgd,mtop}.idxwithCREATE VIRTUAL TABLE licensed_text USING fts5(chunk_text, source_id);and 5-gram tokenization (datasets.md §9.1). - HF dataset-card frontmatter —
README.mdYAML frontmatter parses via the HFdatasetsloader (datasets.md §8.6).
Every test below maps to one numbered clause in datasets.md. Clause references are embedded in each test docstring as datasets.md §X.Y / datasets.md §7 edge N.
1. Unit tests
All unit tests live in DRIFTCALL/tests/data/. Import surface under test:
from driftcall.data.models import (
TemplateLibrary, I18nLibrary,
DriftPatternLibrary, APISchemaRegistry, APISchema,
AudioManifest, AudioClip,
SFTCorpus, SFTTrajectory,
BriefRow,
)
from driftcall.data.loaders import (
load_templates, load_i18n,
load_drift_patterns, load_api_schemas,
load_audio_manifest, load_sft_corpus,
)
from driftcall.data.errors import (
DatasetError, DatasetFileMissingError, MalformedYAMLError, MalformedJSONError,
DatasetSchemaError, UnknownLanguageKeyError, LicenseConflictError,
TrainValLeakError, DriftPatternOrphanError, ChecksumMismatchError,
UnicodeNFDError, PIIDetectedError, DuplicateDriftPatternIdError,
CatalogueHashMismatchError, PartialSFTCorpusError,
)
from training.data_export import canonical_dumps, sample_train_seeds, val_seeds
from training.sft_generator import append_trajectory, resume_batch
from scripts.build_license_cache import build_index, FTS5_SCHEMA_DDL
Fixtures (§5) come from tests/data/conftest.py and tests/conftest.py.
1.1 BriefRow — frozen dataclass + 13-field contract
| # | Test name | Asserts | Maps to |
|---|---|---|---|
| U1 | test_brief_row_has_exactly_thirteen_fields |
len(dataclasses.fields(BriefRow)) == 13; field names equal the ordered tuple ("episode_id","seed","stage","language","domain","template_id","goal","drift_schedule","catalogue_hash","templates_sha256","i18n_sha256","generator_version","created_ts_ist"). |
datasets.md §4.7 |
| U2 | test_brief_row_is_frozen |
Building from brief_row_happy fixture, every attempted assignment (row.seed = 7, row.episode_id = "x", etc.) raises dataclasses.FrozenInstanceError. Parametrized over all 13 fields. |
datasets.md §3.5 invariant #1 (immutability), §4.7 |
| U3 | test_brief_row_happy_construct_roundtrip |
brief_row_happy constructs; dataclasses.asdict(row) returns a dict with 13 keys matching the spec; all string fields are NFC. |
datasets.md §4.7 |
| U4 | test_brief_row_missing_required_field_raises |
BriefRow() raises TypeError (no defaults — every field is required). Parametrized: supplying 12 of 13 fields also raises. |
datasets.md §4.7 |
| U5 | test_brief_row_stage_literal_enforced |
BriefRow(..., stage=4) is statically illegal; at runtime a DatasetSchemaError is raised by load_* on a stage value ∉ {1,2,3}. |
datasets.md §4.7 |
| U6 | test_brief_row_domain_literal_enforced |
A domain="payment" row is rejected at load time (BriefRow.domain is the 4-value primary-domain literal — payment is L2-only and does not appear in publication). |
datasets.md §4.7, §3.5 invariant #4 |
| U7 | test_brief_row_created_ts_ist_must_carry_plus0530_offset |
load_briefs("train/briefs.jsonl") rejects a row whose created_ts_ist does not end in +05:30. |
datasets.md §4.7 |
1.2 Canonical JSON ordering
| # | Test name | Asserts | Maps to |
|---|---|---|---|
| U8 | test_canonical_dumps_sorts_keys |
canonical_dumps({"b":1,"a":2}) == '{"a":2,"b":1}'; output contains no spaces; no trailing newline. |
datasets.md §3.1 canonical-JSON block |
| U9 | test_canonical_dumps_preserves_devanagari |
canonical_dumps({"city":"बेंगलुरु"}) contains the literal Devanagari bytes (UTF-8), NOT क… escapes. Exact bytes asserted with ==. |
datasets.md §3.1 (ensure_ascii=False) |
| U10 | test_canonical_dumps_exact_separators |
The serialized form of {"a":1,"b":2} equals b'{"a":1,"b":2}' byte-for-byte; no whitespace between , / : and neighbours. |
datasets.md §3.1 canonical-JSON block |
| U11 | test_canonical_dumps_brief_row_matches_golden |
canonical_dumps(asdict(brief_row_happy)) equals the golden line in §8.5 of datasets.md byte-for-byte. |
datasets.md §8.5 |
| U12 | test_canonical_dumps_is_idempotent |
canonical_dumps(json.loads(canonical_dumps(row))) == canonical_dumps(row) for 100 random fixture perturbations (fuzzed via hypothesis — see §2). |
datasets.md §3.1, §3.5 invariant #6 |
1.3 Lineage hashes (catalogue_hash, templates_sha256, i18n_sha256)
| # | Test name | Asserts | Maps to |
|---|---|---|---|
| U13 | test_catalogue_hash_matches_drifts_yaml_bytes |
catalogue_hash == hashlib.sha256(Path("data/drift_patterns/drifts.yaml").read_bytes()).hexdigest(); length 64 hex chars; lowercase. |
datasets.md §4.7, §3.5 invariant #9 |
| U14 | test_templates_sha256_matches_templates_yaml_bytes |
templates_sha256 == sha256(templates.yaml) byte-for-byte. |
datasets.md §4.7 |
| U15 | test_i18n_sha256_matches_i18n_yaml_bytes |
i18n_sha256 == sha256(i18n.yaml) byte-for-byte. |
datasets.md §4.7 |
| U16 | test_catalogue_hash_mismatch_raises_at_load |
Given brief_row_happy serialized with catalogue_hash="deadbeef…" (wrong), load_briefs(path) raises CatalogueHashMismatchError naming the offending field(s). |
datasets.md §3.5 invariant #9, §5 (CatalogueHashMismatchError) |
| U17 | test_hash_computation_is_stable_across_reruns |
Computing the three hashes twice in the same process returns identical values; computing across two subprocesses (via subprocess.check_output) also identical. |
datasets.md §3.5 invariant #6 |
1.4 AudioClip.source excludes synth
| # | Test name | Asserts | Maps to |
|---|---|---|---|
| U18 | test_audio_clip_source_accepts_real_only |
AudioClip(..., source="real_indicvoices_r", ...) constructs; AudioClip(..., source="synth_kokoro", ...) raises DatasetSchemaError at load. |
datasets.md §4.5 |
| U19 | test_audio_manifest_rejects_synth_row |
A MANIFEST.jsonl line containing "source":"synth_kokoro" causes load_audio_manifest to raise DatasetSchemaError("source must be 'real_indicvoices_r'"). |
datasets.md §4.5 |
| U20 | test_audio_manifest_duration_upper_bound |
AudioClip(..., duration_s=20.01) loads raise DatasetSchemaError; 20.00 OK (DESIGN.md §9 upper bound). |
datasets.md §4.5 |
1.5 TemplateLibrary.size == 20 at v1.0
| # | Test name | Asserts | Maps to |
|---|---|---|---|
| U21 | test_template_library_size_is_exactly_twenty_at_v1 |
len(load_templates().templates) == 20; len(templates) % 5 == 0; generator_version.startswith("driftcall-1.0"). |
datasets.md §3.5 invariant #4, §4.1 |
| U22 | test_template_library_four_domains_five_each |
Grouped by template.domain, exactly 4 primary domains are present (airline, cab, restaurant, hotel), 5 templates each. Payment is NOT a primary-domain template owner. |
datasets.md §4.1, §3.5 invariant #4 |
| U23 | test_template_library_every_language_every_template |
For every template, set(template.language_variants.keys()) == {"hi","ta","kn","en","hinglish"}; missing key raises DatasetSchemaError at load. |
datasets.md §3.5 invariant #4, §7 edge 1 |
| U24 | test_template_library_future_version_monotonic_growth |
Synthesize a mock templates.yaml with 25 entries tagged generator_version="driftcall-1.1.0"; load_templates accepts it (monotonic growth invariant holds: len >= 20 and len % 5 == 0). |
datasets.md §4.1 |
1.6 LICENSES.md schema parse + LICENSE verbatim Apache-2.0
| # | Test name | Asserts | Maps to |
|---|---|---|---|
| U25 | test_root_license_byte_length_at_least_11000 |
Path("LICENSE").read_bytes().__len__() >= 11_000. |
datasets.md §3.4 |
| U26 | test_root_license_contains_apache_canonical_header |
The bytes b"Apache License\n Version 2.0, January 2004" appear at the top of LICENSE. |
datasets.md §3.4 |
| U27 | test_root_license_sha256_pinned |
sha256(LICENSE bytes) == APACHE_2_0_CANONICAL_SHA (pinned constant 8a0d778…; exact value locked in tests/data/fixtures/license_hashes.py). |
datasets.md §3.4 |
| U28 | test_audio_licenses_md_embeds_full_apache_text |
data/audio/LICENSES.md byte length ≥ 11 000 AND contains the canonical Apache header. |
datasets.md §3.4 |
| U29 | test_sft_licenses_md_embeds_full_apache_text |
Same check for data/sft_warmup/LICENSES.md. |
datasets.md §3.4 |
| U30 | test_licenses_md_table_schema_parses |
The markdown table in each LICENSES.md parses with columns in exact order `["utterance_id" |
"trajectory_id", "upstream_source", "upstream_license", "attribution_required", "notes"]; every row has 5 cells; attribution_required ∈ {"yes","no"}`. |
1.7 Seed selection — deterministic, byte-identical
| # | Test name | Asserts | Maps to |
|---|---|---|---|
| U31 | test_train_seed_sampling_is_deterministic |
sample_train_seeds() == random.Random(20260425).sample(range(0, 20_000_000), 15_000); re-running the function twice yields identical lists (element-wise equal + ordering identical). |
datasets.md §2.4 |
| U32 | test_train_seed_count_is_fifteen_thousand |
len(sample_train_seeds()) == 15_000; all elements in [0, 20_000_000); no duplicates. |
datasets.md §2.4, §3.1 |
| U33 | test_val_seeds_are_exact_contiguous_slice |
val_seeds() == list(range(20_000_000, 20_000_500)); len == 500; first == 20_000_000; last == 20_000_499. |
datasets.md §2.4 |
| U34 | test_train_val_disjoint |
set(sample_train_seeds()).isdisjoint(set(val_seeds())); assert raises TrainValLeakError if injected seed 20_000_050 is spliced into train output. |
datasets.md §3.5 invariant #5, §7 edge 5 |
1.8 SFT restart recovery + PartialSFTCorpusError
| # | Test name | Asserts | Maps to |
|---|---|---|---|
| U35 | test_sft_append_one_line_fsyncs |
append_trajectory(fd, traj) writes exactly one canonical-JSON line ending \n and invokes os.fsync(fd) once per call (verified via unittest.mock.patch("os.fsync")). |
datasets.md §4.6 |
| U36 | test_sft_generation_batch_id_monotonic_within_batch |
Batch generates N=10 trajectories; all carry identical generation_batch_id (uuid4); generation_index values are [0..9] strictly monotonic and contiguous. |
datasets.md §4.6 |
| U37 | test_sft_restart_rehydrates_batch_id |
Given trajectories.jsonl pre-populated with 3 rows (batch_id B), call resume_batch(path); returned (batch_id, next_index) == (B, 3). New rows appended reuse B. |
datasets.md §4.6, §7 edge 11 |
| U38 | test_sft_partial_corpus_error_on_resume_count_mismatch |
Corpus file has 298 rows with target_count=300 recorded in corpus metadata; load_sft_corpus raises PartialSFTCorpusError("expected 300, got 298"). |
datasets.md §4.6, §5 (PartialSFTCorpusError), §7 edge 11 |
| U39 | test_sft_generator_final_count_validation |
training/sft_generator.py run with --target-count 5 but Sarvam-M drops 1 response → generator raises PartialSFTCorpusError post-loop (not silently). |
datasets.md §4.6 |
1.9 License-cache FTS5 schema
| # | Test name | Asserts | Maps to |
|---|---|---|---|
| U40 | test_license_cache_schema_ddl_is_exact |
FTS5_SCHEMA_DDL == "CREATE VIRTUAL TABLE licensed_text USING fts5(chunk_text, source_id);" — byte-for-byte. |
datasets.md §9.1 |
| U41 | test_license_cache_uses_5gram_tokenizer |
Introspect sqlite PRAGMA fts5_integrity_check + the CREATE VIRTUAL TABLE statement records tokenize = 'unicode61' with 5-gram config; build_index(tokenizer="trigram") raises ValueError. |
datasets.md §9.1 |
| U42 | test_license_cache_built_is_read_only_in_ci |
In CI mode (DRIFTCALL_CI=1), invoking build_index raises RuntimeError("license cache is read-only in CI"). |
datasets.md §9.1 |
1.10 README.md YAML frontmatter — HF dataset loader parse
| # | Test name | Asserts | Maps to |
|---|---|---|---|
| U43 | test_readme_frontmatter_parses_with_pyyaml |
yaml.safe_load(frontmatter_block) yields a dict with keys {license, language, size_categories, task_categories, pretty_name, configs, dataset_info}; license == "apache-2.0"; language == ["hi","ta","kn","en"]. |
datasets.md §8.6 |
| U44 | test_readme_frontmatter_loads_via_hf_datasets |
datasets.load_dataset(str(bundle_dir)) (HF loader) returns a DatasetDict with {"train","val"} splits; train.num_rows == 15_000; val.num_rows == 500. Skipped if datasets not installed. |
datasets.md §8.6 |
| U45 | test_readme_frontmatter_features_flat_columns_only |
The features block lists only the 6 flat columns {episode_id, seed, stage, language, domain, template_id}; nested goal/drift_schedule are NOT pre-declared (auto-inferred). |
datasets.md §8.6 |
1.11 Miscellaneous spec wiring
| # | Test name | Asserts | Maps to |
|---|---|---|---|
| U46 | test_load_drift_patterns_count_equals_twenty |
len(load_drift_patterns().patterns) == 20; every drift_type ∈ {"schema","policy","tnc","pricing","auth"}. |
datasets.md §3.5 invariant #2 |
| U47 | test_load_api_schemas_count_equals_fourteen_across_five_domains |
APISchemaRegistry reports exactly 14 schemas keyed {airline:{v1,v2,v3}, cab:{v1,v2,v3}, restaurant:{v1,v2,v3}, hotel:{v1,v2,v3}, payment:{v1,v2}}. |
datasets.md §3.5 invariant #3 |
| U48 | test_drift_pattern_orphan_raises |
YAML with from_version="v5" (nonexistent) raises DriftPatternOrphanError. |
datasets.md §5, §7 edge 6 |
| U49 | test_duplicate_drift_pattern_id_raises |
YAML with two entries sharing id: airline.price_rename raises DuplicateDriftPatternIdError citing both line numbers. |
datasets.md §5, §7 edge 10 |
| U50 | test_nfc_normalization_applied_at_load |
A templates YAML authored with NFD Kannada weekday normalizes to NFC on load; unicodedata.is_normalized("NFC", v) is True for every loaded string. |
datasets.md §3.5 invariant #1, §7 edge 2 |
| U51 | test_pii_10_digit_run_raises |
An authored string containing "9876543210" outside IATA / timestamp contexts raises PIIDetectedError. |
datasets.md §3.5 invariant #8, §3.1 |
| U52 | test_license_header_missing_raises |
A YAML file without the # SPDX-License-Identifier: leading comment raises DatasetSchemaError at load. |
datasets.md §3.5 invariant #7 |
| U53 | test_audio_manifest_sha256_mismatch_raises |
Corrupt a wav byte; load_audio_manifest raises ChecksumMismatchError citing expected vs actual. |
datasets.md §5, §7 edge 8 |
| U54 | test_sft_trajectory_val_seed_raises |
SFTTrajectory(goal_seed=20_000_042, …) on load raises TrainValLeakError. |
datasets.md §5, §7 edge 9 |
| U55 | test_loader_is_singleton_per_path |
load_templates() twice returns the same object by identity (is); called with a different path= yields a distinct instance cached separately. |
datasets.md §3.2 |
Total unit tests: 55 (target ≥ 35).
2. Property tests
All property tests live in DRIFTCALL/tests/data/test_properties.py using hypothesis.
| # | Property | Strategy | Maps to |
|---|---|---|---|
| P1 | Byte-identical re-runs of data_export. For any seed s == 20260425, two invocations of data_export.main(seed=s) produce byte-identical train/briefs.jsonl + val/briefs.jsonl (SHA-256 hashes match). |
Fixed seed + hypothesis-generated minor perturbations (run order, tmpdir path). | datasets.md §3.5 invariant #6 |
| P2 | BriefRow is frozen. For any BriefRow instance generated by brief_row_strategy(), assigning to any of its 13 fields raises FrozenInstanceError. Hypothesis enumerates field name and value type. |
st.builds(BriefRow, …) + st.sampled_from(fields_of(BriefRow)). |
datasets.md §3.5 invariant #1, §4.7 |
| P3 | Seed-range disjointness. For any pair (t, v) where t ∈ [0, 20_000_000) and v ∈ [20_000_000, 20_000_500), t != v and both sets generated by the spec are disjoint. Hypothesis samples 10 000 pairs. |
st.tuples(st.integers(min_value=0, max_value=19_999_999), st.integers(min_value=20_000_000, max_value=20_000_499)). |
datasets.md §3.5 invariant #5, §2.4 |
| P4 | Canonical JSON determinism under key permutation. For any dict d and any permutation d' of its keys, canonical_dumps(d) == canonical_dumps(d') byte-for-byte. |
st.dictionaries(st.text(), st.one_of(st.integers(), st.text())) + .map(shuffle_keys). |
datasets.md §3.1, §3.5 invariant #6 |
| P5 | NFC idempotence. For any string s, nfc(nfc(s)) == nfc(s); load_templates applied twice yields the same library by hash. |
st.text(alphabet=st.characters(min_codepoint=0x0900, max_codepoint=0x0DFF)) — Devanagari + Tamil + Kannada ranges. |
datasets.md §3.5 invariant #1, §7 edge 2 |
| P6 | Catalogue-hash round-trip. For any brief_row_happy-shaped row with a synthetic YAML y, sha256(y) computed by load_* equals hashlib.sha256(y.encode("utf-8")).hexdigest() (i.e., loader uses the same algorithm as the spec). |
st.text(alphabet=st.characters(whitelist_categories=("Ll","Lu","Nd"))). |
datasets.md §3.5 invariant #9, §4.7 |
Total properties: 6 (target ≥ 5).
3. Integration tests
All integration tests live in DRIFTCALL/tests/data/test_integration.py. Marked @pytest.mark.integration — run by CI and by pytest -m integration locally.
| # | Test name | Scenario | Maps to |
|---|---|---|---|
| I1 | test_full_data_export_writes_train_and_val_jsonl |
Invoke training/data_export.main(--out-train, --out-val, --n-train 15000, --n-val 500, --seed 20260425) in a tmpdir; assert both files exist, each line parses as canonical JSON, train has 15 000 rows, val has 500, and the set of (seed) values across both splits equals set(train_seeds) ∪ set(val_seeds). |
datasets.md §2.4 |
| I2 | test_full_data_export_round_trip_hashes |
Re-run I1 a second time in a separate tmpdir; assert sha256(train/briefs.jsonl) and sha256(val/briefs.jsonl) match the first-run hashes byte-for-byte. |
datasets.md §3.5 invariant #6 |
| I3 | test_hf_upload_dry_run |
Run hf upload <org>/driftcall-indic-briefs data/publication/ . --repo-type dataset --dry-run (via subprocess). Assert exit 0; stdout lists exactly the files enumerated in datasets.md §2.1 publication tree; no network request fires (use HF_HUB_OFFLINE=1). |
datasets.md §2.4 |
| I4 | test_round_trip_load_json_dumps_load |
For every row in train/briefs.jsonl: row_dict_a = json.loads(line); line_b = canonical_dumps(row_dict_a); row_dict_b = json.loads(line_b); assert row_dict_a == row_dict_b AND line_b == line.rstrip("\n"). 15 000 rows checked; fails on first discrepancy. |
datasets.md §3.1, §3.5 invariant #6 |
| I5 | test_verbatim_contamination_detector_sgd_mtop |
Build corpus_snapshot_20260425 license cache (sqlite FTS5 over SGD + MTOP exports). For every seed_utterance in train/briefs.jsonl + val/briefs.jsonl, query the FTS5 index for a ≥ 10-token verbatim suffix match. Assert zero hits. Any hit → LicenseConflictError is raised by the CI wrapper. |
datasets.md §3.4, §7 edge 3, §9.1 |
| I6 | test_loader_cross_consistency_templates_vs_drift_patterns |
Load both libraries; assert every primary-domain pattern's mutation keys ⊆ union of drift_slot_tags across its domain's templates (the two transversal payment-auth patterns exempted). |
datasets.md §3.5 invariant #4 |
| I7 | test_loader_cross_consistency_drifts_vs_api_schemas |
For every pattern, from_version and to_version exist under data/api_schemas/<pattern.domain>/. |
datasets.md §3.3 |
| I8 | test_eval_load_raises_on_catalogue_hash_mismatch |
Publish a bundle with current catalogue; mutate drifts.yaml by one byte; invoke consumer-side load_briefs(path) → raises CatalogueHashMismatchError before any row is consumed. |
datasets.md §3.5 invariant #9, §5 |
| I9 | test_sft_generator_restart_end_to_end |
Generate 5 trajectories; kill -9 the process after row 3 (simulated via subprocess + signal.SIGKILL); restart; assert final file has 5 rows, a single shared generation_batch_id, generation_index == [0,1,2,3,4] strictly, and no PartialSFTCorpusError. |
datasets.md §4.6, §7 edge 11 |
| I10 | test_bundle_immutability_after_publish |
Publish v1.0; attempt to re-publish without changes; assert sha256(train/briefs.jsonl) matches v1.0. Mutate one byte in an authored template → re-publication fails CI with DatasetSchemaError (version bump required). |
datasets.md §3.5 invariant #10, §2.4 |
Total integration tests: 10.
4. Coverage target
100% line + 95% branch on:
driftcall/data/models.py(dataclass definitions — trivial to hit 100%)driftcall/data/loaders.py(every loader, every validation branch, every error raise)driftcall/data/errors.py(everyDatasetErrorsubclass constructed in at least one test)training/data_export.py(seed sampling, canonical dumps, write path, disjointness assertion)training/sft_generator.py(append + fsync, batch-id rehydration, partial-count validation, Sarvam-M error paths mocked)scripts/build_license_cache.py(FTS5 schema DDL, 5-gram tokenizer wiring, CI read-only guard)
Branch coverage ≥ 95% — every error-mode if / raise pair is exercised. The remaining 5% allowance covers unreachable else branches defensively guarding against enum exhaustion (Python has no exhaustive-match static guarantee).
Enforced by:
python3 -m pytest tests/data/ \
--cov=driftcall.data \
--cov=training.data_export \
--cov=training.sft_generator \
--cov=scripts.build_license_cache \
--cov-branch \
--cov-fail-under=100 \
--cov-report=term-missing
Any PR that drops line coverage below 100% or branch coverage below 95% on these modules fails CI.
5. Fixtures
Fixtures live in DRIFTCALL/tests/data/conftest.py and are shared verbatim with evaluation_tests.md and training_tests.md. All fixtures are @pytest.fixture(scope="session") unless noted; they are pure-read and return frozen dataclasses or bytes.
5.1 brief_row_happy
A canonical Stage-2 airline-booking BriefRow with hinglish seed_utterance, drifted at turn 4 via airline.price_rename. Matches the JSONL example in datasets.md §8.5 exactly (its canonical_dumps(asdict(row)) equals the §8.5 golden line byte-for-byte). All three lineage hashes are pinned to the corpus_snapshot_20260425 fixture (§5.5).
5.2 brief_row_stage3_compound
A Stage-3 compound row (hotel + payment, bilingual code-switch between hi and en) with two drift events in drift_schedule — hotel.tnc_change at turn 3 and payment.auth_scope_upgrade at turn 6. Exercises the transversal-payment-auth-exempt branch of datasets.md §3.5 invariant #4.
5.3 manifest_ok
An AudioManifest built from 20 curated AudioClip rows (4 per language × 5 languages) pulled from IndicVoices-R — all source="real_indicvoices_r", every sha256 matching the on-disk WAV under tests/data/fixtures/audio/real/, every duration_s ≤ 20.0. Used to verify the happy-path of load_audio_manifest.
5.4 manifest_with_orphan
Same as manifest_ok but with one row whose path references kn/iv_r_kn_9999.wav (absent on disk). Drives the ChecksumMismatchError + DatasetFileMissingError error-mode tests.
5.5 corpus_snapshot_20260425
A byte-frozen snapshot of the entire data/ tree as of 2026-04-25 (publication seed date). Contains:
templates.yaml+ its pinnedsha256i18n.yaml+ its pinnedsha256drifts.yaml+ its pinnedsha256- All 14
api_schemas/*/*.jsonfiles - The 20-row
MANIFEST.jsonl - License-cache sqlite files (
.license_cache/sgd.idx,.license_cache/mtop.idx) - Pinned Apache-2.0
LICENSESHA constant (APACHE_2_0_CANONICAL_SHA)
Loaded once per session via pytest.fixture(scope="session"). Every lineage-hash test compares against constants frozen in this fixture, so a corpus-file byte mutation anywhere under data/ causes the hash-pinning tests (U13 – U17, U27) to fail loudly — the intended canary for silent catalogue drift.
This fixture is shared verbatim with:
DRIFTCALL/docs/tests/evaluation_tests.md(consumer-sideCatalogueHashMismatchErrorcoverage)DRIFTCALL/docs/tests/training_tests.md(GRPO warmup corpus lineage,sft_warmup/happy-path + restart coverage)
Authors of those test plans import via from tests.data.conftest import corpus_snapshot_20260425 rather than re-deriving snapshots — single source of truth prevents divergence.
This test plan implements the full verification surface for docs/modules/datasets.md. It does not exist until ≥ 1 fresh critic returns NOTHING_FURTHER per CLAUDE.md §3.2 Batch D4. Fixtures are locked to corpus_snapshot_20260425 and shared with evaluation_tests.md + training_tests.md — any change here must be mirrored there in the same PR.