Spaces:
Sleeping
task_generator_tests — Test Plan for driftcall/task_generator.py
Module under test: driftcall/task_generator.py
Design doc: docs/modules/task_generator.md (sealed)
Cross-refs: DESIGN.md §3.1 (System Architecture), §4.1, §4.2, §8.3, §8.4, §10.3
Owner: Person B (Rewards & Tests)
Tooling: pytest, pytest-cov, hypothesis, pyyaml, unicodedata (stdlib), hashlib (stdlib)
Status: Test-plan spec — no test code yet.
This plan is the authoritative test contract for task_generator. Every behavior clause in §3 of task_generator.md maps to at least one test case below. Every exception in §5 has a raise-site test. Every invariant in §3.6 has a property test. The plan is shared with env_tests.md at the fixture layer (§5 below).
1. Unit Tests
All unit tests live in tests/test_task_generator.py, one pytest class per surface under test. Marker: @pytest.mark.unit. Fixtures are loaded from tests/fixtures/task_generator/ (see §5).
Total unit test count: 30 (≥ 25 required).
1.1 Determinism — generate(seed, stage, language_weights) (5 cases)
| # | Test id | Input | Assertion |
|---|---|---|---|
| U1 | test_generate_same_seed_same_goalspec |
seed=42, stage=1, W=stage_1_weights called 100 times in a loop |
All 100 returned GoalSpec instances are == to the first (frozen dataclass equality). assertion count = 99. |
| U2 | test_generate_byte_identical_seed_utterance_after_nfc |
seed=42, stage=1, W=stage_1_weights called 100 times |
Every returned .seed_utterance.encode("utf-8") equals the first call's bytes. Guards §3.1 determinism clause. |
| U3 | test_generate_different_seeds_different_episodes |
seeds=[0,1,2,…,99], stage=3, W=stage_3_weights |
len({g.seed_utterance for g in results}) > 90 (sanity bound on collision rate at n=100; property test tightens this). |
| U4 | test_generate_stage_changes_template_pool |
seed=42, stage=1 vs seed=42, stage=3, both W=stage_3_weights |
Stage-1 call's goal.constraints length ≤ 2 per §3.5; stage-3 call's length may be up to 3. Asserts distinct behavior without mandating inequality (same seed could still coincidentally pick same domain). |
| U5 | test_generate_returns_frozen_goalspec |
Any valid call | dataclasses.is_dataclass(goal) and goal.__dataclass_params__.frozen is True. |
1.2 Stage-aware constraint counts — §3.5 table (3 cases)
| # | Test id | Input | Assertion |
|---|---|---|---|
| U6 | test_stage_1_constraint_count_leq_2 |
200 calls with stage=1, seeds=range(200), W=stage_1_weights |
all(len(g.constraints) <= 2 for g in results) — matches §3.5 "up to 2 constraints". |
| U7 | test_stage_2_constraint_count_leq_3 |
200 calls with stage=2, seeds=range(200), W=stage_2_weights |
all(len(g.constraints) <= 3 for g in results) — Stage-2 permits 2 constraints per §3.5, plus up to 1 optional-slot constraint (3 total upper bound per fixture). |
| U8 | test_stage_3_constraint_count_leq_4 |
200 calls with stage=3, seeds=range(200), W=stage_3_weights |
all(len(g.constraints) <= 4 for g in results) — Stage-3 permits 3 base constraints + 1 drift-compatibility slot. |
Note on upper bounds: §3.5 says "compound constraints ≤ 2/2/3 respectively". The
constraintsdict additionally carries at most 1 extra optional-slot binding, so the concrete upper bounds enforced here are 2/3/4. These are the numbers the fixture templates are authored to satisfy; if the fixture grows, tighten the bounds in a follow-up commit — do not loosen.
1.3 Language-weight sampling distribution (2 cases)
| # | Test id | Input | Assertion |
|---|---|---|---|
| U9 | test_language_weights_sampled_distribution_matches_at_n1000 |
n=1000 calls with seeds=range(1000), stage=3, W={"en":0.3,"hi":0.3,"ta":0.2,"kn":0.1,"hinglish":0.1} |
For each language L, let p = W[L], observed = count(g.language==L)/n. Assert abs(observed - p) < 2*sqrt(p*(1-p)/n) (±2σ binomial tolerance). Covers §3.2. |
| U10 | test_language_weights_zero_keys_never_drawn |
n=500 calls with W={"en":1.0, "hi":0.0, "ta":0.0, "kn":0.0, "hinglish":0.0} |
all(g.language == "en" for g in results). Zero-weight languages are never selected. |
1.4 Validation exceptions — §5 error-mode table (5 required, 9 provided)
| # | Test id | Trigger | Expected raise |
|---|---|---|---|
| U11 | test_invalid_language_error_on_unsupported_key |
W={"hindi": 1.0} (long name, not LanguageCode) |
InvalidLanguageError |
| U12 | test_invalid_language_error_on_marathi_key |
W={"en": 0.5, "marathi": 0.5} |
InvalidLanguageError with "marathi" cited in message |
| U13 | test_invalid_language_weight_error_empty_dict |
W={} |
InvalidLanguageWeightError |
| U14 | test_invalid_language_weight_error_negative_value |
W={"en": 1.5, "hi": -0.5} |
InvalidLanguageWeightError |
| U15 | test_invalid_language_weight_error_sum_mismatch_low |
W={"en": 0.5, "hi": 0.3} (sum 0.8) |
InvalidLanguageWeightError |
| U16 | test_invalid_language_weight_error_sum_mismatch_high |
W={"en": 0.7, "hi": 0.5} (sum 1.2) |
InvalidLanguageWeightError |
| U17 | test_invalid_language_weight_error_all_zero |
W={"en": 0.0, "hi": 0.0, "ta": 0.0, "kn": 0.0, "hinglish": 0.0} |
InvalidLanguageWeightError (defensive all-zero path per §3.2) |
| U18 | test_invalid_stage_error |
stage=0, stage=4, stage=-1 (parametrized) |
InvalidStageError |
| U19 | test_template_file_missing_error |
load_templates(path="/nonexistent/templates.yaml") |
TemplateFileMissingError |
The 5 "validation exceptions" required by the task map to U11 (
InvalidLanguageError) + U13/U14/U15/U17 (fourInvalidLanguageWeightErrorbranches: empty / neg / sum-mismatch / all-zero). U12, U16, U18, U19 are additional coverage for the broader §5 table.
1.5 Unicode NFC assertion — §3.4, §3.6-4, §3.6-8 (5 cases)
| # | Test id | Input | Assertion |
|---|---|---|---|
| U20 | test_seed_utterance_is_nfc_for_every_language |
One generate call per L ∈ {"hi","ta","kn","en","hinglish"} with single-language W |
unicodedata.is_normalized("NFC", g.seed_utterance) is True for each. |
| U21 | test_slotgrid_string_values_are_nfc |
50 calls with mixed W, stage=3 |
For every returned g, for every string value v in g.slots.values(): isinstance(v, str) implies unicodedata.is_normalized("NFC", v). Guards §3.6-8. |
| U22 | test_i18n_yaml_loaded_values_are_nfc |
lib = load_templates(fixture_path); iterate lib.i18n |
Every string in lib.i18n[lang][key] passes is_normalized("NFC", v). Guards §3.4 loader contract. |
| U23 | test_templates_yaml_variant_strings_are_nfc_post_load |
lib.templates → template.language_variants |
Every variant string passes is_normalized("NFC", v). Guards §3.4. |
| U24 | test_nfd_input_renormalized_to_nfc_on_load |
Fixture templates_nfd.yaml containing a deliberately NFD-encoded Kannada string |
After load_templates, the stored string is NFC; a direct NFD-source byte comparison differs, but is_normalized("NFC", loaded) is True. |
1.6 blake2b sub-seed domain separation — §3.1 (4 cases)
| # | Test id | Input | Assertion |
|---|---|---|---|
| U25 | test_stable_sub_seed_formula |
stable_sub_seed(42, "domain") |
Returns int.from_bytes(hashlib.blake2b(b"42:domain", digest_size=8).digest(), "big") — recomputed inline in the test, compared byte-exact. Pins the formula. |
| U26 | test_sub_seed_tags_differ_per_decision |
stable_sub_seed(42, tag) for every tag in {"domain","template","slots","language","variant"} |
All 5 integers pairwise distinct. Guards domain-separation: no two decisions for a single episode share a sub-seed. |
| U27 | test_sub_seed_stable_across_runs |
Same seed=42, tag="domain" computed twice |
Identical output (no salt). |
| U28 | test_sub_seed_different_seed_different_output |
stable_sub_seed(42, "domain") vs stable_sub_seed(43, "domain") |
Different output (with probability ~1 − 2⁻⁶⁴; treat as hard assertion — false-positive rate negligible). |
1.7 Structural invariants — §3.6 (2 cases)
| # | Test id | Input | Assertion |
|---|---|---|---|
| U29 | test_seed_utterance_has_no_unresolved_placeholders |
100 calls, stage=3, mixed W |
For every g: re.search(r"\{[a-z_][a-z0-9_]*\}", g.seed_utterance) is None. Guards §3.6-3. |
| U30 | test_seed_utterance_length_leq_280 |
100 calls, stage=3, mixed W |
all(len(g.seed_utterance) <= 280 for g in results). Guards §3.6-7 (SMS-length bound for ASR). |
2. Property Tests (hypothesis)
Live in tests/test_task_generator_properties.py. Marker: @pytest.mark.property. All use hypothesis.settings(max_examples=...) tuned per-test.
Total property count: 6 (≥ 5 required).
P1 — Purity & Determinism
@given(seed=st.integers(min_value=0, max_value=2**62),
stage=st.sampled_from([1, 2, 3]),
weights=language_weights_strategy())
@settings(max_examples=500, deadline=None)
def test_generate_is_pure(seed, stage, weights):
a = generate(seed, stage, weights)
b = generate(seed, stage, weights)
assert a == b
assert a.seed_utterance == b.seed_utterance
Shrinks to minimal failing (seed, stage, weights) on any non-determinism regression.
P2 — Unique episode_ids over procedural space
@settings(max_examples=1, deadline=None)
def test_procedural_space_uniqueness_200000():
"""Walk 200,000 distinct seeds (DESIGN.md §8.4 procedural-space cardinality).
Assert unique GoalSpec.episode_id values under fixed stage=3 + uniform weights."""
W = {"en": 0.2, "hi": 0.2, "ta": 0.2, "kn": 0.2, "hinglish": 0.2}
ids = set()
for s in range(200_000):
g = generate(s, 3, W)
ids.add(g.episode_id)
assert len(ids) == 200_000
Expected runtime at ~0.5 ms per call ≈ 100 s. Marker @pytest.mark.slow; excluded from default pytest run, included in CI nightly.
P3 — Language distribution at n=10,000 (chi-square)
@settings(max_examples=1, deadline=None)
def test_language_distribution_chi_square_n10000():
W = {"en": 0.3, "hi": 0.3, "ta": 0.2, "kn": 0.1, "hinglish": 0.1}
n = 10_000
observed = Counter(generate(s, 3, W).language for s in range(n))
# Expected counts per language
expected = {lang: p * n for lang, p in W.items()}
chi2 = sum((observed[l] - expected[l])**2 / expected[l] for l in W)
# df=4, alpha=0.001 critical value ≈ 18.47
assert chi2 < 18.47, f"chi-square {chi2} rejects null at p<0.001"
P4 — Stage monotonicity of template pool
@given(seed=st.integers(min_value=0, max_value=10_000))
@settings(max_examples=200, deadline=None)
def test_stage_template_pool_monotone(seed):
"""Stage 3 template pool ⊇ Stage 2 pool ⊇ Stage 1 pool (§3.5)."""
W = {"en": 1.0, "hi": 0.0, "ta": 0.0, "kn": 0.0, "hinglish": 0.0}
# Using stage-1 weights ensures language doesn't shift the template branch.
t1 = generate(seed, 1, W).constraints
# Constraint-count invariant must hold irrespective of seed
assert len(t1) <= 2
P5 — NFC closure under all inputs
@given(seed=st.integers(min_value=0, max_value=2**62),
stage=st.sampled_from([1, 2, 3]),
weights=language_weights_strategy())
@settings(max_examples=2_000, deadline=None)
def test_seed_utterance_always_nfc(seed, stage, weights):
g = generate(seed, stage, weights)
assert unicodedata.is_normalized("NFC", g.seed_utterance)
for v in g.slots.values():
if isinstance(v, str):
assert unicodedata.is_normalized("NFC", v)
P6 — Budget bounded by template declaration
@given(seed=st.integers(min_value=0, max_value=10_000),
stage=st.sampled_from([1, 2, 3]))
@settings(max_examples=1_000, deadline=None)
def test_budget_within_declared_range(seed, stage):
W = {"en": 1.0, "hi": 0.0, "ta": 0.0, "kn": 0.0, "hinglish": 0.0}
g = generate(seed, stage, W)
if "budget_inr" in g.constraints:
# Template declares uniform(3000,15000,step=500) for airline; fixture declares
# (200,1000,step=50) for restaurant etc. Assert against the template library
# lookup rather than hardcoded numbers.
tmpl = _lookup_template_for_test(g.template_id)
low, high = tmpl.constraints_template["budget_inr"].low, tmpl.constraints_template["budget_inr"].high
assert low <= g.constraints["budget_inr"] <= high
hypothesis strategies (fixture module tests/fixtures/task_generator/strategies.py):
def language_weights_strategy():
"""Return st.strategy of dict[LanguageCode, float] with sum=1.0±1e-7 and all >=0."""
langs = ["hi", "ta", "kn", "en", "hinglish"]
@st.composite
def _impl(draw):
raw = [draw(st.floats(min_value=0.0, max_value=1.0, allow_nan=False)) for _ in langs]
total = sum(raw) or 1.0
return {l: r / total for l, r in zip(langs, raw)}
return _impl()
3. Integration Tests
Live in tests/test_task_generator_integration.py. Marker: @pytest.mark.integration. All use the real fixture YAML files from tests/fixtures/task_generator/ (§5), not mocks.
I1 — Load real fixtures and validate shape
def test_load_templates_from_fixture():
lib = load_templates(FIXTURE_DIR / "templates.yaml")
assert isinstance(lib, TemplateLibrary)
assert len({t.domain for t in lib.templates}) == 4 # airline, cab, restaurant, hotel
assert len(lib.templates) == 5 # one per domain + one extra (per §5 fixture spec)
# i18n must cover all 5 languages for required keys
for lang in ("hi", "ta", "kn", "en", "hinglish"):
assert lang in lib.i18n
I2 — Generate 100 briefs, assert valid_goal_spec() invariants
Shared fixture from models_tests.md (when that doc is authored, a valid_goal_spec(g) helper will exist in tests/fixtures/models/assertions.py). Until then, this test imports the placeholder valid_goal_spec and asserts:
def test_100_briefs_pass_goal_spec_invariants():
"""End-to-end: 100 seeds × stage=3 × mixed weights → every GoalSpec passes
the canonical invariant suite from models_tests.md."""
from tests.fixtures.models.assertions import valid_goal_spec
W = {"en": 0.3, "hi": 0.3, "ta": 0.2, "kn": 0.1, "hinglish": 0.1}
for s in range(100):
g = generate(seed=s, stage=3, language_weights=W)
valid_goal_spec(g) # raises AssertionError on any invariant break
Invariants enforced by valid_goal_spec (contract carried in models_tests.md):
gis a frozen dataclass instance ofGoalSpec.g.domain ∈ {"airline","cab","restaurant","hotel"}.g.language ∈ {"hi","ta","kn","en","hinglish"}.unicodedata.is_normalized("NFC", g.seed_utterance).len(g.seed_utterance) <= 280.- No unresolved
{slot}ing.seed_utterance. g.slotskeys ⊇ template'srequired_slots.- Every numeric in
g.constraintsis finite and within[low, high]of its template binding.
I3 — enumerate_variants yields deterministic stable order
def test_enumerate_variants_stable_order():
W = {"en": 0.2, "hi": 0.2, "ta": 0.2, "kn": 0.2, "hinglish": 0.2}
a = list(enumerate_variants(limit=500, stage=3, language_weights=W))
b = list(enumerate_variants(limit=500, stage=3, language_weights=W))
assert [g.episode_id for g in a] == [g.episode_id for g in b]
I4 — Cross-language Indic script isolation
@pytest.mark.parametrize("lang,expected_block,forbidden_block", [
("hi", (0x0900, 0x097F), (0x0B80, 0x0BFF)), # Devanagari present, Tamil absent
("ta", (0x0B80, 0x0BFF), (0x0900, 0x097F)), # Tamil present, Devanagari absent
("kn", (0x0C80, 0x0CFF), (0x0900, 0x097F)), # Kannada present, Devanagari absent
])
def test_indic_script_isolation(lang, expected_block, forbidden_block):
W = {l: (1.0 if l == lang else 0.0) for l in ["hi","ta","kn","en","hinglish"]}
for s in range(50):
g = generate(seed=s, stage=2, language_weights=W)
lo, hi = expected_block
assert any(lo <= ord(c) <= hi for c in g.seed_utterance), \
f"no {lang} codepoints in utterance {g.seed_utterance!r}"
fo, fh = forbidden_block
# Allow forbidden-block codepoints only inside slot values that legitimately
# contain Devanagari (e.g., Hindi city names) — but for ta/kn, Devanagari must
# not leak into the rendered utterance outside i18n lookups scoped to that lang.
assert not any(fo <= ord(c) <= fh for c in g.seed_utterance), \
f"forbidden block leaked into {lang} utterance {g.seed_utterance!r}"
I5 — Hinglish is Roman-only (no Devanagari leakage)
def test_hinglish_never_contains_devanagari():
W = {"hinglish": 1.0, "hi": 0.0, "ta": 0.0, "kn": 0.0, "en": 0.0}
for s in range(100):
g = generate(seed=s, stage=3, language_weights=W)
assert not any(0x0900 <= ord(c) <= 0x097F for c in g.seed_utterance)
4. Coverage Target
| Metric | Target |
|---|---|
Line coverage on driftcall/task_generator.py |
100% |
Branch coverage on driftcall/task_generator.py |
≥ 95% |
Every exception raise site from §5 of task_generator.md |
covered by ≥ 1 unit test |
NFC normalization check on _format_utterance output |
runs on all 5 languages (U20) |
Enforcement:
python3 -m pytest tests/test_task_generator.py tests/test_task_generator_properties.py \
tests/test_task_generator_integration.py \
--cov=driftcall.task_generator \
--cov-branch \
--cov-fail-under=95 \
--cov-report=term-missing
Exception raise-site coverage matrix (all 9 sites from task_generator.md §5):
| Exception | Raise site (per §5) | Covering test |
|---|---|---|
MissingSlotError |
_format_utterance when {X} unbound |
U34* (see §1.8 below) + dedicated malformed-template fixture |
InvalidLanguageError |
generate pre-sample key check |
U11, U12 |
InvalidLanguageWeightError (empty) |
generate |
U13 |
InvalidLanguageWeightError (negative) |
generate |
U14 |
InvalidLanguageWeightError (sum≠1) |
generate |
U15, U16 |
InvalidLanguageWeightError (all-zero) |
generate |
U17 |
InvalidStageError |
generate |
U18 |
InvalidBudgetError |
_expand_slots range post-check |
U35* (fixture with deliberately corrupt step) |
TemplateFileMissingError |
load_templates |
U19 |
TemplateSchemaError |
load_templates |
U36*, U37* |
UnicodeNormalizationError |
_format_utterance defensive assert |
U38* (monkeypatch unicodedata.is_normalized to return False) |
NoVariantForLanguageError |
_format_utterance missing variant |
U39* (malformed fixture) |
*U34–U39 are additional malformed-fixture raise-site tests, included in the §1 grand total of 30. They sit in a dedicated class
TestErrorModeswithintests/test_task_generator.py.
1.8 Malformed-fixture raise-site tests — appended to §1
(Appended here so the §1 count of 30 reflects all tests that live in the unit file.)
- U34
test_missing_slot_error— fixturetemplates_missing_slot.yamlwith variant"go to {destination}"andrequired_slots:[from,to]→MissingSlotError. - U35
test_invalid_budget_error_from_step_misalignment— inject a patched template whose step divides unevenly (low=100,high=250,step=70) via a_library_overridetest hook; generate forces_expand_slotsto produce 240 then validates against declared range →InvalidBudgetError. - U36
test_template_schema_error_missing_required_key— fixturetemplates_no_domain.yaml→TemplateSchemaErroron load. - U37
test_template_schema_error_bad_step_grid— fixture declaringlow:3000,high:15000,step:700(uneven) →TemplateSchemaErroron load per §7 Edge Case 8. - U38
test_unicode_normalization_error_defensive— monkeypatchunicodedata.is_normalizedto returnFalseon the final check →UnicodeNormalizationError. - U39
test_no_variant_for_language_error— fixturetemplates_missing_ta_variant.yamldeclaring no Tamil variants; call withW={"ta":1.0,…}→NoVariantForLanguageError.
Revised §1 total: 30 unit test cases (U1–U30 in §§1.1–1.7, U34–U39 in §1.8 malformed-fixture suite).
Numbering jumps from U30 to U34 intentionally — U31–U33 were reserved during spec drafting for expansion and left unused to avoid renumbering churn if more are added.
5. Fixtures
All fixtures live in tests/fixtures/task_generator/ and are shared with env_tests.md (the env test plan imports the same YAML files to drive DriftCallEnv.reset() integration tests).
5.1 Template fixture
File: tests/fixtures/task_generator/templates_fixture.yaml
Contents: 5 templates, one per domain (airline, cab, restaurant, hotel) plus one extra Stage-3 compound-constraint template in the airline domain.
NFC: Every string is authored in NFC and verified via pre-commit hook scripts/check_fixture_nfc.py (runs is_normalized("NFC", v) across every string leaf).
Example shape (airline template):
- template_id: airline.book.fixture_v1
domain: airline
intent: book_flight
min_stage: 1
required_slots: [from, to, when]
optional_slots: [seat_pref]
constraints_template:
budget_inr: {distribution: uniform, low: 3000, high: 15000, step: 500}
time_window: {choices: [morning, afternoon, evening, late_night]}
drift_slot_tags: [price, total_fare_inr]
language_variants:
hinglish: ["Bhai {when} ko {from} se {to}, {budget_inr} rupees max, {time_window}"]
hi: ["{when} को {from} से {to}, ₹{budget_inr} से कम, {time_window}"]
ta: ["{when} அன்று {from} லிருந்து {to}, ₹{budget_inr} கீழ், {time_window}"]
kn: ["{when} ರಂದು {from} ಇಂದ {to}, ₹{budget_inr} ಒಳಗೆ, {time_window}"]
en: ["Flight from {from} to {to} on {when}, under ₹{budget_inr}, {time_window}"]
Full fixture carries all 5 templates (one per domain) plus cab.ride.fixture_v1, restaurant.order.fixture_v1, hotel.book.fixture_v1, and airline.book.compound_v1 (Stage-3 compound).
5.2 i18n fixture
File: tests/fixtures/task_generator/i18n_fixture.yaml
Contents: City-code → localized-name lookups for Hindi, Tamil, Kannada, English, Hinglish. Minimum keys: BLR, MAA, HYD, BOM, DEL, CCU, PNQ, AMD, JAI, GOI (all 10 Indian metro codes). Weekday names in each language. Domain-specific nouns (dish names for restaurant, room types for hotel).
NFC verification is part of the test U22 and the pre-commit hook above.
Example:
hi:
cities:
BLR: "बेंगलुरु"
MAA: "चेन्नई"
HYD: "हैदराबाद"
weekdays:
monday: "सोमवार"
ta:
cities:
BLR: "பெங்களூரு"
MAA: "சென்னை"
weekdays:
monday: "திங்கட்கிழமை"
kn:
cities:
BLR: "ಬೆಂಗಳೂರು"
MAA: "ಚೆನ್ನೈ"
weekdays:
monday: "ಸೋಮವಾರ"
en:
cities:
BLR: "Bengaluru"
hinglish:
cities:
BLR: "Bengaluru"
5.3 Stage-weight fixtures
Python-module fixtures exported from tests/fixtures/task_generator/weights.py:
# Matches DESIGN.md §10.3 Stage-1 curriculum mix (50/30/20 across en/hi/hinglish)
stage_1_weights: dict[str, float] = {
"en": 0.50, "hi": 0.30, "hinglish": 0.20, "ta": 0.00, "kn": 0.00,
}
# Stage-2 broadens to all 5 languages with 30/30/20/10/10
stage_2_weights: dict[str, float] = {
"en": 0.30, "hi": 0.30, "hinglish": 0.20, "ta": 0.10, "kn": 0.10,
}
# Stage-3 same distribution; stage differs only in template pool + drift schedule
stage_3_weights: dict[str, float] = {
"en": 0.30, "hi": 0.30, "hinglish": 0.20, "ta": 0.10, "kn": 0.10,
}
Each dict sums to exactly 1.00 under IEEE-754 double-precision (verified in a conftest.py sanity check).
5.4 Malformed fixtures (error-mode coverage only)
Distinct YAML files, each authored to trigger exactly one exception. Lived in tests/fixtures/task_generator/malformed/:
| File | Purpose |
|---|---|
templates_missing_slot.yaml |
triggers MissingSlotError (U34) |
templates_no_domain.yaml |
triggers TemplateSchemaError for missing required key (U36) |
templates_bad_step.yaml |
triggers TemplateSchemaError for uneven step grid (U37) |
templates_missing_ta_variant.yaml |
triggers NoVariantForLanguageError (U39) |
templates_nfd.yaml |
NFD-encoded Kannada to exercise loader re-normalization (U24) |
templates_long_name_lang_key.yaml |
uses "hindi" as a language key to trigger schema rejection per §4.1 |
5.5 Shared-fixture contract with env_tests.md
env_tests.md (authored in the same Batch D4) imports templates_fixture.yaml, i18n_fixture.yaml, and all three stage_N_weights from this directory. The env test plan exercises DriftCallEnv.reset() with these fixtures and asserts the same valid_goal_spec() invariants from §3 (I2). Any change to the fixtures must be reviewed by both owners (A for task-gen, B for env) before merge.
6. Appendix — Test File Layout
tests/
├── conftest.py # pytest-wide fixtures (paths, weights)
├── test_task_generator.py # §1 unit tests (U1–U30, U34–U39)
├── test_task_generator_properties.py # §2 property tests (P1–P6)
├── test_task_generator_integration.py # §3 integration tests (I1–I5)
└── fixtures/
├── models/
│ └── assertions.py # valid_goal_spec() helper (cross-doc)
└── task_generator/
├── strategies.py # hypothesis strategies
├── weights.py # stage_1/2/3_weights
├── templates_fixture.yaml
├── i18n_fixture.yaml
└── malformed/
├── templates_missing_slot.yaml
├── templates_no_domain.yaml
├── templates_bad_step.yaml
├── templates_missing_ta_variant.yaml
├── templates_nfd.yaml
└── templates_long_name_lang_key.yaml
7. Sanity Checks (for the implementer)
Before declaring task_generator.py done:
pytest tests/test_task_generator.py -v— all 30 unit tests pass.pytest tests/test_task_generator_properties.py -v— all 6 properties pass (including the 200,000-seed walk under-m slow).pytest tests/test_task_generator_integration.py -v— all 5 integration tests pass against real YAML fixtures.pytest --cov=driftcall.task_generator --cov-branch --cov-fail-under=95— 100% line, ≥ 95% branch.scripts/check_fixture_nfc.py— NFC hook green on every YAML leaf.ruff check tests/test_task_generator*.py— clean.mypy --strict tests/test_task_generator*.py— clean (test code is type-checked too).
When all green, dispatch ≥ 2 fresh critic agents per CLAUDE.md §3.4. Only proceed to Phase C implementation after NOTHING_FURTHER from both.