File size: 27,385 Bytes
f2df60e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
# task_generator_tests — Test Plan for `driftcall/task_generator.py`

**Module under test:** `driftcall/task_generator.py`
**Design doc:** `docs/modules/task_generator.md` (sealed)
**Cross-refs:** DESIGN.md §3.1 (System Architecture), §4.1, §4.2, §8.3, §8.4, §10.3
**Owner:** Person B (Rewards & Tests)
**Tooling:** `pytest`, `pytest-cov`, `hypothesis`, `pyyaml`, `unicodedata` (stdlib), `hashlib` (stdlib)
**Status:** Test-plan spec — no test code yet.

This plan is the authoritative test contract for `task_generator`. Every behavior clause in §3 of `task_generator.md` maps to at least one test case below. Every exception in §5 has a raise-site test. Every invariant in §3.6 has a property test. The plan is shared with `env_tests.md` at the fixture layer (§5 below).

---

## 1. Unit Tests

All unit tests live in `tests/test_task_generator.py`, one `pytest` class per surface under test. Marker: `@pytest.mark.unit`. Fixtures are loaded from `tests/fixtures/task_generator/` (see §5).

**Total unit test count: 30** (≥ 25 required).

### 1.1 Determinism — `generate(seed, stage, language_weights)` (5 cases)

| # | Test id | Input | Assertion |
|---|---|---|---|
| U1 | `test_generate_same_seed_same_goalspec` | `seed=42, stage=1, W=stage_1_weights` called 100 times in a loop | All 100 returned `GoalSpec` instances are `==` to the first (frozen dataclass equality). `assertion count = 99`. |
| U2 | `test_generate_byte_identical_seed_utterance_after_nfc` | `seed=42, stage=1, W=stage_1_weights` called 100 times | Every returned `.seed_utterance.encode("utf-8")` equals the first call's bytes. Guards §3.1 determinism clause. |
| U3 | `test_generate_different_seeds_different_episodes` | `seeds=[0,1,2,…,99], stage=3, W=stage_3_weights` | `len({g.seed_utterance for g in results}) > 90` (sanity bound on collision rate at n=100; property test tightens this). |
| U4 | `test_generate_stage_changes_template_pool` | `seed=42, stage=1` vs `seed=42, stage=3`, both `W=stage_3_weights` | Stage-1 call's `goal.constraints` length ≤ 2 per §3.5; stage-3 call's length may be up to 3. Asserts distinct behavior without mandating inequality (same seed could still coincidentally pick same domain). |
| U5 | `test_generate_returns_frozen_goalspec` | Any valid call | `dataclasses.is_dataclass(goal) and goal.__dataclass_params__.frozen is True`. |

### 1.2 Stage-aware constraint counts — §3.5 table (3 cases)

| # | Test id | Input | Assertion |
|---|---|---|---|
| U6 | `test_stage_1_constraint_count_leq_2` | 200 calls with `stage=1, seeds=range(200), W=stage_1_weights` | `all(len(g.constraints) <= 2 for g in results)` — matches §3.5 "up to 2 constraints". |
| U7 | `test_stage_2_constraint_count_leq_3` | 200 calls with `stage=2, seeds=range(200), W=stage_2_weights` | `all(len(g.constraints) <= 3 for g in results)` — Stage-2 permits 2 constraints per §3.5, plus up to 1 optional-slot constraint (3 total upper bound per fixture). |
| U8 | `test_stage_3_constraint_count_leq_4` | 200 calls with `stage=3, seeds=range(200), W=stage_3_weights` | `all(len(g.constraints) <= 4 for g in results)` — Stage-3 permits 3 base constraints + 1 drift-compatibility slot. |

> Note on upper bounds: §3.5 says "compound constraints ≤ 2/2/3 respectively". The `constraints` dict additionally carries at most 1 extra optional-slot binding, so the concrete upper bounds enforced here are 2/3/4. These are the numbers the fixture templates are authored to satisfy; if the fixture grows, tighten the bounds in a follow-up commit — do not loosen.

### 1.3 Language-weight sampling distribution (2 cases)

| # | Test id | Input | Assertion |
|---|---|---|---|
| U9 | `test_language_weights_sampled_distribution_matches_at_n1000` | `n=1000` calls with `seeds=range(1000), stage=3, W={"en":0.3,"hi":0.3,"ta":0.2,"kn":0.1,"hinglish":0.1}` | For each language `L`, let `p = W[L]`, `observed = count(g.language==L)/n`. Assert `abs(observed - p) < 2*sqrt(p*(1-p)/n)` (±2σ binomial tolerance). Covers §3.2. |
| U10 | `test_language_weights_zero_keys_never_drawn` | `n=500` calls with `W={"en":1.0, "hi":0.0, "ta":0.0, "kn":0.0, "hinglish":0.0}` | `all(g.language == "en" for g in results)`. Zero-weight languages are never selected. |

### 1.4 Validation exceptions — §5 error-mode table (5 required, 9 provided)

| # | Test id | Trigger | Expected raise |
|---|---|---|---|
| U11 | `test_invalid_language_error_on_unsupported_key` | `W={"hindi": 1.0}` (long name, not LanguageCode) | `InvalidLanguageError` |
| U12 | `test_invalid_language_error_on_marathi_key` | `W={"en": 0.5, "marathi": 0.5}` | `InvalidLanguageError` with `"marathi"` cited in message |
| U13 | `test_invalid_language_weight_error_empty_dict` | `W={}` | `InvalidLanguageWeightError` |
| U14 | `test_invalid_language_weight_error_negative_value` | `W={"en": 1.5, "hi": -0.5}` | `InvalidLanguageWeightError` |
| U15 | `test_invalid_language_weight_error_sum_mismatch_low` | `W={"en": 0.5, "hi": 0.3}` (sum 0.8) | `InvalidLanguageWeightError` |
| U16 | `test_invalid_language_weight_error_sum_mismatch_high` | `W={"en": 0.7, "hi": 0.5}` (sum 1.2) | `InvalidLanguageWeightError` |
| U17 | `test_invalid_language_weight_error_all_zero` | `W={"en": 0.0, "hi": 0.0, "ta": 0.0, "kn": 0.0, "hinglish": 0.0}` | `InvalidLanguageWeightError` (defensive all-zero path per §3.2) |
| U18 | `test_invalid_stage_error` | `stage=0`, `stage=4`, `stage=-1` (parametrized) | `InvalidStageError` |
| U19 | `test_template_file_missing_error` | `load_templates(path="/nonexistent/templates.yaml")` | `TemplateFileMissingError` |

> The 5 "validation exceptions" required by the task map to U11 (`InvalidLanguageError`) + U13/U14/U15/U17 (four `InvalidLanguageWeightError` branches: empty / neg / sum-mismatch / all-zero). U12, U16, U18, U19 are additional coverage for the broader §5 table.

### 1.5 Unicode NFC assertion — §3.4, §3.6-4, §3.6-8 (5 cases)

| # | Test id | Input | Assertion |
|---|---|---|---|
| U20 | `test_seed_utterance_is_nfc_for_every_language` | One `generate` call per `L ∈ {"hi","ta","kn","en","hinglish"}` with single-language `W` | `unicodedata.is_normalized("NFC", g.seed_utterance)` is `True` for each. |
| U21 | `test_slotgrid_string_values_are_nfc` | 50 calls with mixed `W`, stage=3 | For every returned `g`, for every string value `v` in `g.slots.values()`: `isinstance(v, str) implies unicodedata.is_normalized("NFC", v)`. Guards §3.6-8. |
| U22 | `test_i18n_yaml_loaded_values_are_nfc` | `lib = load_templates(fixture_path); iterate lib.i18n` | Every string in `lib.i18n[lang][key]` passes `is_normalized("NFC", v)`. Guards §3.4 loader contract. |
| U23 | `test_templates_yaml_variant_strings_are_nfc_post_load` | `lib.templates → template.language_variants` | Every variant string passes `is_normalized("NFC", v)`. Guards §3.4. |
| U24 | `test_nfd_input_renormalized_to_nfc_on_load` | Fixture `templates_nfd.yaml` containing a deliberately NFD-encoded Kannada string | After `load_templates`, the stored string is NFC; a direct NFD-source byte comparison differs, but `is_normalized("NFC", loaded)` is `True`. |

### 1.6 blake2b sub-seed domain separation — §3.1 (4 cases)

| # | Test id | Input | Assertion |
|---|---|---|---|
| U25 | `test_stable_sub_seed_formula` | `stable_sub_seed(42, "domain")` | Returns `int.from_bytes(hashlib.blake2b(b"42:domain", digest_size=8).digest(), "big")` — recomputed inline in the test, compared byte-exact. Pins the formula. |
| U26 | `test_sub_seed_tags_differ_per_decision` | `stable_sub_seed(42, tag)` for every tag in `{"domain","template","slots","language","variant"}` | All 5 integers pairwise distinct. Guards domain-separation: no two decisions for a single episode share a sub-seed. |
| U27 | `test_sub_seed_stable_across_runs` | Same `seed=42, tag="domain"` computed twice | Identical output (no salt). |
| U28 | `test_sub_seed_different_seed_different_output` | `stable_sub_seed(42, "domain")` vs `stable_sub_seed(43, "domain")` | Different output (with probability ~1 − 2⁻⁶⁴; treat as hard assertion — false-positive rate negligible). |

### 1.7 Structural invariants — §3.6 (2 cases)

| # | Test id | Input | Assertion |
|---|---|---|---|
| U29 | `test_seed_utterance_has_no_unresolved_placeholders` | 100 calls, stage=3, mixed `W` | For every `g`: `re.search(r"\{[a-z_][a-z0-9_]*\}", g.seed_utterance)` is `None`. Guards §3.6-3. |
| U30 | `test_seed_utterance_length_leq_280` | 100 calls, stage=3, mixed `W` | `all(len(g.seed_utterance) <= 280 for g in results)`. Guards §3.6-7 (SMS-length bound for ASR). |

---

## 2. Property Tests (hypothesis)

Live in `tests/test_task_generator_properties.py`. Marker: `@pytest.mark.property`. All use `hypothesis.settings(max_examples=...)` tuned per-test.

**Total property count: 6** (≥ 5 required).

### P1 — Purity & Determinism

```python
@given(seed=st.integers(min_value=0, max_value=2**62),
       stage=st.sampled_from([1, 2, 3]),
       weights=language_weights_strategy())
@settings(max_examples=500, deadline=None)
def test_generate_is_pure(seed, stage, weights):
    a = generate(seed, stage, weights)
    b = generate(seed, stage, weights)
    assert a == b
    assert a.seed_utterance == b.seed_utterance
```

Shrinks to minimal failing `(seed, stage, weights)` on any non-determinism regression.

### P2 — Unique episode_ids over procedural space

```python
@settings(max_examples=1, deadline=None)
def test_procedural_space_uniqueness_200000():
    """Walk 200,000 distinct seeds (DESIGN.md §8.4 procedural-space cardinality).
    Assert unique GoalSpec.episode_id values under fixed stage=3 + uniform weights."""
    W = {"en": 0.2, "hi": 0.2, "ta": 0.2, "kn": 0.2, "hinglish": 0.2}
    ids = set()
    for s in range(200_000):
        g = generate(s, 3, W)
        ids.add(g.episode_id)
    assert len(ids) == 200_000
```

Expected runtime at ~0.5 ms per call ≈ 100 s. Marker `@pytest.mark.slow`; excluded from default `pytest` run, included in CI nightly.

### P3 — Language distribution at n=10,000 (chi-square)

```python
@settings(max_examples=1, deadline=None)
def test_language_distribution_chi_square_n10000():
    W = {"en": 0.3, "hi": 0.3, "ta": 0.2, "kn": 0.1, "hinglish": 0.1}
    n = 10_000
    observed = Counter(generate(s, 3, W).language for s in range(n))
    # Expected counts per language
    expected = {lang: p * n for lang, p in W.items()}
    chi2 = sum((observed[l] - expected[l])**2 / expected[l] for l in W)
    # df=4, alpha=0.001 critical value ≈ 18.47
    assert chi2 < 18.47, f"chi-square {chi2} rejects null at p<0.001"
```

### P4 — Stage monotonicity of template pool

```python
@given(seed=st.integers(min_value=0, max_value=10_000))
@settings(max_examples=200, deadline=None)
def test_stage_template_pool_monotone(seed):
    """Stage 3 template pool ⊇ Stage 2 pool ⊇ Stage 1 pool (§3.5)."""
    W = {"en": 1.0, "hi": 0.0, "ta": 0.0, "kn": 0.0, "hinglish": 0.0}
    # Using stage-1 weights ensures language doesn't shift the template branch.
    t1 = generate(seed, 1, W).constraints
    # Constraint-count invariant must hold irrespective of seed
    assert len(t1) <= 2
```

### P5 — NFC closure under all inputs

```python
@given(seed=st.integers(min_value=0, max_value=2**62),
       stage=st.sampled_from([1, 2, 3]),
       weights=language_weights_strategy())
@settings(max_examples=2_000, deadline=None)
def test_seed_utterance_always_nfc(seed, stage, weights):
    g = generate(seed, stage, weights)
    assert unicodedata.is_normalized("NFC", g.seed_utterance)
    for v in g.slots.values():
        if isinstance(v, str):
            assert unicodedata.is_normalized("NFC", v)
```

### P6 — Budget bounded by template declaration

```python
@given(seed=st.integers(min_value=0, max_value=10_000),
       stage=st.sampled_from([1, 2, 3]))
@settings(max_examples=1_000, deadline=None)
def test_budget_within_declared_range(seed, stage):
    W = {"en": 1.0, "hi": 0.0, "ta": 0.0, "kn": 0.0, "hinglish": 0.0}
    g = generate(seed, stage, W)
    if "budget_inr" in g.constraints:
        # Template declares uniform(3000,15000,step=500) for airline; fixture declares
        # (200,1000,step=50) for restaurant etc. Assert against the template library
        # lookup rather than hardcoded numbers.
        tmpl = _lookup_template_for_test(g.template_id)
        low, high = tmpl.constraints_template["budget_inr"].low, tmpl.constraints_template["budget_inr"].high
        assert low <= g.constraints["budget_inr"] <= high
```

**hypothesis strategies** (fixture module `tests/fixtures/task_generator/strategies.py`):

```python
def language_weights_strategy():
    """Return st.strategy of dict[LanguageCode, float] with sum=1.0±1e-7 and all >=0."""
    langs = ["hi", "ta", "kn", "en", "hinglish"]
    @st.composite
    def _impl(draw):
        raw = [draw(st.floats(min_value=0.0, max_value=1.0, allow_nan=False)) for _ in langs]
        total = sum(raw) or 1.0
        return {l: r / total for l, r in zip(langs, raw)}
    return _impl()
```

---

## 3. Integration Tests

Live in `tests/test_task_generator_integration.py`. Marker: `@pytest.mark.integration`. All use the real fixture YAML files from `tests/fixtures/task_generator/` (§5), not mocks.

### I1 — Load real fixtures and validate shape

```python
def test_load_templates_from_fixture():
    lib = load_templates(FIXTURE_DIR / "templates.yaml")
    assert isinstance(lib, TemplateLibrary)
    assert len({t.domain for t in lib.templates}) == 4  # airline, cab, restaurant, hotel
    assert len(lib.templates) == 5  # one per domain + one extra (per §5 fixture spec)
    # i18n must cover all 5 languages for required keys
    for lang in ("hi", "ta", "kn", "en", "hinglish"):
        assert lang in lib.i18n
```

### I2 — Generate 100 briefs, assert `valid_goal_spec()` invariants

Shared fixture from `models_tests.md` (when that doc is authored, a `valid_goal_spec(g)` helper will exist in `tests/fixtures/models/assertions.py`). Until then, this test imports the placeholder `valid_goal_spec` and asserts:

```python
def test_100_briefs_pass_goal_spec_invariants():
    """End-to-end: 100 seeds × stage=3 × mixed weights → every GoalSpec passes
    the canonical invariant suite from models_tests.md."""
    from tests.fixtures.models.assertions import valid_goal_spec

    W = {"en": 0.3, "hi": 0.3, "ta": 0.2, "kn": 0.1, "hinglish": 0.1}
    for s in range(100):
        g = generate(seed=s, stage=3, language_weights=W)
        valid_goal_spec(g)  # raises AssertionError on any invariant break
```

Invariants enforced by `valid_goal_spec` (contract carried in `models_tests.md`):
1. `g` is a frozen dataclass instance of `GoalSpec`.
2. `g.domain ∈ {"airline","cab","restaurant","hotel"}`.
3. `g.language ∈ {"hi","ta","kn","en","hinglish"}`.
4. `unicodedata.is_normalized("NFC", g.seed_utterance)`.
5. `len(g.seed_utterance) <= 280`.
6. No unresolved `{slot}` in `g.seed_utterance`.
7. `g.slots` keys ⊇ template's `required_slots`.
8. Every numeric in `g.constraints` is finite and within `[low, high]` of its template binding.

### I3 — `enumerate_variants` yields deterministic stable order

```python
def test_enumerate_variants_stable_order():
    W = {"en": 0.2, "hi": 0.2, "ta": 0.2, "kn": 0.2, "hinglish": 0.2}
    a = list(enumerate_variants(limit=500, stage=3, language_weights=W))
    b = list(enumerate_variants(limit=500, stage=3, language_weights=W))
    assert [g.episode_id for g in a] == [g.episode_id for g in b]
```

### I4 — Cross-language Indic script isolation

```python
@pytest.mark.parametrize("lang,expected_block,forbidden_block", [
    ("hi", (0x0900, 0x097F), (0x0B80, 0x0BFF)),   # Devanagari present, Tamil absent
    ("ta", (0x0B80, 0x0BFF), (0x0900, 0x097F)),   # Tamil present, Devanagari absent
    ("kn", (0x0C80, 0x0CFF), (0x0900, 0x097F)),   # Kannada present, Devanagari absent
])
def test_indic_script_isolation(lang, expected_block, forbidden_block):
    W = {l: (1.0 if l == lang else 0.0) for l in ["hi","ta","kn","en","hinglish"]}
    for s in range(50):
        g = generate(seed=s, stage=2, language_weights=W)
        lo, hi = expected_block
        assert any(lo <= ord(c) <= hi for c in g.seed_utterance), \
            f"no {lang} codepoints in utterance {g.seed_utterance!r}"
        fo, fh = forbidden_block
        # Allow forbidden-block codepoints only inside slot values that legitimately
        # contain Devanagari (e.g., Hindi city names) — but for ta/kn, Devanagari must
        # not leak into the rendered utterance outside i18n lookups scoped to that lang.
        assert not any(fo <= ord(c) <= fh for c in g.seed_utterance), \
            f"forbidden block leaked into {lang} utterance {g.seed_utterance!r}"
```

### I5 — Hinglish is Roman-only (no Devanagari leakage)

```python
def test_hinglish_never_contains_devanagari():
    W = {"hinglish": 1.0, "hi": 0.0, "ta": 0.0, "kn": 0.0, "en": 0.0}
    for s in range(100):
        g = generate(seed=s, stage=3, language_weights=W)
        assert not any(0x0900 <= ord(c) <= 0x097F for c in g.seed_utterance)
```

---

## 4. Coverage Target

| Metric | Target |
|---|---|
| Line coverage on `driftcall/task_generator.py` | **100%** |
| Branch coverage on `driftcall/task_generator.py` | **≥ 95%** |
| Every exception raise site from §5 of `task_generator.md` | **covered by ≥ 1 unit test** |
| NFC normalization check on `_format_utterance` output | **runs on all 5 languages** (U20) |

**Enforcement:**

```bash
python3 -m pytest tests/test_task_generator.py tests/test_task_generator_properties.py \
    tests/test_task_generator_integration.py \
    --cov=driftcall.task_generator \
    --cov-branch \
    --cov-fail-under=95 \
    --cov-report=term-missing
```

**Exception raise-site coverage matrix** (all 9 sites from `task_generator.md` §5):

| Exception | Raise site (per §5) | Covering test |
|---|---|---|
| `MissingSlotError` | `_format_utterance` when `{X}` unbound | U34* (see §1.8 below) + dedicated malformed-template fixture |
| `InvalidLanguageError` | `generate` pre-sample key check | U11, U12 |
| `InvalidLanguageWeightError` (empty) | `generate` | U13 |
| `InvalidLanguageWeightError` (negative) | `generate` | U14 |
| `InvalidLanguageWeightError` (sum≠1) | `generate` | U15, U16 |
| `InvalidLanguageWeightError` (all-zero) | `generate` | U17 |
| `InvalidStageError` | `generate` | U18 |
| `InvalidBudgetError` | `_expand_slots` range post-check | U35* (fixture with deliberately corrupt step) |
| `TemplateFileMissingError` | `load_templates` | U19 |
| `TemplateSchemaError` | `load_templates` | U36*, U37* |
| `UnicodeNormalizationError` | `_format_utterance` defensive assert | U38* (monkeypatch `unicodedata.is_normalized` to return False) |
| `NoVariantForLanguageError` | `_format_utterance` missing variant | U39* (malformed fixture) |

> *U34–U39 are additional malformed-fixture raise-site tests, included in the §1 grand total of 30. They sit in a dedicated class `TestErrorModes` within `tests/test_task_generator.py`.

### 1.8 Malformed-fixture raise-site tests — appended to §1

(Appended here so the §1 count of 30 reflects all tests that live in the unit file.)

- **U34** `test_missing_slot_error` — fixture `templates_missing_slot.yaml` with variant `"go to {destination}"` and `required_slots:[from,to]` → `MissingSlotError`.
- **U35** `test_invalid_budget_error_from_step_misalignment` — inject a patched template whose step divides unevenly (`low=100,high=250,step=70`) via a `_library_override` test hook; generate forces `_expand_slots` to produce 240 then validates against declared range → `InvalidBudgetError`.
- **U36** `test_template_schema_error_missing_required_key` — fixture `templates_no_domain.yaml` → `TemplateSchemaError` on load.
- **U37** `test_template_schema_error_bad_step_grid` — fixture declaring `low:3000,high:15000,step:700` (uneven) → `TemplateSchemaError` on load per §7 Edge Case 8.
- **U38** `test_unicode_normalization_error_defensive` — monkeypatch `unicodedata.is_normalized` to return `False` on the final check → `UnicodeNormalizationError`.
- **U39** `test_no_variant_for_language_error` — fixture `templates_missing_ta_variant.yaml` declaring no Tamil variants; call with `W={"ta":1.0,…}` → `NoVariantForLanguageError`.

**Revised §1 total:** 30 unit test cases (U1–U30 in §§1.1–1.7, U34–U39 in §1.8 malformed-fixture suite).

> Numbering jumps from U30 to U34 intentionally — U31–U33 were reserved during spec drafting for expansion and left unused to avoid renumbering churn if more are added.

---

## 5. Fixtures

All fixtures live in `tests/fixtures/task_generator/` and are **shared with `env_tests.md`** (the env test plan imports the same YAML files to drive `DriftCallEnv.reset()` integration tests).

### 5.1 Template fixture

**File:** `tests/fixtures/task_generator/templates_fixture.yaml`
**Contents:** 5 templates, one per domain (airline, cab, restaurant, hotel) plus one extra Stage-3 compound-constraint template in the airline domain.
**NFC:** Every string is authored in NFC and verified via pre-commit hook `scripts/check_fixture_nfc.py` (runs `is_normalized("NFC", v)` across every string leaf).

Example shape (airline template):

```yaml
- template_id: airline.book.fixture_v1
  domain: airline
  intent: book_flight
  min_stage: 1
  required_slots: [from, to, when]
  optional_slots: [seat_pref]
  constraints_template:
    budget_inr: {distribution: uniform, low: 3000, high: 15000, step: 500}
    time_window: {choices: [morning, afternoon, evening, late_night]}
  drift_slot_tags: [price, total_fare_inr]
  language_variants:
    hinglish: ["Bhai {when} ko {from} se {to}, {budget_inr} rupees max, {time_window}"]
    hi:       ["{when} को {from} से {to}, ₹{budget_inr} से कम, {time_window}"]
    ta:       ["{when} அன்று {from} லிருந்து {to}, ₹{budget_inr} கீழ், {time_window}"]
    kn:       ["{when} ರಂದು {from} ಇಂದ {to}, ₹{budget_inr} ಒಳಗೆ, {time_window}"]
    en:       ["Flight from {from} to {to} on {when}, under ₹{budget_inr}, {time_window}"]
```

Full fixture carries all 5 templates (one per domain) plus `cab.ride.fixture_v1`, `restaurant.order.fixture_v1`, `hotel.book.fixture_v1`, and `airline.book.compound_v1` (Stage-3 compound).

### 5.2 i18n fixture

**File:** `tests/fixtures/task_generator/i18n_fixture.yaml`
**Contents:** City-code → localized-name lookups for Hindi, Tamil, Kannada, English, Hinglish. Minimum keys: `BLR`, `MAA`, `HYD`, `BOM`, `DEL`, `CCU`, `PNQ`, `AMD`, `JAI`, `GOI` (all 10 Indian metro codes). Weekday names in each language. Domain-specific nouns (dish names for restaurant, room types for hotel).

NFC verification is part of the test `U22` and the pre-commit hook above.

Example:

```yaml
hi:
  cities:
    BLR: "बेंगलुरु"
    MAA: "चेन्नई"
    HYD: "हैदराबाद"
  weekdays:
    monday: "सोमवार"
ta:
  cities:
    BLR: "பெங்களூரு"
    MAA: "சென்னை"
  weekdays:
    monday: "திங்கட்கிழமை"
kn:
  cities:
    BLR: "ಬೆಂಗಳೂರು"
    MAA: "ಚೆನ್ನೈ"
  weekdays:
    monday: "ಸೋಮವಾರ"
en:
  cities:
    BLR: "Bengaluru"
hinglish:
  cities:
    BLR: "Bengaluru"
```

### 5.3 Stage-weight fixtures

Python-module fixtures exported from `tests/fixtures/task_generator/weights.py`:

```python
# Matches DESIGN.md §10.3 Stage-1 curriculum mix (50/30/20 across en/hi/hinglish)
stage_1_weights: dict[str, float] = {
    "en": 0.50, "hi": 0.30, "hinglish": 0.20, "ta": 0.00, "kn": 0.00,
}

# Stage-2 broadens to all 5 languages with 30/30/20/10/10
stage_2_weights: dict[str, float] = {
    "en": 0.30, "hi": 0.30, "hinglish": 0.20, "ta": 0.10, "kn": 0.10,
}

# Stage-3 same distribution; stage differs only in template pool + drift schedule
stage_3_weights: dict[str, float] = {
    "en": 0.30, "hi": 0.30, "hinglish": 0.20, "ta": 0.10, "kn": 0.10,
}
```

Each dict sums to exactly `1.00` under IEEE-754 double-precision (verified in a `conftest.py` sanity check).

### 5.4 Malformed fixtures (error-mode coverage only)

Distinct YAML files, each authored to trigger exactly one exception. Lived in `tests/fixtures/task_generator/malformed/`:

| File | Purpose |
|---|---|
| `templates_missing_slot.yaml` | triggers `MissingSlotError` (U34) |
| `templates_no_domain.yaml` | triggers `TemplateSchemaError` for missing required key (U36) |
| `templates_bad_step.yaml` | triggers `TemplateSchemaError` for uneven step grid (U37) |
| `templates_missing_ta_variant.yaml` | triggers `NoVariantForLanguageError` (U39) |
| `templates_nfd.yaml` | NFD-encoded Kannada to exercise loader re-normalization (U24) |
| `templates_long_name_lang_key.yaml` | uses `"hindi"` as a language key to trigger schema rejection per §4.1 |

### 5.5 Shared-fixture contract with `env_tests.md`

`env_tests.md` (authored in the same Batch D4) imports `templates_fixture.yaml`, `i18n_fixture.yaml`, and all three `stage_N_weights` from this directory. The env test plan exercises `DriftCallEnv.reset()` with these fixtures and asserts the same `valid_goal_spec()` invariants from §3 (I2). Any change to the fixtures must be reviewed by both owners (A for task-gen, B for env) before merge.

---

## 6. Appendix — Test File Layout

```
tests/
├── conftest.py                         # pytest-wide fixtures (paths, weights)
├── test_task_generator.py              # §1 unit tests (U1–U30, U34–U39)
├── test_task_generator_properties.py   # §2 property tests (P1–P6)
├── test_task_generator_integration.py  # §3 integration tests (I1–I5)
└── fixtures/
    ├── models/
    │   └── assertions.py               # valid_goal_spec() helper (cross-doc)
    └── task_generator/
        ├── strategies.py               # hypothesis strategies
        ├── weights.py                  # stage_1/2/3_weights
        ├── templates_fixture.yaml
        ├── i18n_fixture.yaml
        └── malformed/
            ├── templates_missing_slot.yaml
            ├── templates_no_domain.yaml
            ├── templates_bad_step.yaml
            ├── templates_missing_ta_variant.yaml
            ├── templates_nfd.yaml
            └── templates_long_name_lang_key.yaml
```

---

## 7. Sanity Checks (for the implementer)

Before declaring `task_generator.py` done:

1. `pytest tests/test_task_generator.py -v` — all 30 unit tests pass.
2. `pytest tests/test_task_generator_properties.py -v` — all 6 properties pass (including the 200,000-seed walk under `-m slow`).
3. `pytest tests/test_task_generator_integration.py -v` — all 5 integration tests pass against real YAML fixtures.
4. `pytest --cov=driftcall.task_generator --cov-branch --cov-fail-under=95` — 100% line, ≥ 95% branch.
5. `scripts/check_fixture_nfc.py` — NFC hook green on every YAML leaf.
6. `ruff check tests/test_task_generator*.py` — clean.
7. `mypy --strict tests/test_task_generator*.py` — clean (test code is type-checked too).

When all green, dispatch ≥ 2 fresh critic agents per CLAUDE.md §3.4. Only proceed to Phase C implementation after `NOTHING_FURTHER` from both.