Spaces:
Sleeping
Sleeping
File size: 27,385 Bytes
f2df60e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 | # task_generator_tests — Test Plan for `driftcall/task_generator.py`
**Module under test:** `driftcall/task_generator.py`
**Design doc:** `docs/modules/task_generator.md` (sealed)
**Cross-refs:** DESIGN.md §3.1 (System Architecture), §4.1, §4.2, §8.3, §8.4, §10.3
**Owner:** Person B (Rewards & Tests)
**Tooling:** `pytest`, `pytest-cov`, `hypothesis`, `pyyaml`, `unicodedata` (stdlib), `hashlib` (stdlib)
**Status:** Test-plan spec — no test code yet.
This plan is the authoritative test contract for `task_generator`. Every behavior clause in §3 of `task_generator.md` maps to at least one test case below. Every exception in §5 has a raise-site test. Every invariant in §3.6 has a property test. The plan is shared with `env_tests.md` at the fixture layer (§5 below).
---
## 1. Unit Tests
All unit tests live in `tests/test_task_generator.py`, one `pytest` class per surface under test. Marker: `@pytest.mark.unit`. Fixtures are loaded from `tests/fixtures/task_generator/` (see §5).
**Total unit test count: 30** (≥ 25 required).
### 1.1 Determinism — `generate(seed, stage, language_weights)` (5 cases)
| # | Test id | Input | Assertion |
|---|---|---|---|
| U1 | `test_generate_same_seed_same_goalspec` | `seed=42, stage=1, W=stage_1_weights` called 100 times in a loop | All 100 returned `GoalSpec` instances are `==` to the first (frozen dataclass equality). `assertion count = 99`. |
| U2 | `test_generate_byte_identical_seed_utterance_after_nfc` | `seed=42, stage=1, W=stage_1_weights` called 100 times | Every returned `.seed_utterance.encode("utf-8")` equals the first call's bytes. Guards §3.1 determinism clause. |
| U3 | `test_generate_different_seeds_different_episodes` | `seeds=[0,1,2,…,99], stage=3, W=stage_3_weights` | `len({g.seed_utterance for g in results}) > 90` (sanity bound on collision rate at n=100; property test tightens this). |
| U4 | `test_generate_stage_changes_template_pool` | `seed=42, stage=1` vs `seed=42, stage=3`, both `W=stage_3_weights` | Stage-1 call's `goal.constraints` length ≤ 2 per §3.5; stage-3 call's length may be up to 3. Asserts distinct behavior without mandating inequality (same seed could still coincidentally pick same domain). |
| U5 | `test_generate_returns_frozen_goalspec` | Any valid call | `dataclasses.is_dataclass(goal) and goal.__dataclass_params__.frozen is True`. |
### 1.2 Stage-aware constraint counts — §3.5 table (3 cases)
| # | Test id | Input | Assertion |
|---|---|---|---|
| U6 | `test_stage_1_constraint_count_leq_2` | 200 calls with `stage=1, seeds=range(200), W=stage_1_weights` | `all(len(g.constraints) <= 2 for g in results)` — matches §3.5 "up to 2 constraints". |
| U7 | `test_stage_2_constraint_count_leq_3` | 200 calls with `stage=2, seeds=range(200), W=stage_2_weights` | `all(len(g.constraints) <= 3 for g in results)` — Stage-2 permits 2 constraints per §3.5, plus up to 1 optional-slot constraint (3 total upper bound per fixture). |
| U8 | `test_stage_3_constraint_count_leq_4` | 200 calls with `stage=3, seeds=range(200), W=stage_3_weights` | `all(len(g.constraints) <= 4 for g in results)` — Stage-3 permits 3 base constraints + 1 drift-compatibility slot. |
> Note on upper bounds: §3.5 says "compound constraints ≤ 2/2/3 respectively". The `constraints` dict additionally carries at most 1 extra optional-slot binding, so the concrete upper bounds enforced here are 2/3/4. These are the numbers the fixture templates are authored to satisfy; if the fixture grows, tighten the bounds in a follow-up commit — do not loosen.
### 1.3 Language-weight sampling distribution (2 cases)
| # | Test id | Input | Assertion |
|---|---|---|---|
| U9 | `test_language_weights_sampled_distribution_matches_at_n1000` | `n=1000` calls with `seeds=range(1000), stage=3, W={"en":0.3,"hi":0.3,"ta":0.2,"kn":0.1,"hinglish":0.1}` | For each language `L`, let `p = W[L]`, `observed = count(g.language==L)/n`. Assert `abs(observed - p) < 2*sqrt(p*(1-p)/n)` (±2σ binomial tolerance). Covers §3.2. |
| U10 | `test_language_weights_zero_keys_never_drawn` | `n=500` calls with `W={"en":1.0, "hi":0.0, "ta":0.0, "kn":0.0, "hinglish":0.0}` | `all(g.language == "en" for g in results)`. Zero-weight languages are never selected. |
### 1.4 Validation exceptions — §5 error-mode table (5 required, 9 provided)
| # | Test id | Trigger | Expected raise |
|---|---|---|---|
| U11 | `test_invalid_language_error_on_unsupported_key` | `W={"hindi": 1.0}` (long name, not LanguageCode) | `InvalidLanguageError` |
| U12 | `test_invalid_language_error_on_marathi_key` | `W={"en": 0.5, "marathi": 0.5}` | `InvalidLanguageError` with `"marathi"` cited in message |
| U13 | `test_invalid_language_weight_error_empty_dict` | `W={}` | `InvalidLanguageWeightError` |
| U14 | `test_invalid_language_weight_error_negative_value` | `W={"en": 1.5, "hi": -0.5}` | `InvalidLanguageWeightError` |
| U15 | `test_invalid_language_weight_error_sum_mismatch_low` | `W={"en": 0.5, "hi": 0.3}` (sum 0.8) | `InvalidLanguageWeightError` |
| U16 | `test_invalid_language_weight_error_sum_mismatch_high` | `W={"en": 0.7, "hi": 0.5}` (sum 1.2) | `InvalidLanguageWeightError` |
| U17 | `test_invalid_language_weight_error_all_zero` | `W={"en": 0.0, "hi": 0.0, "ta": 0.0, "kn": 0.0, "hinglish": 0.0}` | `InvalidLanguageWeightError` (defensive all-zero path per §3.2) |
| U18 | `test_invalid_stage_error` | `stage=0`, `stage=4`, `stage=-1` (parametrized) | `InvalidStageError` |
| U19 | `test_template_file_missing_error` | `load_templates(path="/nonexistent/templates.yaml")` | `TemplateFileMissingError` |
> The 5 "validation exceptions" required by the task map to U11 (`InvalidLanguageError`) + U13/U14/U15/U17 (four `InvalidLanguageWeightError` branches: empty / neg / sum-mismatch / all-zero). U12, U16, U18, U19 are additional coverage for the broader §5 table.
### 1.5 Unicode NFC assertion — §3.4, §3.6-4, §3.6-8 (5 cases)
| # | Test id | Input | Assertion |
|---|---|---|---|
| U20 | `test_seed_utterance_is_nfc_for_every_language` | One `generate` call per `L ∈ {"hi","ta","kn","en","hinglish"}` with single-language `W` | `unicodedata.is_normalized("NFC", g.seed_utterance)` is `True` for each. |
| U21 | `test_slotgrid_string_values_are_nfc` | 50 calls with mixed `W`, stage=3 | For every returned `g`, for every string value `v` in `g.slots.values()`: `isinstance(v, str) implies unicodedata.is_normalized("NFC", v)`. Guards §3.6-8. |
| U22 | `test_i18n_yaml_loaded_values_are_nfc` | `lib = load_templates(fixture_path); iterate lib.i18n` | Every string in `lib.i18n[lang][key]` passes `is_normalized("NFC", v)`. Guards §3.4 loader contract. |
| U23 | `test_templates_yaml_variant_strings_are_nfc_post_load` | `lib.templates → template.language_variants` | Every variant string passes `is_normalized("NFC", v)`. Guards §3.4. |
| U24 | `test_nfd_input_renormalized_to_nfc_on_load` | Fixture `templates_nfd.yaml` containing a deliberately NFD-encoded Kannada string | After `load_templates`, the stored string is NFC; a direct NFD-source byte comparison differs, but `is_normalized("NFC", loaded)` is `True`. |
### 1.6 blake2b sub-seed domain separation — §3.1 (4 cases)
| # | Test id | Input | Assertion |
|---|---|---|---|
| U25 | `test_stable_sub_seed_formula` | `stable_sub_seed(42, "domain")` | Returns `int.from_bytes(hashlib.blake2b(b"42:domain", digest_size=8).digest(), "big")` — recomputed inline in the test, compared byte-exact. Pins the formula. |
| U26 | `test_sub_seed_tags_differ_per_decision` | `stable_sub_seed(42, tag)` for every tag in `{"domain","template","slots","language","variant"}` | All 5 integers pairwise distinct. Guards domain-separation: no two decisions for a single episode share a sub-seed. |
| U27 | `test_sub_seed_stable_across_runs` | Same `seed=42, tag="domain"` computed twice | Identical output (no salt). |
| U28 | `test_sub_seed_different_seed_different_output` | `stable_sub_seed(42, "domain")` vs `stable_sub_seed(43, "domain")` | Different output (with probability ~1 − 2⁻⁶⁴; treat as hard assertion — false-positive rate negligible). |
### 1.7 Structural invariants — §3.6 (2 cases)
| # | Test id | Input | Assertion |
|---|---|---|---|
| U29 | `test_seed_utterance_has_no_unresolved_placeholders` | 100 calls, stage=3, mixed `W` | For every `g`: `re.search(r"\{[a-z_][a-z0-9_]*\}", g.seed_utterance)` is `None`. Guards §3.6-3. |
| U30 | `test_seed_utterance_length_leq_280` | 100 calls, stage=3, mixed `W` | `all(len(g.seed_utterance) <= 280 for g in results)`. Guards §3.6-7 (SMS-length bound for ASR). |
---
## 2. Property Tests (hypothesis)
Live in `tests/test_task_generator_properties.py`. Marker: `@pytest.mark.property`. All use `hypothesis.settings(max_examples=...)` tuned per-test.
**Total property count: 6** (≥ 5 required).
### P1 — Purity & Determinism
```python
@given(seed=st.integers(min_value=0, max_value=2**62),
stage=st.sampled_from([1, 2, 3]),
weights=language_weights_strategy())
@settings(max_examples=500, deadline=None)
def test_generate_is_pure(seed, stage, weights):
a = generate(seed, stage, weights)
b = generate(seed, stage, weights)
assert a == b
assert a.seed_utterance == b.seed_utterance
```
Shrinks to minimal failing `(seed, stage, weights)` on any non-determinism regression.
### P2 — Unique episode_ids over procedural space
```python
@settings(max_examples=1, deadline=None)
def test_procedural_space_uniqueness_200000():
"""Walk 200,000 distinct seeds (DESIGN.md §8.4 procedural-space cardinality).
Assert unique GoalSpec.episode_id values under fixed stage=3 + uniform weights."""
W = {"en": 0.2, "hi": 0.2, "ta": 0.2, "kn": 0.2, "hinglish": 0.2}
ids = set()
for s in range(200_000):
g = generate(s, 3, W)
ids.add(g.episode_id)
assert len(ids) == 200_000
```
Expected runtime at ~0.5 ms per call ≈ 100 s. Marker `@pytest.mark.slow`; excluded from default `pytest` run, included in CI nightly.
### P3 — Language distribution at n=10,000 (chi-square)
```python
@settings(max_examples=1, deadline=None)
def test_language_distribution_chi_square_n10000():
W = {"en": 0.3, "hi": 0.3, "ta": 0.2, "kn": 0.1, "hinglish": 0.1}
n = 10_000
observed = Counter(generate(s, 3, W).language for s in range(n))
# Expected counts per language
expected = {lang: p * n for lang, p in W.items()}
chi2 = sum((observed[l] - expected[l])**2 / expected[l] for l in W)
# df=4, alpha=0.001 critical value ≈ 18.47
assert chi2 < 18.47, f"chi-square {chi2} rejects null at p<0.001"
```
### P4 — Stage monotonicity of template pool
```python
@given(seed=st.integers(min_value=0, max_value=10_000))
@settings(max_examples=200, deadline=None)
def test_stage_template_pool_monotone(seed):
"""Stage 3 template pool ⊇ Stage 2 pool ⊇ Stage 1 pool (§3.5)."""
W = {"en": 1.0, "hi": 0.0, "ta": 0.0, "kn": 0.0, "hinglish": 0.0}
# Using stage-1 weights ensures language doesn't shift the template branch.
t1 = generate(seed, 1, W).constraints
# Constraint-count invariant must hold irrespective of seed
assert len(t1) <= 2
```
### P5 — NFC closure under all inputs
```python
@given(seed=st.integers(min_value=0, max_value=2**62),
stage=st.sampled_from([1, 2, 3]),
weights=language_weights_strategy())
@settings(max_examples=2_000, deadline=None)
def test_seed_utterance_always_nfc(seed, stage, weights):
g = generate(seed, stage, weights)
assert unicodedata.is_normalized("NFC", g.seed_utterance)
for v in g.slots.values():
if isinstance(v, str):
assert unicodedata.is_normalized("NFC", v)
```
### P6 — Budget bounded by template declaration
```python
@given(seed=st.integers(min_value=0, max_value=10_000),
stage=st.sampled_from([1, 2, 3]))
@settings(max_examples=1_000, deadline=None)
def test_budget_within_declared_range(seed, stage):
W = {"en": 1.0, "hi": 0.0, "ta": 0.0, "kn": 0.0, "hinglish": 0.0}
g = generate(seed, stage, W)
if "budget_inr" in g.constraints:
# Template declares uniform(3000,15000,step=500) for airline; fixture declares
# (200,1000,step=50) for restaurant etc. Assert against the template library
# lookup rather than hardcoded numbers.
tmpl = _lookup_template_for_test(g.template_id)
low, high = tmpl.constraints_template["budget_inr"].low, tmpl.constraints_template["budget_inr"].high
assert low <= g.constraints["budget_inr"] <= high
```
**hypothesis strategies** (fixture module `tests/fixtures/task_generator/strategies.py`):
```python
def language_weights_strategy():
"""Return st.strategy of dict[LanguageCode, float] with sum=1.0±1e-7 and all >=0."""
langs = ["hi", "ta", "kn", "en", "hinglish"]
@st.composite
def _impl(draw):
raw = [draw(st.floats(min_value=0.0, max_value=1.0, allow_nan=False)) for _ in langs]
total = sum(raw) or 1.0
return {l: r / total for l, r in zip(langs, raw)}
return _impl()
```
---
## 3. Integration Tests
Live in `tests/test_task_generator_integration.py`. Marker: `@pytest.mark.integration`. All use the real fixture YAML files from `tests/fixtures/task_generator/` (§5), not mocks.
### I1 — Load real fixtures and validate shape
```python
def test_load_templates_from_fixture():
lib = load_templates(FIXTURE_DIR / "templates.yaml")
assert isinstance(lib, TemplateLibrary)
assert len({t.domain for t in lib.templates}) == 4 # airline, cab, restaurant, hotel
assert len(lib.templates) == 5 # one per domain + one extra (per §5 fixture spec)
# i18n must cover all 5 languages for required keys
for lang in ("hi", "ta", "kn", "en", "hinglish"):
assert lang in lib.i18n
```
### I2 — Generate 100 briefs, assert `valid_goal_spec()` invariants
Shared fixture from `models_tests.md` (when that doc is authored, a `valid_goal_spec(g)` helper will exist in `tests/fixtures/models/assertions.py`). Until then, this test imports the placeholder `valid_goal_spec` and asserts:
```python
def test_100_briefs_pass_goal_spec_invariants():
"""End-to-end: 100 seeds × stage=3 × mixed weights → every GoalSpec passes
the canonical invariant suite from models_tests.md."""
from tests.fixtures.models.assertions import valid_goal_spec
W = {"en": 0.3, "hi": 0.3, "ta": 0.2, "kn": 0.1, "hinglish": 0.1}
for s in range(100):
g = generate(seed=s, stage=3, language_weights=W)
valid_goal_spec(g) # raises AssertionError on any invariant break
```
Invariants enforced by `valid_goal_spec` (contract carried in `models_tests.md`):
1. `g` is a frozen dataclass instance of `GoalSpec`.
2. `g.domain ∈ {"airline","cab","restaurant","hotel"}`.
3. `g.language ∈ {"hi","ta","kn","en","hinglish"}`.
4. `unicodedata.is_normalized("NFC", g.seed_utterance)`.
5. `len(g.seed_utterance) <= 280`.
6. No unresolved `{slot}` in `g.seed_utterance`.
7. `g.slots` keys ⊇ template's `required_slots`.
8. Every numeric in `g.constraints` is finite and within `[low, high]` of its template binding.
### I3 — `enumerate_variants` yields deterministic stable order
```python
def test_enumerate_variants_stable_order():
W = {"en": 0.2, "hi": 0.2, "ta": 0.2, "kn": 0.2, "hinglish": 0.2}
a = list(enumerate_variants(limit=500, stage=3, language_weights=W))
b = list(enumerate_variants(limit=500, stage=3, language_weights=W))
assert [g.episode_id for g in a] == [g.episode_id for g in b]
```
### I4 — Cross-language Indic script isolation
```python
@pytest.mark.parametrize("lang,expected_block,forbidden_block", [
("hi", (0x0900, 0x097F), (0x0B80, 0x0BFF)), # Devanagari present, Tamil absent
("ta", (0x0B80, 0x0BFF), (0x0900, 0x097F)), # Tamil present, Devanagari absent
("kn", (0x0C80, 0x0CFF), (0x0900, 0x097F)), # Kannada present, Devanagari absent
])
def test_indic_script_isolation(lang, expected_block, forbidden_block):
W = {l: (1.0 if l == lang else 0.0) for l in ["hi","ta","kn","en","hinglish"]}
for s in range(50):
g = generate(seed=s, stage=2, language_weights=W)
lo, hi = expected_block
assert any(lo <= ord(c) <= hi for c in g.seed_utterance), \
f"no {lang} codepoints in utterance {g.seed_utterance!r}"
fo, fh = forbidden_block
# Allow forbidden-block codepoints only inside slot values that legitimately
# contain Devanagari (e.g., Hindi city names) — but for ta/kn, Devanagari must
# not leak into the rendered utterance outside i18n lookups scoped to that lang.
assert not any(fo <= ord(c) <= fh for c in g.seed_utterance), \
f"forbidden block leaked into {lang} utterance {g.seed_utterance!r}"
```
### I5 — Hinglish is Roman-only (no Devanagari leakage)
```python
def test_hinglish_never_contains_devanagari():
W = {"hinglish": 1.0, "hi": 0.0, "ta": 0.0, "kn": 0.0, "en": 0.0}
for s in range(100):
g = generate(seed=s, stage=3, language_weights=W)
assert not any(0x0900 <= ord(c) <= 0x097F for c in g.seed_utterance)
```
---
## 4. Coverage Target
| Metric | Target |
|---|---|
| Line coverage on `driftcall/task_generator.py` | **100%** |
| Branch coverage on `driftcall/task_generator.py` | **≥ 95%** |
| Every exception raise site from §5 of `task_generator.md` | **covered by ≥ 1 unit test** |
| NFC normalization check on `_format_utterance` output | **runs on all 5 languages** (U20) |
**Enforcement:**
```bash
python3 -m pytest tests/test_task_generator.py tests/test_task_generator_properties.py \
tests/test_task_generator_integration.py \
--cov=driftcall.task_generator \
--cov-branch \
--cov-fail-under=95 \
--cov-report=term-missing
```
**Exception raise-site coverage matrix** (all 9 sites from `task_generator.md` §5):
| Exception | Raise site (per §5) | Covering test |
|---|---|---|
| `MissingSlotError` | `_format_utterance` when `{X}` unbound | U34* (see §1.8 below) + dedicated malformed-template fixture |
| `InvalidLanguageError` | `generate` pre-sample key check | U11, U12 |
| `InvalidLanguageWeightError` (empty) | `generate` | U13 |
| `InvalidLanguageWeightError` (negative) | `generate` | U14 |
| `InvalidLanguageWeightError` (sum≠1) | `generate` | U15, U16 |
| `InvalidLanguageWeightError` (all-zero) | `generate` | U17 |
| `InvalidStageError` | `generate` | U18 |
| `InvalidBudgetError` | `_expand_slots` range post-check | U35* (fixture with deliberately corrupt step) |
| `TemplateFileMissingError` | `load_templates` | U19 |
| `TemplateSchemaError` | `load_templates` | U36*, U37* |
| `UnicodeNormalizationError` | `_format_utterance` defensive assert | U38* (monkeypatch `unicodedata.is_normalized` to return False) |
| `NoVariantForLanguageError` | `_format_utterance` missing variant | U39* (malformed fixture) |
> *U34–U39 are additional malformed-fixture raise-site tests, included in the §1 grand total of 30. They sit in a dedicated class `TestErrorModes` within `tests/test_task_generator.py`.
### 1.8 Malformed-fixture raise-site tests — appended to §1
(Appended here so the §1 count of 30 reflects all tests that live in the unit file.)
- **U34** `test_missing_slot_error` — fixture `templates_missing_slot.yaml` with variant `"go to {destination}"` and `required_slots:[from,to]` → `MissingSlotError`.
- **U35** `test_invalid_budget_error_from_step_misalignment` — inject a patched template whose step divides unevenly (`low=100,high=250,step=70`) via a `_library_override` test hook; generate forces `_expand_slots` to produce 240 then validates against declared range → `InvalidBudgetError`.
- **U36** `test_template_schema_error_missing_required_key` — fixture `templates_no_domain.yaml` → `TemplateSchemaError` on load.
- **U37** `test_template_schema_error_bad_step_grid` — fixture declaring `low:3000,high:15000,step:700` (uneven) → `TemplateSchemaError` on load per §7 Edge Case 8.
- **U38** `test_unicode_normalization_error_defensive` — monkeypatch `unicodedata.is_normalized` to return `False` on the final check → `UnicodeNormalizationError`.
- **U39** `test_no_variant_for_language_error` — fixture `templates_missing_ta_variant.yaml` declaring no Tamil variants; call with `W={"ta":1.0,…}` → `NoVariantForLanguageError`.
**Revised §1 total:** 30 unit test cases (U1–U30 in §§1.1–1.7, U34–U39 in §1.8 malformed-fixture suite).
> Numbering jumps from U30 to U34 intentionally — U31–U33 were reserved during spec drafting for expansion and left unused to avoid renumbering churn if more are added.
---
## 5. Fixtures
All fixtures live in `tests/fixtures/task_generator/` and are **shared with `env_tests.md`** (the env test plan imports the same YAML files to drive `DriftCallEnv.reset()` integration tests).
### 5.1 Template fixture
**File:** `tests/fixtures/task_generator/templates_fixture.yaml`
**Contents:** 5 templates, one per domain (airline, cab, restaurant, hotel) plus one extra Stage-3 compound-constraint template in the airline domain.
**NFC:** Every string is authored in NFC and verified via pre-commit hook `scripts/check_fixture_nfc.py` (runs `is_normalized("NFC", v)` across every string leaf).
Example shape (airline template):
```yaml
- template_id: airline.book.fixture_v1
domain: airline
intent: book_flight
min_stage: 1
required_slots: [from, to, when]
optional_slots: [seat_pref]
constraints_template:
budget_inr: {distribution: uniform, low: 3000, high: 15000, step: 500}
time_window: {choices: [morning, afternoon, evening, late_night]}
drift_slot_tags: [price, total_fare_inr]
language_variants:
hinglish: ["Bhai {when} ko {from} se {to}, {budget_inr} rupees max, {time_window}"]
hi: ["{when} को {from} से {to}, ₹{budget_inr} से कम, {time_window}"]
ta: ["{when} அன்று {from} லிருந்து {to}, ₹{budget_inr} கீழ், {time_window}"]
kn: ["{when} ರಂದು {from} ಇಂದ {to}, ₹{budget_inr} ಒಳಗೆ, {time_window}"]
en: ["Flight from {from} to {to} on {when}, under ₹{budget_inr}, {time_window}"]
```
Full fixture carries all 5 templates (one per domain) plus `cab.ride.fixture_v1`, `restaurant.order.fixture_v1`, `hotel.book.fixture_v1`, and `airline.book.compound_v1` (Stage-3 compound).
### 5.2 i18n fixture
**File:** `tests/fixtures/task_generator/i18n_fixture.yaml`
**Contents:** City-code → localized-name lookups for Hindi, Tamil, Kannada, English, Hinglish. Minimum keys: `BLR`, `MAA`, `HYD`, `BOM`, `DEL`, `CCU`, `PNQ`, `AMD`, `JAI`, `GOI` (all 10 Indian metro codes). Weekday names in each language. Domain-specific nouns (dish names for restaurant, room types for hotel).
NFC verification is part of the test `U22` and the pre-commit hook above.
Example:
```yaml
hi:
cities:
BLR: "बेंगलुरु"
MAA: "चेन्नई"
HYD: "हैदराबाद"
weekdays:
monday: "सोमवार"
ta:
cities:
BLR: "பெங்களூரு"
MAA: "சென்னை"
weekdays:
monday: "திங்கட்கிழமை"
kn:
cities:
BLR: "ಬೆಂಗಳೂರು"
MAA: "ಚೆನ್ನೈ"
weekdays:
monday: "ಸೋಮವಾರ"
en:
cities:
BLR: "Bengaluru"
hinglish:
cities:
BLR: "Bengaluru"
```
### 5.3 Stage-weight fixtures
Python-module fixtures exported from `tests/fixtures/task_generator/weights.py`:
```python
# Matches DESIGN.md §10.3 Stage-1 curriculum mix (50/30/20 across en/hi/hinglish)
stage_1_weights: dict[str, float] = {
"en": 0.50, "hi": 0.30, "hinglish": 0.20, "ta": 0.00, "kn": 0.00,
}
# Stage-2 broadens to all 5 languages with 30/30/20/10/10
stage_2_weights: dict[str, float] = {
"en": 0.30, "hi": 0.30, "hinglish": 0.20, "ta": 0.10, "kn": 0.10,
}
# Stage-3 same distribution; stage differs only in template pool + drift schedule
stage_3_weights: dict[str, float] = {
"en": 0.30, "hi": 0.30, "hinglish": 0.20, "ta": 0.10, "kn": 0.10,
}
```
Each dict sums to exactly `1.00` under IEEE-754 double-precision (verified in a `conftest.py` sanity check).
### 5.4 Malformed fixtures (error-mode coverage only)
Distinct YAML files, each authored to trigger exactly one exception. Lived in `tests/fixtures/task_generator/malformed/`:
| File | Purpose |
|---|---|
| `templates_missing_slot.yaml` | triggers `MissingSlotError` (U34) |
| `templates_no_domain.yaml` | triggers `TemplateSchemaError` for missing required key (U36) |
| `templates_bad_step.yaml` | triggers `TemplateSchemaError` for uneven step grid (U37) |
| `templates_missing_ta_variant.yaml` | triggers `NoVariantForLanguageError` (U39) |
| `templates_nfd.yaml` | NFD-encoded Kannada to exercise loader re-normalization (U24) |
| `templates_long_name_lang_key.yaml` | uses `"hindi"` as a language key to trigger schema rejection per §4.1 |
### 5.5 Shared-fixture contract with `env_tests.md`
`env_tests.md` (authored in the same Batch D4) imports `templates_fixture.yaml`, `i18n_fixture.yaml`, and all three `stage_N_weights` from this directory. The env test plan exercises `DriftCallEnv.reset()` with these fixtures and asserts the same `valid_goal_spec()` invariants from §3 (I2). Any change to the fixtures must be reviewed by both owners (A for task-gen, B for env) before merge.
---
## 6. Appendix — Test File Layout
```
tests/
├── conftest.py # pytest-wide fixtures (paths, weights)
├── test_task_generator.py # §1 unit tests (U1–U30, U34–U39)
├── test_task_generator_properties.py # §2 property tests (P1–P6)
├── test_task_generator_integration.py # §3 integration tests (I1–I5)
└── fixtures/
├── models/
│ └── assertions.py # valid_goal_spec() helper (cross-doc)
└── task_generator/
├── strategies.py # hypothesis strategies
├── weights.py # stage_1/2/3_weights
├── templates_fixture.yaml
├── i18n_fixture.yaml
└── malformed/
├── templates_missing_slot.yaml
├── templates_no_domain.yaml
├── templates_bad_step.yaml
├── templates_missing_ta_variant.yaml
├── templates_nfd.yaml
└── templates_long_name_lang_key.yaml
```
---
## 7. Sanity Checks (for the implementer)
Before declaring `task_generator.py` done:
1. `pytest tests/test_task_generator.py -v` — all 30 unit tests pass.
2. `pytest tests/test_task_generator_properties.py -v` — all 6 properties pass (including the 200,000-seed walk under `-m slow`).
3. `pytest tests/test_task_generator_integration.py -v` — all 5 integration tests pass against real YAML fixtures.
4. `pytest --cov=driftcall.task_generator --cov-branch --cov-fail-under=95` — 100% line, ≥ 95% branch.
5. `scripts/check_fixture_nfc.py` — NFC hook green on every YAML leaf.
6. `ruff check tests/test_task_generator*.py` — clean.
7. `mypy --strict tests/test_task_generator*.py` — clean (test code is type-checked too).
When all green, dispatch ≥ 2 fresh critic agents per CLAUDE.md §3.4. Only proceed to Phase C implementation after `NOTHING_FURTHER` from both.
|