Spaces:
Paused
Paused
| # env_tests.md — Test Plan for `driftcall/env.py` | |
| **Module under test:** `driftcall/env.py` (class `DriftCallEnv`) | |
| **Design doc:** `DRIFTCALL/docs/modules/env.md` (final sealed, 9-section spec) | |
| **Owner:** Person B (Rewards & Tests); reviewed by Person A (Environment) | |
| **Implements test coverage for:** DESIGN.md §4 (OpenEnv Interface), §4.2–4.5 (reset/step/budget), §6.2 (drift trigger), §7 (reward invariants), §9.4 (audio boundary), §11.1 (one env per session) | |
| **Framework:** `pytest` + `hypothesis` (+ `pytest-cov`) | |
| **Coverage tool:** `pytest --cov=driftcall.env --cov-branch --cov-report=term-missing` | |
| **Status:** Test plan — pre-critic-gate | |
| **Last updated:** 2026-04-24 | |
| **Training path constraint:** All tests are **CUDA-free** (text-only). Audio-boundary tests use in-process stub engines — no Kokoro / Whisper model loads, no network, no disk writes. | |
| This plan specifies **100% line coverage** and **≥ 95% branch coverage** on `driftcall/env.py`. Every behavior clause in `env.md §2–§3`, every error mode `E1–E12` in `env.md §5`, every edge case in `env.md §7`, and every worked example in `env.md §8` has at least one dedicated test. Fixtures are **shared** with `docs/tests/deploy_env_space_tests.md` and reuse factories already defined in `models_tests.md`, `vendors_tests.md`, `drift_injector_tests.md`, `task_generator_tests.md`, and `rewards_tests.md` — single source of truth in `tests/conftest.py`. | |
| **Test count target:** ≥ 25 unit + ≥ 5 property + 4 integration = **34 cases minimum**; inventory below sums to **45** (35 unit + 6 property + 4 integration). | |
| --- | |
| ## 0. Scope & Contract | |
| Covered (public surface of `DriftCallEnv` + `EnvConfig.from_mapping`): | |
| - `DriftCallEnv.__init__(config)` — config validation, unknown-key rejection, mutually-exclusive fields | |
| - `reset(seed)` — deterministic trajectory, curriculum_stage derivation, language_weights propagation, `audio_boundary_enabled` toggle invokes `tts_engine.synthesize` | |
| - `step(action)` — full pipeline ordering per env.md §2.3: (1a pure `_validate_action` → 1b caller handles repeated failures → 2 turn increment → 3 drift fold → 4 side-channel emit → 5 dispatch → 6 `dataclasses.replace` record → 7 terminal check → 8 `compute_rewards` once → 9 observation) | |
| - `state()` — frozen reference return (no deepcopy), `E2` when unready | |
| - `close()` — idempotent, `E3` afterwards, does NOT free shared audio singletons (env.md §9 open question 7) | |
| - `episode()`, `rewards()`, `done()` — terminal-only gating, memoized return | |
| - All 12 typed exceptions in `driftcall.env.errors` rooted at `DriftCallEnvError` | |
| Not covered here (covered elsewhere, referenced only): | |
| - Vendor dispatch internals → `vendors_tests.md` | |
| - Drift pattern catalogue → `drift_injector_tests.md` | |
| - Reward arithmetic → `rewards_tests.md` | |
| - Sim-caller responder body → resolved via env.md §9 Q1 at critic gate; this plan only asserts the responder is deterministic `(seed, turn)`-keyed. | |
| --- | |
| ## 1. Unit tests (≥ 25 cases — inventory: 35) | |
| All unit tests live in `tests/test_env/`. Layout: | |
| ``` | |
| tests/test_env/ | |
| __init__.py | |
| test_init_config_validation.py | |
| test_reset.py | |
| test_step_ordering.py | |
| test_step_validation_purity.py | |
| test_state_accessor.py | |
| test_close_idempotent.py | |
| test_terminal_accessors.py | |
| test_audio_boundary_toggle.py | |
| test_error_taxonomy.py | |
| ``` | |
| ### 1.1 `__init__` + `EnvConfig.from_mapping` — config validation (9 cases) | |
| **Scope:** E1 `InvalidConfigError` on every malformed-config branch. `__init__` performs no I/O. | |
| | # | Name | Setup | Assertion | | |
| |---|---|---|---| | |
| | U1 | `test_init_default_config_ok` | `DriftCallEnv()` (no arg) | Succeeds. `env._config.curriculum_stage == 1`. `env._config.language_weights == {"en":0.4,"hinglish":0.4,"hi":0.1,"ta":0.05,"kn":0.05}`. `env._config.audio_boundary_enabled is False`. `env._state is None`. | | |
| | U2 | `test_init_rejects_unknown_key` | `DriftCallEnv({"curriculum_stage":1, "frobnicate":True})` | Raises `InvalidConfigError`; message contains `"frobnicate"` and the full allowed-key list. | | |
| | U3 | `test_init_rejects_invalid_stage` | Parametrized: `0, 4, -1, "1", 1.0, None` | Raises `InvalidConfigError` with `"curriculum_stage"`. | | |
| | U4 | `test_init_rejects_weights_wrong_sum` | `language_weights={"en":0.5,"hinglish":0.4}` (sum=0.9) | Raises `InvalidConfigError`; message cites `"sum"`. | | |
| | U5 | `test_init_rejects_weights_negative` | `language_weights={"en":0.6,"hinglish":0.5,"hi":-0.1}` | Raises `InvalidConfigError`; cites `"negative"`. | | |
| | U6 | `test_init_rejects_audio_enabled_missing_tts` | `audio_boundary_enabled=True, tts_engine=None, asr_engine=<stub>` | Raises `InvalidConfigError`; cites `"tts_engine"`. | | |
| | U7 | `test_init_rejects_audio_disabled_with_tts` | `audio_boundary_enabled=False, tts_engine=<stub>` | Raises `InvalidConfigError` ("tts_engine must be None when audio_boundary_enabled is False" — env.md §7.5). | | |
| | U8 | `test_init_is_pure_no_io` | Patch `builtins.open`, `socket.socket`, and `os.urandom` to raise. `DriftCallEnv({"curriculum_stage":2})`. | Succeeds without invoking any patched callable. Asserts env.md §2.1 "no I/O, no model load, no network call". | | |
| | U9 | `test_init_stores_frozen_config_copy` | Pass a mutable `weights` dict; mutate it after construction. | `env._config.language_weights` unchanged. `EnvConfig` instance has `__dataclass_params__.frozen is True`. | | |
| ### 1.2 `reset()` — trajectory setup (8 cases) | |
| | # | Name | Setup | Assertion | | |
| |---|---|---|---| | |
| | U10 | `test_reset_stage1_sets_max_turns_8` | `env=DriftCallEnv({"curriculum_stage":1}); obs=env.reset(seed=1)` | `env._state.max_turns == 8`. `obs.budget_remaining == 8`. `obs.turn == 0`. | | |
| | U11 | `test_reset_stage2_sets_max_turns_12` | stage 2 | `max_turns == 12`; `budget_remaining == 12`. | | |
| | U12 | `test_reset_stage3_sets_max_turns_16` | stage 3 | `max_turns == 16`; `budget_remaining == 16`. | | |
| | U13 | `test_reset_populates_curriculum_stage_on_state` | stage 2 | `env._state.stage == 2` (or equivalent attribute; matches env.md §4.3 `stage` field piped into `Episode`). | | |
| | U14 | `test_reset_passes_language_weights_to_task_generator` | Monkeypatch `task_generator.generate` to record args. `reset(seed=7)` with custom weights. | Recorded `language_weights` argument is **byte-identical** to `env._config.language_weights` (not merely equal-by-value). | | |
| | U15 | `test_reset_same_seed_same_goal_and_schedule` | `env.reset(seed=42)` twice (construct two envs) | `obs_a.goal == obs_b.goal`; `env_a._state.drift_schedule == env_b._state.drift_schedule`; `env_a._state.vendor_states == env_b._state.vendor_states`. | | |
| | U16 | `test_reset_none_seed_populates_from_urandom` | `reset(seed=None)` | `env._seed` is an `int`. Two calls produce different `_seed` with high probability (assert inequality across 3 calls — tolerates 1-in-2^64 flake). | | |
| | U17 | `test_reset_audio_boundary_enabled_invokes_tts_synthesize` | Stub `tts_engine` with a recording synthesize. `audio_boundary_enabled=True`. `reset(seed=11)`. | Stub recorded **exactly one** call with args `(goal.seed_utterance, goal.language)`. `obs.last_transcript == obs.goal.seed_utterance` (canonical source unchanged — env.md §3.7 clause 1). | | |
| ### 1.3 `step()` — pipeline ordering (7 cases) | |
| Every case instruments the env by monkeypatching private helpers (`_validate_action`, `_fire_drifts`, `_dispatch`, `_record_action`, `_check_terminal`) to append their names to a shared `call_log` list, proving the order. | |
| | # | Name | Setup | Assertion | | |
| |---|---|---|---| | |
| | U18 | `test_step_validates_before_any_mutation` | Valid stage-1 env after `reset`. Issue a valid TOOL_CALL. | `call_log == ["_validate_action", "_fire_drifts", "_emit_side_channel", "_dispatch", "_record_action", "_check_terminal", "_build_observation"]` — this is the env.md §2.3 order. | | |
| | U19 | `test_step_increments_turn_after_validate_before_dispatch` | Valid TOOL_CALL. | `obs.turn == 1` post-step. Turn counter bump occurs between `_validate_action` and `_fire_drifts` (per env.md §2.3 step 2). Instrumented via snapshot of `self._state.turn` inside stubbed `_fire_drifts`. | | |
| | U20 | `test_step_fires_drifts_before_dispatch` | Scripted scheduler fires `airline.price_rename` at turn 1. Agent action: `TOOL_CALL airline.search` at turn 1. | `obs.tool_results[-1].schema_version == "v2"` (tool saw post-drift schema). `obs.drift_log[-1].pattern_id == "airline.price_rename"`. | | |
| | U21 | `test_step_records_action_via_dataclasses_replace` | Valid TOOL_CALL. | `prev_state = env._state; env.step(a); next_state = env._state`. Assert `prev_state is not next_state`, `id(prev_state.actions) != id(next_state.actions)`, `next_state.actions == prev_state.actions + (a,)`. | | |
| | U22 | `test_step_checks_terminal_after_record` | Stage-1 env; issue 8 benign SPEAK actions (budget=8). | 8th step: `env.done() is True`. `env.episode().terminated_by == "TIMEOUT"`. Turn counter = 8. | | |
| | U23 | `test_step_submit_calls_compute_rewards_exactly_once` | Monkeypatch `rewards.compute_rewards` with a recorder. Issue TOOL_CALL then SUBMIT. | Recorder called **once**. `env.rewards()` returns the exact object the recorder produced. A second call to `env.rewards()` returns the **same identity** (memoized — env.md §3.6). | | |
| | U24 | `test_step_abort_forces_r1_zero` | `reset(seed=1)`; `step(ABORT)`. | `env.episode().terminated_by == "ABORT"`. `env.rewards().r1 == 0.0`. R2…R5 still computed (non-None). | | |
| ### 1.4 `_validate_action` purity & `InvalidActionError` (4 cases) | |
| These cases pin env.md §3.5 / E4 behavior: `_validate_action` raises **before** any mutation; env remains valid for a subsequent `step()`. | |
| | # | Name | Setup | Assertion | | |
| |---|---|---|---| | |
| | U25 | `test_invalid_action_raises_no_state_mutation` | Valid stage-1 env. Snapshot `prev_state = env._state`. Call `env.step(DriftCallAction(action_type=TOOL_CALL, tool_name="airline.search"))` with `tool_args=None` (required dict). | Raises `InvalidActionError`. `env._state is prev_state`. `env._state.turn == prev_state.turn`. `len(env._state.actions) == len(prev_state.actions)`. `env._state.done is False`. No Rewards cached (`env._rewards is None`). | | |
| | U26 | `test_env_valid_after_invalid_action` | U25's env, then issue a **valid** TOOL_CALL. | Succeeds. `env._state.turn == 1`. Observation returned normally. Proves env is still steppable. | | |
| | U27 | `test_invalid_action_no_drift_fired_no_terminal_marker` | Scripted scheduler places drift at turn 1. Attempt invalid action. | Raises `InvalidActionError`. `env._state.drift_fired == ()`. `env.done() is False`. The drift did NOT fire (drift firing is inside step 3, after validate). | | |
| | U28 | `test_oversize_rationale_raises_invalid_action` | `DriftCallAction(action_type=SUBMIT, confidence=0.5, rationale="x"*201)` | Raises `InvalidActionError` with `"rationale"`. State unchanged (repeat U25's state-preservation asserts). | | |
| ### 1.5 `state()` — frozen reference (2 cases) | |
| | # | Name | Setup | Assertion | | |
| |---|---|---|---| | |
| | U29 | `test_state_returns_frozen_reference` | Post-reset env. | `env.state() is env._state`. `env.state().__dataclass_params__.frozen is True`. Attempting `env.state().turn = 99` raises `dataclasses.FrozenInstanceError`. | | |
| | U30 | `test_state_unready_raises_e2` | Fresh `DriftCallEnv()` without reset. | `env.state()` raises `EnvNotReadyError`. `env.done() is False` (not an error — env.md §7.1). | | |
| ### 1.6 `close()` — idempotency (2 cases) | |
| | # | Name | Setup | Assertion | | |
| |---|---|---|---| | |
| | U31 | `test_close_idempotent` | `env.close(); env.close(); env.close()` | No exception; `env._closed is True` after first call and stays True. | | |
| | U32 | `test_close_does_not_free_shared_audio_engines` | Build env with `audio_boundary_enabled=True` and stub TTS/ASR engines. `env.close()`. | `env._closed is True`; `env._state is None`; the stub engines expose no `close()` method at all (`assert not hasattr(tts_stub, "close")` and same for `asr_stub`) — env.md §9 Q7: engines are process-global singletons, and audio.md §2.1–2.2 define no `close()` on `TTSEngine`/`ASREngine`. | | |
| ### 1.7 Terminal-only accessors + error taxonomy (3 cases) | |
| | # | Name | Setup | Assertion | | |
| |---|---|---|---| | |
| | U33 | `test_episode_before_terminal_raises_e6` | Post-reset, mid-episode. | `env.episode()` raises `EpisodeNotTerminalError`. Same for `env.rewards()`. `env.done() is False`. | | |
| | U34 | `test_double_submit_raises_e5` | Submit, then attempt another step. | Second `step(...)` raises `EpisodeAlreadyTerminalError` (E5 — env.md §7.2). `env.done()` still True. Rewards object identity preserved. | | |
| | U35 | `test_all_12_errors_derive_from_driftcallenverror` | Introspect `driftcall.env.errors`. | The set `{InvalidConfigError, EnvNotReadyError, EnvClosedError, InvalidActionError, EpisodeAlreadyTerminalError, EpisodeNotTerminalError, ConcurrentStepError, UnknownDomainError, UnknownToolError, DriftInjectionError, RewardComputationError, AudioPipelineError}` each subclass `DriftCallEnvError` which subclasses `Exception`. Count is exactly 12. | | |
| --- | |
| ## 2. Property tests (≥ 5 — inventory: 6) | |
| Written with `hypothesis`. Strategies live in `tests/test_env/strategies.py` (shared with `test_rewards` where applicable). | |
| | # | Name | Property | Strategy | | |
| |---|---|---|---| | |
| | P1 | `test_step_is_pure_per_call` | For a fresh env `e1` and `e2` constructed with the same config and `reset(seed=s)`, given the same action sequence, every `step()` return is **equal** and the post-step states are **equal**. Same `(state, action) → (state', observation)`. | Seeds in `integers(0, 2**31-1)`; action sequences built from a `DriftCallAction` strategy over valid types; stage in `sampled_from([1,2,3])`. ≥ 200 examples. | | |
| | P2 | `test_validation_failure_preserves_pre_step_state` | For any env in a steppable state and any `DriftCallAction` that fails `_validate_action`: state after the raised `InvalidActionError` equals state before (by identity — `env._state is prev`). | Mixed-validity action strategy; hypothesis `assume()` filters to invalid ones. | | |
| | P3 | `test_turn_counter_monotone_non_decreasing` | Across any legal step sequence, `env._state.turn` is monotone non-decreasing; it **strictly increases** on every non-raising `step()` and is **unchanged** on every raised `InvalidActionError`. | Random action sequences up to length 20; assume `stage=3` to permit budget 16. | | |
| | P4 | `test_frozen_state_identity_changes_on_transition` | After every successful `step()`, `prev_state is not next_state` and `id(prev_state.actions) != id(next_state.actions)` whenever `len(next.actions) > len(prev.actions)`. (env.md §3.8 invariant.) | As P1. | | |
| | P5 | `test_rewards_memoized_identity` | After termination, `env.rewards() is env.rewards()` (identity, not just equality) across 10 calls. Same for `env.episode()`. | Parametrized over `terminated_by ∈ {"SUBMIT","ABORT","TIMEOUT"}`. | | |
| | P6 | `test_available_tools_fixed_for_episode` | The set `obs.available_tools` is equal across every observation in an episode, regardless of drifts fired. (env.md §3.4 clause 4.) | Random schedules over stage 2/3; ≥ 50 episodes. | | |
| --- | |
| ## 3. Integration tests (4 cases) | |
| Live in `tests/test_env/test_e2e.py`. These are **full episode traces** matching env.md §8 examples. All dependencies are real (real `task_generator`, real `drift_injector`, real vendors, real `rewards.compute_rewards`) — only the audio engines are stubbed in I4. | |
| | # | Name | Maps to | Scenario | | |
| |---|---|---|---| | |
| | I1 | `test_episode_stage1_airline_happy_submit` | env.md §8.1 | `DriftCallEnv({"curriculum_stage":1})`; `reset(seed=42)`. Replay the 5-turn script: `airline.search` → 3 more tool calls → `SUBMIT(confidence=0.9)`. Assertions: `env.done() is True`; `env.episode().terminated_by == "SUBMIT"`; `env.episode().turns_used == 5`; `obs.drift_log == ()`; `env.rewards().r1 == 1.0`; `env.rewards().r2 == 0.5` (stage-1 neutral); `env.rewards().reward` in `[0.85, 1.0]`. | | |
| | I2 | `test_episode_stage2_drift_detect_adapt` | env.md §8.2 | `stage=2; seed=7`. Scripted sequence through turn 6 terminating in SUBMIT. Drift `airline.price_rename` fires turn 3. Agent SPEAK at turn 4 mentions `"total_fare_inr"`. Assertions: `obs.drift_log[0].pattern_id == "airline.price_rename"`; `obs.drift_log[0].turn == 3`; `obs.tool_results[-2].response` references `"total_fare_inr"` (not `"price"`); `env.rewards().r1 == 1.0`; `env.rewards().r2 == 1.0`; `env.rewards().reward ≈ 0.90 ± 0.05`. | | |
| | I3 | `test_episode_stage3_compound_drift_timeout` | env.md §8.3 | `stage=3; seed=2026`. Script designed to consume all 16 turns. Two drifts fire (airline turn 3, payment turn 9). Assertions: `env.done() is True`; `env.episode().terminated_by == "TIMEOUT"`; `env.episode().turns_used == 16`; `env.rewards().r1 == 0.0`; `env.rewards().r2 in {0.5, 1.0}`; `env.rewards().reward < 0.3`. | | |
| | I4 | `test_episode_audio_boundary_enabled_stubs` | env.md §8.4 | `audio_boundary_enabled=True`, `tts_engine=StubTTS()`, `asr_engine=StubWhisper()` (contracts in §5 — signatures match `audio.md §2.1–2.2`). Stubs are in-process, CUDA-free, deterministic: `synthesize(text, language_code, voice_pack=None, *, seed=0, sample_rate_hz=16000) → f"WAV[{text}:{language_code}:{seed}:{sample_rate_hz}]".encode("utf-8")`; `transcribe(audio_bytes, language_hint, *, beam_size=1, vad_filter=True, max_duration_s=30.0) → TranscriptResult(text=<scripted>, language_detected="hinglish", confidence=0.82, duration_s=1.250)`. Episode: `reset(seed=11)` → `CLARIFY` → `TOOL_CALL` → `SUBMIT`. Assertions: stub TTS synthesize called on `reset` **and** on every CLARIFY/SPEAK side-channel emission; `obs.last_transcript` after CLARIFY equals the stubbed ASR text; `obs.last_confidence == 0.82`; reward computation is **100% textual** — no TTS bytes reach `compute_rewards` (verified by asserting `episode.actions` and `episode.tool_results` contain no `bytes` objects). | | |
| All integration tests reuse fixtures: | |
| - `goal_airline`, `goal_restaurant` — from `drift_injector_tests.md §5.2` (session-scoped `GoalSpec` instances) | |
| - `airline_v1`, `airline_v2`, `payment_v2` — from `vendors_tests.md §5.1` (per-domain aliases over `vendor_states_v{1,2,3}`; `payment_v2` is the post-`auth_scope_bump` state) | |
| - `drift_patterns_fixture` — from `drift_injector_tests.md §5.1` (authoritative 20-pattern catalogue; individual events + compound schedules used by I2/I3 are defined locally in §5 below as `drift_event_airline_price_rename_turn3`, `drift_event_payment_auth_turn9`, and `schedule_stage3_compound`, because drift_injector_tests.md only ships the catalogue, not pre-composed schedules) | |
| - `episode_happy_airline`, `episode_timeout` — from `rewards_tests.md §5` (§5.1 and §5.4 respectively) | |
| - `valid_tool_call_action`, `valid_submit_action`, `valid_observation_reset` — from `models_tests.md §5.4` (the factory/instance fixtures used to assemble per-step action sequences) | |
| No integration test touches the network. No test loads a real Kokoro/Whisper model. | |
| --- | |
| ## 4. Coverage target | |
| **100% line coverage** and **≥ 95% branch coverage** on `driftcall/env.py` under `pytest --cov=driftcall.env --cov-branch --cov-report=term-missing`. | |
| ### 4.1 Error-mode coverage matrix (every E1–E12 raised at least once) | |
| | Code | Exception | Raised by which test | | |
| |---|---|---| | |
| | E1 | `InvalidConfigError` | U2 (unknown key), U3 (bad stage), U4 (weights sum), U5 (negative weight), U6 (missing TTS), U7 (forbidden TTS). Also raised from U4.3 reset if scripted scheduler produces turn > max_turns — covered by a dedicated test `test_reset_scripted_bad_schedule_raises_e1`. | | |
| | E2 | `EnvNotReadyError` | U30 (`state()`), plus `test_step_before_reset_raises_e2`, `test_episode_before_reset_raises_e2`. | | |
| | E3 | `EnvClosedError` | `test_step_after_close_raises_e3`, `test_reset_after_close_raises_e3`. | | |
| | E4 | `InvalidActionError` | U25, U26, U27, U28, plus per-ActionType parametrized cases: missing `tool_name` on TOOL_CALL, message len 0 and len 2001 on SPEAK, NUL byte in message on CLARIFY, missing `confidence` on SUBMIT, forbidden `tool_name` on ABORT. | | |
| | E5 | `EpisodeAlreadyTerminalError` | U34 (double SUBMIT). | | |
| | E6 | `EpisodeNotTerminalError` | U33. | | |
| | E7 | `ConcurrentStepError` | `test_reentrant_step_raises_e7` — stub a vendor `dispatch` that re-invokes `env.step(other_action)`; assert E7 raised on the inner call; assert outer state unchanged. | | |
| | E8 | `UnknownDomainError` | `test_probe_schema_unknown_domain_raises_e8` — PROBE_SCHEMA with `tool_name="spaceship"`. | | |
| | E9 | `UnknownToolError` | `test_tool_call_unknown_tool_raises_e9` — `tool_name="airline.teleport"`. | | |
| | E10 | `DriftInjectionError` | `test_drift_fold_error_propagates_e10` — scripted scheduler yields event with unknown `pattern_id`; env must not swallow. | | |
| | E11 | `RewardComputationError` | `test_reward_compute_error_propagates_e11` — monkeypatch `rewards.compute_rewards` to raise; env must surface. | | |
| | E12 | `AudioPipelineError` | `test_audio_pipeline_error_on_clarify` — stub ASR that raises on 2nd transcribe; assert E12 surfaces from `step(CLARIFY)`; episode does NOT terminate (env.md §5 E12 note). Second test: `test_audio_pipeline_error_on_reset_is_e1_class` — stub TTS that raises on `reset`; the env is unready afterwards per env.md §5 E12. | | |
| Total dedicated error-mode tests: 12 exceptions × ≥ 1 = 12 minimum; inventory covers 18 error-mode paths. | |
| ### 4.2 Line/branch targets | |
| - `DriftCallEnv.__init__` — 100% line; 100% branch (both `config is None` and `config is dict` branches hit in U1, U2). | |
| - `EnvConfig.from_mapping` — 100% line; 100% branch (all 7 raise branches covered by U2–U7 + reset-bad-schedule). | |
| - `reset` — 100% line; step 7b audio branch covered by U17 (True) and U10 (False). | |
| - `step` — 100% line; all 6 ActionType dispatch branches (TOOL_CALL / SPEAK / CLARIFY / PROBE_SCHEMA / SUBMIT / ABORT) each have ≥ 1 unit test + integration coverage; drift-fold-empty vs non-empty both covered (I1 empty, I2 non-empty); terminal vs non-terminal both covered (U22 TIMEOUT, U23 SUBMIT, I1/I2/I3 mix). | |
| - `state`, `close`, `episode`, `rewards`, `done` — 100% line; all raise/early-return branches covered. | |
| - `_validate_action` — 100% line; every row of env.md §3.1 Table is parametrized (per-ActionType forbidden-field matrix). | |
| - `_build_observation` — 100% line; `last_transcript` branches for turn 0 vs mid-episode vs audio-enabled all covered (U17, I4, I1). | |
| Branch coverage < 95% is a **hard CI fail**. | |
| --- | |
| ## 5. Fixtures | |
| All fixtures defined in `tests/conftest.py` under the `env_*` namespace. Shared with `docs/tests/deploy_env_space_tests.md` (same names, same content). | |
| | Name | Scope | Purpose | Reuses | | |
| |---|---|---|---| | |
| | `env_stage1_airline` | function | `DriftCallEnv({"curriculum_stage":1})` already `reset(seed=42)`, goal forced to airline via scripted `task_generator` monkeypatch when hermetic goal needed. Provides `(env, obs0)` tuple. | `goal_airline` from `drift_injector_tests.md §5.2`; `airline_v1` from `vendors_tests.md §5.1`. | | |
| | `env_stage2_restaurant_drift` | function | Stage-2 env `reset(seed=7)` with restaurant goal, scripted scheduler that fires `restaurant.items_shape_bump` at turn 3. Returns `(env, obs0, drift_event)`. | `goal_restaurant` from `drift_injector_tests.md §5.2`; `drift_event_restaurant_items_shape_bump_turn3` defined below. | | |
| | `env_stage3_compound` | function | Stage-3 env `reset(seed=2026)`, scripted scheduler with compound drift (airline turn 3 + payment turn 9). Used by I3. | `schedule_stage3_compound` defined below; reuses `drift_patterns_fixture` catalogue from `drift_injector_tests.md §5.1`. | | |
| | `env_audio_enabled` | function | Stage-1 env with `audio_boundary_enabled=True`, `tts_engine=StubTTS()`, `asr_engine=StubWhisper()`. Stubs are pure Python, CUDA-free, deterministic. Returns `(env, tts_stub, asr_stub)` for assertions on call counts. | `StubTTS`, `StubWhisper` classes defined in `tests/stubs/audio_stubs.py`. | | |
| | `env_config_invalid_key` | function | `{"curriculum_stage":1, "frobnicate":True}` — a single malformed config dict reused across U2 and any critic-requested smoke test. | — | | |
| Stub engine contracts (pinned here for cross-doc consistency with `audio_tests.md`; signatures match `docs/modules/audio.md §2.1` and `§2.2` exactly): | |
| ```python | |
| from driftcall.audio.asr_whisper import TranscriptResult | |
| from driftcall.audio.tts_kokoro import VoicePack | |
| class StubTTS: | |
| """In-process TTS double. Matches audio.md §2.1 `TTSEngine.synthesize` signature.""" | |
| def __init__(self) -> None: | |
| self.calls: list[tuple[str, str, VoicePack | None, int, int]] = [] | |
| def synthesize( | |
| self, | |
| text: str, | |
| language_code: str, | |
| voice_pack: VoicePack | None = None, | |
| *, | |
| seed: int = 0, | |
| sample_rate_hz: int = 16000, | |
| ) -> bytes: | |
| self.calls.append((text, language_code, voice_pack, seed, sample_rate_hz)) | |
| return f"WAV[{text}:{language_code}:{seed}:{sample_rate_hz}]".encode("utf-8") | |
| class StubWhisper: | |
| """In-process ASR double. Matches audio.md §2.2 `ASREngine.transcribe` signature | |
| and the 4-field `TranscriptResult` contract (text, language_detected, confidence, duration_s).""" | |
| def __init__(self, scripted: dict[int, str] | None = None) -> None: | |
| self.calls: list[bytes] = [] | |
| self._scripted = scripted or {} | |
| def transcribe( | |
| self, | |
| audio_bytes: bytes, | |
| language_hint: str | None, | |
| *, | |
| beam_size: int = 1, | |
| vad_filter: bool = True, | |
| max_duration_s: float = 30.0, | |
| ) -> TranscriptResult: | |
| self.calls.append(audio_bytes) | |
| turn = len(self.calls) | |
| return TranscriptResult( | |
| text=self._scripted.get(turn, "shaam ko, 7 baje"), | |
| language_detected="hinglish", | |
| confidence=0.82, | |
| duration_s=1.250, | |
| ) | |
| ``` | |
| Neither stub exposes a `.close()` method: `audio.md §2.1–2.2` defines no such method on `TTSEngine`/`ASREngine`, and the engines are process-global singletons (env.md §9 Q7) — U32 asserts `env.close()` does NOT invoke anything engine-side, so the stubs simply must not carry a `close()` attribute at all (U32's "call count is 0" is upgraded to `not hasattr(stub, "close")` to match the real contract). | |
| ### 5.1 Locally-defined drift events and schedules (not shipped by `drift_injector_tests.md`) | |
| `drift_injector_tests.md §5.1` publishes the 20-pattern catalogue (`drift_patterns_fixture`) but does NOT pre-compose per-test `DriftEvent` instances or full `DriftSchedule` objects — those are composed locally here because scheduling is an env-side concern. All three fixtures below are session-scoped and import `drift_patterns_fixture` to look up the authoritative pattern record. | |
| ```python | |
| from driftcall.models import DriftEvent | |
| from driftcall.drift_injector import DriftSchedule | |
| @pytest.fixture(scope="session") | |
| def drift_event_airline_price_rename_turn3(drift_patterns_fixture) -> DriftEvent: | |
| """Used by I2. Pattern id asserted byte-identical to drift_patterns_fixture entry.""" | |
| pattern = next(p for p in drift_patterns_fixture if p.id == "airline.price_rename") | |
| return DriftEvent( | |
| turn=3, | |
| drift_type=pattern.drift_type, # "schema" | |
| domain=pattern.domain, # "airline" | |
| description=pattern.description, | |
| from_version=pattern.from_version, # "v1" | |
| to_version=pattern.to_version, # "v2" | |
| ) | |
| @pytest.fixture(scope="session") | |
| def drift_event_restaurant_items_shape_bump_turn3(drift_patterns_fixture) -> DriftEvent: | |
| """Used by env_stage2_restaurant_drift. `restaurant.items_shape_bump` is the | |
| canonical restaurant schema drift per drift_injector.md §4.4 (items gain required `modifiers`).""" | |
| pattern = next(p for p in drift_patterns_fixture if p.id == "restaurant.items_shape_bump") | |
| return DriftEvent( | |
| turn=3, | |
| drift_type=pattern.drift_type, | |
| domain=pattern.domain, | |
| description=pattern.description, | |
| from_version=pattern.from_version, | |
| to_version=pattern.to_version, | |
| ) | |
| @pytest.fixture(scope="session") | |
| def drift_event_payment_auth_turn9(drift_patterns_fixture) -> DriftEvent: | |
| """Used by I3. Pattern id `payment.auth_scope_upgrade` (Auth axis, drift_injector.md §4.4).""" | |
| pattern = next(p for p in drift_patterns_fixture if p.id == "payment.auth_scope_upgrade") | |
| return DriftEvent( | |
| turn=9, | |
| drift_type=pattern.drift_type, # "auth" | |
| domain=pattern.domain, # "payment" | |
| description=pattern.description, | |
| from_version=pattern.from_version, | |
| to_version=pattern.to_version, | |
| ) | |
| @pytest.fixture(scope="session") | |
| def schedule_stage3_compound( | |
| drift_event_airline_price_rename_turn3, | |
| drift_event_payment_auth_turn9, | |
| ) -> DriftSchedule: | |
| """Used by I3. Two drifts, one per domain, matching env.md §8.3 worked example.""" | |
| return DriftSchedule(events=( | |
| drift_event_airline_price_rename_turn3, | |
| drift_event_payment_auth_turn9, | |
| )) | |
| ``` | |
| These three `DriftEvent`s plus one `DriftSchedule` are the only fixtures defined *in* `env_tests.md`; everything else is imported from the sibling test plans cited in §3 above. | |
| **Fixture immutability rule:** if any field of any fixture changes here, the matching fixture in `deploy_env_space_tests.md §5` **must** be updated in the same commit — they share a single `conftest.py` definition. CI guards this via a grep-based pre-commit hook (`scripts/check_fixture_parity.sh`). | |