driftcall / docs /tests /models_tests.md
saumilyajj's picture
Upload folder using huggingface_hub
f2df60e verified

models_tests.md — Test Plan for driftcall/models.py

Owner: Person B (Rewards & Tests) Target module: DRIFTCALL/docs/modules/models.md (sealed) Implements coverage for: DESIGN.md §4.1 (Dataclasses), §4.2 (reset), §4.3 (step), §4.4 (termination), §7.1 (Reward inputs) Frameworks: pytest, hypothesis Status: DRAFT — pending ≥ 1 fresh critic round (test-plan gate is lighter per CLAUDE.md §3.2 Batch D4)


0. Scope & Non-goals

models.py is pure shape — frozen dataclasses + one Enum. There is no runtime behavior to integration-test. This plan therefore concentrates on:

  1. Constructibility — every dataclass can be instantiated with valid inputs and rejects malformed required fields.
  2. Immutability — every frozen guarantee raises FrozenInstanceError on mutation attempt.
  3. Type shape — tuples stay tuples, dicts stay dicts, Optionals default to None, Literals accept documented values.
  4. Hashability boundaries — the classes the doc declares hashable hash; the classes the doc declares unhashable raise TypeError.
  5. Round-trip equalitydataclasses.asdict + JSON encode + decode + reconstruct produces an equal object (the invariant in models.md §3.4).
  6. Invariant witnessing — properties from models.md §3.5 that can be checked by construction are asserted via hypothesis. Invariants enforced elsewhere (e.g. env.step validation) are NOT retested here; they are documented as N/A with a pointer to env_tests.md.

Every test below maps to one numbered clause in docs/modules/models.md. Clause references are embedded in the test docstring as models.md §X.Y / models.md §7 edge N.


1. Unit tests

All unit tests live in DRIFTCALL/tests/test_models.py. Import line under test:

from driftcall.models import (
    ActionType,
    DriftCallAction,
    ToolResult,
    DriftEvent,
    GoalSpec,
    DriftCallObservation,
    DriftCallState,
)

Fixtures (§5) are imported from tests/conftest.py.

1.1 ActionType (Enum)

# Test name Asserts Maps to
U1 test_action_type_members_exactly_six set(ActionType) == {TOOL_CALL, SPEAK, CLARIFY, PROBE_SCHEMA, SUBMIT, ABORT}; len(ActionType) == 6. models.md §4.1
U2 test_action_type_is_str_subclass isinstance(ActionType.TOOL_CALL, str); ActionType.TOOL_CALL == "tool_call"; json.dumps({"t": ActionType.SPEAK}) == '{"t": "speak"}' (string-mixed Enum serializes to its .value). models.md §3.4, §4.1
U3 test_action_type_values_match_spec ActionType.TOOL_CALL.value == "tool_call" and the other five values equal exactly the lowercase strings listed in the spec table. models.md §4.1
U4 test_action_type_is_hashable hash(ActionType.TOOL_CALL) returns an int; {ActionType.SPEAK, ActionType.SPEAK} == {ActionType.SPEAK} (set de-dupes). models.md §3.2 bullet 1

1.2 DriftCallAction

# Test name Asserts Maps to
U5 test_driftcall_action_happy_tool_call Build a TOOL_CALL action from valid_tool_call_action() fixture; assert all six fields carry the values passed in; action.confidence is None; action.message is None. models.md §4.2, §8.1
U6 test_driftcall_action_happy_submit Build a SUBMIT action with confidence=0.87; assert action.action_type is ActionType.SUBMIT, action.confidence == 0.87, and tool_name / tool_args / message are all None. models.md §4.2, §3.5 SUBMIT row
U7 test_driftcall_action_happy_speak Build a SPEAK action with Unicode message="मुझे कल दिल्ली जाना है"; assert action.message round-trips byte-for-byte. models.md §7 edge 3
U8 test_driftcall_action_frozen_mutation_raises Construct any action; assert action.tool_name = "x" raises dataclasses.FrozenInstanceError; same for every one of the 6 fields. Parametrized over all 6 field names. models.md §3.1, §5 row 1
U9 test_driftcall_action_defaults_are_none DriftCallAction(action_type=ActionType.ABORT) succeeds; every optional field defaults to None. models.md §2 (dataclass signature), §4.2
U10 test_driftcall_action_missing_required_raises DriftCallAction() raises TypeError (missing action_type). models.md §5 row 3
U11 test_driftcall_action_equality_value_based Two actions built with identical fields compare ==; changing any single field makes them !=. Covers all 6 fields via parametrize. models.md §3.2
U12 test_driftcall_action_unhashable_due_to_dict hash(valid_tool_call_action()) raises TypeError: unhashable type: 'dict'. models.md §3.2 bullet 2, §5 row 7
U13 test_driftcall_action_confidence_none_vs_zero DriftCallAction(action_type=SUBMIT, confidence=0.0) != DriftCallAction(action_type=SUBMIT, confidence=None); 0.0 is a real low confidence, None means absent. Proves the Optional[float] semantics are preserved by the dataclass (i.e. None is a distinguishable value, not coerced to 0.0). models.md §4.2, §7 edge 6

1.3 ToolResult

# Test name Asserts Maps to
U14 test_tool_result_happy_ok Build from valid_tool_result(status="ok"); fields round-trip; result.status == "ok"; result.response["results"] is a list. models.md §4.3, §8.2
U15 test_tool_result_frozen_mutation_raises result.status = "timeout" raises FrozenInstanceError; parametrized over all 5 fields. models.md §3.1, §5 row 1
U16 test_tool_result_accepts_all_five_statuses Parametrized over ["ok", "schema_error", "policy_error", "auth_error", "timeout"]; all construct successfully. models.md §4.3 status row
U17 test_tool_result_empty_response_on_non_ok valid_tool_result(status="schema_error", response={}) constructs without error (models.py does not validate; vendor-contract bug detection lives in test_vendors.py). models.md §7 edge 7, §5 row 6
U18 test_tool_result_unhashable_due_to_dict hash(valid_tool_result()) raises TypeError. models.md §3.2 bullet 2

1.4 DriftEvent

# Test name Asserts Maps to
U19 test_drift_event_happy_schema Build from valid_drift_event(turn=3, domain="airline"); all six fields match input; from_version="v1", to_version="v2". models.md §4.4, §8.4
U20 test_drift_event_frozen_mutation_raises drift.turn = 99 raises FrozenInstanceError; parametrized over all 6 fields. models.md §3.1
U21 test_drift_event_is_hashable hash(valid_drift_event()) returns an int; {drift, drift} == {drift}. Confirms primitive-only fields keep DriftEvent hashable (models.md §3.2 bullet 1 asserts this). models.md §3.2
U22 test_drift_event_accepts_all_five_drift_types Parametrized over ["schema", "policy", "tnc", "pricing", "auth"]; all construct successfully. models.md §4.4 drift_type row

1.5 GoalSpec

# Test name Asserts Maps to
U23 test_goal_spec_happy_hinglish Build from valid_goal_spec(); fields round-trip; goal.language == "hinglish"; goal.slots["from"] == "HYD". models.md §4.5, §8.3
U24 test_goal_spec_accepts_all_five_languages Parametrized over ["hi", "ta", "kn", "en", "hinglish"]; all construct successfully. models.md §4.5 language row
U25 test_goal_spec_frozen_mutation_raises goal.intent = "other" raises FrozenInstanceError. models.md §3.1
U26 test_goal_spec_unhashable_due_to_dict hash(valid_goal_spec()) raises TypeError. models.md §3.2 bullet 2
U27 test_goal_spec_unicode_seed_utterance seed_utterance="{when} அன்று விமானம்" (Tamil) survives json.dumps(..., ensure_ascii=False) + json.loads + reconstruction; reconstructed == original. models.md §7 edge 3

1.6 DriftCallObservation

# Test name Asserts Maps to
U28 test_observation_reset_state Build turn-0 observation with tool_results=(), drift_log=(), last_transcript="", last_lang="", last_confidence=1.0, budget_remaining=12. Assert isinstance(obs.tool_results, tuple), len(obs.tool_results) == 0, same for drift_log. models.md §7 edge 1, §7 edge 2, §8.3
U29 test_observation_tuple_not_list_for_sequences Passing tool_results=[valid_tool_result()] (a list) still stores a list on the dataclass (Python does not coerce) — the test asserts type(obs.tool_results) is list and then asserts that our documented contract requires callers to pass tuples by flagging this shape as a contract-violating fixture. The real enforcement is a separate test that constructs with tool_results=(valid_tool_result(),) and asserts isinstance(obs.tool_results, tuple). Covers both the "what Python does" and "what the contract requires" halves of models.md §3.1. models.md §3.1, §7 edge 1
U30 test_observation_frozen_mutation_raises obs.turn = 1 raises FrozenInstanceError; parametrized over all 9 fields. models.md §3.1
U31 test_observation_unhashable hash(obs) raises TypeError (observation contains GoalSpec which contains dicts). models.md §3.2 bullet 2
U32 test_observation_goal_ref_stable Two observations built with the same GoalSpec object yield obs_a.goal is obs_b.goal when passed the same instance; i.e. frozen=True does not deep-copy. Encodes the invariant that goal is copied by reference within an episode. models.md §4.6 goal row

1.7 DriftCallState

# Test name Asserts Maps to
U33 test_state_happy_turn_zero Build state with turn=0, actions=(), drift_fired=(), done=False; assert length invariant len(state.actions) == state.turn holds at construction; state.done is False. models.md §4.7, §3.5 len(actions) == turn row
U34 test_state_replace_appends_action Use dataclasses.replace(state, turn=state.turn+1, actions=state.actions + (action,)); assert original state is unchanged (state.turn == 0, state.actions == ()) and new state has turn == 1, actions[-1] is action. This witnesses the replace-don't-mutate pattern from models.md §3.3. models.md §3.3, §8.4
U35 test_state_replace_dict_field_builds_new_dict Call replace(state, schema_versions={**state.schema_versions, "airline": "v2"}); assert state.schema_versions["airline"] == "v1" (original untouched) and new_state.schema_versions["airline"] == "v2". Confirms the "always build new dict" convention does not rely on shared references. models.md §3.3
U36 test_state_frozen_mutation_raises state.done = True raises FrozenInstanceError; parametrized over all 10 fields. models.md §3.1
U37 test_state_unhashable hash(state) raises TypeError. models.md §3.2 bullet 2

1.8 Cross-cutting: JSON round-trip

# Test name Asserts Maps to
U38 test_action_json_roundtrip_equality For an action with Unicode message, nested-dict tool_args, and all optional fields set: a2 = DriftCallAction(**json.loads(json.dumps(dataclasses.asdict(a1), ensure_ascii=False))) then action_type is re-coerced to ActionType; a2 == a1. models.md §3.4 invariant
U39 test_tool_result_json_roundtrip_equality Same pattern for ToolResult with nested list-of-dict response. models.md §3.4, §7 edge 4
U40 test_observation_json_roundtrip_preserves_tuple_length After JSON round-trip (which makes tuples into lists), a reconstructor that wraps the sequence fields back into tuples yields obs2 == obs1. Documents the reconstructor contract app.py must uphold. models.md §3.4

Unit test count: 40.


2. Property tests (Hypothesis)

All property tests live in DRIFTCALL/tests/test_models_properties.py. Hypothesis settings: deadline=None, max_examples=200 (enough coverage for dataclass shape; no IO so runs are fast).

Shared strategies live in tests/strategies.py:

# strategies.py (sketch for reference — NOT part of this test-plan doc to implement)
import string
from hypothesis import strategies as st
from driftcall.models import ActionType

_versions = st.from_regex(r"^v[1-9]\d?$", fullmatch=True)
_languages = st.sampled_from(["hi", "ta", "kn", "en", "hinglish"])
_domains = st.sampled_from(["airline", "cab", "restaurant", "hotel", "payment"])
_drift_types = st.sampled_from(["schema", "policy", "tnc", "pricing", "auth"])
_statuses = st.sampled_from(["ok", "schema_error", "policy_error", "auth_error", "timeout"])

_json_scalar = st.one_of(st.none(), st.booleans(), st.integers(-1_000_000, 1_000_000),
                         st.floats(allow_nan=False, allow_infinity=False),
                         st.text(alphabet=st.characters(blacklist_categories=("Cs",)), max_size=32))
_json_dict = st.dictionaries(keys=st.text(alphabet=string.ascii_letters, min_size=1, max_size=8),
                             values=_json_scalar, max_size=4)

Each property below names its strategies and states the invariant.

# Property name Strategy Invariant Maps to
P1 test_state_turn_matches_len_actions_by_construction Generate turn ∈ [0, 16], then build actions tuple of exactly turn DriftCallAction instances; construct DriftCallState with those values. len(state.actions) == state.turn holds for every generated example (proves the invariant is witness-able by construction; actual runtime enforcement is in env_tests.md). This is the "turn monotone non-decreasing by construction" property at the state-boundary layer. models.md §3.5 state row, §7 edge 9
P2 test_drift_fired_is_subset_of_drift_schedule Generate a drift_schedule tuple of 0..2 DriftEvents with unique ascending turn; pick a random prefix length k ∈ [0, len(schedule)]; set drift_fired = drift_schedule[:k]. set(state.drift_fired).issubset(set(state.drift_schedule)) AND len(state.drift_fired) <= len(state.drift_schedule) AND state.drift_fired == state.drift_schedule[:len(state.drift_fired)] (prefix, order preserved). models.md §3.5 drift_fired ⊆ drift_schedule row, §4.7 drift_fired row
P3 test_probe_schema_tool_name_is_bare_domain Generate DriftCallAction(action_type=PROBE_SCHEMA, tool_name=domain) where domain comes from _domains. "." not in action.tool_name AND action.tool_name in {"airline","cab","restaurant","hotel","payment"}. Encodes the "PROBE_SCHEMA carries bare domain, NOT domain.verb" invariant from models.md §3.5. (Note: models.py itself does not validate this; the property asserts the TEST FIXTURES obey the contract — a regression-canary against a fixture drifting to "airline.search" by accident.) models.md §3.5 PROBE_SCHEMA row
P4 test_frozen_invariant_universal For every dataclass type in {DriftCallAction, ToolResult, DriftEvent, GoalSpec, DriftCallObservation, DriftCallState}, generate a valid instance using the type-specific strategy, then for each field name on the class, assert setattr(instance, field_name, ...) raises FrozenInstanceError. The frozen guarantee is universal across the module. Stronger than the per-class U8/U15/U20/U25/U30/U36 tests because it is strategy-driven rather than single-example. models.md §3.1, §5 row 1
P5 test_json_roundtrip_preserves_equality_for_action Generate DriftCallAction with random valid fields (Unicode text up to 200 chars for rationale, nested JSON dicts for tool_args, random ActionType). Serialize via json.dumps(dataclasses.asdict(a), ensure_ascii=False), parse, rebuild (re-coerce action_type to ActionType). Rebuilt action == original for every generated example. Verifies the models.md §3.4 round-trip invariant holds under adversarial shape generation. models.md §3.4 invariant
P6 test_observation_budget_non_negative_by_construction Generate max_turns ∈ [1, 16], turn ∈ [0, max_turns], then budget_remaining = max_turns - turn. Build observation. obs.budget_remaining >= 0 always. (models.py does not enforce; the property ensures our fixture arithmetic respects models.md §3.5 budget_remaining row.) models.md §3.5 budget row, §4.6 budget row

Property test count: 6.

Notes on properties intentionally not tested here:

  • "DriftCallState.turn monotone non-decreasing across a step sequence" — that is a runtime property of env.step, tested in env_tests.md. At the models.py layer the property is trivially "by construction a single instance has one turn value", which P1 already witnesses by rebuilding valid states at every turn.
  • "DriftEvent.from_version != to_version" — enforced by drift injector, not by models.py. Tested in drift_injector_tests.md.
  • "DriftCallAction.confidence ∈ [0, 1] when SUBMIT" — enforced by env.step, tested in env_tests.md.

Each of these deferrals matches models.md §3.5's "Enforced by" column (not models.py).


3. Integration tests

N/A — defer to env_tests.md.

models.py has no IO, no network, no filesystem, no cross-module composition. Every meaningful interaction (e.g. "step consumes action, emits observation") belongs to env.py and is tested in DRIFTCALL/docs/tests/env_tests.md once that test plan is authored. This section is deliberately empty; the pointer is the deliverable.


4. Coverage target

Line coverage: 100% on driftcall/models.py. Branch coverage: ≥ 95% on driftcall/models.py.

models.py is effectively branchless — the only "branches" are the implicit ones the dataclass decorator generates in __init__, __repr__, __eq__, and __hash__. Python's coverage.py with branch=True will report those auto-generated lines; our unit and property tests hit every one:

models.py region Hit by
class ActionType(str, Enum): ... — 6 member lines U1, U2, U3 (every member is named or sampled).
@dataclass(frozen=True) class DriftCallAction: ... — 6 field lines + auto-generated __init__ / __eq__ / __repr__ / __hash__ U5–U13, U38, P4, P5.
@dataclass(frozen=True) class ToolResult: ... — 5 field lines + auto-methods U14–U18, U39, P4.
@dataclass(frozen=True) class DriftEvent: ... — 6 field lines + auto-methods U19–U22, P2, P4.
@dataclass(frozen=True) class GoalSpec: ... — 6 field lines + auto-methods U23–U27, P4.
@dataclass(frozen=True) class DriftCallObservation: ... — 9 field lines + auto-methods U28–U32, U40, P4, P6.
@dataclass(frozen=True) class DriftCallState: ... — 10 field lines + auto-methods U33–U37, P1, P2, P4.
__all__ = [...] list Imported in every test module; trivially covered.

Verification command (per CLAUDE.md §6):

python3 -m pytest tests/test_models.py tests/test_models_properties.py \
    --cov=driftcall.models --cov-branch --cov-report=term-missing

Gate: if either line-coverage < 100% or branch-coverage < 95%, the PR is blocked and the missing lines/branches are added as new unit tests before proceeding.


5. Fixtures

All fixtures live in DRIFTCALL/tests/conftest.py and are imported by every test file that needs models. They are pytest fixtures (not plain functions) so hypothesis strategies can reuse them via st.builds. Each fixture returns a frozen dataclass built with spec-valid defaults; keyword arguments override specific fields for variation. Naming follows valid_<thing>[_variant] convention.

5.1 valid_goal_spec

import pytest
from driftcall.models import GoalSpec

@pytest.fixture
def valid_goal_spec() -> GoalSpec:
    """A spec-valid Hinglish airline booking goal. Matches models.md §8.3."""
    return GoalSpec(
        domain="airline",
        intent="book_flight",
        slots={"from": "HYD", "to": "BLR", "when": "2026-04-25"},
        constraints={"budget_inr": 8000, "time_window": "evening"},
        language="hinglish",
        seed_utterance="Bhai Friday ko Bangalore jaana hai, 8000 rupees max, 6pm ke baad",
    )

Reused by: U23, U26, U28, U33, P6, and every downstream test file (test_env.py, test_rewards.py, test_vendors.py).

5.2 valid_drift_event

from driftcall.models import DriftEvent

@pytest.fixture
def valid_drift_event_factory():
    """Factory fixture so tests can override turn/domain. Matches models.md §8.4."""
    def _build(turn: int = 3, domain: str = "airline") -> DriftEvent:
        return DriftEvent(
            turn=turn,
            drift_type="schema",
            domain=domain,
            description="field 'price' renamed to 'total_fare_inr'; 'currency' removed",
            from_version="v1",
            to_version="v2",
        )
    return _build

Also provided as a plain (non-factory) valid_drift_event fixture with defaults turn=3, domain="airline" for convenience.

Reused by: U19, U20, U21, U22, P2, and drift_injector_tests.md / env_tests.md.

5.3 valid_tool_result

from driftcall.models import ToolResult
from typing import Any

@pytest.fixture
def valid_tool_result_factory():
    """Factory fixture so tests can override status/response. Matches models.md §8.2."""
    def _build(
        status: str = "ok",
        response: dict[str, Any] | None = None,
        tool_name: str = "airline.search",
        schema_version: str = "v1",
        latency_ms: int = 142,
    ) -> ToolResult:
        if response is None:
            response = (
                {
                    "results": [
                        {
                            "flight_id": "6E-2345",
                            "from": "HYD",
                            "to": "BLR",
                            "depart": "2026-04-25T18:30:00+05:30",
                            "price": 7200,
                            "currency": "INR",
                            "seats_left": 14,
                        }
                    ]
                }
                if status == "ok"
                else {"error_code": status.upper()}
            )
        return ToolResult(
            tool_name=tool_name,
            status=status,  # type: ignore[arg-type]
            response=response,
            schema_version=schema_version,
            latency_ms=latency_ms,
        )
    return _build

A plain valid_tool_result fixture with status="ok" defaults is also exposed.

Reused by: U14, U15, U16, U17, U18, U29, U39, and test_vendors.py, test_rewards.py, test_env.py.

5.4 Supporting fixtures (bonus, for the downstream test files)

These are included here because the fixtures module must ship as one unit, and the briefing names them "reusable across other modules' tests":

  • valid_tool_call_action() — returns the §8.1 DriftCallAction (airline.search with slots). Consumed by U5, U6, U11, U34.
  • valid_submit_action(confidence: float = 0.87) — builds a SUBMIT action; consumed by U6, U13.
  • valid_observation_reset(valid_goal_spec) — builds the §8.3 turn-0 observation; consumed by U28, U29, U30, U31, U32, U40, P6.
  • valid_state_reset(valid_goal_spec) — builds a turn-0 DriftCallState with max_turns=12, empty action/drift tuples, done=False; consumed by U33, U34, U35, U36, U37, P1.

Fixture count (fixtures that must exist in conftest.py): 7.

  • valid_goal_spec
  • valid_drift_event (plain)
  • valid_drift_event_factory
  • valid_tool_result (plain)
  • valid_tool_result_factory
  • valid_tool_call_action
  • valid_submit_action
  • valid_observation_reset
  • valid_state_reset

(The briefing requires the first three by name; the remaining four are the minimal set required to write the 40 unit tests above without duplicating construction boilerplate. All 9 ship together.)


6. Execution plan

  1. conftest.py fixtures land first (no tests will even collect without them).
  2. test_models.py is authored top-down: U1 → U40, committed in 4 logical chunks (ActionType + DriftCallAction, ToolResult + DriftEvent, GoalSpec + DriftCallObservation, DriftCallState + JSON round-trip). Each chunk is a separate commit with passing pytest -v + coverage output.
  3. test_models_properties.py lands last, after all unit tests are green. Hypothesis shrinking behavior requires the fixtures and strategies to be stable first.
  4. Coverage gate: pytest --cov=driftcall.models --cov-branch must show 100% line / ≥ 95% branch before PR.
  5. Ruff + mypy --strict run on both test files (CLAUDE.md §6 commands). The test files themselves must obey the same hygiene bar as production code.

7. Traceability summary

models.md clause Covered by
§2 (Interface) U5–U37 (every dataclass built with full spec signature)
§3.1 Immutability U8, U15, U20, U25, U30, U36, P4
§3.2 Equality & hashing U4, U11, U12, U18, U21, U26, U31, U37
§3.3 replace pattern U34, U35
§3.4 Serialization round-trip U38, U39, U40, P5
§3.5 Field invariants (constructible-at-boundary subset) P1, P2, P3, P6
§4.1–4.7 (per-class field tables) U1–U37 (one class per subsection)
§5 Error modes U8, U10, U12, U15, U18, U20, U25, U26, U30, U31, U36, U37
§7 Edge 1 (empty tuples at turn 0) U28, U29
§7 Edge 2 (empty strings vs None) U28
§7 Edge 3 (Unicode round-trip) U7, U27, U38
§7 Edge 4 (nested tool_args) U38, U39
§7 Edge 5 (large history) N/A at models layer — deferred to env_tests.md §integration
§7 Edge 6 (SUBMIT + confidence=None) U13
§7 Edge 7 (non-ok with empty response) U17
§7 Edge 8 (from_version == to_version) N/A at models layer — deferred to drift_injector_tests.md
§7 Edge 9 (len(actions) != turn) P1 (witnesses constructibility of the good case; violation-detection deferred to env_tests.md)
§7 Edge 10 (auth-drifted tool still listed) N/A at models layer — deferred to env_tests.md
§8.1–8.4 (worked examples) U5, U14, U19, U28, U33, U34

Every models.md clause is either directly tested here or explicitly deferred to a named downstream test plan with a reason. No clause is silently unaddressed.


8. Open questions

None. The test surface for models.py is bounded by the module's pure-shape nature: 40 unit tests + 6 properties + 9 fixtures + 100%/95% coverage exhaustively cover every field, every frozen guarantee, every hashability rule, every JSON round-trip invariant, and every edge case that can be witnessed at construction time. All edge cases requiring runtime enforcement (env.step validation, drift injector rules) are explicitly routed to their home test plans.