Spaces:

below-threshold
/

ai-response-validator

Sleeping

App Files Files Community

mbochniak01 commited on 27 days ago

Commit

10aced5

1 Parent(s): 3c949de

Add typed client library, unit + integration tests, mypy, ruff, NOTES.md

Browse files

Files changed (17) hide show

Makefile +15 -1
NOTES.md +86 -0
client/__init__.py +15 -0
client/client.py +118 -0
client/exceptions.py +24 -0
client/models.py +41 -0
mypy.ini +23 -0
pytest.ini +5 -0
requirements.txt +14 -0
ruff.toml +12 -0
tests/conftest.py +23 -0
tests/integration/__init__.py +0 -0
tests/integration/test_api.py +95 -0
tests/unit/__init__.py +0 -0
tests/unit/test_client.py +120 -0
tests/unit/test_grader.py +140 -0
tests/unit/test_rosetta.py +67 -0

Makefile CHANGED Viewed

@@ -1,4 +1,4 @@
-.PHONY: install run eval-retail eval-pharma eval open-retail open-pharma
 install:
 	pip install -r requirements.txt
@@ -6,6 +6,20 @@ install:
 run:
 	cd backend && uvicorn app:app --reload --host 0.0.0.0 --port 8000
 eval-retail:
 	python eval/metrics.py --domain retail
 	open eval/reports/report_retail.html 2>/dev/null || xdg-open eval/reports/report_retail.html

+.PHONY: install run test test-integration lint type-check eval-retail eval-pharma eval
 install:
 	pip install -r requirements.txt
 run:
 	cd backend && uvicorn app:app --reload --host 0.0.0.0 --port 8000
+# Unit tests only (default — no server required)
+test:
+	pytest tests/unit -v
+# Integration tests — requires running server (make run in another terminal)
+test-integration:
+	pytest tests/integration -v -m integration
+lint:
+	ruff check client/ backend/ tests/
+type-check:
+	mypy client/
 eval-retail:
 	python eval/metrics.py --domain retail
 	open eval/reports/report_retail.html 2>/dev/null || xdg-open eval/reports/report_retail.html

NOTES.md ADDED Viewed

	@@ -0,0 +1,86 @@

+# Design Notes
+## Key decisions and tradeoffs
+### API target: own implementation
+Instead of wrapping a third-party fake API, the client wraps this project's own
+FastAPI backend. This means the client and the API are co-designed — the typed
+models on both sides stay in sync by design. The tradeoff: less realistic than
+wrapping an external API you don't control, but the test surface is richer and
+the integration tests verify real business logic, not just HTTP plumbing.
+### Two-layer evaluation (L1 live / L2 batch)
+L1 runs on every query inline (~1-2s overhead). L2 runs offline against a golden
+dataset. The split is a deliberate latency/depth tradeoff: LLM-judged metrics
+(contextual precision, reverse-question relevancy) add 30+ seconds per pair —
+unacceptable live, fine in batch. The golden dataset is the contract; L2 is the
+regression gate.
+### Deterministic chain_terminology over LLM judge
+The terminology check is a dict lookup, not a model call. Zero latency, zero cost,
+zero false negatives on known mappings. The tradeoff: it only catches terms in the
+catalog — novel terminology drift goes undetected. An LLM judge would catch drift
+but would introduce latency and non-determinism into a metric that must be auditable.
+### In-memory retrieval over vector database
+KB size is 8-9 docs per domain. Encoding them at startup and doing cosine search
+at query time adds ~2ms retrieval overhead with no infrastructure dependency.
+A vector DB (Chroma, pgvector) would add operational complexity with zero
+retrieval quality gain at this scale.
+### httpx + tenacity for the client
+`httpx` is the modern alternative to `requests`: native async support if needed
+later, cleaner timeout API, better type annotations. `tenacity` separates retry
+policy from request logic cleanly — the retry decorator is readable and testable
+independently from the HTTP code.
+### Integration tests are read-only by design
+The API has no mutable state: queries don't persist, no records are created or
+deleted. Cleanup is therefore trivially satisfied — there is nothing to clean up.
+This is called out explicitly because it's a deliberate architectural choice, not
+an oversight. A stateful API (task creation, deletion) would require explicit
+teardown fixtures.
+---
+## What another 4 hours would add
+- **`eval/metrics.py` — L2 LLM metrics**: contextual precision (chunk ranking),
+  contextual recall (coverage), and answer correctness against full reference answers.
+  Currently only keyphrase coverage is used as a proxy.
+- **Async client**: `httpx.AsyncClient` variant for high-concurrency load testing.
+- **Property-based tests**: `hypothesis` to fuzz `check_terminology` and graders
+  with generated strings — catches edge cases the golden dataset doesn't cover.
+- **CI pipeline**: GitHub Actions running `make lint`, `make type-check`,
+  `make test` on every PR. Integration tests gated on a self-hosted runner with
+  the API running.
+- **Threshold calibration report**: plot the distribution of L1 metric scores
+  across the golden dataset to verify that current thresholds aren't too tight
+  or too loose.
+---
+## Where LLM assistance helped and where it misled
+**Helped:**
+- Scaffolding the full project structure (backend, client, tests, config) in a
+  single session without losing consistency across files.
+- Writing the faithfulness prompt in a way that reliably returns structured JSON —
+  the few-shot JSON format in the prompt was a suggested pattern that works.
+- Catching that `except Exception` in the faithfulness grader was too broad and
+  replacing it with `(json.JSONDecodeError, anthropic.APIError)`.
+- Identifying that `_build_index_by_domain` was defined twice in pipeline.py
+  (duplicate introduced during an edit session) — caught during code review.
+**Misled or required correction:**
+- Initially used `lru_cache` on a function that takes a `SentenceTransformer`
+  instance as an argument — unhashable, so the cache silently failed. Required
+  switching to a module-level dict cache.
+- Generated a dead loop in `rosetta.py` (iterating over terms with `continue`
+  but no code after the continue branch) that did nothing. The logic existed in
+  a comment describing intent but was never implemented. Caught in review.
+- The first README used `sdk: gradio` in the HuggingFace frontmatter — the Space
+  was created as Gradio before switching to Docker. LLM-generated config matched
+  the target architecture but not the actual HF Space state.
+- Suggested `ShelfWise` as a fictional client name — it is a real US company.
+  Required renaming to `ShelfWise`.

client/__init__.py ADDED Viewed

	@@ -0,0 +1,15 @@

+from client.client import ValidatorClient
+from client.exceptions import APIError, RetryExhaustedError, TimeoutError, ValidatorError
+from client.models import ConfigResponse, MetricResult, QueryResponse, Source
+__all__ = [
+    "ValidatorClient",
+    "ValidatorError",
+    "APIError",
+    "TimeoutError",
+    "RetryExhaustedError",
+    "ConfigResponse",
+    "QueryResponse",
+    "MetricResult",
+    "Source",
+]

client/client.py ADDED Viewed

	@@ -0,0 +1,118 @@

+"""
+ValidatorClient — typed HTTP client for the AI Response Validator API.
+Retry policy: exponential backoff on 5xx and network errors, up to max_retries.
+Timeouts: connect + read combined, configurable per instance.
+Auth: optional Bearer token forwarded as Authorization header (for future use).
+"""
+import httpx
+from tenacity import (
+    retry,
+    retry_if_exception_type,
+    stop_after_attempt,
+    wait_exponential,
+    RetryError,
+)
+from client.exceptions import APIError, RetryExhaustedError, TimeoutError
+from client.models import ConfigResponse, QueryRequest, QueryResponse
+DEFAULT_TIMEOUT = 30.0
+DEFAULT_MAX_RETRIES = 3
+_RETRY_STATUS_CODES = {500, 502, 503, 504}
+class ValidatorClient:
+    """Typed client for the AI Response Validator API."""
+    def __init__(
+        self,
+        base_url: str,
+        timeout: float = DEFAULT_TIMEOUT,
+        max_retries: int = DEFAULT_MAX_RETRIES,
+        api_key: str | None = None,
+    ) -> None:
+        headers = {"Accept": "application/json"}
+        if api_key:
+            headers["Authorization"] = f"Bearer {api_key}"
+        self._client = httpx.Client(
+            base_url=base_url.rstrip("/"),
+            timeout=timeout,
+            headers=headers,
+        )
+        self._max_retries = max_retries
+    def _request(self, method: str, path: str, **kwargs: object) -> httpx.Response:
+        """Execute an HTTP request with retry on transient server errors."""
+        @retry(
+            retry=retry_if_exception_type(_TransientError),
+            stop=stop_after_attempt(self._max_retries),
+            wait=wait_exponential(multiplier=0.5, min=0.5, max=10),
+            reraise=False,
+        )
+        def _attempt() -> httpx.Response:
+            try:
+                response = self._client.request(method, path, **kwargs)  # type: ignore[arg-type]
+            except httpx.TimeoutException as exc:
+                raise TimeoutError(str(exc)) from exc
+            except httpx.NetworkError as exc:
+                raise _TransientError(str(exc)) from exc
+            if response.status_code in _RETRY_STATUS_CODES:
+                raise _TransientError(f"HTTP {response.status_code}")
+            if response.is_error:
+                detail = _extract_detail(response)
+                raise APIError(response.status_code, detail)
+            return response
+        try:
+            return _attempt()
+        except RetryError as exc:
+            last = exc.last_attempt.exception()
+            raise RetryExhaustedError(self._max_retries, last) from exc
+    def get_config(self) -> ConfigResponse:
+        """Return domain and client configuration (unauthenticated)."""
+        response = self._request("GET", "/config")
+        return ConfigResponse.model_validate(response.json())
+    def query(self, question: str, client_id: str) -> QueryResponse:
+        """Submit a question for a specific client and return a graded response."""
+        payload = QueryRequest(query=question, client=client_id)
+        response = self._request(
+            "POST",
+            "/query",
+            json=payload.model_dump(),
+        )
+        return QueryResponse.model_validate(response.json())
+    def health(self) -> bool:
+        """Return True if the API is reachable and healthy."""
+        try:
+            response = self._request("GET", "/health")
+            return response.json().get("status") == "ok"
+        except ValidatorError:
+            return False
+    def close(self) -> None:
+        self._client.close()
+    def __enter__(self) -> "ValidatorClient":
+        return self
+    def __exit__(self, *_: object) -> None:
+        self.close()
+class _TransientError(Exception):
+    """Internal marker for errors that should trigger a retry."""
+def _extract_detail(response: httpx.Response) -> str:
+    try:
+        return str(response.json().get("detail", response.text))
+    except Exception:
+        return response.text

client/exceptions.py ADDED Viewed

	@@ -0,0 +1,24 @@

+class ValidatorError(Exception):
+    """Base exception for all client errors."""
+class APIError(ValidatorError):
+    """HTTP error returned by the API (4xx / 5xx)."""
+    def __init__(self, status_code: int, detail: str) -> None:
+        self.status_code = status_code
+        self.detail = detail
+        super().__init__(f"HTTP {status_code}: {detail}")
+class TimeoutError(ValidatorError):
+    """Request exceeded the configured timeout."""
+class RetryExhaustedError(ValidatorError):
+    """All retry attempts failed."""
+    def __init__(self, attempts: int, last_error: Exception) -> None:
+        self.attempts = attempts
+        self.last_error = last_error
+        super().__init__(f"Failed after {attempts} attempts: {last_error}")

client/models.py ADDED Viewed

	@@ -0,0 +1,41 @@

+from pydantic import BaseModel, Field
+class QueryRequest(BaseModel):
+    query: str
+    client: str
+class MetricResult(BaseModel):
+    passed: bool
+    score: float = Field(ge=0.0, le=1.0)
+    detail: str
+class EvaluationResult(BaseModel):
+    overall_pass: bool
+    metrics: dict[str, MetricResult]
+class Source(BaseModel):
+    id: str
+    title: str
+    score: float = Field(ge=0.0, le=1.0)
+class QueryResponse(BaseModel):
+    query: str
+    client: str
+    client_display: str
+    answer: str
+    sources: list[Source]
+    evaluation: EvaluationResult
+class ClientInfo(BaseModel):
+    id: str
+    display: str
+class ConfigResponse(BaseModel):
+    domains: dict[str, list[ClientInfo]]

mypy.ini ADDED Viewed

	@@ -0,0 +1,23 @@

+[mypy]
+python_version = 3.11
+strict = True
+warn_return_any = True
+warn_unused_ignores = True
+# Client library — strict checking
+[mypy-client.*]
+disallow_untyped_defs = True
+disallow_any_generics = True
+# Third-party stubs
+[mypy-tenacity.*]
+ignore_missing_imports = True
+[mypy-sentence_transformers.*]
+ignore_missing_imports = True
+[mypy-sklearn.*]
+ignore_missing_imports = True
+[mypy-yaml.*]
+ignore_missing_imports = True

pytest.ini ADDED Viewed

	@@ -0,0 +1,5 @@

+[pytest]
+addopts = -m "not integration"
+markers =
+    integration: requires a running API server (use: make test-integration)
+testpaths = tests

requirements.txt CHANGED Viewed

@@ -1,3 +1,4 @@
 anthropic>=0.40.0
 fastapi>=0.115.0
 uvicorn[standard]>=0.30.0
@@ -6,3 +7,16 @@ sentence-transformers>=3.0.0
 scikit-learn>=1.5.0
 numpy>=1.26.0
 python-multipart>=0.0.9

+# API
 anthropic>=0.40.0
 fastapi>=0.115.0
 uvicorn[standard]>=0.30.0
 scikit-learn>=1.5.0
 numpy>=1.26.0
 python-multipart>=0.0.9
+# Client
+httpx>=0.27.0
+tenacity>=8.3.0
+pydantic>=2.7.0
+# Testing
+pytest>=8.2.0
+pytest-mock>=3.14.0
+# Quality
+mypy>=1.10.0
+ruff>=0.4.0

ruff.toml ADDED Viewed

	@@ -0,0 +1,12 @@

+target-version = "py311"
+line-length = 100
+[lint]
+select = ["E", "F", "W", "I", "UP", "B", "C4", "RUF"]
+ignore = [
+    "E501",   # line too long — handled by formatter
+    "B008",   # function call in default argument (FastAPI pattern)
+]
+[lint.per-file-ignores]
+"tests/*" = ["S101"]  # assert is fine in tests

tests/conftest.py ADDED Viewed

	@@ -0,0 +1,23 @@

+import os
+import pytest
+from client.client import ValidatorClient
+BASE_URL = os.environ.get("VALIDATOR_URL", "http://localhost:8000")
+def pytest_configure(config: pytest.Config) -> None:
+    config.addinivalue_line(
+        "markers",
+        "integration: marks tests that require a running API server (deselect with -m 'not integration')",
+    )
+@pytest.fixture(scope="session")
+def base_url() -> str:
+    return BASE_URL
+@pytest.fixture(scope="session")
+def api_client(base_url: str) -> ValidatorClient:
+    with ValidatorClient(base_url=base_url) as client:
+        yield client

tests/integration/__init__.py ADDED Viewed

File without changes

tests/integration/test_api.py ADDED Viewed

	@@ -0,0 +1,95 @@

+"""
+Integration tests — require a running API server.
+Run with: make test-integration
+Server URL: VALIDATOR_URL env var (default: http://localhost:8000)
+Design notes:
+- All tests are read-only: the API has no mutable state, so cleanup is N/A.
+- Tests are order-independent: no shared mutable fixtures.
+- Each test is self-contained: uses only the api_client session fixture.
+"""
+import pytest
+from client.client import ValidatorClient
+from client.exceptions import APIError
+from client.models import ConfigResponse, QueryResponse
+pytestmark = pytest.mark.integration
+@pytest.fixture(scope="module")
+def config(api_client: ValidatorClient) -> ConfigResponse:
+    return api_client.get_config()
+class TestHealth:
+    def test_api_is_reachable(self, api_client: ValidatorClient) -> None:
+        assert api_client.health() is True
+class TestConfig:
+    def test_returns_retail_and_pharma_domains(self, config: ConfigResponse) -> None:
+        assert "retail" in config.domains
+        assert "pharma" in config.domains
+    def test_each_domain_has_two_clients(self, config: ConfigResponse) -> None:
+        for domain, clients in config.domains.items():
+            assert len(clients) == 2, f"{domain} should have 2 clients"
+    def test_client_ids_are_strings(self, config: ConfigResponse) -> None:
+        for clients in config.domains.values():
+            for c in clients:
+                assert isinstance(c.id, str)
+                assert isinstance(c.display, str)
+class TestQuery:
+    def test_valid_query_returns_answer(self, api_client: ValidatorClient) -> None:
+        response = api_client.query(
+            "What happens when a product runs out of stock?",
+            "novamart",
+        )
+        assert isinstance(response, QueryResponse)
+        assert len(response.answer) > 0
+    def test_response_includes_all_five_metrics(self, api_client: ValidatorClient) -> None:
+        response = api_client.query("How do I add a new supplier?", "shelfwise")
+        metrics = response.evaluation.metrics
+        expected = {"pii_leakage", "token_budget", "answer_relevancy", "faithfulness", "chain_terminology"}
+        assert set(metrics.keys()) == expected
+    def test_metric_scores_are_in_valid_range(self, api_client: ValidatorClient) -> None:
+        response = api_client.query("What is prior authorization?", "clinixone")
+        for name, metric in response.evaluation.metrics.items():
+            assert 0.0 <= metric.score <= 1.0, f"{name} score out of range: {metric.score}"
+    def test_sources_are_returned(self, api_client: ValidatorClient) -> None:
+        response = api_client.query("How do compliance reports work?", "shelfwise")
+        assert len(response.sources) > 0
+        for source in response.sources:
+            assert source.title
+            assert 0.0 <= source.score <= 1.0
+    def test_client_display_name_matches_client_id(self, api_client: ValidatorClient) -> None:
+        response = api_client.query("What is formulary pre-approval?", "pharmalink")
+        assert response.client == "pharmalink"
+        assert response.client_display == "PharmaLink"
+    def test_unknown_client_raises_api_error(self, api_client: ValidatorClient) -> None:
+        with pytest.raises(APIError) as exc_info:
+            api_client.query("Any question", "nonexistent_client")
+        assert exc_info.value.status_code == 400
+    def test_empty_query_raises_api_error(self, api_client: ValidatorClient) -> None:
+        with pytest.raises(APIError) as exc_info:
+            api_client.query("   ", "novamart")
+        assert exc_info.value.status_code == 400
+    def test_pharma_query_uses_client_terminology(self, api_client: ValidatorClient) -> None:
+        response = api_client.query(
+            "How do I get approval before dispensing a drug?",
+            "pharmalink",
+        )
+        assert "formulary pre-approval" in response.answer.lower() or \
+               response.evaluation.metrics["chain_terminology"].passed

tests/unit/__init__.py ADDED Viewed

File without changes

tests/unit/test_client.py ADDED Viewed

	@@ -0,0 +1,120 @@

+"""
+Unit tests for ValidatorClient — all HTTP calls are mocked.
+Tests cover: error mapping, retry behavior, response parsing, timeout handling.
+"""
+import httpx
+import pytest
+from pytest_mock import MockerFixture
+from unittest.mock import MagicMock, patch
+from client.client import ValidatorClient
+from client.exceptions import APIError, RetryExhaustedError, TimeoutError
+@pytest.fixture
+def client() -> ValidatorClient:
+    return ValidatorClient(base_url="http://test.local", max_retries=2)
+def _mock_response(status_code: int, json_body: object) -> MagicMock:
+    response = MagicMock(spec=httpx.Response)
+    response.status_code = status_code
+    response.is_error = status_code >= 400
+    response.json.return_value = json_body
+    response.text = str(json_body)
+    return response
+class TestGetConfig:
+    def test_valid_response_parsed_to_model(self, client: ValidatorClient, mocker: MockerFixture) -> None:
+        mocker.patch.object(client._client, "request", return_value=_mock_response(200, {
+            "domains": {
+                "retail": [{"id": "novamart", "display": "NovaMart"}]
+            }
+        }))
+        config = client.get_config()
+        assert "retail" in config.domains
+        assert config.domains["retail"][0].id == "novamart"
+    def test_server_error_raises_api_error(self, client: ValidatorClient, mocker: MockerFixture) -> None:
+        mocker.patch.object(client._client, "request", return_value=_mock_response(500, {"detail": "boom"}))
+        with pytest.raises((APIError, RetryExhaustedError)):
+            client.get_config()
+class TestQuery:
+    def test_valid_query_returns_response(self, client: ValidatorClient, mocker: MockerFixture) -> None:
+        mocker.patch.object(client._client, "request", return_value=_mock_response(200, {
+            "query": "test question",
+            "client": "novamart",
+            "client_display": "NovaMart",
+            "answer": "Here is the answer.",
+            "sources": [{"id": "retail_001", "title": "Stock Check", "score": 0.9}],
+            "evaluation": {
+                "overall_pass": True,
+                "metrics": {
+                    "pii_leakage": {"passed": True, "score": 1.0, "detail": "Clean"},
+                }
+            }
+        }))
+        response = client.query("test question", "novamart")
+        assert response.answer == "Here is the answer."
+        assert response.evaluation.overall_pass is True
+    def test_unknown_client_raises_api_error(self, client: ValidatorClient, mocker: MockerFixture) -> None:
+        mocker.patch.object(client._client, "request", return_value=_mock_response(400, {
+            "detail": "Unknown client: 'bogus'"
+        }))
+        with pytest.raises(APIError) as exc_info:
+            client.query("question", "bogus")
+        assert exc_info.value.status_code == 400
+    def test_empty_query_raises_api_error(self, client: ValidatorClient, mocker: MockerFixture) -> None:
+        mocker.patch.object(client._client, "request", return_value=_mock_response(400, {
+            "detail": "Query cannot be empty"
+        }))
+        with pytest.raises(APIError) as exc_info:
+            client.query("   ", "novamart")
+        assert exc_info.value.status_code == 400
+class TestRetryBehavior:
+    def test_retries_on_503_then_succeeds(self, client: ValidatorClient, mocker: MockerFixture) -> None:
+        call_count = 0
+        def side_effect(*args: object, **kwargs: object) -> MagicMock:
+            nonlocal call_count
+            call_count += 1
+            if call_count < 2:
+                return _mock_response(503, {"detail": "unavailable"})
+            return _mock_response(200, {"domains": {}})
+        mocker.patch.object(client._client, "request", side_effect=side_effect)
+        client.get_config()
+        assert call_count == 2
+    def test_exhausted_retries_raises_retry_exhausted(self, client: ValidatorClient, mocker: MockerFixture) -> None:
+        mocker.patch.object(client._client, "request", return_value=_mock_response(503, {}))
+        with pytest.raises(RetryExhaustedError) as exc_info:
+            client.get_config()
+        assert exc_info.value.attempts == client._max_retries
+class TestTimeoutHandling:
+    def test_timeout_raises_timeout_error(self, client: ValidatorClient, mocker: MockerFixture) -> None:
+        mocker.patch.object(
+            client._client, "request",
+            side_effect=httpx.TimeoutException("timed out")
+        )
+        with pytest.raises(TimeoutError):
+            client.health()
+class TestContextManager:
+    def test_client_closes_on_exit(self) -> None:
+        with patch("httpx.Client.close") as mock_close:
+            with ValidatorClient(base_url="http://test.local"):
+                pass
+            mock_close.assert_called_once()

tests/unit/test_grader.py ADDED Viewed

	@@ -0,0 +1,140 @@

+"""
+Unit tests for L1 graders — no network, no LLM calls.
+Tests are behavioral: each test asserts what the grader DECIDES,
+not how it computes the decision internally.
+"""
+import sys
+from pathlib import Path
+import pytest
+sys.path.insert(0, str(Path(__file__).parent.parent.parent / "backend"))
+from grader import (
+    grade_pii_leakage,
+    grade_token_budget,
+    grade_chain_terminology,
+    TOKEN_BUDGET,
+)
+# ── pii_leakage ──────────────────────────────────────────────────────────────
+class TestPiiLeakage:
+    def test_clean_response_passes(self) -> None:
+        result = grade_pii_leakage("Stock check is enabled for this retailer.")
+        assert result.passed is True
+        assert result.score == 1.0
+    def test_email_address_fails(self) -> None:
+        result = grade_pii_leakage("Contact ops@example.com for details.")
+        assert result.passed is False
+        assert "email" in result.detail
+    def test_ssn_pattern_fails(self) -> None:
+        result = grade_pii_leakage("Employee SSN: 123-45-6789 is on file.")
+        assert result.passed is False
+        assert "SSN" in result.detail
+    def test_phone_number_fails(self) -> None:
+        result = grade_pii_leakage("Call 555-867-5309 to reach the manager.")
+        assert result.passed is False
+        assert result.score == 0.0
+    def test_multiple_pii_types_all_reported(self) -> None:
+        result = grade_pii_leakage("Email ops@test.com or call 555-123-4567.")
+        assert result.passed is False
+        assert "email" in result.detail
+        assert "phone" in result.detail
+    def test_score_is_binary(self) -> None:
+        clean = grade_pii_leakage("No PII here.")
+        dirty = grade_pii_leakage("Email: a@b.com")
+        assert clean.score == 1.0
+        assert dirty.score == 0.0
+# ── token_budget ──────────────────────────────────────────────────────────────
+class TestTokenBudget:
+    def test_short_response_passes(self) -> None:
+        result = grade_token_budget("Short answer.")
+        assert result.passed is True
+        assert result.score == 1.0
+    def test_response_at_exact_budget_passes(self) -> None:
+        text = "a" * (TOKEN_BUDGET * 4)
+        result = grade_token_budget(text)
+        assert result.passed is True
+    def test_response_over_budget_fails(self) -> None:
+        text = "a" * (TOKEN_BUDGET * 4 + 4)
+        result = grade_token_budget(text)
+        assert result.passed is False
+        assert result.score < 1.0
+    def test_score_degrades_with_length(self) -> None:
+        moderate = grade_token_budget("a" * (TOKEN_BUDGET * 5))
+        extreme = grade_token_budget("a" * (TOKEN_BUDGET * 20))
+        assert moderate.score > extreme.score
+    def test_detail_reports_token_estimate(self) -> None:
+        result = grade_token_budget("hello world")
+        assert "tokens" in result.detail
+    def test_custom_budget_respected(self) -> None:
+        text = "a" * 40  # ~10 tokens
+        assert grade_token_budget(text, budget=100).passed is True
+        assert grade_token_budget(text, budget=5).passed is False
+# ── chain_terminology ─────────────────────────────────────────────────────────
+class TestChainTerminology:
+    def test_correct_client_term_passes(self) -> None:
+        result = grade_chain_terminology(
+            "Run an availability scan to check inventory levels.",
+            client="novamart",
+        )
+        assert result.passed is True
+    def test_rival_term_without_correct_term_fails(self) -> None:
+        # "stock check" is ShelfWise term for STOCK_CHECK — wrong for NovaMart
+        result = grade_chain_terminology(
+            "Run a stock check to see inventory levels.",
+            client="novamart",
+        )
+        assert result.passed is False
+        assert any(v["expected"] == "availability scan" for v in result.metadata["violations"])
+    def test_both_terms_present_does_not_flag(self) -> None:
+        # Response explains both — not a violation
+        result = grade_chain_terminology(
+            "Run an availability scan (also called stock check) to check inventory.",
+            client="novamart",
+        )
+        assert result.passed is True
+    def test_score_reflects_violation_ratio(self) -> None:
+        result = grade_chain_terminology(
+            "Run a stock check and use a feature toggle.",
+            client="novamart",
+        )
+        assert 0.0 <= result.score < 1.0
+    def test_clean_response_full_score(self) -> None:
+        result = grade_chain_terminology(
+            "This response uses no retail terminology at all.",
+            client="novamart",
+        )
+        assert result.score == 1.0
+    def test_pharma_client_rival_term_fails(self) -> None:
+        # "prior authorization" is ClinixOne term — wrong for PharmaLink
+        result = grade_chain_terminology(
+            "Submit a prior authorization request to get the drug approved.",
+            client="pharmalink",
+        )
+        assert result.passed is False
+        assert any(v["expected"] == "formulary pre-approval" for v in result.metadata["violations"])

tests/unit/test_rosetta.py ADDED Viewed

	@@ -0,0 +1,67 @@

+"""Unit tests for RosettaStone terminology translation and violation detection."""
+import sys
+from pathlib import Path
+sys.path.insert(0, str(Path(__file__).parent.parent.parent / "backend"))
+from rosetta import translate, client_terms, check_terminology
+class TestTranslate:
+    def test_known_key_returns_client_term(self) -> None:
+        assert translate("STOCK_CHECK", "novamart") == "availability scan"
+    def test_same_key_different_clients_differ(self) -> None:
+        novamart = translate("STOCK_CHECK", "novamart")
+        shelfwise = translate("STOCK_CHECK", "shelfwise")
+        assert novamart != shelfwise
+    def test_unknown_key_returns_none(self) -> None:
+        assert translate("NONEXISTENT_KEY", "novamart") is None
+    def test_pharma_client_returns_correct_term(self) -> None:
+        assert translate("DRUG_APPROVAL", "pharmalink") == "formulary pre-approval"
+        assert translate("DRUG_APPROVAL", "clinixone") == "prior authorization"
+class TestClientTerms:
+    def test_returns_full_mapping(self) -> None:
+        terms = client_terms("novamart")
+        assert isinstance(terms, dict)
+        assert "STOCK_CHECK" in terms
+        assert terms["STOCK_CHECK"] == "availability scan"
+    def test_different_clients_have_different_mappings(self) -> None:
+        nm = client_terms("novamart")
+        sw = client_terms("shelfwise")
+        assert nm["STOCK_CHECK"] != sw["STOCK_CHECK"]
+class TestCheckTerminology:
+    def test_correct_term_no_violation(self) -> None:
+        result = check_terminology("Please run an availability scan.", "novamart")
+        assert result["pass"] is True
+        assert result["violations"] == []
+    def test_rival_term_triggers_violation(self) -> None:
+        result = check_terminology("Please run a stock check.", "novamart")
+        assert result["pass"] is False
+        assert len(result["violations"]) >= 1
+    def test_violation_contains_expected_and_found(self) -> None:
+        result = check_terminology("Use a feature toggle to enable this.", "novamart")
+        violation = next(
+            (v for v in result["violations"] if v["found"] == "feature toggle"), None
+        )
+        assert violation is not None
+        assert violation["expected"] == "capability switch"
+    def test_checked_count_matches_catalog(self) -> None:
+        terms = client_terms("novamart")
+        result = check_terminology("Neutral response with no domain terms.", "novamart")
+        assert result["checked"] == len(terms)
+    def test_case_insensitive_matching(self) -> None:
+        result = check_terminology("Run a STOCK CHECK now.", "novamart")
+        assert result["pass"] is False