Spaces:

AshwinP
/

compounding-test

Sleeping

apingali Claude Opus 4.7 (1M context) commited on 9 days ago

Commit

21da7b6

1 Parent(s): 61393d7

test(hf-space): cover diagnose() Premium path + Anthropic call shape

User asked: do we have tests verifying the Premium path actually
works, barring an invalid API key? Honest answer was no — the existing
31 tests covered parse_response and the provider dispatcher but
diagnose() and _call_anthropic had zero coverage. That's why my
Premium-key-override logic shipped unverified.

Added 11 unit tests + 1 opt-in integration test:

diagnose() input validation (4):
- empty description → "Please describe" friendly error
- <200 words → "at least 200 words" error with current count
- >5000 words → "under 5000 words" error
- unknown provider → "Unknown model provider" error

diagnose() Premium (anthropic) path (6):
- no env key + no user key → "Premium needs an API key"
- whitespace-only user key → same friendly error
- env key only → dispatches; env key visible at backend call time
- user-supplied key OVERRIDES env key — the visitor-key-overrides-
Space-owner-key invariant for public Spaces
- backend raises → F14 wraps as markdown with provider/model/
exception name/detail, no stack trace leak
- backend returns garbage → MalformedResponseError surfaces as
"model returned malformed output", no crash

_call_anthropic call shape (1):
Mocks anthropic.Anthropic and asserts the SDK is invoked with:
- model = ANTHROPIC_MODEL_ID (claude-opus-4-7)
- max_tokens = 2500
- system block + cache_control: ephemeral
- user messages array
- response unwrapped via .content[0].text
Catches typos / forgotten cache_control / wrong arg names without
hitting the network or costing money.

Integration test (opt-in, 1):
test_call_anthropic_real_api_returns_text — skipped unless BOTH
ANTHROPIC_API_KEY AND ANTHROPIC_INTEGRATION=1 are set. Hits the
real API, costs ~$0.05/call. Run manually to verify keys + model
availability end-to-end; routine CI leaves it skipped.

All 42 unit tests pass + 1 skipped. No backend code changes; this is
pure new coverage. Run all:
pytest gradio-apps/compounding-test/test_diagnose.py -v
Run integration too:
ANTHROPIC_API_KEY=sk-... ANTHROPIC_INTEGRATION=1 pytest ...

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (1) hide show

test_diagnose.py +262 -0

test_diagnose.py CHANGED Viewed

@@ -9,11 +9,18 @@ from __future__ import annotations
 import pytest
 from app import (
     MalformedResponseError,
     PROVIDERS,
     _call_model,
     _detect_provider,
     parse_response,
 )
 import app as app_module
@@ -270,6 +277,261 @@ def test_call_model_unknown_provider_raises():
         _call_model("s", "u", "bogus-provider")
 def test_warnings_populated_for_failure_quadrant():
     raw = VALID_JSON_BLOCK.replace('"quadrant": "compounder"', '"quadrant": "roman-candle"').replace(
         '"warnings": []',

 import pytest
+import json
+import os
+from unittest.mock import MagicMock
 from app import (
+    ANTHROPIC_MODEL_ID,
     MalformedResponseError,
     PROVIDERS,
+    _call_anthropic,
     _call_model,
     _detect_provider,
+    diagnose,
     parse_response,
 )
 import app as app_module
         _call_model("s", "u", "bogus-provider")
+# --- diagnose() input validation -------------------------------------------
+# Reused across diagnose() tests: a description long enough to pass the
+# 200-word minimum. The actual content doesn't matter for these tests
+# because we mock the backend.
+_LONG_DESCRIPTION = " ".join(["word"] * 250)
+def test_diagnose_empty_description_returns_friendly_error():
+    writeup, json_str = diagnose("", None, None, None, provider="zerogpu")
+    assert "Please describe" in writeup
+    assert json_str == ""
+def test_diagnose_short_description_returns_word_count_error():
+    short = " ".join(["word"] * 50)
+    writeup, json_str = diagnose(short, None, None, None, provider="zerogpu")
+    assert "at least 200 words" in writeup
+    assert "50" in writeup  # current word count
+    assert json_str == ""
+def test_diagnose_long_description_returns_word_count_error():
+    long = " ".join(["word"] * 5001)
+    writeup, json_str = diagnose(long, None, None, None, provider="zerogpu")
+    assert "under 5000 words" in writeup
+    assert json_str == ""
+def test_diagnose_unknown_provider_returns_friendly_error():
+    writeup, json_str = diagnose(_LONG_DESCRIPTION, None, None, None, provider="bogus")
+    assert "Unknown model provider" in writeup
+    assert "bogus" in writeup
+    assert json_str == ""
+# --- diagnose() Premium (Anthropic) path -----------------------------------
+def test_diagnose_premium_without_any_key_returns_friendly_error(monkeypatch):
+    monkeypatch.delenv("ANTHROPIC_API_KEY", raising=False)
+    writeup, json_str = diagnose(
+        _LONG_DESCRIPTION, None, None, None,
+        provider="anthropic",
+        anthropic_api_key=None,
+    )
+    assert "Premium" in writeup
+    assert "API key" in writeup
+    assert json_str == ""
+def test_diagnose_premium_with_empty_string_key_returns_friendly_error(monkeypatch):
+    # Whitespace-only key should not count as supplied
+    monkeypatch.delenv("ANTHROPIC_API_KEY", raising=False)
+    writeup, _ = diagnose(
+        _LONG_DESCRIPTION, None, None, None,
+        provider="anthropic",
+        anthropic_api_key="   ",
+    )
+    assert "Premium" in writeup
+    assert "API key" in writeup
+def test_diagnose_premium_with_env_key_dispatches_to_anthropic(monkeypatch):
+    monkeypatch.setenv("ANTHROPIC_API_KEY", "sk-env-xxx")
+    captured = {}
+    def fake_anthropic(system, user):
+        captured["system"] = system
+        captured["user"] = user
+        captured["env_key_at_call_time"] = os.environ.get("ANTHROPIC_API_KEY")
+        return _VALID_BACKEND_RESPONSE
+    monkeypatch.setitem(PROVIDERS, "anthropic", fake_anthropic)
+    writeup, json_str = diagnose(
+        _LONG_DESCRIPTION, "insurance", "enterprise", "$1M–$10M",
+        provider="anthropic",
+        anthropic_api_key=None,
+    )
+    # Backend was called (so dispatch worked) and the env key was visible
+    assert captured.get("env_key_at_call_time") == "sk-env-xxx"
+    # And the response made it through the parser → JSON tab populated
+    assert json_str
+    parsed = json.loads(json_str)
+    assert parsed["quadrant"] == "compounder"
+def test_diagnose_premium_user_key_overrides_env(monkeypatch):
+    """The page's API-key field should take precedence over any
+    ANTHROPIC_API_KEY env var the Space owner has configured. This is
+    the F15-ish defense — a visitor pasting their own key should never
+    inadvertently use the Space owner's key."""
+    monkeypatch.setenv("ANTHROPIC_API_KEY", "sk-space-owner-xxx")
+    captured = {}
+    def fake_anthropic(system, user):
+        captured["env_key_at_call_time"] = os.environ.get("ANTHROPIC_API_KEY")
+        return _VALID_BACKEND_RESPONSE
+    monkeypatch.setitem(PROVIDERS, "anthropic", fake_anthropic)
+    diagnose(
+        _LONG_DESCRIPTION, None, None, None,
+        provider="anthropic",
+        anthropic_api_key="sk-user-yyy",
+    )
+    assert captured.get("env_key_at_call_time") == "sk-user-yyy"
+def test_diagnose_premium_backend_exception_returns_friendly_error(monkeypatch):
+    """When the Anthropic SDK raises (auth fail, rate limit, network),
+    F14 should wrap it in a markdown message that names the provider,
+    model, exception class, and exception detail — never a raw trace."""
+    monkeypatch.setenv("ANTHROPIC_API_KEY", "sk-test")
+    class FakeAuthError(Exception):
+        pass
+    def failing_anthropic(system, user):
+        raise FakeAuthError("invalid x-api-key header")
+    monkeypatch.setitem(PROVIDERS, "anthropic", failing_anthropic)
+    writeup, json_str = diagnose(
+        _LONG_DESCRIPTION, None, None, None,
+        provider="anthropic",
+        anthropic_api_key=None,
+    )
+    assert "anthropic" in writeup
+    assert ANTHROPIC_MODEL_ID in writeup
+    assert "FakeAuthError" in writeup
+    assert "invalid x-api-key header" in writeup
+    assert "stack" not in writeup.lower()  # no stack trace leaked
+    assert json_str == ""
+def test_diagnose_premium_backend_returns_malformed_response(monkeypatch):
+    """When the backend returns something that fails the JSON schema
+    validator, surface the parser's error message — don't crash."""
+    monkeypatch.setenv("ANTHROPIC_API_KEY", "sk-test")
+    def garbage_anthropic(system, user):
+        return "Sorry, I cannot help with that request."  # no JSON block
+    monkeypatch.setitem(PROVIDERS, "anthropic", garbage_anthropic)
+    writeup, json_str = diagnose(
+        _LONG_DESCRIPTION, None, None, None,
+        provider="anthropic",
+        anthropic_api_key=None,
+    )
+    assert "malformed output" in writeup
+    assert json_str == ""
+# --- _call_anthropic: Anthropic SDK call shape -----------------------------
+#
+# Per Principle VII the actual API call is exempt from automated tests
+# (the SDK and the remote API are not our code). But the SHAPE of the
+# call we make IS our code: model id, system block, cache_control flag,
+# messages structure, response-unwrap path. These are easy to typo and
+# easy to miss in review. Shape tests catch that without hitting the
+# network.
+def test_call_anthropic_passes_system_block_with_cache_control(monkeypatch):
+    captured = {}
+    class FakeContentBlock:
+        text = "raw response text"
+    class FakeMessage:
+        content = [FakeContentBlock()]
+    class FakeClient:
+        class messages:  # noqa: N801 — mirroring SDK's nested .messages.create
+            @staticmethod
+            def create(**kwargs):
+                captured.update(kwargs)
+                return FakeMessage()
+    import anthropic as anthropic_module
+    monkeypatch.setattr(anthropic_module, "Anthropic", lambda: FakeClient())
+    result = _call_anthropic("MY SYSTEM BLOCK", "MY USER PROMPT")
+    # The function unwrapped content[0].text correctly
+    assert result == "raw response text"
+    # Model + token budget
+    assert captured["model"] == ANTHROPIC_MODEL_ID
+    assert captured["max_tokens"] == 2500
+    # System block is a list of one dict with cache_control
+    sys_block = captured["system"]
+    assert isinstance(sys_block, list)
+    assert len(sys_block) == 1
+    assert sys_block[0]["type"] == "text"
+    assert sys_block[0]["text"] == "MY SYSTEM BLOCK"
+    assert sys_block[0]["cache_control"] == {"type": "ephemeral"}
+    # User prompt in the messages array
+    assert captured["messages"] == [{"role": "user", "content": "MY USER PROMPT"}]
+# --- Integration test (opt-in; hits the real Anthropic API) ----------------
+#
+# Skipped unless ANTHROPIC_API_KEY is set AND ANTHROPIC_INTEGRATION=1 is
+# set. Costs money to run (~$0.05 per call to Opus 4.7). Use this when
+# you want to verify end-to-end that the key works and the model is
+# reachable; routine CI should leave this skipped.
+import pytest as _pytest  # already imported above as pytest, but kept explicit
+@_pytest.mark.skipif(
+    not (os.environ.get("ANTHROPIC_API_KEY") and os.environ.get("ANTHROPIC_INTEGRATION") == "1"),
+    reason="needs ANTHROPIC_API_KEY + ANTHROPIC_INTEGRATION=1 to hit the real API",
+)
+def test_call_anthropic_real_api_returns_text():
+    result = _call_anthropic(
+        "You are a one-word echo. Reply with exactly one word.",
+        "Say hello.",
+    )
+    assert isinstance(result, str)
+    assert len(result) > 0
+    assert len(result.split()) < 20  # one-word reply, generously bounded
+# ---------------------------------------------------------------------------
+# Fixture: a backend response that satisfies parse_response() so the
+# diagnose-Premium happy-path tests can assert on parser output without
+# duplicating the full JSON shape per test.
+# ---------------------------------------------------------------------------
+_VALID_BACKEND_RESPONSE = """```json
+{
+  "constraint": "Underwriting at the quote screen.",
+  "scores": {
+    "proprietary_data":         { "score": 4, "rationale": "first-party.", "quoted_span": "12-year claims database" },
+    "self_labeling":            { "score": 4, "rationale": "policies self-label.", "quoted_span": "Every policy we write" },
+    "decreasing_marginal_cost": { "score": 3, "rationale": "amortized pipeline.", "quoted_span": "feed back into the next quarter" },
+    "defensible_asymmetry":     { "score": 3, "rationale": "no data sharing.", "quoted_span": "we don't share data with" }
+  },
+  "quadrant": "compounder",
+  "closest_portrait": "progressive",
+  "closest_portrait_paragraph": "Your case tracks Progressive most closely.",
+  "warnings": []
+}
+```
+# The Verdict
+Your initiative is a compounder. Here's why.
+"""
 def test_warnings_populated_for_failure_quadrant():
     raw = VALID_JSON_BLOCK.replace('"quadrant": "compounder"', '"quadrant": "roman-candle"').replace(
         '"warnings": []',