Spaces:

AshwinP
/

compounding-test

Sleeping

apingali Claude Opus 4.7 (1M context) commited on 19 days ago

Commit

3c77cd5

1 Parent(s): 21da7b6

test(hf-space): cover _call_huggingface (full) + _call_zerogpu (refactor)

Symmetric coverage with the Premium path. 59 unit tests now pass
(was 42). One small production refactor to make ZeroGPU testable.

app.py refactor:
- Extracted _zerogpu_invoke() with the actual model-invocation logic
(chat template build → device move → generate → prompt-strip →
decode). _call_zerogpu became a one-line @spaces.GPU wrapper.
- This lets tests exercise the invocation path without needing torch
or @spaces.GPU runtime, by monkeypatching the module-level
_zerogpu_tokenizer / _zerogpu_model / _load_zerogpu_model.
- Stub fallback (deps unavailable) is unchanged.

test_diagnose.py: 17 new tests

_call_huggingface (10):
Token resolution:
- No token anywhere → RuntimeError with actionable message
- HF_TOKEN env wins (primary)
- HUGGING_FACE_HUB_TOKEN env as fallback
- get_token() from `hf auth login` cache as last fallback
- HF_TOKEN wins over the other two sources
InferenceClient init shape:
- model = HF_MODEL_ID
- provider="auto" — catches regressions that would re-break the
modern HF Inference Providers routing (the bug we fixed earlier)
- timeout=120
chat_completion shape:
- messages = [system, user] with correct role/content
- max_tokens=2500
- temperature=0.2 — intentionally low for small-model JSON
adherence; catch drift
- response unwrap via choices[0].message.content
Error handling:
- model_not_supported → RuntimeError with billing guidance
- alternate phrasing also triggers the wrap
- Other exceptions (ValueError, etc.) pass through so F14 can
format them in diagnose()

_call_zerogpu / _zerogpu_invoke (7):
Stub path:
- When deps unavailable, _call_zerogpu raises clear RuntimeError
- _zerogpu_available() reflects _ZEROGPU_DEPS_AVAILABLE
Invocation shape (via _zerogpu_invoke with mocked tokenizer/model):
- Builds chat template with system + user roles, return_tensors="pt",
add_generation_prompt=True
- Moves inputs to model.device (the .to() chain)
- generate() called with max_new_tokens=2500, temperature=0.2,
do_sample=True (required for non-zero temp), pad_token_id=eos_token_id
- Prompt tokens are stripped before decode (outputs[0][prompt_len:])
- skip_special_tokens=True on decode
- Returns the decoded string

What this catches in practice:
- Bumping HF_MODEL_ID without re-validating it gets passed correctly
- Accidentally removing provider="auto" (the model_not_supported bug)
- SDK arg name changes (max_new_tokens vs max_tokens for HF chat_completion
vs generate — easy to confuse)
- Forgetting do_sample=True when setting temperature
- Wrong response-unwrap path (HF uses .choices[].message.content,
Anthropic uses .content[0].text — easy to mix up)
- Forgetting to strip prompt tokens (would echo back the system prompt)

All 59 unit tests pass + 1 skipped opt-in integration test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (2) hide show

app.py +32 -23
test_diagnose.py +282 -0

app.py CHANGED Viewed

@@ -377,35 +377,44 @@ def _load_zerogpu_model():
     )
 if _ZEROGPU_DEPS_AVAILABLE:
     @_spaces.GPU(duration=ZEROGPU_DURATION_SECONDS)
     def _call_zerogpu(system_block: str, user_prompt: str) -> str:
         """ZeroGPU backend. Loads Phi-4-mini-instruct (or whatever
         ZEROGPU_MODEL_ID points at) into the Space's allocated GPU and
-        runs chat-template inference. Returns the assistant text only —
-        prompt tokens are stripped before decoding."""
-        _load_zerogpu_model()
-        messages = [
-            {"role": "system", "content": system_block},
-            {"role": "user", "content": user_prompt},
-        ]
-        inputs = _zerogpu_tokenizer.apply_chat_template(
-            messages,
-            return_tensors="pt",
-            add_generation_prompt=True,
-        ).to(_zerogpu_model.device)
-        outputs = _zerogpu_model.generate(
-            inputs,
-            max_new_tokens=2500,
-            temperature=0.2,
-            do_sample=True,
-            pad_token_id=_zerogpu_tokenizer.eos_token_id,
-        )
-        prompt_len = inputs.shape[1]
-        return _zerogpu_tokenizer.decode(
-            outputs[0][prompt_len:], skip_special_tokens=True
-        )
 else:

     )
+def _zerogpu_invoke(system_block: str, user_prompt: str) -> str:
+    """Model invocation logic for the ZeroGPU backend. Separated from
+    the `@spaces.GPU` decoration below so it can be unit-tested without
+    actually allocating a GPU. The function reads module-level globals
+    (`_zerogpu_tokenizer`, `_zerogpu_model`) which tests can monkeypatch
+    to fake the transformers types."""
+    _load_zerogpu_model()
+    messages = [
+        {"role": "system", "content": system_block},
+        {"role": "user", "content": user_prompt},
+    ]
+    inputs = _zerogpu_tokenizer.apply_chat_template(
+        messages,
+        return_tensors="pt",
+        add_generation_prompt=True,
+    ).to(_zerogpu_model.device)
+    outputs = _zerogpu_model.generate(
+        inputs,
+        max_new_tokens=2500,
+        temperature=0.2,
+        do_sample=True,
+        pad_token_id=_zerogpu_tokenizer.eos_token_id,
+    )
+    prompt_len = inputs.shape[1]
+    return _zerogpu_tokenizer.decode(
+        outputs[0][prompt_len:], skip_special_tokens=True
+    )
 if _ZEROGPU_DEPS_AVAILABLE:
     @_spaces.GPU(duration=ZEROGPU_DURATION_SECONDS)
     def _call_zerogpu(system_block: str, user_prompt: str) -> str:
         """ZeroGPU backend. Loads Phi-4-mini-instruct (or whatever
         ZEROGPU_MODEL_ID points at) into the Space's allocated GPU and
+        runs chat-template inference. Thin wrapper around the testable
+        `_zerogpu_invoke` so the decorator stays at module load time."""
+        return _zerogpu_invoke(system_block, user_prompt)
 else:

test_diagnose.py CHANGED Viewed

@@ -15,11 +15,16 @@ from unittest.mock import MagicMock
 from app import (
     ANTHROPIC_MODEL_ID,
     MalformedResponseError,
     PROVIDERS,
     _call_anthropic,
     _call_model,
     _detect_provider,
     diagnose,
     parse_response,
 )
@@ -480,6 +485,283 @@ def test_call_anthropic_passes_system_block_with_cache_control(monkeypatch):
     assert captured["messages"] == [{"role": "user", "content": "MY USER PROMPT"}]
 # --- Integration test (opt-in; hits the real Anthropic API) ----------------
 #
 # Skipped unless ANTHROPIC_API_KEY is set AND ANTHROPIC_INTEGRATION=1 is

 from app import (
     ANTHROPIC_MODEL_ID,
+    HF_MODEL_ID,
     MalformedResponseError,
     PROVIDERS,
     _call_anthropic,
+    _call_huggingface,
     _call_model,
+    _call_zerogpu,
     _detect_provider,
+    _zerogpu_available,
+    _zerogpu_invoke,
     diagnose,
     parse_response,
 )
     assert captured["messages"] == [{"role": "user", "content": "MY USER PROMPT"}]
+# --- _call_huggingface: token resolution + call shape ----------------------
+def _install_fake_inference_client(monkeypatch, captured: dict, *,
+                                    response_text: str = "hf response",
+                                    raises: Exception | None = None):
+    """Replace huggingface_hub.InferenceClient with a fake that records
+    its init kwargs and chat_completion kwargs into `captured`. Optionally
+    have chat_completion raise an exception instead of returning."""
+    class _FakeMsg:
+        content = response_text
+    class _FakeChoice:
+        message = _FakeMsg()
+    class _FakeResponse:
+        choices = [_FakeChoice()]
+    class _FakeClient:
+        def __init__(self, **kwargs):
+            captured["init_kwargs"] = kwargs
+        def chat_completion(self, **kwargs):
+            captured["chat_kwargs"] = kwargs
+            if raises is not None:
+                raise raises
+            return _FakeResponse()
+    import huggingface_hub
+    monkeypatch.setattr(huggingface_hub, "InferenceClient", _FakeClient)
+def test_call_huggingface_no_token_anywhere_raises_actionable_error(monkeypatch):
+    monkeypatch.delenv("HF_TOKEN", raising=False)
+    monkeypatch.delenv("HUGGING_FACE_HUB_TOKEN", raising=False)
+    import huggingface_hub
+    monkeypatch.setattr(huggingface_hub, "get_token", lambda: None)
+    with pytest.raises(RuntimeError, match="No HuggingFace token"):
+        _call_huggingface("sys", "usr")
+def test_call_huggingface_uses_HF_TOKEN_env(monkeypatch):
+    monkeypatch.setenv("HF_TOKEN", "hf_from_env")
+    captured = {}
+    _install_fake_inference_client(monkeypatch, captured)
+    _call_huggingface("sys", "usr")
+    assert captured["init_kwargs"]["token"] == "hf_from_env"
+def test_call_huggingface_uses_HUGGING_FACE_HUB_TOKEN_env_as_fallback(monkeypatch):
+    monkeypatch.delenv("HF_TOKEN", raising=False)
+    monkeypatch.setenv("HUGGING_FACE_HUB_TOKEN", "hf_legacy_var")
+    captured = {}
+    _install_fake_inference_client(monkeypatch, captured)
+    _call_huggingface("sys", "usr")
+    assert captured["init_kwargs"]["token"] == "hf_legacy_var"
+def test_call_huggingface_uses_get_token_when_no_env(monkeypatch):
+    monkeypatch.delenv("HF_TOKEN", raising=False)
+    monkeypatch.delenv("HUGGING_FACE_HUB_TOKEN", raising=False)
+    import huggingface_hub
+    monkeypatch.setattr(huggingface_hub, "get_token", lambda: "hf_from_cli_login")
+    captured = {}
+    _install_fake_inference_client(monkeypatch, captured)
+    _call_huggingface("sys", "usr")
+    assert captured["init_kwargs"]["token"] == "hf_from_cli_login"
+def test_call_huggingface_HF_TOKEN_wins_over_other_sources(monkeypatch):
+    monkeypatch.setenv("HF_TOKEN", "hf_winner")
+    monkeypatch.setenv("HUGGING_FACE_HUB_TOKEN", "hf_loser_1")
+    import huggingface_hub
+    monkeypatch.setattr(huggingface_hub, "get_token", lambda: "hf_loser_2")
+    captured = {}
+    _install_fake_inference_client(monkeypatch, captured)
+    _call_huggingface("sys", "usr")
+    assert captured["init_kwargs"]["token"] == "hf_winner"
+def test_call_huggingface_init_shape_model_provider_timeout(monkeypatch):
+    monkeypatch.setenv("HF_TOKEN", "hf_test")
+    captured = {}
+    _install_fake_inference_client(monkeypatch, captured)
+    _call_huggingface("sys", "usr")
+    init = captured["init_kwargs"]
+    assert init["model"] == HF_MODEL_ID
+    # provider="auto" is the critical config that enables the modern HF
+    # Inference Providers routing layer — without it, the client falls
+    # back to the legacy hf-inference-only path. Catch any regression
+    # that removes this flag.
+    assert init["provider"] == "auto"
+    assert init["timeout"] == 120
+def test_call_huggingface_chat_completion_call_shape(monkeypatch):
+    monkeypatch.setenv("HF_TOKEN", "hf_test")
+    captured = {}
+    _install_fake_inference_client(monkeypatch, captured)
+    result = _call_huggingface("MY SYSTEM BLOCK", "MY USER PROMPT")
+    chat = captured["chat_kwargs"]
+    assert chat["messages"] == [
+        {"role": "system", "content": "MY SYSTEM BLOCK"},
+        {"role": "user", "content": "MY USER PROMPT"},
+    ]
+    assert chat["max_tokens"] == 2500
+    # Low temperature is intentional — smaller open models can produce
+    # looser JSON at higher temperatures. Catch any drift.
+    assert chat["temperature"] == 0.2
+    # Response unwrap: choices[0].message.content
+    assert result == "hf response"
+def test_call_huggingface_model_not_supported_error_wrapped(monkeypatch):
+    monkeypatch.setenv("HF_TOKEN", "hf_test")
+    fake_hf_error = Exception(
+        "Bad request: {'message': \"The requested model is not supported "
+        "by any provider you have enabled.\", 'code': 'model_not_supported'}"
+    )
+    captured = {}
+    _install_fake_inference_client(monkeypatch, captured, raises=fake_hf_error)
+    with pytest.raises(RuntimeError, match="isn't available through any"):
+        _call_huggingface("sys", "usr")
+def test_call_huggingface_model_not_supported_alternate_phrasing_wrapped(monkeypatch):
+    monkeypatch.setenv("HF_TOKEN", "hf_test")
+    fake_hf_error = Exception("...'code': 'model_not_supported'...")
+    captured = {}
+    _install_fake_inference_client(monkeypatch, captured, raises=fake_hf_error)
+    with pytest.raises(RuntimeError, match="isn't available through any"):
+        _call_huggingface("sys", "usr")
+def test_call_huggingface_other_exception_passes_through(monkeypatch):
+    """Errors that aren't the model_not_supported case (auth fail,
+    network timeout, malformed response) should propagate up so the
+    F14 wrapper in diagnose() can surface them with the original class
+    name and detail."""
+    monkeypatch.setenv("HF_TOKEN", "hf_test")
+    fake_other_error = ValueError("Invalid API key")
+    captured = {}
+    _install_fake_inference_client(monkeypatch, captured, raises=fake_other_error)
+    with pytest.raises(ValueError, match="Invalid API key"):
+        _call_huggingface("sys", "usr")
+# --- _call_zerogpu: stub path + invocation shape --------------------------
+def test_call_zerogpu_stub_raises_clear_error_when_deps_unavailable():
+    """In a local environment without spaces/torch/transformers installed,
+    _ZEROGPU_DEPS_AVAILABLE is False and _call_zerogpu is the stub that
+    raises a RuntimeError pointing the user to the other two backends."""
+    if _zerogpu_available():
+        pytest.skip("Test only meaningful when zerogpu deps are NOT installed")
+    with pytest.raises(RuntimeError, match="ZeroGPU backend requires"):
+        _call_zerogpu("sys", "usr")
+def test_zerogpu_available_reflects_dep_state():
+    """_zerogpu_available() is the sole gating function for the zerogpu
+    branch in _detect_provider; it must return the cached import-time
+    boolean rather than re-trying imports on every call."""
+    import app as app_module
+    assert _zerogpu_available() is app_module._ZEROGPU_DEPS_AVAILABLE
+def _install_fake_zerogpu_model(monkeypatch, captured: dict, *,
+                                 prompt_len: int = 5,
+                                 decoded_text: str = "model output"):
+    """Replace the module-level _zerogpu_tokenizer and _zerogpu_model
+    with fakes that record their calls. Simulates transformers types
+    just enough for _zerogpu_invoke() to run end-to-end without torch
+    actually installed."""
+    import app as app_module
+    class _FakeInputs:
+        def __init__(self):
+            self.shape = (1, prompt_len)
+        def to(self, device):
+            captured["inputs_moved_to_device"] = device
+            return self  # chain .to() back into self for further use
+    fake_inputs = _FakeInputs()
+    fake_outputs = [list(range(prompt_len + 10))]  # prompt tokens + 10 new tokens
+    class _FakeTokenizer:
+        eos_token_id = 99
+        def apply_chat_template(self, messages, **kwargs):
+            captured["apply_chat_template"] = {
+                "messages": messages,
+                "kwargs": kwargs,
+            }
+            return fake_inputs
+        def decode(self, token_ids, **kwargs):
+            captured["decode"] = {"token_ids": list(token_ids), "kwargs": kwargs}
+            return decoded_text
+    class _FakeModel:
+        device = "cuda:0"
+        def generate(self, inputs, **kwargs):
+            captured["generate_inputs"] = inputs
+            captured["generate_kwargs"] = kwargs
+            return fake_outputs
+    monkeypatch.setattr(app_module, "_zerogpu_tokenizer", _FakeTokenizer())
+    monkeypatch.setattr(app_module, "_zerogpu_model", _FakeModel())
+    # Skip the real model-load path; we've already populated the globals.
+    monkeypatch.setattr(app_module, "_load_zerogpu_model", lambda: None)
+def test_zerogpu_invoke_builds_chat_template_with_system_and_user(monkeypatch):
+    captured = {}
+    _install_fake_zerogpu_model(monkeypatch, captured)
+    _zerogpu_invoke("MY SYSTEM BLOCK", "MY USER PROMPT")
+    chat = captured["apply_chat_template"]
+    assert chat["messages"] == [
+        {"role": "system", "content": "MY SYSTEM BLOCK"},
+        {"role": "user", "content": "MY USER PROMPT"},
+    ]
+    assert chat["kwargs"]["return_tensors"] == "pt"
+    assert chat["kwargs"]["add_generation_prompt"] is True
+def test_zerogpu_invoke_moves_inputs_to_model_device(monkeypatch):
+    captured = {}
+    _install_fake_zerogpu_model(monkeypatch, captured)
+    _zerogpu_invoke("sys", "usr")
+    assert captured["inputs_moved_to_device"] == "cuda:0"
+def test_zerogpu_invoke_generate_call_shape(monkeypatch):
+    """The .generate() kwargs are easy to typo and carry real semantics:
+      max_new_tokens=2500 caps output length
+      temperature=0.2 keeps JSON output stable for small models
+      do_sample=True is needed for non-zero temperature to have effect
+      pad_token_id=eos_token_id avoids warning spam on short prompts
+    Catch regressions in any of these."""
+    captured = {}
+    _install_fake_zerogpu_model(monkeypatch, captured)
+    _zerogpu_invoke("sys", "usr")
+    gen = captured["generate_kwargs"]
+    assert gen["max_new_tokens"] == 2500
+    assert gen["temperature"] == 0.2
+    assert gen["do_sample"] is True
+    assert gen["pad_token_id"] == 99  # _FakeTokenizer.eos_token_id
+def test_zerogpu_invoke_strips_prompt_tokens_before_decode(monkeypatch):
+    """The decoded output must be the GENERATED text only, not echo back
+    the prompt. The function does this by slicing outputs[0][prompt_len:]
+    before calling decode. Verify the slice happens correctly."""
+    captured = {}
+    # prompt_len=5 → fake_outputs returns range(15) (5 prompt + 10 generated)
+    # so decode should be called with tokens [5..15)
+    _install_fake_zerogpu_model(monkeypatch, captured, prompt_len=5)
+    _zerogpu_invoke("sys", "usr")
+    decoded_tokens = captured["decode"]["token_ids"]
+    assert decoded_tokens == list(range(5, 15))
+    # And skip_special_tokens is on so we don't include things like </s>
+    assert captured["decode"]["kwargs"]["skip_special_tokens"] is True
+def test_zerogpu_invoke_returns_decoded_text(monkeypatch):
+    captured = {}
+    _install_fake_zerogpu_model(monkeypatch, captured, decoded_text="my generated answer")
+    result = _zerogpu_invoke("sys", "usr")
+    assert result == "my generated answer"
 # --- Integration test (opt-in; hits the real Anthropic API) ----------------
 #
 # Skipped unless ANTHROPIC_API_KEY is set AND ANTHROPIC_INTEGRATION=1 is