Spaces:
Sleeping
feat(hf-space): add ZeroGPU backend for HuggingFace Pro Spaces
Browse filesUser has a HF Pro plan, so the Space can use ZeroGPU: free on-demand
A10g/A100/H200 allocation per request, no API round-trip, no inference
credits burned. We load an open model directly via transformers and
decorate the inference call with @spaces.GPU so the Pro plan picks up
the GPU automatically. Default model is Phi-4-mini-instruct (small
enough for fast cold start); swappable to Gemma 2 9B, Llama-3.3, or
full Phi-4 via ZEROGPU_MODEL_ID.
What changed:
app.py
- try/except imports of `spaces`, `torch`, `transformers` at
module load; sets _ZEROGPU_DEPS_AVAILABLE accordingly.
_zerogpu_available() wrapper exists so tests can monkeypatch
the answer without touching the real imports.
- _call_zerogpu(system, user) decorated with
`@spaces.GPU(duration=ZEROGPU_DURATION_SECONDS)` when deps are
available; replaced with a clear-error stub otherwise.
- Lazy model load (_load_zerogpu_model) β kept warm across
requests in module-level state; cold start only on first call
after Space process restart.
- _detect_provider precedence updated: Pro Space (SPACE_ID + deps)
β zerogpu, else anthropic, else huggingface, else anthropic.
Explicit MODEL_PROVIDER=anthropic still wins on a Pro Space.
- PROVIDERS dict always includes zerogpu (even when deps absent β
the stub raises a clear error); UI only surfaces the dropdown
option when deps are importable.
- F14 error message now resolves the model_label across all three
providers via dict lookup.
README.md (NEW)
HuggingFace Spaces YAML header specifying `sdk: gradio`,
`app_file: app.py`, `hardware: zero-a10g` so the deployed Space
actually receives ZeroGPU allocation. Also user-facing docs:
backend table, auto-detect precedence, configuration, local-dev
setup, test commands.
requirements.txt
Restructured with section comments. Added:
+ spaces>=0.30
+ transformers>=4.45
+ torch>=2.4
+ accelerate>=1.0
Documented that local-only users (anthropic backend) can omit
the heavy zerogpu lines.
.env.example
Added ZEROGPU BACKEND section: ZEROGPU_MODEL_ID with tested
alternatives, ZEROGPU_DURATION_SECONDS. Updated PROVIDER
SELECTION docs to include zerogpu and the new precedence.
test_diagnose.py
+ 4 zerogpu detection tests (Pro Space with deps β zerogpu,
Space without deps β huggingface fallback, explicit
MODEL_PROVIDER=zerogpu wins, explicit anthropic beats zerogpu
auto-detect).
+ 1 PROVIDERS-dict test (zerogpu always present so the
dispatcher is uniform regardless of dep availability).
All 31 tests pass (15 parser + 16 provider).
specs/004-berkshire-test/contracts/hf-space-interface.md Β§2
Added Β§2.C with the full ZeroGPU invocation pattern (lazy load,
chat-template inference, prompt-stripped decode). Updated
provider-selection precedence table. Added "Required Space
metadata" subsection explaining the hardware: zero-a10g header.
Updated cache-strategy notes with ZeroGPU cold-start behavior.
specs/004-berkshire-test/tasks.md
T037 rationale updated to reflect the three-backend
architecture and the README.md Space-metadata addition.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- .env.example +31 -5
- README.md +86 -0
- app.py +127 -14
- requirements.txt +16 -2
- test_diagnose.py +28 -2
|
@@ -7,12 +7,18 @@
|
|
| 7 |
# PROVIDER SELECTION
|
| 8 |
# ============================================================
|
| 9 |
# Optional. If unset, the app auto-detects based on which credentials
|
| 10 |
-
# are present
|
|
|
|
| 11 |
# anthropic β Claude via the Anthropic SDK (best writeup quality)
|
| 12 |
-
# huggingface β Gemma 2 / Phi-4 / Llama-3.3 / Qwen via
|
| 13 |
-
#
|
| 14 |
-
#
|
| 15 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
# MODEL_PROVIDER=
|
| 17 |
|
| 18 |
# ============================================================
|
|
@@ -44,6 +50,26 @@ MODEL_ID=claude-opus-4-7
|
|
| 44 |
# shows a "try again" message rather than crashing.
|
| 45 |
# HF_MODEL_ID=google/gemma-2-9b-it
|
| 46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
# ============================================================
|
| 48 |
# VALIDATION
|
| 49 |
# ============================================================
|
|
|
|
| 7 |
# PROVIDER SELECTION
|
| 8 |
# ============================================================
|
| 9 |
# Optional. If unset, the app auto-detects based on which credentials
|
| 10 |
+
# are present and whether we are running on a HuggingFace Space (see
|
| 11 |
+
# app.py::_detect_provider). Valid values:
|
| 12 |
# anthropic β Claude via the Anthropic SDK (best writeup quality)
|
| 13 |
+
# huggingface β Open models (Gemma 2 / Phi-4 / Llama-3.3 / Qwen) via
|
| 14 |
+
# HF Inference Providers API. Free on HF Spaces via
|
| 15 |
+
# the Space's monthly credits; HF_TOKEN locally.
|
| 16 |
+
# zerogpu β Open model (Phi-4-mini-instruct by default) loaded
|
| 17 |
+
# locally in the Space and run on free on-demand GPU
|
| 18 |
+
# via the HuggingFace Pro plan's ZeroGPU allocation.
|
| 19 |
+
# No API round-trip; no inference credits burned.
|
| 20 |
+
# Auto-detect precedence: Pro Space β zerogpu, else Anthropic key β
|
| 21 |
+
# anthropic, else HF_TOKEN or any Space β huggingface, else anthropic.
|
| 22 |
# MODEL_PROVIDER=
|
| 23 |
|
| 24 |
# ============================================================
|
|
|
|
| 50 |
# shows a "try again" message rather than crashing.
|
| 51 |
# HF_MODEL_ID=google/gemma-2-9b-it
|
| 52 |
|
| 53 |
+
# ============================================================
|
| 54 |
+
# ZEROGPU BACKEND (HuggingFace Pro plan)
|
| 55 |
+
# ============================================================
|
| 56 |
+
# No credentials required β the @spaces.GPU decorator handles allocation
|
| 57 |
+
# automatically when the Space has a Pro owner. Locally, the function
|
| 58 |
+
# decoration is a no-op and the model runs on CPU (slow, smoke-test only).
|
| 59 |
+
#
|
| 60 |
+
# Optional. Default microsoft/Phi-4-mini-instruct fits on the standard
|
| 61 |
+
# A100 allocation with fast cold start. Other tested choices:
|
| 62 |
+
# google/gemma-2-9b-it β larger, slower load, more capable
|
| 63 |
+
# meta-llama/Llama-3.3-8B-Instruct β Llama 3.3 8B, good JSON adherence
|
| 64 |
+
# microsoft/phi-4 β full 14B Phi-4, slower
|
| 65 |
+
# HuggingFace's gated models (Llama, etc.) need HF_TOKEN to download.
|
| 66 |
+
# ZEROGPU_MODEL_ID=microsoft/Phi-4-mini-instruct
|
| 67 |
+
|
| 68 |
+
# Optional. Maximum GPU allocation per request, in seconds. The Pro
|
| 69 |
+
# plan allows up to 120s per request; raise/lower to balance cold-start
|
| 70 |
+
# coverage vs. quota use.
|
| 71 |
+
# ZEROGPU_DURATION_SECONDS=120
|
| 72 |
+
|
| 73 |
# ============================================================
|
| 74 |
# VALIDATION
|
| 75 |
# ============================================================
|
|
@@ -0,0 +1,86 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: The Compounding Test
|
| 3 |
+
emoji: π
|
| 4 |
+
colorFrom: indigo
|
| 5 |
+
colorTo: purple
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: 4.44.0
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
license: mit
|
| 11 |
+
short_description: A diagnostic for AI investments at non-tech companies.
|
| 12 |
+
hardware: zero-a10g
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# The Compounding Test
|
| 16 |
+
|
| 17 |
+
A diagnostic for AI investments at non-technology companies. Paste a
|
| 18 |
+
description of your AI initiative (200β5000 words); receive a scored
|
| 19 |
+
writeup in one of four quadrants β **compounder**, **one-shot win**,
|
| 20 |
+
**compounding the wrong thing**, or **Roman Candle**.
|
| 21 |
+
|
| 22 |
+
Framework essay: <https://www.mile-hi.ai/journal/the-berkshire-test>
|
| 23 |
+
|
| 24 |
+
## Backends
|
| 25 |
+
|
| 26 |
+
The Space supports three interchangeable model backends. The dropdown
|
| 27 |
+
in the UI lets you switch per-submission to compare writeup quality.
|
| 28 |
+
|
| 29 |
+
| Backend | Model (default) | Credentials | Where it runs |
|
| 30 |
+
|---|---|---|---|
|
| 31 |
+
| `anthropic` | `claude-opus-4-7` | `ANTHROPIC_API_KEY` (Space secret) | Anthropic API |
|
| 32 |
+
| `huggingface` | `google/gemma-2-9b-it` | none on a Space; `HF_TOKEN` locally | HF Inference Providers |
|
| 33 |
+
| `zerogpu` | `microsoft/Phi-4-mini-instruct` | none β Pro plan handles it | On-Space ZeroGPU |
|
| 34 |
+
|
| 35 |
+
**Auto-detect precedence:**
|
| 36 |
+
|
| 37 |
+
1. Explicit `MODEL_PROVIDER` env var wins.
|
| 38 |
+
2. On a Pro Space (zerogpu deps installed) β `zerogpu`.
|
| 39 |
+
3. Else if `ANTHROPIC_API_KEY` is set β `anthropic`.
|
| 40 |
+
4. Else if `HF_TOKEN` is set, or running on any Space β `huggingface`.
|
| 41 |
+
5. Else fall through to `anthropic` (call-time error guides the user).
|
| 42 |
+
|
| 43 |
+
## Configuration
|
| 44 |
+
|
| 45 |
+
See `.env.example` for the full list of env vars. Common overrides:
|
| 46 |
+
|
| 47 |
+
```
|
| 48 |
+
MODEL_PROVIDER=zerogpu
|
| 49 |
+
ZEROGPU_MODEL_ID=google/gemma-2-9b-it # bigger; ~30s cold start on A10
|
| 50 |
+
ZEROGPU_DURATION_SECONDS=120 # max GPU allocation per request
|
| 51 |
+
HF_MODEL_ID=meta-llama/Llama-3.3-70B-Instruct
|
| 52 |
+
MODEL_ID=claude-sonnet-4-6 # cheaper Anthropic fallback
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
## Local development
|
| 56 |
+
|
| 57 |
+
```bash
|
| 58 |
+
python3 -m venv .venv && source .venv/bin/activate
|
| 59 |
+
pip install -r requirements.txt # ~2GB with torch/transformers
|
| 60 |
+
cp .env.example .env # fill in whatever you have
|
| 61 |
+
python app.py # http://127.0.0.1:7860
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
If you only need to test the `anthropic` backend locally, you can skip
|
| 65 |
+
the heavy `spaces` / `torch` / `transformers` / `accelerate` lines in
|
| 66 |
+
`requirements.txt` β the app degrades gracefully (the zerogpu dropdown
|
| 67 |
+
option won't appear).
|
| 68 |
+
|
| 69 |
+
## Tests
|
| 70 |
+
|
| 71 |
+
```bash
|
| 72 |
+
pytest test_diagnose.py -v
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
31 tests covering the parser contract (15 β what JSON shapes the parser
|
| 76 |
+
accepts and rejects) and the provider routing (16 β auto-detection
|
| 77 |
+
precedence, dispatcher routing, env-driven overrides).
|
| 78 |
+
|
| 79 |
+
## Repository
|
| 80 |
+
|
| 81 |
+
Source lives in [apingali/effectiveness][repo] under
|
| 82 |
+
`gradio-apps/compounding-test/`. The Space is deployed from that path.
|
| 83 |
+
The published framework essay and four portrait articles live at
|
| 84 |
+
<https://www.mile-hi.ai/journal/the-berkshire-test>.
|
| 85 |
+
|
| 86 |
+
[repo]: https://github.com/apingali/effectiveness
|
|
@@ -5,8 +5,8 @@ the two-axis Berkshire Test for AI and returns a scored writeup.
|
|
| 5 |
|
| 6 |
Architecture per specs/004-berkshire-test/contracts/hf-space-interface.md:
|
| 7 |
- Inputs: a description (200β5000 words) + 3 optional clarifiers.
|
| 8 |
-
-
|
| 9 |
-
from available credentials:
|
| 10 |
* anthropic β Claude Opus / Sonnet via the Anthropic SDK;
|
| 11 |
system block is `cache_control:ephemeral` so
|
| 12 |
subsequent calls hit the 5-minute prefix cache.
|
|
@@ -15,6 +15,13 @@ Architecture per specs/004-berkshire-test/contracts/hf-space-interface.md:
|
|
| 15 |
huggingface_hub InferenceClient. Works on HF
|
| 16 |
Spaces with the Space's free inference credits;
|
| 17 |
locally requires HF_TOKEN.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
- Output: two Gradio tabs β markdown writeup + raw JSON.
|
| 19 |
|
| 20 |
Engine/Site boundary (Principle VIII): this app lives in gradio-apps/
|
|
@@ -181,10 +188,34 @@ ROOT = Path(__file__).parent
|
|
| 181 |
|
| 182 |
ANTHROPIC_MODEL_ID = os.environ.get("MODEL_ID", "claude-opus-4-7")
|
| 183 |
HF_MODEL_ID = os.environ.get("HF_MODEL_ID", "google/gemma-2-9b-it")
|
|
|
|
|
|
|
| 184 |
MAX_DESCRIPTION_WORDS = int(os.environ.get("MAX_DESCRIPTION_WORDS", "5000"))
|
| 185 |
MIN_DESCRIPTION_WORDS = 200
|
| 186 |
|
| 187 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 188 |
# ---------------------------------------------------------------------------
|
| 189 |
# Provider abstraction (anthropic vs huggingface β selectable at runtime)
|
| 190 |
# ---------------------------------------------------------------------------
|
|
@@ -192,17 +223,22 @@ MIN_DESCRIPTION_WORDS = 200
|
|
| 192 |
|
| 193 |
def _detect_provider(env=None) -> str:
|
| 194 |
"""Pick a model provider from env. Order of precedence:
|
| 195 |
-
1. Explicit MODEL_PROVIDER (anthropic | huggingface).
|
| 196 |
-
2.
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
|
|
|
|
|
|
|
|
|
| 200 |
which env to set).
|
| 201 |
"""
|
| 202 |
env = env if env is not None else os.environ
|
| 203 |
explicit = env.get("MODEL_PROVIDER", "").strip().lower()
|
| 204 |
-
if explicit in ("anthropic", "huggingface"):
|
| 205 |
return explicit
|
|
|
|
|
|
|
| 206 |
if env.get("ANTHROPIC_API_KEY"):
|
| 207 |
return "anthropic"
|
| 208 |
if (
|
|
@@ -260,9 +296,75 @@ def _call_huggingface(system_block: str, user_prompt: str) -> str:
|
|
| 260 |
return resp.choices[0].message.content
|
| 261 |
|
| 262 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 263 |
PROVIDERS = {
|
| 264 |
"anthropic": _call_anthropic,
|
| 265 |
"huggingface": _call_huggingface,
|
|
|
|
| 266 |
}
|
| 267 |
|
| 268 |
|
|
@@ -394,7 +496,11 @@ def diagnose(
|
|
| 394 |
except Exception as e:
|
| 395 |
# API timeout / rate limit / auth / server / network failure
|
| 396 |
# (Anthropic SDK or huggingface_hub InferenceClient).
|
| 397 |
-
model_label =
|
|
|
|
|
|
|
|
|
|
|
|
|
| 398 |
return (
|
| 399 |
f"β The diagnostic call to {provider} ({model_label}) failed "
|
| 400 |
f"({type(e).__name__}). Try again in a moment, switch providers in "
|
|
@@ -439,8 +545,13 @@ def build_demo():
|
|
| 439 |
|
| 440 |
provider_choices = [
|
| 441 |
(f"Anthropic β {ANTHROPIC_MODEL_ID} (requires ANTHROPIC_API_KEY)", "anthropic"),
|
| 442 |
-
(f"HuggingFace β {HF_MODEL_ID} (free on HF Spaces; HF_TOKEN locally)", "huggingface"),
|
| 443 |
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 444 |
|
| 445 |
with gr.Blocks(title="The Compounding Test") as demo:
|
| 446 |
gr.Markdown(
|
|
@@ -475,10 +586,12 @@ def build_demo():
|
|
| 475 |
label="Model provider",
|
| 476 |
info=(
|
| 477 |
"Claude gives the highest-quality writeups but needs your "
|
| 478 |
-
"own ANTHROPIC_API_KEY.
|
| 479 |
-
"
|
| 480 |
-
"
|
| 481 |
-
"
|
|
|
|
|
|
|
| 482 |
),
|
| 483 |
)
|
| 484 |
submit = gr.Button("Diagnose", variant="primary")
|
|
|
|
| 5 |
|
| 6 |
Architecture per specs/004-berkshire-test/contracts/hf-space-interface.md:
|
| 7 |
- Inputs: a description (200β5000 words) + 3 optional clarifiers.
|
| 8 |
+
- Three backends, selectable by env (`MODEL_PROVIDER`) or auto-detected
|
| 9 |
+
from available credentials and runtime environment:
|
| 10 |
* anthropic β Claude Opus / Sonnet via the Anthropic SDK;
|
| 11 |
system block is `cache_control:ephemeral` so
|
| 12 |
subsequent calls hit the 5-minute prefix cache.
|
|
|
|
| 15 |
huggingface_hub InferenceClient. Works on HF
|
| 16 |
Spaces with the Space's free inference credits;
|
| 17 |
locally requires HF_TOKEN.
|
| 18 |
+
* zerogpu β Open model (Phi-4-mini-instruct by default)
|
| 19 |
+
loaded LOCALLY in the Space via transformers,
|
| 20 |
+
decorated with `@spaces.GPU` so a HuggingFace
|
| 21 |
+
Pro plan gets free on-demand A100/H100 GPU
|
| 22 |
+
allocation per request. No per-call credit burn;
|
| 23 |
+
no API round-trip. Requires the Space to have a
|
| 24 |
+
Pro owner; locally falls back to CPU (slow).
|
| 25 |
- Output: two Gradio tabs β markdown writeup + raw JSON.
|
| 26 |
|
| 27 |
Engine/Site boundary (Principle VIII): this app lives in gradio-apps/
|
|
|
|
| 188 |
|
| 189 |
ANTHROPIC_MODEL_ID = os.environ.get("MODEL_ID", "claude-opus-4-7")
|
| 190 |
HF_MODEL_ID = os.environ.get("HF_MODEL_ID", "google/gemma-2-9b-it")
|
| 191 |
+
ZEROGPU_MODEL_ID = os.environ.get("ZEROGPU_MODEL_ID", "microsoft/Phi-4-mini-instruct")
|
| 192 |
+
ZEROGPU_DURATION_SECONDS = int(os.environ.get("ZEROGPU_DURATION_SECONDS", "120"))
|
| 193 |
MAX_DESCRIPTION_WORDS = int(os.environ.get("MAX_DESCRIPTION_WORDS", "5000"))
|
| 194 |
MIN_DESCRIPTION_WORDS = 200
|
| 195 |
|
| 196 |
|
| 197 |
+
# ZeroGPU availability is detected at import time. The `spaces` package
|
| 198 |
+
# is HuggingFace's runtime for on-demand GPU allocation; `transformers`
|
| 199 |
+
# + `torch` are required to actually load and run the model. All three
|
| 200 |
+
# must be importable for the zerogpu backend to function.
|
| 201 |
+
try:
|
| 202 |
+
import spaces as _spaces
|
| 203 |
+
import torch as _torch
|
| 204 |
+
from transformers import AutoModelForCausalLM as _AutoModelForCausalLM
|
| 205 |
+
from transformers import AutoTokenizer as _AutoTokenizer
|
| 206 |
+
|
| 207 |
+
_ZEROGPU_DEPS_AVAILABLE = True
|
| 208 |
+
except ImportError:
|
| 209 |
+
_ZEROGPU_DEPS_AVAILABLE = False
|
| 210 |
+
|
| 211 |
+
|
| 212 |
+
def _zerogpu_available() -> bool:
|
| 213 |
+
"""Return whether the zerogpu backend can be used. Wrapped as a
|
| 214 |
+
function so tests can monkeypatch the answer without touching the
|
| 215 |
+
real torch/transformers imports."""
|
| 216 |
+
return _ZEROGPU_DEPS_AVAILABLE
|
| 217 |
+
|
| 218 |
+
|
| 219 |
# ---------------------------------------------------------------------------
|
| 220 |
# Provider abstraction (anthropic vs huggingface β selectable at runtime)
|
| 221 |
# ---------------------------------------------------------------------------
|
|
|
|
| 223 |
|
| 224 |
def _detect_provider(env=None) -> str:
|
| 225 |
"""Pick a model provider from env. Order of precedence:
|
| 226 |
+
1. Explicit MODEL_PROVIDER (anthropic | huggingface | zerogpu).
|
| 227 |
+
2. Running on a HuggingFace Space (SPACE_ID set) AND the zerogpu
|
| 228 |
+
deps (spaces + transformers + torch) are importable β zerogpu.
|
| 229 |
+
This is the Pro-plan free-GPU path.
|
| 230 |
+
3. Presence of ANTHROPIC_API_KEY β anthropic.
|
| 231 |
+
4. Presence of HF_TOKEN / HUGGING_FACE_HUB_TOKEN, or running on
|
| 232 |
+
a HuggingFace Space without zerogpu deps β huggingface.
|
| 233 |
+
5. Fall through to anthropic (call-time error will tell the user
|
| 234 |
which env to set).
|
| 235 |
"""
|
| 236 |
env = env if env is not None else os.environ
|
| 237 |
explicit = env.get("MODEL_PROVIDER", "").strip().lower()
|
| 238 |
+
if explicit in ("anthropic", "huggingface", "zerogpu"):
|
| 239 |
return explicit
|
| 240 |
+
if env.get("SPACE_ID") and _zerogpu_available():
|
| 241 |
+
return "zerogpu"
|
| 242 |
if env.get("ANTHROPIC_API_KEY"):
|
| 243 |
return "anthropic"
|
| 244 |
if (
|
|
|
|
| 296 |
return resp.choices[0].message.content
|
| 297 |
|
| 298 |
|
| 299 |
+
# ZeroGPU backend. The model is loaded once on first call (lazy) and
|
| 300 |
+
# kept warm in module-level state so subsequent requests reuse it.
|
| 301 |
+
# The `@spaces.GPU` decorator MUST be applied at function-definition
|
| 302 |
+
# time on a Pro Space β outside a Space, the decorator is a no-op and
|
| 303 |
+
# the function just runs on CPU (very slow, useful only for smoke tests).
|
| 304 |
+
_zerogpu_model = None
|
| 305 |
+
_zerogpu_tokenizer = None
|
| 306 |
+
|
| 307 |
+
|
| 308 |
+
def _load_zerogpu_model():
|
| 309 |
+
"""Load the model + tokenizer once. Called lazily on first request
|
| 310 |
+
so module import stays fast (the model weights are tens of GB)."""
|
| 311 |
+
global _zerogpu_model, _zerogpu_tokenizer
|
| 312 |
+
if _zerogpu_model is not None:
|
| 313 |
+
return
|
| 314 |
+
_zerogpu_tokenizer = _AutoTokenizer.from_pretrained(ZEROGPU_MODEL_ID)
|
| 315 |
+
_zerogpu_model = _AutoModelForCausalLM.from_pretrained(
|
| 316 |
+
ZEROGPU_MODEL_ID,
|
| 317 |
+
torch_dtype=_torch.bfloat16,
|
| 318 |
+
device_map="auto",
|
| 319 |
+
)
|
| 320 |
+
|
| 321 |
+
|
| 322 |
+
if _ZEROGPU_DEPS_AVAILABLE:
|
| 323 |
+
|
| 324 |
+
@_spaces.GPU(duration=ZEROGPU_DURATION_SECONDS)
|
| 325 |
+
def _call_zerogpu(system_block: str, user_prompt: str) -> str:
|
| 326 |
+
"""ZeroGPU backend. Loads Phi-4-mini-instruct (or whatever
|
| 327 |
+
ZEROGPU_MODEL_ID points at) into the Space's allocated GPU and
|
| 328 |
+
runs chat-template inference. Returns the assistant text only β
|
| 329 |
+
prompt tokens are stripped before decoding."""
|
| 330 |
+
_load_zerogpu_model()
|
| 331 |
+
messages = [
|
| 332 |
+
{"role": "system", "content": system_block},
|
| 333 |
+
{"role": "user", "content": user_prompt},
|
| 334 |
+
]
|
| 335 |
+
inputs = _zerogpu_tokenizer.apply_chat_template(
|
| 336 |
+
messages,
|
| 337 |
+
return_tensors="pt",
|
| 338 |
+
add_generation_prompt=True,
|
| 339 |
+
).to(_zerogpu_model.device)
|
| 340 |
+
outputs = _zerogpu_model.generate(
|
| 341 |
+
inputs,
|
| 342 |
+
max_new_tokens=2500,
|
| 343 |
+
temperature=0.2,
|
| 344 |
+
do_sample=True,
|
| 345 |
+
pad_token_id=_zerogpu_tokenizer.eos_token_id,
|
| 346 |
+
)
|
| 347 |
+
prompt_len = inputs.shape[1]
|
| 348 |
+
return _zerogpu_tokenizer.decode(
|
| 349 |
+
outputs[0][prompt_len:], skip_special_tokens=True
|
| 350 |
+
)
|
| 351 |
+
|
| 352 |
+
else:
|
| 353 |
+
|
| 354 |
+
def _call_zerogpu(system_block: str, user_prompt: str) -> str:
|
| 355 |
+
raise RuntimeError(
|
| 356 |
+
"ZeroGPU backend requires `spaces`, `transformers`, and `torch` "
|
| 357 |
+
"to be importable AND should be run on a HuggingFace Pro Space "
|
| 358 |
+
"for free on-demand GPU. Install the full requirements.txt and "
|
| 359 |
+
"deploy to a Space, or pick anthropic / huggingface from the "
|
| 360 |
+
"provider dropdown."
|
| 361 |
+
)
|
| 362 |
+
|
| 363 |
+
|
| 364 |
PROVIDERS = {
|
| 365 |
"anthropic": _call_anthropic,
|
| 366 |
"huggingface": _call_huggingface,
|
| 367 |
+
"zerogpu": _call_zerogpu,
|
| 368 |
}
|
| 369 |
|
| 370 |
|
|
|
|
| 496 |
except Exception as e:
|
| 497 |
# API timeout / rate limit / auth / server / network failure
|
| 498 |
# (Anthropic SDK or huggingface_hub InferenceClient).
|
| 499 |
+
model_label = {
|
| 500 |
+
"anthropic": ANTHROPIC_MODEL_ID,
|
| 501 |
+
"huggingface": HF_MODEL_ID,
|
| 502 |
+
"zerogpu": ZEROGPU_MODEL_ID,
|
| 503 |
+
}.get(provider, provider)
|
| 504 |
return (
|
| 505 |
f"β The diagnostic call to {provider} ({model_label}) failed "
|
| 506 |
f"({type(e).__name__}). Try again in a moment, switch providers in "
|
|
|
|
| 545 |
|
| 546 |
provider_choices = [
|
| 547 |
(f"Anthropic β {ANTHROPIC_MODEL_ID} (requires ANTHROPIC_API_KEY)", "anthropic"),
|
| 548 |
+
(f"HuggingFace API β {HF_MODEL_ID} (free on HF Spaces; HF_TOKEN locally)", "huggingface"),
|
| 549 |
]
|
| 550 |
+
if _zerogpu_available():
|
| 551 |
+
provider_choices.append((
|
| 552 |
+
f"ZeroGPU β {ZEROGPU_MODEL_ID} (free GPU via HuggingFace Pro plan)",
|
| 553 |
+
"zerogpu",
|
| 554 |
+
))
|
| 555 |
|
| 556 |
with gr.Blocks(title="The Compounding Test") as demo:
|
| 557 |
gr.Markdown(
|
|
|
|
| 586 |
label="Model provider",
|
| 587 |
info=(
|
| 588 |
"Claude gives the highest-quality writeups but needs your "
|
| 589 |
+
"own ANTHROPIC_API_KEY. ZeroGPU runs an open-weight model "
|
| 590 |
+
"(Phi-4-mini-instruct by default) on the Space's free Pro "
|
| 591 |
+
"GPU β no API costs, no inference credits. HuggingFace API "
|
| 592 |
+
"routes to an open model through the HF Inference Providers "
|
| 593 |
+
"API β works without any keys on a Space via the Space's "
|
| 594 |
+
"monthly credits. Switch to compare writeup quality."
|
| 595 |
),
|
| 596 |
)
|
| 597 |
submit = gr.Button("Diagnose", variant="primary")
|
|
@@ -1,5 +1,19 @@
|
|
|
|
|
| 1 |
gradio>=4.0
|
| 2 |
-
anthropic>=0.39
|
| 3 |
-
huggingface_hub>=0.27
|
| 4 |
python-dotenv>=1.0
|
| 5 |
pytest>=8.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Core (always required)
|
| 2 |
gradio>=4.0
|
|
|
|
|
|
|
| 3 |
python-dotenv>=1.0
|
| 4 |
pytest>=8.0
|
| 5 |
+
|
| 6 |
+
# Anthropic backend (only needed if MODEL_PROVIDER=anthropic)
|
| 7 |
+
anthropic>=0.39
|
| 8 |
+
|
| 9 |
+
# HuggingFace API backend (only needed if MODEL_PROVIDER=huggingface)
|
| 10 |
+
huggingface_hub>=0.27
|
| 11 |
+
|
| 12 |
+
# ZeroGPU backend (only needed if MODEL_PROVIDER=zerogpu, i.e. running
|
| 13 |
+
# on a HuggingFace Pro Space with on-demand A100/H100 allocation).
|
| 14 |
+
# These are heavy (~2GB total via torch); local-only users who do not
|
| 15 |
+
# plan to use the zerogpu backend can omit these.
|
| 16 |
+
spaces>=0.30
|
| 17 |
+
transformers>=4.45
|
| 18 |
+
torch>=2.4
|
| 19 |
+
accelerate>=1.0
|
|
@@ -16,6 +16,7 @@ from app import (
|
|
| 16 |
_detect_provider,
|
| 17 |
parse_response,
|
| 18 |
)
|
|
|
|
| 19 |
|
| 20 |
|
| 21 |
# --- Fixtures ---------------------------------------------------------------
|
|
@@ -205,11 +206,36 @@ def test_detect_provider_huggingface_when_only_hf_token_set():
|
|
| 205 |
assert _detect_provider({"HF_TOKEN": "hf-xxx"}) == "huggingface"
|
| 206 |
|
| 207 |
|
| 208 |
-
def
|
| 209 |
-
#
|
|
|
|
| 210 |
assert _detect_provider({"SPACE_ID": "mile-hi-ai/compounding-test"}) == "huggingface"
|
| 211 |
|
| 212 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 213 |
def test_detect_provider_alt_hf_token_var():
|
| 214 |
# HuggingFace SDKs also recognize HUGGING_FACE_HUB_TOKEN
|
| 215 |
assert _detect_provider({"HUGGING_FACE_HUB_TOKEN": "hf-xxx"}) == "huggingface"
|
|
|
|
| 16 |
_detect_provider,
|
| 17 |
parse_response,
|
| 18 |
)
|
| 19 |
+
import app as app_module
|
| 20 |
|
| 21 |
|
| 22 |
# --- Fixtures ---------------------------------------------------------------
|
|
|
|
| 206 |
assert _detect_provider({"HF_TOKEN": "hf-xxx"}) == "huggingface"
|
| 207 |
|
| 208 |
|
| 209 |
+
def test_detect_provider_huggingface_when_running_on_hf_space_without_zerogpu(monkeypatch):
|
| 210 |
+
# On a Space WITHOUT ZeroGPU deps installed, fall back to the inference API.
|
| 211 |
+
monkeypatch.setattr(app_module, "_zerogpu_available", lambda: False)
|
| 212 |
assert _detect_provider({"SPACE_ID": "mile-hi-ai/compounding-test"}) == "huggingface"
|
| 213 |
|
| 214 |
|
| 215 |
+
def test_detect_provider_prefers_zerogpu_on_pro_space_with_deps(monkeypatch):
|
| 216 |
+
# On a Space WITH ZeroGPU deps installed (transformers + torch + spaces),
|
| 217 |
+
# default to the free GPU backend rather than burning inference credits.
|
| 218 |
+
monkeypatch.setattr(app_module, "_zerogpu_available", lambda: True)
|
| 219 |
+
assert _detect_provider({"SPACE_ID": "mile-hi-ai/compounding-test"}) == "zerogpu"
|
| 220 |
+
|
| 221 |
+
|
| 222 |
+
def test_detect_provider_explicit_anthropic_wins_over_zerogpu(monkeypatch):
|
| 223 |
+
# Explicit MODEL_PROVIDER beats the zerogpu auto-detect even on a Pro Space.
|
| 224 |
+
monkeypatch.setattr(app_module, "_zerogpu_available", lambda: True)
|
| 225 |
+
env = {"MODEL_PROVIDER": "anthropic", "SPACE_ID": "mile-hi-ai/compounding-test"}
|
| 226 |
+
assert _detect_provider(env) == "anthropic"
|
| 227 |
+
|
| 228 |
+
|
| 229 |
+
def test_detect_provider_explicit_zerogpu_wins():
|
| 230 |
+
assert _detect_provider({"MODEL_PROVIDER": "zerogpu"}) == "zerogpu"
|
| 231 |
+
|
| 232 |
+
|
| 233 |
+
def test_zerogpu_is_in_providers_dict():
|
| 234 |
+
# Even when deps aren't installed locally, the provider key exists so the
|
| 235 |
+
# UI dropdown can show it (the stub raises a clear error if invoked).
|
| 236 |
+
assert "zerogpu" in PROVIDERS
|
| 237 |
+
|
| 238 |
+
|
| 239 |
def test_detect_provider_alt_hf_token_var():
|
| 240 |
# HuggingFace SDKs also recognize HUGGING_FACE_HUB_TOKEN
|
| 241 |
assert _detect_provider({"HUGGING_FACE_HUB_TOKEN": "hf-xxx"}) == "huggingface"
|