Spaces:

AshwinP
/

compounding-test

Sleeping

apingali Claude Opus 4.7 (1M context) commited on 20 days ago

Commit

3c7d5bb

1 Parent(s): e084f76

fix(hf-space): lower ZEROGPU_DURATION_SECONDS 120 → 60

User question: how to get inference working without a higher quota?

Answer: cut the per-call reservation in half. @spaces.GPU(duration=N)
reserves N seconds from the daily quota PER CALL regardless of actual
inference time. With 120s reserved per call and ~5 min daily quota on
Pro tier, the user fit ~2-3 calls before hitting "120s requested vs.
168s left, try again in 21:21:10".

Per the just-collected latency baseline in
specs/004-berkshire-test/sc-005-latency-baseline.md:
cold-start (first call after Space sleep): ~37s
warm (subsequent calls): ~15s

60s leaves margin for cold-start while halving the per-call quota cost.
Effective effect: ~2x more submissions per daily window with zero loss
of functionality. If a cold-start ever takes >60s, the call fails with
the F14 friendly error and the user retries — same UX as today.

Pro-tier max is 120s; the user can raise via the ZEROGPU_DURATION_SECONDS
env var if their quota tier supports it.

Tests: 64 passed, 1 skipped (no test impact — duration is a deploy-time
constant, not exercised in unit tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (1) hide show

app.py +8 -4

app.py CHANGED Viewed

@@ -189,10 +189,14 @@ ROOT = Path(__file__).parent
 ANTHROPIC_MODEL_ID = os.environ.get("MODEL_ID", "claude-opus-4-7")
 HF_MODEL_ID = os.environ.get("HF_MODEL_ID", "google/gemma-2-9b-it")
 ZEROGPU_MODEL_ID = os.environ.get("ZEROGPU_MODEL_ID", "microsoft/Phi-4-mini-instruct")
-# ZeroGPU's per-request duration cap depends on the Space owner's quota
-# tier. 120s is the Pro-tier default; we found 300s exceeds the limit.
-# Override via env var if your tier allows longer.
-ZEROGPU_DURATION_SECONDS = int(os.environ.get("ZEROGPU_DURATION_SECONDS", "120"))
 MAX_DESCRIPTION_WORDS = int(os.environ.get("MAX_DESCRIPTION_WORDS", "5000"))
 MIN_DESCRIPTION_WORDS = 200

 ANTHROPIC_MODEL_ID = os.environ.get("MODEL_ID", "claude-opus-4-7")
 HF_MODEL_ID = os.environ.get("HF_MODEL_ID", "google/gemma-2-9b-it")
 ZEROGPU_MODEL_ID = os.environ.get("ZEROGPU_MODEL_ID", "microsoft/Phi-4-mini-instruct")
+# ZeroGPU reserves this many seconds from the Space owner's daily quota
+# per request, regardless of actual inference time. Per the latency
+# baseline in specs/004-berkshire-test/sc-005-latency-baseline.md:
+# cold-start runs ~37s, warm runs ~15s. 60s leaves margin for cold-start
+# without burning quota the way 120s did. Halving the reservation
+# effectively doubles the number of submissions per quota window.
+# Pro-tier max is 120s; raise via env if you have headroom.
+ZEROGPU_DURATION_SECONDS = int(os.environ.get("ZEROGPU_DURATION_SECONDS", "60"))
 MAX_DESCRIPTION_WORDS = int(os.environ.get("MAX_DESCRIPTION_WORDS", "5000"))
 MIN_DESCRIPTION_WORDS = 200