apingali Claude Opus 4.7 (1M context) commited on
Commit
3c7d5bb
·
1 Parent(s): e084f76

fix(hf-space): lower ZEROGPU_DURATION_SECONDS 120 → 60

Browse files

User question: how to get inference working without a higher quota?

Answer: cut the per-call reservation in half. @spaces.GPU(duration=N)
reserves N seconds from the daily quota PER CALL regardless of actual
inference time. With 120s reserved per call and ~5 min daily quota on
Pro tier, the user fit ~2-3 calls before hitting "120s requested vs.
168s left, try again in 21:21:10".

Per the just-collected latency baseline in
specs/004-berkshire-test/sc-005-latency-baseline.md:
cold-start (first call after Space sleep): ~37s
warm (subsequent calls): ~15s

60s leaves margin for cold-start while halving the per-call quota cost.
Effective effect: ~2x more submissions per daily window with zero loss
of functionality. If a cold-start ever takes >60s, the call fails with
the F14 friendly error and the user retries — same UX as today.

Pro-tier max is 120s; the user can raise via the ZEROGPU_DURATION_SECONDS
env var if their quota tier supports it.

Tests: 64 passed, 1 skipped (no test impact — duration is a deploy-time
constant, not exercised in unit tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (1) hide show
  1. app.py +8 -4
app.py CHANGED
@@ -189,10 +189,14 @@ ROOT = Path(__file__).parent
189
  ANTHROPIC_MODEL_ID = os.environ.get("MODEL_ID", "claude-opus-4-7")
190
  HF_MODEL_ID = os.environ.get("HF_MODEL_ID", "google/gemma-2-9b-it")
191
  ZEROGPU_MODEL_ID = os.environ.get("ZEROGPU_MODEL_ID", "microsoft/Phi-4-mini-instruct")
192
- # ZeroGPU's per-request duration cap depends on the Space owner's quota
193
- # tier. 120s is the Pro-tier default; we found 300s exceeds the limit.
194
- # Override via env var if your tier allows longer.
195
- ZEROGPU_DURATION_SECONDS = int(os.environ.get("ZEROGPU_DURATION_SECONDS", "120"))
 
 
 
 
196
  MAX_DESCRIPTION_WORDS = int(os.environ.get("MAX_DESCRIPTION_WORDS", "5000"))
197
  MIN_DESCRIPTION_WORDS = 200
198
 
 
189
  ANTHROPIC_MODEL_ID = os.environ.get("MODEL_ID", "claude-opus-4-7")
190
  HF_MODEL_ID = os.environ.get("HF_MODEL_ID", "google/gemma-2-9b-it")
191
  ZEROGPU_MODEL_ID = os.environ.get("ZEROGPU_MODEL_ID", "microsoft/Phi-4-mini-instruct")
192
+ # ZeroGPU reserves this many seconds from the Space owner's daily quota
193
+ # per request, regardless of actual inference time. Per the latency
194
+ # baseline in specs/004-berkshire-test/sc-005-latency-baseline.md:
195
+ # cold-start runs ~37s, warm runs ~15s. 60s leaves margin for cold-start
196
+ # without burning quota the way 120s did. Halving the reservation
197
+ # effectively doubles the number of submissions per quota window.
198
+ # Pro-tier max is 120s; raise via env if you have headroom.
199
+ ZEROGPU_DURATION_SECONDS = int(os.environ.get("ZEROGPU_DURATION_SECONDS", "60"))
200
  MAX_DESCRIPTION_WORDS = int(os.environ.get("MAX_DESCRIPTION_WORDS", "5000"))
201
  MIN_DESCRIPTION_WORDS = 200
202