Spaces:
Sleeping
fix(hf-space): lower ZEROGPU_DURATION_SECONDS 120 → 60
Browse filesUser question: how to get inference working without a higher quota?
Answer: cut the per-call reservation in half. @spaces.GPU(duration=N)
reserves N seconds from the daily quota PER CALL regardless of actual
inference time. With 120s reserved per call and ~5 min daily quota on
Pro tier, the user fit ~2-3 calls before hitting "120s requested vs.
168s left, try again in 21:21:10".
Per the just-collected latency baseline in
specs/004-berkshire-test/sc-005-latency-baseline.md:
cold-start (first call after Space sleep): ~37s
warm (subsequent calls): ~15s
60s leaves margin for cold-start while halving the per-call quota cost.
Effective effect: ~2x more submissions per daily window with zero loss
of functionality. If a cold-start ever takes >60s, the call fails with
the F14 friendly error and the user retries — same UX as today.
Pro-tier max is 120s; the user can raise via the ZEROGPU_DURATION_SECONDS
env var if their quota tier supports it.
Tests: 64 passed, 1 skipped (no test impact — duration is a deploy-time
constant, not exercised in unit tests).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@@ -189,10 +189,14 @@ ROOT = Path(__file__).parent
|
|
| 189 |
ANTHROPIC_MODEL_ID = os.environ.get("MODEL_ID", "claude-opus-4-7")
|
| 190 |
HF_MODEL_ID = os.environ.get("HF_MODEL_ID", "google/gemma-2-9b-it")
|
| 191 |
ZEROGPU_MODEL_ID = os.environ.get("ZEROGPU_MODEL_ID", "microsoft/Phi-4-mini-instruct")
|
| 192 |
-
# ZeroGPU
|
| 193 |
-
#
|
| 194 |
-
#
|
| 195 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 196 |
MAX_DESCRIPTION_WORDS = int(os.environ.get("MAX_DESCRIPTION_WORDS", "5000"))
|
| 197 |
MIN_DESCRIPTION_WORDS = 200
|
| 198 |
|
|
|
|
| 189 |
ANTHROPIC_MODEL_ID = os.environ.get("MODEL_ID", "claude-opus-4-7")
|
| 190 |
HF_MODEL_ID = os.environ.get("HF_MODEL_ID", "google/gemma-2-9b-it")
|
| 191 |
ZEROGPU_MODEL_ID = os.environ.get("ZEROGPU_MODEL_ID", "microsoft/Phi-4-mini-instruct")
|
| 192 |
+
# ZeroGPU reserves this many seconds from the Space owner's daily quota
|
| 193 |
+
# per request, regardless of actual inference time. Per the latency
|
| 194 |
+
# baseline in specs/004-berkshire-test/sc-005-latency-baseline.md:
|
| 195 |
+
# cold-start runs ~37s, warm runs ~15s. 60s leaves margin for cold-start
|
| 196 |
+
# without burning quota the way 120s did. Halving the reservation
|
| 197 |
+
# effectively doubles the number of submissions per quota window.
|
| 198 |
+
# Pro-tier max is 120s; raise via env if you have headroom.
|
| 199 |
+
ZEROGPU_DURATION_SECONDS = int(os.environ.get("ZEROGPU_DURATION_SECONDS", "60"))
|
| 200 |
MAX_DESCRIPTION_WORDS = int(os.environ.get("MAX_DESCRIPTION_WORDS", "5000"))
|
| 201 |
MIN_DESCRIPTION_WORDS = 200
|
| 202 |
|