Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.19.0
Deploying & configuring the model-serving apps
This guide covers prerequisites, deployment, configuration knobs, auth, GPU sizing, and wiring the endpoints into the engine.
The serving layer is deliberately small: it's Modal's canonical vLLM recipe β an
autoscaling @app.function that launches vllm serve as a subprocess behind a
@modal.web_server β applied once in service.py to every model in
catalogue.py. See ADR-0034 for why we stripped the earlier snapshot / FP8 /
structured-logging machinery back to this core.
Prerequisites
pip install -r modal/requirements.txt
modal token new # one-time auth with your Modal workspace
Gated repos (Gemma, and the Nemotron repos here) require a Hugging Face token. Accept each model's license on its Hugging Face page, then create the secret:
modal secret create huggingface-secret HF_TOKEN=hf_xxx
Only models with gated=True mount this secret; ungated models deploy without it.
Deploy
Each provider is its own Modal app, deployed independently:
modal deploy modal/app_nvidia.py # Nemotron 3 Nano 4B + 30B, Cascade 14B
modal deploy modal/app_openbmb.py # MiniCPM4.1-8B + MiniCPM-o 4.5
modal deploy modal/app_google.py # Gemma 4 12B + 26B
Use modal serve modal/app_<provider>.py for a hot-reloading dev session.
Or deploy one, several, or all providers with a single uv command β a thin wrapper that exposes the two deploy-time env knobs as flags:
uv run scripts/deploy_modal.py # all providers
uv run scripts/deploy_modal.py nvidia openbmb # just these
uv run scripts/deploy_modal.py nvidia --keep-warm # = MODAL_LLM_KEEP_WARM=1
# --auth β MODAL_LLM_REQUIRE_AUTH=1, --dry-run to preview the commands.
Run these from the repo root; the script's own directory (modal/) is on
sys.path, so from service import ... / from catalogue import ... resolve,
and import modal still binds the installed SDK (the folder name does not
shadow it).
Endpoints
Each model becomes its own OpenAI-compatible endpoint. Modal builds the URL from
the modal.App name and the function's endpoint_name:
https://<workspace>--<app-name>-<endpoint-name>.modal.run/v1
<app-name> is nvidia-llms, openbmb-llms, or google-llms (one per provider
app); <endpoint-name> is the per-model slug. e.g. the Nemotron 4B endpoint is
https://<workspace>--nvidia-llms-nemotron-3-nano-4b.modal.run/v1.
Model id vs URL slug. The
--modelvalue (and the"model"field in a raw request) is the served model id β the HF repo id, e.g.nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16β becauseserved_model_namedefaults to the reponame. It is not the URL slug (nemotron-3-nano-4b). Call/v1/modelson any endpoint to see the exact id it serves.
Standard routes: /v1/chat/completions, /v1/completions, /v1/models, plus
/docs for the Swagger UI. Smoke-test one:
python modal/client.py \
--base-url https://<workspace>--google-llms-gemma-4-12b.modal.run/v1 \
--model google/gemma-4-12B \
--prompt "Describe a mossy ticket booth in the wood."
Configuring models (per task)
All knobs live in catalogue.py as ModelConfig fields β no serving code
changes needed:
| Field | Purpose |
|---|---|
gpu |
Modal GPU spec, e.g. H200:1, H100:2, L40S:1, L4:1. |
tensor_parallel_size |
Shard across GPUs; set equal to the GPU count in gpu. |
max_model_len |
Cap context length to fit memory / tune throughput. |
max_concurrent_inputs |
Hard ceiling of requests multiplexed onto one container (autoscale target is ~75% of it). |
scaledown_window |
Idle seconds before a container stops (cold-start vs. cost). |
min_containers |
Keep N warm to eliminate cold starts (always-on cost). |
gpu_memory_utilization |
Fraction of VRAM for weights + KV cache (vLLM default 0.9); raise for a bigger KV cache. |
enable_prefix_caching |
Reuse the KV cache for shared prompt prefixes (on by default β big win when the system prompt / ledger context repeats across the cast). |
async_scheduling |
Overlap CPU request scheduling with GPU compute (on by default; off for the Transformers-backend Gemma 12B + omni models). |
enforce_eager |
Skip CUDA-graph capture β faster cold start, lower steady-state throughput. |
log_requests |
Log each request's id, sampling params, and token counts (on by default). |
reasoning_parser / tool_call_parser / enable_auto_tool_choice |
OpenAI tool/reasoning features (vLLM parser names; leave None if unsupported). |
mm_limits |
Per-prompt image/audio/video caps; set to 0 on an auto-detected-multimodal model you serve text-only. |
trust_remote_code |
Required by MiniCPM / Nemotron custom modeling code. |
vllm_version |
Per-model inference-stack pin (escape hatch); None = the default VLLM_VERSION, "nightly" = latest nightly wheel, else a pinned version. |
extra_vllm_args |
Raw vllm serve flags appended verbatim β the escape hatch for anything not modelled above (quantization, batch caps, custom parser plugins, β¦). |
extra_pip / env |
Extra image deps / container env (escape hatch). |
Per-model vLLM version. The image pins
VLLM_VERSION(seeservice.py) for reproducible deploys. A single model can override it viavllm_versionwhen the pinned release can't serve its architecture β this is scoped to that model's image, so one model's bump never touches another provider's app. Only the Gemma 4 12B setsvllm_version="nightly"(plustransformers>=5.10.2) because itsgemma4_unifiedarchitecture has no class in any stable vLLM β€0.22.1. The Gemma 4 26B is a standard MoE arch that serves on the pinned stable release, so it stays on the default pin.
Performance tuning
The serving path follows Modal's high-performance-LLM-inference guidance, so the defaults are already tuned for throughput; the knobs above let you push further per model:
- Prefix caching is on by default. In a multi-agent cast the system prompt and shared ledger context repeat across nearly every call, so reusing the KV cache for that shared prefix is the single largest win β leave it on.
- CUDA graphs are kept, their cost is amortized. Containers capture CUDA
graphs (no
enforce_eager) for best steady-state throughput, and the compile / graph cache is persisted on the sharedvllm-cacheVolume (VLLM_CACHE_ROOT), so only the first container compiles β later cold starts replay the cached graphs. Setenforce_eager=Trueon a model only when its backend can't capture graphs (the Transformers-backend Gemma 12B) or when cold start dominates. - Async scheduling overlaps CPU request scheduling with GPU compute; on by default for native vLLM models, off where the backend doesn't support it.
- Autoscaling scales out at ~75% of
max_concurrent_inputswhile a hot container bursts up to the ceiling, so we add capacity before a container saturates rather than after. Usemin_containersto remove cold starts entirely (at always-on cost).
For memory-bound models, raise gpu_memory_utilization (more KV cache β more
concurrency); if a step OOMs, lower max_model_len or cap the batch via
extra_vllm_args (e.g. ("--max-num-seqs", "32")).
Cold starts
A scale-from-zero cold start pays container boot β weight load β engine warmup. Two mechanisms keep that bounded:
1. Shared caches (always on). Weights are pulled once onto the
huggingface-cache Volume and the torch.compile / CUDA-graph artifacts are
persisted on the vllm-cache Volume (VLLM_CACHE_ROOT). So a model downloads
once across every container and provider, and only the first container
compiles its graphs β later cold starts replay the cache.
2. Demo-day keep-warm (deploy-time, no code edits). Pin one warm container for every profile-bound model (tiny/fast/balanced/strong) right before a live demo β specialists keep scale-to-zero:
MODAL_LLM_KEEP_WARM=1 modal deploy modal/app_nvidia.py # one warm container per tier model
modal deploy modal/app_nvidia.py # back to scale-to-zero after
This burns GPU-hours while deployed; it's a switch for the hours around a demo,
not a steady state. min_containers in catalogue.py remains the per-model
override for anything finer-grained.
Cold-start clients must follow redirects: a Modal endpoint that hasn't answered
within ~150s returns a 303 to the same URL while the container finishes
booting (modal/healthcheck.py handles this; so does the engine's gateway).
Add a model
Append one ModelConfig to the appropriate provider list in catalogue.py (tag
its profile tier to make it a tier default). The engine picks it up with no
edits β it reads the same catalogue.py.
Add a provider
- Add a
<PROVIDER>_MODELSlist and aPROVIDERS["<provider>"]entry (carrying itsappname) incatalogue.py. - Create
app_<provider>.pythat reads that entry:app = modal.App(PROVIDERS["<provider>"].app)thenregister_all(app, PROVIDERS["<provider>"].models).
Lower precision (quantization)
Every model repo here ships BF16 weights and serves at full precision. To
shrink a model's footprint β fit it on a smaller GPU, or free VRAM for a longer
context / more concurrency β pass vLLM's quantization flags through the
extra_vllm_args escape hatch on its ModelConfig:
extra_vllm_args=("--quantization", "fp8", "--kv-cache-dtype", "fp8")
This is purely serving-side: --served-model-name is unchanged, so the engine,
endpoint URLs, and the running cast are untouched.
Not every architecture serves under on-the-fly FP8. It needs an Ada/Hopper GPU (our L4/L40S/H200 all qualify) and vLLM support for the model's arch. Custom-code / hybrid-Mamba archs (the Nemotron Nanos, MiniCPM) and the Transformers-backend Gemma 12B may fail to boot under it. Verify a model after adding the flag (
modal/healthcheck.pyorcurl <url>/v1/models); if it won't start, drop the flag. This is why every model defaults to full precision.
Auth
Modal web endpoints are public by default. Secrets are supplied as environment variables (never hard-coded). To require a bearer token:
# Key MUST be VLLM_API_KEY (vLLM reads it); value is the token clients send.
modal secret create llm-api-key VLLM_API_KEY=sk-your-token
# Turn auth on at deploy time β no code edits:
MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
When MODAL_LLM_REQUIRE_AUTH is set, every endpoint mounts the llm-api-key
secret as the VLLM_API_KEY env var and vLLM enforces Authorization: Bearer <token> (401 otherwise). Clients pass the same token (the bundled client.py
reads it from LLM_API_KEY). Alternatively front endpoints with Modal Proxy
Auth Tokens (see docs/modal-llms.txt β Proxy Auth Tokens).
See openapi.md for the full API reference and the checked-in
OpenAPI spec (../openapi.yaml).
Observability & logging
Every container's stdout/stderr is captured by Modal β watch it live with
modal app logs <app-name> or in the dashboard. Each endpoint runs vLLM with
--enable-log-requests (toggle via log_requests), so every call logs its
request id, sampling params, and (on completion) prompt/generation token counts
and finish reason. Clients can pass an X-Request-Id header and it shows up in
the request logs β handy for correlating an engine call with its server-side line.
Throughput, KV-cache usage, and prefix-cache hit rate are logged every second
(VLLM_LOG_STATS_INTERVAL) and also exposed as Prometheus metrics at /metrics.
GPU sizing cheatsheet
BF16 weights β 2 bytes/param; leave headroom for the KV cache. MoE models load all expert weights even though only a slice activates per token, so size to the total parameter count.
| Model | Params (total / active) | Starting GPU |
|---|---|---|
| Nemotron-Cascade-14B-Thinking | ~14B (dense, Qwen3) | L40S:1 |
| Nemotron-3-Nano-4B | ~4B (Tiny Titan) | L4:1 |
| MiniCPM-o-4_5 (omni) | ~9B + media encoders | L40S:1 |
| MiniCPM4.1-8B | 8B | L40S:1 |
| Gemma-4-26B-A4B-it | ~25B / ~4B (MoE) | A100:1 |
| Gemma-4-12B-it | ~12B (dense) | L40S:1 |
These are starting points. If a container OOMs, lower max_model_len, raise the
GPU tier, or bump tensor_parallel_size (and the GPU count) for sharding.
Engine integration
The engine reads this same catalogue.py (by path, via
src/models/modal_catalogue.py) and routes every profile through the LiteLLM
gateway (ADR-0015 / ADR-0019). You don't wire endpoints by hand β set the
workspace and the four tiers bind automatically from config/models.yaml:
export MODAL_WORKSPACE="<your-workspace>" # activates the live path
export MODAL_LLM_KEY="EMPTY" # or the configured VLLM_API_KEY
Each profile's endpoint URL is derived as
https://${MODAL_WORKSPACE}--<app>-<endpoint>.modal.run/v1. To point a profile at
a different catalogue model, change its endpoint: in config/models.yaml; to
override the model string outright, set MODEL_TINY/FAST/BALANCED/STRONG. For a
one-off single endpoint (e.g. a local dev box), set MODAL_LLM_BASE_URL instead
of MODAL_WORKSPACE.