Spaces:
Running on Zero
Running on Zero
| # Deploying & configuring the model-serving apps | |
| This guide covers prerequisites, deployment, configuration knobs, auth, GPU | |
| sizing, and wiring the endpoints into the engine. | |
| The serving layer is deliberately small: it's Modal's canonical vLLM recipe β an | |
| autoscaling `@app.function` that launches `vllm serve` as a subprocess behind a | |
| `@modal.web_server` β applied once in `service.py` to every model in | |
| `catalogue.py`. See ADR-0034 for why we stripped the earlier snapshot / FP8 / | |
| structured-logging machinery back to this core. | |
| ## Prerequisites | |
| ```bash | |
| pip install -r modal/requirements.txt | |
| modal token new # one-time auth with your Modal workspace | |
| ``` | |
| Gated repos (Gemma, and the Nemotron repos here) require a Hugging Face token. | |
| Accept each model's license on its Hugging Face page, then create the secret: | |
| ```bash | |
| modal secret create huggingface-secret HF_TOKEN=hf_xxx | |
| ``` | |
| Only models with `gated=True` mount this secret; ungated models deploy without it. | |
| ## Deploy | |
| Each provider is its own Modal app, deployed independently: | |
| ```bash | |
| modal deploy modal/app_nvidia.py # Nemotron 3 Nano 4B + 30B, Cascade 14B | |
| modal deploy modal/app_openbmb.py # MiniCPM4.1-8B + MiniCPM-o 4.5 | |
| modal deploy modal/app_google.py # Gemma 4 12B + 26B | |
| ``` | |
| Use `modal serve modal/app_<provider>.py` for a hot-reloading dev session. | |
| Or deploy one, several, or all providers with a single uv command β a thin | |
| wrapper that exposes the two deploy-time env knobs as flags: | |
| ```bash | |
| uv run scripts/deploy_modal.py # all providers | |
| uv run scripts/deploy_modal.py nvidia openbmb # just these | |
| uv run scripts/deploy_modal.py nvidia --keep-warm # = MODAL_LLM_KEEP_WARM=1 | |
| # --auth β MODAL_LLM_REQUIRE_AUTH=1, --dry-run to preview the commands. | |
| ``` | |
| Run these from the repo root; the script's own directory (`modal/`) is on | |
| `sys.path`, so `from service import ...` / `from catalogue import ...` resolve, | |
| and `import modal` still binds the installed SDK (the folder name does not | |
| shadow it). | |
| ## Endpoints | |
| Each model becomes its own OpenAI-compatible endpoint. Modal builds the URL from | |
| the `modal.App` name **and** the function's `endpoint_name`: | |
| ``` | |
| https://<workspace>--<app-name>-<endpoint-name>.modal.run/v1 | |
| ``` | |
| `<app-name>` is `nvidia-llms`, `openbmb-llms`, or `google-llms` (one per provider | |
| app); `<endpoint-name>` is the per-model slug. e.g. the Nemotron 4B endpoint is | |
| `https://<workspace>--nvidia-llms-nemotron-3-nano-4b.modal.run/v1`. | |
| > **Model id vs URL slug.** The `--model` value (and the `"model"` field in a raw | |
| > request) is the *served model id* β the HF repo id, e.g. | |
| > `nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16` β because `served_model_name` defaults to | |
| > the repo `name`. It is **not** the URL slug (`nemotron-3-nano-4b`). Call | |
| > `/v1/models` on any endpoint to see the exact id it serves. | |
| Standard routes: `/v1/chat/completions`, `/v1/completions`, `/v1/models`, plus | |
| `/docs` for the Swagger UI. Smoke-test one: | |
| ```bash | |
| python modal/client.py \ | |
| --base-url https://<workspace>--google-llms-gemma-4-12b.modal.run/v1 \ | |
| --model google/gemma-4-12B \ | |
| --prompt "Describe a mossy ticket booth in the wood." | |
| ``` | |
| ## Configuring models (per task) | |
| All knobs live in `catalogue.py` as `ModelConfig` fields β no serving code | |
| changes needed: | |
| | Field | Purpose | | |
| | ----------------------- | -------------------------------------------------------------- | | |
| | `gpu` | Modal GPU spec, e.g. `H200:1`, `H100:2`, `L40S:1`, `L4:1`. | | |
| | `tensor_parallel_size` | Shard across GPUs; set equal to the GPU count in `gpu`. | | |
| | `max_model_len` | Cap context length to fit memory / tune throughput. | | |
| | `max_concurrent_inputs` | Hard ceiling of requests multiplexed onto one container (autoscale target is ~75% of it). | | |
| | `scaledown_window` | Idle seconds before a container stops (cold-start vs. cost). | | |
| | `min_containers` | Keep N warm to eliminate cold starts (always-on cost). | | |
| | `gpu_memory_utilization` | Fraction of VRAM for weights + KV cache (vLLM default `0.9`); raise for a bigger KV cache. | | |
| | `enable_prefix_caching` | Reuse the KV cache for shared prompt prefixes (on by default β big win when the system prompt / ledger context repeats across the cast). | | |
| | `async_scheduling` | Overlap CPU request scheduling with GPU compute (on by default; off for the Transformers-backend Gemma 12B + omni models). | | |
| | `enforce_eager` | Skip CUDA-graph capture β faster cold start, lower steady-state throughput. | | |
| | `log_requests` | Log each request's id, sampling params, and token counts (on by default). | | |
| | `reasoning_parser` / `tool_call_parser` / `enable_auto_tool_choice` | OpenAI tool/reasoning features (vLLM parser names; leave None if unsupported). | | |
| | `mm_limits` | Per-prompt image/audio/video caps; set to 0 on an auto-detected-multimodal model you serve text-only. | | |
| | `trust_remote_code` | Required by MiniCPM / Nemotron custom modeling code. | | |
| | `vllm_version` | Per-model inference-stack pin (escape hatch); `None` = the default `VLLM_VERSION`, `"nightly"` = latest nightly wheel, else a pinned version. | | |
| | `extra_vllm_args` | Raw `vllm serve` flags appended verbatim β the escape hatch for anything not modelled above (quantization, batch caps, custom parser plugins, β¦). | | |
| | `extra_pip` / `env` | Extra image deps / container env (escape hatch). | | |
| > **Per-model vLLM version.** The image pins `VLLM_VERSION` (see `service.py`) for | |
| > reproducible deploys. A single model can override it via `vllm_version` when the | |
| > pinned release can't serve its architecture β this is scoped to that model's image, | |
| > so one model's bump never touches another provider's app. Only the Gemma 4 **12B** | |
| > sets `vllm_version="nightly"` (plus `transformers>=5.10.2`) because its | |
| > `gemma4_unified` architecture has no class in any stable vLLM β€0.22.1. The Gemma 4 | |
| > **26B** is a standard MoE arch that serves on the pinned stable release, so it | |
| > stays on the default pin. | |
| ### Performance tuning | |
| The serving path follows Modal's high-performance-LLM-inference guidance, so the | |
| defaults are already tuned for throughput; the knobs above let you push further | |
| per model: | |
| - **Prefix caching is on by default.** In a multi-agent cast the system prompt and | |
| shared ledger context repeat across nearly every call, so reusing the KV cache | |
| for that shared prefix is the single largest win β leave it on. | |
| - **CUDA graphs are kept, their cost is amortized.** Containers capture CUDA | |
| graphs (no `enforce_eager`) for best steady-state throughput, and the compile / | |
| graph cache is persisted on the shared `vllm-cache` Volume (`VLLM_CACHE_ROOT`), | |
| so only the *first* container compiles β later cold starts replay the cached | |
| graphs. Set `enforce_eager=True` on a model only when its backend can't capture | |
| graphs (the Transformers-backend Gemma 12B) or when cold start dominates. | |
| - **Async scheduling** overlaps CPU request scheduling with GPU compute; on by | |
| default for native vLLM models, off where the backend doesn't support it. | |
| - **Autoscaling** scales out at ~75% of `max_concurrent_inputs` while a hot | |
| container bursts up to the ceiling, so we add capacity before a container | |
| saturates rather than after. Use `min_containers` to remove cold starts | |
| entirely (at always-on cost). | |
| For memory-bound models, raise `gpu_memory_utilization` (more KV cache β more | |
| concurrency); if a step OOMs, lower `max_model_len` or cap the batch via | |
| `extra_vllm_args` (e.g. `("--max-num-seqs", "32")`). | |
| ### Cold starts | |
| A scale-from-zero cold start pays container boot β weight load β engine warmup. | |
| Two mechanisms keep that bounded: | |
| **1. Shared caches (always on).** Weights are pulled once onto the | |
| `huggingface-cache` Volume and the torch.compile / CUDA-graph artifacts are | |
| persisted on the `vllm-cache` Volume (`VLLM_CACHE_ROOT`). So a model downloads | |
| once across every container and provider, and only the *first* container | |
| compiles its graphs β later cold starts replay the cache. | |
| **2. Demo-day keep-warm (deploy-time, no code edits).** Pin one warm container | |
| for every *profile-bound* model (tiny/fast/balanced/strong) right before a live | |
| demo β specialists keep scale-to-zero: | |
| ```bash | |
| MODAL_LLM_KEEP_WARM=1 modal deploy modal/app_nvidia.py # one warm container per tier model | |
| modal deploy modal/app_nvidia.py # back to scale-to-zero after | |
| ``` | |
| This burns GPU-hours while deployed; it's a switch for the hours around a demo, | |
| not a steady state. `min_containers` in `catalogue.py` remains the per-model | |
| override for anything finer-grained. | |
| Cold-start clients must follow redirects: a Modal endpoint that hasn't answered | |
| within ~150s returns a `303` to the same URL while the container finishes | |
| booting (`modal/healthcheck.py` handles this; so does the engine's gateway). | |
| ### Add a model | |
| Append one `ModelConfig` to the appropriate provider list in `catalogue.py` (tag | |
| its `profile` tier to make it a tier default). The engine picks it up with no | |
| edits β it reads the same `catalogue.py`. | |
| ### Add a provider | |
| 1. Add a `<PROVIDER>_MODELS` list and a `PROVIDERS["<provider>"]` entry (carrying | |
| its `app` name) in `catalogue.py`. | |
| 2. Create `app_<provider>.py` that reads that entry: | |
| `app = modal.App(PROVIDERS["<provider>"].app)` then | |
| `register_all(app, PROVIDERS["<provider>"].models)`. | |
| ## Lower precision (quantization) | |
| Every model repo here ships **BF16** weights and serves at full precision. To | |
| shrink a model's footprint β fit it on a smaller GPU, or free VRAM for a longer | |
| context / more concurrency β pass vLLM's quantization flags through the | |
| `extra_vllm_args` escape hatch on its `ModelConfig`: | |
| ```python | |
| extra_vllm_args=("--quantization", "fp8", "--kv-cache-dtype", "fp8") | |
| ``` | |
| This is purely serving-side: `--served-model-name` is unchanged, so the engine, | |
| endpoint URLs, and the running cast are untouched. | |
| > **Not every architecture serves under on-the-fly FP8.** It needs an Ada/Hopper | |
| > GPU (our L4/L40S/H200 all qualify) *and* vLLM support for the model's arch. | |
| > Custom-code / hybrid-Mamba archs (the Nemotron Nanos, MiniCPM) and the | |
| > Transformers-backend Gemma 12B may **fail to boot** under it. Verify a model | |
| > after adding the flag (`modal/healthcheck.py` or `curl <url>/v1/models`); if it | |
| > won't start, drop the flag. This is why every model defaults to full precision. | |
| ## Auth | |
| Modal web endpoints are public by default. Secrets are supplied as environment | |
| variables (never hard-coded). To require a bearer token: | |
| ```bash | |
| # Key MUST be VLLM_API_KEY (vLLM reads it); value is the token clients send. | |
| modal secret create llm-api-key VLLM_API_KEY=sk-your-token | |
| # Turn auth on at deploy time β no code edits: | |
| MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py | |
| ``` | |
| When `MODAL_LLM_REQUIRE_AUTH` is set, every endpoint mounts the `llm-api-key` | |
| secret as the `VLLM_API_KEY` env var and vLLM enforces `Authorization: Bearer | |
| <token>` (401 otherwise). Clients pass the same token (the bundled `client.py` | |
| reads it from `LLM_API_KEY`). Alternatively front endpoints with Modal Proxy | |
| Auth Tokens (see `docs/modal-llms.txt` β Proxy Auth Tokens). | |
| See [`openapi.md`](openapi.md) for the full API reference and the checked-in | |
| OpenAPI spec (`../openapi.yaml`). | |
| ## Observability & logging | |
| Every container's stdout/stderr is captured by Modal β watch it live with | |
| `modal app logs <app-name>` or in the dashboard. Each endpoint runs vLLM with | |
| `--enable-log-requests` (toggle via `log_requests`), so every call logs its | |
| request id, sampling params, and (on completion) prompt/generation token counts | |
| and finish reason. Clients can pass an `X-Request-Id` header and it shows up in | |
| the request logs β handy for correlating an engine call with its server-side line. | |
| Throughput, KV-cache usage, and prefix-cache hit rate are logged every second | |
| (`VLLM_LOG_STATS_INTERVAL`) and also exposed as Prometheus metrics at `/metrics`. | |
| ## GPU sizing cheatsheet | |
| BF16 weights β 2 bytes/param; leave headroom for the KV cache. MoE models load | |
| all expert weights even though only a slice activates per token, so size to the | |
| total parameter count. | |
| | Model | Params (total / active) | Starting GPU | | |
| | ---------------------------------- | ----------------------- | ------------ | | |
| | Nemotron-Cascade-14B-Thinking | ~14B (dense, Qwen3) | `L40S:1` | | |
| | Nemotron-3-Nano-4B | ~4B (Tiny Titan) | `L4:1` | | |
| | MiniCPM-o-4_5 (omni) | ~9B + media encoders | `L40S:1` | | |
| | MiniCPM4.1-8B | 8B | `L40S:1` | | |
| | Gemma-4-26B-A4B-it | ~25B / ~4B (MoE) | `A100:1` | | |
| | Gemma-4-12B-it | ~12B (dense) | `L40S:1` | | |
| These are starting points. If a container OOMs, lower `max_model_len`, raise the | |
| GPU tier, or bump `tensor_parallel_size` (and the GPU count) for sharding. | |
| ## Engine integration | |
| The engine reads this same `catalogue.py` (by path, via | |
| `src/models/modal_catalogue.py`) and routes every profile through the LiteLLM | |
| gateway (ADR-0015 / ADR-0019). You don't wire endpoints by hand β set the | |
| workspace and the four tiers bind automatically from `config/models.yaml`: | |
| ```bash | |
| export MODAL_WORKSPACE="<your-workspace>" # activates the live path | |
| export MODAL_LLM_KEY="EMPTY" # or the configured VLLM_API_KEY | |
| ``` | |
| Each profile's endpoint URL is derived as | |
| `https://${MODAL_WORKSPACE}--<app>-<endpoint>.modal.run/v1`. To point a profile at | |
| a different catalogue model, change its `endpoint:` in `config/models.yaml`; to | |
| override the model string outright, set `MODEL_TINY/FAST/BALANCED/STRONG`. For a | |
| one-off single endpoint (e.g. a local dev box), set `MODAL_LLM_BASE_URL` instead | |
| of `MODAL_WORKSPACE`. | |