File size: 13,972 Bytes
8a801e8 5d4ef87 8a801e8 5d4ef87 8a801e8 1bc1435 5d4ef87 1bc1435 5d4ef87 1bc1435 8a801e8 5d4ef87 8a801e8 7cedfb2 8a801e8 7cedfb2 8a801e8 7cedfb2 8a801e8 7cedfb2 8a801e8 9dd6dab 8a801e8 5d4ef87 8a801e8 40a30b6 5d4ef87 40a30b6 6ca7a5f 5d4ef87 8a801e8 c1656a8 5d4ef87 8a801e8 c1656a8 5d4ef87 c1656a8 40a30b6 5d4ef87 40a30b6 5d4ef87 40a30b6 5d4ef87 40a30b6 ce159dc 5d4ef87 ce159dc 8a801e8 9dd6dab 8a801e8 9dd6dab 8a801e8 5d4ef87 e3dfec9 5d4ef87 e3dfec9 5d4ef87 e3dfec9 5d4ef87 e3dfec9 5d4ef87 e334e95 8a801e8 57b8237 8a801e8 57b8237 8a801e8 6ca7a5f 5d4ef87 6ca7a5f 8a801e8 5d4ef87 8a801e8 8400d8c 5d4ef87 8a801e8 9dd6dab 8a801e8 9dd6dab 8a801e8 9dd6dab | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 | # Deploying & configuring the model-serving apps
This guide covers prerequisites, deployment, configuration knobs, auth, GPU
sizing, and wiring the endpoints into the engine.
The serving layer is deliberately small: it's Modal's canonical vLLM recipe β an
autoscaling `@app.function` that launches `vllm serve` as a subprocess behind a
`@modal.web_server` β applied once in `service.py` to every model in
`catalogue.py`. See ADR-0034 for why we stripped the earlier snapshot / FP8 /
structured-logging machinery back to this core.
## Prerequisites
```bash
pip install -r modal/requirements.txt
modal token new # one-time auth with your Modal workspace
```
Gated repos (Gemma, and the Nemotron repos here) require a Hugging Face token.
Accept each model's license on its Hugging Face page, then create the secret:
```bash
modal secret create huggingface-secret HF_TOKEN=hf_xxx
```
Only models with `gated=True` mount this secret; ungated models deploy without it.
## Deploy
Each provider is its own Modal app, deployed independently:
```bash
modal deploy modal/app_nvidia.py # Nemotron 3 Nano 4B + 30B, Cascade 14B
modal deploy modal/app_openbmb.py # MiniCPM4.1-8B + MiniCPM-o 4.5
modal deploy modal/app_google.py # Gemma 4 12B + 26B
```
Use `modal serve modal/app_<provider>.py` for a hot-reloading dev session.
Or deploy one, several, or all providers with a single uv command β a thin
wrapper that exposes the two deploy-time env knobs as flags:
```bash
uv run scripts/deploy_modal.py # all providers
uv run scripts/deploy_modal.py nvidia openbmb # just these
uv run scripts/deploy_modal.py nvidia --keep-warm # = MODAL_LLM_KEEP_WARM=1
# --auth β MODAL_LLM_REQUIRE_AUTH=1, --dry-run to preview the commands.
```
Run these from the repo root; the script's own directory (`modal/`) is on
`sys.path`, so `from service import ...` / `from catalogue import ...` resolve,
and `import modal` still binds the installed SDK (the folder name does not
shadow it).
## Endpoints
Each model becomes its own OpenAI-compatible endpoint. Modal builds the URL from
the `modal.App` name **and** the function's `endpoint_name`:
```
https://<workspace>--<app-name>-<endpoint-name>.modal.run/v1
```
`<app-name>` is `nvidia-llms`, `openbmb-llms`, or `google-llms` (one per provider
app); `<endpoint-name>` is the per-model slug. e.g. the Nemotron 4B endpoint is
`https://<workspace>--nvidia-llms-nemotron-3-nano-4b.modal.run/v1`.
> **Model id vs URL slug.** The `--model` value (and the `"model"` field in a raw
> request) is the *served model id* β the HF repo id, e.g.
> `nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16` β because `served_model_name` defaults to
> the repo `name`. It is **not** the URL slug (`nemotron-3-nano-4b`). Call
> `/v1/models` on any endpoint to see the exact id it serves.
Standard routes: `/v1/chat/completions`, `/v1/completions`, `/v1/models`, plus
`/docs` for the Swagger UI. Smoke-test one:
```bash
python modal/client.py \
--base-url https://<workspace>--google-llms-gemma-4-12b.modal.run/v1 \
--model google/gemma-4-12B \
--prompt "Describe a mossy ticket booth in the wood."
```
## Configuring models (per task)
All knobs live in `catalogue.py` as `ModelConfig` fields β no serving code
changes needed:
| Field | Purpose |
| ----------------------- | -------------------------------------------------------------- |
| `gpu` | Modal GPU spec, e.g. `H200:1`, `H100:2`, `L40S:1`, `L4:1`. |
| `tensor_parallel_size` | Shard across GPUs; set equal to the GPU count in `gpu`. |
| `max_model_len` | Cap context length to fit memory / tune throughput. |
| `max_concurrent_inputs` | Hard ceiling of requests multiplexed onto one container (autoscale target is ~75% of it). |
| `scaledown_window` | Idle seconds before a container stops (cold-start vs. cost). |
| `min_containers` | Keep N warm to eliminate cold starts (always-on cost). |
| `gpu_memory_utilization` | Fraction of VRAM for weights + KV cache (vLLM default `0.9`); raise for a bigger KV cache. |
| `enable_prefix_caching` | Reuse the KV cache for shared prompt prefixes (on by default β big win when the system prompt / ledger context repeats across the cast). |
| `async_scheduling` | Overlap CPU request scheduling with GPU compute (on by default; off for the Transformers-backend Gemma 12B + omni models). |
| `enforce_eager` | Skip CUDA-graph capture β faster cold start, lower steady-state throughput. |
| `log_requests` | Log each request's id, sampling params, and token counts (on by default). |
| `reasoning_parser` / `tool_call_parser` / `enable_auto_tool_choice` | OpenAI tool/reasoning features (vLLM parser names; leave None if unsupported). |
| `mm_limits` | Per-prompt image/audio/video caps; set to 0 on an auto-detected-multimodal model you serve text-only. |
| `trust_remote_code` | Required by MiniCPM / Nemotron custom modeling code. |
| `vllm_version` | Per-model inference-stack pin (escape hatch); `None` = the default `VLLM_VERSION`, `"nightly"` = latest nightly wheel, else a pinned version. |
| `extra_vllm_args` | Raw `vllm serve` flags appended verbatim β the escape hatch for anything not modelled above (quantization, batch caps, custom parser plugins, β¦). |
| `extra_pip` / `env` | Extra image deps / container env (escape hatch). |
> **Per-model vLLM version.** The image pins `VLLM_VERSION` (see `service.py`) for
> reproducible deploys. A single model can override it via `vllm_version` when the
> pinned release can't serve its architecture β this is scoped to that model's image,
> so one model's bump never touches another provider's app. Only the Gemma 4 **12B**
> sets `vllm_version="nightly"` (plus `transformers>=5.10.2`) because its
> `gemma4_unified` architecture has no class in any stable vLLM β€0.22.1. The Gemma 4
> **26B** is a standard MoE arch that serves on the pinned stable release, so it
> stays on the default pin.
### Performance tuning
The serving path follows Modal's high-performance-LLM-inference guidance, so the
defaults are already tuned for throughput; the knobs above let you push further
per model:
- **Prefix caching is on by default.** In a multi-agent cast the system prompt and
shared ledger context repeat across nearly every call, so reusing the KV cache
for that shared prefix is the single largest win β leave it on.
- **CUDA graphs are kept, their cost is amortized.** Containers capture CUDA
graphs (no `enforce_eager`) for best steady-state throughput, and the compile /
graph cache is persisted on the shared `vllm-cache` Volume (`VLLM_CACHE_ROOT`),
so only the *first* container compiles β later cold starts replay the cached
graphs. Set `enforce_eager=True` on a model only when its backend can't capture
graphs (the Transformers-backend Gemma 12B) or when cold start dominates.
- **Async scheduling** overlaps CPU request scheduling with GPU compute; on by
default for native vLLM models, off where the backend doesn't support it.
- **Autoscaling** scales out at ~75% of `max_concurrent_inputs` while a hot
container bursts up to the ceiling, so we add capacity before a container
saturates rather than after. Use `min_containers` to remove cold starts
entirely (at always-on cost).
For memory-bound models, raise `gpu_memory_utilization` (more KV cache β more
concurrency); if a step OOMs, lower `max_model_len` or cap the batch via
`extra_vllm_args` (e.g. `("--max-num-seqs", "32")`).
### Cold starts
A scale-from-zero cold start pays container boot β weight load β engine warmup.
Two mechanisms keep that bounded:
**1. Shared caches (always on).** Weights are pulled once onto the
`huggingface-cache` Volume and the torch.compile / CUDA-graph artifacts are
persisted on the `vllm-cache` Volume (`VLLM_CACHE_ROOT`). So a model downloads
once across every container and provider, and only the *first* container
compiles its graphs β later cold starts replay the cache.
**2. Demo-day keep-warm (deploy-time, no code edits).** Pin one warm container
for every *profile-bound* model (tiny/fast/balanced/strong) right before a live
demo β specialists keep scale-to-zero:
```bash
MODAL_LLM_KEEP_WARM=1 modal deploy modal/app_nvidia.py # one warm container per tier model
modal deploy modal/app_nvidia.py # back to scale-to-zero after
```
This burns GPU-hours while deployed; it's a switch for the hours around a demo,
not a steady state. `min_containers` in `catalogue.py` remains the per-model
override for anything finer-grained.
Cold-start clients must follow redirects: a Modal endpoint that hasn't answered
within ~150s returns a `303` to the same URL while the container finishes
booting (`modal/healthcheck.py` handles this; so does the engine's gateway).
### Add a model
Append one `ModelConfig` to the appropriate provider list in `catalogue.py` (tag
its `profile` tier to make it a tier default). The engine picks it up with no
edits β it reads the same `catalogue.py`.
### Add a provider
1. Add a `<PROVIDER>_MODELS` list and a `PROVIDERS["<provider>"]` entry (carrying
its `app` name) in `catalogue.py`.
2. Create `app_<provider>.py` that reads that entry:
`app = modal.App(PROVIDERS["<provider>"].app)` then
`register_all(app, PROVIDERS["<provider>"].models)`.
## Lower precision (quantization)
Every model repo here ships **BF16** weights and serves at full precision. To
shrink a model's footprint β fit it on a smaller GPU, or free VRAM for a longer
context / more concurrency β pass vLLM's quantization flags through the
`extra_vllm_args` escape hatch on its `ModelConfig`:
```python
extra_vllm_args=("--quantization", "fp8", "--kv-cache-dtype", "fp8")
```
This is purely serving-side: `--served-model-name` is unchanged, so the engine,
endpoint URLs, and the running cast are untouched.
> **Not every architecture serves under on-the-fly FP8.** It needs an Ada/Hopper
> GPU (our L4/L40S/H200 all qualify) *and* vLLM support for the model's arch.
> Custom-code / hybrid-Mamba archs (the Nemotron Nanos, MiniCPM) and the
> Transformers-backend Gemma 12B may **fail to boot** under it. Verify a model
> after adding the flag (`modal/healthcheck.py` or `curl <url>/v1/models`); if it
> won't start, drop the flag. This is why every model defaults to full precision.
## Auth
Modal web endpoints are public by default. Secrets are supplied as environment
variables (never hard-coded). To require a bearer token:
```bash
# Key MUST be VLLM_API_KEY (vLLM reads it); value is the token clients send.
modal secret create llm-api-key VLLM_API_KEY=sk-your-token
# Turn auth on at deploy time β no code edits:
MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
```
When `MODAL_LLM_REQUIRE_AUTH` is set, every endpoint mounts the `llm-api-key`
secret as the `VLLM_API_KEY` env var and vLLM enforces `Authorization: Bearer
<token>` (401 otherwise). Clients pass the same token (the bundled `client.py`
reads it from `LLM_API_KEY`). Alternatively front endpoints with Modal Proxy
Auth Tokens (see `docs/modal-llms.txt` β Proxy Auth Tokens).
See [`openapi.md`](openapi.md) for the full API reference and the checked-in
OpenAPI spec (`../openapi.yaml`).
## Observability & logging
Every container's stdout/stderr is captured by Modal β watch it live with
`modal app logs <app-name>` or in the dashboard. Each endpoint runs vLLM with
`--enable-log-requests` (toggle via `log_requests`), so every call logs its
request id, sampling params, and (on completion) prompt/generation token counts
and finish reason. Clients can pass an `X-Request-Id` header and it shows up in
the request logs β handy for correlating an engine call with its server-side line.
Throughput, KV-cache usage, and prefix-cache hit rate are logged every second
(`VLLM_LOG_STATS_INTERVAL`) and also exposed as Prometheus metrics at `/metrics`.
## GPU sizing cheatsheet
BF16 weights β 2 bytes/param; leave headroom for the KV cache. MoE models load
all expert weights even though only a slice activates per token, so size to the
total parameter count.
| Model | Params (total / active) | Starting GPU |
| ---------------------------------- | ----------------------- | ------------ |
| Nemotron-Cascade-14B-Thinking | ~14B (dense, Qwen3) | `L40S:1` |
| Nemotron-3-Nano-4B | ~4B (Tiny Titan) | `L4:1` |
| MiniCPM-o-4_5 (omni) | ~9B + media encoders | `L40S:1` |
| MiniCPM4.1-8B | 8B | `L40S:1` |
| Gemma-4-26B-A4B-it | ~25B / ~4B (MoE) | `A100:1` |
| Gemma-4-12B-it | ~12B (dense) | `L40S:1` |
These are starting points. If a container OOMs, lower `max_model_len`, raise the
GPU tier, or bump `tensor_parallel_size` (and the GPU count) for sharding.
## Engine integration
The engine reads this same `catalogue.py` (by path, via
`src/models/modal_catalogue.py`) and routes every profile through the LiteLLM
gateway (ADR-0015 / ADR-0019). You don't wire endpoints by hand β set the
workspace and the four tiers bind automatically from `config/models.yaml`:
```bash
export MODAL_WORKSPACE="<your-workspace>" # activates the live path
export MODAL_LLM_KEY="EMPTY" # or the configured VLLM_API_KEY
```
Each profile's endpoint URL is derived as
`https://${MODAL_WORKSPACE}--<app>-<endpoint>.modal.run/v1`. To point a profile at
a different catalogue model, change its `endpoint:` in `config/models.yaml`; to
override the model string outright, set `MODEL_TINY/FAST/BALANCED/STRONG`. For a
one-off single endpoint (e.g. a local dev box), set `MODAL_LLM_BASE_URL` instead
of `MODAL_WORKSPACE`.
|