File size: 13,972 Bytes
8a801e8
 
 
 
 
5d4ef87
 
 
 
 
 
8a801e8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d4ef87
 
 
8a801e8
 
 
 
1bc1435
5d4ef87
1bc1435
 
 
 
 
5d4ef87
1bc1435
 
8a801e8
5d4ef87
8a801e8
 
 
 
 
7cedfb2
 
8a801e8
 
7cedfb2
8a801e8
 
7cedfb2
 
 
 
 
 
 
 
 
 
8a801e8
 
 
 
 
7cedfb2
8a801e8
 
 
 
 
 
9dd6dab
8a801e8
 
 
 
 
 
 
5d4ef87
8a801e8
 
40a30b6
 
5d4ef87
40a30b6
6ca7a5f
5d4ef87
 
8a801e8
c1656a8
5d4ef87
8a801e8
 
c1656a8
 
 
5d4ef87
 
 
 
 
c1656a8
40a30b6
 
 
 
 
 
 
 
 
 
 
 
 
 
5d4ef87
40a30b6
 
5d4ef87
 
 
40a30b6
 
 
5d4ef87
 
40a30b6
ce159dc
 
5d4ef87
 
 
 
 
 
 
 
 
 
 
ce159dc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8a801e8
 
9dd6dab
 
 
8a801e8
 
 
9dd6dab
 
 
 
 
8a801e8
5d4ef87
e3dfec9
5d4ef87
 
 
 
e3dfec9
5d4ef87
 
e3dfec9
 
5d4ef87
 
 
e3dfec9
 
5d4ef87
 
 
 
e334e95
8a801e8
 
57b8237
 
8a801e8
57b8237
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8a801e8
6ca7a5f
 
 
5d4ef87
 
 
 
 
6ca7a5f
 
 
 
8a801e8
 
 
 
 
 
 
 
5d4ef87
 
 
8a801e8
8400d8c
5d4ef87
8a801e8
 
 
 
 
 
9dd6dab
 
 
 
8a801e8
 
9dd6dab
 
8a801e8
 
9dd6dab
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
# Deploying & configuring the model-serving apps

This guide covers prerequisites, deployment, configuration knobs, auth, GPU
sizing, and wiring the endpoints into the engine.

The serving layer is deliberately small: it's Modal's canonical vLLM recipe β€” an
autoscaling `@app.function` that launches `vllm serve` as a subprocess behind a
`@modal.web_server` β€” applied once in `service.py` to every model in
`catalogue.py`. See ADR-0034 for why we stripped the earlier snapshot / FP8 /
structured-logging machinery back to this core.

## Prerequisites

```bash
pip install -r modal/requirements.txt
modal token new            # one-time auth with your Modal workspace
```

Gated repos (Gemma, and the Nemotron repos here) require a Hugging Face token.
Accept each model's license on its Hugging Face page, then create the secret:

```bash
modal secret create huggingface-secret HF_TOKEN=hf_xxx
```

Only models with `gated=True` mount this secret; ungated models deploy without it.

## Deploy

Each provider is its own Modal app, deployed independently:

```bash
modal deploy modal/app_nvidia.py     # Nemotron 3 Nano 4B + 30B, Cascade 14B
modal deploy modal/app_openbmb.py    # MiniCPM4.1-8B + MiniCPM-o 4.5
modal deploy modal/app_google.py     # Gemma 4 12B + 26B
```

Use `modal serve modal/app_<provider>.py` for a hot-reloading dev session.

Or deploy one, several, or all providers with a single uv command β€” a thin
wrapper that exposes the two deploy-time env knobs as flags:

```bash
uv run scripts/deploy_modal.py                      # all providers
uv run scripts/deploy_modal.py nvidia openbmb       # just these
uv run scripts/deploy_modal.py nvidia --keep-warm   # = MODAL_LLM_KEEP_WARM=1
# --auth β†’ MODAL_LLM_REQUIRE_AUTH=1, --dry-run to preview the commands.
```

Run these from the repo root; the script's own directory (`modal/`) is on
`sys.path`, so `from service import ...` / `from catalogue import ...` resolve,
and `import modal` still binds the installed SDK (the folder name does not
shadow it).

## Endpoints

Each model becomes its own OpenAI-compatible endpoint. Modal builds the URL from
the `modal.App` name **and** the function's `endpoint_name`:

```
https://<workspace>--<app-name>-<endpoint-name>.modal.run/v1
```

`<app-name>` is `nvidia-llms`, `openbmb-llms`, or `google-llms` (one per provider
app); `<endpoint-name>` is the per-model slug. e.g. the Nemotron 4B endpoint is
`https://<workspace>--nvidia-llms-nemotron-3-nano-4b.modal.run/v1`.

> **Model id vs URL slug.** The `--model` value (and the `"model"` field in a raw
> request) is the *served model id* β€” the HF repo id, e.g.
> `nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16` β€” because `served_model_name` defaults to
> the repo `name`. It is **not** the URL slug (`nemotron-3-nano-4b`). Call
> `/v1/models` on any endpoint to see the exact id it serves.

Standard routes: `/v1/chat/completions`, `/v1/completions`, `/v1/models`, plus
`/docs` for the Swagger UI. Smoke-test one:

```bash
python modal/client.py \
  --base-url https://<workspace>--google-llms-gemma-4-12b.modal.run/v1 \
  --model google/gemma-4-12B \
  --prompt "Describe a mossy ticket booth in the wood."
```

## Configuring models (per task)

All knobs live in `catalogue.py` as `ModelConfig` fields β€” no serving code
changes needed:

| Field                   | Purpose                                                        |
| ----------------------- | -------------------------------------------------------------- |
| `gpu`                   | Modal GPU spec, e.g. `H200:1`, `H100:2`, `L40S:1`, `L4:1`.     |
| `tensor_parallel_size`  | Shard across GPUs; set equal to the GPU count in `gpu`.        |
| `max_model_len`         | Cap context length to fit memory / tune throughput.            |
| `max_concurrent_inputs` | Hard ceiling of requests multiplexed onto one container (autoscale target is ~75% of it). |
| `scaledown_window`      | Idle seconds before a container stops (cold-start vs. cost).   |
| `min_containers`        | Keep N warm to eliminate cold starts (always-on cost).         |
| `gpu_memory_utilization` | Fraction of VRAM for weights + KV cache (vLLM default `0.9`); raise for a bigger KV cache. |
| `enable_prefix_caching` | Reuse the KV cache for shared prompt prefixes (on by default β€” big win when the system prompt / ledger context repeats across the cast). |
| `async_scheduling`      | Overlap CPU request scheduling with GPU compute (on by default; off for the Transformers-backend Gemma 12B + omni models). |
| `enforce_eager`         | Skip CUDA-graph capture β€” faster cold start, lower steady-state throughput. |
| `log_requests`          | Log each request's id, sampling params, and token counts (on by default). |
| `reasoning_parser` / `tool_call_parser` / `enable_auto_tool_choice` | OpenAI tool/reasoning features (vLLM parser names; leave None if unsupported). |
| `mm_limits`             | Per-prompt image/audio/video caps; set to 0 on an auto-detected-multimodal model you serve text-only. |
| `trust_remote_code`     | Required by MiniCPM / Nemotron custom modeling code.           |
| `vllm_version`          | Per-model inference-stack pin (escape hatch); `None` = the default `VLLM_VERSION`, `"nightly"` = latest nightly wheel, else a pinned version. |
| `extra_vllm_args`       | Raw `vllm serve` flags appended verbatim β€” the escape hatch for anything not modelled above (quantization, batch caps, custom parser plugins, …). |
| `extra_pip` / `env`     | Extra image deps / container env (escape hatch).               |

> **Per-model vLLM version.** The image pins `VLLM_VERSION` (see `service.py`) for
> reproducible deploys. A single model can override it via `vllm_version` when the
> pinned release can't serve its architecture β€” this is scoped to that model's image,
> so one model's bump never touches another provider's app. Only the Gemma 4 **12B**
> sets `vllm_version="nightly"` (plus `transformers>=5.10.2`) because its
> `gemma4_unified` architecture has no class in any stable vLLM ≀0.22.1. The Gemma 4
> **26B** is a standard MoE arch that serves on the pinned stable release, so it
> stays on the default pin.

### Performance tuning

The serving path follows Modal's high-performance-LLM-inference guidance, so the
defaults are already tuned for throughput; the knobs above let you push further
per model:

- **Prefix caching is on by default.** In a multi-agent cast the system prompt and
  shared ledger context repeat across nearly every call, so reusing the KV cache
  for that shared prefix is the single largest win β€” leave it on.
- **CUDA graphs are kept, their cost is amortized.** Containers capture CUDA
  graphs (no `enforce_eager`) for best steady-state throughput, and the compile /
  graph cache is persisted on the shared `vllm-cache` Volume (`VLLM_CACHE_ROOT`),
  so only the *first* container compiles β€” later cold starts replay the cached
  graphs. Set `enforce_eager=True` on a model only when its backend can't capture
  graphs (the Transformers-backend Gemma 12B) or when cold start dominates.
- **Async scheduling** overlaps CPU request scheduling with GPU compute; on by
  default for native vLLM models, off where the backend doesn't support it.
- **Autoscaling** scales out at ~75% of `max_concurrent_inputs` while a hot
  container bursts up to the ceiling, so we add capacity before a container
  saturates rather than after. Use `min_containers` to remove cold starts
  entirely (at always-on cost).

For memory-bound models, raise `gpu_memory_utilization` (more KV cache β†’ more
concurrency); if a step OOMs, lower `max_model_len` or cap the batch via
`extra_vllm_args` (e.g. `("--max-num-seqs", "32")`).

### Cold starts

A scale-from-zero cold start pays container boot β†’ weight load β†’ engine warmup.
Two mechanisms keep that bounded:

**1. Shared caches (always on).** Weights are pulled once onto the
`huggingface-cache` Volume and the torch.compile / CUDA-graph artifacts are
persisted on the `vllm-cache` Volume (`VLLM_CACHE_ROOT`). So a model downloads
once across every container and provider, and only the *first* container
compiles its graphs β€” later cold starts replay the cache.

**2. Demo-day keep-warm (deploy-time, no code edits).** Pin one warm container
for every *profile-bound* model (tiny/fast/balanced/strong) right before a live
demo β€” specialists keep scale-to-zero:

```bash
MODAL_LLM_KEEP_WARM=1 modal deploy modal/app_nvidia.py   # one warm container per tier model
modal deploy modal/app_nvidia.py                         # back to scale-to-zero after
```

This burns GPU-hours while deployed; it's a switch for the hours around a demo,
not a steady state. `min_containers` in `catalogue.py` remains the per-model
override for anything finer-grained.

Cold-start clients must follow redirects: a Modal endpoint that hasn't answered
within ~150s returns a `303` to the same URL while the container finishes
booting (`modal/healthcheck.py` handles this; so does the engine's gateway).

### Add a model

Append one `ModelConfig` to the appropriate provider list in `catalogue.py` (tag
its `profile` tier to make it a tier default). The engine picks it up with no
edits β€” it reads the same `catalogue.py`.

### Add a provider

1. Add a `<PROVIDER>_MODELS` list and a `PROVIDERS["<provider>"]` entry (carrying
   its `app` name) in `catalogue.py`.
2. Create `app_<provider>.py` that reads that entry:
   `app = modal.App(PROVIDERS["<provider>"].app)` then
   `register_all(app, PROVIDERS["<provider>"].models)`.

## Lower precision (quantization)

Every model repo here ships **BF16** weights and serves at full precision. To
shrink a model's footprint β€” fit it on a smaller GPU, or free VRAM for a longer
context / more concurrency β€” pass vLLM's quantization flags through the
`extra_vllm_args` escape hatch on its `ModelConfig`:

```python
extra_vllm_args=("--quantization", "fp8", "--kv-cache-dtype", "fp8")
```

This is purely serving-side: `--served-model-name` is unchanged, so the engine,
endpoint URLs, and the running cast are untouched.

> **Not every architecture serves under on-the-fly FP8.** It needs an Ada/Hopper
> GPU (our L4/L40S/H200 all qualify) *and* vLLM support for the model's arch.
> Custom-code / hybrid-Mamba archs (the Nemotron Nanos, MiniCPM) and the
> Transformers-backend Gemma 12B may **fail to boot** under it. Verify a model
> after adding the flag (`modal/healthcheck.py` or `curl <url>/v1/models`); if it
> won't start, drop the flag. This is why every model defaults to full precision.

## Auth

Modal web endpoints are public by default. Secrets are supplied as environment
variables (never hard-coded). To require a bearer token:

```bash
# Key MUST be VLLM_API_KEY (vLLM reads it); value is the token clients send.
modal secret create llm-api-key VLLM_API_KEY=sk-your-token

# Turn auth on at deploy time β€” no code edits:
MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
```

When `MODAL_LLM_REQUIRE_AUTH` is set, every endpoint mounts the `llm-api-key`
secret as the `VLLM_API_KEY` env var and vLLM enforces `Authorization: Bearer
<token>` (401 otherwise). Clients pass the same token (the bundled `client.py`
reads it from `LLM_API_KEY`). Alternatively front endpoints with Modal Proxy
Auth Tokens (see `docs/modal-llms.txt` β†’ Proxy Auth Tokens).

See [`openapi.md`](openapi.md) for the full API reference and the checked-in
OpenAPI spec (`../openapi.yaml`).

## Observability & logging

Every container's stdout/stderr is captured by Modal β€” watch it live with
`modal app logs <app-name>` or in the dashboard. Each endpoint runs vLLM with
`--enable-log-requests` (toggle via `log_requests`), so every call logs its
request id, sampling params, and (on completion) prompt/generation token counts
and finish reason. Clients can pass an `X-Request-Id` header and it shows up in
the request logs β€” handy for correlating an engine call with its server-side line.

Throughput, KV-cache usage, and prefix-cache hit rate are logged every second
(`VLLM_LOG_STATS_INTERVAL`) and also exposed as Prometheus metrics at `/metrics`.

## GPU sizing cheatsheet

BF16 weights β‰ˆ 2 bytes/param; leave headroom for the KV cache. MoE models load
all expert weights even though only a slice activates per token, so size to the
total parameter count.

| Model                              | Params (total / active) | Starting GPU |
| ---------------------------------- | ----------------------- | ------------ |
| Nemotron-Cascade-14B-Thinking      | ~14B (dense, Qwen3)     | `L40S:1`     |
| Nemotron-3-Nano-4B                 | ~4B (Tiny Titan)        | `L4:1`       |
| MiniCPM-o-4_5 (omni)               | ~9B + media encoders    | `L40S:1`     |
| MiniCPM4.1-8B                      | 8B                      | `L40S:1`     |
| Gemma-4-26B-A4B-it                 | ~25B / ~4B (MoE)        | `A100:1`     |
| Gemma-4-12B-it                     | ~12B (dense)            | `L40S:1`     |

These are starting points. If a container OOMs, lower `max_model_len`, raise the
GPU tier, or bump `tensor_parallel_size` (and the GPU count) for sharding.

## Engine integration

The engine reads this same `catalogue.py` (by path, via
`src/models/modal_catalogue.py`) and routes every profile through the LiteLLM
gateway (ADR-0015 / ADR-0019). You don't wire endpoints by hand β€” set the
workspace and the four tiers bind automatically from `config/models.yaml`:

```bash
export MODAL_WORKSPACE="<your-workspace>"   # activates the live path
export MODAL_LLM_KEY="EMPTY"                # or the configured VLLM_API_KEY
```

Each profile's endpoint URL is derived as
`https://${MODAL_WORKSPACE}--<app>-<endpoint>.modal.run/v1`. To point a profile at
a different catalogue model, change its `endpoint:` in `config/models.yaml`; to
override the model string outright, set `MODEL_TINY/FAST/BALANCED/STRONG`. For a
one-off single endpoint (e.g. a local dev box), set `MODAL_LLM_BASE_URL` instead
of `MODAL_WORKSPACE`.