agharsallah commited on
Commit
5d4ef87
Β·
1 Parent(s): e3ba862

Refactor modal service and logging setup

Browse files

- Updated the service module to streamline model registration and improve clarity in the handling of model configurations.
- Removed the vllm_logging module, integrating its functionality into the main logging setup for better maintainability and consistency.
- Simplified the build_command function by removing unnecessary precision handling and logging configurations.
- Enhanced test coverage for the build_command function, ensuring proper flag emissions and configurations.
- Cleaned up deprecated snapshot model handling and adjusted related tests for clarity and accuracy.
- Improved documentation throughout the service module to better reflect current functionality and design goals.

docs/adr/0030-gpu-memory-snapshots-cold-start.md CHANGED
@@ -2,7 +2,12 @@
2
 
3
  ## Status
4
 
5
- Accepted (extends [ADR-0014 *Modal model serving*](0014-modal-model-serving.md),
 
 
 
 
 
6
  [ADR-0019](0019-single-model-catalogue-no-cloud-path.md))
7
 
8
  ## Context
 
2
 
3
  ## Status
4
 
5
+ **Superseded by [ADR-0034 *Simplify the Modal serving layer*](0034-simplify-modal-serving-to-canonical-vllm.md)**
6
+ β€” the snapshot lifecycle was removed for being alpha and error-prone; cold starts
7
+ now rely on the shared compile/weight caches plus the retained `MODAL_LLM_KEEP_WARM`
8
+ demo switch. The historical context below stands.
9
+
10
+ Originally Accepted (extended [ADR-0014 *Modal model serving*](0014-modal-model-serving.md),
11
  [ADR-0019](0019-single-model-catalogue-no-cloud-path.md))
12
 
13
  ## Context
docs/adr/0031-fp8-quantization-control.md CHANGED
@@ -2,8 +2,12 @@
2
 
3
  ## Status
4
 
5
- Accepted (extends [ADR-0014 *Modal model serving*](0014-modal-model-serving.md),
6
- [ADR-0019](0019-single-model-catalogue-no-cloud-path.md); interacts with
 
 
 
 
7
  [ADR-0030 *GPU memory snapshots*](0030-gpu-memory-snapshots-cold-start.md))
8
 
9
  ## Context
 
2
 
3
  ## Status
4
 
5
+ **Superseded by [ADR-0034 *Simplify the Modal serving layer*](0034-simplify-modal-serving-to-canonical-vllm.md)**
6
+ β€” the env-controlled quantization machinery was removed; lower precision is now
7
+ reached via a model's `extra_vllm_args`. The historical context below stands.
8
+
9
+ Originally Accepted (extended [ADR-0014 *Modal model serving*](0014-modal-model-serving.md),
10
+ [ADR-0019](0019-single-model-catalogue-no-cloud-path.md); interacted with
11
  [ADR-0030 *GPU memory snapshots*](0030-gpu-memory-snapshots-cold-start.md))
12
 
13
  ## Context
docs/adr/0034-simplify-modal-serving-to-canonical-vllm.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ADR-0034: Simplify the Modal serving layer to the canonical vLLM recipe
2
+
3
+ ## Status
4
+
5
+ Accepted. **Supersedes [ADR-0030 *GPU memory snapshots*](0030-gpu-memory-snapshots-cold-start.md)
6
+ and [ADR-0031 *FP8 quantization control*](0031-fp8-quantization-control.md).**
7
+ Extends [ADR-0014 *Modal model serving*](0014-modal-model-serving.md) and
8
+ [ADR-0019](0019-single-model-catalogue-no-cloud-path.md).
9
+
10
+ ## Context
11
+
12
+ `modal/service.py` had grown to ~500 lines by accreting three optional
13
+ subsystems on top of the plain vLLM web-server path from ADR-0014:
14
+
15
+ - **GPU memory snapshots** (ADR-0030) — a class-based sleep→snapshot→wake
16
+ lifecycle, a second registration shape, and `enable_gpu_snapshot` (Modal
17
+ *alpha*).
18
+ - **FP8 / quantization control** (ADR-0031) β€” a deploy-time env-override resolver
19
+ plus a workaround for FP8-KV-cache crashing the snapshot wake path.
20
+ - **Structured JSON logging** β€” a `vllm_logging.py` formatter shipped into the
21
+ image and wired through a generated `dictConfig`.
22
+
23
+ In practice this surface was the source of the errors, not a benefit:
24
+
25
+ - The snapshot lifecycle is alpha and fragile β€” the documented FP8Γ—snapshot
26
+ wake-path crash (ADR-0031) is one instance; the hand-folded URL label and
27
+ cloudpickled-closure constraints are others. Hard to deploy, hard to debug.
28
+ - The FP8 machinery defaulted to `None` on **every** model β€” pure surface area
29
+ with no model actually using it.
30
+ - JSON logging defaulted **off** β€” more surface area, off by default.
31
+ - Per-model configs had drifted from the models' real serving requirements
32
+ (e.g. the Gemma 4 26B was pinned to a nightly vLLM it doesn't need).
33
+
34
+ The working core is small and is exactly Modal's current canonical vLLM example:
35
+ an autoscaling `@app.function` + `@modal.concurrent` + `@modal.web_server` whose
36
+ body runs `subprocess.Popen(["vllm", "serve", ...])`.
37
+
38
+ ## Decision
39
+
40
+ **1. One serving path.** `register_model()` only registers the plain
41
+ `@app.function` web server. The snapshot class lifecycle
42
+ (`_register_snapshot_model`, `_class_name`, sleep/wake, `enable_gpu_snapshot`) is
43
+ deleted. `service.py` drops from ~500 to ~210 lines.
44
+
45
+ **2. Quantization moves to the escape hatch.** The `MODAL_LLM_QUANTIZATION` /
46
+ `MODAL_LLM_KV_CACHE_DTYPE` env resolver, the `quantization` / `kv_cache_dtype`
47
+ `ModelConfig` fields, and the FP8Γ—snapshot workaround are removed. A model that
48
+ wants lower precision passes the flags through the existing `extra_vllm_args`
49
+ (`("--quantization", "fp8")`). Quantization was always opt-in and never on; this
50
+ keeps it possible without standing machinery.
51
+
52
+ **3. JSON logging is removed.** `vllm_logging.py` is deleted along with the
53
+ `MODAL_LLM_JSON_LOGS` / `MODAL_LLM_LOG_LEVEL` wiring. Modal captures
54
+ stdout/stderr; `--enable-log-requests` (kept, via `log_requests`) gives
55
+ per-request detail.
56
+
57
+ **4. `ModelConfig` is trimmed** to the fields the one path actually reads.
58
+ Removed: `gpu_snapshot`, `quantization`, `kv_cache_dtype`, `max_num_seqs`,
59
+ `max_num_batched_tokens`, `target_concurrent_inputs`, `buffer_containers`,
60
+ `log_outputs`, `max_log_len`, `uvicorn_access_log`, `multimodal`. The autoscale
61
+ target is computed inline (~75% of `max_concurrent_inputs`); anything exotic uses
62
+ `extra_vllm_args`.
63
+
64
+ **5. Per-model configs re-grounded in each model's documentation** (verified
65
+ against the HF model cards + vLLM recipes, June 2026):
66
+
67
+ | Model | Correction |
68
+ | --- | --- |
69
+ | Gemma 4 **26B-A4B** | Standard `gemma4` MoE β€” serves on the **pinned stable vLLM**. Dropped the nightly pin, `transformers>=5.10.2`, the unverified `VLLM_USE_FLASHINFER_SAMPLER=0`, and `enforce_eager` (native path β†’ CUDA graphs work). |
70
+ | Gemma 4 **12B** | `gemma4_unified` (encoder-free) has no class in any stable vLLM ≀0.22.1 β†’ **keeps** `vllm_version="nightly"` + `transformers>=5.10.2`; dropped the unverified flashinfer env. |
71
+ | Nemotron Nano **4B / 30B** | Hybrid-Mamba; `trust_remote_code` kept. Served as plain chat β€” NVIDIA's `nano_v3` reasoning parser ships as a downloadable *plugin file* and is omitted for boot-robustness (addable via `extra_vllm_args` later). 30B params corrected 30β†’31. |
72
+ | Nemotron **Cascade-14B** | Confirmed stock Qwen3 β€” `reasoning_parser="qwen3"` + `tool_call_parser="hermes"` are correct and built-in; kept. |
73
+ | MiniCPM **4.1-8B** | `trust_remote_code` kept; no tool parser (custom `<|tool_call_start|>` format β€” engine uses guided decoding per ADR-0016). Serves on the pinned stable. |
74
+ | MiniCPM **-o 4.5** | Params corrected 8β†’9B; served text+image (audio over vLLM is experimental β€” the documented `transformers==4.51.0` pin conflicts with vLLM's bundled version, so we keep the lean preprocessing deps). |
75
+
76
+ ## Consequences
77
+
78
+ - **Far smaller blast radius.** One registration shape, no alpha features, no
79
+ generated log config, no precision resolver. The thing that errored is gone.
80
+ - **Cold starts** now rely on the always-on shared caches (weights + compiled
81
+ graphs on Volumes) and the retained `MODAL_LLM_KEEP_WARM` demo-day switch
82
+ (mechanism 2 of ADR-0030, the robust half). We trade snapshot's seconds-from-
83
+ cold for simplicity; keep-warm covers the live-demo first-30-seconds bar.
84
+ - **Quantization / batch caps** are still reachable via `extra_vllm_args`, just
85
+ not first-class fields. If a model later needs standing FP8, re-promote a typed
86
+ field then β€” but not speculatively.
87
+ - **Gemma 4 26B is cheaper and more robust** off the nightly: it's a tier
88
+ default (`strong`), so removing its nightly dependency removes a recurring
89
+ break. Only the 12B remains on nightly, where it's unavoidable.
90
+ - **Prize impact unchanged.** All seven models and all four provider tracks
91
+ (OpenAI-compatible, MiniCPM, Nemotron, Gemma) still deploy; the no-API-key
92
+ deterministic stub is untouched. The serving path stays demo-ready for the
93
+ Modal Awards, now without the alpha-feature risk on stage.
94
+ - **Tests** for the removed precision/snapshot behaviour are replaced by tests
95
+ that pin the simplified `build_command` argv. Full suite stays green.
modal/README.md CHANGED
@@ -18,8 +18,6 @@ modal/
18
  app_nvidia.py App "nvidia-llms" β€” Nemotron 3 Nano 4B + 30B, Cascade 14B Thinking.
19
  app_openbmb.py App "openbmb-llms" β€” MiniCPM4.1-8B + MiniCPM-o 4.5.
20
  app_google.py App "google-llms" β€” Gemma 4 12B + 26B.
21
- vllm_logging.py Dependency-free JSON log formatter shipped into the image
22
- when MODAL_LLM_JSON_LOGS=1 (structured logs via vLLM dictConfig).
23
  client.py OpenAI-SDK smoke-test client for any endpoint.
24
  openapi.yaml Checked-in OpenAPI 3.1 spec for the served API surface.
25
  pyproject.toml uv workspace member (deploy/client tooling; non-package).
@@ -71,11 +69,13 @@ sizing, and how to add models/providers or wire endpoints into the engine.
71
  radius; one provider's outage or redeploy never touches another.
72
  - **Scalable** β€” serverless autoscaling, input concurrency, a shared weight
73
  cache (pull once, warm everywhere), and per-model `min_containers` warm pools.
74
- - **Fast cold starts** β€” snapshot-enabled models (`gpu_snapshot=True`) restore a
75
- pre-warmed engine from a Modal memory snapshot in seconds instead of re-paying
76
- download + load + warmup; `MODAL_LLM_KEEP_WARM=1` at deploy time pins warm
77
- containers for the tier models on demo day. See
78
- [`docs/deploying.md` β†’ Cold starts](docs/deploying.md#cold-starts) (ADR-0030).
 
 
79
  - **Extensible** β€” add a model = one `ModelConfig` in `catalogue.py`; add a
80
  provider = one `Provider` entry + one app file. The serving path is written once
81
  in `service.py`, and the engine picks up the new model with no edits (it reads
 
18
  app_nvidia.py App "nvidia-llms" β€” Nemotron 3 Nano 4B + 30B, Cascade 14B Thinking.
19
  app_openbmb.py App "openbmb-llms" β€” MiniCPM4.1-8B + MiniCPM-o 4.5.
20
  app_google.py App "google-llms" β€” Gemma 4 12B + 26B.
 
 
21
  client.py OpenAI-SDK smoke-test client for any endpoint.
22
  openapi.yaml Checked-in OpenAPI 3.1 spec for the served API surface.
23
  pyproject.toml uv workspace member (deploy/client tooling; non-package).
 
69
  radius; one provider's outage or redeploy never touches another.
70
  - **Scalable** β€” serverless autoscaling, input concurrency, a shared weight
71
  cache (pull once, warm everywhere), and per-model `min_containers` warm pools.
72
+ - **One serving path** β€” Modal's canonical vLLM recipe (an autoscaling
73
+ `@app.function` launching `vllm serve` behind a `@modal.web_server`), written
74
+ once in `service.py`. No bespoke per-model lifecycle to break (ADR-0034).
75
+ - **Fast cold starts on demo day** β€” the shared `vllm-cache` Volume persists the
76
+ torch.compile / CUDA-graph artifacts so only the first container compiles, and
77
+ `MODAL_LLM_KEEP_WARM=1` at deploy time pins one warm container per tier model.
78
+ See [`docs/deploying.md` β†’ Cold starts](docs/deploying.md#cold-starts).
79
  - **Extensible** β€” add a model = one `ModelConfig` in `catalogue.py`; add a
80
  provider = one `Provider` entry + one app file. The serving path is written once
81
  in `service.py`, and the engine picks up the new model with no edits (it reads
modal/catalogue.py CHANGED
@@ -72,60 +72,31 @@ class ModelConfig:
72
  max_model_len: int | None = None # cap context to fit memory / task
73
  trust_remote_code: bool = False # required by MiniCPM / Nemotron custom code
74
 
75
- # Precision / quantization (vLLM serve flags). Both default to full precision
76
- # (BF16 weights, model-dtype KV cache); set them to shrink the memory footprint
77
- # so a model fits a smaller GPU or leaves more room for KV cache. A deploy-time
78
- # env override (``MODAL_LLM_QUANTIZATION`` / ``MODAL_LLM_KV_CACHE_DTYPE``, read in
79
- # ``service.py``) wins over these per-model values for a whole deploy. CAVEAT:
80
- # on-the-fly FP8 needs an Ada/Hopper GPU (our L4/L40S/H200 all qualify) AND vLLM
81
- # support for the architecture β€” custom-code / hybrid-mamba archs (Nemotron-H,
82
- # MiniCPM) and the Transformers-backend Gemmas may fail to start under it, so these
83
- # stay ``None`` until a model is verified to serve quantized. See ADR-0031.
84
- quantization: str | None = None # vLLM --quantization, on-the-fly weight quant (e.g. "fp8"); None = full BF16
85
- kv_cache_dtype: str | None = None # vLLM --kv-cache-dtype (e.g. "fp8"); None = auto (model dtype)
86
-
87
  # Performance / throughput (vLLM serve flags). Defaults target high
88
  # steady-state throughput on the common single-GPU path; tune per model.
89
- # See ``service.build_command`` for how each maps to a flag.
 
90
  gpu_memory_utilization: float | None = None # fraction of VRAM for weights + KV cache (vLLM default 0.9)
91
  enable_prefix_caching: bool = True # reuse KV for shared prompt prefixes β€” big win when system/context repeat
92
  async_scheduling: bool = True # overlap CPU request scheduling with GPU compute
93
  enforce_eager: bool = False # skip CUDA-graph capture: faster cold start, lower steady-state throughput
94
- max_num_seqs: int | None = None # cap sequences batched per step (memory vs. throughput)
95
- max_num_batched_tokens: int | None = None # token budget per scheduler step (prefill throughput)
96
-
97
- # Cold starts. Opt a model into Modal memory snapshots (CPU + experimental GPU
98
- # snapshot): the container boots once, loads weights, warms the engine, puts it
99
- # to sleep (vLLM sleep mode, weights offloaded to host RAM), and is snapshotted;
100
- # every later cold start restores the snapshot and wakes the engine in seconds
101
- # instead of re-paying download + load + warmup. Constraints (why this is per
102
- # model, not global): single-GPU models only, the model's vLLM build must
103
- # support `--enable-sleep-mode`, and host RAM must hold the offloaded weights.
104
- # Modal marks GPU snapshots alpha β€” keep it off for exotic serving paths
105
- # (Transformers-backend Gemma, the omni specialist) and flip off on any model
106
- # that misbehaves; the plain serving path is unchanged.
107
- gpu_snapshot: bool = False
108
-
109
- # Observability / request logging (vLLM serve flags). Defaults give per-request
110
- # visibility in the container logs out of the box; see ``service.build_command``.
111
- log_requests: bool = True # log each request's id, sampling params, and token counts
112
- log_outputs: bool = False # also log generated text (verbose; can echo story content) β€” opt-in
113
- max_log_len: int | None = 2048 # truncate logged prompts/outputs to N chars (None = no cap)
114
- uvicorn_access_log: bool = True # keep uvicorn's per-request HTTP access line (method, path, status)
115
 
116
  # OpenAI feature parsers (vLLM names; leave None if unsupported on the model)
117
  reasoning_parser: str | None = None
118
  tool_call_parser: str | None = None
119
  enable_auto_tool_choice: bool = False
120
 
121
- # Multimodal
122
- multimodal: bool = False
123
- mm_limits: dict[str, int] | None = None # e.g. {"image": 4, "audio": 2}
 
124
 
125
  # Scaling / lifecycle
126
  max_concurrent_inputs: int = 64 # hard ceiling of requests multiplexed onto one container
127
- target_concurrent_inputs: int | None = None # autoscale target β€” scale out here, burst up to max; defaults to ~75%
128
- buffer_containers: int = 0 # extra idle containers to pre-warm under active load (bursty traffic)
129
  scaledown_window: int = 15 * 60 # idle seconds before a container stops
130
  min_containers: int = 0 # keep N warm to remove cold starts (costs $)
131
  startup_timeout: int = 30 * 60 # weight download + load can be slow
@@ -169,31 +140,34 @@ NVIDIA_MODELS: tuple[ModelConfig, ...] = (
169
  ModelConfig(
170
  name="nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16",
171
  endpoint_name="nemotron-3-nano-4b",
172
- # Tiny Titan tier (≀4B): comfortably fits a single 24GB L4.
173
  profile="tiny",
174
  params_b=4,
175
  gpu="L4:1",
176
  max_model_len=16384,
 
177
  trust_remote_code=True,
178
  gated=True,
179
  max_concurrent_inputs=32,
180
- # Tiny tier is the cast's hottest endpoint and 4B of BF16 weights (~8GB)
181
- # easily fit host RAM during sleep β€” the ideal snapshot candidate.
182
- gpu_snapshot=True,
 
 
 
183
  ),
184
  ModelConfig(
185
  name="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
186
  endpoint_name="nemotron-3-nano-30b",
187
- # 30B total params in BF16 (~60GB) though only ~3B activate per token.
188
- # An alternate strong model β€” not cast to a profile by default.
189
- # No gpu_snapshot: sleep mode would offload ~60GB of weights to host RAM,
190
- # past what a default container comfortably holds.
191
- params_b=30,
192
  gpu="H200:1",
193
  max_model_len=32768,
194
  trust_remote_code=True,
195
  gated=True,
196
  max_concurrent_inputs=64,
 
197
  ),
198
  ModelConfig(
199
  name="nvidia/Nemotron-Cascade-14B-Thinking",
@@ -210,15 +184,13 @@ NVIDIA_MODELS: tuple[ModelConfig, ...] = (
210
  params_b=14,
211
  gpu="L40S:1",
212
  max_model_len=32768,
213
- # Qwen3-native in vLLM (no custom code); ChatML template with a thinking
214
- # block parsed by the Qwen3 reasoning parser.
 
215
  reasoning_parser="qwen3",
216
  tool_call_parser="hermes",
217
  enable_auto_tool_choice=True,
218
  max_concurrent_inputs=48,
219
- # Qwen3-native single-GPU path on the pinned vLLM β€” snapshot-safe, and a
220
- # reasoning model is exactly where a multi-minute cold start hurts most.
221
- gpu_snapshot=True,
222
  ),
223
  )
224
 
@@ -234,28 +206,31 @@ OPENBMB_MODELS: tuple[ModelConfig, ...] = (
234
  max_model_len=32768,
235
  trust_remote_code=True,
236
  max_concurrent_inputs=48,
237
- # Fast tier default for the cast; 8B BF16 (~16GB) offloads to host RAM
238
- # fine. Sleep mode is allocator-level, so the custom MiniCPM modeling
239
- # code doesn't affect it.
240
- gpu_snapshot=True,
241
  # No tool_call_parser on purpose: MiniCPM4.1 emits a custom
242
- # <|tool_call_start|> format vLLM 0.21.0 has no parser for, so tool-call
243
- # structured output 400s here. The engine's structured path uses vLLM
244
  # guided decoding (response_format json_schema) instead, which is
245
  # parser-independent β€” see ADR-0016. Don't bolt on a mismatched parser.
 
 
246
  ),
247
  ModelConfig(
248
  name="openbmb/MiniCPM-o-4_5",
249
  endpoint_name="minicpm-o-4-5",
250
- # Omni-modal (text + vision + audio). Needs custom code and media backends.
251
- # A specialist model β€” not cast to a profile by default.
252
- params_b=8,
253
  gpu="L40S:1",
254
  trust_remote_code=True,
255
- multimodal=True,
256
- mm_limits={"image": 4, "audio": 2, "video": 1},
257
- # Audio/vision preprocessing backends pulled into the image.
 
 
 
 
258
  extra_pip=("librosa", "soundfile", "timm"),
 
259
  max_concurrent_inputs=16,
260
  # Custom omni-modal code path: keep the async scheduler off (conservative
261
  # β€” it's a specialist, not on the default cast). Prefix caching stays on.
@@ -285,36 +260,25 @@ GOOGLE_MODELS: tuple[ModelConfig, ...] = (
285
  tool_call_parser="gemma4",
286
  enable_auto_tool_choice=True,
287
  max_concurrent_inputs=48,
288
- # Served via vLLM's Transformers modeling backend (gemma4_unified has no
289
- # native vLLM class), which runs eager-only β€” CUDA-graph capture and the
290
- # async scheduler aren't supported on that path, so disable both here.
291
- # Prefix caching still applies and stays on (the default). gpu_snapshot
292
- # stays off too: sleep mode on the nightly Transformers backend is
293
- # unverified, and the Gemmas already skip the costliest warmup (no
294
- # CUDA-graph capture).
295
- enforce_eager=True,
296
- async_scheduling=False,
297
- # Text-only in the cast (vision/audio is the MiniCPM-o specialist's job).
298
- # vLLM auto-detects gemma4_unified as multimodal and otherwise spends a big
299
- # slice of cold-start profiling a *video* encoder we never call (and the MM
300
- # warmup fails anyway). Zeroing the per-prompt MM limits disables that whole
301
- # path β€” faster start, less GPU memory, more KV cache.
302
- mm_limits={"image": 0, "audio": 0, "video": 0},
303
- # gemma4_unified uses *variable* head dims (256 on sliding-attention layers,
304
- # 512 on full-attention ones). vLLM <= 0.22.1 (incl. the pinned 0.21.0) sizes
305
- # the o_proj from a uniform head_dim and dies on the full-attention layers
306
- # with "mat1 and mat2 shapes cannot be multiplied". Only a vLLM nightly serves
307
- # gemma4_unified, paired with transformers >= 5.10.2 (which adds the arch) and
308
- # the FlashInfer sampler off (its JIT path breaks on these builds). All three
309
- # are scoped to this model, so NVIDIA/OpenBMB stay on the reproducible pin.
310
  vllm_version="nightly",
311
  extra_pip=("transformers>=5.10.2",),
312
- env={"VLLM_USE_FLASHINFER_SAMPLER": "0"},
 
 
 
 
 
 
313
  ),
314
  ModelConfig(
315
  name="google/gemma-4-26B-A4B-it",
316
  endpoint_name="gemma-4-26b",
317
- # MoE: ~26B total params (~4B active). Gated repo β€” needs an HF token.
318
  profile="strong",
319
  params_b=26,
320
  gpu="H200:1",
@@ -324,18 +288,11 @@ GOOGLE_MODELS: tuple[ModelConfig, ...] = (
324
  tool_call_parser="gemma4",
325
  enable_auto_tool_choice=True,
326
  max_concurrent_inputs=64,
327
- # Transformers modeling backend (see the 12B above): eager-only, so no
328
- # CUDA graphs / async scheduler. Prefix caching stays on by default.
329
- enforce_eager=True,
330
- async_scheduling=False,
331
- # Text-only in the cast β€” disable the auto-detected multimodal (video)
332
- # encoder to cut cold-start profiling and free memory (see the 12B above).
333
- mm_limits={"image": 0, "audio": 0, "video": 0},
334
- # Same gemma4_unified fix as the 12B above (nightly vLLM + transformers
335
- # >= 5.10.2 + FlashInfer sampler off).
336
- vllm_version="nightly",
337
- extra_pip=("transformers>=5.10.2",),
338
- env={"VLLM_USE_FLASHINFER_SAMPLER": "0"},
339
  ),
340
  )
341
 
 
72
  max_model_len: int | None = None # cap context to fit memory / task
73
  trust_remote_code: bool = False # required by MiniCPM / Nemotron custom code
74
 
 
 
 
 
 
 
 
 
 
 
 
 
75
  # Performance / throughput (vLLM serve flags). Defaults target high
76
  # steady-state throughput on the common single-GPU path; tune per model.
77
+ # See ``service.build_command`` for how each maps to a flag. For anything more
78
+ # exotic (quantization, batch-size caps, …) use ``extra_vllm_args``.
79
  gpu_memory_utilization: float | None = None # fraction of VRAM for weights + KV cache (vLLM default 0.9)
80
  enable_prefix_caching: bool = True # reuse KV for shared prompt prefixes β€” big win when system/context repeat
81
  async_scheduling: bool = True # overlap CPU request scheduling with GPU compute
82
  enforce_eager: bool = False # skip CUDA-graph capture: faster cold start, lower steady-state throughput
83
+
84
+ # Observability. ``log_requests`` adds --enable-log-requests so each call's id,
85
+ # sampling params, and token counts show in the Modal container logs.
86
+ log_requests: bool = True
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
  # OpenAI feature parsers (vLLM names; leave None if unsupported on the model)
89
  reasoning_parser: str | None = None
90
  tool_call_parser: str | None = None
91
  enable_auto_tool_choice: bool = False
92
 
93
+ # Multimodal β€” per-prompt input caps, e.g. {"image": 4, "audio": 2}. Set the
94
+ # caps to 0 on an auto-detected-multimodal model you serve text-only, to skip
95
+ # the encoder warmup and free memory.
96
+ mm_limits: dict[str, int] | None = None
97
 
98
  # Scaling / lifecycle
99
  max_concurrent_inputs: int = 64 # hard ceiling of requests multiplexed onto one container
 
 
100
  scaledown_window: int = 15 * 60 # idle seconds before a container stops
101
  min_containers: int = 0 # keep N warm to remove cold starts (costs $)
102
  startup_timeout: int = 30 * 60 # weight download + load can be slow
 
140
  ModelConfig(
141
  name="nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16",
142
  endpoint_name="nemotron-3-nano-4b",
143
+ # Tiny Titan tier (≀4B): ~4B BF16 weights (~8GB) fit a single 24GB L4.
144
  profile="tiny",
145
  params_b=4,
146
  gpu="L4:1",
147
  max_model_len=16384,
148
+ # Hybrid Mamba-2 + MLP + attention arch β†’ custom modeling code; required.
149
  trust_remote_code=True,
150
  gated=True,
151
  max_concurrent_inputs=32,
152
+ # Served as a plain chat endpoint. NVIDIA ships a custom `nano_v3` reasoning
153
+ # parser as a downloadable plugin file (--reasoning-parser-plugin) plus a
154
+ # `qwen3_coder` tool parser; both are omitted here for boot-robustness (the
155
+ # plugin must be shipped into the image and is easy to get wrong). The
156
+ # model still reasons β€” the <think> block just stays inline in the content.
157
+ # Add them later via extra_vllm_args if structured reasoning/tools are needed.
158
  ),
159
  ModelConfig(
160
  name="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
161
  endpoint_name="nemotron-3-nano-30b",
162
+ # Hybrid Mamba-2 + MoE: ~31B total params in BF16 (~62GB), ~3B active per
163
+ # token. Needs an 80GB card β€” an alternate strong model, not a tier default.
164
+ params_b=31,
 
 
165
  gpu="H200:1",
166
  max_model_len=32768,
167
  trust_remote_code=True,
168
  gated=True,
169
  max_concurrent_inputs=64,
170
+ # Same plain-chat posture as the 4B (custom `nano_v3` parser plugin omitted).
171
  ),
172
  ModelConfig(
173
  name="nvidia/Nemotron-Cascade-14B-Thinking",
 
184
  params_b=14,
185
  gpu="L40S:1",
186
  max_model_len=32768,
187
+ # Post-trained from Qwen3-14B Base β†’ stock Qwen3 arch (no custom code).
188
+ # ChatML thinking block parsed by the Qwen3 reasoning parser; `hermes` is
189
+ # the standard Qwen3-family tool parser. Both verified built-in in vLLM.
190
  reasoning_parser="qwen3",
191
  tool_call_parser="hermes",
192
  enable_auto_tool_choice=True,
193
  max_concurrent_inputs=48,
 
 
 
194
  ),
195
  )
196
 
 
206
  max_model_len=32768,
207
  trust_remote_code=True,
208
  max_concurrent_inputs=48,
 
 
 
 
209
  # No tool_call_parser on purpose: MiniCPM4.1 emits a custom
210
+ # <|tool_call_start|> code-block format vLLM has no matching parser for, so
211
+ # a tool parser would 400/mis-parse. The engine's structured path uses vLLM
212
  # guided decoding (response_format json_schema) instead, which is
213
  # parser-independent β€” see ADR-0016. Don't bolt on a mismatched parser.
214
+ # (The model card suggests a vLLM nightly; 0.21.0 predates the release and
215
+ # serves it fine β€” flip vllm_version="nightly" if a boot failure proves otherwise.)
216
  ),
217
  ModelConfig(
218
  name="openbmb/MiniCPM-o-4_5",
219
  endpoint_name="minicpm-o-4-5",
220
+ # Omni-modal (text + vision + audio) on a Qwen3-8B backbone β†’ ~9B total in
221
+ # BF16. A specialist model, not cast to a profile by default.
222
+ params_b=9,
223
  gpu="L40S:1",
224
  trust_remote_code=True,
225
+ # Text + image only here; audio in/out over vLLM is experimental (it really
226
+ # wants the Transformers/demo runtime). Caps keep the encoder warmup bounded.
227
+ mm_limits={"image": 1, "audio": 0, "video": 0},
228
+ # Light vision/audio preprocessing backends. NOTE: full omni support wants
229
+ # openbmb's `minicpmo-utils[all]` + a pinned transformers==4.51.0, but that
230
+ # pin conflicts with vLLM's bundled transformers β€” so we keep the lean set
231
+ # and serve text+image. Treat audio as experimental.
232
  extra_pip=("librosa", "soundfile", "timm"),
233
+ gpu_memory_utilization=0.9,
234
  max_concurrent_inputs=16,
235
  # Custom omni-modal code path: keep the async scheduler off (conservative
236
  # β€” it's a specialist, not on the default cast). Prefix caching stays on.
 
260
  tool_call_parser="gemma4",
261
  enable_auto_tool_choice=True,
262
  max_concurrent_inputs=48,
263
+ # gemma4_unified (encoder-free) has no native class in any *stable* vLLM
264
+ # (≀0.22.1 falls back to the Transformers backend and crashes); only the
265
+ # nightly wheel registers Gemma4UnifiedForConditionalGeneration. So this
266
+ # model alone pins the nightly + transformers>=5.10.2. Scoped here, so
267
+ # NVIDIA/OpenBMB and the 26B sibling stay on the reproducible pin.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
268
  vllm_version="nightly",
269
  extra_pip=("transformers>=5.10.2",),
270
+ # Transformers-backend / fresh-nightly path: eager-only is the safe choice
271
+ # (CUDA-graph capture + async scheduler aren't reliable here).
272
+ enforce_eager=True,
273
+ async_scheduling=False,
274
+ # Text-only in the cast β€” gemma4 auto-detects as multimodal, so zero the
275
+ # per-prompt caps to skip the encoder warmup and free memory for KV cache.
276
+ mm_limits={"image": 0, "audio": 0},
277
  ),
278
  ModelConfig(
279
  name="google/gemma-4-26B-A4B-it",
280
  endpoint_name="gemma-4-26b",
281
+ # MoE: ~25B total params (~4B active) with a small vision encoder. Gated.
282
  profile="strong",
283
  params_b=26,
284
  gpu="H200:1",
 
288
  tool_call_parser="gemma4",
289
  enable_auto_tool_choice=True,
290
  max_concurrent_inputs=64,
291
+ # Standard gemma4 MoE arch (NOT the unified 12B path): served by a native
292
+ # vLLM class on the pinned stable release (0.19.1+), so NO nightly, no
293
+ # transformers pin, and CUDA graphs + async scheduling work β€” defaults stand.
294
+ # Text-only in the cast: zero the auto-detected multimodal caps.
295
+ mm_limits={"image": 0},
 
 
 
 
 
 
 
296
  ),
297
  )
298
 
modal/docs/deploying.md CHANGED
@@ -3,6 +3,12 @@
3
  This guide covers prerequisites, deployment, configuration knobs, auth, GPU
4
  sizing, and wiring the endpoints into the engine.
5
 
 
 
 
 
 
 
6
  ## Prerequisites
7
 
8
  ```bash
@@ -24,26 +30,25 @@ Only models with `gated=True` mount this secret; ungated models deploy without i
24
  Each provider is its own Modal app, deployed independently:
25
 
26
  ```bash
27
- modal deploy modal/app_nvidia.py # Nemotron 3 Nano 30B + 4B
28
- modal deploy modal/app_openbmb.py # MiniCPM-o 4.5 + MiniCPM4.1-8B
29
- modal deploy modal/app_google.py # Gemma 4 26B + 12B
30
  ```
31
 
32
  Use `modal serve modal/app_<provider>.py` for a hot-reloading dev session.
33
 
34
  Or deploy one, several, or all providers with a single uv command β€” a thin
35
- wrapper that exposes the deploy-time env knobs below as flags:
36
 
37
  ```bash
38
  uv run scripts/deploy_modal.py # all providers
39
  uv run scripts/deploy_modal.py nvidia openbmb # just these
40
  uv run scripts/deploy_modal.py nvidia --keep-warm # = MODAL_LLM_KEEP_WARM=1
41
- # --auth β†’ MODAL_LLM_REQUIRE_AUTH=1, --json-logs β†’ MODAL_LLM_JSON_LOGS=1,
42
- # --log-level LEVEL β†’ MODAL_LLM_LOG_LEVEL, --dry-run to preview the commands.
43
  ```
44
 
45
  Run these from the repo root; the script's own directory (`modal/`) is on
46
- `sys.path`, so `from service import ...` / `from registry import ...` resolve,
47
  and `import modal` still binds the installed SDK (the folder name does not
48
  shadow it).
49
 
@@ -86,34 +91,29 @@ changes needed:
86
  | `gpu` | Modal GPU spec, e.g. `H200:1`, `H100:2`, `L40S:1`, `L4:1`. |
87
  | `tensor_parallel_size` | Shard across GPUs; set equal to the GPU count in `gpu`. |
88
  | `max_model_len` | Cap context length to fit memory / tune throughput. |
89
- | `max_concurrent_inputs` | Hard ceiling of requests multiplexed onto one container. |
90
- | `target_concurrent_inputs` | Autoscale target β€” scale out here, burst to the max (defaults to ~75% of the ceiling). |
91
- | `buffer_containers` | Extra idle containers pre-warmed under active load (bursty traffic). |
92
  | `scaledown_window` | Idle seconds before a container stops (cold-start vs. cost). |
93
- | `gpu_snapshot` | Serve via Modal memory snapshots (CPU + GPU): cold starts restore a warmed engine in seconds instead of re-paying load + warmup. See [Cold starts](#cold-starts). |
94
  | `min_containers` | Keep N warm to eliminate cold starts (always-on cost). |
95
  | `gpu_memory_utilization` | Fraction of VRAM for weights + KV cache (vLLM default `0.9`); raise for a bigger KV cache. |
96
  | `enable_prefix_caching` | Reuse the KV cache for shared prompt prefixes (on by default β€” big win when the system prompt / ledger context repeats across the cast). |
97
- | `async_scheduling` | Overlap CPU request scheduling with GPU compute (on by default; off for the Transformers-backend Gemma + omni models). |
98
  | `enforce_eager` | Skip CUDA-graph capture β€” faster cold start, lower steady-state throughput. |
99
- | `max_num_seqs` / `max_num_batched_tokens` | Batch-size and per-step token budget (memory vs. throughput). |
100
  | `log_requests` | Log each request's id, sampling params, and token counts (on by default). |
101
- | `log_outputs` | Also log generated text (verbose; off by default). |
102
- | `max_log_len` | Truncate logged prompts/outputs to N chars (`None` = no cap; default 2048). |
103
- | `uvicorn_access_log` | Keep the per-request HTTP access line (method, path, status). |
104
- | `reasoning_parser` / `tool_call_parser` / `enable_auto_tool_choice` | OpenAI tool/reasoning features. |
105
- | `multimodal` / `mm_limits` | Image/audio/video inputs and per-prompt caps. |
106
  | `trust_remote_code` | Required by MiniCPM / Nemotron custom modeling code. |
107
  | `vllm_version` | Per-model inference-stack pin (escape hatch); `None` = the default `VLLM_VERSION`, `"nightly"` = latest nightly wheel, else a pinned version. |
108
- | `extra_vllm_args` | Raw `vllm serve` flags appended verbatim (escape hatch). |
109
  | `extra_pip` / `env` | Extra image deps / container env (escape hatch). |
110
 
111
  > **Per-model vLLM version.** The image pins `VLLM_VERSION` (see `service.py`) for
112
  > reproducible deploys. A single model can override it via `vllm_version` when the
113
  > pinned release can't serve its architecture β€” this is scoped to that model's image,
114
- > so one model's bump never touches another provider's app. The Gemma 4 entries set
115
- > `vllm_version="nightly"` (plus `transformers>=5.10.2` and `VLLM_USE_FLASHINFER_SAMPLER=0`)
116
- > because the `gemma4_unified` architecture is unservable on the pinned release.
 
 
117
 
118
  ### Performance tuning
119
 
@@ -129,45 +129,31 @@ per model:
129
  graph cache is persisted on the shared `vllm-cache` Volume (`VLLM_CACHE_ROOT`),
130
  so only the *first* container compiles β€” later cold starts replay the cached
131
  graphs. Set `enforce_eager=True` on a model only when its backend can't capture
132
- graphs (the Transformers-backend Gemma models) or when cold start dominates.
133
  - **Async scheduling** overlaps CPU request scheduling with GPU compute; on by
134
  default for native vLLM models, off where the backend doesn't support it.
135
- - **Autoscaling** scales out at `target_concurrent_inputs` (β‰ˆ75% of the ceiling by
136
- default) while a hot container bursts up to `max_concurrent_inputs`, so we add
137
- capacity before a container saturates rather than after. Use `buffer_containers`
138
- to pre-warm spares for bursty traffic, or `min_containers` to remove cold starts
139
  entirely (at always-on cost).
140
- - **The V1 engine is pinned** (`VLLM_USE_V1=1`) for its better scheduler, chunked
141
- prefill, and prefix caching.
142
 
143
  For memory-bound models, raise `gpu_memory_utilization` (more KV cache β†’ more
144
- concurrency) and cap `max_num_seqs` / `max_num_batched_tokens` if a step OOMs.
 
145
 
146
  ### Cold starts
147
 
148
- A scale-from-zero cold start normally pays the full pipeline: container boot β†’
149
- weight load β†’ engine warmup β€” minutes for the bigger models. Two mechanisms cut
150
- this (ADR-0030):
151
-
152
- **1. Memory snapshots (`gpu_snapshot=True`, per model).** The first container
153
- boots once, loads weights, runs a few warmup completions, puts vLLM to sleep
154
- (sleep level 1: weights offloaded to host RAM, KV cache dropped), and Modal
155
- snapshots the container β€” CPU *and* GPU state. Every later cold start restores
156
- the snapshot and wakes the engine, turning a multi-minute boot into seconds.
157
- Under the hood this switches the model from the plain `@app.function` web server
158
- to a class-based lifecycle (`@modal.enter(snap=True)` warmup β†’ snapshot β†’
159
- `@modal.enter(snap=False)` wake), but the public URL and API are identical β€”
160
- clients can't tell the paths apart.
161
-
162
- Snapshot-enabled today: `nemotron-3-nano-4b` (tiny), `minicpm-4-1-8b` (fast),
163
- `nemotron-cascade-14b`. Left off deliberately: the Gemmas (nightly
164
- Transformers-backend path, sleep mode unverified), `nemotron-3-nano-30b`
165
- (~60GB of weights won't fit host RAM during sleep), and the omni specialist.
166
- GPU snapshots are **Modal-alpha** β€” if a snapshot model misbehaves, set its
167
- `gpu_snapshot=False` and redeploy; the plain path is unchanged.
168
-
169
- **2. Demo-day keep-warm (deploy-time, no code edits).** Pin warm containers for
170
- every *profile-bound* model (tiny/fast/balanced/strong) right before a live
171
  demo β€” specialists keep scale-to-zero:
172
 
173
  ```bash
@@ -197,49 +183,26 @@ edits β€” it reads the same `catalogue.py`.
197
  `app = modal.App(PROVIDERS["<provider>"].app)` then
198
  `register_all(app, PROVIDERS["<provider>"].models)`.
199
 
200
- ## Quantization (lower precision)
201
-
202
- Every model repo ships **BF16** weights. To shrink the memory footprint β€” fit a
203
- model on a smaller GPU, or free VRAM for a longer context / more concurrency β€” you
204
- can serve it at lower precision. This is purely serving-side: it only adds
205
- `--quantization` / `--kv-cache-dtype` to the vLLM argv, and `--served-model-name`
206
- is unchanged, so the engine, endpoint URLs, and the running cast are untouched.
207
-
208
- Two controls, env override wins:
209
 
210
- - **Per model** β€” set `quantization` (and/or `kv_cache_dtype`) on a `ModelConfig`
211
- in `catalogue.py`. This is the baseline a model serves at by default.
212
- - **Per deploy (no code edits)** β€” `MODAL_LLM_QUANTIZATION` / `MODAL_LLM_KV_CACHE_DTYPE`
213
- override every model in the deploy. A disable token (`none`/`off`/`bf16`/…) forces
214
- full precision even on a model that defaults to quantized.
215
 
216
- ```bash
217
- # On-the-fly FP8 weights for one provider (via the deploy helper):
218
- uv run scripts/deploy_modal.py nvidia --quantization fp8
219
-
220
- # FP8 weights + FP8 KV cache, raw modal CLI:
221
- MODAL_LLM_QUANTIZATION=fp8 MODAL_LLM_KV_CACHE_DTYPE=fp8 modal deploy modal/app_nvidia.py
222
-
223
- # Force full precision back (overrides any per-model default):
224
- uv run scripts/deploy_modal.py nvidia --quantization none
225
  ```
226
 
 
 
 
227
  > **Not every architecture serves under on-the-fly FP8.** It needs an Ada/Hopper
228
  > GPU (our L4/L40S/H200 all qualify) *and* vLLM support for the model's arch.
229
- > Custom-code / hybrid-mamba archs (Nemotron-H = `nemotron-3-nano-4b`/`-30b`,
230
- > MiniCPM) and the Transformers-backend Gemmas may **fail to boot** under it β€” a
231
- > failed boot surfaces as `modal-http: invalid function call` (no healthy
232
- > container). Verify a provider after flipping it on (`modal/healthcheck.py` or
233
- > `curl <url>/v1/models`); if a model won't start, redeploy that provider without
234
- > the flag. This is why all per-model defaults stay `None` for now. See ADR-0031.
235
-
236
- > **FP8 KV cache (`--kv-cache-dtype fp8`) is silently dropped for snapshot models.**
237
- > On the pinned vLLM it crashes the `/wake_up` path (`init_fp8_kv_scales` β†’
238
- > `'list' object has no attribute 'zero_'`), so an FP8-KV snapshot model boots but
239
- > can never wake. `build_command` drops the flag for any `gpu_snapshot=True` model
240
- > and logs a `⚠️` line at deploy; the endpoint serves with full-precision KV cache.
241
- > FP8 *weights* (`--quantization fp8`) are unaffected. To run FP8 KV cache on such a
242
- > model, set its `gpu_snapshot=False`. See ADR-0031.
243
 
244
  ## Auth
245
 
@@ -266,40 +229,11 @@ OpenAPI spec (`../openapi.yaml`).
266
  ## Observability & logging
267
 
268
  Every container's stdout/stderr is captured by Modal β€” watch it live with
269
- `modal app logs <app-name>` or in the dashboard. Two layers shape what you see:
270
-
271
- **Request-level detail (on by default).** Each endpoint runs vLLM with
272
- `--enable-log-requests`, so every call logs its request id, sampling params, and
273
- (on completion) prompt/generation token counts and finish reason. `--max-log-len`
274
- caps the logged prompt at 2048 chars so a long context can't bloat a log line.
275
- The uvicorn access log (method, path, status, latency) stays on. Tune per model:
276
-
277
- | Knob | Effect |
278
- | ----------------- | ------------------------------------------------------------- |
279
- | `log_requests` | Per-request id / params / token counts (default **on**). |
280
- | `log_outputs` | Also log the generated text β€” verbose, can echo story content (default off). |
281
- | `max_log_len` | Truncate logged prompts/outputs; set `None` to log them in full. |
282
- | `uvicorn_access_log` | Set `False` to drop the per-request HTTP access line. |
283
-
284
- Clients can pass an `X-Request-Id` header and it shows up in the request logs β€”
285
- handy for correlating an engine call with its server-side line.
286
-
287
- **Structured JSON (opt-in).** For grepping fields or shipping to an aggregator,
288
- emit one JSON object per log line instead of vLLM's coloured text. Turn it on at
289
- deploy time β€” no code edits:
290
-
291
- ```bash
292
- MODAL_LLM_JSON_LOGS=1 modal deploy modal/app_nvidia.py
293
- MODAL_LLM_JSON_LOGS=1 MODAL_LLM_LOG_LEVEL=DEBUG modal deploy modal/app_google.py
294
- ```
295
-
296
- This ships a dependency-free formatter (`modal/vllm_logging.py`) into the image
297
- and points vLLM's `VLLM_LOGGING_CONFIG_PATH` at a generated `dictConfig`, so
298
- **all** vLLM + uvicorn logs (including the request logs above) come out as JSON
299
- with `ts` / `level` / `logger` / `msg` / `src` plus any structured extras (request
300
- id, token counts). `MODAL_LLM_LOG_LEVEL` (default `INFO`) sets verbosity for both
301
- the text and JSON paths. Leave JSON off for live demos β€” the coloured text is
302
- easier to watch.
303
 
304
  Throughput, KV-cache usage, and prefix-cache hit rate are logged every second
305
  (`VLLM_LOG_STATS_INTERVAL`) and also exposed as Prometheus metrics at `/metrics`.
@@ -312,12 +246,13 @@ total parameter count.
312
 
313
  | Model | Params (total / active) | Starting GPU |
314
  | ---------------------------------- | ----------------------- | ------------ |
315
- | Nemotron-3-Nano-30B-A3B | 30B / ~3B (MoE) | `H200:1` |
316
- | Nemotron-3-Nano-4B | 4B (Tiny Titan) | `L4:1` |
317
- | MiniCPM-o-4_5 (omni) | ~8B + media encoders | `L40S:1` |
 
318
  | MiniCPM4.1-8B | 8B | `L40S:1` |
319
- | Gemma-4-26B-A4B-it | 26B / ~4B (MoE) | `H200:1` |
320
- | Gemma-4-12B | 12B | `L40S:1` |
321
 
322
  These are starting points. If a container OOMs, lower `max_model_len`, raise the
323
  GPU tier, or bump `tensor_parallel_size` (and the GPU count) for sharding.
 
3
  This guide covers prerequisites, deployment, configuration knobs, auth, GPU
4
  sizing, and wiring the endpoints into the engine.
5
 
6
+ The serving layer is deliberately small: it's Modal's canonical vLLM recipe β€” an
7
+ autoscaling `@app.function` that launches `vllm serve` as a subprocess behind a
8
+ `@modal.web_server` β€” applied once in `service.py` to every model in
9
+ `catalogue.py`. See ADR-0034 for why we stripped the earlier snapshot / FP8 /
10
+ structured-logging machinery back to this core.
11
+
12
  ## Prerequisites
13
 
14
  ```bash
 
30
  Each provider is its own Modal app, deployed independently:
31
 
32
  ```bash
33
+ modal deploy modal/app_nvidia.py # Nemotron 3 Nano 4B + 30B, Cascade 14B
34
+ modal deploy modal/app_openbmb.py # MiniCPM4.1-8B + MiniCPM-o 4.5
35
+ modal deploy modal/app_google.py # Gemma 4 12B + 26B
36
  ```
37
 
38
  Use `modal serve modal/app_<provider>.py` for a hot-reloading dev session.
39
 
40
  Or deploy one, several, or all providers with a single uv command β€” a thin
41
+ wrapper that exposes the two deploy-time env knobs as flags:
42
 
43
  ```bash
44
  uv run scripts/deploy_modal.py # all providers
45
  uv run scripts/deploy_modal.py nvidia openbmb # just these
46
  uv run scripts/deploy_modal.py nvidia --keep-warm # = MODAL_LLM_KEEP_WARM=1
47
+ # --auth β†’ MODAL_LLM_REQUIRE_AUTH=1, --dry-run to preview the commands.
 
48
  ```
49
 
50
  Run these from the repo root; the script's own directory (`modal/`) is on
51
+ `sys.path`, so `from service import ...` / `from catalogue import ...` resolve,
52
  and `import modal` still binds the installed SDK (the folder name does not
53
  shadow it).
54
 
 
91
  | `gpu` | Modal GPU spec, e.g. `H200:1`, `H100:2`, `L40S:1`, `L4:1`. |
92
  | `tensor_parallel_size` | Shard across GPUs; set equal to the GPU count in `gpu`. |
93
  | `max_model_len` | Cap context length to fit memory / tune throughput. |
94
+ | `max_concurrent_inputs` | Hard ceiling of requests multiplexed onto one container (autoscale target is ~75% of it). |
 
 
95
  | `scaledown_window` | Idle seconds before a container stops (cold-start vs. cost). |
 
96
  | `min_containers` | Keep N warm to eliminate cold starts (always-on cost). |
97
  | `gpu_memory_utilization` | Fraction of VRAM for weights + KV cache (vLLM default `0.9`); raise for a bigger KV cache. |
98
  | `enable_prefix_caching` | Reuse the KV cache for shared prompt prefixes (on by default β€” big win when the system prompt / ledger context repeats across the cast). |
99
+ | `async_scheduling` | Overlap CPU request scheduling with GPU compute (on by default; off for the Transformers-backend Gemma 12B + omni models). |
100
  | `enforce_eager` | Skip CUDA-graph capture β€” faster cold start, lower steady-state throughput. |
 
101
  | `log_requests` | Log each request's id, sampling params, and token counts (on by default). |
102
+ | `reasoning_parser` / `tool_call_parser` / `enable_auto_tool_choice` | OpenAI tool/reasoning features (vLLM parser names; leave None if unsupported). |
103
+ | `mm_limits` | Per-prompt image/audio/video caps; set to 0 on an auto-detected-multimodal model you serve text-only. |
 
 
 
104
  | `trust_remote_code` | Required by MiniCPM / Nemotron custom modeling code. |
105
  | `vllm_version` | Per-model inference-stack pin (escape hatch); `None` = the default `VLLM_VERSION`, `"nightly"` = latest nightly wheel, else a pinned version. |
106
+ | `extra_vllm_args` | Raw `vllm serve` flags appended verbatim β€” the escape hatch for anything not modelled above (quantization, batch caps, custom parser plugins, …). |
107
  | `extra_pip` / `env` | Extra image deps / container env (escape hatch). |
108
 
109
  > **Per-model vLLM version.** The image pins `VLLM_VERSION` (see `service.py`) for
110
  > reproducible deploys. A single model can override it via `vllm_version` when the
111
  > pinned release can't serve its architecture β€” this is scoped to that model's image,
112
+ > so one model's bump never touches another provider's app. Only the Gemma 4 **12B**
113
+ > sets `vllm_version="nightly"` (plus `transformers>=5.10.2`) because its
114
+ > `gemma4_unified` architecture has no class in any stable vLLM ≀0.22.1. The Gemma 4
115
+ > **26B** is a standard MoE arch that serves on the pinned stable release, so it
116
+ > stays on the default pin.
117
 
118
  ### Performance tuning
119
 
 
129
  graph cache is persisted on the shared `vllm-cache` Volume (`VLLM_CACHE_ROOT`),
130
  so only the *first* container compiles β€” later cold starts replay the cached
131
  graphs. Set `enforce_eager=True` on a model only when its backend can't capture
132
+ graphs (the Transformers-backend Gemma 12B) or when cold start dominates.
133
  - **Async scheduling** overlaps CPU request scheduling with GPU compute; on by
134
  default for native vLLM models, off where the backend doesn't support it.
135
+ - **Autoscaling** scales out at ~75% of `max_concurrent_inputs` while a hot
136
+ container bursts up to the ceiling, so we add capacity before a container
137
+ saturates rather than after. Use `min_containers` to remove cold starts
 
138
  entirely (at always-on cost).
 
 
139
 
140
  For memory-bound models, raise `gpu_memory_utilization` (more KV cache β†’ more
141
+ concurrency); if a step OOMs, lower `max_model_len` or cap the batch via
142
+ `extra_vllm_args` (e.g. `("--max-num-seqs", "32")`).
143
 
144
  ### Cold starts
145
 
146
+ A scale-from-zero cold start pays container boot β†’ weight load β†’ engine warmup.
147
+ Two mechanisms keep that bounded:
148
+
149
+ **1. Shared caches (always on).** Weights are pulled once onto the
150
+ `huggingface-cache` Volume and the torch.compile / CUDA-graph artifacts are
151
+ persisted on the `vllm-cache` Volume (`VLLM_CACHE_ROOT`). So a model downloads
152
+ once across every container and provider, and only the *first* container
153
+ compiles its graphs β€” later cold starts replay the cache.
154
+
155
+ **2. Demo-day keep-warm (deploy-time, no code edits).** Pin one warm container
156
+ for every *profile-bound* model (tiny/fast/balanced/strong) right before a live
 
 
 
 
 
 
 
 
 
 
 
 
157
  demo β€” specialists keep scale-to-zero:
158
 
159
  ```bash
 
183
  `app = modal.App(PROVIDERS["<provider>"].app)` then
184
  `register_all(app, PROVIDERS["<provider>"].models)`.
185
 
186
+ ## Lower precision (quantization)
 
 
 
 
 
 
 
 
187
 
188
+ Every model repo here ships **BF16** weights and serves at full precision. To
189
+ shrink a model's footprint β€” fit it on a smaller GPU, or free VRAM for a longer
190
+ context / more concurrency β€” pass vLLM's quantization flags through the
191
+ `extra_vllm_args` escape hatch on its `ModelConfig`:
 
192
 
193
+ ```python
194
+ extra_vllm_args=("--quantization", "fp8", "--kv-cache-dtype", "fp8")
 
 
 
 
 
 
 
195
  ```
196
 
197
+ This is purely serving-side: `--served-model-name` is unchanged, so the engine,
198
+ endpoint URLs, and the running cast are untouched.
199
+
200
  > **Not every architecture serves under on-the-fly FP8.** It needs an Ada/Hopper
201
  > GPU (our L4/L40S/H200 all qualify) *and* vLLM support for the model's arch.
202
+ > Custom-code / hybrid-Mamba archs (the Nemotron Nanos, MiniCPM) and the
203
+ > Transformers-backend Gemma 12B may **fail to boot** under it. Verify a model
204
+ > after adding the flag (`modal/healthcheck.py` or `curl <url>/v1/models`); if it
205
+ > won't start, drop the flag. This is why every model defaults to full precision.
 
 
 
 
 
 
 
 
 
 
206
 
207
  ## Auth
208
 
 
229
  ## Observability & logging
230
 
231
  Every container's stdout/stderr is captured by Modal β€” watch it live with
232
+ `modal app logs <app-name>` or in the dashboard. Each endpoint runs vLLM with
233
+ `--enable-log-requests` (toggle via `log_requests`), so every call logs its
234
+ request id, sampling params, and (on completion) prompt/generation token counts
235
+ and finish reason. Clients can pass an `X-Request-Id` header and it shows up in
236
+ the request logs β€” handy for correlating an engine call with its server-side line.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
237
 
238
  Throughput, KV-cache usage, and prefix-cache hit rate are logged every second
239
  (`VLLM_LOG_STATS_INTERVAL`) and also exposed as Prometheus metrics at `/metrics`.
 
246
 
247
  | Model | Params (total / active) | Starting GPU |
248
  | ---------------------------------- | ----------------------- | ------------ |
249
+ | Nemotron-3-Nano-30B-A3B | ~31B / ~3B (Mamba MoE) | `H200:1` |
250
+ | Nemotron-Cascade-14B-Thinking | ~14B (dense, Qwen3) | `L40S:1` |
251
+ | Nemotron-3-Nano-4B | ~4B (Tiny Titan) | `L4:1` |
252
+ | MiniCPM-o-4_5 (omni) | ~9B + media encoders | `L40S:1` |
253
  | MiniCPM4.1-8B | 8B | `L40S:1` |
254
+ | Gemma-4-26B-A4B-it | ~25B / ~4B (MoE) | `H200:1` |
255
+ | Gemma-4-12B-it | ~12B (dense) | `L40S:1` |
256
 
257
  These are starting points. If a container OOMs, lower `max_model_len`, raise the
258
  GPU tier, or bump `tensor_parallel_size` (and the GPU count) for sharding.
modal/healthcheck.py CHANGED
@@ -218,8 +218,7 @@ async def check_chat(client: httpx.AsyncClient, t: Target, deadline: float) -> N
218
  backoff = min(backoff * 1.5, 20.0)
219
 
220
 
221
- async def run_target(t: Target, api_key: str, timeout: int, do_chat: bool,
222
- sem: asyncio.Semaphore) -> None:
223
  async with sem:
224
  t.started = time.monotonic()
225
  deadline = t.started + timeout
@@ -231,9 +230,9 @@ async def run_target(t: Target, api_key: str, timeout: int, do_chat: bool,
231
  # within 150s returns a 303 to the same URL (clients are expected to follow
232
  # it β€” up to ~20 hops / 50 min) while the container finishes cold-starting.
233
  # Without this, the first 303 at ~150s looks like a terminal error.
234
- async with httpx.AsyncClient(headers=headers, timeout=client_timeout,
235
- limits=limits, follow_redirects=True,
236
- max_redirects=20) as client:
237
  await check_models(client, t, deadline)
238
  if t.models_ok and do_chat:
239
  await check_chat(client, t, deadline)
@@ -258,10 +257,9 @@ PHASE_ICON = {
258
 
259
  def render_board(targets: list[Target], started: float) -> str:
260
  width = max(len(t.key) for t in targets)
261
- lines = [f" cold-start health-check Β· {len(targets)} endpoints Β· "
262
- f"{time.monotonic() - started:5.0f}s elapsed"]
263
  for t in targets:
264
- live = (t.elapsed or (time.monotonic() - t.started if t.started else 0.0))
265
  icon = PHASE_ICON.get(t.phase, "?")
266
  detail = t.phase
267
  if t.phase == "booting":
@@ -311,16 +309,14 @@ def print_report(targets: list[Target], do_chat: bool) -> None:
311
  detail = t.error or (t.sample if t.chat_ok else t.served_reported) or ""
312
  if t.chat_ok and t.finish_reason:
313
  detail = f"[{t.finish_reason}] {detail}"
314
- print(f" {t.key:<{kw}} {yn(t.models_ok):<6} {yn(t.chat_ok):<5} "
315
- f"{lat} {detail[:60]}")
316
 
317
  def healthy(t: Target) -> bool:
318
  return bool(t.models_ok and (t.chat_ok or not do_chat))
319
 
320
  ok = sum(1 for t in targets if healthy(t))
321
  print(" " + "-" * (len(header) - 2))
322
- print(f" {ok}/{len(targets)} healthy"
323
- + ("" if do_chat else " (liveness only β€” chat not tested)"))
324
  failed = [t.key for t in targets if not healthy(t)]
325
  if failed:
326
  print(f" needs attention: {', '.join(failed)}")
@@ -343,14 +339,16 @@ def build_targets(catalogue: ModuleType, workspace: str | None, args) -> list[Ta
343
  base_url = base_override.rstrip("/")
344
  else:
345
  base_url = catalogue.endpoint_url(e.app, e.endpoint_name, workspace)
346
- targets.append(Target(
347
- key=key,
348
- app=e.app,
349
- served_model_id=e.served_model_id,
350
- profile=e.profile,
351
- params_b=e.params_b,
352
- base_url=base_url,
353
- ))
 
 
354
  return targets
355
 
356
 
@@ -359,8 +357,11 @@ async def main_async(args) -> int:
359
  workspace = resolve_workspace(args.workspace)
360
  base_override = os.environ.get("MODAL_LLM_BASE_URL")
361
  if not workspace and not base_override:
362
- print("ERROR: could not resolve a Modal workspace. Pass --workspace, set "
363
- "$MODAL_WORKSPACE, or run `modal token new`.", file=sys.stderr)
 
 
 
364
  return 2
365
 
366
  targets = build_targets(catalogue, workspace, args)
@@ -378,8 +379,10 @@ async def main_async(args) -> int:
378
  return 0
379
 
380
  do_chat = not args.no_chat
381
- print(f"Workspace: {workspace} endpoints: {len(targets)} "
382
- f"chat: {'yes' if do_chat else 'no'} per-endpoint timeout: {args.timeout}s")
 
 
383
  print("Firing all endpoints concurrently β€” cold starts overlap, so this takes")
384
  print("about as long as the single slowest model, not the sum.\n")
385
 
@@ -388,9 +391,7 @@ async def main_async(args) -> int:
388
  done = asyncio.Event()
389
  progress = asyncio.create_task(progress_loop(targets, started, done))
390
  try:
391
- await asyncio.gather(*(
392
- run_target(t, api_key, args.timeout, do_chat, sem) for t in targets
393
- ))
394
  finally:
395
  done.set()
396
  await progress
@@ -398,18 +399,21 @@ async def main_async(args) -> int:
398
  print_report(targets, do_chat)
399
 
400
  if args.json:
401
- summary = [{
402
- "endpoint": t.key,
403
- "app": t.app,
404
- "served_model_id": t.served_model_id,
405
- "base_url": t.base_url,
406
- "models_ok": t.models_ok,
407
- "chat_ok": t.chat_ok,
408
- "latency_s": round(t.elapsed, 1),
409
- "finish_reason": t.finish_reason,
410
- "served_reported": t.served_reported,
411
- "error": t.error,
412
- } for t in targets]
 
 
 
413
  Path(args.json).write_text(json.dumps(summary, indent=2))
414
  print(f"\nWrote JSON summary to {args.json}")
415
 
@@ -418,21 +422,17 @@ async def main_async(args) -> int:
418
 
419
 
420
  def parse_args(argv: list[str] | None = None) -> argparse.Namespace:
421
- p = argparse.ArgumentParser(description=__doc__,
422
- formatter_class=argparse.RawDescriptionHelpFormatter)
423
  p.add_argument("--workspace", help="Modal workspace slug (else $MODAL_WORKSPACE / `modal profile current`)")
424
  p.add_argument("--only", help="comma-separated endpoint keys to include")
425
  p.add_argument("--skip", help="comma-separated endpoint keys to exclude")
426
- p.add_argument("--profiles-only", action="store_true",
427
- help="test only the engine-bound tiers (tiny/fast/balanced/strong)")
428
- p.add_argument("--no-chat", action="store_true",
429
- help="liveness only (GET /v1/models); skip the chat completion")
430
- p.add_argument("--timeout", type=int, default=900,
431
- help="per-endpoint deadline in seconds (default 900)")
432
- p.add_argument("--concurrency", type=int, default=0,
433
- help="max endpoints in flight at once (default 0 = all)")
434
- p.add_argument("--print-urls", action="store_true",
435
- help="resolve and print endpoint URLs, then exit (no calls)")
436
  p.add_argument("--json", help="also write a machine-readable summary to this path")
437
  return p.parse_args(argv)
438
 
 
218
  backoff = min(backoff * 1.5, 20.0)
219
 
220
 
221
+ async def run_target(t: Target, api_key: str, timeout: int, do_chat: bool, sem: asyncio.Semaphore) -> None:
 
222
  async with sem:
223
  t.started = time.monotonic()
224
  deadline = t.started + timeout
 
230
  # within 150s returns a 303 to the same URL (clients are expected to follow
231
  # it β€” up to ~20 hops / 50 min) while the container finishes cold-starting.
232
  # Without this, the first 303 at ~150s looks like a terminal error.
233
+ async with httpx.AsyncClient(
234
+ headers=headers, timeout=client_timeout, limits=limits, follow_redirects=True, max_redirects=20
235
+ ) as client:
236
  await check_models(client, t, deadline)
237
  if t.models_ok and do_chat:
238
  await check_chat(client, t, deadline)
 
257
 
258
  def render_board(targets: list[Target], started: float) -> str:
259
  width = max(len(t.key) for t in targets)
260
+ lines = [f" cold-start health-check Β· {len(targets)} endpoints Β· {time.monotonic() - started:5.0f}s elapsed"]
 
261
  for t in targets:
262
+ live = t.elapsed or (time.monotonic() - t.started if t.started else 0.0)
263
  icon = PHASE_ICON.get(t.phase, "?")
264
  detail = t.phase
265
  if t.phase == "booting":
 
309
  detail = t.error or (t.sample if t.chat_ok else t.served_reported) or ""
310
  if t.chat_ok and t.finish_reason:
311
  detail = f"[{t.finish_reason}] {detail}"
312
+ print(f" {t.key:<{kw}} {yn(t.models_ok):<6} {yn(t.chat_ok):<5} {lat} {detail[:60]}")
 
313
 
314
  def healthy(t: Target) -> bool:
315
  return bool(t.models_ok and (t.chat_ok or not do_chat))
316
 
317
  ok = sum(1 for t in targets if healthy(t))
318
  print(" " + "-" * (len(header) - 2))
319
+ print(f" {ok}/{len(targets)} healthy" + ("" if do_chat else " (liveness only β€” chat not tested)"))
 
320
  failed = [t.key for t in targets if not healthy(t)]
321
  if failed:
322
  print(f" needs attention: {', '.join(failed)}")
 
339
  base_url = base_override.rstrip("/")
340
  else:
341
  base_url = catalogue.endpoint_url(e.app, e.endpoint_name, workspace)
342
+ targets.append(
343
+ Target(
344
+ key=key,
345
+ app=e.app,
346
+ served_model_id=e.served_model_id,
347
+ profile=e.profile,
348
+ params_b=e.params_b,
349
+ base_url=base_url,
350
+ )
351
+ )
352
  return targets
353
 
354
 
 
357
  workspace = resolve_workspace(args.workspace)
358
  base_override = os.environ.get("MODAL_LLM_BASE_URL")
359
  if not workspace and not base_override:
360
+ print(
361
+ "ERROR: could not resolve a Modal workspace. Pass --workspace, set "
362
+ "$MODAL_WORKSPACE, or run `modal token new`.",
363
+ file=sys.stderr,
364
+ )
365
  return 2
366
 
367
  targets = build_targets(catalogue, workspace, args)
 
379
  return 0
380
 
381
  do_chat = not args.no_chat
382
+ print(
383
+ f"Workspace: {workspace} endpoints: {len(targets)} "
384
+ f"chat: {'yes' if do_chat else 'no'} per-endpoint timeout: {args.timeout}s"
385
+ )
386
  print("Firing all endpoints concurrently β€” cold starts overlap, so this takes")
387
  print("about as long as the single slowest model, not the sum.\n")
388
 
 
391
  done = asyncio.Event()
392
  progress = asyncio.create_task(progress_loop(targets, started, done))
393
  try:
394
+ await asyncio.gather(*(run_target(t, api_key, args.timeout, do_chat, sem) for t in targets))
 
 
395
  finally:
396
  done.set()
397
  await progress
 
399
  print_report(targets, do_chat)
400
 
401
  if args.json:
402
+ summary = [
403
+ {
404
+ "endpoint": t.key,
405
+ "app": t.app,
406
+ "served_model_id": t.served_model_id,
407
+ "base_url": t.base_url,
408
+ "models_ok": t.models_ok,
409
+ "chat_ok": t.chat_ok,
410
+ "latency_s": round(t.elapsed, 1),
411
+ "finish_reason": t.finish_reason,
412
+ "served_reported": t.served_reported,
413
+ "error": t.error,
414
+ }
415
+ for t in targets
416
+ ]
417
  Path(args.json).write_text(json.dumps(summary, indent=2))
418
  print(f"\nWrote JSON summary to {args.json}")
419
 
 
422
 
423
 
424
  def parse_args(argv: list[str] | None = None) -> argparse.Namespace:
425
+ p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
 
426
  p.add_argument("--workspace", help="Modal workspace slug (else $MODAL_WORKSPACE / `modal profile current`)")
427
  p.add_argument("--only", help="comma-separated endpoint keys to include")
428
  p.add_argument("--skip", help="comma-separated endpoint keys to exclude")
429
+ p.add_argument(
430
+ "--profiles-only", action="store_true", help="test only the engine-bound tiers (tiny/fast/balanced/strong)"
431
+ )
432
+ p.add_argument("--no-chat", action="store_true", help="liveness only (GET /v1/models); skip the chat completion")
433
+ p.add_argument("--timeout", type=int, default=900, help="per-endpoint deadline in seconds (default 900)")
434
+ p.add_argument("--concurrency", type=int, default=0, help="max endpoints in flight at once (default 0 = all)")
435
+ p.add_argument("--print-urls", action="store_true", help="resolve and print endpoint URLs, then exit (no calls)")
 
 
 
436
  p.add_argument("--json", help="also write a machine-readable summary to this path")
437
  return p.parse_args(argv)
438
 
modal/service.py CHANGED
@@ -1,19 +1,18 @@
1
  """Reusable, OpenAI-compatible model-serving layer for Modal.
2
 
3
- This module is provider-agnostic. It knows how to take a single ``ModelConfig``
4
- and turn it into a serverless, autoscaling, OpenAI-compatible HTTP endpoint
5
- backed by vLLM. Each provider app (``app_nvidia.py``, ``app_openbmb.py``,
6
- ``app_google.py``) imports :func:`register_model` and wires up its own models,
7
- so providers stay fully isolated in their own Modal apps while sharing one
8
- battle-tested serving path.
9
-
10
- Design goals:
11
- - **Extensible**: add a model by appending one ``ModelConfig`` to the registry.
12
- - **Scalable**: serverless autoscaling, input concurrency, shared weight cache.
13
- - **Configurable per task**: every knob (GPU, context length, parsers,
14
- multimodal limits, extra flags) lives in data, not code.
15
-
16
- The served endpoints speak the OpenAI REST API (``/v1/chat/completions`,
17
  ``/v1/completions``, ``/v1/models``), so any OpenAI-compatible client can call
18
  them by pointing ``base_url`` at the deployed URL.
19
  """
@@ -33,10 +32,11 @@ from catalogue import ModelConfig
33
 
34
  # --- Shared serving constants --------------------------------------------------
35
 
36
- # Pin the inference stack so deploys are reproducible. Bump deliberately.
 
37
  VLLM_VERSION = "0.21.0"
38
  CUDA_IMAGE = "nvidia/cuda:12.9.0-devel-ubuntu22.04"
39
- PYTHON_VERSION = "3.13"
40
 
41
  # The in-container port vLLM listens on; Modal maps it to a public HTTPS URL.
42
  VLLM_PORT = 8000
@@ -46,12 +46,12 @@ HF_CACHE_PATH = "/root/.cache/huggingface"
46
  VLLM_CACHE_PATH = "/root/.cache/vllm"
47
 
48
  # Name of the Modal Secret that holds a Hugging Face token (key: HF_TOKEN).
49
- # Required only for gated repos (e.g. Gemma). Create it once with:
50
  # modal secret create huggingface-secret HF_TOKEN=hf_...
51
  HF_SECRET_NAME = "huggingface-secret"
52
 
53
- # Name of the Modal Secret holding the bearer token clients must present.
54
- # The key MUST be VLLM_API_KEY β€” vLLM reads that env var and then enforces
55
  # `Authorization: Bearer <token>` on every request. Create it once with:
56
  # modal secret create llm-api-key VLLM_API_KEY=sk-...
57
  API_KEY_SECRET_NAME = "llm-api-key"
@@ -60,72 +60,27 @@ API_KEY_SECRET_NAME = "llm-api-key"
60
  # MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
61
  # When enabled, every endpoint mounts API_KEY_SECRET_NAME and rejects requests
62
  # without a valid bearer token. Off by default (endpoints are then public).
63
- REQUIRE_API_KEY = os.environ.get("MODAL_LLM_REQUIRE_AUTH", "").lower() in (
64
- "1",
65
- "true",
66
- "yes",
67
- )
68
-
69
- # Emit logs as structured JSON (one object per line) instead of vLLM's default
70
- # human-readable text. Opt in at deploy time (no code edits), mirroring the auth
71
- # toggle above:
72
- # MODAL_LLM_JSON_LOGS=1 modal deploy modal/app_google.py
73
- # Off by default β€” the coloured text logs are nicer to watch live; turn this on
74
- # when shipping logs to an aggregator or grepping fields. Request-level logging
75
- # itself (the per-request detail) is always on via ModelConfig, independent of
76
- # the format chosen here.
77
- JSON_LOGS = os.environ.get("MODAL_LLM_JSON_LOGS", "").lower() in ("1", "true", "yes")
78
-
79
- # Verbosity for the served loggers (vLLM honours VLLM_LOGGING_LEVEL; the JSON
80
- # config applies the same level). Read at deploy time and baked into the image.
81
- LOG_LEVEL = os.environ.get("MODAL_LLM_LOG_LEVEL", "INFO").upper()
82
 
83
  # Demo-day switch: keep N containers warm for every *profile-bound* model (the
84
- # tiers the cast actually runs on), removing their cold starts entirely for the
85
- # duration of the deploy. Specialists keep scale-to-zero. Costs GPU-hours while
86
- # deployed β€” turn it on right before a live demo, redeploy without it after:
87
  # MODAL_LLM_KEEP_WARM=1 modal deploy modal/app_nvidia.py
88
  KEEP_WARM = int(os.environ.get("MODAL_LLM_KEEP_WARM", "0") or "0")
89
 
90
- # Deploy-time precision overrides. When set, each wins over the matching per-model
91
- # ``ModelConfig`` field for *every* model in the deploy β€” so you flip a whole
92
- # provider to FP8 without editing the catalogue (deploys are per-provider, so the
93
- # blast radius is one app):
94
- # MODAL_LLM_QUANTIZATION=fp8 modal deploy modal/app_nvidia.py
95
- # MODAL_LLM_QUANTIZATION=fp8 MODAL_LLM_KV_CACHE_DTYPE=fp8 uv run scripts/deploy_modal.py nvidia
96
- # A disable token (``none``/``off``/``bf16``/…) forces full precision even if a model
97
- # defaults to a quantized mode. Read at deploy time and baked into each model's argv
98
- # (see build_command). CAVEAT: not every architecture serves under on-the-fly FP8 β€”
99
- # verify per provider; a model that can't will fail to boot. See ADR-0031.
100
- QUANTIZATION = os.environ.get("MODAL_LLM_QUANTIZATION", "").strip()
101
- KV_CACHE_DTYPE = os.environ.get("MODAL_LLM_KV_CACHE_DTYPE", "").strip()
102
- # Override values that mean "no quantization / model-default precision" β€” they make
103
- # the resolver omit the flag rather than pass a bogus value to vLLM.
104
- _PRECISION_DISABLE = frozenset({"none", "off", "false", "0", "no", "bf16", "fp16", "auto"})
105
-
106
- # Where the structured-logging module + its generated config live in the
107
- # container. The module dir goes on PYTHONPATH so vLLM can import the formatter
108
- # the dictConfig references (``vllm_logging.JsonFormatter``).
109
- _LOG_MODULE_DIR = "/opt/mal_logging"
110
- _LOG_CONFIG_PATH = "/tmp/vllm_logging.json"
111
-
112
  # Weights and the vLLM compile cache are shared across every provider app, so a
113
  # model pulled once is warm for all subsequent deploys and containers.
114
  hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
115
  vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)
116
 
117
- # Baseline image shared by every text model. Multimodal models extend it via
118
- # ``ModelConfig.extra_pip`` (see ``build_image``).
 
119
  _BASE_ENV = {
120
  "HF_HUB_CACHE": HF_CACHE_PATH,
121
  "HF_XET_HIGH_PERFORMANCE": "1", # faster weight downloads
122
  "VLLM_LOG_STATS_INTERVAL": "1",
123
- # Verbosity of vLLM's own loggers (throughput/cache stats, request logs).
124
- "VLLM_LOGGING_LEVEL": LOG_LEVEL,
125
- # Persist torch.compile + CUDA-graph artifacts on the shared vLLM cache
126
- # Volume (mounted at VLLM_CACHE_PATH). The first container compiles; every
127
- # later cold start replays the cached graphs instead of recompiling, so we
128
- # keep CUDA graphs (throughput) without paying their capture cost each boot.
129
  "VLLM_CACHE_ROOT": VLLM_CACHE_PATH,
130
  }
131
 
@@ -146,28 +101,6 @@ def build_image(cfg: ModelConfig) -> modal.Image:
146
  else:
147
  image = image.uv_pip_install(f"vllm=={cfg.vllm_version or VLLM_VERSION}")
148
  image = image.env(_BASE_ENV)
149
- if JSON_LOGS:
150
- # Ship the stdlib JSON formatter and put it on PYTHONPATH so vLLM can
151
- # import it when it applies the dictConfig. ``serve()`` writes the config
152
- # file and points VLLM_LOGGING_CONFIG_PATH at it. Baking the toggle into
153
- # the image env is what lets the (deploy-time) flag reach the container.
154
- from pathlib import Path
155
-
156
- image = (
157
- image.add_local_file(
158
- Path(__file__).with_name("vllm_logging.py"),
159
- f"{_LOG_MODULE_DIR}/vllm_logging.py",
160
- copy=True,
161
- )
162
- .env({"PYTHONPATH": _LOG_MODULE_DIR})
163
- .env({"MODAL_LLM_JSON_LOGS": "1", "MODAL_LLM_LOG_LEVEL": LOG_LEVEL})
164
- )
165
- if cfg.gpu_snapshot:
166
- # Snapshot prerequisites: VLLM_SERVER_DEV_MODE exposes the /sleep and
167
- # /wake_up endpoints the snapshot lifecycle drives, and single-threaded
168
- # inductor compilation keeps torch.compile artifacts snapshot-safe
169
- # (Modal's documented vLLM + GPU-snapshot recipe).
170
- image = image.env({"VLLM_SERVER_DEV_MODE": "1", "TORCHINDUCTOR_COMPILE_THREADS": "1"})
171
  if cfg.extra_pip:
172
  image = image.uv_pip_install(*cfg.extra_pip)
173
  if cfg.env:
@@ -175,20 +108,6 @@ def build_image(cfg: ModelConfig) -> modal.Image:
175
  return image
176
 
177
 
178
- def _resolve_precision(override: str, model_value: str | None) -> str | None:
179
- """Effective precision flag: a deploy-time *override* wins over *model_value*.
180
-
181
- A disable token in the override (``none``/``off``/``bf16``/…) returns ``None`` so
182
- the caller omits the flag and vLLM keeps full / model-default precision; an empty
183
- override falls back to the per-model value. Reads its inputs as arguments (the
184
- callers pass the module globals) so tests can monkeypatch ``QUANTIZATION`` /
185
- ``KV_CACHE_DTYPE`` and see the change without reimporting.
186
- """
187
- if override:
188
- return None if override.lower() in _PRECISION_DISABLE else override
189
- return model_value
190
-
191
-
192
  def build_command(cfg: ModelConfig) -> list[str]:
193
  """Assemble the ``vllm serve`` argv for a model. Returned as a list so we can
194
  launch with ``subprocess.Popen`` without a shell (no quoting pitfalls)."""
@@ -213,31 +132,6 @@ def build_command(cfg: ModelConfig) -> list[str]:
213
  cmd += ["--max-model-len", str(cfg.max_model_len)]
214
  if cfg.trust_remote_code:
215
  cmd += ["--trust-remote-code"]
216
- # Precision / quantization. A deploy-time env override (QUANTIZATION /
217
- # KV_CACHE_DTYPE) wins over the per-model ModelConfig field; both default to
218
- # full precision (no flag). On-the-fly FP8 needs Ada/Hopper + arch support.
219
- quantization = _resolve_precision(QUANTIZATION, cfg.quantization)
220
- if quantization:
221
- cmd += ["--quantization", quantization]
222
- kv_cache_dtype = _resolve_precision(KV_CACHE_DTYPE, cfg.kv_cache_dtype)
223
- # FP8 KV cache is incompatible with sleep-mode/snapshot models on the pinned
224
- # vLLM: the wake path runs init_fp8_kv_scales() over a post-sleep KV cache that
225
- # is a *list* of per-layer tensors, not one tensor, so cache_tensor.zero_()
226
- # throws and /wake_up 500s (every snapshot restore dies). Snapshot is a
227
- # structural per-model decision; the KV dtype is a deploy knob β€” so snapshot
228
- # wins. Drop the flag and warn loudly rather than ship an endpoint that boots
229
- # but can never wake. Weight --quantization is unaffected (different code path).
230
- if kv_cache_dtype and cfg.gpu_snapshot and kv_cache_dtype.lower().startswith("fp8"):
231
- print(
232
- f"⚠️ {cfg.endpoint_name}: dropping --kv-cache-dtype {kv_cache_dtype} β€” "
233
- "FP8 KV cache crashes the snapshot wake path on the pinned vLLM (see ADR-0031). "
234
- "Serving with full-precision KV cache. Drop gpu_snapshot to keep FP8 KV cache.",
235
- flush=True,
236
- )
237
- kv_cache_dtype = None
238
- if kv_cache_dtype:
239
- cmd += ["--kv-cache-dtype", kv_cache_dtype]
240
- # Performance / throughput knobs (all data-driven from ModelConfig).
241
  if cfg.gpu_memory_utilization is not None:
242
  cmd += ["--gpu-memory-utilization", str(cfg.gpu_memory_utilization)]
243
  # Prefix caching reuses the KV cache for shared prompt prefixes. In a
@@ -248,21 +142,10 @@ def build_command(cfg: ModelConfig) -> list[str]:
248
  cmd += ["--async-scheduling"]
249
  if cfg.enforce_eager:
250
  cmd += ["--enforce-eager"]
251
- if cfg.max_num_seqs:
252
- cmd += ["--max-num-seqs", str(cfg.max_num_seqs)]
253
- if cfg.max_num_batched_tokens:
254
- cmd += ["--max-num-batched-tokens", str(cfg.max_num_batched_tokens)]
255
  # Observability: log each incoming request (id, params, token counts) so the
256
- # Modal logs show what's actually being served. Bound the logged prompt length
257
- # by default so a long context can't blow up the log line.
258
  if cfg.log_requests:
259
  cmd += ["--enable-log-requests"]
260
- if cfg.log_outputs:
261
- cmd += ["--enable-log-outputs"]
262
- if cfg.max_log_len is not None:
263
- cmd += ["--max-log-len", str(cfg.max_log_len)]
264
- if not cfg.uvicorn_access_log:
265
- cmd += ["--disable-uvicorn-access-log"]
266
  if cfg.reasoning_parser:
267
  cmd += ["--reasoning-parser", cfg.reasoning_parser]
268
  if cfg.enable_auto_tool_choice:
@@ -271,10 +154,6 @@ def build_command(cfg: ModelConfig) -> list[str]:
271
  cmd += ["--tool-call-parser", cfg.tool_call_parser]
272
  if cfg.mm_limits:
273
  cmd += ["--limit-mm-per-prompt", json.dumps(cfg.mm_limits)]
274
- if cfg.gpu_snapshot:
275
- # Sleep mode lets the snapshot lifecycle offload weights to host RAM
276
- # (sleep level 1) before the memory snapshot is taken, then wake on restore.
277
- cmd += ["--enable-sleep-mode"]
278
  cmd += list(cfg.extra_vllm_args)
279
  return cmd
280
 
@@ -282,16 +161,11 @@ def build_command(cfg: ModelConfig) -> list[str]:
282
  # --- Endpoint registration ------------------------------------------------------
283
 
284
 
285
- def register_model(app: modal.App, cfg: ModelConfig) -> modal.Function | type:
286
  """Attach one model to ``app`` as an autoscaling, OpenAI-compatible endpoint.
287
 
288
- Dispatches on ``cfg.gpu_snapshot``: the default path is a serialized
289
- ``@app.function`` web server; snapshot models use a class-based lifecycle
290
- (load β†’ warm up β†’ sleep β†’ snapshot) so later cold starts restore in seconds
291
- instead of re-paying download + load + warmup. Both paths publish the same
292
- URL shape (``…--<app>-<endpoint_name>.modal.run``), so clients can't tell
293
- them apart.
294
-
295
  Everything is serialized (the prebuilt ``vllm serve`` argv is shipped to the
296
  container), which lets us register many distinctly-named endpoints from a
297
  simple loop without each needing a hand-written module-level function.
@@ -311,23 +185,11 @@ def register_model(app: modal.App, cfg: ModelConfig) -> modal.Function | type:
311
  if KEEP_WARM and cfg.profile:
312
  min_containers = max(min_containers, KEEP_WARM)
313
 
314
- # Autoscale at the target, but let a hot container absorb a burst up to the
315
- # hard max before another cold-starts (Modal high-perf-inference guidance).
316
- # Default the target to ~75% of the ceiling so we scale out before saturating.
317
- target_inputs = cfg.target_concurrent_inputs or max(1, (cfg.max_concurrent_inputs * 3) // 4)
318
-
319
- if cfg.gpu_snapshot:
320
- return _register_snapshot_model(
321
- app,
322
- cfg,
323
- image=image,
324
- cmd=cmd,
325
- secrets=secrets,
326
- min_containers=min_containers,
327
- target_inputs=target_inputs,
328
- )
329
-
330
- function_kwargs = dict(
331
  name=cfg.endpoint_name,
332
  image=image,
333
  gpu=cfg.gpu,
@@ -338,169 +200,18 @@ def register_model(app: modal.App, cfg: ModelConfig) -> modal.Function | type:
338
  timeout=cfg.request_timeout,
339
  serialized=True,
340
  )
341
- # Pre-warm spare containers under load for bursty traffic (opt-in per model).
342
- if cfg.buffer_containers:
343
- function_kwargs["buffer_containers"] = cfg.buffer_containers
344
-
345
- @app.function(**function_kwargs)
346
  @modal.concurrent(max_inputs=cfg.max_concurrent_inputs, target_inputs=target_inputs)
347
  @modal.web_server(port=VLLM_PORT, startup_timeout=cfg.startup_timeout)
348
  def serve():
349
- import os
350
  import subprocess
351
 
352
- env = dict(os.environ)
353
- # When structured logging is on, generate the dictConfig file and point
354
- # vLLM at it. Done at container start (not build) so the level is picked
355
- # up from the env without rebuilding the image.
356
- if env.get("MODAL_LLM_JSON_LOGS", "").lower() in ("1", "true", "yes"):
357
- import vllm_logging
358
-
359
- vllm_logging.write_config(_LOG_CONFIG_PATH, level=env.get("MODAL_LLM_LOG_LEVEL", "INFO"))
360
- env["VLLM_LOGGING_CONFIG_PATH"] = _LOG_CONFIG_PATH
361
-
362
  # vLLM serves the OpenAI REST API on VLLM_PORT; Modal exposes it publicly.
363
- subprocess.Popen(cmd, env=env)
 
364
 
365
  return serve
366
 
367
 
368
- def _class_name(slug: str) -> str:
369
- """Modal class name for an endpoint slug: ``nemotron-3-nano-4b`` β†’ ``Nemotron3Nano4b``."""
370
- return "".join(part.capitalize() for part in slug.replace("_", "-").split("-") if part) or "SnapshotServer"
371
-
372
-
373
- def _register_snapshot_model(
374
- app: modal.App,
375
- cfg: ModelConfig,
376
- *,
377
- image: modal.Image,
378
- cmd: list[str],
379
- secrets: list[modal.Secret],
380
- min_containers: int,
381
- target_inputs: int,
382
- ) -> type:
383
- """Snapshot serving path β€” Modal's vLLM + GPU-memory-snapshot recipe.
384
-
385
- First boot: start vLLM, wait for the port, run a few warmup completions so
386
- compiled artifacts and caches are resident, put the engine to sleep (weights
387
- offloaded to host RAM, KV cache dropped), and let Modal snapshot the
388
- container (CPU + GPU state). Every later cold start restores the snapshot
389
- and wakes the engine β€” seconds instead of minutes. The web URL label is
390
- pinned to ``<app>-<endpoint_name>`` so the public URL is identical to the
391
- plain function path (``…--<app>-<endpoint_name>.modal.run``) the catalogue's
392
- ``endpoint_url`` builds. A ``@modal.web_server`` ``label`` becomes the URL as
393
- ``<workspace>--<label>.modal.run`` *without* the app prefix Modal adds to a
394
- plain function's URL, so the app name must be folded into the label by hand
395
- or snapshot models answer at the wrong host (``…--<endpoint_name>``).
396
- """
397
- served_name = cfg.served_name
398
-
399
- # Helpers are nested (not module-level) on purpose: the class ships to the
400
- # container via cloudpickle (``serialized=True``), and closures are pickled
401
- # by value β€” a module-level helper would be pickled by reference to the
402
- # ``service`` module, which doesn't exist inside the container.
403
- def _headers() -> dict[str, str]:
404
- import os
405
-
406
- key = os.environ.get("VLLM_API_KEY")
407
- return {"Authorization": f"Bearer {key}"} if key else {}
408
-
409
- def _wait_ready(proc) -> None:
410
- # vLLM opens the port only once the engine is initialized, so a
411
- # successful connect means "ready", not just "listening".
412
- import socket
413
- import time
414
-
415
- while True:
416
- try:
417
- socket.create_connection(("localhost", VLLM_PORT), timeout=1).close()
418
- return
419
- except OSError:
420
- if proc.poll() is not None:
421
- raise RuntimeError(f"vllm exited with code {proc.returncode}")
422
- time.sleep(0.2)
423
-
424
- def _post(path: str, json_body: dict | None = None, timeout: float = 300.0) -> None:
425
- import requests # vLLM dependency, always present in the image
426
-
427
- url = f"http://localhost:{VLLM_PORT}{path}"
428
- requests.post(url, headers=_headers(), json=json_body, timeout=timeout).raise_for_status()
429
-
430
- class _SnapshotServer:
431
- @modal.enter(snap=True)
432
- def start(self):
433
- import os
434
- import subprocess
435
-
436
- env = dict(os.environ)
437
- # Same structured-logging hook as the plain path (see ``serve``).
438
- if env.get("MODAL_LLM_JSON_LOGS", "").lower() in ("1", "true", "yes"):
439
- import vllm_logging
440
-
441
- vllm_logging.write_config(_LOG_CONFIG_PATH, level=env.get("MODAL_LLM_LOG_LEVEL", "INFO"))
442
- env["VLLM_LOGGING_CONFIG_PATH"] = _LOG_CONFIG_PATH
443
-
444
- self.vllm_proc = subprocess.Popen(cmd, env=env)
445
- _wait_ready(self.vllm_proc)
446
- # Touch the full serving path so compile/caching work happens *before*
447
- # the snapshot rather than on the first real request after restore.
448
- warmup = {
449
- "model": served_name,
450
- "messages": [{"role": "user", "content": "Who tends the wood?"}],
451
- "max_tokens": 8,
452
- }
453
- for _ in range(3):
454
- _post("/v1/chat/completions", json_body=warmup)
455
- # Offload weights to host RAM (sleep level 1); Modal snapshots the
456
- # container right after the snap=True enters return.
457
- _post("/sleep?level=1", timeout=120.0)
458
-
459
- @modal.enter(snap=False)
460
- def wake(self):
461
- # Runs after every restore (and on the snapshot-creating boot itself,
462
- # which simply resumes serving): reload weights onto the GPU.
463
- _post("/wake_up", timeout=120.0)
464
- _wait_ready(self.vllm_proc)
465
-
466
- @modal.web_server(port=VLLM_PORT, startup_timeout=cfg.startup_timeout, label=f"{app.name}-{cfg.endpoint_name}")
467
- def serve(self):
468
- pass # vLLM (already running) is the web server; Modal just exposes the port.
469
-
470
- @modal.exit()
471
- def stop(self):
472
- proc = getattr(self, "vllm_proc", None)
473
- if proc is not None:
474
- proc.terminate()
475
-
476
- # One Modal class per model, named after the endpoint (App.cls has no name
477
- # override, so rename the type before decorating).
478
- name = _class_name(cfg.endpoint_name)
479
- _SnapshotServer.__name__ = name
480
- _SnapshotServer.__qualname__ = name
481
-
482
- cls_kwargs = dict(
483
- image=image,
484
- gpu=cfg.gpu,
485
- volumes={HF_CACHE_PATH: hf_cache_vol, VLLM_CACHE_PATH: vllm_cache_vol},
486
- secrets=secrets,
487
- scaledown_window=cfg.scaledown_window,
488
- min_containers=min_containers,
489
- timeout=cfg.request_timeout,
490
- # Bounds the whole snap=True phase (download + load + warmup + sleep).
491
- startup_timeout=cfg.startup_timeout,
492
- serialized=True,
493
- enable_memory_snapshot=True,
494
- # GPU snapshots are Modal-alpha; scoped per model via cfg.gpu_snapshot.
495
- experimental_options={"enable_gpu_snapshot": True},
496
- )
497
- if cfg.buffer_containers:
498
- cls_kwargs["buffer_containers"] = cfg.buffer_containers
499
-
500
- concurrent = modal.concurrent(max_inputs=cfg.max_concurrent_inputs, target_inputs=target_inputs)
501
- return app.cls(**cls_kwargs)(concurrent(_SnapshotServer))
502
-
503
-
504
  def register_all(app: modal.App, configs: Iterable[ModelConfig]) -> None:
505
  """Register every model in ``configs`` onto ``app``."""
506
  for cfg in configs:
 
1
  """Reusable, OpenAI-compatible model-serving layer for Modal.
2
 
3
+ This module is provider-agnostic. It takes a single ``ModelConfig`` and turns it
4
+ into a serverless, autoscaling, OpenAI-compatible HTTP endpoint backed by vLLM.
5
+ Each provider app (``app_nvidia.py``, ``app_openbmb.py``, ``app_google.py``)
6
+ imports :func:`register_all` and wires up its own models, so providers stay
7
+ isolated in their own Modal apps while sharing one serving path.
8
+
9
+ This is Modal's canonical vLLM recipe, kept deliberately small: an autoscaling
10
+ ``@app.function`` whose body launches ``vllm serve`` as a subprocess behind a
11
+ ``@modal.web_server``. Everything that shapes a model (GPU, context length,
12
+ parsers, multimodal limits, extra flags) lives in data β€” the ``ModelConfig`` β€”
13
+ not in code, so adding a model is one entry in ``catalogue.py``.
14
+
15
+ The served endpoints speak the OpenAI REST API (``/v1/chat/completions``,
 
16
  ``/v1/completions``, ``/v1/models``), so any OpenAI-compatible client can call
17
  them by pointing ``base_url`` at the deployed URL.
18
  """
 
32
 
33
  # --- Shared serving constants --------------------------------------------------
34
 
35
+ # Pin the inference stack so deploys are reproducible. Bump deliberately. This is
36
+ # the version Modal's current vLLM example ships with.
37
  VLLM_VERSION = "0.21.0"
38
  CUDA_IMAGE = "nvidia/cuda:12.9.0-devel-ubuntu22.04"
39
+ PYTHON_VERSION = "3.12"
40
 
41
  # The in-container port vLLM listens on; Modal maps it to a public HTTPS URL.
42
  VLLM_PORT = 8000
 
46
  VLLM_CACHE_PATH = "/root/.cache/vllm"
47
 
48
  # Name of the Modal Secret that holds a Hugging Face token (key: HF_TOKEN).
49
+ # Required only for gated repos. Create it once with:
50
  # modal secret create huggingface-secret HF_TOKEN=hf_...
51
  HF_SECRET_NAME = "huggingface-secret"
52
 
53
+ # Name of the Modal Secret holding the bearer token clients must present. The key
54
+ # MUST be VLLM_API_KEY β€” vLLM reads that env var and then enforces
55
  # `Authorization: Bearer <token>` on every request. Create it once with:
56
  # modal secret create llm-api-key VLLM_API_KEY=sk-...
57
  API_KEY_SECRET_NAME = "llm-api-key"
 
60
  # MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
61
  # When enabled, every endpoint mounts API_KEY_SECRET_NAME and rejects requests
62
  # without a valid bearer token. Off by default (endpoints are then public).
63
+ REQUIRE_API_KEY = os.environ.get("MODAL_LLM_REQUIRE_AUTH", "").lower() in ("1", "true", "yes")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
  # Demo-day switch: keep N containers warm for every *profile-bound* model (the
66
+ # tiers the cast actually runs on), removing their cold starts for the duration
67
+ # of the deploy. Specialists keep scale-to-zero. Costs GPU-hours while deployed β€”
68
+ # turn it on right before a live demo, redeploy without it after:
69
  # MODAL_LLM_KEEP_WARM=1 modal deploy modal/app_nvidia.py
70
  KEEP_WARM = int(os.environ.get("MODAL_LLM_KEEP_WARM", "0") or "0")
71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  # Weights and the vLLM compile cache are shared across every provider app, so a
73
  # model pulled once is warm for all subsequent deploys and containers.
74
  hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
75
  vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)
76
 
77
+ # Baseline image env shared by every model. Persisting the torch.compile + CUDA
78
+ # graph cache on the shared vLLM Volume means only the first container compiles;
79
+ # later cold starts replay the cached graphs instead of recapturing them.
80
  _BASE_ENV = {
81
  "HF_HUB_CACHE": HF_CACHE_PATH,
82
  "HF_XET_HIGH_PERFORMANCE": "1", # faster weight downloads
83
  "VLLM_LOG_STATS_INTERVAL": "1",
 
 
 
 
 
 
84
  "VLLM_CACHE_ROOT": VLLM_CACHE_PATH,
85
  }
86
 
 
101
  else:
102
  image = image.uv_pip_install(f"vllm=={cfg.vllm_version or VLLM_VERSION}")
103
  image = image.env(_BASE_ENV)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  if cfg.extra_pip:
105
  image = image.uv_pip_install(*cfg.extra_pip)
106
  if cfg.env:
 
108
  return image
109
 
110
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
111
  def build_command(cfg: ModelConfig) -> list[str]:
112
  """Assemble the ``vllm serve`` argv for a model. Returned as a list so we can
113
  launch with ``subprocess.Popen`` without a shell (no quoting pitfalls)."""
 
132
  cmd += ["--max-model-len", str(cfg.max_model_len)]
133
  if cfg.trust_remote_code:
134
  cmd += ["--trust-remote-code"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
  if cfg.gpu_memory_utilization is not None:
136
  cmd += ["--gpu-memory-utilization", str(cfg.gpu_memory_utilization)]
137
  # Prefix caching reuses the KV cache for shared prompt prefixes. In a
 
142
  cmd += ["--async-scheduling"]
143
  if cfg.enforce_eager:
144
  cmd += ["--enforce-eager"]
 
 
 
 
145
  # Observability: log each incoming request (id, params, token counts) so the
146
+ # Modal logs show what's actually being served.
 
147
  if cfg.log_requests:
148
  cmd += ["--enable-log-requests"]
 
 
 
 
 
 
149
  if cfg.reasoning_parser:
150
  cmd += ["--reasoning-parser", cfg.reasoning_parser]
151
  if cfg.enable_auto_tool_choice:
 
154
  cmd += ["--tool-call-parser", cfg.tool_call_parser]
155
  if cfg.mm_limits:
156
  cmd += ["--limit-mm-per-prompt", json.dumps(cfg.mm_limits)]
 
 
 
 
157
  cmd += list(cfg.extra_vllm_args)
158
  return cmd
159
 
 
161
  # --- Endpoint registration ------------------------------------------------------
162
 
163
 
164
+ def register_model(app: modal.App, cfg: ModelConfig) -> modal.Function:
165
  """Attach one model to ``app`` as an autoscaling, OpenAI-compatible endpoint.
166
 
167
+ A single serialized ``@app.function`` web server launches ``vllm serve`` as a
168
+ subprocess; Modal exposes its port at ``…--<app>-<endpoint_name>.modal.run``.
 
 
 
 
 
169
  Everything is serialized (the prebuilt ``vllm serve`` argv is shipped to the
170
  container), which lets us register many distinctly-named endpoints from a
171
  simple loop without each needing a hand-written module-level function.
 
185
  if KEEP_WARM and cfg.profile:
186
  min_containers = max(min_containers, KEEP_WARM)
187
 
188
+ # Autoscale at ~75% of the ceiling, but let a hot container absorb a burst up
189
+ # to the hard max before another cold-starts (Modal high-perf guidance).
190
+ target_inputs = max(1, (cfg.max_concurrent_inputs * 3) // 4)
191
+
192
+ @app.function(
 
 
 
 
 
 
 
 
 
 
 
 
193
  name=cfg.endpoint_name,
194
  image=image,
195
  gpu=cfg.gpu,
 
200
  timeout=cfg.request_timeout,
201
  serialized=True,
202
  )
 
 
 
 
 
203
  @modal.concurrent(max_inputs=cfg.max_concurrent_inputs, target_inputs=target_inputs)
204
  @modal.web_server(port=VLLM_PORT, startup_timeout=cfg.startup_timeout)
205
  def serve():
 
206
  import subprocess
207
 
 
 
 
 
 
 
 
 
 
 
208
  # vLLM serves the OpenAI REST API on VLLM_PORT; Modal exposes it publicly.
209
+ # Inherits the container env (HF cache, vLLM cache, any secrets).
210
+ subprocess.Popen(cmd)
211
 
212
  return serve
213
 
214
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215
  def register_all(app: modal.App, configs: Iterable[ModelConfig]) -> None:
216
  """Register every model in ``configs`` onto ``app``."""
217
  for cfg in configs:
modal/vllm_logging.py DELETED
@@ -1,118 +0,0 @@
1
- """Structured (JSON) logging for the vLLM subprocess β€” stdlib only.
2
-
3
- vLLM applies a standard :func:`logging.config.dictConfig` when the
4
- ``VLLM_LOGGING_CONFIG_PATH`` env var points at a JSON file (see vLLM's
5
- ``envs.py``). This module builds that config and ships the :class:`JsonFormatter`
6
- it references, so one importable module serves both sides:
7
-
8
- * :func:`write_config` β€” called by ``service.serve()`` to drop the JSON config
9
- file into the container before launching ``vllm serve``; and
10
- * :class:`JsonFormatter` β€” imported *by name* from the JSON config when vLLM
11
- runs ``dictConfig`` in its own process.
12
-
13
- For the second to work, this file is added to the container image and its
14
- directory is placed on ``PYTHONPATH`` (see ``service.build_image``). Keeping it
15
- **dependency-free** (no ``python-json-logger`` etc.) means there is no extra
16
- wheel to install and no import path that can drift between versions β€” vLLM only
17
- needs the stdlib plus this one file.
18
-
19
- One JSON object is emitted per log line: ``ts``, ``level``, ``logger``, ``msg``,
20
- the source ``module:lineno``, and any structured extras attached to the record
21
- (vLLM threads request ids and token counts through these). Output stays on
22
- stdout so Modal captures it like every other container log.
23
- """
24
-
25
- from __future__ import annotations
26
-
27
- import json
28
- import logging
29
-
30
- # Standard LogRecord attributes β€” everything here is either folded into a fixed
31
- # JSON key below or deliberately dropped. Anything *else* on the record is a
32
- # caller-supplied extra (e.g. a request id) and is included verbatim.
33
- _RESERVED: frozenset[str] = frozenset(
34
- {
35
- "args",
36
- "asctime",
37
- "created",
38
- "exc_info",
39
- "exc_text",
40
- "filename",
41
- "funcName",
42
- "levelname",
43
- "levelno",
44
- "lineno",
45
- "module",
46
- "msecs",
47
- "message",
48
- "msg",
49
- "name",
50
- "pathname",
51
- "process",
52
- "processName",
53
- "relativeCreated",
54
- "stack_info",
55
- "taskName",
56
- "thread",
57
- "threadName",
58
- }
59
- )
60
-
61
-
62
- class JsonFormatter(logging.Formatter):
63
- """Render each log record as a single compact JSON line.
64
-
65
- Referenced from the dictConfig by dotted path (``vllm_logging.JsonFormatter``),
66
- so it must stay importable under that name in the container.
67
- """
68
-
69
- def format(self, record: logging.LogRecord) -> str:
70
- data: dict[str, object] = {
71
- "ts": self.formatTime(record, self.datefmt),
72
- "level": record.levelname,
73
- "logger": record.name,
74
- "msg": record.getMessage(),
75
- "src": f"{record.module}:{record.lineno}",
76
- }
77
- if record.exc_info:
78
- data["exc"] = self.formatException(record.exc_info)
79
- # Fold in any structured extras (request_id, token counts, ...). Values
80
- # that aren't JSON-serialisable fall back to repr so a stray object can
81
- # never crash the logging path.
82
- for key, value in record.__dict__.items():
83
- if key in _RESERVED or key.startswith("_"):
84
- continue
85
- try:
86
- json.dumps(value)
87
- except (TypeError, ValueError):
88
- value = repr(value)
89
- data[key] = value
90
- return json.dumps(data, ensure_ascii=False, default=repr)
91
-
92
-
93
- def build_config(level: str = "INFO") -> dict:
94
- """Return a ``logging.config.dictConfig`` that routes vLLM + uvicorn through
95
- :class:`JsonFormatter` on stdout at ``level``."""
96
- level = (level or "INFO").upper()
97
- handler = {
98
- "class": "logging.StreamHandler",
99
- "formatter": "json",
100
- "stream": "ext://sys.stdout",
101
- }
102
- logger = {"handlers": ["stdout"], "level": level, "propagate": False}
103
- return {
104
- "version": 1,
105
- # Keep vLLM's own loggers; we only swap their formatting/handler.
106
- "disable_existing_loggers": False,
107
- "formatters": {"json": {"()": "vllm_logging.JsonFormatter"}},
108
- "handlers": {"stdout": handler},
109
- "loggers": {name: dict(logger) for name in ("vllm", "uvicorn", "uvicorn.access", "uvicorn.error")},
110
- "root": {"handlers": ["stdout"], "level": level},
111
- }
112
-
113
-
114
- def write_config(path: str, level: str = "INFO") -> str:
115
- """Write the dictConfig JSON to ``path`` (for ``VLLM_LOGGING_CONFIG_PATH``)."""
116
- with open(path, "w", encoding="utf-8") as fh:
117
- json.dump(build_config(level), fh)
118
- return path
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/observability/logging_setup.py CHANGED
@@ -1,7 +1,6 @@
1
  """Root logging configuration β€” structured records to stdout and to the store.
2
 
3
- Generalises the dependency-free JSON formatter from ``modal/vllm_logging.py`` for
4
- the whole engine, and adds:
5
 
6
  * a :class:`_ContextFilter` that stamps every record with the bound
7
  run/turn/agent (see :mod:`src.observability.context`); and
 
1
  """Root logging configuration β€” structured records to stdout and to the store.
2
 
3
+ Provides a dependency-free JSON formatter for the whole engine, and adds:
 
4
 
5
  * a :class:`_ContextFilter` that stamps every record with the bound
6
  run/turn/agent (see :mod:`src.observability.context`); and
tests/test_modal_build_command.py CHANGED
@@ -1,11 +1,9 @@
1
- """Guard the precision flags ``build_command`` emits into the vLLM argv.
2
 
3
- Quantization is purely serving-side: it only adds ``--quantization`` /
4
- ``--kv-cache-dtype`` to the ``vllm serve`` argv (the ``--served-model-name`` is
5
- unchanged, so the engine never notices). Two controls feed those flags β€” a
6
- per-model ``ModelConfig`` field and a deploy-time env override that wins over it
7
- β€” and these tests pin both, plus the force-disable token, since this is the first
8
- test to assert on ``build_command``'s output at all.
9
 
10
  ``modal/service.py`` does ``import modal`` and ``from catalogue import …``, so we
11
  load it exactly the way ``modal deploy`` does: with ``modal/`` on ``sys.path`` (the
@@ -16,6 +14,7 @@ binds the installed SDK, not the folder).
16
  from __future__ import annotations
17
 
18
  import importlib
 
19
  import sys
20
  from pathlib import Path
21
 
@@ -37,109 +36,105 @@ def _make(service, **kwargs):
37
  return service.ModelConfig(name="acme/Tiny-1B", endpoint_name="tiny-1b", **kwargs)
38
 
39
 
40
- # ── per-model field ──────────────────────────────────────────────────────────
 
 
41
 
42
 
43
- def test_no_quantization_by_default(service):
44
- cmd = service.build_command(_make(service))
45
- assert "--quantization" not in cmd
46
- assert "--kv-cache-dtype" not in cmd
47
-
48
-
49
- def test_per_model_quantization_emits_flag(service):
50
- cmd = service.build_command(_make(service, quantization="fp8"))
51
- assert cmd[cmd.index("--quantization") + 1] == "fp8"
52
-
53
 
54
- def test_per_model_kv_cache_dtype_emits_flag(service):
55
- cmd = service.build_command(_make(service, kv_cache_dtype="fp8"))
56
- assert cmd[cmd.index("--kv-cache-dtype") + 1] == "fp8"
57
 
58
-
59
- # ── deploy-time env override ───────────────────────────────────────────────────
 
 
 
 
 
60
 
61
 
62
- def test_env_override_beats_unset_model_field(service, monkeypatch):
63
- monkeypatch.setattr(service, "QUANTIZATION", "fp8")
64
- cmd = service.build_command(_make(service)) # model field is None
65
- assert cmd[cmd.index("--quantization") + 1] == "fp8"
 
66
 
67
 
68
- def test_env_override_beats_model_field(service, monkeypatch):
69
- monkeypatch.setattr(service, "QUANTIZATION", "awq")
70
- cmd = service.build_command(_make(service, quantization="fp8"))
71
- assert cmd[cmd.index("--quantization") + 1] == "awq"
72
 
73
 
74
- @pytest.mark.parametrize("token", ["none", "off", "bf16", "AUTO"])
75
- def test_disable_token_forces_full_precision(service, monkeypatch, token):
76
- # A model that defaults to fp8 is overridden back to no flag at deploy time.
77
- monkeypatch.setattr(service, "QUANTIZATION", token)
78
- cmd = service.build_command(_make(service, quantization="fp8"))
79
- assert "--quantization" not in cmd
80
 
81
 
82
- def test_kv_cache_env_override(service, monkeypatch):
83
- monkeypatch.setattr(service, "KV_CACHE_DTYPE", "fp8")
84
- cmd = service.build_command(_make(service))
85
- assert cmd[cmd.index("--kv-cache-dtype") + 1] == "fp8"
 
 
 
 
 
 
 
 
 
 
86
 
87
 
88
- # ── FP8 KV cache Γ— snapshot incompatibility (vLLM wake-path crash) ─────────────
 
 
89
 
90
 
91
- def test_fp8_kv_cache_dropped_for_snapshot_models(service):
92
- # FP8 KV cache crashes the /wake_up path on snapshot models, so the flag is
93
- # suppressed when gpu_snapshot is set β€” the endpoint serves with full-precision
94
- # KV cache rather than booting into a state it can never wake from.
95
- cmd = service.build_command(_make(service, kv_cache_dtype="fp8", gpu_snapshot=True))
96
- assert "--kv-cache-dtype" not in cmd
97
- # The snapshot flag itself still wins and is emitted.
98
- assert "--enable-sleep-mode" in cmd
 
 
 
99
 
100
 
101
- def test_fp8_kv_cache_env_override_dropped_for_snapshot_models(service, monkeypatch):
102
- # The global deploy override is the common trigger: it lands on every model in
103
- # the app, including snapshot ones, which must still drop it.
104
- monkeypatch.setattr(service, "KV_CACHE_DTYPE", "fp8")
105
- cmd = service.build_command(_make(service, gpu_snapshot=True))
106
- assert "--kv-cache-dtype" not in cmd
107
 
108
 
109
- def test_fp8_variant_kv_cache_dropped_for_snapshot_models(service):
110
- # Every fp8 variant hits init_fp8_kv_scales, so fp8_e5m2 is dropped too.
111
- cmd = service.build_command(_make(service, kv_cache_dtype="fp8_e5m2", gpu_snapshot=True))
112
- assert "--kv-cache-dtype" not in cmd
113
 
114
 
115
- def test_non_fp8_kv_cache_kept_for_snapshot_models(service):
116
- # The guard only fires on fp8; a non-fp8 dtype passes through even with snapshot.
117
- cmd = service.build_command(_make(service, kv_cache_dtype="auto", gpu_snapshot=True))
118
- assert cmd[cmd.index("--kv-cache-dtype") + 1] == "auto"
119
 
120
 
121
- def test_fp8_kv_cache_kept_for_non_snapshot_models(service):
122
- # Without snapshot there's no wake path, so FP8 KV cache stays.
123
- cmd = service.build_command(_make(service, kv_cache_dtype="fp8", gpu_snapshot=False))
124
- assert cmd[cmd.index("--kv-cache-dtype") + 1] == "fp8"
125
 
126
 
127
  # ── deploy script wiring ───────────────────────────────────────────────────────
128
 
129
 
130
- def test_deploy_script_propagates_quantization_env():
131
  sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "scripts"))
132
  deploy_modal = importlib.import_module("deploy_modal")
133
  from argparse import Namespace
134
 
135
- base = dict(keep_warm=False, auth=False, json_logs=False, log_level="", kv_cache_dtype=None)
136
- env_fp8 = deploy_modal._env_for(Namespace(quantization="fp8", **base))
137
- assert env_fp8["MODAL_LLM_QUANTIZATION"] == "fp8"
138
-
139
- # ``--quantization none`` (force full precision) is still propagated, not dropped.
140
- env_none = deploy_modal._env_for(Namespace(quantization="none", **base))
141
- assert env_none["MODAL_LLM_QUANTIZATION"] == "none"
142
 
143
- # Unset β†’ the env var is left alone (so a model's own default stands).
144
- env_unset = deploy_modal._env_for(Namespace(quantization=None, **base))
145
- assert "MODAL_LLM_QUANTIZATION" not in env_unset
 
 
1
+ """Guard the ``vllm serve`` argv that ``build_command`` emits.
2
 
3
+ The serving layer turns one ``ModelConfig`` into the argv launched inside the
4
+ container, so these tests pin the mapping from config fields to vLLM flags: the
5
+ always-present identity flags, the data-driven toggles (parsers, eager, prefix
6
+ caching), and the ``extra_vllm_args`` escape hatch.
 
 
7
 
8
  ``modal/service.py`` does ``import modal`` and ``from catalogue import …``, so we
9
  load it exactly the way ``modal deploy`` does: with ``modal/`` on ``sys.path`` (the
 
14
  from __future__ import annotations
15
 
16
  import importlib
17
+ import json
18
  import sys
19
  from pathlib import Path
20
 
 
36
  return service.ModelConfig(name="acme/Tiny-1B", endpoint_name="tiny-1b", **kwargs)
37
 
38
 
39
+ def _flag_value(cmd: list[str], flag: str) -> str:
40
+ """The argument that follows ``flag`` in the argv."""
41
+ return cmd[cmd.index(flag) + 1]
42
 
43
 
44
+ # ── always-present identity flags ──────────────────────────────────────────────
 
 
 
 
 
 
 
 
 
45
 
 
 
 
46
 
47
+ def test_serves_the_model_with_identity_flags(service):
48
+ cmd = service.build_command(_make(service))
49
+ assert cmd[:3] == ["vllm", "serve", "acme/Tiny-1B"]
50
+ # served-model-name defaults to the repo name (clients pass the repo id).
51
+ assert _flag_value(cmd, "--served-model-name") == "acme/Tiny-1B"
52
+ assert _flag_value(cmd, "--port") == str(service.VLLM_PORT)
53
+ assert _flag_value(cmd, "--tensor-parallel-size") == "1"
54
 
55
 
56
+ def test_served_model_name_alias(service):
57
+ cmd = service.build_command(_make(service, served_model_name="acme/Tiny"))
58
+ assert _flag_value(cmd, "--served-model-name") == "acme/Tiny"
59
+ # but vLLM still loads the real repo (positional arg)
60
+ assert cmd[2] == "acme/Tiny-1B"
61
 
62
 
63
+ # ── data-driven toggles ────────────────────────────────────────────────────────
 
 
 
64
 
65
 
66
+ def test_prefix_caching_on_by_default_off_when_disabled(service):
67
+ assert "--enable-prefix-caching" in service.build_command(_make(service))
68
+ off = service.build_command(_make(service, enable_prefix_caching=False))
69
+ assert "--no-enable-prefix-caching" in off
70
+ assert "--enable-prefix-caching" not in off
 
71
 
72
 
73
+ def test_optional_inference_flags_emitted(service):
74
+ cmd = service.build_command(
75
+ _make(
76
+ service,
77
+ max_model_len=8192,
78
+ trust_remote_code=True,
79
+ enforce_eager=True,
80
+ gpu_memory_utilization=0.9,
81
+ )
82
+ )
83
+ assert _flag_value(cmd, "--max-model-len") == "8192"
84
+ assert "--trust-remote-code" in cmd
85
+ assert "--enforce-eager" in cmd
86
+ assert _flag_value(cmd, "--gpu-memory-utilization") == "0.9"
87
 
88
 
89
+ def test_async_scheduling_default_on_off_when_disabled(service):
90
+ assert "--async-scheduling" in service.build_command(_make(service))
91
+ assert "--async-scheduling" not in service.build_command(_make(service, async_scheduling=False))
92
 
93
 
94
+ def test_parser_flags(service):
95
+ cmd = service.build_command(
96
+ _make(service, reasoning_parser="qwen3", tool_call_parser="hermes", enable_auto_tool_choice=True)
97
+ )
98
+ assert _flag_value(cmd, "--reasoning-parser") == "qwen3"
99
+ assert _flag_value(cmd, "--tool-call-parser") == "hermes"
100
+ assert "--enable-auto-tool-choice" in cmd
101
+ # None parsers emit nothing.
102
+ bare = service.build_command(_make(service))
103
+ assert "--reasoning-parser" not in bare
104
+ assert "--tool-call-parser" not in bare
105
 
106
 
107
+ def test_mm_limits_serialized_as_json(service):
108
+ cmd = service.build_command(_make(service, mm_limits={"image": 0, "audio": 0}))
109
+ assert json.loads(_flag_value(cmd, "--limit-mm-per-prompt")) == {"image": 0, "audio": 0}
 
 
 
110
 
111
 
112
+ def test_log_requests_default_on(service):
113
+ assert "--enable-log-requests" in service.build_command(_make(service))
114
+ assert "--enable-log-requests" not in service.build_command(_make(service, log_requests=False))
 
115
 
116
 
117
+ # ── escape hatch ────────────────────────────────────────────────────────────────
 
 
 
118
 
119
 
120
+ def test_extra_vllm_args_appended_verbatim(service):
121
+ cmd = service.build_command(_make(service, extra_vllm_args=("--quantization", "fp8")))
122
+ assert cmd[-2:] == ["--quantization", "fp8"]
 
123
 
124
 
125
  # ── deploy script wiring ───────────────────────────────────────────────────────
126
 
127
 
128
+ def test_deploy_script_propagates_knob_envs():
129
  sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "scripts"))
130
  deploy_modal = importlib.import_module("deploy_modal")
131
  from argparse import Namespace
132
 
133
+ env = deploy_modal._env_for(Namespace(keep_warm=True, auth=True))
134
+ assert env["MODAL_LLM_KEEP_WARM"] == "1"
135
+ assert env["MODAL_LLM_REQUIRE_AUTH"] == "1"
 
 
 
 
136
 
137
+ # Both off β†’ neither env var is set (so endpoints stay public + scale-to-zero).
138
+ env_off = deploy_modal._env_for(Namespace(keep_warm=False, auth=False))
139
+ assert "MODAL_LLM_KEEP_WARM" not in env_off
140
+ assert "MODAL_LLM_REQUIRE_AUTH" not in env_off
tests/test_modal_endpoint_urls.py CHANGED
@@ -16,8 +16,6 @@ import importlib.util
16
  import sys
17
  from pathlib import Path
18
 
19
- import pytest
20
-
21
  _CATALOGUE_PATH = Path(__file__).resolve().parents[1] / "modal" / "catalogue.py"
22
 
23
  # Max length of a single DNS label (RFC 1035). The whole subdomain before
 
16
  import sys
17
  from pathlib import Path
18
 
 
 
19
  _CATALOGUE_PATH = Path(__file__).resolve().parents[1] / "modal" / "catalogue.py"
20
 
21
  # Max length of a single DNS label (RFC 1035). The whole subdomain before