Spaces:
Running on Zero
Running on Zero
agharsallah commited on
Commit Β·
5d4ef87
1
Parent(s): e3ba862
Refactor modal service and logging setup
Browse files- Updated the service module to streamline model registration and improve clarity in the handling of model configurations.
- Removed the vllm_logging module, integrating its functionality into the main logging setup for better maintainability and consistency.
- Simplified the build_command function by removing unnecessary precision handling and logging configurations.
- Enhanced test coverage for the build_command function, ensuring proper flag emissions and configurations.
- Cleaned up deprecated snapshot model handling and adjusted related tests for clarity and accuracy.
- Improved documentation throughout the service module to better reflect current functionality and design goals.
- docs/adr/0030-gpu-memory-snapshots-cold-start.md +6 -1
- docs/adr/0031-fp8-quantization-control.md +6 -2
- docs/adr/0034-simplify-modal-serving-to-canonical-vllm.md +95 -0
- modal/README.md +7 -7
- modal/catalogue.py +58 -101
- modal/docs/deploying.md +64 -129
- modal/healthcheck.py +51 -51
- modal/service.py +37 -326
- modal/vllm_logging.py +0 -118
- src/observability/logging_setup.py +1 -2
- tests/test_modal_build_command.py +74 -79
- tests/test_modal_endpoint_urls.py +0 -2
docs/adr/0030-gpu-memory-snapshots-cold-start.md
CHANGED
|
@@ -2,7 +2,12 @@
|
|
| 2 |
|
| 3 |
## Status
|
| 4 |
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
[ADR-0019](0019-single-model-catalogue-no-cloud-path.md))
|
| 7 |
|
| 8 |
## Context
|
|
|
|
| 2 |
|
| 3 |
## Status
|
| 4 |
|
| 5 |
+
**Superseded by [ADR-0034 *Simplify the Modal serving layer*](0034-simplify-modal-serving-to-canonical-vllm.md)**
|
| 6 |
+
β the snapshot lifecycle was removed for being alpha and error-prone; cold starts
|
| 7 |
+
now rely on the shared compile/weight caches plus the retained `MODAL_LLM_KEEP_WARM`
|
| 8 |
+
demo switch. The historical context below stands.
|
| 9 |
+
|
| 10 |
+
Originally Accepted (extended [ADR-0014 *Modal model serving*](0014-modal-model-serving.md),
|
| 11 |
[ADR-0019](0019-single-model-catalogue-no-cloud-path.md))
|
| 12 |
|
| 13 |
## Context
|
docs/adr/0031-fp8-quantization-control.md
CHANGED
|
@@ -2,8 +2,12 @@
|
|
| 2 |
|
| 3 |
## Status
|
| 4 |
|
| 5 |
-
|
| 6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
[ADR-0030 *GPU memory snapshots*](0030-gpu-memory-snapshots-cold-start.md))
|
| 8 |
|
| 9 |
## Context
|
|
|
|
| 2 |
|
| 3 |
## Status
|
| 4 |
|
| 5 |
+
**Superseded by [ADR-0034 *Simplify the Modal serving layer*](0034-simplify-modal-serving-to-canonical-vllm.md)**
|
| 6 |
+
β the env-controlled quantization machinery was removed; lower precision is now
|
| 7 |
+
reached via a model's `extra_vllm_args`. The historical context below stands.
|
| 8 |
+
|
| 9 |
+
Originally Accepted (extended [ADR-0014 *Modal model serving*](0014-modal-model-serving.md),
|
| 10 |
+
[ADR-0019](0019-single-model-catalogue-no-cloud-path.md); interacted with
|
| 11 |
[ADR-0030 *GPU memory snapshots*](0030-gpu-memory-snapshots-cold-start.md))
|
| 12 |
|
| 13 |
## Context
|
docs/adr/0034-simplify-modal-serving-to-canonical-vllm.md
ADDED
|
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ADR-0034: Simplify the Modal serving layer to the canonical vLLM recipe
|
| 2 |
+
|
| 3 |
+
## Status
|
| 4 |
+
|
| 5 |
+
Accepted. **Supersedes [ADR-0030 *GPU memory snapshots*](0030-gpu-memory-snapshots-cold-start.md)
|
| 6 |
+
and [ADR-0031 *FP8 quantization control*](0031-fp8-quantization-control.md).**
|
| 7 |
+
Extends [ADR-0014 *Modal model serving*](0014-modal-model-serving.md) and
|
| 8 |
+
[ADR-0019](0019-single-model-catalogue-no-cloud-path.md).
|
| 9 |
+
|
| 10 |
+
## Context
|
| 11 |
+
|
| 12 |
+
`modal/service.py` had grown to ~500 lines by accreting three optional
|
| 13 |
+
subsystems on top of the plain vLLM web-server path from ADR-0014:
|
| 14 |
+
|
| 15 |
+
- **GPU memory snapshots** (ADR-0030) β a class-based sleepβsnapshotβwake
|
| 16 |
+
lifecycle, a second registration shape, and `enable_gpu_snapshot` (Modal
|
| 17 |
+
*alpha*).
|
| 18 |
+
- **FP8 / quantization control** (ADR-0031) β a deploy-time env-override resolver
|
| 19 |
+
plus a workaround for FP8-KV-cache crashing the snapshot wake path.
|
| 20 |
+
- **Structured JSON logging** β a `vllm_logging.py` formatter shipped into the
|
| 21 |
+
image and wired through a generated `dictConfig`.
|
| 22 |
+
|
| 23 |
+
In practice this surface was the source of the errors, not a benefit:
|
| 24 |
+
|
| 25 |
+
- The snapshot lifecycle is alpha and fragile β the documented FP8Γsnapshot
|
| 26 |
+
wake-path crash (ADR-0031) is one instance; the hand-folded URL label and
|
| 27 |
+
cloudpickled-closure constraints are others. Hard to deploy, hard to debug.
|
| 28 |
+
- The FP8 machinery defaulted to `None` on **every** model β pure surface area
|
| 29 |
+
with no model actually using it.
|
| 30 |
+
- JSON logging defaulted **off** β more surface area, off by default.
|
| 31 |
+
- Per-model configs had drifted from the models' real serving requirements
|
| 32 |
+
(e.g. the Gemma 4 26B was pinned to a nightly vLLM it doesn't need).
|
| 33 |
+
|
| 34 |
+
The working core is small and is exactly Modal's current canonical vLLM example:
|
| 35 |
+
an autoscaling `@app.function` + `@modal.concurrent` + `@modal.web_server` whose
|
| 36 |
+
body runs `subprocess.Popen(["vllm", "serve", ...])`.
|
| 37 |
+
|
| 38 |
+
## Decision
|
| 39 |
+
|
| 40 |
+
**1. One serving path.** `register_model()` only registers the plain
|
| 41 |
+
`@app.function` web server. The snapshot class lifecycle
|
| 42 |
+
(`_register_snapshot_model`, `_class_name`, sleep/wake, `enable_gpu_snapshot`) is
|
| 43 |
+
deleted. `service.py` drops from ~500 to ~210 lines.
|
| 44 |
+
|
| 45 |
+
**2. Quantization moves to the escape hatch.** The `MODAL_LLM_QUANTIZATION` /
|
| 46 |
+
`MODAL_LLM_KV_CACHE_DTYPE` env resolver, the `quantization` / `kv_cache_dtype`
|
| 47 |
+
`ModelConfig` fields, and the FP8Γsnapshot workaround are removed. A model that
|
| 48 |
+
wants lower precision passes the flags through the existing `extra_vllm_args`
|
| 49 |
+
(`("--quantization", "fp8")`). Quantization was always opt-in and never on; this
|
| 50 |
+
keeps it possible without standing machinery.
|
| 51 |
+
|
| 52 |
+
**3. JSON logging is removed.** `vllm_logging.py` is deleted along with the
|
| 53 |
+
`MODAL_LLM_JSON_LOGS` / `MODAL_LLM_LOG_LEVEL` wiring. Modal captures
|
| 54 |
+
stdout/stderr; `--enable-log-requests` (kept, via `log_requests`) gives
|
| 55 |
+
per-request detail.
|
| 56 |
+
|
| 57 |
+
**4. `ModelConfig` is trimmed** to the fields the one path actually reads.
|
| 58 |
+
Removed: `gpu_snapshot`, `quantization`, `kv_cache_dtype`, `max_num_seqs`,
|
| 59 |
+
`max_num_batched_tokens`, `target_concurrent_inputs`, `buffer_containers`,
|
| 60 |
+
`log_outputs`, `max_log_len`, `uvicorn_access_log`, `multimodal`. The autoscale
|
| 61 |
+
target is computed inline (~75% of `max_concurrent_inputs`); anything exotic uses
|
| 62 |
+
`extra_vllm_args`.
|
| 63 |
+
|
| 64 |
+
**5. Per-model configs re-grounded in each model's documentation** (verified
|
| 65 |
+
against the HF model cards + vLLM recipes, June 2026):
|
| 66 |
+
|
| 67 |
+
| Model | Correction |
|
| 68 |
+
| --- | --- |
|
| 69 |
+
| Gemma 4 **26B-A4B** | Standard `gemma4` MoE β serves on the **pinned stable vLLM**. Dropped the nightly pin, `transformers>=5.10.2`, the unverified `VLLM_USE_FLASHINFER_SAMPLER=0`, and `enforce_eager` (native path β CUDA graphs work). |
|
| 70 |
+
| Gemma 4 **12B** | `gemma4_unified` (encoder-free) has no class in any stable vLLM β€0.22.1 β **keeps** `vllm_version="nightly"` + `transformers>=5.10.2`; dropped the unverified flashinfer env. |
|
| 71 |
+
| Nemotron Nano **4B / 30B** | Hybrid-Mamba; `trust_remote_code` kept. Served as plain chat β NVIDIA's `nano_v3` reasoning parser ships as a downloadable *plugin file* and is omitted for boot-robustness (addable via `extra_vllm_args` later). 30B params corrected 30β31. |
|
| 72 |
+
| Nemotron **Cascade-14B** | Confirmed stock Qwen3 β `reasoning_parser="qwen3"` + `tool_call_parser="hermes"` are correct and built-in; kept. |
|
| 73 |
+
| MiniCPM **4.1-8B** | `trust_remote_code` kept; no tool parser (custom `<|tool_call_start|>` format β engine uses guided decoding per ADR-0016). Serves on the pinned stable. |
|
| 74 |
+
| MiniCPM **-o 4.5** | Params corrected 8β9B; served text+image (audio over vLLM is experimental β the documented `transformers==4.51.0` pin conflicts with vLLM's bundled version, so we keep the lean preprocessing deps). |
|
| 75 |
+
|
| 76 |
+
## Consequences
|
| 77 |
+
|
| 78 |
+
- **Far smaller blast radius.** One registration shape, no alpha features, no
|
| 79 |
+
generated log config, no precision resolver. The thing that errored is gone.
|
| 80 |
+
- **Cold starts** now rely on the always-on shared caches (weights + compiled
|
| 81 |
+
graphs on Volumes) and the retained `MODAL_LLM_KEEP_WARM` demo-day switch
|
| 82 |
+
(mechanism 2 of ADR-0030, the robust half). We trade snapshot's seconds-from-
|
| 83 |
+
cold for simplicity; keep-warm covers the live-demo first-30-seconds bar.
|
| 84 |
+
- **Quantization / batch caps** are still reachable via `extra_vllm_args`, just
|
| 85 |
+
not first-class fields. If a model later needs standing FP8, re-promote a typed
|
| 86 |
+
field then β but not speculatively.
|
| 87 |
+
- **Gemma 4 26B is cheaper and more robust** off the nightly: it's a tier
|
| 88 |
+
default (`strong`), so removing its nightly dependency removes a recurring
|
| 89 |
+
break. Only the 12B remains on nightly, where it's unavoidable.
|
| 90 |
+
- **Prize impact unchanged.** All seven models and all four provider tracks
|
| 91 |
+
(OpenAI-compatible, MiniCPM, Nemotron, Gemma) still deploy; the no-API-key
|
| 92 |
+
deterministic stub is untouched. The serving path stays demo-ready for the
|
| 93 |
+
Modal Awards, now without the alpha-feature risk on stage.
|
| 94 |
+
- **Tests** for the removed precision/snapshot behaviour are replaced by tests
|
| 95 |
+
that pin the simplified `build_command` argv. Full suite stays green.
|
modal/README.md
CHANGED
|
@@ -18,8 +18,6 @@ modal/
|
|
| 18 |
app_nvidia.py App "nvidia-llms" β Nemotron 3 Nano 4B + 30B, Cascade 14B Thinking.
|
| 19 |
app_openbmb.py App "openbmb-llms" β MiniCPM4.1-8B + MiniCPM-o 4.5.
|
| 20 |
app_google.py App "google-llms" β Gemma 4 12B + 26B.
|
| 21 |
-
vllm_logging.py Dependency-free JSON log formatter shipped into the image
|
| 22 |
-
when MODAL_LLM_JSON_LOGS=1 (structured logs via vLLM dictConfig).
|
| 23 |
client.py OpenAI-SDK smoke-test client for any endpoint.
|
| 24 |
openapi.yaml Checked-in OpenAPI 3.1 spec for the served API surface.
|
| 25 |
pyproject.toml uv workspace member (deploy/client tooling; non-package).
|
|
@@ -71,11 +69,13 @@ sizing, and how to add models/providers or wire endpoints into the engine.
|
|
| 71 |
radius; one provider's outage or redeploy never touches another.
|
| 72 |
- **Scalable** β serverless autoscaling, input concurrency, a shared weight
|
| 73 |
cache (pull once, warm everywhere), and per-model `min_containers` warm pools.
|
| 74 |
-
- **
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
|
|
|
|
|
|
| 79 |
- **Extensible** β add a model = one `ModelConfig` in `catalogue.py`; add a
|
| 80 |
provider = one `Provider` entry + one app file. The serving path is written once
|
| 81 |
in `service.py`, and the engine picks up the new model with no edits (it reads
|
|
|
|
| 18 |
app_nvidia.py App "nvidia-llms" β Nemotron 3 Nano 4B + 30B, Cascade 14B Thinking.
|
| 19 |
app_openbmb.py App "openbmb-llms" β MiniCPM4.1-8B + MiniCPM-o 4.5.
|
| 20 |
app_google.py App "google-llms" β Gemma 4 12B + 26B.
|
|
|
|
|
|
|
| 21 |
client.py OpenAI-SDK smoke-test client for any endpoint.
|
| 22 |
openapi.yaml Checked-in OpenAPI 3.1 spec for the served API surface.
|
| 23 |
pyproject.toml uv workspace member (deploy/client tooling; non-package).
|
|
|
|
| 69 |
radius; one provider's outage or redeploy never touches another.
|
| 70 |
- **Scalable** β serverless autoscaling, input concurrency, a shared weight
|
| 71 |
cache (pull once, warm everywhere), and per-model `min_containers` warm pools.
|
| 72 |
+
- **One serving path** β Modal's canonical vLLM recipe (an autoscaling
|
| 73 |
+
`@app.function` launching `vllm serve` behind a `@modal.web_server`), written
|
| 74 |
+
once in `service.py`. No bespoke per-model lifecycle to break (ADR-0034).
|
| 75 |
+
- **Fast cold starts on demo day** β the shared `vllm-cache` Volume persists the
|
| 76 |
+
torch.compile / CUDA-graph artifacts so only the first container compiles, and
|
| 77 |
+
`MODAL_LLM_KEEP_WARM=1` at deploy time pins one warm container per tier model.
|
| 78 |
+
See [`docs/deploying.md` β Cold starts](docs/deploying.md#cold-starts).
|
| 79 |
- **Extensible** β add a model = one `ModelConfig` in `catalogue.py`; add a
|
| 80 |
provider = one `Provider` entry + one app file. The serving path is written once
|
| 81 |
in `service.py`, and the engine picks up the new model with no edits (it reads
|
modal/catalogue.py
CHANGED
|
@@ -72,60 +72,31 @@ class ModelConfig:
|
|
| 72 |
max_model_len: int | None = None # cap context to fit memory / task
|
| 73 |
trust_remote_code: bool = False # required by MiniCPM / Nemotron custom code
|
| 74 |
|
| 75 |
-
# Precision / quantization (vLLM serve flags). Both default to full precision
|
| 76 |
-
# (BF16 weights, model-dtype KV cache); set them to shrink the memory footprint
|
| 77 |
-
# so a model fits a smaller GPU or leaves more room for KV cache. A deploy-time
|
| 78 |
-
# env override (``MODAL_LLM_QUANTIZATION`` / ``MODAL_LLM_KV_CACHE_DTYPE``, read in
|
| 79 |
-
# ``service.py``) wins over these per-model values for a whole deploy. CAVEAT:
|
| 80 |
-
# on-the-fly FP8 needs an Ada/Hopper GPU (our L4/L40S/H200 all qualify) AND vLLM
|
| 81 |
-
# support for the architecture β custom-code / hybrid-mamba archs (Nemotron-H,
|
| 82 |
-
# MiniCPM) and the Transformers-backend Gemmas may fail to start under it, so these
|
| 83 |
-
# stay ``None`` until a model is verified to serve quantized. See ADR-0031.
|
| 84 |
-
quantization: str | None = None # vLLM --quantization, on-the-fly weight quant (e.g. "fp8"); None = full BF16
|
| 85 |
-
kv_cache_dtype: str | None = None # vLLM --kv-cache-dtype (e.g. "fp8"); None = auto (model dtype)
|
| 86 |
-
|
| 87 |
# Performance / throughput (vLLM serve flags). Defaults target high
|
| 88 |
# steady-state throughput on the common single-GPU path; tune per model.
|
| 89 |
-
# See ``service.build_command`` for how each maps to a flag.
|
|
|
|
| 90 |
gpu_memory_utilization: float | None = None # fraction of VRAM for weights + KV cache (vLLM default 0.9)
|
| 91 |
enable_prefix_caching: bool = True # reuse KV for shared prompt prefixes β big win when system/context repeat
|
| 92 |
async_scheduling: bool = True # overlap CPU request scheduling with GPU compute
|
| 93 |
enforce_eager: bool = False # skip CUDA-graph capture: faster cold start, lower steady-state throughput
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
# snapshot): the container boots once, loads weights, warms the engine, puts it
|
| 99 |
-
# to sleep (vLLM sleep mode, weights offloaded to host RAM), and is snapshotted;
|
| 100 |
-
# every later cold start restores the snapshot and wakes the engine in seconds
|
| 101 |
-
# instead of re-paying download + load + warmup. Constraints (why this is per
|
| 102 |
-
# model, not global): single-GPU models only, the model's vLLM build must
|
| 103 |
-
# support `--enable-sleep-mode`, and host RAM must hold the offloaded weights.
|
| 104 |
-
# Modal marks GPU snapshots alpha β keep it off for exotic serving paths
|
| 105 |
-
# (Transformers-backend Gemma, the omni specialist) and flip off on any model
|
| 106 |
-
# that misbehaves; the plain serving path is unchanged.
|
| 107 |
-
gpu_snapshot: bool = False
|
| 108 |
-
|
| 109 |
-
# Observability / request logging (vLLM serve flags). Defaults give per-request
|
| 110 |
-
# visibility in the container logs out of the box; see ``service.build_command``.
|
| 111 |
-
log_requests: bool = True # log each request's id, sampling params, and token counts
|
| 112 |
-
log_outputs: bool = False # also log generated text (verbose; can echo story content) β opt-in
|
| 113 |
-
max_log_len: int | None = 2048 # truncate logged prompts/outputs to N chars (None = no cap)
|
| 114 |
-
uvicorn_access_log: bool = True # keep uvicorn's per-request HTTP access line (method, path, status)
|
| 115 |
|
| 116 |
# OpenAI feature parsers (vLLM names; leave None if unsupported on the model)
|
| 117 |
reasoning_parser: str | None = None
|
| 118 |
tool_call_parser: str | None = None
|
| 119 |
enable_auto_tool_choice: bool = False
|
| 120 |
|
| 121 |
-
# Multimodal
|
| 122 |
-
multimodal
|
| 123 |
-
|
|
|
|
| 124 |
|
| 125 |
# Scaling / lifecycle
|
| 126 |
max_concurrent_inputs: int = 64 # hard ceiling of requests multiplexed onto one container
|
| 127 |
-
target_concurrent_inputs: int | None = None # autoscale target β scale out here, burst up to max; defaults to ~75%
|
| 128 |
-
buffer_containers: int = 0 # extra idle containers to pre-warm under active load (bursty traffic)
|
| 129 |
scaledown_window: int = 15 * 60 # idle seconds before a container stops
|
| 130 |
min_containers: int = 0 # keep N warm to remove cold starts (costs $)
|
| 131 |
startup_timeout: int = 30 * 60 # weight download + load can be slow
|
|
@@ -169,31 +140,34 @@ NVIDIA_MODELS: tuple[ModelConfig, ...] = (
|
|
| 169 |
ModelConfig(
|
| 170 |
name="nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16",
|
| 171 |
endpoint_name="nemotron-3-nano-4b",
|
| 172 |
-
# Tiny Titan tier (β€4B):
|
| 173 |
profile="tiny",
|
| 174 |
params_b=4,
|
| 175 |
gpu="L4:1",
|
| 176 |
max_model_len=16384,
|
|
|
|
| 177 |
trust_remote_code=True,
|
| 178 |
gated=True,
|
| 179 |
max_concurrent_inputs=32,
|
| 180 |
-
#
|
| 181 |
-
#
|
| 182 |
-
|
|
|
|
|
|
|
|
|
|
| 183 |
),
|
| 184 |
ModelConfig(
|
| 185 |
name="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
|
| 186 |
endpoint_name="nemotron-3-nano-30b",
|
| 187 |
-
#
|
| 188 |
-
#
|
| 189 |
-
|
| 190 |
-
# past what a default container comfortably holds.
|
| 191 |
-
params_b=30,
|
| 192 |
gpu="H200:1",
|
| 193 |
max_model_len=32768,
|
| 194 |
trust_remote_code=True,
|
| 195 |
gated=True,
|
| 196 |
max_concurrent_inputs=64,
|
|
|
|
| 197 |
),
|
| 198 |
ModelConfig(
|
| 199 |
name="nvidia/Nemotron-Cascade-14B-Thinking",
|
|
@@ -210,15 +184,13 @@ NVIDIA_MODELS: tuple[ModelConfig, ...] = (
|
|
| 210 |
params_b=14,
|
| 211 |
gpu="L40S:1",
|
| 212 |
max_model_len=32768,
|
| 213 |
-
# Qwen3-
|
| 214 |
-
# block parsed by the Qwen3 reasoning parser
|
|
|
|
| 215 |
reasoning_parser="qwen3",
|
| 216 |
tool_call_parser="hermes",
|
| 217 |
enable_auto_tool_choice=True,
|
| 218 |
max_concurrent_inputs=48,
|
| 219 |
-
# Qwen3-native single-GPU path on the pinned vLLM β snapshot-safe, and a
|
| 220 |
-
# reasoning model is exactly where a multi-minute cold start hurts most.
|
| 221 |
-
gpu_snapshot=True,
|
| 222 |
),
|
| 223 |
)
|
| 224 |
|
|
@@ -234,28 +206,31 @@ OPENBMB_MODELS: tuple[ModelConfig, ...] = (
|
|
| 234 |
max_model_len=32768,
|
| 235 |
trust_remote_code=True,
|
| 236 |
max_concurrent_inputs=48,
|
| 237 |
-
# Fast tier default for the cast; 8B BF16 (~16GB) offloads to host RAM
|
| 238 |
-
# fine. Sleep mode is allocator-level, so the custom MiniCPM modeling
|
| 239 |
-
# code doesn't affect it.
|
| 240 |
-
gpu_snapshot=True,
|
| 241 |
# No tool_call_parser on purpose: MiniCPM4.1 emits a custom
|
| 242 |
-
# <|tool_call_start|> format vLLM
|
| 243 |
-
#
|
| 244 |
# guided decoding (response_format json_schema) instead, which is
|
| 245 |
# parser-independent β see ADR-0016. Don't bolt on a mismatched parser.
|
|
|
|
|
|
|
| 246 |
),
|
| 247 |
ModelConfig(
|
| 248 |
name="openbmb/MiniCPM-o-4_5",
|
| 249 |
endpoint_name="minicpm-o-4-5",
|
| 250 |
-
# Omni-modal (text + vision + audio)
|
| 251 |
-
# A specialist model
|
| 252 |
-
params_b=
|
| 253 |
gpu="L40S:1",
|
| 254 |
trust_remote_code=True,
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 258 |
extra_pip=("librosa", "soundfile", "timm"),
|
|
|
|
| 259 |
max_concurrent_inputs=16,
|
| 260 |
# Custom omni-modal code path: keep the async scheduler off (conservative
|
| 261 |
# β it's a specialist, not on the default cast). Prefix caching stays on.
|
|
@@ -285,36 +260,25 @@ GOOGLE_MODELS: tuple[ModelConfig, ...] = (
|
|
| 285 |
tool_call_parser="gemma4",
|
| 286 |
enable_auto_tool_choice=True,
|
| 287 |
max_concurrent_inputs=48,
|
| 288 |
-
#
|
| 289 |
-
#
|
| 290 |
-
#
|
| 291 |
-
#
|
| 292 |
-
#
|
| 293 |
-
# unverified, and the Gemmas already skip the costliest warmup (no
|
| 294 |
-
# CUDA-graph capture).
|
| 295 |
-
enforce_eager=True,
|
| 296 |
-
async_scheduling=False,
|
| 297 |
-
# Text-only in the cast (vision/audio is the MiniCPM-o specialist's job).
|
| 298 |
-
# vLLM auto-detects gemma4_unified as multimodal and otherwise spends a big
|
| 299 |
-
# slice of cold-start profiling a *video* encoder we never call (and the MM
|
| 300 |
-
# warmup fails anyway). Zeroing the per-prompt MM limits disables that whole
|
| 301 |
-
# path β faster start, less GPU memory, more KV cache.
|
| 302 |
-
mm_limits={"image": 0, "audio": 0, "video": 0},
|
| 303 |
-
# gemma4_unified uses *variable* head dims (256 on sliding-attention layers,
|
| 304 |
-
# 512 on full-attention ones). vLLM <= 0.22.1 (incl. the pinned 0.21.0) sizes
|
| 305 |
-
# the o_proj from a uniform head_dim and dies on the full-attention layers
|
| 306 |
-
# with "mat1 and mat2 shapes cannot be multiplied". Only a vLLM nightly serves
|
| 307 |
-
# gemma4_unified, paired with transformers >= 5.10.2 (which adds the arch) and
|
| 308 |
-
# the FlashInfer sampler off (its JIT path breaks on these builds). All three
|
| 309 |
-
# are scoped to this model, so NVIDIA/OpenBMB stay on the reproducible pin.
|
| 310 |
vllm_version="nightly",
|
| 311 |
extra_pip=("transformers>=5.10.2",),
|
| 312 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 313 |
),
|
| 314 |
ModelConfig(
|
| 315 |
name="google/gemma-4-26B-A4B-it",
|
| 316 |
endpoint_name="gemma-4-26b",
|
| 317 |
-
# MoE: ~
|
| 318 |
profile="strong",
|
| 319 |
params_b=26,
|
| 320 |
gpu="H200:1",
|
|
@@ -324,18 +288,11 @@ GOOGLE_MODELS: tuple[ModelConfig, ...] = (
|
|
| 324 |
tool_call_parser="gemma4",
|
| 325 |
enable_auto_tool_choice=True,
|
| 326 |
max_concurrent_inputs=64,
|
| 327 |
-
#
|
| 328 |
-
#
|
| 329 |
-
|
| 330 |
-
|
| 331 |
-
|
| 332 |
-
# encoder to cut cold-start profiling and free memory (see the 12B above).
|
| 333 |
-
mm_limits={"image": 0, "audio": 0, "video": 0},
|
| 334 |
-
# Same gemma4_unified fix as the 12B above (nightly vLLM + transformers
|
| 335 |
-
# >= 5.10.2 + FlashInfer sampler off).
|
| 336 |
-
vllm_version="nightly",
|
| 337 |
-
extra_pip=("transformers>=5.10.2",),
|
| 338 |
-
env={"VLLM_USE_FLASHINFER_SAMPLER": "0"},
|
| 339 |
),
|
| 340 |
)
|
| 341 |
|
|
|
|
| 72 |
max_model_len: int | None = None # cap context to fit memory / task
|
| 73 |
trust_remote_code: bool = False # required by MiniCPM / Nemotron custom code
|
| 74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
# Performance / throughput (vLLM serve flags). Defaults target high
|
| 76 |
# steady-state throughput on the common single-GPU path; tune per model.
|
| 77 |
+
# See ``service.build_command`` for how each maps to a flag. For anything more
|
| 78 |
+
# exotic (quantization, batch-size caps, β¦) use ``extra_vllm_args``.
|
| 79 |
gpu_memory_utilization: float | None = None # fraction of VRAM for weights + KV cache (vLLM default 0.9)
|
| 80 |
enable_prefix_caching: bool = True # reuse KV for shared prompt prefixes β big win when system/context repeat
|
| 81 |
async_scheduling: bool = True # overlap CPU request scheduling with GPU compute
|
| 82 |
enforce_eager: bool = False # skip CUDA-graph capture: faster cold start, lower steady-state throughput
|
| 83 |
+
|
| 84 |
+
# Observability. ``log_requests`` adds --enable-log-requests so each call's id,
|
| 85 |
+
# sampling params, and token counts show in the Modal container logs.
|
| 86 |
+
log_requests: bool = True
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
# OpenAI feature parsers (vLLM names; leave None if unsupported on the model)
|
| 89 |
reasoning_parser: str | None = None
|
| 90 |
tool_call_parser: str | None = None
|
| 91 |
enable_auto_tool_choice: bool = False
|
| 92 |
|
| 93 |
+
# Multimodal β per-prompt input caps, e.g. {"image": 4, "audio": 2}. Set the
|
| 94 |
+
# caps to 0 on an auto-detected-multimodal model you serve text-only, to skip
|
| 95 |
+
# the encoder warmup and free memory.
|
| 96 |
+
mm_limits: dict[str, int] | None = None
|
| 97 |
|
| 98 |
# Scaling / lifecycle
|
| 99 |
max_concurrent_inputs: int = 64 # hard ceiling of requests multiplexed onto one container
|
|
|
|
|
|
|
| 100 |
scaledown_window: int = 15 * 60 # idle seconds before a container stops
|
| 101 |
min_containers: int = 0 # keep N warm to remove cold starts (costs $)
|
| 102 |
startup_timeout: int = 30 * 60 # weight download + load can be slow
|
|
|
|
| 140 |
ModelConfig(
|
| 141 |
name="nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16",
|
| 142 |
endpoint_name="nemotron-3-nano-4b",
|
| 143 |
+
# Tiny Titan tier (β€4B): ~4B BF16 weights (~8GB) fit a single 24GB L4.
|
| 144 |
profile="tiny",
|
| 145 |
params_b=4,
|
| 146 |
gpu="L4:1",
|
| 147 |
max_model_len=16384,
|
| 148 |
+
# Hybrid Mamba-2 + MLP + attention arch β custom modeling code; required.
|
| 149 |
trust_remote_code=True,
|
| 150 |
gated=True,
|
| 151 |
max_concurrent_inputs=32,
|
| 152 |
+
# Served as a plain chat endpoint. NVIDIA ships a custom `nano_v3` reasoning
|
| 153 |
+
# parser as a downloadable plugin file (--reasoning-parser-plugin) plus a
|
| 154 |
+
# `qwen3_coder` tool parser; both are omitted here for boot-robustness (the
|
| 155 |
+
# plugin must be shipped into the image and is easy to get wrong). The
|
| 156 |
+
# model still reasons β the <think> block just stays inline in the content.
|
| 157 |
+
# Add them later via extra_vllm_args if structured reasoning/tools are needed.
|
| 158 |
),
|
| 159 |
ModelConfig(
|
| 160 |
name="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
|
| 161 |
endpoint_name="nemotron-3-nano-30b",
|
| 162 |
+
# Hybrid Mamba-2 + MoE: ~31B total params in BF16 (~62GB), ~3B active per
|
| 163 |
+
# token. Needs an 80GB card β an alternate strong model, not a tier default.
|
| 164 |
+
params_b=31,
|
|
|
|
|
|
|
| 165 |
gpu="H200:1",
|
| 166 |
max_model_len=32768,
|
| 167 |
trust_remote_code=True,
|
| 168 |
gated=True,
|
| 169 |
max_concurrent_inputs=64,
|
| 170 |
+
# Same plain-chat posture as the 4B (custom `nano_v3` parser plugin omitted).
|
| 171 |
),
|
| 172 |
ModelConfig(
|
| 173 |
name="nvidia/Nemotron-Cascade-14B-Thinking",
|
|
|
|
| 184 |
params_b=14,
|
| 185 |
gpu="L40S:1",
|
| 186 |
max_model_len=32768,
|
| 187 |
+
# Post-trained from Qwen3-14B Base β stock Qwen3 arch (no custom code).
|
| 188 |
+
# ChatML thinking block parsed by the Qwen3 reasoning parser; `hermes` is
|
| 189 |
+
# the standard Qwen3-family tool parser. Both verified built-in in vLLM.
|
| 190 |
reasoning_parser="qwen3",
|
| 191 |
tool_call_parser="hermes",
|
| 192 |
enable_auto_tool_choice=True,
|
| 193 |
max_concurrent_inputs=48,
|
|
|
|
|
|
|
|
|
|
| 194 |
),
|
| 195 |
)
|
| 196 |
|
|
|
|
| 206 |
max_model_len=32768,
|
| 207 |
trust_remote_code=True,
|
| 208 |
max_concurrent_inputs=48,
|
|
|
|
|
|
|
|
|
|
|
|
|
| 209 |
# No tool_call_parser on purpose: MiniCPM4.1 emits a custom
|
| 210 |
+
# <|tool_call_start|> code-block format vLLM has no matching parser for, so
|
| 211 |
+
# a tool parser would 400/mis-parse. The engine's structured path uses vLLM
|
| 212 |
# guided decoding (response_format json_schema) instead, which is
|
| 213 |
# parser-independent β see ADR-0016. Don't bolt on a mismatched parser.
|
| 214 |
+
# (The model card suggests a vLLM nightly; 0.21.0 predates the release and
|
| 215 |
+
# serves it fine β flip vllm_version="nightly" if a boot failure proves otherwise.)
|
| 216 |
),
|
| 217 |
ModelConfig(
|
| 218 |
name="openbmb/MiniCPM-o-4_5",
|
| 219 |
endpoint_name="minicpm-o-4-5",
|
| 220 |
+
# Omni-modal (text + vision + audio) on a Qwen3-8B backbone β ~9B total in
|
| 221 |
+
# BF16. A specialist model, not cast to a profile by default.
|
| 222 |
+
params_b=9,
|
| 223 |
gpu="L40S:1",
|
| 224 |
trust_remote_code=True,
|
| 225 |
+
# Text + image only here; audio in/out over vLLM is experimental (it really
|
| 226 |
+
# wants the Transformers/demo runtime). Caps keep the encoder warmup bounded.
|
| 227 |
+
mm_limits={"image": 1, "audio": 0, "video": 0},
|
| 228 |
+
# Light vision/audio preprocessing backends. NOTE: full omni support wants
|
| 229 |
+
# openbmb's `minicpmo-utils[all]` + a pinned transformers==4.51.0, but that
|
| 230 |
+
# pin conflicts with vLLM's bundled transformers β so we keep the lean set
|
| 231 |
+
# and serve text+image. Treat audio as experimental.
|
| 232 |
extra_pip=("librosa", "soundfile", "timm"),
|
| 233 |
+
gpu_memory_utilization=0.9,
|
| 234 |
max_concurrent_inputs=16,
|
| 235 |
# Custom omni-modal code path: keep the async scheduler off (conservative
|
| 236 |
# β it's a specialist, not on the default cast). Prefix caching stays on.
|
|
|
|
| 260 |
tool_call_parser="gemma4",
|
| 261 |
enable_auto_tool_choice=True,
|
| 262 |
max_concurrent_inputs=48,
|
| 263 |
+
# gemma4_unified (encoder-free) has no native class in any *stable* vLLM
|
| 264 |
+
# (β€0.22.1 falls back to the Transformers backend and crashes); only the
|
| 265 |
+
# nightly wheel registers Gemma4UnifiedForConditionalGeneration. So this
|
| 266 |
+
# model alone pins the nightly + transformers>=5.10.2. Scoped here, so
|
| 267 |
+
# NVIDIA/OpenBMB and the 26B sibling stay on the reproducible pin.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 268 |
vllm_version="nightly",
|
| 269 |
extra_pip=("transformers>=5.10.2",),
|
| 270 |
+
# Transformers-backend / fresh-nightly path: eager-only is the safe choice
|
| 271 |
+
# (CUDA-graph capture + async scheduler aren't reliable here).
|
| 272 |
+
enforce_eager=True,
|
| 273 |
+
async_scheduling=False,
|
| 274 |
+
# Text-only in the cast β gemma4 auto-detects as multimodal, so zero the
|
| 275 |
+
# per-prompt caps to skip the encoder warmup and free memory for KV cache.
|
| 276 |
+
mm_limits={"image": 0, "audio": 0},
|
| 277 |
),
|
| 278 |
ModelConfig(
|
| 279 |
name="google/gemma-4-26B-A4B-it",
|
| 280 |
endpoint_name="gemma-4-26b",
|
| 281 |
+
# MoE: ~25B total params (~4B active) with a small vision encoder. Gated.
|
| 282 |
profile="strong",
|
| 283 |
params_b=26,
|
| 284 |
gpu="H200:1",
|
|
|
|
| 288 |
tool_call_parser="gemma4",
|
| 289 |
enable_auto_tool_choice=True,
|
| 290 |
max_concurrent_inputs=64,
|
| 291 |
+
# Standard gemma4 MoE arch (NOT the unified 12B path): served by a native
|
| 292 |
+
# vLLM class on the pinned stable release (0.19.1+), so NO nightly, no
|
| 293 |
+
# transformers pin, and CUDA graphs + async scheduling work β defaults stand.
|
| 294 |
+
# Text-only in the cast: zero the auto-detected multimodal caps.
|
| 295 |
+
mm_limits={"image": 0},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 296 |
),
|
| 297 |
)
|
| 298 |
|
modal/docs/deploying.md
CHANGED
|
@@ -3,6 +3,12 @@
|
|
| 3 |
This guide covers prerequisites, deployment, configuration knobs, auth, GPU
|
| 4 |
sizing, and wiring the endpoints into the engine.
|
| 5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
## Prerequisites
|
| 7 |
|
| 8 |
```bash
|
|
@@ -24,26 +30,25 @@ Only models with `gated=True` mount this secret; ungated models deploy without i
|
|
| 24 |
Each provider is its own Modal app, deployed independently:
|
| 25 |
|
| 26 |
```bash
|
| 27 |
-
modal deploy modal/app_nvidia.py # Nemotron 3 Nano
|
| 28 |
-
modal deploy modal/app_openbmb.py # MiniCPM-o 4.5
|
| 29 |
-
modal deploy modal/app_google.py # Gemma 4
|
| 30 |
```
|
| 31 |
|
| 32 |
Use `modal serve modal/app_<provider>.py` for a hot-reloading dev session.
|
| 33 |
|
| 34 |
Or deploy one, several, or all providers with a single uv command β a thin
|
| 35 |
-
wrapper that exposes the deploy-time env knobs
|
| 36 |
|
| 37 |
```bash
|
| 38 |
uv run scripts/deploy_modal.py # all providers
|
| 39 |
uv run scripts/deploy_modal.py nvidia openbmb # just these
|
| 40 |
uv run scripts/deploy_modal.py nvidia --keep-warm # = MODAL_LLM_KEEP_WARM=1
|
| 41 |
-
# --auth β MODAL_LLM_REQUIRE_AUTH=1, --
|
| 42 |
-
# --log-level LEVEL β MODAL_LLM_LOG_LEVEL, --dry-run to preview the commands.
|
| 43 |
```
|
| 44 |
|
| 45 |
Run these from the repo root; the script's own directory (`modal/`) is on
|
| 46 |
-
`sys.path`, so `from service import ...` / `from
|
| 47 |
and `import modal` still binds the installed SDK (the folder name does not
|
| 48 |
shadow it).
|
| 49 |
|
|
@@ -86,34 +91,29 @@ changes needed:
|
|
| 86 |
| `gpu` | Modal GPU spec, e.g. `H200:1`, `H100:2`, `L40S:1`, `L4:1`. |
|
| 87 |
| `tensor_parallel_size` | Shard across GPUs; set equal to the GPU count in `gpu`. |
|
| 88 |
| `max_model_len` | Cap context length to fit memory / tune throughput. |
|
| 89 |
-
| `max_concurrent_inputs` | Hard ceiling of requests multiplexed onto one container.
|
| 90 |
-
| `target_concurrent_inputs` | Autoscale target β scale out here, burst to the max (defaults to ~75% of the ceiling). |
|
| 91 |
-
| `buffer_containers` | Extra idle containers pre-warmed under active load (bursty traffic). |
|
| 92 |
| `scaledown_window` | Idle seconds before a container stops (cold-start vs. cost). |
|
| 93 |
-
| `gpu_snapshot` | Serve via Modal memory snapshots (CPU + GPU): cold starts restore a warmed engine in seconds instead of re-paying load + warmup. See [Cold starts](#cold-starts). |
|
| 94 |
| `min_containers` | Keep N warm to eliminate cold starts (always-on cost). |
|
| 95 |
| `gpu_memory_utilization` | Fraction of VRAM for weights + KV cache (vLLM default `0.9`); raise for a bigger KV cache. |
|
| 96 |
| `enable_prefix_caching` | Reuse the KV cache for shared prompt prefixes (on by default β big win when the system prompt / ledger context repeats across the cast). |
|
| 97 |
-
| `async_scheduling` | Overlap CPU request scheduling with GPU compute (on by default; off for the Transformers-backend Gemma + omni models). |
|
| 98 |
| `enforce_eager` | Skip CUDA-graph capture β faster cold start, lower steady-state throughput. |
|
| 99 |
-
| `max_num_seqs` / `max_num_batched_tokens` | Batch-size and per-step token budget (memory vs. throughput). |
|
| 100 |
| `log_requests` | Log each request's id, sampling params, and token counts (on by default). |
|
| 101 |
-
| `
|
| 102 |
-
| `
|
| 103 |
-
| `uvicorn_access_log` | Keep the per-request HTTP access line (method, path, status). |
|
| 104 |
-
| `reasoning_parser` / `tool_call_parser` / `enable_auto_tool_choice` | OpenAI tool/reasoning features. |
|
| 105 |
-
| `multimodal` / `mm_limits` | Image/audio/video inputs and per-prompt caps. |
|
| 106 |
| `trust_remote_code` | Required by MiniCPM / Nemotron custom modeling code. |
|
| 107 |
| `vllm_version` | Per-model inference-stack pin (escape hatch); `None` = the default `VLLM_VERSION`, `"nightly"` = latest nightly wheel, else a pinned version. |
|
| 108 |
-
| `extra_vllm_args` | Raw `vllm serve` flags appended verbatim
|
| 109 |
| `extra_pip` / `env` | Extra image deps / container env (escape hatch). |
|
| 110 |
|
| 111 |
> **Per-model vLLM version.** The image pins `VLLM_VERSION` (see `service.py`) for
|
| 112 |
> reproducible deploys. A single model can override it via `vllm_version` when the
|
| 113 |
> pinned release can't serve its architecture β this is scoped to that model's image,
|
| 114 |
-
> so one model's bump never touches another provider's app.
|
| 115 |
-
> `vllm_version="nightly"` (plus `transformers>=5.10.2`
|
| 116 |
-
>
|
|
|
|
|
|
|
| 117 |
|
| 118 |
### Performance tuning
|
| 119 |
|
|
@@ -129,45 +129,31 @@ per model:
|
|
| 129 |
graph cache is persisted on the shared `vllm-cache` Volume (`VLLM_CACHE_ROOT`),
|
| 130 |
so only the *first* container compiles β later cold starts replay the cached
|
| 131 |
graphs. Set `enforce_eager=True` on a model only when its backend can't capture
|
| 132 |
-
graphs (the Transformers-backend Gemma
|
| 133 |
- **Async scheduling** overlaps CPU request scheduling with GPU compute; on by
|
| 134 |
default for native vLLM models, off where the backend doesn't support it.
|
| 135 |
-
- **Autoscaling** scales out at
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
to pre-warm spares for bursty traffic, or `min_containers` to remove cold starts
|
| 139 |
entirely (at always-on cost).
|
| 140 |
-
- **The V1 engine is pinned** (`VLLM_USE_V1=1`) for its better scheduler, chunked
|
| 141 |
-
prefill, and prefix caching.
|
| 142 |
|
| 143 |
For memory-bound models, raise `gpu_memory_utilization` (more KV cache β more
|
| 144 |
-
concurrency)
|
|
|
|
| 145 |
|
| 146 |
### Cold starts
|
| 147 |
|
| 148 |
-
A scale-from-zero cold start
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
`@modal.enter(snap=False)` wake), but the public URL and API are identical β
|
| 160 |
-
clients can't tell the paths apart.
|
| 161 |
-
|
| 162 |
-
Snapshot-enabled today: `nemotron-3-nano-4b` (tiny), `minicpm-4-1-8b` (fast),
|
| 163 |
-
`nemotron-cascade-14b`. Left off deliberately: the Gemmas (nightly
|
| 164 |
-
Transformers-backend path, sleep mode unverified), `nemotron-3-nano-30b`
|
| 165 |
-
(~60GB of weights won't fit host RAM during sleep), and the omni specialist.
|
| 166 |
-
GPU snapshots are **Modal-alpha** β if a snapshot model misbehaves, set its
|
| 167 |
-
`gpu_snapshot=False` and redeploy; the plain path is unchanged.
|
| 168 |
-
|
| 169 |
-
**2. Demo-day keep-warm (deploy-time, no code edits).** Pin warm containers for
|
| 170 |
-
every *profile-bound* model (tiny/fast/balanced/strong) right before a live
|
| 171 |
demo β specialists keep scale-to-zero:
|
| 172 |
|
| 173 |
```bash
|
|
@@ -197,49 +183,26 @@ edits β it reads the same `catalogue.py`.
|
|
| 197 |
`app = modal.App(PROVIDERS["<provider>"].app)` then
|
| 198 |
`register_all(app, PROVIDERS["<provider>"].models)`.
|
| 199 |
|
| 200 |
-
##
|
| 201 |
-
|
| 202 |
-
Every model repo ships **BF16** weights. To shrink the memory footprint β fit a
|
| 203 |
-
model on a smaller GPU, or free VRAM for a longer context / more concurrency β you
|
| 204 |
-
can serve it at lower precision. This is purely serving-side: it only adds
|
| 205 |
-
`--quantization` / `--kv-cache-dtype` to the vLLM argv, and `--served-model-name`
|
| 206 |
-
is unchanged, so the engine, endpoint URLs, and the running cast are untouched.
|
| 207 |
-
|
| 208 |
-
Two controls, env override wins:
|
| 209 |
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
full precision even on a model that defaults to quantized.
|
| 215 |
|
| 216 |
-
```
|
| 217 |
-
|
| 218 |
-
uv run scripts/deploy_modal.py nvidia --quantization fp8
|
| 219 |
-
|
| 220 |
-
# FP8 weights + FP8 KV cache, raw modal CLI:
|
| 221 |
-
MODAL_LLM_QUANTIZATION=fp8 MODAL_LLM_KV_CACHE_DTYPE=fp8 modal deploy modal/app_nvidia.py
|
| 222 |
-
|
| 223 |
-
# Force full precision back (overrides any per-model default):
|
| 224 |
-
uv run scripts/deploy_modal.py nvidia --quantization none
|
| 225 |
```
|
| 226 |
|
|
|
|
|
|
|
|
|
|
| 227 |
> **Not every architecture serves under on-the-fly FP8.** It needs an Ada/Hopper
|
| 228 |
> GPU (our L4/L40S/H200 all qualify) *and* vLLM support for the model's arch.
|
| 229 |
-
> Custom-code / hybrid-
|
| 230 |
-
>
|
| 231 |
-
>
|
| 232 |
-
>
|
| 233 |
-
> `curl <url>/v1/models`); if a model won't start, redeploy that provider without
|
| 234 |
-
> the flag. This is why all per-model defaults stay `None` for now. See ADR-0031.
|
| 235 |
-
|
| 236 |
-
> **FP8 KV cache (`--kv-cache-dtype fp8`) is silently dropped for snapshot models.**
|
| 237 |
-
> On the pinned vLLM it crashes the `/wake_up` path (`init_fp8_kv_scales` β
|
| 238 |
-
> `'list' object has no attribute 'zero_'`), so an FP8-KV snapshot model boots but
|
| 239 |
-
> can never wake. `build_command` drops the flag for any `gpu_snapshot=True` model
|
| 240 |
-
> and logs a `β οΈ` line at deploy; the endpoint serves with full-precision KV cache.
|
| 241 |
-
> FP8 *weights* (`--quantization fp8`) are unaffected. To run FP8 KV cache on such a
|
| 242 |
-
> model, set its `gpu_snapshot=False`. See ADR-0031.
|
| 243 |
|
| 244 |
## Auth
|
| 245 |
|
|
@@ -266,40 +229,11 @@ OpenAPI spec (`../openapi.yaml`).
|
|
| 266 |
## Observability & logging
|
| 267 |
|
| 268 |
Every container's stdout/stderr is captured by Modal β watch it live with
|
| 269 |
-
`modal app logs <app-name>` or in the dashboard.
|
| 270 |
-
|
| 271 |
-
|
| 272 |
-
|
| 273 |
-
|
| 274 |
-
caps the logged prompt at 2048 chars so a long context can't bloat a log line.
|
| 275 |
-
The uvicorn access log (method, path, status, latency) stays on. Tune per model:
|
| 276 |
-
|
| 277 |
-
| Knob | Effect |
|
| 278 |
-
| ----------------- | ------------------------------------------------------------- |
|
| 279 |
-
| `log_requests` | Per-request id / params / token counts (default **on**). |
|
| 280 |
-
| `log_outputs` | Also log the generated text β verbose, can echo story content (default off). |
|
| 281 |
-
| `max_log_len` | Truncate logged prompts/outputs; set `None` to log them in full. |
|
| 282 |
-
| `uvicorn_access_log` | Set `False` to drop the per-request HTTP access line. |
|
| 283 |
-
|
| 284 |
-
Clients can pass an `X-Request-Id` header and it shows up in the request logs β
|
| 285 |
-
handy for correlating an engine call with its server-side line.
|
| 286 |
-
|
| 287 |
-
**Structured JSON (opt-in).** For grepping fields or shipping to an aggregator,
|
| 288 |
-
emit one JSON object per log line instead of vLLM's coloured text. Turn it on at
|
| 289 |
-
deploy time β no code edits:
|
| 290 |
-
|
| 291 |
-
```bash
|
| 292 |
-
MODAL_LLM_JSON_LOGS=1 modal deploy modal/app_nvidia.py
|
| 293 |
-
MODAL_LLM_JSON_LOGS=1 MODAL_LLM_LOG_LEVEL=DEBUG modal deploy modal/app_google.py
|
| 294 |
-
```
|
| 295 |
-
|
| 296 |
-
This ships a dependency-free formatter (`modal/vllm_logging.py`) into the image
|
| 297 |
-
and points vLLM's `VLLM_LOGGING_CONFIG_PATH` at a generated `dictConfig`, so
|
| 298 |
-
**all** vLLM + uvicorn logs (including the request logs above) come out as JSON
|
| 299 |
-
with `ts` / `level` / `logger` / `msg` / `src` plus any structured extras (request
|
| 300 |
-
id, token counts). `MODAL_LLM_LOG_LEVEL` (default `INFO`) sets verbosity for both
|
| 301 |
-
the text and JSON paths. Leave JSON off for live demos β the coloured text is
|
| 302 |
-
easier to watch.
|
| 303 |
|
| 304 |
Throughput, KV-cache usage, and prefix-cache hit rate are logged every second
|
| 305 |
(`VLLM_LOG_STATS_INTERVAL`) and also exposed as Prometheus metrics at `/metrics`.
|
|
@@ -312,12 +246,13 @@ total parameter count.
|
|
| 312 |
|
| 313 |
| Model | Params (total / active) | Starting GPU |
|
| 314 |
| ---------------------------------- | ----------------------- | ------------ |
|
| 315 |
-
| Nemotron-3-Nano-30B-A3B |
|
| 316 |
-
| Nemotron-
|
| 317 |
-
|
|
|
|
|
| 318 |
| MiniCPM4.1-8B | 8B | `L40S:1` |
|
| 319 |
-
| Gemma-4-26B-A4B-it |
|
| 320 |
-
| Gemma-4-12B
|
| 321 |
|
| 322 |
These are starting points. If a container OOMs, lower `max_model_len`, raise the
|
| 323 |
GPU tier, or bump `tensor_parallel_size` (and the GPU count) for sharding.
|
|
|
|
| 3 |
This guide covers prerequisites, deployment, configuration knobs, auth, GPU
|
| 4 |
sizing, and wiring the endpoints into the engine.
|
| 5 |
|
| 6 |
+
The serving layer is deliberately small: it's Modal's canonical vLLM recipe β an
|
| 7 |
+
autoscaling `@app.function` that launches `vllm serve` as a subprocess behind a
|
| 8 |
+
`@modal.web_server` β applied once in `service.py` to every model in
|
| 9 |
+
`catalogue.py`. See ADR-0034 for why we stripped the earlier snapshot / FP8 /
|
| 10 |
+
structured-logging machinery back to this core.
|
| 11 |
+
|
| 12 |
## Prerequisites
|
| 13 |
|
| 14 |
```bash
|
|
|
|
| 30 |
Each provider is its own Modal app, deployed independently:
|
| 31 |
|
| 32 |
```bash
|
| 33 |
+
modal deploy modal/app_nvidia.py # Nemotron 3 Nano 4B + 30B, Cascade 14B
|
| 34 |
+
modal deploy modal/app_openbmb.py # MiniCPM4.1-8B + MiniCPM-o 4.5
|
| 35 |
+
modal deploy modal/app_google.py # Gemma 4 12B + 26B
|
| 36 |
```
|
| 37 |
|
| 38 |
Use `modal serve modal/app_<provider>.py` for a hot-reloading dev session.
|
| 39 |
|
| 40 |
Or deploy one, several, or all providers with a single uv command β a thin
|
| 41 |
+
wrapper that exposes the two deploy-time env knobs as flags:
|
| 42 |
|
| 43 |
```bash
|
| 44 |
uv run scripts/deploy_modal.py # all providers
|
| 45 |
uv run scripts/deploy_modal.py nvidia openbmb # just these
|
| 46 |
uv run scripts/deploy_modal.py nvidia --keep-warm # = MODAL_LLM_KEEP_WARM=1
|
| 47 |
+
# --auth β MODAL_LLM_REQUIRE_AUTH=1, --dry-run to preview the commands.
|
|
|
|
| 48 |
```
|
| 49 |
|
| 50 |
Run these from the repo root; the script's own directory (`modal/`) is on
|
| 51 |
+
`sys.path`, so `from service import ...` / `from catalogue import ...` resolve,
|
| 52 |
and `import modal` still binds the installed SDK (the folder name does not
|
| 53 |
shadow it).
|
| 54 |
|
|
|
|
| 91 |
| `gpu` | Modal GPU spec, e.g. `H200:1`, `H100:2`, `L40S:1`, `L4:1`. |
|
| 92 |
| `tensor_parallel_size` | Shard across GPUs; set equal to the GPU count in `gpu`. |
|
| 93 |
| `max_model_len` | Cap context length to fit memory / tune throughput. |
|
| 94 |
+
| `max_concurrent_inputs` | Hard ceiling of requests multiplexed onto one container (autoscale target is ~75% of it). |
|
|
|
|
|
|
|
| 95 |
| `scaledown_window` | Idle seconds before a container stops (cold-start vs. cost). |
|
|
|
|
| 96 |
| `min_containers` | Keep N warm to eliminate cold starts (always-on cost). |
|
| 97 |
| `gpu_memory_utilization` | Fraction of VRAM for weights + KV cache (vLLM default `0.9`); raise for a bigger KV cache. |
|
| 98 |
| `enable_prefix_caching` | Reuse the KV cache for shared prompt prefixes (on by default β big win when the system prompt / ledger context repeats across the cast). |
|
| 99 |
+
| `async_scheduling` | Overlap CPU request scheduling with GPU compute (on by default; off for the Transformers-backend Gemma 12B + omni models). |
|
| 100 |
| `enforce_eager` | Skip CUDA-graph capture β faster cold start, lower steady-state throughput. |
|
|
|
|
| 101 |
| `log_requests` | Log each request's id, sampling params, and token counts (on by default). |
|
| 102 |
+
| `reasoning_parser` / `tool_call_parser` / `enable_auto_tool_choice` | OpenAI tool/reasoning features (vLLM parser names; leave None if unsupported). |
|
| 103 |
+
| `mm_limits` | Per-prompt image/audio/video caps; set to 0 on an auto-detected-multimodal model you serve text-only. |
|
|
|
|
|
|
|
|
|
|
| 104 |
| `trust_remote_code` | Required by MiniCPM / Nemotron custom modeling code. |
|
| 105 |
| `vllm_version` | Per-model inference-stack pin (escape hatch); `None` = the default `VLLM_VERSION`, `"nightly"` = latest nightly wheel, else a pinned version. |
|
| 106 |
+
| `extra_vllm_args` | Raw `vllm serve` flags appended verbatim β the escape hatch for anything not modelled above (quantization, batch caps, custom parser plugins, β¦). |
|
| 107 |
| `extra_pip` / `env` | Extra image deps / container env (escape hatch). |
|
| 108 |
|
| 109 |
> **Per-model vLLM version.** The image pins `VLLM_VERSION` (see `service.py`) for
|
| 110 |
> reproducible deploys. A single model can override it via `vllm_version` when the
|
| 111 |
> pinned release can't serve its architecture β this is scoped to that model's image,
|
| 112 |
+
> so one model's bump never touches another provider's app. Only the Gemma 4 **12B**
|
| 113 |
+
> sets `vllm_version="nightly"` (plus `transformers>=5.10.2`) because its
|
| 114 |
+
> `gemma4_unified` architecture has no class in any stable vLLM β€0.22.1. The Gemma 4
|
| 115 |
+
> **26B** is a standard MoE arch that serves on the pinned stable release, so it
|
| 116 |
+
> stays on the default pin.
|
| 117 |
|
| 118 |
### Performance tuning
|
| 119 |
|
|
|
|
| 129 |
graph cache is persisted on the shared `vllm-cache` Volume (`VLLM_CACHE_ROOT`),
|
| 130 |
so only the *first* container compiles β later cold starts replay the cached
|
| 131 |
graphs. Set `enforce_eager=True` on a model only when its backend can't capture
|
| 132 |
+
graphs (the Transformers-backend Gemma 12B) or when cold start dominates.
|
| 133 |
- **Async scheduling** overlaps CPU request scheduling with GPU compute; on by
|
| 134 |
default for native vLLM models, off where the backend doesn't support it.
|
| 135 |
+
- **Autoscaling** scales out at ~75% of `max_concurrent_inputs` while a hot
|
| 136 |
+
container bursts up to the ceiling, so we add capacity before a container
|
| 137 |
+
saturates rather than after. Use `min_containers` to remove cold starts
|
|
|
|
| 138 |
entirely (at always-on cost).
|
|
|
|
|
|
|
| 139 |
|
| 140 |
For memory-bound models, raise `gpu_memory_utilization` (more KV cache β more
|
| 141 |
+
concurrency); if a step OOMs, lower `max_model_len` or cap the batch via
|
| 142 |
+
`extra_vllm_args` (e.g. `("--max-num-seqs", "32")`).
|
| 143 |
|
| 144 |
### Cold starts
|
| 145 |
|
| 146 |
+
A scale-from-zero cold start pays container boot β weight load β engine warmup.
|
| 147 |
+
Two mechanisms keep that bounded:
|
| 148 |
+
|
| 149 |
+
**1. Shared caches (always on).** Weights are pulled once onto the
|
| 150 |
+
`huggingface-cache` Volume and the torch.compile / CUDA-graph artifacts are
|
| 151 |
+
persisted on the `vllm-cache` Volume (`VLLM_CACHE_ROOT`). So a model downloads
|
| 152 |
+
once across every container and provider, and only the *first* container
|
| 153 |
+
compiles its graphs β later cold starts replay the cache.
|
| 154 |
+
|
| 155 |
+
**2. Demo-day keep-warm (deploy-time, no code edits).** Pin one warm container
|
| 156 |
+
for every *profile-bound* model (tiny/fast/balanced/strong) right before a live
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
demo β specialists keep scale-to-zero:
|
| 158 |
|
| 159 |
```bash
|
|
|
|
| 183 |
`app = modal.App(PROVIDERS["<provider>"].app)` then
|
| 184 |
`register_all(app, PROVIDERS["<provider>"].models)`.
|
| 185 |
|
| 186 |
+
## Lower precision (quantization)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
|
| 188 |
+
Every model repo here ships **BF16** weights and serves at full precision. To
|
| 189 |
+
shrink a model's footprint β fit it on a smaller GPU, or free VRAM for a longer
|
| 190 |
+
context / more concurrency β pass vLLM's quantization flags through the
|
| 191 |
+
`extra_vllm_args` escape hatch on its `ModelConfig`:
|
|
|
|
| 192 |
|
| 193 |
+
```python
|
| 194 |
+
extra_vllm_args=("--quantization", "fp8", "--kv-cache-dtype", "fp8")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 195 |
```
|
| 196 |
|
| 197 |
+
This is purely serving-side: `--served-model-name` is unchanged, so the engine,
|
| 198 |
+
endpoint URLs, and the running cast are untouched.
|
| 199 |
+
|
| 200 |
> **Not every architecture serves under on-the-fly FP8.** It needs an Ada/Hopper
|
| 201 |
> GPU (our L4/L40S/H200 all qualify) *and* vLLM support for the model's arch.
|
| 202 |
+
> Custom-code / hybrid-Mamba archs (the Nemotron Nanos, MiniCPM) and the
|
| 203 |
+
> Transformers-backend Gemma 12B may **fail to boot** under it. Verify a model
|
| 204 |
+
> after adding the flag (`modal/healthcheck.py` or `curl <url>/v1/models`); if it
|
| 205 |
+
> won't start, drop the flag. This is why every model defaults to full precision.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 206 |
|
| 207 |
## Auth
|
| 208 |
|
|
|
|
| 229 |
## Observability & logging
|
| 230 |
|
| 231 |
Every container's stdout/stderr is captured by Modal β watch it live with
|
| 232 |
+
`modal app logs <app-name>` or in the dashboard. Each endpoint runs vLLM with
|
| 233 |
+
`--enable-log-requests` (toggle via `log_requests`), so every call logs its
|
| 234 |
+
request id, sampling params, and (on completion) prompt/generation token counts
|
| 235 |
+
and finish reason. Clients can pass an `X-Request-Id` header and it shows up in
|
| 236 |
+
the request logs β handy for correlating an engine call with its server-side line.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 237 |
|
| 238 |
Throughput, KV-cache usage, and prefix-cache hit rate are logged every second
|
| 239 |
(`VLLM_LOG_STATS_INTERVAL`) and also exposed as Prometheus metrics at `/metrics`.
|
|
|
|
| 246 |
|
| 247 |
| Model | Params (total / active) | Starting GPU |
|
| 248 |
| ---------------------------------- | ----------------------- | ------------ |
|
| 249 |
+
| Nemotron-3-Nano-30B-A3B | ~31B / ~3B (Mamba MoE) | `H200:1` |
|
| 250 |
+
| Nemotron-Cascade-14B-Thinking | ~14B (dense, Qwen3) | `L40S:1` |
|
| 251 |
+
| Nemotron-3-Nano-4B | ~4B (Tiny Titan) | `L4:1` |
|
| 252 |
+
| MiniCPM-o-4_5 (omni) | ~9B + media encoders | `L40S:1` |
|
| 253 |
| MiniCPM4.1-8B | 8B | `L40S:1` |
|
| 254 |
+
| Gemma-4-26B-A4B-it | ~25B / ~4B (MoE) | `H200:1` |
|
| 255 |
+
| Gemma-4-12B-it | ~12B (dense) | `L40S:1` |
|
| 256 |
|
| 257 |
These are starting points. If a container OOMs, lower `max_model_len`, raise the
|
| 258 |
GPU tier, or bump `tensor_parallel_size` (and the GPU count) for sharding.
|
modal/healthcheck.py
CHANGED
|
@@ -218,8 +218,7 @@ async def check_chat(client: httpx.AsyncClient, t: Target, deadline: float) -> N
|
|
| 218 |
backoff = min(backoff * 1.5, 20.0)
|
| 219 |
|
| 220 |
|
| 221 |
-
async def run_target(t: Target, api_key: str, timeout: int, do_chat: bool,
|
| 222 |
-
sem: asyncio.Semaphore) -> None:
|
| 223 |
async with sem:
|
| 224 |
t.started = time.monotonic()
|
| 225 |
deadline = t.started + timeout
|
|
@@ -231,9 +230,9 @@ async def run_target(t: Target, api_key: str, timeout: int, do_chat: bool,
|
|
| 231 |
# within 150s returns a 303 to the same URL (clients are expected to follow
|
| 232 |
# it β up to ~20 hops / 50 min) while the container finishes cold-starting.
|
| 233 |
# Without this, the first 303 at ~150s looks like a terminal error.
|
| 234 |
-
async with httpx.AsyncClient(
|
| 235 |
-
|
| 236 |
-
|
| 237 |
await check_models(client, t, deadline)
|
| 238 |
if t.models_ok and do_chat:
|
| 239 |
await check_chat(client, t, deadline)
|
|
@@ -258,10 +257,9 @@ PHASE_ICON = {
|
|
| 258 |
|
| 259 |
def render_board(targets: list[Target], started: float) -> str:
|
| 260 |
width = max(len(t.key) for t in targets)
|
| 261 |
-
lines = [f" cold-start health-check Β· {len(targets)} endpoints Β· "
|
| 262 |
-
f"{time.monotonic() - started:5.0f}s elapsed"]
|
| 263 |
for t in targets:
|
| 264 |
-
live =
|
| 265 |
icon = PHASE_ICON.get(t.phase, "?")
|
| 266 |
detail = t.phase
|
| 267 |
if t.phase == "booting":
|
|
@@ -311,16 +309,14 @@ def print_report(targets: list[Target], do_chat: bool) -> None:
|
|
| 311 |
detail = t.error or (t.sample if t.chat_ok else t.served_reported) or ""
|
| 312 |
if t.chat_ok and t.finish_reason:
|
| 313 |
detail = f"[{t.finish_reason}] {detail}"
|
| 314 |
-
print(f" {t.key:<{kw}} {yn(t.models_ok):<6} {yn(t.chat_ok):<5} "
|
| 315 |
-
f"{lat} {detail[:60]}")
|
| 316 |
|
| 317 |
def healthy(t: Target) -> bool:
|
| 318 |
return bool(t.models_ok and (t.chat_ok or not do_chat))
|
| 319 |
|
| 320 |
ok = sum(1 for t in targets if healthy(t))
|
| 321 |
print(" " + "-" * (len(header) - 2))
|
| 322 |
-
print(f" {ok}/{len(targets)} healthy"
|
| 323 |
-
+ ("" if do_chat else " (liveness only β chat not tested)"))
|
| 324 |
failed = [t.key for t in targets if not healthy(t)]
|
| 325 |
if failed:
|
| 326 |
print(f" needs attention: {', '.join(failed)}")
|
|
@@ -343,14 +339,16 @@ def build_targets(catalogue: ModuleType, workspace: str | None, args) -> list[Ta
|
|
| 343 |
base_url = base_override.rstrip("/")
|
| 344 |
else:
|
| 345 |
base_url = catalogue.endpoint_url(e.app, e.endpoint_name, workspace)
|
| 346 |
-
targets.append(
|
| 347 |
-
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
|
| 351 |
-
|
| 352 |
-
|
| 353 |
-
|
|
|
|
|
|
|
| 354 |
return targets
|
| 355 |
|
| 356 |
|
|
@@ -359,8 +357,11 @@ async def main_async(args) -> int:
|
|
| 359 |
workspace = resolve_workspace(args.workspace)
|
| 360 |
base_override = os.environ.get("MODAL_LLM_BASE_URL")
|
| 361 |
if not workspace and not base_override:
|
| 362 |
-
print(
|
| 363 |
-
|
|
|
|
|
|
|
|
|
|
| 364 |
return 2
|
| 365 |
|
| 366 |
targets = build_targets(catalogue, workspace, args)
|
|
@@ -378,8 +379,10 @@ async def main_async(args) -> int:
|
|
| 378 |
return 0
|
| 379 |
|
| 380 |
do_chat = not args.no_chat
|
| 381 |
-
print(
|
| 382 |
-
|
|
|
|
|
|
|
| 383 |
print("Firing all endpoints concurrently β cold starts overlap, so this takes")
|
| 384 |
print("about as long as the single slowest model, not the sum.\n")
|
| 385 |
|
|
@@ -388,9 +391,7 @@ async def main_async(args) -> int:
|
|
| 388 |
done = asyncio.Event()
|
| 389 |
progress = asyncio.create_task(progress_loop(targets, started, done))
|
| 390 |
try:
|
| 391 |
-
await asyncio.gather(*(
|
| 392 |
-
run_target(t, api_key, args.timeout, do_chat, sem) for t in targets
|
| 393 |
-
))
|
| 394 |
finally:
|
| 395 |
done.set()
|
| 396 |
await progress
|
|
@@ -398,18 +399,21 @@ async def main_async(args) -> int:
|
|
| 398 |
print_report(targets, do_chat)
|
| 399 |
|
| 400 |
if args.json:
|
| 401 |
-
summary = [
|
| 402 |
-
|
| 403 |
-
|
| 404 |
-
|
| 405 |
-
|
| 406 |
-
|
| 407 |
-
|
| 408 |
-
|
| 409 |
-
|
| 410 |
-
|
| 411 |
-
|
| 412 |
-
|
|
|
|
|
|
|
|
|
|
| 413 |
Path(args.json).write_text(json.dumps(summary, indent=2))
|
| 414 |
print(f"\nWrote JSON summary to {args.json}")
|
| 415 |
|
|
@@ -418,21 +422,17 @@ async def main_async(args) -> int:
|
|
| 418 |
|
| 419 |
|
| 420 |
def parse_args(argv: list[str] | None = None) -> argparse.Namespace:
|
| 421 |
-
p = argparse.ArgumentParser(description=__doc__,
|
| 422 |
-
formatter_class=argparse.RawDescriptionHelpFormatter)
|
| 423 |
p.add_argument("--workspace", help="Modal workspace slug (else $MODAL_WORKSPACE / `modal profile current`)")
|
| 424 |
p.add_argument("--only", help="comma-separated endpoint keys to include")
|
| 425 |
p.add_argument("--skip", help="comma-separated endpoint keys to exclude")
|
| 426 |
-
p.add_argument(
|
| 427 |
-
|
| 428 |
-
|
| 429 |
-
|
| 430 |
-
p.add_argument("--timeout", type=int, default=900,
|
| 431 |
-
|
| 432 |
-
p.add_argument("--
|
| 433 |
-
help="max endpoints in flight at once (default 0 = all)")
|
| 434 |
-
p.add_argument("--print-urls", action="store_true",
|
| 435 |
-
help="resolve and print endpoint URLs, then exit (no calls)")
|
| 436 |
p.add_argument("--json", help="also write a machine-readable summary to this path")
|
| 437 |
return p.parse_args(argv)
|
| 438 |
|
|
|
|
| 218 |
backoff = min(backoff * 1.5, 20.0)
|
| 219 |
|
| 220 |
|
| 221 |
+
async def run_target(t: Target, api_key: str, timeout: int, do_chat: bool, sem: asyncio.Semaphore) -> None:
|
|
|
|
| 222 |
async with sem:
|
| 223 |
t.started = time.monotonic()
|
| 224 |
deadline = t.started + timeout
|
|
|
|
| 230 |
# within 150s returns a 303 to the same URL (clients are expected to follow
|
| 231 |
# it β up to ~20 hops / 50 min) while the container finishes cold-starting.
|
| 232 |
# Without this, the first 303 at ~150s looks like a terminal error.
|
| 233 |
+
async with httpx.AsyncClient(
|
| 234 |
+
headers=headers, timeout=client_timeout, limits=limits, follow_redirects=True, max_redirects=20
|
| 235 |
+
) as client:
|
| 236 |
await check_models(client, t, deadline)
|
| 237 |
if t.models_ok and do_chat:
|
| 238 |
await check_chat(client, t, deadline)
|
|
|
|
| 257 |
|
| 258 |
def render_board(targets: list[Target], started: float) -> str:
|
| 259 |
width = max(len(t.key) for t in targets)
|
| 260 |
+
lines = [f" cold-start health-check Β· {len(targets)} endpoints Β· {time.monotonic() - started:5.0f}s elapsed"]
|
|
|
|
| 261 |
for t in targets:
|
| 262 |
+
live = t.elapsed or (time.monotonic() - t.started if t.started else 0.0)
|
| 263 |
icon = PHASE_ICON.get(t.phase, "?")
|
| 264 |
detail = t.phase
|
| 265 |
if t.phase == "booting":
|
|
|
|
| 309 |
detail = t.error or (t.sample if t.chat_ok else t.served_reported) or ""
|
| 310 |
if t.chat_ok and t.finish_reason:
|
| 311 |
detail = f"[{t.finish_reason}] {detail}"
|
| 312 |
+
print(f" {t.key:<{kw}} {yn(t.models_ok):<6} {yn(t.chat_ok):<5} {lat} {detail[:60]}")
|
|
|
|
| 313 |
|
| 314 |
def healthy(t: Target) -> bool:
|
| 315 |
return bool(t.models_ok and (t.chat_ok or not do_chat))
|
| 316 |
|
| 317 |
ok = sum(1 for t in targets if healthy(t))
|
| 318 |
print(" " + "-" * (len(header) - 2))
|
| 319 |
+
print(f" {ok}/{len(targets)} healthy" + ("" if do_chat else " (liveness only β chat not tested)"))
|
|
|
|
| 320 |
failed = [t.key for t in targets if not healthy(t)]
|
| 321 |
if failed:
|
| 322 |
print(f" needs attention: {', '.join(failed)}")
|
|
|
|
| 339 |
base_url = base_override.rstrip("/")
|
| 340 |
else:
|
| 341 |
base_url = catalogue.endpoint_url(e.app, e.endpoint_name, workspace)
|
| 342 |
+
targets.append(
|
| 343 |
+
Target(
|
| 344 |
+
key=key,
|
| 345 |
+
app=e.app,
|
| 346 |
+
served_model_id=e.served_model_id,
|
| 347 |
+
profile=e.profile,
|
| 348 |
+
params_b=e.params_b,
|
| 349 |
+
base_url=base_url,
|
| 350 |
+
)
|
| 351 |
+
)
|
| 352 |
return targets
|
| 353 |
|
| 354 |
|
|
|
|
| 357 |
workspace = resolve_workspace(args.workspace)
|
| 358 |
base_override = os.environ.get("MODAL_LLM_BASE_URL")
|
| 359 |
if not workspace and not base_override:
|
| 360 |
+
print(
|
| 361 |
+
"ERROR: could not resolve a Modal workspace. Pass --workspace, set "
|
| 362 |
+
"$MODAL_WORKSPACE, or run `modal token new`.",
|
| 363 |
+
file=sys.stderr,
|
| 364 |
+
)
|
| 365 |
return 2
|
| 366 |
|
| 367 |
targets = build_targets(catalogue, workspace, args)
|
|
|
|
| 379 |
return 0
|
| 380 |
|
| 381 |
do_chat = not args.no_chat
|
| 382 |
+
print(
|
| 383 |
+
f"Workspace: {workspace} endpoints: {len(targets)} "
|
| 384 |
+
f"chat: {'yes' if do_chat else 'no'} per-endpoint timeout: {args.timeout}s"
|
| 385 |
+
)
|
| 386 |
print("Firing all endpoints concurrently β cold starts overlap, so this takes")
|
| 387 |
print("about as long as the single slowest model, not the sum.\n")
|
| 388 |
|
|
|
|
| 391 |
done = asyncio.Event()
|
| 392 |
progress = asyncio.create_task(progress_loop(targets, started, done))
|
| 393 |
try:
|
| 394 |
+
await asyncio.gather(*(run_target(t, api_key, args.timeout, do_chat, sem) for t in targets))
|
|
|
|
|
|
|
| 395 |
finally:
|
| 396 |
done.set()
|
| 397 |
await progress
|
|
|
|
| 399 |
print_report(targets, do_chat)
|
| 400 |
|
| 401 |
if args.json:
|
| 402 |
+
summary = [
|
| 403 |
+
{
|
| 404 |
+
"endpoint": t.key,
|
| 405 |
+
"app": t.app,
|
| 406 |
+
"served_model_id": t.served_model_id,
|
| 407 |
+
"base_url": t.base_url,
|
| 408 |
+
"models_ok": t.models_ok,
|
| 409 |
+
"chat_ok": t.chat_ok,
|
| 410 |
+
"latency_s": round(t.elapsed, 1),
|
| 411 |
+
"finish_reason": t.finish_reason,
|
| 412 |
+
"served_reported": t.served_reported,
|
| 413 |
+
"error": t.error,
|
| 414 |
+
}
|
| 415 |
+
for t in targets
|
| 416 |
+
]
|
| 417 |
Path(args.json).write_text(json.dumps(summary, indent=2))
|
| 418 |
print(f"\nWrote JSON summary to {args.json}")
|
| 419 |
|
|
|
|
| 422 |
|
| 423 |
|
| 424 |
def parse_args(argv: list[str] | None = None) -> argparse.Namespace:
|
| 425 |
+
p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
|
|
|
|
| 426 |
p.add_argument("--workspace", help="Modal workspace slug (else $MODAL_WORKSPACE / `modal profile current`)")
|
| 427 |
p.add_argument("--only", help="comma-separated endpoint keys to include")
|
| 428 |
p.add_argument("--skip", help="comma-separated endpoint keys to exclude")
|
| 429 |
+
p.add_argument(
|
| 430 |
+
"--profiles-only", action="store_true", help="test only the engine-bound tiers (tiny/fast/balanced/strong)"
|
| 431 |
+
)
|
| 432 |
+
p.add_argument("--no-chat", action="store_true", help="liveness only (GET /v1/models); skip the chat completion")
|
| 433 |
+
p.add_argument("--timeout", type=int, default=900, help="per-endpoint deadline in seconds (default 900)")
|
| 434 |
+
p.add_argument("--concurrency", type=int, default=0, help="max endpoints in flight at once (default 0 = all)")
|
| 435 |
+
p.add_argument("--print-urls", action="store_true", help="resolve and print endpoint URLs, then exit (no calls)")
|
|
|
|
|
|
|
|
|
|
| 436 |
p.add_argument("--json", help="also write a machine-readable summary to this path")
|
| 437 |
return p.parse_args(argv)
|
| 438 |
|
modal/service.py
CHANGED
|
@@ -1,19 +1,18 @@
|
|
| 1 |
"""Reusable, OpenAI-compatible model-serving layer for Modal.
|
| 2 |
|
| 3 |
-
This module is provider-agnostic. It
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
The served endpoints speak the OpenAI REST API (``/v1/chat/completions`,
|
| 17 |
``/v1/completions``, ``/v1/models``), so any OpenAI-compatible client can call
|
| 18 |
them by pointing ``base_url`` at the deployed URL.
|
| 19 |
"""
|
|
@@ -33,10 +32,11 @@ from catalogue import ModelConfig
|
|
| 33 |
|
| 34 |
# --- Shared serving constants --------------------------------------------------
|
| 35 |
|
| 36 |
-
# Pin the inference stack so deploys are reproducible. Bump deliberately.
|
|
|
|
| 37 |
VLLM_VERSION = "0.21.0"
|
| 38 |
CUDA_IMAGE = "nvidia/cuda:12.9.0-devel-ubuntu22.04"
|
| 39 |
-
PYTHON_VERSION = "3.
|
| 40 |
|
| 41 |
# The in-container port vLLM listens on; Modal maps it to a public HTTPS URL.
|
| 42 |
VLLM_PORT = 8000
|
|
@@ -46,12 +46,12 @@ HF_CACHE_PATH = "/root/.cache/huggingface"
|
|
| 46 |
VLLM_CACHE_PATH = "/root/.cache/vllm"
|
| 47 |
|
| 48 |
# Name of the Modal Secret that holds a Hugging Face token (key: HF_TOKEN).
|
| 49 |
-
# Required only for gated repos
|
| 50 |
# modal secret create huggingface-secret HF_TOKEN=hf_...
|
| 51 |
HF_SECRET_NAME = "huggingface-secret"
|
| 52 |
|
| 53 |
-
# Name of the Modal Secret holding the bearer token clients must present.
|
| 54 |
-
#
|
| 55 |
# `Authorization: Bearer <token>` on every request. Create it once with:
|
| 56 |
# modal secret create llm-api-key VLLM_API_KEY=sk-...
|
| 57 |
API_KEY_SECRET_NAME = "llm-api-key"
|
|
@@ -60,72 +60,27 @@ API_KEY_SECRET_NAME = "llm-api-key"
|
|
| 60 |
# MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
|
| 61 |
# When enabled, every endpoint mounts API_KEY_SECRET_NAME and rejects requests
|
| 62 |
# without a valid bearer token. Off by default (endpoints are then public).
|
| 63 |
-
REQUIRE_API_KEY = os.environ.get("MODAL_LLM_REQUIRE_AUTH", "").lower() in (
|
| 64 |
-
"1",
|
| 65 |
-
"true",
|
| 66 |
-
"yes",
|
| 67 |
-
)
|
| 68 |
-
|
| 69 |
-
# Emit logs as structured JSON (one object per line) instead of vLLM's default
|
| 70 |
-
# human-readable text. Opt in at deploy time (no code edits), mirroring the auth
|
| 71 |
-
# toggle above:
|
| 72 |
-
# MODAL_LLM_JSON_LOGS=1 modal deploy modal/app_google.py
|
| 73 |
-
# Off by default β the coloured text logs are nicer to watch live; turn this on
|
| 74 |
-
# when shipping logs to an aggregator or grepping fields. Request-level logging
|
| 75 |
-
# itself (the per-request detail) is always on via ModelConfig, independent of
|
| 76 |
-
# the format chosen here.
|
| 77 |
-
JSON_LOGS = os.environ.get("MODAL_LLM_JSON_LOGS", "").lower() in ("1", "true", "yes")
|
| 78 |
-
|
| 79 |
-
# Verbosity for the served loggers (vLLM honours VLLM_LOGGING_LEVEL; the JSON
|
| 80 |
-
# config applies the same level). Read at deploy time and baked into the image.
|
| 81 |
-
LOG_LEVEL = os.environ.get("MODAL_LLM_LOG_LEVEL", "INFO").upper()
|
| 82 |
|
| 83 |
# Demo-day switch: keep N containers warm for every *profile-bound* model (the
|
| 84 |
-
# tiers the cast actually runs on), removing their cold starts
|
| 85 |
-
#
|
| 86 |
-
#
|
| 87 |
# MODAL_LLM_KEEP_WARM=1 modal deploy modal/app_nvidia.py
|
| 88 |
KEEP_WARM = int(os.environ.get("MODAL_LLM_KEEP_WARM", "0") or "0")
|
| 89 |
|
| 90 |
-
# Deploy-time precision overrides. When set, each wins over the matching per-model
|
| 91 |
-
# ``ModelConfig`` field for *every* model in the deploy β so you flip a whole
|
| 92 |
-
# provider to FP8 without editing the catalogue (deploys are per-provider, so the
|
| 93 |
-
# blast radius is one app):
|
| 94 |
-
# MODAL_LLM_QUANTIZATION=fp8 modal deploy modal/app_nvidia.py
|
| 95 |
-
# MODAL_LLM_QUANTIZATION=fp8 MODAL_LLM_KV_CACHE_DTYPE=fp8 uv run scripts/deploy_modal.py nvidia
|
| 96 |
-
# A disable token (``none``/``off``/``bf16``/β¦) forces full precision even if a model
|
| 97 |
-
# defaults to a quantized mode. Read at deploy time and baked into each model's argv
|
| 98 |
-
# (see build_command). CAVEAT: not every architecture serves under on-the-fly FP8 β
|
| 99 |
-
# verify per provider; a model that can't will fail to boot. See ADR-0031.
|
| 100 |
-
QUANTIZATION = os.environ.get("MODAL_LLM_QUANTIZATION", "").strip()
|
| 101 |
-
KV_CACHE_DTYPE = os.environ.get("MODAL_LLM_KV_CACHE_DTYPE", "").strip()
|
| 102 |
-
# Override values that mean "no quantization / model-default precision" β they make
|
| 103 |
-
# the resolver omit the flag rather than pass a bogus value to vLLM.
|
| 104 |
-
_PRECISION_DISABLE = frozenset({"none", "off", "false", "0", "no", "bf16", "fp16", "auto"})
|
| 105 |
-
|
| 106 |
-
# Where the structured-logging module + its generated config live in the
|
| 107 |
-
# container. The module dir goes on PYTHONPATH so vLLM can import the formatter
|
| 108 |
-
# the dictConfig references (``vllm_logging.JsonFormatter``).
|
| 109 |
-
_LOG_MODULE_DIR = "/opt/mal_logging"
|
| 110 |
-
_LOG_CONFIG_PATH = "/tmp/vllm_logging.json"
|
| 111 |
-
|
| 112 |
# Weights and the vLLM compile cache are shared across every provider app, so a
|
| 113 |
# model pulled once is warm for all subsequent deploys and containers.
|
| 114 |
hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
|
| 115 |
vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)
|
| 116 |
|
| 117 |
-
# Baseline image shared by every
|
| 118 |
-
#
|
|
|
|
| 119 |
_BASE_ENV = {
|
| 120 |
"HF_HUB_CACHE": HF_CACHE_PATH,
|
| 121 |
"HF_XET_HIGH_PERFORMANCE": "1", # faster weight downloads
|
| 122 |
"VLLM_LOG_STATS_INTERVAL": "1",
|
| 123 |
-
# Verbosity of vLLM's own loggers (throughput/cache stats, request logs).
|
| 124 |
-
"VLLM_LOGGING_LEVEL": LOG_LEVEL,
|
| 125 |
-
# Persist torch.compile + CUDA-graph artifacts on the shared vLLM cache
|
| 126 |
-
# Volume (mounted at VLLM_CACHE_PATH). The first container compiles; every
|
| 127 |
-
# later cold start replays the cached graphs instead of recompiling, so we
|
| 128 |
-
# keep CUDA graphs (throughput) without paying their capture cost each boot.
|
| 129 |
"VLLM_CACHE_ROOT": VLLM_CACHE_PATH,
|
| 130 |
}
|
| 131 |
|
|
@@ -146,28 +101,6 @@ def build_image(cfg: ModelConfig) -> modal.Image:
|
|
| 146 |
else:
|
| 147 |
image = image.uv_pip_install(f"vllm=={cfg.vllm_version or VLLM_VERSION}")
|
| 148 |
image = image.env(_BASE_ENV)
|
| 149 |
-
if JSON_LOGS:
|
| 150 |
-
# Ship the stdlib JSON formatter and put it on PYTHONPATH so vLLM can
|
| 151 |
-
# import it when it applies the dictConfig. ``serve()`` writes the config
|
| 152 |
-
# file and points VLLM_LOGGING_CONFIG_PATH at it. Baking the toggle into
|
| 153 |
-
# the image env is what lets the (deploy-time) flag reach the container.
|
| 154 |
-
from pathlib import Path
|
| 155 |
-
|
| 156 |
-
image = (
|
| 157 |
-
image.add_local_file(
|
| 158 |
-
Path(__file__).with_name("vllm_logging.py"),
|
| 159 |
-
f"{_LOG_MODULE_DIR}/vllm_logging.py",
|
| 160 |
-
copy=True,
|
| 161 |
-
)
|
| 162 |
-
.env({"PYTHONPATH": _LOG_MODULE_DIR})
|
| 163 |
-
.env({"MODAL_LLM_JSON_LOGS": "1", "MODAL_LLM_LOG_LEVEL": LOG_LEVEL})
|
| 164 |
-
)
|
| 165 |
-
if cfg.gpu_snapshot:
|
| 166 |
-
# Snapshot prerequisites: VLLM_SERVER_DEV_MODE exposes the /sleep and
|
| 167 |
-
# /wake_up endpoints the snapshot lifecycle drives, and single-threaded
|
| 168 |
-
# inductor compilation keeps torch.compile artifacts snapshot-safe
|
| 169 |
-
# (Modal's documented vLLM + GPU-snapshot recipe).
|
| 170 |
-
image = image.env({"VLLM_SERVER_DEV_MODE": "1", "TORCHINDUCTOR_COMPILE_THREADS": "1"})
|
| 171 |
if cfg.extra_pip:
|
| 172 |
image = image.uv_pip_install(*cfg.extra_pip)
|
| 173 |
if cfg.env:
|
|
@@ -175,20 +108,6 @@ def build_image(cfg: ModelConfig) -> modal.Image:
|
|
| 175 |
return image
|
| 176 |
|
| 177 |
|
| 178 |
-
def _resolve_precision(override: str, model_value: str | None) -> str | None:
|
| 179 |
-
"""Effective precision flag: a deploy-time *override* wins over *model_value*.
|
| 180 |
-
|
| 181 |
-
A disable token in the override (``none``/``off``/``bf16``/β¦) returns ``None`` so
|
| 182 |
-
the caller omits the flag and vLLM keeps full / model-default precision; an empty
|
| 183 |
-
override falls back to the per-model value. Reads its inputs as arguments (the
|
| 184 |
-
callers pass the module globals) so tests can monkeypatch ``QUANTIZATION`` /
|
| 185 |
-
``KV_CACHE_DTYPE`` and see the change without reimporting.
|
| 186 |
-
"""
|
| 187 |
-
if override:
|
| 188 |
-
return None if override.lower() in _PRECISION_DISABLE else override
|
| 189 |
-
return model_value
|
| 190 |
-
|
| 191 |
-
|
| 192 |
def build_command(cfg: ModelConfig) -> list[str]:
|
| 193 |
"""Assemble the ``vllm serve`` argv for a model. Returned as a list so we can
|
| 194 |
launch with ``subprocess.Popen`` without a shell (no quoting pitfalls)."""
|
|
@@ -213,31 +132,6 @@ def build_command(cfg: ModelConfig) -> list[str]:
|
|
| 213 |
cmd += ["--max-model-len", str(cfg.max_model_len)]
|
| 214 |
if cfg.trust_remote_code:
|
| 215 |
cmd += ["--trust-remote-code"]
|
| 216 |
-
# Precision / quantization. A deploy-time env override (QUANTIZATION /
|
| 217 |
-
# KV_CACHE_DTYPE) wins over the per-model ModelConfig field; both default to
|
| 218 |
-
# full precision (no flag). On-the-fly FP8 needs Ada/Hopper + arch support.
|
| 219 |
-
quantization = _resolve_precision(QUANTIZATION, cfg.quantization)
|
| 220 |
-
if quantization:
|
| 221 |
-
cmd += ["--quantization", quantization]
|
| 222 |
-
kv_cache_dtype = _resolve_precision(KV_CACHE_DTYPE, cfg.kv_cache_dtype)
|
| 223 |
-
# FP8 KV cache is incompatible with sleep-mode/snapshot models on the pinned
|
| 224 |
-
# vLLM: the wake path runs init_fp8_kv_scales() over a post-sleep KV cache that
|
| 225 |
-
# is a *list* of per-layer tensors, not one tensor, so cache_tensor.zero_()
|
| 226 |
-
# throws and /wake_up 500s (every snapshot restore dies). Snapshot is a
|
| 227 |
-
# structural per-model decision; the KV dtype is a deploy knob β so snapshot
|
| 228 |
-
# wins. Drop the flag and warn loudly rather than ship an endpoint that boots
|
| 229 |
-
# but can never wake. Weight --quantization is unaffected (different code path).
|
| 230 |
-
if kv_cache_dtype and cfg.gpu_snapshot and kv_cache_dtype.lower().startswith("fp8"):
|
| 231 |
-
print(
|
| 232 |
-
f"β οΈ {cfg.endpoint_name}: dropping --kv-cache-dtype {kv_cache_dtype} β "
|
| 233 |
-
"FP8 KV cache crashes the snapshot wake path on the pinned vLLM (see ADR-0031). "
|
| 234 |
-
"Serving with full-precision KV cache. Drop gpu_snapshot to keep FP8 KV cache.",
|
| 235 |
-
flush=True,
|
| 236 |
-
)
|
| 237 |
-
kv_cache_dtype = None
|
| 238 |
-
if kv_cache_dtype:
|
| 239 |
-
cmd += ["--kv-cache-dtype", kv_cache_dtype]
|
| 240 |
-
# Performance / throughput knobs (all data-driven from ModelConfig).
|
| 241 |
if cfg.gpu_memory_utilization is not None:
|
| 242 |
cmd += ["--gpu-memory-utilization", str(cfg.gpu_memory_utilization)]
|
| 243 |
# Prefix caching reuses the KV cache for shared prompt prefixes. In a
|
|
@@ -248,21 +142,10 @@ def build_command(cfg: ModelConfig) -> list[str]:
|
|
| 248 |
cmd += ["--async-scheduling"]
|
| 249 |
if cfg.enforce_eager:
|
| 250 |
cmd += ["--enforce-eager"]
|
| 251 |
-
if cfg.max_num_seqs:
|
| 252 |
-
cmd += ["--max-num-seqs", str(cfg.max_num_seqs)]
|
| 253 |
-
if cfg.max_num_batched_tokens:
|
| 254 |
-
cmd += ["--max-num-batched-tokens", str(cfg.max_num_batched_tokens)]
|
| 255 |
# Observability: log each incoming request (id, params, token counts) so the
|
| 256 |
-
# Modal logs show what's actually being served.
|
| 257 |
-
# by default so a long context can't blow up the log line.
|
| 258 |
if cfg.log_requests:
|
| 259 |
cmd += ["--enable-log-requests"]
|
| 260 |
-
if cfg.log_outputs:
|
| 261 |
-
cmd += ["--enable-log-outputs"]
|
| 262 |
-
if cfg.max_log_len is not None:
|
| 263 |
-
cmd += ["--max-log-len", str(cfg.max_log_len)]
|
| 264 |
-
if not cfg.uvicorn_access_log:
|
| 265 |
-
cmd += ["--disable-uvicorn-access-log"]
|
| 266 |
if cfg.reasoning_parser:
|
| 267 |
cmd += ["--reasoning-parser", cfg.reasoning_parser]
|
| 268 |
if cfg.enable_auto_tool_choice:
|
|
@@ -271,10 +154,6 @@ def build_command(cfg: ModelConfig) -> list[str]:
|
|
| 271 |
cmd += ["--tool-call-parser", cfg.tool_call_parser]
|
| 272 |
if cfg.mm_limits:
|
| 273 |
cmd += ["--limit-mm-per-prompt", json.dumps(cfg.mm_limits)]
|
| 274 |
-
if cfg.gpu_snapshot:
|
| 275 |
-
# Sleep mode lets the snapshot lifecycle offload weights to host RAM
|
| 276 |
-
# (sleep level 1) before the memory snapshot is taken, then wake on restore.
|
| 277 |
-
cmd += ["--enable-sleep-mode"]
|
| 278 |
cmd += list(cfg.extra_vllm_args)
|
| 279 |
return cmd
|
| 280 |
|
|
@@ -282,16 +161,11 @@ def build_command(cfg: ModelConfig) -> list[str]:
|
|
| 282 |
# --- Endpoint registration ------------------------------------------------------
|
| 283 |
|
| 284 |
|
| 285 |
-
def register_model(app: modal.App, cfg: ModelConfig) -> modal.Function
|
| 286 |
"""Attach one model to ``app`` as an autoscaling, OpenAI-compatible endpoint.
|
| 287 |
|
| 288 |
-
|
| 289 |
-
|
| 290 |
-
(load β warm up β sleep β snapshot) so later cold starts restore in seconds
|
| 291 |
-
instead of re-paying download + load + warmup. Both paths publish the same
|
| 292 |
-
URL shape (``β¦--<app>-<endpoint_name>.modal.run``), so clients can't tell
|
| 293 |
-
them apart.
|
| 294 |
-
|
| 295 |
Everything is serialized (the prebuilt ``vllm serve`` argv is shipped to the
|
| 296 |
container), which lets us register many distinctly-named endpoints from a
|
| 297 |
simple loop without each needing a hand-written module-level function.
|
|
@@ -311,23 +185,11 @@ def register_model(app: modal.App, cfg: ModelConfig) -> modal.Function | type:
|
|
| 311 |
if KEEP_WARM and cfg.profile:
|
| 312 |
min_containers = max(min_containers, KEEP_WARM)
|
| 313 |
|
| 314 |
-
# Autoscale at the
|
| 315 |
-
# hard max before another cold-starts (Modal high-perf
|
| 316 |
-
|
| 317 |
-
|
| 318 |
-
|
| 319 |
-
if cfg.gpu_snapshot:
|
| 320 |
-
return _register_snapshot_model(
|
| 321 |
-
app,
|
| 322 |
-
cfg,
|
| 323 |
-
image=image,
|
| 324 |
-
cmd=cmd,
|
| 325 |
-
secrets=secrets,
|
| 326 |
-
min_containers=min_containers,
|
| 327 |
-
target_inputs=target_inputs,
|
| 328 |
-
)
|
| 329 |
-
|
| 330 |
-
function_kwargs = dict(
|
| 331 |
name=cfg.endpoint_name,
|
| 332 |
image=image,
|
| 333 |
gpu=cfg.gpu,
|
|
@@ -338,169 +200,18 @@ def register_model(app: modal.App, cfg: ModelConfig) -> modal.Function | type:
|
|
| 338 |
timeout=cfg.request_timeout,
|
| 339 |
serialized=True,
|
| 340 |
)
|
| 341 |
-
# Pre-warm spare containers under load for bursty traffic (opt-in per model).
|
| 342 |
-
if cfg.buffer_containers:
|
| 343 |
-
function_kwargs["buffer_containers"] = cfg.buffer_containers
|
| 344 |
-
|
| 345 |
-
@app.function(**function_kwargs)
|
| 346 |
@modal.concurrent(max_inputs=cfg.max_concurrent_inputs, target_inputs=target_inputs)
|
| 347 |
@modal.web_server(port=VLLM_PORT, startup_timeout=cfg.startup_timeout)
|
| 348 |
def serve():
|
| 349 |
-
import os
|
| 350 |
import subprocess
|
| 351 |
|
| 352 |
-
env = dict(os.environ)
|
| 353 |
-
# When structured logging is on, generate the dictConfig file and point
|
| 354 |
-
# vLLM at it. Done at container start (not build) so the level is picked
|
| 355 |
-
# up from the env without rebuilding the image.
|
| 356 |
-
if env.get("MODAL_LLM_JSON_LOGS", "").lower() in ("1", "true", "yes"):
|
| 357 |
-
import vllm_logging
|
| 358 |
-
|
| 359 |
-
vllm_logging.write_config(_LOG_CONFIG_PATH, level=env.get("MODAL_LLM_LOG_LEVEL", "INFO"))
|
| 360 |
-
env["VLLM_LOGGING_CONFIG_PATH"] = _LOG_CONFIG_PATH
|
| 361 |
-
|
| 362 |
# vLLM serves the OpenAI REST API on VLLM_PORT; Modal exposes it publicly.
|
| 363 |
-
|
|
|
|
| 364 |
|
| 365 |
return serve
|
| 366 |
|
| 367 |
|
| 368 |
-
def _class_name(slug: str) -> str:
|
| 369 |
-
"""Modal class name for an endpoint slug: ``nemotron-3-nano-4b`` β ``Nemotron3Nano4b``."""
|
| 370 |
-
return "".join(part.capitalize() for part in slug.replace("_", "-").split("-") if part) or "SnapshotServer"
|
| 371 |
-
|
| 372 |
-
|
| 373 |
-
def _register_snapshot_model(
|
| 374 |
-
app: modal.App,
|
| 375 |
-
cfg: ModelConfig,
|
| 376 |
-
*,
|
| 377 |
-
image: modal.Image,
|
| 378 |
-
cmd: list[str],
|
| 379 |
-
secrets: list[modal.Secret],
|
| 380 |
-
min_containers: int,
|
| 381 |
-
target_inputs: int,
|
| 382 |
-
) -> type:
|
| 383 |
-
"""Snapshot serving path β Modal's vLLM + GPU-memory-snapshot recipe.
|
| 384 |
-
|
| 385 |
-
First boot: start vLLM, wait for the port, run a few warmup completions so
|
| 386 |
-
compiled artifacts and caches are resident, put the engine to sleep (weights
|
| 387 |
-
offloaded to host RAM, KV cache dropped), and let Modal snapshot the
|
| 388 |
-
container (CPU + GPU state). Every later cold start restores the snapshot
|
| 389 |
-
and wakes the engine β seconds instead of minutes. The web URL label is
|
| 390 |
-
pinned to ``<app>-<endpoint_name>`` so the public URL is identical to the
|
| 391 |
-
plain function path (``β¦--<app>-<endpoint_name>.modal.run``) the catalogue's
|
| 392 |
-
``endpoint_url`` builds. A ``@modal.web_server`` ``label`` becomes the URL as
|
| 393 |
-
``<workspace>--<label>.modal.run`` *without* the app prefix Modal adds to a
|
| 394 |
-
plain function's URL, so the app name must be folded into the label by hand
|
| 395 |
-
or snapshot models answer at the wrong host (``β¦--<endpoint_name>``).
|
| 396 |
-
"""
|
| 397 |
-
served_name = cfg.served_name
|
| 398 |
-
|
| 399 |
-
# Helpers are nested (not module-level) on purpose: the class ships to the
|
| 400 |
-
# container via cloudpickle (``serialized=True``), and closures are pickled
|
| 401 |
-
# by value β a module-level helper would be pickled by reference to the
|
| 402 |
-
# ``service`` module, which doesn't exist inside the container.
|
| 403 |
-
def _headers() -> dict[str, str]:
|
| 404 |
-
import os
|
| 405 |
-
|
| 406 |
-
key = os.environ.get("VLLM_API_KEY")
|
| 407 |
-
return {"Authorization": f"Bearer {key}"} if key else {}
|
| 408 |
-
|
| 409 |
-
def _wait_ready(proc) -> None:
|
| 410 |
-
# vLLM opens the port only once the engine is initialized, so a
|
| 411 |
-
# successful connect means "ready", not just "listening".
|
| 412 |
-
import socket
|
| 413 |
-
import time
|
| 414 |
-
|
| 415 |
-
while True:
|
| 416 |
-
try:
|
| 417 |
-
socket.create_connection(("localhost", VLLM_PORT), timeout=1).close()
|
| 418 |
-
return
|
| 419 |
-
except OSError:
|
| 420 |
-
if proc.poll() is not None:
|
| 421 |
-
raise RuntimeError(f"vllm exited with code {proc.returncode}")
|
| 422 |
-
time.sleep(0.2)
|
| 423 |
-
|
| 424 |
-
def _post(path: str, json_body: dict | None = None, timeout: float = 300.0) -> None:
|
| 425 |
-
import requests # vLLM dependency, always present in the image
|
| 426 |
-
|
| 427 |
-
url = f"http://localhost:{VLLM_PORT}{path}"
|
| 428 |
-
requests.post(url, headers=_headers(), json=json_body, timeout=timeout).raise_for_status()
|
| 429 |
-
|
| 430 |
-
class _SnapshotServer:
|
| 431 |
-
@modal.enter(snap=True)
|
| 432 |
-
def start(self):
|
| 433 |
-
import os
|
| 434 |
-
import subprocess
|
| 435 |
-
|
| 436 |
-
env = dict(os.environ)
|
| 437 |
-
# Same structured-logging hook as the plain path (see ``serve``).
|
| 438 |
-
if env.get("MODAL_LLM_JSON_LOGS", "").lower() in ("1", "true", "yes"):
|
| 439 |
-
import vllm_logging
|
| 440 |
-
|
| 441 |
-
vllm_logging.write_config(_LOG_CONFIG_PATH, level=env.get("MODAL_LLM_LOG_LEVEL", "INFO"))
|
| 442 |
-
env["VLLM_LOGGING_CONFIG_PATH"] = _LOG_CONFIG_PATH
|
| 443 |
-
|
| 444 |
-
self.vllm_proc = subprocess.Popen(cmd, env=env)
|
| 445 |
-
_wait_ready(self.vllm_proc)
|
| 446 |
-
# Touch the full serving path so compile/caching work happens *before*
|
| 447 |
-
# the snapshot rather than on the first real request after restore.
|
| 448 |
-
warmup = {
|
| 449 |
-
"model": served_name,
|
| 450 |
-
"messages": [{"role": "user", "content": "Who tends the wood?"}],
|
| 451 |
-
"max_tokens": 8,
|
| 452 |
-
}
|
| 453 |
-
for _ in range(3):
|
| 454 |
-
_post("/v1/chat/completions", json_body=warmup)
|
| 455 |
-
# Offload weights to host RAM (sleep level 1); Modal snapshots the
|
| 456 |
-
# container right after the snap=True enters return.
|
| 457 |
-
_post("/sleep?level=1", timeout=120.0)
|
| 458 |
-
|
| 459 |
-
@modal.enter(snap=False)
|
| 460 |
-
def wake(self):
|
| 461 |
-
# Runs after every restore (and on the snapshot-creating boot itself,
|
| 462 |
-
# which simply resumes serving): reload weights onto the GPU.
|
| 463 |
-
_post("/wake_up", timeout=120.0)
|
| 464 |
-
_wait_ready(self.vllm_proc)
|
| 465 |
-
|
| 466 |
-
@modal.web_server(port=VLLM_PORT, startup_timeout=cfg.startup_timeout, label=f"{app.name}-{cfg.endpoint_name}")
|
| 467 |
-
def serve(self):
|
| 468 |
-
pass # vLLM (already running) is the web server; Modal just exposes the port.
|
| 469 |
-
|
| 470 |
-
@modal.exit()
|
| 471 |
-
def stop(self):
|
| 472 |
-
proc = getattr(self, "vllm_proc", None)
|
| 473 |
-
if proc is not None:
|
| 474 |
-
proc.terminate()
|
| 475 |
-
|
| 476 |
-
# One Modal class per model, named after the endpoint (App.cls has no name
|
| 477 |
-
# override, so rename the type before decorating).
|
| 478 |
-
name = _class_name(cfg.endpoint_name)
|
| 479 |
-
_SnapshotServer.__name__ = name
|
| 480 |
-
_SnapshotServer.__qualname__ = name
|
| 481 |
-
|
| 482 |
-
cls_kwargs = dict(
|
| 483 |
-
image=image,
|
| 484 |
-
gpu=cfg.gpu,
|
| 485 |
-
volumes={HF_CACHE_PATH: hf_cache_vol, VLLM_CACHE_PATH: vllm_cache_vol},
|
| 486 |
-
secrets=secrets,
|
| 487 |
-
scaledown_window=cfg.scaledown_window,
|
| 488 |
-
min_containers=min_containers,
|
| 489 |
-
timeout=cfg.request_timeout,
|
| 490 |
-
# Bounds the whole snap=True phase (download + load + warmup + sleep).
|
| 491 |
-
startup_timeout=cfg.startup_timeout,
|
| 492 |
-
serialized=True,
|
| 493 |
-
enable_memory_snapshot=True,
|
| 494 |
-
# GPU snapshots are Modal-alpha; scoped per model via cfg.gpu_snapshot.
|
| 495 |
-
experimental_options={"enable_gpu_snapshot": True},
|
| 496 |
-
)
|
| 497 |
-
if cfg.buffer_containers:
|
| 498 |
-
cls_kwargs["buffer_containers"] = cfg.buffer_containers
|
| 499 |
-
|
| 500 |
-
concurrent = modal.concurrent(max_inputs=cfg.max_concurrent_inputs, target_inputs=target_inputs)
|
| 501 |
-
return app.cls(**cls_kwargs)(concurrent(_SnapshotServer))
|
| 502 |
-
|
| 503 |
-
|
| 504 |
def register_all(app: modal.App, configs: Iterable[ModelConfig]) -> None:
|
| 505 |
"""Register every model in ``configs`` onto ``app``."""
|
| 506 |
for cfg in configs:
|
|
|
|
| 1 |
"""Reusable, OpenAI-compatible model-serving layer for Modal.
|
| 2 |
|
| 3 |
+
This module is provider-agnostic. It takes a single ``ModelConfig`` and turns it
|
| 4 |
+
into a serverless, autoscaling, OpenAI-compatible HTTP endpoint backed by vLLM.
|
| 5 |
+
Each provider app (``app_nvidia.py``, ``app_openbmb.py``, ``app_google.py``)
|
| 6 |
+
imports :func:`register_all` and wires up its own models, so providers stay
|
| 7 |
+
isolated in their own Modal apps while sharing one serving path.
|
| 8 |
+
|
| 9 |
+
This is Modal's canonical vLLM recipe, kept deliberately small: an autoscaling
|
| 10 |
+
``@app.function`` whose body launches ``vllm serve`` as a subprocess behind a
|
| 11 |
+
``@modal.web_server``. Everything that shapes a model (GPU, context length,
|
| 12 |
+
parsers, multimodal limits, extra flags) lives in data β the ``ModelConfig`` β
|
| 13 |
+
not in code, so adding a model is one entry in ``catalogue.py``.
|
| 14 |
+
|
| 15 |
+
The served endpoints speak the OpenAI REST API (``/v1/chat/completions``,
|
|
|
|
| 16 |
``/v1/completions``, ``/v1/models``), so any OpenAI-compatible client can call
|
| 17 |
them by pointing ``base_url`` at the deployed URL.
|
| 18 |
"""
|
|
|
|
| 32 |
|
| 33 |
# --- Shared serving constants --------------------------------------------------
|
| 34 |
|
| 35 |
+
# Pin the inference stack so deploys are reproducible. Bump deliberately. This is
|
| 36 |
+
# the version Modal's current vLLM example ships with.
|
| 37 |
VLLM_VERSION = "0.21.0"
|
| 38 |
CUDA_IMAGE = "nvidia/cuda:12.9.0-devel-ubuntu22.04"
|
| 39 |
+
PYTHON_VERSION = "3.12"
|
| 40 |
|
| 41 |
# The in-container port vLLM listens on; Modal maps it to a public HTTPS URL.
|
| 42 |
VLLM_PORT = 8000
|
|
|
|
| 46 |
VLLM_CACHE_PATH = "/root/.cache/vllm"
|
| 47 |
|
| 48 |
# Name of the Modal Secret that holds a Hugging Face token (key: HF_TOKEN).
|
| 49 |
+
# Required only for gated repos. Create it once with:
|
| 50 |
# modal secret create huggingface-secret HF_TOKEN=hf_...
|
| 51 |
HF_SECRET_NAME = "huggingface-secret"
|
| 52 |
|
| 53 |
+
# Name of the Modal Secret holding the bearer token clients must present. The key
|
| 54 |
+
# MUST be VLLM_API_KEY β vLLM reads that env var and then enforces
|
| 55 |
# `Authorization: Bearer <token>` on every request. Create it once with:
|
| 56 |
# modal secret create llm-api-key VLLM_API_KEY=sk-...
|
| 57 |
API_KEY_SECRET_NAME = "llm-api-key"
|
|
|
|
| 60 |
# MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
|
| 61 |
# When enabled, every endpoint mounts API_KEY_SECRET_NAME and rejects requests
|
| 62 |
# without a valid bearer token. Off by default (endpoints are then public).
|
| 63 |
+
REQUIRE_API_KEY = os.environ.get("MODAL_LLM_REQUIRE_AUTH", "").lower() in ("1", "true", "yes")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
# Demo-day switch: keep N containers warm for every *profile-bound* model (the
|
| 66 |
+
# tiers the cast actually runs on), removing their cold starts for the duration
|
| 67 |
+
# of the deploy. Specialists keep scale-to-zero. Costs GPU-hours while deployed β
|
| 68 |
+
# turn it on right before a live demo, redeploy without it after:
|
| 69 |
# MODAL_LLM_KEEP_WARM=1 modal deploy modal/app_nvidia.py
|
| 70 |
KEEP_WARM = int(os.environ.get("MODAL_LLM_KEEP_WARM", "0") or "0")
|
| 71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
# Weights and the vLLM compile cache are shared across every provider app, so a
|
| 73 |
# model pulled once is warm for all subsequent deploys and containers.
|
| 74 |
hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
|
| 75 |
vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)
|
| 76 |
|
| 77 |
+
# Baseline image env shared by every model. Persisting the torch.compile + CUDA
|
| 78 |
+
# graph cache on the shared vLLM Volume means only the first container compiles;
|
| 79 |
+
# later cold starts replay the cached graphs instead of recapturing them.
|
| 80 |
_BASE_ENV = {
|
| 81 |
"HF_HUB_CACHE": HF_CACHE_PATH,
|
| 82 |
"HF_XET_HIGH_PERFORMANCE": "1", # faster weight downloads
|
| 83 |
"VLLM_LOG_STATS_INTERVAL": "1",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
"VLLM_CACHE_ROOT": VLLM_CACHE_PATH,
|
| 85 |
}
|
| 86 |
|
|
|
|
| 101 |
else:
|
| 102 |
image = image.uv_pip_install(f"vllm=={cfg.vllm_version or VLLM_VERSION}")
|
| 103 |
image = image.env(_BASE_ENV)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
if cfg.extra_pip:
|
| 105 |
image = image.uv_pip_install(*cfg.extra_pip)
|
| 106 |
if cfg.env:
|
|
|
|
| 108 |
return image
|
| 109 |
|
| 110 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 111 |
def build_command(cfg: ModelConfig) -> list[str]:
|
| 112 |
"""Assemble the ``vllm serve`` argv for a model. Returned as a list so we can
|
| 113 |
launch with ``subprocess.Popen`` without a shell (no quoting pitfalls)."""
|
|
|
|
| 132 |
cmd += ["--max-model-len", str(cfg.max_model_len)]
|
| 133 |
if cfg.trust_remote_code:
|
| 134 |
cmd += ["--trust-remote-code"]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
if cfg.gpu_memory_utilization is not None:
|
| 136 |
cmd += ["--gpu-memory-utilization", str(cfg.gpu_memory_utilization)]
|
| 137 |
# Prefix caching reuses the KV cache for shared prompt prefixes. In a
|
|
|
|
| 142 |
cmd += ["--async-scheduling"]
|
| 143 |
if cfg.enforce_eager:
|
| 144 |
cmd += ["--enforce-eager"]
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
# Observability: log each incoming request (id, params, token counts) so the
|
| 146 |
+
# Modal logs show what's actually being served.
|
|
|
|
| 147 |
if cfg.log_requests:
|
| 148 |
cmd += ["--enable-log-requests"]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
if cfg.reasoning_parser:
|
| 150 |
cmd += ["--reasoning-parser", cfg.reasoning_parser]
|
| 151 |
if cfg.enable_auto_tool_choice:
|
|
|
|
| 154 |
cmd += ["--tool-call-parser", cfg.tool_call_parser]
|
| 155 |
if cfg.mm_limits:
|
| 156 |
cmd += ["--limit-mm-per-prompt", json.dumps(cfg.mm_limits)]
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
cmd += list(cfg.extra_vllm_args)
|
| 158 |
return cmd
|
| 159 |
|
|
|
|
| 161 |
# --- Endpoint registration ------------------------------------------------------
|
| 162 |
|
| 163 |
|
| 164 |
+
def register_model(app: modal.App, cfg: ModelConfig) -> modal.Function:
|
| 165 |
"""Attach one model to ``app`` as an autoscaling, OpenAI-compatible endpoint.
|
| 166 |
|
| 167 |
+
A single serialized ``@app.function`` web server launches ``vllm serve`` as a
|
| 168 |
+
subprocess; Modal exposes its port at ``β¦--<app>-<endpoint_name>.modal.run``.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
Everything is serialized (the prebuilt ``vllm serve`` argv is shipped to the
|
| 170 |
container), which lets us register many distinctly-named endpoints from a
|
| 171 |
simple loop without each needing a hand-written module-level function.
|
|
|
|
| 185 |
if KEEP_WARM and cfg.profile:
|
| 186 |
min_containers = max(min_containers, KEEP_WARM)
|
| 187 |
|
| 188 |
+
# Autoscale at ~75% of the ceiling, but let a hot container absorb a burst up
|
| 189 |
+
# to the hard max before another cold-starts (Modal high-perf guidance).
|
| 190 |
+
target_inputs = max(1, (cfg.max_concurrent_inputs * 3) // 4)
|
| 191 |
+
|
| 192 |
+
@app.function(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 193 |
name=cfg.endpoint_name,
|
| 194 |
image=image,
|
| 195 |
gpu=cfg.gpu,
|
|
|
|
| 200 |
timeout=cfg.request_timeout,
|
| 201 |
serialized=True,
|
| 202 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 203 |
@modal.concurrent(max_inputs=cfg.max_concurrent_inputs, target_inputs=target_inputs)
|
| 204 |
@modal.web_server(port=VLLM_PORT, startup_timeout=cfg.startup_timeout)
|
| 205 |
def serve():
|
|
|
|
| 206 |
import subprocess
|
| 207 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 208 |
# vLLM serves the OpenAI REST API on VLLM_PORT; Modal exposes it publicly.
|
| 209 |
+
# Inherits the container env (HF cache, vLLM cache, any secrets).
|
| 210 |
+
subprocess.Popen(cmd)
|
| 211 |
|
| 212 |
return serve
|
| 213 |
|
| 214 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 215 |
def register_all(app: modal.App, configs: Iterable[ModelConfig]) -> None:
|
| 216 |
"""Register every model in ``configs`` onto ``app``."""
|
| 217 |
for cfg in configs:
|
modal/vllm_logging.py
DELETED
|
@@ -1,118 +0,0 @@
|
|
| 1 |
-
"""Structured (JSON) logging for the vLLM subprocess β stdlib only.
|
| 2 |
-
|
| 3 |
-
vLLM applies a standard :func:`logging.config.dictConfig` when the
|
| 4 |
-
``VLLM_LOGGING_CONFIG_PATH`` env var points at a JSON file (see vLLM's
|
| 5 |
-
``envs.py``). This module builds that config and ships the :class:`JsonFormatter`
|
| 6 |
-
it references, so one importable module serves both sides:
|
| 7 |
-
|
| 8 |
-
* :func:`write_config` β called by ``service.serve()`` to drop the JSON config
|
| 9 |
-
file into the container before launching ``vllm serve``; and
|
| 10 |
-
* :class:`JsonFormatter` β imported *by name* from the JSON config when vLLM
|
| 11 |
-
runs ``dictConfig`` in its own process.
|
| 12 |
-
|
| 13 |
-
For the second to work, this file is added to the container image and its
|
| 14 |
-
directory is placed on ``PYTHONPATH`` (see ``service.build_image``). Keeping it
|
| 15 |
-
**dependency-free** (no ``python-json-logger`` etc.) means there is no extra
|
| 16 |
-
wheel to install and no import path that can drift between versions β vLLM only
|
| 17 |
-
needs the stdlib plus this one file.
|
| 18 |
-
|
| 19 |
-
One JSON object is emitted per log line: ``ts``, ``level``, ``logger``, ``msg``,
|
| 20 |
-
the source ``module:lineno``, and any structured extras attached to the record
|
| 21 |
-
(vLLM threads request ids and token counts through these). Output stays on
|
| 22 |
-
stdout so Modal captures it like every other container log.
|
| 23 |
-
"""
|
| 24 |
-
|
| 25 |
-
from __future__ import annotations
|
| 26 |
-
|
| 27 |
-
import json
|
| 28 |
-
import logging
|
| 29 |
-
|
| 30 |
-
# Standard LogRecord attributes β everything here is either folded into a fixed
|
| 31 |
-
# JSON key below or deliberately dropped. Anything *else* on the record is a
|
| 32 |
-
# caller-supplied extra (e.g. a request id) and is included verbatim.
|
| 33 |
-
_RESERVED: frozenset[str] = frozenset(
|
| 34 |
-
{
|
| 35 |
-
"args",
|
| 36 |
-
"asctime",
|
| 37 |
-
"created",
|
| 38 |
-
"exc_info",
|
| 39 |
-
"exc_text",
|
| 40 |
-
"filename",
|
| 41 |
-
"funcName",
|
| 42 |
-
"levelname",
|
| 43 |
-
"levelno",
|
| 44 |
-
"lineno",
|
| 45 |
-
"module",
|
| 46 |
-
"msecs",
|
| 47 |
-
"message",
|
| 48 |
-
"msg",
|
| 49 |
-
"name",
|
| 50 |
-
"pathname",
|
| 51 |
-
"process",
|
| 52 |
-
"processName",
|
| 53 |
-
"relativeCreated",
|
| 54 |
-
"stack_info",
|
| 55 |
-
"taskName",
|
| 56 |
-
"thread",
|
| 57 |
-
"threadName",
|
| 58 |
-
}
|
| 59 |
-
)
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
class JsonFormatter(logging.Formatter):
|
| 63 |
-
"""Render each log record as a single compact JSON line.
|
| 64 |
-
|
| 65 |
-
Referenced from the dictConfig by dotted path (``vllm_logging.JsonFormatter``),
|
| 66 |
-
so it must stay importable under that name in the container.
|
| 67 |
-
"""
|
| 68 |
-
|
| 69 |
-
def format(self, record: logging.LogRecord) -> str:
|
| 70 |
-
data: dict[str, object] = {
|
| 71 |
-
"ts": self.formatTime(record, self.datefmt),
|
| 72 |
-
"level": record.levelname,
|
| 73 |
-
"logger": record.name,
|
| 74 |
-
"msg": record.getMessage(),
|
| 75 |
-
"src": f"{record.module}:{record.lineno}",
|
| 76 |
-
}
|
| 77 |
-
if record.exc_info:
|
| 78 |
-
data["exc"] = self.formatException(record.exc_info)
|
| 79 |
-
# Fold in any structured extras (request_id, token counts, ...). Values
|
| 80 |
-
# that aren't JSON-serialisable fall back to repr so a stray object can
|
| 81 |
-
# never crash the logging path.
|
| 82 |
-
for key, value in record.__dict__.items():
|
| 83 |
-
if key in _RESERVED or key.startswith("_"):
|
| 84 |
-
continue
|
| 85 |
-
try:
|
| 86 |
-
json.dumps(value)
|
| 87 |
-
except (TypeError, ValueError):
|
| 88 |
-
value = repr(value)
|
| 89 |
-
data[key] = value
|
| 90 |
-
return json.dumps(data, ensure_ascii=False, default=repr)
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
def build_config(level: str = "INFO") -> dict:
|
| 94 |
-
"""Return a ``logging.config.dictConfig`` that routes vLLM + uvicorn through
|
| 95 |
-
:class:`JsonFormatter` on stdout at ``level``."""
|
| 96 |
-
level = (level or "INFO").upper()
|
| 97 |
-
handler = {
|
| 98 |
-
"class": "logging.StreamHandler",
|
| 99 |
-
"formatter": "json",
|
| 100 |
-
"stream": "ext://sys.stdout",
|
| 101 |
-
}
|
| 102 |
-
logger = {"handlers": ["stdout"], "level": level, "propagate": False}
|
| 103 |
-
return {
|
| 104 |
-
"version": 1,
|
| 105 |
-
# Keep vLLM's own loggers; we only swap their formatting/handler.
|
| 106 |
-
"disable_existing_loggers": False,
|
| 107 |
-
"formatters": {"json": {"()": "vllm_logging.JsonFormatter"}},
|
| 108 |
-
"handlers": {"stdout": handler},
|
| 109 |
-
"loggers": {name: dict(logger) for name in ("vllm", "uvicorn", "uvicorn.access", "uvicorn.error")},
|
| 110 |
-
"root": {"handlers": ["stdout"], "level": level},
|
| 111 |
-
}
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
def write_config(path: str, level: str = "INFO") -> str:
|
| 115 |
-
"""Write the dictConfig JSON to ``path`` (for ``VLLM_LOGGING_CONFIG_PATH``)."""
|
| 116 |
-
with open(path, "w", encoding="utf-8") as fh:
|
| 117 |
-
json.dump(build_config(level), fh)
|
| 118 |
-
return path
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/observability/logging_setup.py
CHANGED
|
@@ -1,7 +1,6 @@
|
|
| 1 |
"""Root logging configuration β structured records to stdout and to the store.
|
| 2 |
|
| 3 |
-
|
| 4 |
-
the whole engine, and adds:
|
| 5 |
|
| 6 |
* a :class:`_ContextFilter` that stamps every record with the bound
|
| 7 |
run/turn/agent (see :mod:`src.observability.context`); and
|
|
|
|
| 1 |
"""Root logging configuration β structured records to stdout and to the store.
|
| 2 |
|
| 3 |
+
Provides a dependency-free JSON formatter for the whole engine, and adds:
|
|
|
|
| 4 |
|
| 5 |
* a :class:`_ContextFilter` that stamps every record with the bound
|
| 6 |
run/turn/agent (see :mod:`src.observability.context`); and
|
tests/test_modal_build_command.py
CHANGED
|
@@ -1,11 +1,9 @@
|
|
| 1 |
-
"""Guard the
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
β and these tests pin both, plus the force-disable token, since this is the first
|
| 8 |
-
test to assert on ``build_command``'s output at all.
|
| 9 |
|
| 10 |
``modal/service.py`` does ``import modal`` and ``from catalogue import β¦``, so we
|
| 11 |
load it exactly the way ``modal deploy`` does: with ``modal/`` on ``sys.path`` (the
|
|
@@ -16,6 +14,7 @@ binds the installed SDK, not the folder).
|
|
| 16 |
from __future__ import annotations
|
| 17 |
|
| 18 |
import importlib
|
|
|
|
| 19 |
import sys
|
| 20 |
from pathlib import Path
|
| 21 |
|
|
@@ -37,109 +36,105 @@ def _make(service, **kwargs):
|
|
| 37 |
return service.ModelConfig(name="acme/Tiny-1B", endpoint_name="tiny-1b", **kwargs)
|
| 38 |
|
| 39 |
|
| 40 |
-
|
|
|
|
|
|
|
| 41 |
|
| 42 |
|
| 43 |
-
|
| 44 |
-
cmd = service.build_command(_make(service))
|
| 45 |
-
assert "--quantization" not in cmd
|
| 46 |
-
assert "--kv-cache-dtype" not in cmd
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
def test_per_model_quantization_emits_flag(service):
|
| 50 |
-
cmd = service.build_command(_make(service, quantization="fp8"))
|
| 51 |
-
assert cmd[cmd.index("--quantization") + 1] == "fp8"
|
| 52 |
-
|
| 53 |
|
| 54 |
-
def test_per_model_kv_cache_dtype_emits_flag(service):
|
| 55 |
-
cmd = service.build_command(_make(service, kv_cache_dtype="fp8"))
|
| 56 |
-
assert cmd[cmd.index("--kv-cache-dtype") + 1] == "fp8"
|
| 57 |
|
| 58 |
-
|
| 59 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
|
| 62 |
-
def
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
|
|
|
| 66 |
|
| 67 |
|
| 68 |
-
|
| 69 |
-
monkeypatch.setattr(service, "QUANTIZATION", "awq")
|
| 70 |
-
cmd = service.build_command(_make(service, quantization="fp8"))
|
| 71 |
-
assert cmd[cmd.index("--quantization") + 1] == "awq"
|
| 72 |
|
| 73 |
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
assert "--quantization" not in cmd
|
| 80 |
|
| 81 |
|
| 82 |
-
def
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
|
| 88 |
-
|
|
|
|
|
|
|
| 89 |
|
| 90 |
|
| 91 |
-
def
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
assert "--
|
| 97 |
-
|
| 98 |
-
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
|
| 101 |
-
def
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
monkeypatch.setattr(service, "KV_CACHE_DTYPE", "fp8")
|
| 105 |
-
cmd = service.build_command(_make(service, gpu_snapshot=True))
|
| 106 |
-
assert "--kv-cache-dtype" not in cmd
|
| 107 |
|
| 108 |
|
| 109 |
-
def
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
assert "--kv-cache-dtype" not in cmd
|
| 113 |
|
| 114 |
|
| 115 |
-
|
| 116 |
-
# The guard only fires on fp8; a non-fp8 dtype passes through even with snapshot.
|
| 117 |
-
cmd = service.build_command(_make(service, kv_cache_dtype="auto", gpu_snapshot=True))
|
| 118 |
-
assert cmd[cmd.index("--kv-cache-dtype") + 1] == "auto"
|
| 119 |
|
| 120 |
|
| 121 |
-
def
|
| 122 |
-
|
| 123 |
-
cmd =
|
| 124 |
-
assert cmd[cmd.index("--kv-cache-dtype") + 1] == "fp8"
|
| 125 |
|
| 126 |
|
| 127 |
# ββ deploy script wiring βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 128 |
|
| 129 |
|
| 130 |
-
def
|
| 131 |
sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "scripts"))
|
| 132 |
deploy_modal = importlib.import_module("deploy_modal")
|
| 133 |
from argparse import Namespace
|
| 134 |
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
assert
|
| 138 |
-
|
| 139 |
-
# ``--quantization none`` (force full precision) is still propagated, not dropped.
|
| 140 |
-
env_none = deploy_modal._env_for(Namespace(quantization="none", **base))
|
| 141 |
-
assert env_none["MODAL_LLM_QUANTIZATION"] == "none"
|
| 142 |
|
| 143 |
-
#
|
| 144 |
-
|
| 145 |
-
assert "
|
|
|
|
|
|
| 1 |
+
"""Guard the ``vllm serve`` argv that ``build_command`` emits.
|
| 2 |
|
| 3 |
+
The serving layer turns one ``ModelConfig`` into the argv launched inside the
|
| 4 |
+
container, so these tests pin the mapping from config fields to vLLM flags: the
|
| 5 |
+
always-present identity flags, the data-driven toggles (parsers, eager, prefix
|
| 6 |
+
caching), and the ``extra_vllm_args`` escape hatch.
|
|
|
|
|
|
|
| 7 |
|
| 8 |
``modal/service.py`` does ``import modal`` and ``from catalogue import β¦``, so we
|
| 9 |
load it exactly the way ``modal deploy`` does: with ``modal/`` on ``sys.path`` (the
|
|
|
|
| 14 |
from __future__ import annotations
|
| 15 |
|
| 16 |
import importlib
|
| 17 |
+
import json
|
| 18 |
import sys
|
| 19 |
from pathlib import Path
|
| 20 |
|
|
|
|
| 36 |
return service.ModelConfig(name="acme/Tiny-1B", endpoint_name="tiny-1b", **kwargs)
|
| 37 |
|
| 38 |
|
| 39 |
+
def _flag_value(cmd: list[str], flag: str) -> str:
|
| 40 |
+
"""The argument that follows ``flag`` in the argv."""
|
| 41 |
+
return cmd[cmd.index(flag) + 1]
|
| 42 |
|
| 43 |
|
| 44 |
+
# ββ always-present identity flags ββββββββββββββββββββββββββββββββββββββββββββββ
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
+
def test_serves_the_model_with_identity_flags(service):
|
| 48 |
+
cmd = service.build_command(_make(service))
|
| 49 |
+
assert cmd[:3] == ["vllm", "serve", "acme/Tiny-1B"]
|
| 50 |
+
# served-model-name defaults to the repo name (clients pass the repo id).
|
| 51 |
+
assert _flag_value(cmd, "--served-model-name") == "acme/Tiny-1B"
|
| 52 |
+
assert _flag_value(cmd, "--port") == str(service.VLLM_PORT)
|
| 53 |
+
assert _flag_value(cmd, "--tensor-parallel-size") == "1"
|
| 54 |
|
| 55 |
|
| 56 |
+
def test_served_model_name_alias(service):
|
| 57 |
+
cmd = service.build_command(_make(service, served_model_name="acme/Tiny"))
|
| 58 |
+
assert _flag_value(cmd, "--served-model-name") == "acme/Tiny"
|
| 59 |
+
# but vLLM still loads the real repo (positional arg)
|
| 60 |
+
assert cmd[2] == "acme/Tiny-1B"
|
| 61 |
|
| 62 |
|
| 63 |
+
# ββ data-driven toggles ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
|
| 66 |
+
def test_prefix_caching_on_by_default_off_when_disabled(service):
|
| 67 |
+
assert "--enable-prefix-caching" in service.build_command(_make(service))
|
| 68 |
+
off = service.build_command(_make(service, enable_prefix_caching=False))
|
| 69 |
+
assert "--no-enable-prefix-caching" in off
|
| 70 |
+
assert "--enable-prefix-caching" not in off
|
|
|
|
| 71 |
|
| 72 |
|
| 73 |
+
def test_optional_inference_flags_emitted(service):
|
| 74 |
+
cmd = service.build_command(
|
| 75 |
+
_make(
|
| 76 |
+
service,
|
| 77 |
+
max_model_len=8192,
|
| 78 |
+
trust_remote_code=True,
|
| 79 |
+
enforce_eager=True,
|
| 80 |
+
gpu_memory_utilization=0.9,
|
| 81 |
+
)
|
| 82 |
+
)
|
| 83 |
+
assert _flag_value(cmd, "--max-model-len") == "8192"
|
| 84 |
+
assert "--trust-remote-code" in cmd
|
| 85 |
+
assert "--enforce-eager" in cmd
|
| 86 |
+
assert _flag_value(cmd, "--gpu-memory-utilization") == "0.9"
|
| 87 |
|
| 88 |
|
| 89 |
+
def test_async_scheduling_default_on_off_when_disabled(service):
|
| 90 |
+
assert "--async-scheduling" in service.build_command(_make(service))
|
| 91 |
+
assert "--async-scheduling" not in service.build_command(_make(service, async_scheduling=False))
|
| 92 |
|
| 93 |
|
| 94 |
+
def test_parser_flags(service):
|
| 95 |
+
cmd = service.build_command(
|
| 96 |
+
_make(service, reasoning_parser="qwen3", tool_call_parser="hermes", enable_auto_tool_choice=True)
|
| 97 |
+
)
|
| 98 |
+
assert _flag_value(cmd, "--reasoning-parser") == "qwen3"
|
| 99 |
+
assert _flag_value(cmd, "--tool-call-parser") == "hermes"
|
| 100 |
+
assert "--enable-auto-tool-choice" in cmd
|
| 101 |
+
# None parsers emit nothing.
|
| 102 |
+
bare = service.build_command(_make(service))
|
| 103 |
+
assert "--reasoning-parser" not in bare
|
| 104 |
+
assert "--tool-call-parser" not in bare
|
| 105 |
|
| 106 |
|
| 107 |
+
def test_mm_limits_serialized_as_json(service):
|
| 108 |
+
cmd = service.build_command(_make(service, mm_limits={"image": 0, "audio": 0}))
|
| 109 |
+
assert json.loads(_flag_value(cmd, "--limit-mm-per-prompt")) == {"image": 0, "audio": 0}
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
|
| 112 |
+
def test_log_requests_default_on(service):
|
| 113 |
+
assert "--enable-log-requests" in service.build_command(_make(service))
|
| 114 |
+
assert "--enable-log-requests" not in service.build_command(_make(service, log_requests=False))
|
|
|
|
| 115 |
|
| 116 |
|
| 117 |
+
# ββ escape hatch ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
|
| 120 |
+
def test_extra_vllm_args_appended_verbatim(service):
|
| 121 |
+
cmd = service.build_command(_make(service, extra_vllm_args=("--quantization", "fp8")))
|
| 122 |
+
assert cmd[-2:] == ["--quantization", "fp8"]
|
|
|
|
| 123 |
|
| 124 |
|
| 125 |
# ββ deploy script wiring βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 126 |
|
| 127 |
|
| 128 |
+
def test_deploy_script_propagates_knob_envs():
|
| 129 |
sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "scripts"))
|
| 130 |
deploy_modal = importlib.import_module("deploy_modal")
|
| 131 |
from argparse import Namespace
|
| 132 |
|
| 133 |
+
env = deploy_modal._env_for(Namespace(keep_warm=True, auth=True))
|
| 134 |
+
assert env["MODAL_LLM_KEEP_WARM"] == "1"
|
| 135 |
+
assert env["MODAL_LLM_REQUIRE_AUTH"] == "1"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
+
# Both off β neither env var is set (so endpoints stay public + scale-to-zero).
|
| 138 |
+
env_off = deploy_modal._env_for(Namespace(keep_warm=False, auth=False))
|
| 139 |
+
assert "MODAL_LLM_KEEP_WARM" not in env_off
|
| 140 |
+
assert "MODAL_LLM_REQUIRE_AUTH" not in env_off
|
tests/test_modal_endpoint_urls.py
CHANGED
|
@@ -16,8 +16,6 @@ import importlib.util
|
|
| 16 |
import sys
|
| 17 |
from pathlib import Path
|
| 18 |
|
| 19 |
-
import pytest
|
| 20 |
-
|
| 21 |
_CATALOGUE_PATH = Path(__file__).resolve().parents[1] / "modal" / "catalogue.py"
|
| 22 |
|
| 23 |
# Max length of a single DNS label (RFC 1035). The whole subdomain before
|
|
|
|
| 16 |
import sys
|
| 17 |
from pathlib import Path
|
| 18 |
|
|
|
|
|
|
|
| 19 |
_CATALOGUE_PATH = Path(__file__).resolve().parents[1] / "modal" / "catalogue.py"
|
| 20 |
|
| 21 |
# Max length of a single DNS label (RFC 1035). The whole subdomain before
|