Spaces:

build-small-hackathon
/

multi-agent-lab

Running on Zero

agharsallah commited on 23 days ago

Commit

5d4ef87

1 Parent(s): e3ba862

Refactor modal service and logging setup

- Updated the service module to streamline model registration and improve clarity in the handling of model configurations.
- Removed the vllm_logging module, integrating its functionality into the main logging setup for better maintainability and consistency.
- Simplified the build_command function by removing unnecessary precision handling and logging configurations.
- Enhanced test coverage for the build_command function, ensuring proper flag emissions and configurations.
- Cleaned up deprecated snapshot model handling and adjusted related tests for clarity and accuracy.
- Improved documentation throughout the service module to better reflect current functionality and design goals.

Files changed (12) hide show

docs/adr/0030-gpu-memory-snapshots-cold-start.md +6 -1
docs/adr/0031-fp8-quantization-control.md +6 -2
docs/adr/0034-simplify-modal-serving-to-canonical-vllm.md +95 -0
modal/README.md +7 -7
modal/catalogue.py +58 -101
modal/docs/deploying.md +64 -129
modal/healthcheck.py +51 -51
modal/service.py +37 -326
modal/vllm_logging.py +0 -118
src/observability/logging_setup.py +1 -2
tests/test_modal_build_command.py +74 -79
tests/test_modal_endpoint_urls.py +0 -2

docs/adr/0030-gpu-memory-snapshots-cold-start.md CHANGED Viewed

@@ -2,7 +2,12 @@
 ## Status
-Accepted (extends [ADR-0014 *Modal model serving*](0014-modal-model-serving.md),
 [ADR-0019](0019-single-model-catalogue-no-cloud-path.md))
 ## Context

 ## Status
+**Superseded by [ADR-0034 *Simplify the Modal serving layer*](0034-simplify-modal-serving-to-canonical-vllm.md)**
+— the snapshot lifecycle was removed for being alpha and error-prone; cold starts
+now rely on the shared compile/weight caches plus the retained `MODAL_LLM_KEEP_WARM`
+demo switch. The historical context below stands.
+Originally Accepted (extended [ADR-0014 *Modal model serving*](0014-modal-model-serving.md),
 [ADR-0019](0019-single-model-catalogue-no-cloud-path.md))
 ## Context

docs/adr/0031-fp8-quantization-control.md CHANGED Viewed

@@ -2,8 +2,12 @@
 ## Status
-Accepted (extends [ADR-0014 *Modal model serving*](0014-modal-model-serving.md),
-[ADR-0019](0019-single-model-catalogue-no-cloud-path.md); interacts with
 [ADR-0030 *GPU memory snapshots*](0030-gpu-memory-snapshots-cold-start.md))
 ## Context

 ## Status
+**Superseded by [ADR-0034 *Simplify the Modal serving layer*](0034-simplify-modal-serving-to-canonical-vllm.md)**
+— the env-controlled quantization machinery was removed; lower precision is now
+reached via a model's `extra_vllm_args`. The historical context below stands.
+Originally Accepted (extended [ADR-0014 *Modal model serving*](0014-modal-model-serving.md),
+[ADR-0019](0019-single-model-catalogue-no-cloud-path.md); interacted with
 [ADR-0030 *GPU memory snapshots*](0030-gpu-memory-snapshots-cold-start.md))
 ## Context

docs/adr/0034-simplify-modal-serving-to-canonical-vllm.md ADDED Viewed

	@@ -0,0 +1,95 @@

+# ADR-0034: Simplify the Modal serving layer to the canonical vLLM recipe
+## Status
+Accepted. **Supersedes [ADR-0030 *GPU memory snapshots*](0030-gpu-memory-snapshots-cold-start.md)
+and [ADR-0031 *FP8 quantization control*](0031-fp8-quantization-control.md).**
+Extends [ADR-0014 *Modal model serving*](0014-modal-model-serving.md) and
+[ADR-0019](0019-single-model-catalogue-no-cloud-path.md).
+## Context
+`modal/service.py` had grown to ~500 lines by accreting three optional
+subsystems on top of the plain vLLM web-server path from ADR-0014:
+- **GPU memory snapshots** (ADR-0030) — a class-based sleep→snapshot→wake
+  lifecycle, a second registration shape, and `enable_gpu_snapshot` (Modal
+  *alpha*).
+- **FP8 / quantization control** (ADR-0031) — a deploy-time env-override resolver
+  plus a workaround for FP8-KV-cache crashing the snapshot wake path.
+- **Structured JSON logging** — a `vllm_logging.py` formatter shipped into the
+  image and wired through a generated `dictConfig`.
+In practice this surface was the source of the errors, not a benefit:
+- The snapshot lifecycle is alpha and fragile — the documented FP8×snapshot
+  wake-path crash (ADR-0031) is one instance; the hand-folded URL label and
+  cloudpickled-closure constraints are others. Hard to deploy, hard to debug.
+- The FP8 machinery defaulted to `None` on **every** model — pure surface area
+  with no model actually using it.
+- JSON logging defaulted **off** — more surface area, off by default.
+- Per-model configs had drifted from the models' real serving requirements
+  (e.g. the Gemma 4 26B was pinned to a nightly vLLM it doesn't need).
+The working core is small and is exactly Modal's current canonical vLLM example:
+an autoscaling `@app.function` + `@modal.concurrent` + `@modal.web_server` whose
+body runs `subprocess.Popen(["vllm", "serve", ...])`.
+## Decision
+**1. One serving path.** `register_model()` only registers the plain
+`@app.function` web server. The snapshot class lifecycle
+(`_register_snapshot_model`, `_class_name`, sleep/wake, `enable_gpu_snapshot`) is
+deleted. `service.py` drops from ~500 to ~210 lines.
+**2. Quantization moves to the escape hatch.** The `MODAL_LLM_QUANTIZATION` /
+`MODAL_LLM_KV_CACHE_DTYPE` env resolver, the `quantization` / `kv_cache_dtype`
+`ModelConfig` fields, and the FP8×snapshot workaround are removed. A model that
+wants lower precision passes the flags through the existing `extra_vllm_args`
+(`("--quantization", "fp8")`). Quantization was always opt-in and never on; this
+keeps it possible without standing machinery.
+**3. JSON logging is removed.** `vllm_logging.py` is deleted along with the
+`MODAL_LLM_JSON_LOGS` / `MODAL_LLM_LOG_LEVEL` wiring. Modal captures
+stdout/stderr; `--enable-log-requests` (kept, via `log_requests`) gives
+per-request detail.
+**4. `ModelConfig` is trimmed** to the fields the one path actually reads.
+Removed: `gpu_snapshot`, `quantization`, `kv_cache_dtype`, `max_num_seqs`,
+`max_num_batched_tokens`, `target_concurrent_inputs`, `buffer_containers`,
+`log_outputs`, `max_log_len`, `uvicorn_access_log`, `multimodal`. The autoscale
+target is computed inline (~75% of `max_concurrent_inputs`); anything exotic uses
+`extra_vllm_args`.
+**5. Per-model configs re-grounded in each model's documentation** (verified
+against the HF model cards + vLLM recipes, June 2026):
+| Model | Correction |
+| --- | --- |
+| Gemma 4 **26B-A4B** | Standard `gemma4` MoE — serves on the **pinned stable vLLM**. Dropped the nightly pin, `transformers>=5.10.2`, the unverified `VLLM_USE_FLASHINFER_SAMPLER=0`, and `enforce_eager` (native path → CUDA graphs work). |
+| Gemma 4 **12B** | `gemma4_unified` (encoder-free) has no class in any stable vLLM ≤0.22.1 → **keeps** `vllm_version="nightly"` + `transformers>=5.10.2`; dropped the unverified flashinfer env. |
+| Nemotron Nano **4B / 30B** | Hybrid-Mamba; `trust_remote_code` kept. Served as plain chat — NVIDIA's `nano_v3` reasoning parser ships as a downloadable *plugin file* and is omitted for boot-robustness (addable via `extra_vllm_args` later). 30B params corrected 30→31. |
+| Nemotron **Cascade-14B** | Confirmed stock Qwen3 — `reasoning_parser="qwen3"` + `tool_call_parser="hermes"` are correct and built-in; kept. |
+| MiniCPM **4.1-8B** | `trust_remote_code` kept; no tool parser (custom `<|tool_call_start|>` format — engine uses guided decoding per ADR-0016). Serves on the pinned stable. |
+| MiniCPM **-o 4.5** | Params corrected 8→9B; served text+image (audio over vLLM is experimental — the documented `transformers==4.51.0` pin conflicts with vLLM's bundled version, so we keep the lean preprocessing deps). |
+## Consequences
+- **Far smaller blast radius.** One registration shape, no alpha features, no
+  generated log config, no precision resolver. The thing that errored is gone.
+- **Cold starts** now rely on the always-on shared caches (weights + compiled
+  graphs on Volumes) and the retained `MODAL_LLM_KEEP_WARM` demo-day switch
+  (mechanism 2 of ADR-0030, the robust half). We trade snapshot's seconds-from-
+  cold for simplicity; keep-warm covers the live-demo first-30-seconds bar.
+- **Quantization / batch caps** are still reachable via `extra_vllm_args`, just
+  not first-class fields. If a model later needs standing FP8, re-promote a typed
+  field then — but not speculatively.
+- **Gemma 4 26B is cheaper and more robust** off the nightly: it's a tier
+  default (`strong`), so removing its nightly dependency removes a recurring
+  break. Only the 12B remains on nightly, where it's unavoidable.
+- **Prize impact unchanged.** All seven models and all four provider tracks
+  (OpenAI-compatible, MiniCPM, Nemotron, Gemma) still deploy; the no-API-key
+  deterministic stub is untouched. The serving path stays demo-ready for the
+  Modal Awards, now without the alpha-feature risk on stage.
+- **Tests** for the removed precision/snapshot behaviour are replaced by tests
+  that pin the simplified `build_command` argv. Full suite stays green.

modal/README.md CHANGED Viewed

@@ -18,8 +18,6 @@ modal/
   app_nvidia.py     App "nvidia-llms"  — Nemotron 3 Nano 4B + 30B, Cascade 14B Thinking.
   app_openbmb.py    App "openbmb-llms" — MiniCPM4.1-8B + MiniCPM-o 4.5.
   app_google.py     App "google-llms"  — Gemma 4 12B + 26B.
-  vllm_logging.py   Dependency-free JSON log formatter shipped into the image
-                    when MODAL_LLM_JSON_LOGS=1 (structured logs via vLLM dictConfig).
   client.py         OpenAI-SDK smoke-test client for any endpoint.
   openapi.yaml      Checked-in OpenAPI 3.1 spec for the served API surface.
   pyproject.toml    uv workspace member (deploy/client tooling; non-package).
@@ -71,11 +69,13 @@ sizing, and how to add models/providers or wire endpoints into the engine.
   radius; one provider's outage or redeploy never touches another.
 - **Scalable** — serverless autoscaling, input concurrency, a shared weight
   cache (pull once, warm everywhere), and per-model `min_containers` warm pools.
-- **Fast cold starts** — snapshot-enabled models (`gpu_snapshot=True`) restore a
-  pre-warmed engine from a Modal memory snapshot in seconds instead of re-paying
-  download + load + warmup; `MODAL_LLM_KEEP_WARM=1` at deploy time pins warm
-  containers for the tier models on demo day. See
-  [`docs/deploying.md` → Cold starts](docs/deploying.md#cold-starts) (ADR-0030).
 - **Extensible** — add a model = one `ModelConfig` in `catalogue.py`; add a
   provider = one `Provider` entry + one app file. The serving path is written once
   in `service.py`, and the engine picks up the new model with no edits (it reads

   app_nvidia.py     App "nvidia-llms"  — Nemotron 3 Nano 4B + 30B, Cascade 14B Thinking.
   app_openbmb.py    App "openbmb-llms" — MiniCPM4.1-8B + MiniCPM-o 4.5.
   app_google.py     App "google-llms"  — Gemma 4 12B + 26B.
   client.py         OpenAI-SDK smoke-test client for any endpoint.
   openapi.yaml      Checked-in OpenAPI 3.1 spec for the served API surface.
   pyproject.toml    uv workspace member (deploy/client tooling; non-package).
   radius; one provider's outage or redeploy never touches another.
 - **Scalable** — serverless autoscaling, input concurrency, a shared weight
   cache (pull once, warm everywhere), and per-model `min_containers` warm pools.
+- **One serving path** — Modal's canonical vLLM recipe (an autoscaling
+  `@app.function` launching `vllm serve` behind a `@modal.web_server`), written
+  once in `service.py`. No bespoke per-model lifecycle to break (ADR-0034).
+- **Fast cold starts on demo day** — the shared `vllm-cache` Volume persists the
+  torch.compile / CUDA-graph artifacts so only the first container compiles, and
+  `MODAL_LLM_KEEP_WARM=1` at deploy time pins one warm container per tier model.
+  See [`docs/deploying.md` → Cold starts](docs/deploying.md#cold-starts).
 - **Extensible** — add a model = one `ModelConfig` in `catalogue.py`; add a
   provider = one `Provider` entry + one app file. The serving path is written once
   in `service.py`, and the engine picks up the new model with no edits (it reads

modal/catalogue.py CHANGED Viewed

@@ -72,60 +72,31 @@ class ModelConfig:
     max_model_len: int | None = None  # cap context to fit memory / task
     trust_remote_code: bool = False  # required by MiniCPM / Nemotron custom code
-    # Precision / quantization (vLLM serve flags). Both default to full precision
-    # (BF16 weights, model-dtype KV cache); set them to shrink the memory footprint
-    # so a model fits a smaller GPU or leaves more room for KV cache. A deploy-time
-    # env override (``MODAL_LLM_QUANTIZATION`` / ``MODAL_LLM_KV_CACHE_DTYPE``, read in
-    # ``service.py``) wins over these per-model values for a whole deploy. CAVEAT:
-    # on-the-fly FP8 needs an Ada/Hopper GPU (our L4/L40S/H200 all qualify) AND vLLM
-    # support for the architecture — custom-code / hybrid-mamba archs (Nemotron-H,
-    # MiniCPM) and the Transformers-backend Gemmas may fail to start under it, so these
-    # stay ``None`` until a model is verified to serve quantized. See ADR-0031.
-    quantization: str | None = None  # vLLM --quantization, on-the-fly weight quant (e.g. "fp8"); None = full BF16
-    kv_cache_dtype: str | None = None  # vLLM --kv-cache-dtype (e.g. "fp8"); None = auto (model dtype)
     # Performance / throughput (vLLM serve flags). Defaults target high
     # steady-state throughput on the common single-GPU path; tune per model.
-    # See ``service.build_command`` for how each maps to a flag.
     gpu_memory_utilization: float | None = None  # fraction of VRAM for weights + KV cache (vLLM default 0.9)
     enable_prefix_caching: bool = True  # reuse KV for shared prompt prefixes — big win when system/context repeat
     async_scheduling: bool = True  # overlap CPU request scheduling with GPU compute
     enforce_eager: bool = False  # skip CUDA-graph capture: faster cold start, lower steady-state throughput
-    max_num_seqs: int | None = None  # cap sequences batched per step (memory vs. throughput)
-    max_num_batched_tokens: int | None = None  # token budget per scheduler step (prefill throughput)
-    # Cold starts. Opt a model into Modal memory snapshots (CPU + experimental GPU
-    # snapshot): the container boots once, loads weights, warms the engine, puts it
-    # to sleep (vLLM sleep mode, weights offloaded to host RAM), and is snapshotted;
-    # every later cold start restores the snapshot and wakes the engine in seconds
-    # instead of re-paying download + load + warmup. Constraints (why this is per
-    # model, not global): single-GPU models only, the model's vLLM build must
-    # support `--enable-sleep-mode`, and host RAM must hold the offloaded weights.
-    # Modal marks GPU snapshots alpha — keep it off for exotic serving paths
-    # (Transformers-backend Gemma, the omni specialist) and flip off on any model
-    # that misbehaves; the plain serving path is unchanged.
-    gpu_snapshot: bool = False
-    # Observability / request logging (vLLM serve flags). Defaults give per-request
-    # visibility in the container logs out of the box; see ``service.build_command``.
-    log_requests: bool = True  # log each request's id, sampling params, and token counts
-    log_outputs: bool = False  # also log generated text (verbose; can echo story content) — opt-in
-    max_log_len: int | None = 2048  # truncate logged prompts/outputs to N chars (None = no cap)
-    uvicorn_access_log: bool = True  # keep uvicorn's per-request HTTP access line (method, path, status)
     # OpenAI feature parsers (vLLM names; leave None if unsupported on the model)
     reasoning_parser: str | None = None
     tool_call_parser: str | None = None
     enable_auto_tool_choice: bool = False
-    # Multimodal
-    multimodal: bool = False
-    mm_limits: dict[str, int] | None = None  # e.g. {"image": 4, "audio": 2}
     # Scaling / lifecycle
     max_concurrent_inputs: int = 64  # hard ceiling of requests multiplexed onto one container
-    target_concurrent_inputs: int | None = None  # autoscale target — scale out here, burst up to max; defaults to ~75%
-    buffer_containers: int = 0  # extra idle containers to pre-warm under active load (bursty traffic)
     scaledown_window: int = 15 * 60  # idle seconds before a container stops
     min_containers: int = 0  # keep N warm to remove cold starts (costs $)
     startup_timeout: int = 30 * 60  # weight download + load can be slow
@@ -169,31 +140,34 @@ NVIDIA_MODELS: tuple[ModelConfig, ...] = (
     ModelConfig(
         name="nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16",
         endpoint_name="nemotron-3-nano-4b",
-        # Tiny Titan tier (≤4B): comfortably fits a single 24GB L4.
         profile="tiny",
         params_b=4,
         gpu="L4:1",
         max_model_len=16384,
         trust_remote_code=True,
         gated=True,
         max_concurrent_inputs=32,
-        # Tiny tier is the cast's hottest endpoint and 4B of BF16 weights (~8GB)
-        # easily fit host RAM during sleep — the ideal snapshot candidate.
-        gpu_snapshot=True,
     ),
     ModelConfig(
         name="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
         endpoint_name="nemotron-3-nano-30b",
-        # 30B total params in BF16 (~60GB) though only ~3B activate per token.
-        # An alternate strong model — not cast to a profile by default.
-        # No gpu_snapshot: sleep mode would offload ~60GB of weights to host RAM,
-        # past what a default container comfortably holds.
-        params_b=30,
         gpu="H200:1",
         max_model_len=32768,
         trust_remote_code=True,
         gated=True,
         max_concurrent_inputs=64,
     ),
     ModelConfig(
         name="nvidia/Nemotron-Cascade-14B-Thinking",
@@ -210,15 +184,13 @@ NVIDIA_MODELS: tuple[ModelConfig, ...] = (
         params_b=14,
         gpu="L40S:1",
         max_model_len=32768,
-        # Qwen3-native in vLLM (no custom code); ChatML template with a thinking
-        # block parsed by the Qwen3 reasoning parser.
         reasoning_parser="qwen3",
         tool_call_parser="hermes",
         enable_auto_tool_choice=True,
         max_concurrent_inputs=48,
-        # Qwen3-native single-GPU path on the pinned vLLM — snapshot-safe, and a
-        # reasoning model is exactly where a multi-minute cold start hurts most.
-        gpu_snapshot=True,
     ),
 )
@@ -234,28 +206,31 @@ OPENBMB_MODELS: tuple[ModelConfig, ...] = (
         max_model_len=32768,
         trust_remote_code=True,
         max_concurrent_inputs=48,
-        # Fast tier default for the cast; 8B BF16 (~16GB) offloads to host RAM
-        # fine. Sleep mode is allocator-level, so the custom MiniCPM modeling
-        # code doesn't affect it.
-        gpu_snapshot=True,
         # No tool_call_parser on purpose: MiniCPM4.1 emits a custom
-        # <|tool_call_start|> format vLLM 0.21.0 has no parser for, so tool-call
-        # structured output 400s here. The engine's structured path uses vLLM
         # guided decoding (response_format json_schema) instead, which is
         # parser-independent — see ADR-0016. Don't bolt on a mismatched parser.
     ),
     ModelConfig(
         name="openbmb/MiniCPM-o-4_5",
         endpoint_name="minicpm-o-4-5",
-        # Omni-modal (text + vision + audio). Needs custom code and media backends.
-        # A specialist model — not cast to a profile by default.
-        params_b=8,
         gpu="L40S:1",
         trust_remote_code=True,
-        multimodal=True,
-        mm_limits={"image": 4, "audio": 2, "video": 1},
-        # Audio/vision preprocessing backends pulled into the image.
         extra_pip=("librosa", "soundfile", "timm"),
         max_concurrent_inputs=16,
         # Custom omni-modal code path: keep the async scheduler off (conservative
         # — it's a specialist, not on the default cast). Prefix caching stays on.
@@ -285,36 +260,25 @@ GOOGLE_MODELS: tuple[ModelConfig, ...] = (
         tool_call_parser="gemma4",
         enable_auto_tool_choice=True,
         max_concurrent_inputs=48,
-        # Served via vLLM's Transformers modeling backend (gemma4_unified has no
-        # native vLLM class), which runs eager-only — CUDA-graph capture and the
-        # async scheduler aren't supported on that path, so disable both here.
-        # Prefix caching still applies and stays on (the default). gpu_snapshot
-        # stays off too: sleep mode on the nightly Transformers backend is
-        # unverified, and the Gemmas already skip the costliest warmup (no
-        # CUDA-graph capture).
-        enforce_eager=True,
-        async_scheduling=False,
-        # Text-only in the cast (vision/audio is the MiniCPM-o specialist's job).
-        # vLLM auto-detects gemma4_unified as multimodal and otherwise spends a big
-        # slice of cold-start profiling a *video* encoder we never call (and the MM
-        # warmup fails anyway). Zeroing the per-prompt MM limits disables that whole
-        # path — faster start, less GPU memory, more KV cache.
-        mm_limits={"image": 0, "audio": 0, "video": 0},
-        # gemma4_unified uses *variable* head dims (256 on sliding-attention layers,
-        # 512 on full-attention ones). vLLM <= 0.22.1 (incl. the pinned 0.21.0) sizes
-        # the o_proj from a uniform head_dim and dies on the full-attention layers
-        # with "mat1 and mat2 shapes cannot be multiplied". Only a vLLM nightly serves
-        # gemma4_unified, paired with transformers >= 5.10.2 (which adds the arch) and
-        # the FlashInfer sampler off (its JIT path breaks on these builds). All three
-        # are scoped to this model, so NVIDIA/OpenBMB stay on the reproducible pin.
         vllm_version="nightly",
         extra_pip=("transformers>=5.10.2",),
-        env={"VLLM_USE_FLASHINFER_SAMPLER": "0"},
     ),
     ModelConfig(
         name="google/gemma-4-26B-A4B-it",
         endpoint_name="gemma-4-26b",
-        # MoE: ~26B total params (~4B active). Gated repo — needs an HF token.
         profile="strong",
         params_b=26,
         gpu="H200:1",
@@ -324,18 +288,11 @@ GOOGLE_MODELS: tuple[ModelConfig, ...] = (
         tool_call_parser="gemma4",
         enable_auto_tool_choice=True,
         max_concurrent_inputs=64,
-        # Transformers modeling backend (see the 12B above): eager-only, so no
-        # CUDA graphs / async scheduler. Prefix caching stays on by default.
-        enforce_eager=True,
-        async_scheduling=False,
-        # Text-only in the cast — disable the auto-detected multimodal (video)
-        # encoder to cut cold-start profiling and free memory (see the 12B above).
-        mm_limits={"image": 0, "audio": 0, "video": 0},
-        # Same gemma4_unified fix as the 12B above (nightly vLLM + transformers
-        # >= 5.10.2 + FlashInfer sampler off).
-        vllm_version="nightly",
-        extra_pip=("transformers>=5.10.2",),
-        env={"VLLM_USE_FLASHINFER_SAMPLER": "0"},
     ),
 )

     max_model_len: int | None = None  # cap context to fit memory / task
     trust_remote_code: bool = False  # required by MiniCPM / Nemotron custom code
     # Performance / throughput (vLLM serve flags). Defaults target high
     # steady-state throughput on the common single-GPU path; tune per model.
+    # See ``service.build_command`` for how each maps to a flag. For anything more
+    # exotic (quantization, batch-size caps, …) use ``extra_vllm_args``.
     gpu_memory_utilization: float | None = None  # fraction of VRAM for weights + KV cache (vLLM default 0.9)
     enable_prefix_caching: bool = True  # reuse KV for shared prompt prefixes — big win when system/context repeat
     async_scheduling: bool = True  # overlap CPU request scheduling with GPU compute
     enforce_eager: bool = False  # skip CUDA-graph capture: faster cold start, lower steady-state throughput
+    # Observability. ``log_requests`` adds --enable-log-requests so each call's id,
+    # sampling params, and token counts show in the Modal container logs.
+    log_requests: bool = True
     # OpenAI feature parsers (vLLM names; leave None if unsupported on the model)
     reasoning_parser: str | None = None
     tool_call_parser: str | None = None
     enable_auto_tool_choice: bool = False
+    # Multimodal — per-prompt input caps, e.g. {"image": 4, "audio": 2}. Set the
+    # caps to 0 on an auto-detected-multimodal model you serve text-only, to skip
+    # the encoder warmup and free memory.
+    mm_limits: dict[str, int] | None = None
     # Scaling / lifecycle
     max_concurrent_inputs: int = 64  # hard ceiling of requests multiplexed onto one container
     scaledown_window: int = 15 * 60  # idle seconds before a container stops
     min_containers: int = 0  # keep N warm to remove cold starts (costs $)
     startup_timeout: int = 30 * 60  # weight download + load can be slow
     ModelConfig(
         name="nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16",
         endpoint_name="nemotron-3-nano-4b",
+        # Tiny Titan tier (≤4B): ~4B BF16 weights (~8GB) fit a single 24GB L4.
         profile="tiny",
         params_b=4,
         gpu="L4:1",
         max_model_len=16384,
+        # Hybrid Mamba-2 + MLP + attention arch → custom modeling code; required.
         trust_remote_code=True,
         gated=True,
         max_concurrent_inputs=32,
+        # Served as a plain chat endpoint. NVIDIA ships a custom `nano_v3` reasoning
+        # parser as a downloadable plugin file (--reasoning-parser-plugin) plus a
+        # `qwen3_coder` tool parser; both are omitted here for boot-robustness (the
+        # plugin must be shipped into the image and is easy to get wrong). The
+        # model still reasons — the <think> block just stays inline in the content.
+        # Add them later via extra_vllm_args if structured reasoning/tools are needed.
     ),
     ModelConfig(
         name="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
         endpoint_name="nemotron-3-nano-30b",
+        # Hybrid Mamba-2 + MoE: ~31B total params in BF16 (~62GB), ~3B active per
+        # token. Needs an 80GB card — an alternate strong model, not a tier default.
+        params_b=31,
         gpu="H200:1",
         max_model_len=32768,
         trust_remote_code=True,
         gated=True,
         max_concurrent_inputs=64,
+        # Same plain-chat posture as the 4B (custom `nano_v3` parser plugin omitted).
     ),
     ModelConfig(
         name="nvidia/Nemotron-Cascade-14B-Thinking",
         params_b=14,
         gpu="L40S:1",
         max_model_len=32768,
+        # Post-trained from Qwen3-14B Base → stock Qwen3 arch (no custom code).
+        # ChatML thinking block parsed by the Qwen3 reasoning parser; `hermes` is
+        # the standard Qwen3-family tool parser. Both verified built-in in vLLM.
         reasoning_parser="qwen3",
         tool_call_parser="hermes",
         enable_auto_tool_choice=True,
         max_concurrent_inputs=48,
     ),
 )
         max_model_len=32768,
         trust_remote_code=True,
         max_concurrent_inputs=48,
         # No tool_call_parser on purpose: MiniCPM4.1 emits a custom
+        # <|tool_call_start|> code-block format vLLM has no matching parser for, so
+        # a tool parser would 400/mis-parse. The engine's structured path uses vLLM
         # guided decoding (response_format json_schema) instead, which is
         # parser-independent — see ADR-0016. Don't bolt on a mismatched parser.
+        # (The model card suggests a vLLM nightly; 0.21.0 predates the release and
+        # serves it fine — flip vllm_version="nightly" if a boot failure proves otherwise.)
     ),
     ModelConfig(
         name="openbmb/MiniCPM-o-4_5",
         endpoint_name="minicpm-o-4-5",
+        # Omni-modal (text + vision + audio) on a Qwen3-8B backbone → ~9B total in
+        # BF16. A specialist model, not cast to a profile by default.
+        params_b=9,
         gpu="L40S:1",
         trust_remote_code=True,
+        # Text + image only here; audio in/out over vLLM is experimental (it really
+        # wants the Transformers/demo runtime). Caps keep the encoder warmup bounded.
+        mm_limits={"image": 1, "audio": 0, "video": 0},
+        # Light vision/audio preprocessing backends. NOTE: full omni support wants
+        # openbmb's `minicpmo-utils[all]` + a pinned transformers==4.51.0, but that
+        # pin conflicts with vLLM's bundled transformers — so we keep the lean set
+        # and serve text+image. Treat audio as experimental.
         extra_pip=("librosa", "soundfile", "timm"),
+        gpu_memory_utilization=0.9,
         max_concurrent_inputs=16,
         # Custom omni-modal code path: keep the async scheduler off (conservative
         # — it's a specialist, not on the default cast). Prefix caching stays on.
         tool_call_parser="gemma4",
         enable_auto_tool_choice=True,
         max_concurrent_inputs=48,
+        # gemma4_unified (encoder-free) has no native class in any *stable* vLLM
+        # (≤0.22.1 falls back to the Transformers backend and crashes); only the
+        # nightly wheel registers Gemma4UnifiedForConditionalGeneration. So this
+        # model alone pins the nightly + transformers>=5.10.2. Scoped here, so
+        # NVIDIA/OpenBMB and the 26B sibling stay on the reproducible pin.
         vllm_version="nightly",
         extra_pip=("transformers>=5.10.2",),
+        # Transformers-backend / fresh-nightly path: eager-only is the safe choice
+        # (CUDA-graph capture + async scheduler aren't reliable here).
+        enforce_eager=True,
+        async_scheduling=False,
+        # Text-only in the cast — gemma4 auto-detects as multimodal, so zero the
+        # per-prompt caps to skip the encoder warmup and free memory for KV cache.
+        mm_limits={"image": 0, "audio": 0},
     ),
     ModelConfig(
         name="google/gemma-4-26B-A4B-it",
         endpoint_name="gemma-4-26b",
+        # MoE: ~25B total params (~4B active) with a small vision encoder. Gated.
         profile="strong",
         params_b=26,
         gpu="H200:1",
         tool_call_parser="gemma4",
         enable_auto_tool_choice=True,
         max_concurrent_inputs=64,
+        # Standard gemma4 MoE arch (NOT the unified 12B path): served by a native
+        # vLLM class on the pinned stable release (0.19.1+), so NO nightly, no
+        # transformers pin, and CUDA graphs + async scheduling work — defaults stand.
+        # Text-only in the cast: zero the auto-detected multimodal caps.
+        mm_limits={"image": 0},
     ),
 )

modal/docs/deploying.md CHANGED Viewed

@@ -3,6 +3,12 @@
 This guide covers prerequisites, deployment, configuration knobs, auth, GPU
 sizing, and wiring the endpoints into the engine.
 ## Prerequisites
 ```bash
@@ -24,26 +30,25 @@ Only models with `gated=True` mount this secret; ungated models deploy without i
 Each provider is its own Modal app, deployed independently:
 ```bash
-modal deploy modal/app_nvidia.py     # Nemotron 3 Nano 30B + 4B
-modal deploy modal/app_openbmb.py    # MiniCPM-o 4.5 + MiniCPM4.1-8B
-modal deploy modal/app_google.py     # Gemma 4 26B + 12B
 ```
 Use `modal serve modal/app_<provider>.py` for a hot-reloading dev session.
 Or deploy one, several, or all providers with a single uv command — a thin
-wrapper that exposes the deploy-time env knobs below as flags:
 ```bash
 uv run scripts/deploy_modal.py                      # all providers
 uv run scripts/deploy_modal.py nvidia openbmb       # just these
 uv run scripts/deploy_modal.py nvidia --keep-warm   # = MODAL_LLM_KEEP_WARM=1
-# --auth → MODAL_LLM_REQUIRE_AUTH=1, --json-logs → MODAL_LLM_JSON_LOGS=1,
-# --log-level LEVEL → MODAL_LLM_LOG_LEVEL, --dry-run to preview the commands.
 ```
 Run these from the repo root; the script's own directory (`modal/`) is on
-`sys.path`, so `from service import ...` / `from registry import ...` resolve,
 and `import modal` still binds the installed SDK (the folder name does not
 shadow it).
@@ -86,34 +91,29 @@ changes needed:
 | `gpu`                   | Modal GPU spec, e.g. `H200:1`, `H100:2`, `L40S:1`, `L4:1`.     |
 | `tensor_parallel_size`  | Shard across GPUs; set equal to the GPU count in `gpu`.        |
 | `max_model_len`         | Cap context length to fit memory / tune throughput.            |
-| `max_concurrent_inputs` | Hard ceiling of requests multiplexed onto one container.       |
-| `target_concurrent_inputs` | Autoscale target — scale out here, burst to the max (defaults to ~75% of the ceiling). |
-| `buffer_containers`     | Extra idle containers pre-warmed under active load (bursty traffic). |
 | `scaledown_window`      | Idle seconds before a container stops (cold-start vs. cost).   |
-| `gpu_snapshot`          | Serve via Modal memory snapshots (CPU + GPU): cold starts restore a warmed engine in seconds instead of re-paying load + warmup. See [Cold starts](#cold-starts). |
 | `min_containers`        | Keep N warm to eliminate cold starts (always-on cost).         |
 | `gpu_memory_utilization` | Fraction of VRAM for weights + KV cache (vLLM default `0.9`); raise for a bigger KV cache. |
 | `enable_prefix_caching` | Reuse the KV cache for shared prompt prefixes (on by default — big win when the system prompt / ledger context repeats across the cast). |
-| `async_scheduling`      | Overlap CPU request scheduling with GPU compute (on by default; off for the Transformers-backend Gemma + omni models). |
 | `enforce_eager`         | Skip CUDA-graph capture — faster cold start, lower steady-state throughput. |
-| `max_num_seqs` / `max_num_batched_tokens` | Batch-size and per-step token budget (memory vs. throughput). |
 | `log_requests`          | Log each request's id, sampling params, and token counts (on by default). |
-| `log_outputs`           | Also log generated text (verbose; off by default).            |
-| `max_log_len`           | Truncate logged prompts/outputs to N chars (`None` = no cap; default 2048). |
-| `uvicorn_access_log`    | Keep the per-request HTTP access line (method, path, status). |
-| `reasoning_parser` / `tool_call_parser` / `enable_auto_tool_choice` | OpenAI tool/reasoning features. |
-| `multimodal` / `mm_limits` | Image/audio/video inputs and per-prompt caps.               |
 | `trust_remote_code`     | Required by MiniCPM / Nemotron custom modeling code.           |
 | `vllm_version`          | Per-model inference-stack pin (escape hatch); `None` = the default `VLLM_VERSION`, `"nightly"` = latest nightly wheel, else a pinned version. |
-| `extra_vllm_args`       | Raw `vllm serve` flags appended verbatim (escape hatch).       |
 | `extra_pip` / `env`     | Extra image deps / container env (escape hatch).               |
 > **Per-model vLLM version.** The image pins `VLLM_VERSION` (see `service.py`) for
 > reproducible deploys. A single model can override it via `vllm_version` when the
 > pinned release can't serve its architecture — this is scoped to that model's image,
-> so one model's bump never touches another provider's app. The Gemma 4 entries set
-> `vllm_version="nightly"` (plus `transformers>=5.10.2` and `VLLM_USE_FLASHINFER_SAMPLER=0`)
-> because the `gemma4_unified` architecture is unservable on the pinned release.
 ### Performance tuning
@@ -129,45 +129,31 @@ per model:
   graph cache is persisted on the shared `vllm-cache` Volume (`VLLM_CACHE_ROOT`),
   so only the *first* container compiles — later cold starts replay the cached
   graphs. Set `enforce_eager=True` on a model only when its backend can't capture
-  graphs (the Transformers-backend Gemma models) or when cold start dominates.
 - **Async scheduling** overlaps CPU request scheduling with GPU compute; on by
   default for native vLLM models, off where the backend doesn't support it.
-- **Autoscaling** scales out at `target_concurrent_inputs` (≈75% of the ceiling by
-  default) while a hot container bursts up to `max_concurrent_inputs`, so we add
-  capacity before a container saturates rather than after. Use `buffer_containers`
-  to pre-warm spares for bursty traffic, or `min_containers` to remove cold starts
   entirely (at always-on cost).
-- **The V1 engine is pinned** (`VLLM_USE_V1=1`) for its better scheduler, chunked
-  prefill, and prefix caching.
 For memory-bound models, raise `gpu_memory_utilization` (more KV cache → more
-concurrency) and cap `max_num_seqs` / `max_num_batched_tokens` if a step OOMs.
 ### Cold starts
-A scale-from-zero cold start normally pays the full pipeline: container boot →
-weight load → engine warmup — minutes for the bigger models. Two mechanisms cut
-this (ADR-0030):
-**1. Memory snapshots (`gpu_snapshot=True`, per model).** The first container
-boots once, loads weights, runs a few warmup completions, puts vLLM to sleep
-(sleep level 1: weights offloaded to host RAM, KV cache dropped), and Modal
-snapshots the container — CPU *and* GPU state. Every later cold start restores
-the snapshot and wakes the engine, turning a multi-minute boot into seconds.
-Under the hood this switches the model from the plain `@app.function` web server
-to a class-based lifecycle (`@modal.enter(snap=True)` warmup → snapshot →
-`@modal.enter(snap=False)` wake), but the public URL and API are identical —
-clients can't tell the paths apart.
-Snapshot-enabled today: `nemotron-3-nano-4b` (tiny), `minicpm-4-1-8b` (fast),
-`nemotron-cascade-14b`. Left off deliberately: the Gemmas (nightly
-Transformers-backend path, sleep mode unverified), `nemotron-3-nano-30b`
-(~60GB of weights won't fit host RAM during sleep), and the omni specialist.
-GPU snapshots are **Modal-alpha** — if a snapshot model misbehaves, set its
-`gpu_snapshot=False` and redeploy; the plain path is unchanged.
-**2. Demo-day keep-warm (deploy-time, no code edits).** Pin warm containers for
-every *profile-bound* model (tiny/fast/balanced/strong) right before a live
 demo — specialists keep scale-to-zero:
 ```bash
@@ -197,49 +183,26 @@ edits — it reads the same `catalogue.py`.
    `app = modal.App(PROVIDERS["<provider>"].app)` then
    `register_all(app, PROVIDERS["<provider>"].models)`.
-## Quantization (lower precision)
-Every model repo ships **BF16** weights. To shrink the memory footprint — fit a
-model on a smaller GPU, or free VRAM for a longer context / more concurrency — you
-can serve it at lower precision. This is purely serving-side: it only adds
-`--quantization` / `--kv-cache-dtype` to the vLLM argv, and `--served-model-name`
-is unchanged, so the engine, endpoint URLs, and the running cast are untouched.
-Two controls, env override wins:
-- **Per model** — set `quantization` (and/or `kv_cache_dtype`) on a `ModelConfig`
-  in `catalogue.py`. This is the baseline a model serves at by default.
-- **Per deploy (no code edits)** — `MODAL_LLM_QUANTIZATION` / `MODAL_LLM_KV_CACHE_DTYPE`
-  override every model in the deploy. A disable token (`none`/`off`/`bf16`/…) forces
-  full precision even on a model that defaults to quantized.
-```bash
-# On-the-fly FP8 weights for one provider (via the deploy helper):
-uv run scripts/deploy_modal.py nvidia --quantization fp8
-# FP8 weights + FP8 KV cache, raw modal CLI:
-MODAL_LLM_QUANTIZATION=fp8 MODAL_LLM_KV_CACHE_DTYPE=fp8 modal deploy modal/app_nvidia.py
-# Force full precision back (overrides any per-model default):
-uv run scripts/deploy_modal.py nvidia --quantization none
 ```
 > **Not every architecture serves under on-the-fly FP8.** It needs an Ada/Hopper
 > GPU (our L4/L40S/H200 all qualify) *and* vLLM support for the model's arch.
-> Custom-code / hybrid-mamba archs (Nemotron-H = `nemotron-3-nano-4b`/`-30b`,
-> MiniCPM) and the Transformers-backend Gemmas may **fail to boot** under it — a
-> failed boot surfaces as `modal-http: invalid function call` (no healthy
-> container). Verify a provider after flipping it on (`modal/healthcheck.py` or
-> `curl <url>/v1/models`); if a model won't start, redeploy that provider without
-> the flag. This is why all per-model defaults stay `None` for now. See ADR-0031.
-> **FP8 KV cache (`--kv-cache-dtype fp8`) is silently dropped for snapshot models.**
-> On the pinned vLLM it crashes the `/wake_up` path (`init_fp8_kv_scales` →
-> `'list' object has no attribute 'zero_'`), so an FP8-KV snapshot model boots but
-> can never wake. `build_command` drops the flag for any `gpu_snapshot=True` model
-> and logs a `⚠️` line at deploy; the endpoint serves with full-precision KV cache.
-> FP8 *weights* (`--quantization fp8`) are unaffected. To run FP8 KV cache on such a
-> model, set its `gpu_snapshot=False`. See ADR-0031.
 ## Auth
@@ -266,40 +229,11 @@ OpenAPI spec (`../openapi.yaml`).
 ## Observability & logging
 Every container's stdout/stderr is captured by Modal — watch it live with
-`modal app logs <app-name>` or in the dashboard. Two layers shape what you see:
-**Request-level detail (on by default).** Each endpoint runs vLLM with
-`--enable-log-requests`, so every call logs its request id, sampling params, and
-(on completion) prompt/generation token counts and finish reason. `--max-log-len`
-caps the logged prompt at 2048 chars so a long context can't bloat a log line.
-The uvicorn access log (method, path, status, latency) stays on. Tune per model:
-| Knob              | Effect                                                        |
-| ----------------- | ------------------------------------------------------------- |
-| `log_requests`    | Per-request id / params / token counts (default **on**).      |
-| `log_outputs`     | Also log the generated text — verbose, can echo story content (default off). |
-| `max_log_len`     | Truncate logged prompts/outputs; set `None` to log them in full. |
-| `uvicorn_access_log` | Set `False` to drop the per-request HTTP access line.      |
-Clients can pass an `X-Request-Id` header and it shows up in the request logs —
-handy for correlating an engine call with its server-side line.
-**Structured JSON (opt-in).** For grepping fields or shipping to an aggregator,
-emit one JSON object per log line instead of vLLM's coloured text. Turn it on at
-deploy time — no code edits:
-```bash
-MODAL_LLM_JSON_LOGS=1 modal deploy modal/app_nvidia.py
-MODAL_LLM_JSON_LOGS=1 MODAL_LLM_LOG_LEVEL=DEBUG modal deploy modal/app_google.py
-```
-This ships a dependency-free formatter (`modal/vllm_logging.py`) into the image
-and points vLLM's `VLLM_LOGGING_CONFIG_PATH` at a generated `dictConfig`, so
-**all** vLLM + uvicorn logs (including the request logs above) come out as JSON
-with `ts` / `level` / `logger` / `msg` / `src` plus any structured extras (request
-id, token counts). `MODAL_LLM_LOG_LEVEL` (default `INFO`) sets verbosity for both
-the text and JSON paths. Leave JSON off for live demos — the coloured text is
-easier to watch.
 Throughput, KV-cache usage, and prefix-cache hit rate are logged every second
 (`VLLM_LOG_STATS_INTERVAL`) and also exposed as Prometheus metrics at `/metrics`.
@@ -312,12 +246,13 @@ total parameter count.
 | Model                              | Params (total / active) | Starting GPU |
 | ---------------------------------- | ----------------------- | ------------ |
-| Nemotron-3-Nano-30B-A3B            | 30B / ~3B (MoE)         | `H200:1`     |
-| Nemotron-3-Nano-4B                 | 4B (Tiny Titan)         | `L4:1`       |
-| MiniCPM-o-4_5 (omni)               | ~8B + media encoders    | `L40S:1`     |
 | MiniCPM4.1-8B                      | 8B                      | `L40S:1`     |
-| Gemma-4-26B-A4B-it                 | 26B / ~4B (MoE)         | `H200:1`     |
-| Gemma-4-12B                        | 12B                     | `L40S:1`     |
 These are starting points. If a container OOMs, lower `max_model_len`, raise the
 GPU tier, or bump `tensor_parallel_size` (and the GPU count) for sharding.

 This guide covers prerequisites, deployment, configuration knobs, auth, GPU
 sizing, and wiring the endpoints into the engine.
+The serving layer is deliberately small: it's Modal's canonical vLLM recipe — an
+autoscaling `@app.function` that launches `vllm serve` as a subprocess behind a
+`@modal.web_server` — applied once in `service.py` to every model in
+`catalogue.py`. See ADR-0034 for why we stripped the earlier snapshot / FP8 /
+structured-logging machinery back to this core.
 ## Prerequisites
 ```bash
 Each provider is its own Modal app, deployed independently:
 ```bash
+modal deploy modal/app_nvidia.py     # Nemotron 3 Nano 4B + 30B, Cascade 14B
+modal deploy modal/app_openbmb.py    # MiniCPM4.1-8B + MiniCPM-o 4.5
+modal deploy modal/app_google.py     # Gemma 4 12B + 26B
 ```
 Use `modal serve modal/app_<provider>.py` for a hot-reloading dev session.
 Or deploy one, several, or all providers with a single uv command — a thin
+wrapper that exposes the two deploy-time env knobs as flags:
 ```bash
 uv run scripts/deploy_modal.py                      # all providers
 uv run scripts/deploy_modal.py nvidia openbmb       # just these
 uv run scripts/deploy_modal.py nvidia --keep-warm   # = MODAL_LLM_KEEP_WARM=1
+# --auth → MODAL_LLM_REQUIRE_AUTH=1, --dry-run to preview the commands.
 ```
 Run these from the repo root; the script's own directory (`modal/`) is on
+`sys.path`, so `from service import ...` / `from catalogue import ...` resolve,
 and `import modal` still binds the installed SDK (the folder name does not
 shadow it).
 | `gpu`                   | Modal GPU spec, e.g. `H200:1`, `H100:2`, `L40S:1`, `L4:1`.     |
 | `tensor_parallel_size`  | Shard across GPUs; set equal to the GPU count in `gpu`.        |
 | `max_model_len`         | Cap context length to fit memory / tune throughput.            |
+| `max_concurrent_inputs` | Hard ceiling of requests multiplexed onto one container (autoscale target is ~75% of it). |
 | `scaledown_window`      | Idle seconds before a container stops (cold-start vs. cost).   |
 | `min_containers`        | Keep N warm to eliminate cold starts (always-on cost).         |
 | `gpu_memory_utilization` | Fraction of VRAM for weights + KV cache (vLLM default `0.9`); raise for a bigger KV cache. |
 | `enable_prefix_caching` | Reuse the KV cache for shared prompt prefixes (on by default — big win when the system prompt / ledger context repeats across the cast). |
+| `async_scheduling`      | Overlap CPU request scheduling with GPU compute (on by default; off for the Transformers-backend Gemma 12B + omni models). |
 | `enforce_eager`         | Skip CUDA-graph capture — faster cold start, lower steady-state throughput. |
 | `log_requests`          | Log each request's id, sampling params, and token counts (on by default). |
+| `reasoning_parser` / `tool_call_parser` / `enable_auto_tool_choice` | OpenAI tool/reasoning features (vLLM parser names; leave None if unsupported). |
+| `mm_limits`             | Per-prompt image/audio/video caps; set to 0 on an auto-detected-multimodal model you serve text-only. |
 | `trust_remote_code`     | Required by MiniCPM / Nemotron custom modeling code.           |
 | `vllm_version`          | Per-model inference-stack pin (escape hatch); `None` = the default `VLLM_VERSION`, `"nightly"` = latest nightly wheel, else a pinned version. |
+| `extra_vllm_args`       | Raw `vllm serve` flags appended verbatim — the escape hatch for anything not modelled above (quantization, batch caps, custom parser plugins, …). |
 | `extra_pip` / `env`     | Extra image deps / container env (escape hatch).               |
 > **Per-model vLLM version.** The image pins `VLLM_VERSION` (see `service.py`) for
 > reproducible deploys. A single model can override it via `vllm_version` when the
 > pinned release can't serve its architecture — this is scoped to that model's image,
+> so one model's bump never touches another provider's app. Only the Gemma 4 **12B**
+> sets `vllm_version="nightly"` (plus `transformers>=5.10.2`) because its
+> `gemma4_unified` architecture has no class in any stable vLLM ≤0.22.1. The Gemma 4
+> **26B** is a standard MoE arch that serves on the pinned stable release, so it
+> stays on the default pin.
 ### Performance tuning
   graph cache is persisted on the shared `vllm-cache` Volume (`VLLM_CACHE_ROOT`),
   so only the *first* container compiles — later cold starts replay the cached
   graphs. Set `enforce_eager=True` on a model only when its backend can't capture
+  graphs (the Transformers-backend Gemma 12B) or when cold start dominates.
 - **Async scheduling** overlaps CPU request scheduling with GPU compute; on by
   default for native vLLM models, off where the backend doesn't support it.
+- **Autoscaling** scales out at ~75% of `max_concurrent_inputs` while a hot
+  container bursts up to the ceiling, so we add capacity before a container
+  saturates rather than after. Use `min_containers` to remove cold starts
   entirely (at always-on cost).
 For memory-bound models, raise `gpu_memory_utilization` (more KV cache → more
+concurrency); if a step OOMs, lower `max_model_len` or cap the batch via
+`extra_vllm_args` (e.g. `("--max-num-seqs", "32")`).
 ### Cold starts
+A scale-from-zero cold start pays container boot → weight load → engine warmup.
+Two mechanisms keep that bounded:
+**1. Shared caches (always on).** Weights are pulled once onto the
+`huggingface-cache` Volume and the torch.compile / CUDA-graph artifacts are
+persisted on the `vllm-cache` Volume (`VLLM_CACHE_ROOT`). So a model downloads
+once across every container and provider, and only the *first* container
+compiles its graphs — later cold starts replay the cache.
+**2. Demo-day keep-warm (deploy-time, no code edits).** Pin one warm container
+for every *profile-bound* model (tiny/fast/balanced/strong) right before a live
 demo — specialists keep scale-to-zero:
 ```bash
    `app = modal.App(PROVIDERS["<provider>"].app)` then
    `register_all(app, PROVIDERS["<provider>"].models)`.
+## Lower precision (quantization)
+Every model repo here ships **BF16** weights and serves at full precision. To
+shrink a model's footprint — fit it on a smaller GPU, or free VRAM for a longer
+context / more concurrency — pass vLLM's quantization flags through the
+`extra_vllm_args` escape hatch on its `ModelConfig`:
+```python
+extra_vllm_args=("--quantization", "fp8", "--kv-cache-dtype", "fp8")
 ```
+This is purely serving-side: `--served-model-name` is unchanged, so the engine,
+endpoint URLs, and the running cast are untouched.
 > **Not every architecture serves under on-the-fly FP8.** It needs an Ada/Hopper
 > GPU (our L4/L40S/H200 all qualify) *and* vLLM support for the model's arch.
+> Custom-code / hybrid-Mamba archs (the Nemotron Nanos, MiniCPM) and the
+> Transformers-backend Gemma 12B may **fail to boot** under it. Verify a model
+> after adding the flag (`modal/healthcheck.py` or `curl <url>/v1/models`); if it
+> won't start, drop the flag. This is why every model defaults to full precision.
 ## Auth
 ## Observability & logging
 Every container's stdout/stderr is captured by Modal — watch it live with
+`modal app logs <app-name>` or in the dashboard. Each endpoint runs vLLM with
+`--enable-log-requests` (toggle via `log_requests`), so every call logs its
+request id, sampling params, and (on completion) prompt/generation token counts
+and finish reason. Clients can pass an `X-Request-Id` header and it shows up in
+the request logs — handy for correlating an engine call with its server-side line.
 Throughput, KV-cache usage, and prefix-cache hit rate are logged every second
 (`VLLM_LOG_STATS_INTERVAL`) and also exposed as Prometheus metrics at `/metrics`.
 | Model                              | Params (total / active) | Starting GPU |
 | ---------------------------------- | ----------------------- | ------------ |
+| Nemotron-3-Nano-30B-A3B            | ~31B / ~3B (Mamba MoE)  | `H200:1`     |
+| Nemotron-Cascade-14B-Thinking      | ~14B (dense, Qwen3)     | `L40S:1`     |
+| Nemotron-3-Nano-4B                 | ~4B (Tiny Titan)        | `L4:1`       |
+| MiniCPM-o-4_5 (omni)               | ~9B + media encoders    | `L40S:1`     |
 | MiniCPM4.1-8B                      | 8B                      | `L40S:1`     |
+| Gemma-4-26B-A4B-it                 | ~25B / ~4B (MoE)        | `H200:1`     |
+| Gemma-4-12B-it                     | ~12B (dense)            | `L40S:1`     |
 These are starting points. If a container OOMs, lower `max_model_len`, raise the
 GPU tier, or bump `tensor_parallel_size` (and the GPU count) for sharding.

modal/healthcheck.py CHANGED Viewed

@@ -218,8 +218,7 @@ async def check_chat(client: httpx.AsyncClient, t: Target, deadline: float) -> N
         backoff = min(backoff * 1.5, 20.0)
-async def run_target(t: Target, api_key: str, timeout: int, do_chat: bool,
-                     sem: asyncio.Semaphore) -> None:
     async with sem:
         t.started = time.monotonic()
         deadline = t.started + timeout
@@ -231,9 +230,9 @@ async def run_target(t: Target, api_key: str, timeout: int, do_chat: bool,
         # within 150s returns a 303 to the same URL (clients are expected to follow
         # it — up to ~20 hops / 50 min) while the container finishes cold-starting.
         # Without this, the first 303 at ~150s looks like a terminal error.
-        async with httpx.AsyncClient(headers=headers, timeout=client_timeout,
-                                     limits=limits, follow_redirects=True,
-                                     max_redirects=20) as client:
             await check_models(client, t, deadline)
             if t.models_ok and do_chat:
                 await check_chat(client, t, deadline)
@@ -258,10 +257,9 @@ PHASE_ICON = {
 def render_board(targets: list[Target], started: float) -> str:
     width = max(len(t.key) for t in targets)
-    lines = [f"  cold-start health-check · {len(targets)} endpoints · "
-             f"{time.monotonic() - started:5.0f}s elapsed"]
     for t in targets:
-        live = (t.elapsed or (time.monotonic() - t.started if t.started else 0.0))
         icon = PHASE_ICON.get(t.phase, "?")
         detail = t.phase
         if t.phase == "booting":
@@ -311,16 +309,14 @@ def print_report(targets: list[Target], do_chat: bool) -> None:
         detail = t.error or (t.sample if t.chat_ok else t.served_reported) or ""
         if t.chat_ok and t.finish_reason:
             detail = f"[{t.finish_reason}] {detail}"
-        print(f"  {t.key:<{kw}}  {yn(t.models_ok):<6} {yn(t.chat_ok):<5}  "
-              f"{lat}  {detail[:60]}")
     def healthy(t: Target) -> bool:
         return bool(t.models_ok and (t.chat_ok or not do_chat))
     ok = sum(1 for t in targets if healthy(t))
     print("  " + "-" * (len(header) - 2))
-    print(f"  {ok}/{len(targets)} healthy"
-          + ("" if do_chat else " (liveness only — chat not tested)"))
     failed = [t.key for t in targets if not healthy(t)]
     if failed:
         print(f"  needs attention: {', '.join(failed)}")
@@ -343,14 +339,16 @@ def build_targets(catalogue: ModuleType, workspace: str | None, args) -> list[Ta
             base_url = base_override.rstrip("/")
         else:
             base_url = catalogue.endpoint_url(e.app, e.endpoint_name, workspace)
-        targets.append(Target(
-            key=key,
-            app=e.app,
-            served_model_id=e.served_model_id,
-            profile=e.profile,
-            params_b=e.params_b,
-            base_url=base_url,
-        ))
     return targets
@@ -359,8 +357,11 @@ async def main_async(args) -> int:
     workspace = resolve_workspace(args.workspace)
     base_override = os.environ.get("MODAL_LLM_BASE_URL")
     if not workspace and not base_override:
-        print("ERROR: could not resolve a Modal workspace. Pass --workspace, set "
-              "$MODAL_WORKSPACE, or run `modal token new`.", file=sys.stderr)
         return 2
     targets = build_targets(catalogue, workspace, args)
@@ -378,8 +379,10 @@ async def main_async(args) -> int:
         return 0
     do_chat = not args.no_chat
-    print(f"Workspace: {workspace}   endpoints: {len(targets)}   "
-          f"chat: {'yes' if do_chat else 'no'}   per-endpoint timeout: {args.timeout}s")
     print("Firing all endpoints concurrently — cold starts overlap, so this takes")
     print("about as long as the single slowest model, not the sum.\n")
@@ -388,9 +391,7 @@ async def main_async(args) -> int:
     done = asyncio.Event()
     progress = asyncio.create_task(progress_loop(targets, started, done))
     try:
-        await asyncio.gather(*(
-            run_target(t, api_key, args.timeout, do_chat, sem) for t in targets
-        ))
     finally:
         done.set()
         await progress
@@ -398,18 +399,21 @@ async def main_async(args) -> int:
     print_report(targets, do_chat)
     if args.json:
-        summary = [{
-            "endpoint": t.key,
-            "app": t.app,
-            "served_model_id": t.served_model_id,
-            "base_url": t.base_url,
-            "models_ok": t.models_ok,
-            "chat_ok": t.chat_ok,
-            "latency_s": round(t.elapsed, 1),
-            "finish_reason": t.finish_reason,
-            "served_reported": t.served_reported,
-            "error": t.error,
-        } for t in targets]
         Path(args.json).write_text(json.dumps(summary, indent=2))
         print(f"\nWrote JSON summary to {args.json}")
@@ -418,21 +422,17 @@ async def main_async(args) -> int:
 def parse_args(argv: list[str] | None = None) -> argparse.Namespace:
-    p = argparse.ArgumentParser(description=__doc__,
-                                formatter_class=argparse.RawDescriptionHelpFormatter)
     p.add_argument("--workspace", help="Modal workspace slug (else $MODAL_WORKSPACE / `modal profile current`)")
     p.add_argument("--only", help="comma-separated endpoint keys to include")
     p.add_argument("--skip", help="comma-separated endpoint keys to exclude")
-    p.add_argument("--profiles-only", action="store_true",
-                   help="test only the engine-bound tiers (tiny/fast/balanced/strong)")
-    p.add_argument("--no-chat", action="store_true",
-                   help="liveness only (GET /v1/models); skip the chat completion")
-    p.add_argument("--timeout", type=int, default=900,
-                   help="per-endpoint deadline in seconds (default 900)")
-    p.add_argument("--concurrency", type=int, default=0,
-                   help="max endpoints in flight at once (default 0 = all)")
-    p.add_argument("--print-urls", action="store_true",
-                   help="resolve and print endpoint URLs, then exit (no calls)")
     p.add_argument("--json", help="also write a machine-readable summary to this path")
     return p.parse_args(argv)

         backoff = min(backoff * 1.5, 20.0)
+async def run_target(t: Target, api_key: str, timeout: int, do_chat: bool, sem: asyncio.Semaphore) -> None:
     async with sem:
         t.started = time.monotonic()
         deadline = t.started + timeout
         # within 150s returns a 303 to the same URL (clients are expected to follow
         # it — up to ~20 hops / 50 min) while the container finishes cold-starting.
         # Without this, the first 303 at ~150s looks like a terminal error.
+        async with httpx.AsyncClient(
+            headers=headers, timeout=client_timeout, limits=limits, follow_redirects=True, max_redirects=20
+        ) as client:
             await check_models(client, t, deadline)
             if t.models_ok and do_chat:
                 await check_chat(client, t, deadline)
 def render_board(targets: list[Target], started: float) -> str:
     width = max(len(t.key) for t in targets)
+    lines = [f"  cold-start health-check · {len(targets)} endpoints · {time.monotonic() - started:5.0f}s elapsed"]
     for t in targets:
+        live = t.elapsed or (time.monotonic() - t.started if t.started else 0.0)
         icon = PHASE_ICON.get(t.phase, "?")
         detail = t.phase
         if t.phase == "booting":
         detail = t.error or (t.sample if t.chat_ok else t.served_reported) or ""
         if t.chat_ok and t.finish_reason:
             detail = f"[{t.finish_reason}] {detail}"
+        print(f"  {t.key:<{kw}}  {yn(t.models_ok):<6} {yn(t.chat_ok):<5}  {lat}  {detail[:60]}")
     def healthy(t: Target) -> bool:
         return bool(t.models_ok and (t.chat_ok or not do_chat))
     ok = sum(1 for t in targets if healthy(t))
     print("  " + "-" * (len(header) - 2))
+    print(f"  {ok}/{len(targets)} healthy" + ("" if do_chat else " (liveness only — chat not tested)"))
     failed = [t.key for t in targets if not healthy(t)]
     if failed:
         print(f"  needs attention: {', '.join(failed)}")
             base_url = base_override.rstrip("/")
         else:
             base_url = catalogue.endpoint_url(e.app, e.endpoint_name, workspace)
+        targets.append(
+            Target(
+                key=key,
+                app=e.app,
+                served_model_id=e.served_model_id,
+                profile=e.profile,
+                params_b=e.params_b,
+                base_url=base_url,
+            )
+        )
     return targets
     workspace = resolve_workspace(args.workspace)
     base_override = os.environ.get("MODAL_LLM_BASE_URL")
     if not workspace and not base_override:
+        print(
+            "ERROR: could not resolve a Modal workspace. Pass --workspace, set "
+            "$MODAL_WORKSPACE, or run `modal token new`.",
+            file=sys.stderr,
+        )
         return 2
     targets = build_targets(catalogue, workspace, args)
         return 0
     do_chat = not args.no_chat
+    print(
+        f"Workspace: {workspace}   endpoints: {len(targets)}   "
+        f"chat: {'yes' if do_chat else 'no'}   per-endpoint timeout: {args.timeout}s"
+    )
     print("Firing all endpoints concurrently — cold starts overlap, so this takes")
     print("about as long as the single slowest model, not the sum.\n")
     done = asyncio.Event()
     progress = asyncio.create_task(progress_loop(targets, started, done))
     try:
+        await asyncio.gather(*(run_target(t, api_key, args.timeout, do_chat, sem) for t in targets))
     finally:
         done.set()
         await progress
     print_report(targets, do_chat)
     if args.json:
+        summary = [
+            {
+                "endpoint": t.key,
+                "app": t.app,
+                "served_model_id": t.served_model_id,
+                "base_url": t.base_url,
+                "models_ok": t.models_ok,
+                "chat_ok": t.chat_ok,
+                "latency_s": round(t.elapsed, 1),
+                "finish_reason": t.finish_reason,
+                "served_reported": t.served_reported,
+                "error": t.error,
+            }
+            for t in targets
+        ]
         Path(args.json).write_text(json.dumps(summary, indent=2))
         print(f"\nWrote JSON summary to {args.json}")
 def parse_args(argv: list[str] | None = None) -> argparse.Namespace:
+    p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
     p.add_argument("--workspace", help="Modal workspace slug (else $MODAL_WORKSPACE / `modal profile current`)")
     p.add_argument("--only", help="comma-separated endpoint keys to include")
     p.add_argument("--skip", help="comma-separated endpoint keys to exclude")
+    p.add_argument(
+        "--profiles-only", action="store_true", help="test only the engine-bound tiers (tiny/fast/balanced/strong)"
+    )
+    p.add_argument("--no-chat", action="store_true", help="liveness only (GET /v1/models); skip the chat completion")
+    p.add_argument("--timeout", type=int, default=900, help="per-endpoint deadline in seconds (default 900)")
+    p.add_argument("--concurrency", type=int, default=0, help="max endpoints in flight at once (default 0 = all)")
+    p.add_argument("--print-urls", action="store_true", help="resolve and print endpoint URLs, then exit (no calls)")
     p.add_argument("--json", help="also write a machine-readable summary to this path")
     return p.parse_args(argv)

modal/service.py CHANGED Viewed

@@ -1,19 +1,18 @@
 """Reusable, OpenAI-compatible model-serving layer for Modal.
-This module is provider-agnostic. It knows how to take a single ``ModelConfig``
-and turn it into a serverless, autoscaling, OpenAI-compatible HTTP endpoint
-backed by vLLM. Each provider app (``app_nvidia.py``, ``app_openbmb.py``,
-``app_google.py``) imports :func:`register_model` and wires up its own models,
-so providers stay fully isolated in their own Modal apps while sharing one
-battle-tested serving path.
-Design goals:
-- **Extensible**: add a model by appending one ``ModelConfig`` to the registry.
-- **Scalable**: serverless autoscaling, input concurrency, shared weight cache.
-- **Configurable per task**: every knob (GPU, context length, parsers,
-  multimodal limits, extra flags) lives in data, not code.
-The served endpoints speak the OpenAI REST API (``/v1/chat/completions`,
 ``/v1/completions``, ``/v1/models``), so any OpenAI-compatible client can call
 them by pointing ``base_url`` at the deployed URL.
 """
@@ -33,10 +32,11 @@ from catalogue import ModelConfig
 # --- Shared serving constants --------------------------------------------------
-# Pin the inference stack so deploys are reproducible. Bump deliberately.
 VLLM_VERSION = "0.21.0"
 CUDA_IMAGE = "nvidia/cuda:12.9.0-devel-ubuntu22.04"
-PYTHON_VERSION = "3.13"
 # The in-container port vLLM listens on; Modal maps it to a public HTTPS URL.
 VLLM_PORT = 8000
@@ -46,12 +46,12 @@ HF_CACHE_PATH = "/root/.cache/huggingface"
 VLLM_CACHE_PATH = "/root/.cache/vllm"
 # Name of the Modal Secret that holds a Hugging Face token (key: HF_TOKEN).
-# Required only for gated repos (e.g. Gemma). Create it once with:
 #   modal secret create huggingface-secret HF_TOKEN=hf_...
 HF_SECRET_NAME = "huggingface-secret"
-# Name of the Modal Secret holding the bearer token clients must present.
-# The key MUST be VLLM_API_KEY — vLLM reads that env var and then enforces
 # `Authorization: Bearer <token>` on every request. Create it once with:
 #   modal secret create llm-api-key VLLM_API_KEY=sk-...
 API_KEY_SECRET_NAME = "llm-api-key"
@@ -60,72 +60,27 @@ API_KEY_SECRET_NAME = "llm-api-key"
 #   MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
 # When enabled, every endpoint mounts API_KEY_SECRET_NAME and rejects requests
 # without a valid bearer token. Off by default (endpoints are then public).
-REQUIRE_API_KEY = os.environ.get("MODAL_LLM_REQUIRE_AUTH", "").lower() in (
-    "1",
-    "true",
-    "yes",
-)
-# Emit logs as structured JSON (one object per line) instead of vLLM's default
-# human-readable text. Opt in at deploy time (no code edits), mirroring the auth
-# toggle above:
-#   MODAL_LLM_JSON_LOGS=1 modal deploy modal/app_google.py
-# Off by default — the coloured text logs are nicer to watch live; turn this on
-# when shipping logs to an aggregator or grepping fields. Request-level logging
-# itself (the per-request detail) is always on via ModelConfig, independent of
-# the format chosen here.
-JSON_LOGS = os.environ.get("MODAL_LLM_JSON_LOGS", "").lower() in ("1", "true", "yes")
-# Verbosity for the served loggers (vLLM honours VLLM_LOGGING_LEVEL; the JSON
-# config applies the same level). Read at deploy time and baked into the image.
-LOG_LEVEL = os.environ.get("MODAL_LLM_LOG_LEVEL", "INFO").upper()
 # Demo-day switch: keep N containers warm for every *profile-bound* model (the
-# tiers the cast actually runs on), removing their cold starts entirely for the
-# duration of the deploy. Specialists keep scale-to-zero. Costs GPU-hours while
-# deployed — turn it on right before a live demo, redeploy without it after:
 #   MODAL_LLM_KEEP_WARM=1 modal deploy modal/app_nvidia.py
 KEEP_WARM = int(os.environ.get("MODAL_LLM_KEEP_WARM", "0") or "0")
-# Deploy-time precision overrides. When set, each wins over the matching per-model
-# ``ModelConfig`` field for *every* model in the deploy — so you flip a whole
-# provider to FP8 without editing the catalogue (deploys are per-provider, so the
-# blast radius is one app):
-#   MODAL_LLM_QUANTIZATION=fp8 modal deploy modal/app_nvidia.py
-#   MODAL_LLM_QUANTIZATION=fp8 MODAL_LLM_KV_CACHE_DTYPE=fp8 uv run scripts/deploy_modal.py nvidia
-# A disable token (``none``/``off``/``bf16``/…) forces full precision even if a model
-# defaults to a quantized mode. Read at deploy time and baked into each model's argv
-# (see build_command). CAVEAT: not every architecture serves under on-the-fly FP8 —
-# verify per provider; a model that can't will fail to boot. See ADR-0031.
-QUANTIZATION = os.environ.get("MODAL_LLM_QUANTIZATION", "").strip()
-KV_CACHE_DTYPE = os.environ.get("MODAL_LLM_KV_CACHE_DTYPE", "").strip()
-# Override values that mean "no quantization / model-default precision" — they make
-# the resolver omit the flag rather than pass a bogus value to vLLM.
-_PRECISION_DISABLE = frozenset({"none", "off", "false", "0", "no", "bf16", "fp16", "auto"})
-# Where the structured-logging module + its generated config live in the
-# container. The module dir goes on PYTHONPATH so vLLM can import the formatter
-# the dictConfig references (``vllm_logging.JsonFormatter``).
-_LOG_MODULE_DIR = "/opt/mal_logging"
-_LOG_CONFIG_PATH = "/tmp/vllm_logging.json"
 # Weights and the vLLM compile cache are shared across every provider app, so a
 # model pulled once is warm for all subsequent deploys and containers.
 hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
 vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)
-# Baseline image shared by every text model. Multimodal models extend it via
-# ``ModelConfig.extra_pip`` (see ``build_image``).
 _BASE_ENV = {
     "HF_HUB_CACHE": HF_CACHE_PATH,
     "HF_XET_HIGH_PERFORMANCE": "1",  # faster weight downloads
     "VLLM_LOG_STATS_INTERVAL": "1",
-    # Verbosity of vLLM's own loggers (throughput/cache stats, request logs).
-    "VLLM_LOGGING_LEVEL": LOG_LEVEL,
-    # Persist torch.compile + CUDA-graph artifacts on the shared vLLM cache
-    # Volume (mounted at VLLM_CACHE_PATH). The first container compiles; every
-    # later cold start replays the cached graphs instead of recompiling, so we
-    # keep CUDA graphs (throughput) without paying their capture cost each boot.
     "VLLM_CACHE_ROOT": VLLM_CACHE_PATH,
 }
@@ -146,28 +101,6 @@ def build_image(cfg: ModelConfig) -> modal.Image:
     else:
         image = image.uv_pip_install(f"vllm=={cfg.vllm_version or VLLM_VERSION}")
     image = image.env(_BASE_ENV)
-    if JSON_LOGS:
-        # Ship the stdlib JSON formatter and put it on PYTHONPATH so vLLM can
-        # import it when it applies the dictConfig. ``serve()`` writes the config
-        # file and points VLLM_LOGGING_CONFIG_PATH at it. Baking the toggle into
-        # the image env is what lets the (deploy-time) flag reach the container.
-        from pathlib import Path
-        image = (
-            image.add_local_file(
-                Path(__file__).with_name("vllm_logging.py"),
-                f"{_LOG_MODULE_DIR}/vllm_logging.py",
-                copy=True,
-            )
-            .env({"PYTHONPATH": _LOG_MODULE_DIR})
-            .env({"MODAL_LLM_JSON_LOGS": "1", "MODAL_LLM_LOG_LEVEL": LOG_LEVEL})
-        )
-    if cfg.gpu_snapshot:
-        # Snapshot prerequisites: VLLM_SERVER_DEV_MODE exposes the /sleep and
-        # /wake_up endpoints the snapshot lifecycle drives, and single-threaded
-        # inductor compilation keeps torch.compile artifacts snapshot-safe
-        # (Modal's documented vLLM + GPU-snapshot recipe).
-        image = image.env({"VLLM_SERVER_DEV_MODE": "1", "TORCHINDUCTOR_COMPILE_THREADS": "1"})
     if cfg.extra_pip:
         image = image.uv_pip_install(*cfg.extra_pip)
     if cfg.env:
@@ -175,20 +108,6 @@ def build_image(cfg: ModelConfig) -> modal.Image:
     return image
-def _resolve_precision(override: str, model_value: str | None) -> str | None:
-    """Effective precision flag: a deploy-time *override* wins over *model_value*.
-    A disable token in the override (``none``/``off``/``bf16``/…) returns ``None`` so
-    the caller omits the flag and vLLM keeps full / model-default precision; an empty
-    override falls back to the per-model value. Reads its inputs as arguments (the
-    callers pass the module globals) so tests can monkeypatch ``QUANTIZATION`` /
-    ``KV_CACHE_DTYPE`` and see the change without reimporting.
-    """
-    if override:
-        return None if override.lower() in _PRECISION_DISABLE else override
-    return model_value
 def build_command(cfg: ModelConfig) -> list[str]:
     """Assemble the ``vllm serve`` argv for a model. Returned as a list so we can
     launch with ``subprocess.Popen`` without a shell (no quoting pitfalls)."""
@@ -213,31 +132,6 @@ def build_command(cfg: ModelConfig) -> list[str]:
         cmd += ["--max-model-len", str(cfg.max_model_len)]
     if cfg.trust_remote_code:
         cmd += ["--trust-remote-code"]
-    # Precision / quantization. A deploy-time env override (QUANTIZATION /
-    # KV_CACHE_DTYPE) wins over the per-model ModelConfig field; both default to
-    # full precision (no flag). On-the-fly FP8 needs Ada/Hopper + arch support.
-    quantization = _resolve_precision(QUANTIZATION, cfg.quantization)
-    if quantization:
-        cmd += ["--quantization", quantization]
-    kv_cache_dtype = _resolve_precision(KV_CACHE_DTYPE, cfg.kv_cache_dtype)
-    # FP8 KV cache is incompatible with sleep-mode/snapshot models on the pinned
-    # vLLM: the wake path runs init_fp8_kv_scales() over a post-sleep KV cache that
-    # is a *list* of per-layer tensors, not one tensor, so cache_tensor.zero_()
-    # throws and /wake_up 500s (every snapshot restore dies). Snapshot is a
-    # structural per-model decision; the KV dtype is a deploy knob — so snapshot
-    # wins. Drop the flag and warn loudly rather than ship an endpoint that boots
-    # but can never wake. Weight --quantization is unaffected (different code path).
-    if kv_cache_dtype and cfg.gpu_snapshot and kv_cache_dtype.lower().startswith("fp8"):
-        print(
-            f"⚠️  {cfg.endpoint_name}: dropping --kv-cache-dtype {kv_cache_dtype} — "
-            "FP8 KV cache crashes the snapshot wake path on the pinned vLLM (see ADR-0031). "
-            "Serving with full-precision KV cache. Drop gpu_snapshot to keep FP8 KV cache.",
-            flush=True,
-        )
-        kv_cache_dtype = None
-    if kv_cache_dtype:
-        cmd += ["--kv-cache-dtype", kv_cache_dtype]
-    # Performance / throughput knobs (all data-driven from ModelConfig).
     if cfg.gpu_memory_utilization is not None:
         cmd += ["--gpu-memory-utilization", str(cfg.gpu_memory_utilization)]
     # Prefix caching reuses the KV cache for shared prompt prefixes. In a
@@ -248,21 +142,10 @@ def build_command(cfg: ModelConfig) -> list[str]:
         cmd += ["--async-scheduling"]
     if cfg.enforce_eager:
         cmd += ["--enforce-eager"]
-    if cfg.max_num_seqs:
-        cmd += ["--max-num-seqs", str(cfg.max_num_seqs)]
-    if cfg.max_num_batched_tokens:
-        cmd += ["--max-num-batched-tokens", str(cfg.max_num_batched_tokens)]
     # Observability: log each incoming request (id, params, token counts) so the
-    # Modal logs show what's actually being served. Bound the logged prompt length
-    # by default so a long context can't blow up the log line.
     if cfg.log_requests:
         cmd += ["--enable-log-requests"]
-    if cfg.log_outputs:
-        cmd += ["--enable-log-outputs"]
-    if cfg.max_log_len is not None:
-        cmd += ["--max-log-len", str(cfg.max_log_len)]
-    if not cfg.uvicorn_access_log:
-        cmd += ["--disable-uvicorn-access-log"]
     if cfg.reasoning_parser:
         cmd += ["--reasoning-parser", cfg.reasoning_parser]
     if cfg.enable_auto_tool_choice:
@@ -271,10 +154,6 @@ def build_command(cfg: ModelConfig) -> list[str]:
         cmd += ["--tool-call-parser", cfg.tool_call_parser]
     if cfg.mm_limits:
         cmd += ["--limit-mm-per-prompt", json.dumps(cfg.mm_limits)]
-    if cfg.gpu_snapshot:
-        # Sleep mode lets the snapshot lifecycle offload weights to host RAM
-        # (sleep level 1) before the memory snapshot is taken, then wake on restore.
-        cmd += ["--enable-sleep-mode"]
     cmd += list(cfg.extra_vllm_args)
     return cmd
@@ -282,16 +161,11 @@ def build_command(cfg: ModelConfig) -> list[str]:
 # --- Endpoint registration ------------------------------------------------------
-def register_model(app: modal.App, cfg: ModelConfig) -> modal.Function | type:
     """Attach one model to ``app`` as an autoscaling, OpenAI-compatible endpoint.
-    Dispatches on ``cfg.gpu_snapshot``: the default path is a serialized
-    ``@app.function`` web server; snapshot models use a class-based lifecycle
-    (load → warm up → sleep → snapshot) so later cold starts restore in seconds
-    instead of re-paying download + load + warmup. Both paths publish the same
-    URL shape (``…--<app>-<endpoint_name>.modal.run``), so clients can't tell
-    them apart.
     Everything is serialized (the prebuilt ``vllm serve`` argv is shipped to the
     container), which lets us register many distinctly-named endpoints from a
     simple loop without each needing a hand-written module-level function.
@@ -311,23 +185,11 @@ def register_model(app: modal.App, cfg: ModelConfig) -> modal.Function | type:
     if KEEP_WARM and cfg.profile:
         min_containers = max(min_containers, KEEP_WARM)
-    # Autoscale at the target, but let a hot container absorb a burst up to the
-    # hard max before another cold-starts (Modal high-perf-inference guidance).
-    # Default the target to ~75% of the ceiling so we scale out before saturating.
-    target_inputs = cfg.target_concurrent_inputs or max(1, (cfg.max_concurrent_inputs * 3) // 4)
-    if cfg.gpu_snapshot:
-        return _register_snapshot_model(
-            app,
-            cfg,
-            image=image,
-            cmd=cmd,
-            secrets=secrets,
-            min_containers=min_containers,
-            target_inputs=target_inputs,
-        )
-    function_kwargs = dict(
         name=cfg.endpoint_name,
         image=image,
         gpu=cfg.gpu,
@@ -338,169 +200,18 @@ def register_model(app: modal.App, cfg: ModelConfig) -> modal.Function | type:
         timeout=cfg.request_timeout,
         serialized=True,
     )
-    # Pre-warm spare containers under load for bursty traffic (opt-in per model).
-    if cfg.buffer_containers:
-        function_kwargs["buffer_containers"] = cfg.buffer_containers
-    @app.function(**function_kwargs)
     @modal.concurrent(max_inputs=cfg.max_concurrent_inputs, target_inputs=target_inputs)
     @modal.web_server(port=VLLM_PORT, startup_timeout=cfg.startup_timeout)
     def serve():
-        import os
         import subprocess
-        env = dict(os.environ)
-        # When structured logging is on, generate the dictConfig file and point
-        # vLLM at it. Done at container start (not build) so the level is picked
-        # up from the env without rebuilding the image.
-        if env.get("MODAL_LLM_JSON_LOGS", "").lower() in ("1", "true", "yes"):
-            import vllm_logging
-            vllm_logging.write_config(_LOG_CONFIG_PATH, level=env.get("MODAL_LLM_LOG_LEVEL", "INFO"))
-            env["VLLM_LOGGING_CONFIG_PATH"] = _LOG_CONFIG_PATH
         # vLLM serves the OpenAI REST API on VLLM_PORT; Modal exposes it publicly.
-        subprocess.Popen(cmd, env=env)
     return serve
-def _class_name(slug: str) -> str:
-    """Modal class name for an endpoint slug: ``nemotron-3-nano-4b`` → ``Nemotron3Nano4b``."""
-    return "".join(part.capitalize() for part in slug.replace("_", "-").split("-") if part) or "SnapshotServer"
-def _register_snapshot_model(
-    app: modal.App,
-    cfg: ModelConfig,
-    *,
-    image: modal.Image,
-    cmd: list[str],
-    secrets: list[modal.Secret],
-    min_containers: int,
-    target_inputs: int,
-) -> type:
-    """Snapshot serving path — Modal's vLLM + GPU-memory-snapshot recipe.
-    First boot: start vLLM, wait for the port, run a few warmup completions so
-    compiled artifacts and caches are resident, put the engine to sleep (weights
-    offloaded to host RAM, KV cache dropped), and let Modal snapshot the
-    container (CPU + GPU state). Every later cold start restores the snapshot
-    and wakes the engine — seconds instead of minutes. The web URL label is
-    pinned to ``<app>-<endpoint_name>`` so the public URL is identical to the
-    plain function path (``…--<app>-<endpoint_name>.modal.run``) the catalogue's
-    ``endpoint_url`` builds. A ``@modal.web_server`` ``label`` becomes the URL as
-    ``<workspace>--<label>.modal.run`` *without* the app prefix Modal adds to a
-    plain function's URL, so the app name must be folded into the label by hand
-    or snapshot models answer at the wrong host (``…--<endpoint_name>``).
-    """
-    served_name = cfg.served_name
-    # Helpers are nested (not module-level) on purpose: the class ships to the
-    # container via cloudpickle (``serialized=True``), and closures are pickled
-    # by value — a module-level helper would be pickled by reference to the
-    # ``service`` module, which doesn't exist inside the container.
-    def _headers() -> dict[str, str]:
-        import os
-        key = os.environ.get("VLLM_API_KEY")
-        return {"Authorization": f"Bearer {key}"} if key else {}
-    def _wait_ready(proc) -> None:
-        # vLLM opens the port only once the engine is initialized, so a
-        # successful connect means "ready", not just "listening".
-        import socket
-        import time
-        while True:
-            try:
-                socket.create_connection(("localhost", VLLM_PORT), timeout=1).close()
-                return
-            except OSError:
-                if proc.poll() is not None:
-                    raise RuntimeError(f"vllm exited with code {proc.returncode}")
-                time.sleep(0.2)
-    def _post(path: str, json_body: dict | None = None, timeout: float = 300.0) -> None:
-        import requests  # vLLM dependency, always present in the image
-        url = f"http://localhost:{VLLM_PORT}{path}"
-        requests.post(url, headers=_headers(), json=json_body, timeout=timeout).raise_for_status()
-    class _SnapshotServer:
-        @modal.enter(snap=True)
-        def start(self):
-            import os
-            import subprocess
-            env = dict(os.environ)
-            # Same structured-logging hook as the plain path (see ``serve``).
-            if env.get("MODAL_LLM_JSON_LOGS", "").lower() in ("1", "true", "yes"):
-                import vllm_logging
-                vllm_logging.write_config(_LOG_CONFIG_PATH, level=env.get("MODAL_LLM_LOG_LEVEL", "INFO"))
-                env["VLLM_LOGGING_CONFIG_PATH"] = _LOG_CONFIG_PATH
-            self.vllm_proc = subprocess.Popen(cmd, env=env)
-            _wait_ready(self.vllm_proc)
-            # Touch the full serving path so compile/caching work happens *before*
-            # the snapshot rather than on the first real request after restore.
-            warmup = {
-                "model": served_name,
-                "messages": [{"role": "user", "content": "Who tends the wood?"}],
-                "max_tokens": 8,
-            }
-            for _ in range(3):
-                _post("/v1/chat/completions", json_body=warmup)
-            # Offload weights to host RAM (sleep level 1); Modal snapshots the
-            # container right after the snap=True enters return.
-            _post("/sleep?level=1", timeout=120.0)
-        @modal.enter(snap=False)
-        def wake(self):
-            # Runs after every restore (and on the snapshot-creating boot itself,
-            # which simply resumes serving): reload weights onto the GPU.
-            _post("/wake_up", timeout=120.0)
-            _wait_ready(self.vllm_proc)
-        @modal.web_server(port=VLLM_PORT, startup_timeout=cfg.startup_timeout, label=f"{app.name}-{cfg.endpoint_name}")
-        def serve(self):
-            pass  # vLLM (already running) is the web server; Modal just exposes the port.
-        @modal.exit()
-        def stop(self):
-            proc = getattr(self, "vllm_proc", None)
-            if proc is not None:
-                proc.terminate()
-    # One Modal class per model, named after the endpoint (App.cls has no name
-    # override, so rename the type before decorating).
-    name = _class_name(cfg.endpoint_name)
-    _SnapshotServer.__name__ = name
-    _SnapshotServer.__qualname__ = name
-    cls_kwargs = dict(
-        image=image,
-        gpu=cfg.gpu,
-        volumes={HF_CACHE_PATH: hf_cache_vol, VLLM_CACHE_PATH: vllm_cache_vol},
-        secrets=secrets,
-        scaledown_window=cfg.scaledown_window,
-        min_containers=min_containers,
-        timeout=cfg.request_timeout,
-        # Bounds the whole snap=True phase (download + load + warmup + sleep).
-        startup_timeout=cfg.startup_timeout,
-        serialized=True,
-        enable_memory_snapshot=True,
-        # GPU snapshots are Modal-alpha; scoped per model via cfg.gpu_snapshot.
-        experimental_options={"enable_gpu_snapshot": True},
-    )
-    if cfg.buffer_containers:
-        cls_kwargs["buffer_containers"] = cfg.buffer_containers
-    concurrent = modal.concurrent(max_inputs=cfg.max_concurrent_inputs, target_inputs=target_inputs)
-    return app.cls(**cls_kwargs)(concurrent(_SnapshotServer))
 def register_all(app: modal.App, configs: Iterable[ModelConfig]) -> None:
     """Register every model in ``configs`` onto ``app``."""
     for cfg in configs:

 """Reusable, OpenAI-compatible model-serving layer for Modal.
+This module is provider-agnostic. It takes a single ``ModelConfig`` and turns it
+into a serverless, autoscaling, OpenAI-compatible HTTP endpoint backed by vLLM.
+Each provider app (``app_nvidia.py``, ``app_openbmb.py``, ``app_google.py``)
+imports :func:`register_all` and wires up its own models, so providers stay
+isolated in their own Modal apps while sharing one serving path.
+This is Modal's canonical vLLM recipe, kept deliberately small: an autoscaling
+``@app.function`` whose body launches ``vllm serve`` as a subprocess behind a
+``@modal.web_server``. Everything that shapes a model (GPU, context length,
+parsers, multimodal limits, extra flags) lives in data — the ``ModelConfig`` —
+not in code, so adding a model is one entry in ``catalogue.py``.
+The served endpoints speak the OpenAI REST API (``/v1/chat/completions``,
 ``/v1/completions``, ``/v1/models``), so any OpenAI-compatible client can call
 them by pointing ``base_url`` at the deployed URL.
 """
 # --- Shared serving constants --------------------------------------------------
+# Pin the inference stack so deploys are reproducible. Bump deliberately. This is
+# the version Modal's current vLLM example ships with.
 VLLM_VERSION = "0.21.0"
 CUDA_IMAGE = "nvidia/cuda:12.9.0-devel-ubuntu22.04"
+PYTHON_VERSION = "3.12"
 # The in-container port vLLM listens on; Modal maps it to a public HTTPS URL.
 VLLM_PORT = 8000
 VLLM_CACHE_PATH = "/root/.cache/vllm"
 # Name of the Modal Secret that holds a Hugging Face token (key: HF_TOKEN).
+# Required only for gated repos. Create it once with:
 #   modal secret create huggingface-secret HF_TOKEN=hf_...
 HF_SECRET_NAME = "huggingface-secret"
+# Name of the Modal Secret holding the bearer token clients must present. The key
+# MUST be VLLM_API_KEY — vLLM reads that env var and then enforces
 # `Authorization: Bearer <token>` on every request. Create it once with:
 #   modal secret create llm-api-key VLLM_API_KEY=sk-...
 API_KEY_SECRET_NAME = "llm-api-key"
 #   MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
 # When enabled, every endpoint mounts API_KEY_SECRET_NAME and rejects requests
 # without a valid bearer token. Off by default (endpoints are then public).
+REQUIRE_API_KEY = os.environ.get("MODAL_LLM_REQUIRE_AUTH", "").lower() in ("1", "true", "yes")
 # Demo-day switch: keep N containers warm for every *profile-bound* model (the
+# tiers the cast actually runs on), removing their cold starts for the duration
+# of the deploy. Specialists keep scale-to-zero. Costs GPU-hours while deployed —
+# turn it on right before a live demo, redeploy without it after:
 #   MODAL_LLM_KEEP_WARM=1 modal deploy modal/app_nvidia.py
 KEEP_WARM = int(os.environ.get("MODAL_LLM_KEEP_WARM", "0") or "0")
 # Weights and the vLLM compile cache are shared across every provider app, so a
 # model pulled once is warm for all subsequent deploys and containers.
 hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
 vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)
+# Baseline image env shared by every model. Persisting the torch.compile + CUDA
+# graph cache on the shared vLLM Volume means only the first container compiles;
+# later cold starts replay the cached graphs instead of recapturing them.
 _BASE_ENV = {
     "HF_HUB_CACHE": HF_CACHE_PATH,
     "HF_XET_HIGH_PERFORMANCE": "1",  # faster weight downloads
     "VLLM_LOG_STATS_INTERVAL": "1",
     "VLLM_CACHE_ROOT": VLLM_CACHE_PATH,
 }
     else:
         image = image.uv_pip_install(f"vllm=={cfg.vllm_version or VLLM_VERSION}")
     image = image.env(_BASE_ENV)
     if cfg.extra_pip:
         image = image.uv_pip_install(*cfg.extra_pip)
     if cfg.env:
     return image
 def build_command(cfg: ModelConfig) -> list[str]:
     """Assemble the ``vllm serve`` argv for a model. Returned as a list so we can
     launch with ``subprocess.Popen`` without a shell (no quoting pitfalls)."""
         cmd += ["--max-model-len", str(cfg.max_model_len)]
     if cfg.trust_remote_code:
         cmd += ["--trust-remote-code"]
     if cfg.gpu_memory_utilization is not None:
         cmd += ["--gpu-memory-utilization", str(cfg.gpu_memory_utilization)]
     # Prefix caching reuses the KV cache for shared prompt prefixes. In a
         cmd += ["--async-scheduling"]
     if cfg.enforce_eager:
         cmd += ["--enforce-eager"]
     # Observability: log each incoming request (id, params, token counts) so the
+    # Modal logs show what's actually being served.
     if cfg.log_requests:
         cmd += ["--enable-log-requests"]
     if cfg.reasoning_parser:
         cmd += ["--reasoning-parser", cfg.reasoning_parser]
     if cfg.enable_auto_tool_choice:
         cmd += ["--tool-call-parser", cfg.tool_call_parser]
     if cfg.mm_limits:
         cmd += ["--limit-mm-per-prompt", json.dumps(cfg.mm_limits)]
     cmd += list(cfg.extra_vllm_args)
     return cmd
 # --- Endpoint registration ------------------------------------------------------
+def register_model(app: modal.App, cfg: ModelConfig) -> modal.Function:
     """Attach one model to ``app`` as an autoscaling, OpenAI-compatible endpoint.
+    A single serialized ``@app.function`` web server launches ``vllm serve`` as a
+    subprocess; Modal exposes its port at ``…--<app>-<endpoint_name>.modal.run``.
     Everything is serialized (the prebuilt ``vllm serve`` argv is shipped to the
     container), which lets us register many distinctly-named endpoints from a
     simple loop without each needing a hand-written module-level function.
     if KEEP_WARM and cfg.profile:
         min_containers = max(min_containers, KEEP_WARM)
+    # Autoscale at ~75% of the ceiling, but let a hot container absorb a burst up
+    # to the hard max before another cold-starts (Modal high-perf guidance).
+    target_inputs = max(1, (cfg.max_concurrent_inputs * 3) // 4)
+    @app.function(
         name=cfg.endpoint_name,
         image=image,
         gpu=cfg.gpu,
         timeout=cfg.request_timeout,
         serialized=True,
     )
     @modal.concurrent(max_inputs=cfg.max_concurrent_inputs, target_inputs=target_inputs)
     @modal.web_server(port=VLLM_PORT, startup_timeout=cfg.startup_timeout)
     def serve():
         import subprocess
         # vLLM serves the OpenAI REST API on VLLM_PORT; Modal exposes it publicly.
+        # Inherits the container env (HF cache, vLLM cache, any secrets).
+        subprocess.Popen(cmd)
     return serve
 def register_all(app: modal.App, configs: Iterable[ModelConfig]) -> None:
     """Register every model in ``configs`` onto ``app``."""
     for cfg in configs:

modal/vllm_logging.py DELETED Viewed

@@ -1,118 +0,0 @@
-"""Structured (JSON) logging for the vLLM subprocess — stdlib only.
-vLLM applies a standard :func:`logging.config.dictConfig` when the
-``VLLM_LOGGING_CONFIG_PATH`` env var points at a JSON file (see vLLM's
-``envs.py``). This module builds that config and ships the :class:`JsonFormatter`
-it references, so one importable module serves both sides:
-  * :func:`write_config` — called by ``service.serve()`` to drop the JSON config
-    file into the container before launching ``vllm serve``; and
-  * :class:`JsonFormatter` — imported *by name* from the JSON config when vLLM
-    runs ``dictConfig`` in its own process.
-For the second to work, this file is added to the container image and its
-directory is placed on ``PYTHONPATH`` (see ``service.build_image``). Keeping it
-**dependency-free** (no ``python-json-logger`` etc.) means there is no extra
-wheel to install and no import path that can drift between versions — vLLM only
-needs the stdlib plus this one file.
-One JSON object is emitted per log line: ``ts``, ``level``, ``logger``, ``msg``,
-the source ``module:lineno``, and any structured extras attached to the record
-(vLLM threads request ids and token counts through these). Output stays on
-stdout so Modal captures it like every other container log.
-"""
-from __future__ import annotations
-import json
-import logging
-# Standard LogRecord attributes — everything here is either folded into a fixed
-# JSON key below or deliberately dropped. Anything *else* on the record is a
-# caller-supplied extra (e.g. a request id) and is included verbatim.
-_RESERVED: frozenset[str] = frozenset(
-    {
-        "args",
-        "asctime",
-        "created",
-        "exc_info",
-        "exc_text",
-        "filename",
-        "funcName",
-        "levelname",
-        "levelno",
-        "lineno",
-        "module",
-        "msecs",
-        "message",
-        "msg",
-        "name",
-        "pathname",
-        "process",
-        "processName",
-        "relativeCreated",
-        "stack_info",
-        "taskName",
-        "thread",
-        "threadName",
-    }
-)
-class JsonFormatter(logging.Formatter):
-    """Render each log record as a single compact JSON line.
-    Referenced from the dictConfig by dotted path (``vllm_logging.JsonFormatter``),
-    so it must stay importable under that name in the container.
-    """
-    def format(self, record: logging.LogRecord) -> str:
-        data: dict[str, object] = {
-            "ts": self.formatTime(record, self.datefmt),
-            "level": record.levelname,
-            "logger": record.name,
-            "msg": record.getMessage(),
-            "src": f"{record.module}:{record.lineno}",
-        }
-        if record.exc_info:
-            data["exc"] = self.formatException(record.exc_info)
-        # Fold in any structured extras (request_id, token counts, ...). Values
-        # that aren't JSON-serialisable fall back to repr so a stray object can
-        # never crash the logging path.
-        for key, value in record.__dict__.items():
-            if key in _RESERVED or key.startswith("_"):
-                continue
-            try:
-                json.dumps(value)
-            except (TypeError, ValueError):
-                value = repr(value)
-            data[key] = value
-        return json.dumps(data, ensure_ascii=False, default=repr)
-def build_config(level: str = "INFO") -> dict:
-    """Return a ``logging.config.dictConfig`` that routes vLLM + uvicorn through
-    :class:`JsonFormatter` on stdout at ``level``."""
-    level = (level or "INFO").upper()
-    handler = {
-        "class": "logging.StreamHandler",
-        "formatter": "json",
-        "stream": "ext://sys.stdout",
-    }
-    logger = {"handlers": ["stdout"], "level": level, "propagate": False}
-    return {
-        "version": 1,
-        # Keep vLLM's own loggers; we only swap their formatting/handler.
-        "disable_existing_loggers": False,
-        "formatters": {"json": {"()": "vllm_logging.JsonFormatter"}},
-        "handlers": {"stdout": handler},
-        "loggers": {name: dict(logger) for name in ("vllm", "uvicorn", "uvicorn.access", "uvicorn.error")},
-        "root": {"handlers": ["stdout"], "level": level},
-    }
-def write_config(path: str, level: str = "INFO") -> str:
-    """Write the dictConfig JSON to ``path`` (for ``VLLM_LOGGING_CONFIG_PATH``)."""
-    with open(path, "w", encoding="utf-8") as fh:
-        json.dump(build_config(level), fh)
-    return path

src/observability/logging_setup.py CHANGED Viewed

@@ -1,7 +1,6 @@
 """Root logging configuration — structured records to stdout and to the store.
-Generalises the dependency-free JSON formatter from ``modal/vllm_logging.py`` for
-the whole engine, and adds:
   * a :class:`_ContextFilter` that stamps every record with the bound
     run/turn/agent (see :mod:`src.observability.context`); and

 """Root logging configuration — structured records to stdout and to the store.
+Provides a dependency-free JSON formatter for the whole engine, and adds:
   * a :class:`_ContextFilter` that stamps every record with the bound
     run/turn/agent (see :mod:`src.observability.context`); and

tests/test_modal_build_command.py CHANGED Viewed

@@ -1,11 +1,9 @@
-"""Guard the precision flags ``build_command`` emits into the vLLM argv.
-Quantization is purely serving-side: it only adds ``--quantization`` /
-``--kv-cache-dtype`` to the ``vllm serve`` argv (the ``--served-model-name`` is
-unchanged, so the engine never notices). Two controls feed those flags — a
-per-model ``ModelConfig`` field and a deploy-time env override that wins over it
-— and these tests pin both, plus the force-disable token, since this is the first
-test to assert on ``build_command``'s output at all.
 ``modal/service.py`` does ``import modal`` and ``from catalogue import …``, so we
 load it exactly the way ``modal deploy`` does: with ``modal/`` on ``sys.path`` (the
@@ -16,6 +14,7 @@ binds the installed SDK, not the folder).
 from __future__ import annotations
 import importlib
 import sys
 from pathlib import Path
@@ -37,109 +36,105 @@ def _make(service, **kwargs):
     return service.ModelConfig(name="acme/Tiny-1B", endpoint_name="tiny-1b", **kwargs)
-# ── per-model field ──────────────────────────────────────────────────────────
-def test_no_quantization_by_default(service):
-    cmd = service.build_command(_make(service))
-    assert "--quantization" not in cmd
-    assert "--kv-cache-dtype" not in cmd
-def test_per_model_quantization_emits_flag(service):
-    cmd = service.build_command(_make(service, quantization="fp8"))
-    assert cmd[cmd.index("--quantization") + 1] == "fp8"
-def test_per_model_kv_cache_dtype_emits_flag(service):
-    cmd = service.build_command(_make(service, kv_cache_dtype="fp8"))
-    assert cmd[cmd.index("--kv-cache-dtype") + 1] == "fp8"
-# ── deploy-time env override ───────────────────────────────────────────────────
-def test_env_override_beats_unset_model_field(service, monkeypatch):
-    monkeypatch.setattr(service, "QUANTIZATION", "fp8")
-    cmd = service.build_command(_make(service))  # model field is None
-    assert cmd[cmd.index("--quantization") + 1] == "fp8"
-def test_env_override_beats_model_field(service, monkeypatch):
-    monkeypatch.setattr(service, "QUANTIZATION", "awq")
-    cmd = service.build_command(_make(service, quantization="fp8"))
-    assert cmd[cmd.index("--quantization") + 1] == "awq"
-@pytest.mark.parametrize("token", ["none", "off", "bf16", "AUTO"])
-def test_disable_token_forces_full_precision(service, monkeypatch, token):
-    # A model that defaults to fp8 is overridden back to no flag at deploy time.
-    monkeypatch.setattr(service, "QUANTIZATION", token)
-    cmd = service.build_command(_make(service, quantization="fp8"))
-    assert "--quantization" not in cmd
-def test_kv_cache_env_override(service, monkeypatch):
-    monkeypatch.setattr(service, "KV_CACHE_DTYPE", "fp8")
-    cmd = service.build_command(_make(service))
-    assert cmd[cmd.index("--kv-cache-dtype") + 1] == "fp8"
-# ── FP8 KV cache × snapshot incompatibility (vLLM wake-path crash) ─────────────
-def test_fp8_kv_cache_dropped_for_snapshot_models(service):
-    # FP8 KV cache crashes the /wake_up path on snapshot models, so the flag is
-    # suppressed when gpu_snapshot is set — the endpoint serves with full-precision
-    # KV cache rather than booting into a state it can never wake from.
-    cmd = service.build_command(_make(service, kv_cache_dtype="fp8", gpu_snapshot=True))
-    assert "--kv-cache-dtype" not in cmd
-    # The snapshot flag itself still wins and is emitted.
-    assert "--enable-sleep-mode" in cmd
-def test_fp8_kv_cache_env_override_dropped_for_snapshot_models(service, monkeypatch):
-    # The global deploy override is the common trigger: it lands on every model in
-    # the app, including snapshot ones, which must still drop it.
-    monkeypatch.setattr(service, "KV_CACHE_DTYPE", "fp8")
-    cmd = service.build_command(_make(service, gpu_snapshot=True))
-    assert "--kv-cache-dtype" not in cmd
-def test_fp8_variant_kv_cache_dropped_for_snapshot_models(service):
-    # Every fp8 variant hits init_fp8_kv_scales, so fp8_e5m2 is dropped too.
-    cmd = service.build_command(_make(service, kv_cache_dtype="fp8_e5m2", gpu_snapshot=True))
-    assert "--kv-cache-dtype" not in cmd
-def test_non_fp8_kv_cache_kept_for_snapshot_models(service):
-    # The guard only fires on fp8; a non-fp8 dtype passes through even with snapshot.
-    cmd = service.build_command(_make(service, kv_cache_dtype="auto", gpu_snapshot=True))
-    assert cmd[cmd.index("--kv-cache-dtype") + 1] == "auto"
-def test_fp8_kv_cache_kept_for_non_snapshot_models(service):
-    # Without snapshot there's no wake path, so FP8 KV cache stays.
-    cmd = service.build_command(_make(service, kv_cache_dtype="fp8", gpu_snapshot=False))
-    assert cmd[cmd.index("--kv-cache-dtype") + 1] == "fp8"
 # ── deploy script wiring ───────────────────────────────────────────────────────
-def test_deploy_script_propagates_quantization_env():
     sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "scripts"))
     deploy_modal = importlib.import_module("deploy_modal")
     from argparse import Namespace
-    base = dict(keep_warm=False, auth=False, json_logs=False, log_level="", kv_cache_dtype=None)
-    env_fp8 = deploy_modal._env_for(Namespace(quantization="fp8", **base))
-    assert env_fp8["MODAL_LLM_QUANTIZATION"] == "fp8"
-    # ``--quantization none`` (force full precision) is still propagated, not dropped.
-    env_none = deploy_modal._env_for(Namespace(quantization="none", **base))
-    assert env_none["MODAL_LLM_QUANTIZATION"] == "none"
-    # Unset → the env var is left alone (so a model's own default stands).
-    env_unset = deploy_modal._env_for(Namespace(quantization=None, **base))
-    assert "MODAL_LLM_QUANTIZATION" not in env_unset

+"""Guard the ``vllm serve`` argv that ``build_command`` emits.
+The serving layer turns one ``ModelConfig`` into the argv launched inside the
+container, so these tests pin the mapping from config fields to vLLM flags: the
+always-present identity flags, the data-driven toggles (parsers, eager, prefix
+caching), and the ``extra_vllm_args`` escape hatch.
 ``modal/service.py`` does ``import modal`` and ``from catalogue import …``, so we
 load it exactly the way ``modal deploy`` does: with ``modal/`` on ``sys.path`` (the
 from __future__ import annotations
 import importlib
+import json
 import sys
 from pathlib import Path
     return service.ModelConfig(name="acme/Tiny-1B", endpoint_name="tiny-1b", **kwargs)
+def _flag_value(cmd: list[str], flag: str) -> str:
+    """The argument that follows ``flag`` in the argv."""
+    return cmd[cmd.index(flag) + 1]
+# ── always-present identity flags ──────────────────────────────────────────────
+def test_serves_the_model_with_identity_flags(service):
+    cmd = service.build_command(_make(service))
+    assert cmd[:3] == ["vllm", "serve", "acme/Tiny-1B"]
+    # served-model-name defaults to the repo name (clients pass the repo id).
+    assert _flag_value(cmd, "--served-model-name") == "acme/Tiny-1B"
+    assert _flag_value(cmd, "--port") == str(service.VLLM_PORT)
+    assert _flag_value(cmd, "--tensor-parallel-size") == "1"
+def test_served_model_name_alias(service):
+    cmd = service.build_command(_make(service, served_model_name="acme/Tiny"))
+    assert _flag_value(cmd, "--served-model-name") == "acme/Tiny"
+    # but vLLM still loads the real repo (positional arg)
+    assert cmd[2] == "acme/Tiny-1B"
+# ── data-driven toggles ────────────────────────────────────────────────────────
+def test_prefix_caching_on_by_default_off_when_disabled(service):
+    assert "--enable-prefix-caching" in service.build_command(_make(service))
+    off = service.build_command(_make(service, enable_prefix_caching=False))
+    assert "--no-enable-prefix-caching" in off
+    assert "--enable-prefix-caching" not in off
+def test_optional_inference_flags_emitted(service):
+    cmd = service.build_command(
+        _make(
+            service,
+            max_model_len=8192,
+            trust_remote_code=True,
+            enforce_eager=True,
+            gpu_memory_utilization=0.9,
+        )
+    )
+    assert _flag_value(cmd, "--max-model-len") == "8192"
+    assert "--trust-remote-code" in cmd
+    assert "--enforce-eager" in cmd
+    assert _flag_value(cmd, "--gpu-memory-utilization") == "0.9"
+def test_async_scheduling_default_on_off_when_disabled(service):
+    assert "--async-scheduling" in service.build_command(_make(service))
+    assert "--async-scheduling" not in service.build_command(_make(service, async_scheduling=False))
+def test_parser_flags(service):
+    cmd = service.build_command(
+        _make(service, reasoning_parser="qwen3", tool_call_parser="hermes", enable_auto_tool_choice=True)
+    )
+    assert _flag_value(cmd, "--reasoning-parser") == "qwen3"
+    assert _flag_value(cmd, "--tool-call-parser") == "hermes"
+    assert "--enable-auto-tool-choice" in cmd
+    # None parsers emit nothing.
+    bare = service.build_command(_make(service))
+    assert "--reasoning-parser" not in bare
+    assert "--tool-call-parser" not in bare
+def test_mm_limits_serialized_as_json(service):
+    cmd = service.build_command(_make(service, mm_limits={"image": 0, "audio": 0}))
+    assert json.loads(_flag_value(cmd, "--limit-mm-per-prompt")) == {"image": 0, "audio": 0}
+def test_log_requests_default_on(service):
+    assert "--enable-log-requests" in service.build_command(_make(service))
+    assert "--enable-log-requests" not in service.build_command(_make(service, log_requests=False))
+# ── escape hatch ────────────────────────────────────────────────────────────────
+def test_extra_vllm_args_appended_verbatim(service):
+    cmd = service.build_command(_make(service, extra_vllm_args=("--quantization", "fp8")))
+    assert cmd[-2:] == ["--quantization", "fp8"]
 # ── deploy script wiring ───────────────────────────────────────────────────────
+def test_deploy_script_propagates_knob_envs():
     sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "scripts"))
     deploy_modal = importlib.import_module("deploy_modal")
     from argparse import Namespace
+    env = deploy_modal._env_for(Namespace(keep_warm=True, auth=True))
+    assert env["MODAL_LLM_KEEP_WARM"] == "1"
+    assert env["MODAL_LLM_REQUIRE_AUTH"] == "1"
+    # Both off → neither env var is set (so endpoints stay public + scale-to-zero).
+    env_off = deploy_modal._env_for(Namespace(keep_warm=False, auth=False))
+    assert "MODAL_LLM_KEEP_WARM" not in env_off
+    assert "MODAL_LLM_REQUIRE_AUTH" not in env_off

tests/test_modal_endpoint_urls.py CHANGED Viewed

@@ -16,8 +16,6 @@ import importlib.util
 import sys
 from pathlib import Path
-import pytest
 _CATALOGUE_PATH = Path(__file__).resolve().parents[1] / "modal" / "catalogue.py"
 # Max length of a single DNS label (RFC 1035). The whole subdomain before

 import sys
 from pathlib import Path
 _CATALOGUE_PATH = Path(__file__).resolve().parents[1] / "modal" / "catalogue.py"
 # Max length of a single DNS label (RFC 1035). The whole subdomain before