Spaces:

build-small-hackathon
/

multi-agent-lab

Running on Zero

agharsallah Codex commited on 28 days ago

Commit

57b8237

1 Parent(s): 39ef592

feat: add OpenAPI docs and opt-in bearer auth via env-var secret

- openapi.yaml: checked-in OpenAPI 3.1 spec for the served API surface
- docs/openapi.md: API reference (endpoints, auth, examples, client gen)
- service.py: optional API-key auth, enabled at deploy time with
MODAL_LLM_REQUIRE_AUTH=1; token supplied via the llm-api-key Modal Secret
as the VLLM_API_KEY env var (never hard-coded)
- client.py: read bearer token from LLM_API_KEY env var
- deploying.md / README / ADR-0005: document auth + OpenAPI

Endpoints also self-document at /docs and /openapi.json.

Co-Authored-By: Codex <codex@openai.com>

Files changed (7) hide show

docs/adr/0005-modal-model-serving.md +6 -0
modal/README.md +5 -0
modal/client.py +3 -2
modal/docs/deploying.md +18 -5
modal/docs/openapi.md +100 -0
modal/openapi.yaml +258 -0
modal/service.py +23 -1

docs/adr/0005-modal-model-serving.md CHANGED Viewed

@@ -40,6 +40,12 @@ OpenAI-compatible endpoints (vLLM behind an autoscaling web server).
   touching the others or the engine.
 - Gated repos (Gemma, the Nemotron repos used here) require a Hugging Face token
   in the `huggingface-secret` Modal Secret; ungated models deploy without it.
 - vLLM tool/reasoning parser names are version-specific and left conservative;
   enable per model once verified against the deployed vLLM version.
 - Modal's docs index is mirrored at `modal/docs/modal-llms.txt` and refreshed

   touching the others or the engine.
 - Gated repos (Gemma, the Nemotron repos used here) require a Hugging Face token
   in the `huggingface-secret` Modal Secret; ungated models deploy without it.
+- Endpoints are public by default; bearer-token auth is opt-in at deploy time
+  (`MODAL_LLM_REQUIRE_AUTH=1`) and supplied via the `llm-api-key` Secret as the
+  `VLLM_API_KEY` env var — secrets are never hard-coded.
+- The API is OpenAI-compatible: each endpoint self-documents at `/docs` and
+  `/openapi.json`, and a checked-in `modal/openapi.yaml` (3.1) plus
+  `modal/docs/openapi.md` document the shared surface.
 - vLLM tool/reasoning parser names are version-specific and left conservative;
   enable per model once verified against the deployed vLLM version.
 - Modal's docs index is mirrored at `modal/docs/modal-llms.txt` and refreshed

modal/README.md CHANGED Viewed

@@ -16,12 +16,17 @@ modal/
   app_openbmb.py    App "openbmb-llms" — MiniCPM-o 4.5 + MiniCPM4.1-8B.
   app_google.py     App "google-llms"  — Gemma 4 26B + 12B.
   client.py         OpenAI-SDK smoke-test client for any endpoint.
   requirements.txt  Deploy/client tooling (vLLM lives in the container image).
   docs/
     deploying.md    Deploy, configure, auth, GPU sizing, engine integration.
     modal-llms.txt  In-repo mirror of Modal's docs index, kept updated.
 ```
 ## Models
 | Provider | App            | Model                                   | Endpoint name         | GPU     |

   app_openbmb.py    App "openbmb-llms" — MiniCPM-o 4.5 + MiniCPM4.1-8B.
   app_google.py     App "google-llms"  — Gemma 4 26B + 12B.
   client.py         OpenAI-SDK smoke-test client for any endpoint.
+  openapi.yaml      Checked-in OpenAPI 3.1 spec for the served API surface.
   requirements.txt  Deploy/client tooling (vLLM lives in the container image).
   docs/
     deploying.md    Deploy, configure, auth, GPU sizing, engine integration.
+    openapi.md      API reference: endpoints, auth, examples, client generation.
     modal-llms.txt  In-repo mirror of Modal's docs index, kept updated.
 ```
+Each running endpoint also self-documents at `/docs` (Swagger UI) and
+`/openapi.json` (live spec). See [`docs/openapi.md`](docs/openapi.md).
 ## Models
 | Provider | App            | Model                                   | Endpoint name         | GPU     |

modal/client.py CHANGED Viewed

@@ -26,8 +26,9 @@ def main() -> None:
     from openai import OpenAI
-    # vLLM accepts any token unless an API key was configured on the server.
-    client = OpenAI(base_url=args.base_url, api_key=os.environ.get("MODAL_LLM_KEY", "EMPTY"))
     response = client.chat.completions.create(
         model=args.model,

     from openai import OpenAI
+    # Bearer token from the env var (set LLM_API_KEY to the value of the
+    # `llm-api-key` Modal Secret). Any value works when the server has no auth.
+    client = OpenAI(base_url=args.base_url, api_key=os.environ.get("LLM_API_KEY", "EMPTY"))
     response = client.chat.completions.create(
         model=args.model,

modal/docs/deploying.md CHANGED Viewed

@@ -86,12 +86,25 @@ Append one `ModelConfig` to the appropriate provider list in `registry.py`.
 ## Auth
-Modal web endpoints are public by default. To require a bearer token, either:
-- Set `VLLM_API_KEY` on the container (via a `modal.Secret`) so vLLM enforces
-  `Authorization: Bearer <key>`; or
-- Front the endpoint with Modal Proxy Auth Tokens
-  (see `docs/modal-llms.txt` → Proxy Auth Tokens).
 ## GPU sizing cheatsheet

 ## Auth
+Modal web endpoints are public by default. Secrets are supplied as environment
+variables (never hard-coded). To require a bearer token:
+```bash
+# Key MUST be VLLM_API_KEY (vLLM reads it); value is the token clients send.
+modal secret create llm-api-key VLLM_API_KEY=sk-your-token
+# Turn auth on at deploy time — no code edits:
+MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
+```
+When `MODAL_LLM_REQUIRE_AUTH` is set, every endpoint mounts the `llm-api-key`
+secret as the `VLLM_API_KEY` env var and vLLM enforces `Authorization: Bearer
+<token>` (401 otherwise). Clients pass the same token (the bundled `client.py`
+reads it from `LLM_API_KEY`). Alternatively front endpoints with Modal Proxy
+Auth Tokens (see `docs/modal-llms.txt` → Proxy Auth Tokens).
+See [`openapi.md`](openapi.md) for the full API reference and the checked-in
+OpenAPI spec (`../openapi.yaml`).
 ## GPU sizing cheatsheet

modal/docs/openapi.md ADDED Viewed

	@@ -0,0 +1,100 @@

+# OpenAPI / API reference
+Every deployed model speaks the **OpenAI REST protocol**, so the API surface is
+the familiar OpenAI one. There are two sources of truth:
+- **Live, per-model spec** — each running endpoint serves its own
+  auto-generated spec at `/openapi.json` and an interactive Swagger UI at
+  `/docs`:
+  ```
+  https://<workspace>--<endpoint-name>.modal.run/openapi.json
+  https://<workspace>--<endpoint-name>.modal.run/docs
+  ```
+- **Checked-in spec** — [`../openapi.yaml`](../openapi.yaml) documents the
+  shared, stable surface across all endpoints (OpenAPI 3.1). Use it for client
+  generation and review; use the live spec for the exact, version-pinned shape.
+## Base URL
+```
+https://<workspace>--<endpoint-name>.modal.run/v1
+```
+One server per model; `<endpoint-name>` is the model's `endpoint_name` from
+`registry.py` (e.g. `gemma-4-12b`, `nemotron-3-nano-4b`).
+## Endpoints
+| Method & path           | Purpose                                  |
+| ----------------------- | ---------------------------------------- |
+| `GET  /v1/models`       | List the model served by this endpoint.  |
+| `POST /v1/chat/completions` | Chat completion (streaming via `stream: true`). |
+| `POST /v1/completions`  | Text completion.                         |
+Multimodal models (MiniCPM-o-4_5) accept array-style `content` parts
+(`text` / `image_url` / `input_audio`) on chat messages. Models configured with
+a `tool_call_parser` accept `tools` / `tool_choice`.
+## Authentication
+Auth is **off by default** (endpoints are public; any token is accepted). To
+require a bearer token, deploy with auth enabled — secrets are supplied as
+environment variables, never hard-coded:
+```bash
+# 1. Create the secret. The KEY must be VLLM_API_KEY (vLLM reads this env var);
+#    the VALUE is the bearer token clients will send.
+modal secret create llm-api-key VLLM_API_KEY=sk-your-token
+# 2. Deploy with auth turned on (per provider app).
+MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
+```
+With auth on, vLLM enforces `Authorization: Bearer <token>` and returns `401`
+otherwise. Clients pass the same token as their API key.
+## Examples
+### curl
+```bash
+curl https://<workspace>--gemma-4-12b.modal.run/v1/chat/completions \
+  -H "Authorization: Bearer $LLM_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "google/gemma-4-12B",
+    "messages": [{"role": "user", "content": "Describe a mossy ticket booth."}],
+    "max_tokens": 256
+  }'
+```
+### OpenAI SDK
+```python
+from openai import OpenAI
+client = OpenAI(
+    base_url="https://<workspace>--gemma-4-12b.modal.run/v1",
+    api_key=os.environ["LLM_API_KEY"],  # any value when auth is off
+)
+resp = client.chat.completions.create(
+    model="google/gemma-4-12B",
+    messages=[{"role": "user", "content": "Hello from the wood."}],
+)
+print(resp.choices[0].message.content)
+```
+The bundled [`../client.py`](../client.py) wraps this and reads the token from
+the `LLM_API_KEY` environment variable.
+## Generating clients
+```bash
+# Typed client from the checked-in spec...
+openapi-generator-cli generate -i modal/openapi.yaml -g python -o ./gen
+# ...or from a live endpoint's exact spec:
+curl -s https://<workspace>--gemma-4-12b.modal.run/openapi.json -o openapi.json
+```

modal/openapi.yaml ADDED Viewed

	@@ -0,0 +1,258 @@

+openapi: 3.1.0
+info:
+  title: Multi-Agent Land — Model Serving API
+  version: "1.0.0"
+  description: >
+    OpenAI-compatible inference API served by the Modal apps in this repo
+    (`nvidia-llms`, `openbmb-llms`, `google-llms`). Each model is exposed as its
+    own endpoint and speaks the OpenAI REST protocol, so any OpenAI-compatible
+    client works by pointing `base_url` at the endpoint URL.
+    Every running endpoint also serves a live, model-specific spec at
+    `/openapi.json` and an interactive Swagger UI at `/docs`. This checked-in
+    spec documents the shared, stable surface across all endpoints.
+servers:
+  - url: https://{workspace}--{endpoint}.modal.run/v1
+    description: Per-model Modal endpoint (one server per model).
+    variables:
+      workspace:
+        default: your-workspace
+        description: Your Modal workspace slug.
+      endpoint:
+        default: gemma-4-12b
+        description: A model's endpoint_name from registry.py.
+        enum:
+          - nemotron-3-nano-30b
+          - nemotron-3-nano-4b
+          - minicpm-o-4-5
+          - minicpm-4-1-8b
+          - gemma-4-26b
+          - gemma-4-12b
+security:
+  - bearerAuth: []
+paths:
+  /models:
+    get:
+      operationId: listModels
+      summary: List the model(s) served by this endpoint.
+      responses:
+        "200":
+          description: Available models.
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/ModelList"
+  /chat/completions:
+    post:
+      operationId: createChatCompletion
+      summary: Create a chat completion (OpenAI-compatible).
+      requestBody:
+        required: true
+        content:
+          application/json:
+            schema:
+              $ref: "#/components/schemas/ChatCompletionRequest"
+      responses:
+        "200":
+          description: Chat completion. A stream of SSE chunks when `stream` is true.
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/ChatCompletionResponse"
+        "401":
+          $ref: "#/components/responses/Unauthorized"
+  /completions:
+    post:
+      operationId: createCompletion
+      summary: Create a text completion (OpenAI-compatible).
+      requestBody:
+        required: true
+        content:
+          application/json:
+            schema:
+              $ref: "#/components/schemas/CompletionRequest"
+      responses:
+        "200":
+          description: Text completion.
+          content:
+            application/json:
+              schema:
+                $ref: "#/components/schemas/CompletionResponse"
+        "401":
+          $ref: "#/components/responses/Unauthorized"
+components:
+  securitySchemes:
+    bearerAuth:
+      type: http
+      scheme: bearer
+      description: >
+        Required only when the apps are deployed with MODAL_LLM_REQUIRE_AUTH=1.
+        The token is the value of the `llm-api-key` Modal Secret (VLLM_API_KEY).
+        Without auth the endpoints are public and any token is accepted.
+  responses:
+    Unauthorized:
+      description: Missing or invalid bearer token (auth-enabled deploys only).
+      content:
+        application/json:
+          schema:
+            $ref: "#/components/schemas/Error"
+  schemas:
+    Model:
+      type: object
+      properties:
+        id:
+          type: string
+          description: Served model id (the Hugging Face repo id), e.g. "google/gemma-4-12B".
+        object:
+          type: string
+          const: model
+        created:
+          type: integer
+        owned_by:
+          type: string
+    ModelList:
+      type: object
+      properties:
+        object:
+          type: string
+          const: list
+        data:
+          type: array
+          items:
+            $ref: "#/components/schemas/Model"
+    ChatMessage:
+      type: object
+      required: [role, content]
+      properties:
+        role:
+          type: string
+          enum: [system, user, assistant, tool]
+        content:
+          description: >
+            Plain text, or an array of content parts for multimodal models
+            (e.g. MiniCPM-o) with `type` of text / image_url / input_audio.
+          oneOf:
+            - type: string
+            - type: array
+              items:
+                type: object
+    ChatCompletionRequest:
+      type: object
+      required: [model, messages]
+      properties:
+        model:
+          type: string
+          description: Served model id (must match the endpoint's model).
+        messages:
+          type: array
+          items:
+            $ref: "#/components/schemas/ChatMessage"
+        max_tokens:
+          type: integer
+        temperature:
+          type: number
+          default: 1.0
+        top_p:
+          type: number
+          default: 1.0
+        stream:
+          type: boolean
+          default: false
+        tools:
+          type: array
+          description: Tool/function definitions (models with a tool_call_parser).
+          items:
+            type: object
+        tool_choice:
+          description: "auto | none | required | a specific tool."
+          oneOf:
+            - type: string
+            - type: object
+    ChatCompletionResponse:
+      type: object
+      properties:
+        id:
+          type: string
+        object:
+          type: string
+          const: chat.completion
+        created:
+          type: integer
+        model:
+          type: string
+        choices:
+          type: array
+          items:
+            type: object
+            properties:
+              index:
+                type: integer
+              message:
+                $ref: "#/components/schemas/ChatMessage"
+              finish_reason:
+                type: string
+        usage:
+          $ref: "#/components/schemas/Usage"
+    CompletionRequest:
+      type: object
+      required: [model, prompt]
+      properties:
+        model:
+          type: string
+        prompt:
+          oneOf:
+            - type: string
+            - type: array
+              items:
+                type: string
+        max_tokens:
+          type: integer
+        temperature:
+          type: number
+        stream:
+          type: boolean
+          default: false
+    CompletionResponse:
+      type: object
+      properties:
+        id:
+          type: string
+        object:
+          type: string
+          const: text_completion
+        model:
+          type: string
+        choices:
+          type: array
+          items:
+            type: object
+            properties:
+              index:
+                type: integer
+              text:
+                type: string
+              finish_reason:
+                type: string
+        usage:
+          $ref: "#/components/schemas/Usage"
+    Usage:
+      type: object
+      properties:
+        prompt_tokens:
+          type: integer
+        completion_tokens:
+          type: integer
+        total_tokens:
+          type: integer
+    Error:
+      type: object
+      properties:
+        error:
+          type: object
+          properties:
+            message:
+              type: string
+            type:
+              type: string
+            code:
+              type: string

modal/service.py CHANGED Viewed

@@ -21,6 +21,7 @@ them by pointing ``base_url`` at the deployed URL.
 from __future__ import annotations
 import json
 from dataclasses import dataclass, field
 import modal
@@ -44,6 +45,22 @@ VLLM_CACHE_PATH = "/root/.cache/vllm"
 #   modal secret create huggingface-secret HF_TOKEN=hf_...
 HF_SECRET_NAME = "huggingface-secret"
 # Weights and the vLLM compile cache are shared across every provider app, so a
 # model pulled once is warm for all subsequent deploys and containers.
 hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
@@ -179,7 +196,12 @@ def register_model(app: modal.App, cfg: ModelConfig) -> modal.Function:
     """
     image = build_image(cfg)
     cmd = build_command(cfg)
-    secrets = [modal.Secret.from_name(HF_SECRET_NAME)] if cfg.gated else []
     @app.function(
         name=cfg.endpoint_name,

 from __future__ import annotations
 import json
+import os
 from dataclasses import dataclass, field
 import modal
 #   modal secret create huggingface-secret HF_TOKEN=hf_...
 HF_SECRET_NAME = "huggingface-secret"
+# Name of the Modal Secret holding the bearer token clients must present.
+# The key MUST be VLLM_API_KEY — vLLM reads that env var and then enforces
+# `Authorization: Bearer <token>` on every request. Create it once with:
+#   modal secret create llm-api-key VLLM_API_KEY=sk-...
+API_KEY_SECRET_NAME = "llm-api-key"
+# Opt in to API-key auth at deploy time (no code edits needed):
+#   MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
+# When enabled, every endpoint mounts API_KEY_SECRET_NAME and rejects requests
+# without a valid bearer token. Off by default (endpoints are then public).
+REQUIRE_API_KEY = os.environ.get("MODAL_LLM_REQUIRE_AUTH", "").lower() in (
+    "1",
+    "true",
+    "yes",
+)
 # Weights and the vLLM compile cache are shared across every provider app, so a
 # model pulled once is warm for all subsequent deploys and containers.
 hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
     """
     image = build_image(cfg)
     cmd = build_command(cfg)
+    secrets = []
+    if cfg.gated:
+        secrets.append(modal.Secret.from_name(HF_SECRET_NAME))
+    if REQUIRE_API_KEY:
+        # Exposes VLLM_API_KEY in the container; vLLM then enforces bearer auth.
+        secrets.append(modal.Secret.from_name(API_KEY_SECRET_NAME))
     @app.function(
         name=cfg.endpoint_name,