agharsallah Codex commited on
Commit
57b8237
·
1 Parent(s): 39ef592

feat: add OpenAPI docs and opt-in bearer auth via env-var secret

Browse files

- openapi.yaml: checked-in OpenAPI 3.1 spec for the served API surface
- docs/openapi.md: API reference (endpoints, auth, examples, client gen)
- service.py: optional API-key auth, enabled at deploy time with
MODAL_LLM_REQUIRE_AUTH=1; token supplied via the llm-api-key Modal Secret
as the VLLM_API_KEY env var (never hard-coded)
- client.py: read bearer token from LLM_API_KEY env var
- deploying.md / README / ADR-0005: document auth + OpenAPI

Endpoints also self-document at /docs and /openapi.json.

Co-Authored-By: Codex <codex@openai.com>

docs/adr/0005-modal-model-serving.md CHANGED
@@ -40,6 +40,12 @@ OpenAI-compatible endpoints (vLLM behind an autoscaling web server).
40
  touching the others or the engine.
41
  - Gated repos (Gemma, the Nemotron repos used here) require a Hugging Face token
42
  in the `huggingface-secret` Modal Secret; ungated models deploy without it.
 
 
 
 
 
 
43
  - vLLM tool/reasoning parser names are version-specific and left conservative;
44
  enable per model once verified against the deployed vLLM version.
45
  - Modal's docs index is mirrored at `modal/docs/modal-llms.txt` and refreshed
 
40
  touching the others or the engine.
41
  - Gated repos (Gemma, the Nemotron repos used here) require a Hugging Face token
42
  in the `huggingface-secret` Modal Secret; ungated models deploy without it.
43
+ - Endpoints are public by default; bearer-token auth is opt-in at deploy time
44
+ (`MODAL_LLM_REQUIRE_AUTH=1`) and supplied via the `llm-api-key` Secret as the
45
+ `VLLM_API_KEY` env var — secrets are never hard-coded.
46
+ - The API is OpenAI-compatible: each endpoint self-documents at `/docs` and
47
+ `/openapi.json`, and a checked-in `modal/openapi.yaml` (3.1) plus
48
+ `modal/docs/openapi.md` document the shared surface.
49
  - vLLM tool/reasoning parser names are version-specific and left conservative;
50
  enable per model once verified against the deployed vLLM version.
51
  - Modal's docs index is mirrored at `modal/docs/modal-llms.txt` and refreshed
modal/README.md CHANGED
@@ -16,12 +16,17 @@ modal/
16
  app_openbmb.py App "openbmb-llms" — MiniCPM-o 4.5 + MiniCPM4.1-8B.
17
  app_google.py App "google-llms" — Gemma 4 26B + 12B.
18
  client.py OpenAI-SDK smoke-test client for any endpoint.
 
19
  requirements.txt Deploy/client tooling (vLLM lives in the container image).
20
  docs/
21
  deploying.md Deploy, configure, auth, GPU sizing, engine integration.
 
22
  modal-llms.txt In-repo mirror of Modal's docs index, kept updated.
23
  ```
24
 
 
 
 
25
  ## Models
26
 
27
  | Provider | App | Model | Endpoint name | GPU |
 
16
  app_openbmb.py App "openbmb-llms" — MiniCPM-o 4.5 + MiniCPM4.1-8B.
17
  app_google.py App "google-llms" — Gemma 4 26B + 12B.
18
  client.py OpenAI-SDK smoke-test client for any endpoint.
19
+ openapi.yaml Checked-in OpenAPI 3.1 spec for the served API surface.
20
  requirements.txt Deploy/client tooling (vLLM lives in the container image).
21
  docs/
22
  deploying.md Deploy, configure, auth, GPU sizing, engine integration.
23
+ openapi.md API reference: endpoints, auth, examples, client generation.
24
  modal-llms.txt In-repo mirror of Modal's docs index, kept updated.
25
  ```
26
 
27
+ Each running endpoint also self-documents at `/docs` (Swagger UI) and
28
+ `/openapi.json` (live spec). See [`docs/openapi.md`](docs/openapi.md).
29
+
30
  ## Models
31
 
32
  | Provider | App | Model | Endpoint name | GPU |
modal/client.py CHANGED
@@ -26,8 +26,9 @@ def main() -> None:
26
 
27
  from openai import OpenAI
28
 
29
- # vLLM accepts any token unless an API key was configured on the server.
30
- client = OpenAI(base_url=args.base_url, api_key=os.environ.get("MODAL_LLM_KEY", "EMPTY"))
 
31
 
32
  response = client.chat.completions.create(
33
  model=args.model,
 
26
 
27
  from openai import OpenAI
28
 
29
+ # Bearer token from the env var (set LLM_API_KEY to the value of the
30
+ # `llm-api-key` Modal Secret). Any value works when the server has no auth.
31
+ client = OpenAI(base_url=args.base_url, api_key=os.environ.get("LLM_API_KEY", "EMPTY"))
32
 
33
  response = client.chat.completions.create(
34
  model=args.model,
modal/docs/deploying.md CHANGED
@@ -86,12 +86,25 @@ Append one `ModelConfig` to the appropriate provider list in `registry.py`.
86
 
87
  ## Auth
88
 
89
- Modal web endpoints are public by default. To require a bearer token, either:
 
90
 
91
- - Set `VLLM_API_KEY` on the container (via a `modal.Secret`) so vLLM enforces
92
- `Authorization: Bearer <key>`; or
93
- - Front the endpoint with Modal Proxy Auth Tokens
94
- (see `docs/modal-llms.txt` → Proxy Auth Tokens).
 
 
 
 
 
 
 
 
 
 
 
 
95
 
96
  ## GPU sizing cheatsheet
97
 
 
86
 
87
  ## Auth
88
 
89
+ Modal web endpoints are public by default. Secrets are supplied as environment
90
+ variables (never hard-coded). To require a bearer token:
91
 
92
+ ```bash
93
+ # Key MUST be VLLM_API_KEY (vLLM reads it); value is the token clients send.
94
+ modal secret create llm-api-key VLLM_API_KEY=sk-your-token
95
+
96
+ # Turn auth on at deploy time — no code edits:
97
+ MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
98
+ ```
99
+
100
+ When `MODAL_LLM_REQUIRE_AUTH` is set, every endpoint mounts the `llm-api-key`
101
+ secret as the `VLLM_API_KEY` env var and vLLM enforces `Authorization: Bearer
102
+ <token>` (401 otherwise). Clients pass the same token (the bundled `client.py`
103
+ reads it from `LLM_API_KEY`). Alternatively front endpoints with Modal Proxy
104
+ Auth Tokens (see `docs/modal-llms.txt` → Proxy Auth Tokens).
105
+
106
+ See [`openapi.md`](openapi.md) for the full API reference and the checked-in
107
+ OpenAPI spec (`../openapi.yaml`).
108
 
109
  ## GPU sizing cheatsheet
110
 
modal/docs/openapi.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenAPI / API reference
2
+
3
+ Every deployed model speaks the **OpenAI REST protocol**, so the API surface is
4
+ the familiar OpenAI one. There are two sources of truth:
5
+
6
+ - **Live, per-model spec** — each running endpoint serves its own
7
+ auto-generated spec at `/openapi.json` and an interactive Swagger UI at
8
+ `/docs`:
9
+
10
+ ```
11
+ https://<workspace>--<endpoint-name>.modal.run/openapi.json
12
+ https://<workspace>--<endpoint-name>.modal.run/docs
13
+ ```
14
+
15
+ - **Checked-in spec** — [`../openapi.yaml`](../openapi.yaml) documents the
16
+ shared, stable surface across all endpoints (OpenAPI 3.1). Use it for client
17
+ generation and review; use the live spec for the exact, version-pinned shape.
18
+
19
+ ## Base URL
20
+
21
+ ```
22
+ https://<workspace>--<endpoint-name>.modal.run/v1
23
+ ```
24
+
25
+ One server per model; `<endpoint-name>` is the model's `endpoint_name` from
26
+ `registry.py` (e.g. `gemma-4-12b`, `nemotron-3-nano-4b`).
27
+
28
+ ## Endpoints
29
+
30
+ | Method & path | Purpose |
31
+ | ----------------------- | ---------------------------------------- |
32
+ | `GET /v1/models` | List the model served by this endpoint. |
33
+ | `POST /v1/chat/completions` | Chat completion (streaming via `stream: true`). |
34
+ | `POST /v1/completions` | Text completion. |
35
+
36
+ Multimodal models (MiniCPM-o-4_5) accept array-style `content` parts
37
+ (`text` / `image_url` / `input_audio`) on chat messages. Models configured with
38
+ a `tool_call_parser` accept `tools` / `tool_choice`.
39
+
40
+ ## Authentication
41
+
42
+ Auth is **off by default** (endpoints are public; any token is accepted). To
43
+ require a bearer token, deploy with auth enabled — secrets are supplied as
44
+ environment variables, never hard-coded:
45
+
46
+ ```bash
47
+ # 1. Create the secret. The KEY must be VLLM_API_KEY (vLLM reads this env var);
48
+ # the VALUE is the bearer token clients will send.
49
+ modal secret create llm-api-key VLLM_API_KEY=sk-your-token
50
+
51
+ # 2. Deploy with auth turned on (per provider app).
52
+ MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
53
+ ```
54
+
55
+ With auth on, vLLM enforces `Authorization: Bearer <token>` and returns `401`
56
+ otherwise. Clients pass the same token as their API key.
57
+
58
+ ## Examples
59
+
60
+ ### curl
61
+
62
+ ```bash
63
+ curl https://<workspace>--gemma-4-12b.modal.run/v1/chat/completions \
64
+ -H "Authorization: Bearer $LLM_API_KEY" \
65
+ -H "Content-Type: application/json" \
66
+ -d '{
67
+ "model": "google/gemma-4-12B",
68
+ "messages": [{"role": "user", "content": "Describe a mossy ticket booth."}],
69
+ "max_tokens": 256
70
+ }'
71
+ ```
72
+
73
+ ### OpenAI SDK
74
+
75
+ ```python
76
+ from openai import OpenAI
77
+
78
+ client = OpenAI(
79
+ base_url="https://<workspace>--gemma-4-12b.modal.run/v1",
80
+ api_key=os.environ["LLM_API_KEY"], # any value when auth is off
81
+ )
82
+ resp = client.chat.completions.create(
83
+ model="google/gemma-4-12B",
84
+ messages=[{"role": "user", "content": "Hello from the wood."}],
85
+ )
86
+ print(resp.choices[0].message.content)
87
+ ```
88
+
89
+ The bundled [`../client.py`](../client.py) wraps this and reads the token from
90
+ the `LLM_API_KEY` environment variable.
91
+
92
+ ## Generating clients
93
+
94
+ ```bash
95
+ # Typed client from the checked-in spec...
96
+ openapi-generator-cli generate -i modal/openapi.yaml -g python -o ./gen
97
+
98
+ # ...or from a live endpoint's exact spec:
99
+ curl -s https://<workspace>--gemma-4-12b.modal.run/openapi.json -o openapi.json
100
+ ```
modal/openapi.yaml ADDED
@@ -0,0 +1,258 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ openapi: 3.1.0
2
+ info:
3
+ title: Multi-Agent Land — Model Serving API
4
+ version: "1.0.0"
5
+ description: >
6
+ OpenAI-compatible inference API served by the Modal apps in this repo
7
+ (`nvidia-llms`, `openbmb-llms`, `google-llms`). Each model is exposed as its
8
+ own endpoint and speaks the OpenAI REST protocol, so any OpenAI-compatible
9
+ client works by pointing `base_url` at the endpoint URL.
10
+
11
+ Every running endpoint also serves a live, model-specific spec at
12
+ `/openapi.json` and an interactive Swagger UI at `/docs`. This checked-in
13
+ spec documents the shared, stable surface across all endpoints.
14
+ servers:
15
+ - url: https://{workspace}--{endpoint}.modal.run/v1
16
+ description: Per-model Modal endpoint (one server per model).
17
+ variables:
18
+ workspace:
19
+ default: your-workspace
20
+ description: Your Modal workspace slug.
21
+ endpoint:
22
+ default: gemma-4-12b
23
+ description: A model's endpoint_name from registry.py.
24
+ enum:
25
+ - nemotron-3-nano-30b
26
+ - nemotron-3-nano-4b
27
+ - minicpm-o-4-5
28
+ - minicpm-4-1-8b
29
+ - gemma-4-26b
30
+ - gemma-4-12b
31
+ security:
32
+ - bearerAuth: []
33
+ paths:
34
+ /models:
35
+ get:
36
+ operationId: listModels
37
+ summary: List the model(s) served by this endpoint.
38
+ responses:
39
+ "200":
40
+ description: Available models.
41
+ content:
42
+ application/json:
43
+ schema:
44
+ $ref: "#/components/schemas/ModelList"
45
+ /chat/completions:
46
+ post:
47
+ operationId: createChatCompletion
48
+ summary: Create a chat completion (OpenAI-compatible).
49
+ requestBody:
50
+ required: true
51
+ content:
52
+ application/json:
53
+ schema:
54
+ $ref: "#/components/schemas/ChatCompletionRequest"
55
+ responses:
56
+ "200":
57
+ description: Chat completion. A stream of SSE chunks when `stream` is true.
58
+ content:
59
+ application/json:
60
+ schema:
61
+ $ref: "#/components/schemas/ChatCompletionResponse"
62
+ "401":
63
+ $ref: "#/components/responses/Unauthorized"
64
+ /completions:
65
+ post:
66
+ operationId: createCompletion
67
+ summary: Create a text completion (OpenAI-compatible).
68
+ requestBody:
69
+ required: true
70
+ content:
71
+ application/json:
72
+ schema:
73
+ $ref: "#/components/schemas/CompletionRequest"
74
+ responses:
75
+ "200":
76
+ description: Text completion.
77
+ content:
78
+ application/json:
79
+ schema:
80
+ $ref: "#/components/schemas/CompletionResponse"
81
+ "401":
82
+ $ref: "#/components/responses/Unauthorized"
83
+ components:
84
+ securitySchemes:
85
+ bearerAuth:
86
+ type: http
87
+ scheme: bearer
88
+ description: >
89
+ Required only when the apps are deployed with MODAL_LLM_REQUIRE_AUTH=1.
90
+ The token is the value of the `llm-api-key` Modal Secret (VLLM_API_KEY).
91
+ Without auth the endpoints are public and any token is accepted.
92
+ responses:
93
+ Unauthorized:
94
+ description: Missing or invalid bearer token (auth-enabled deploys only).
95
+ content:
96
+ application/json:
97
+ schema:
98
+ $ref: "#/components/schemas/Error"
99
+ schemas:
100
+ Model:
101
+ type: object
102
+ properties:
103
+ id:
104
+ type: string
105
+ description: Served model id (the Hugging Face repo id), e.g. "google/gemma-4-12B".
106
+ object:
107
+ type: string
108
+ const: model
109
+ created:
110
+ type: integer
111
+ owned_by:
112
+ type: string
113
+ ModelList:
114
+ type: object
115
+ properties:
116
+ object:
117
+ type: string
118
+ const: list
119
+ data:
120
+ type: array
121
+ items:
122
+ $ref: "#/components/schemas/Model"
123
+ ChatMessage:
124
+ type: object
125
+ required: [role, content]
126
+ properties:
127
+ role:
128
+ type: string
129
+ enum: [system, user, assistant, tool]
130
+ content:
131
+ description: >
132
+ Plain text, or an array of content parts for multimodal models
133
+ (e.g. MiniCPM-o) with `type` of text / image_url / input_audio.
134
+ oneOf:
135
+ - type: string
136
+ - type: array
137
+ items:
138
+ type: object
139
+ ChatCompletionRequest:
140
+ type: object
141
+ required: [model, messages]
142
+ properties:
143
+ model:
144
+ type: string
145
+ description: Served model id (must match the endpoint's model).
146
+ messages:
147
+ type: array
148
+ items:
149
+ $ref: "#/components/schemas/ChatMessage"
150
+ max_tokens:
151
+ type: integer
152
+ temperature:
153
+ type: number
154
+ default: 1.0
155
+ top_p:
156
+ type: number
157
+ default: 1.0
158
+ stream:
159
+ type: boolean
160
+ default: false
161
+ tools:
162
+ type: array
163
+ description: Tool/function definitions (models with a tool_call_parser).
164
+ items:
165
+ type: object
166
+ tool_choice:
167
+ description: "auto | none | required | a specific tool."
168
+ oneOf:
169
+ - type: string
170
+ - type: object
171
+ ChatCompletionResponse:
172
+ type: object
173
+ properties:
174
+ id:
175
+ type: string
176
+ object:
177
+ type: string
178
+ const: chat.completion
179
+ created:
180
+ type: integer
181
+ model:
182
+ type: string
183
+ choices:
184
+ type: array
185
+ items:
186
+ type: object
187
+ properties:
188
+ index:
189
+ type: integer
190
+ message:
191
+ $ref: "#/components/schemas/ChatMessage"
192
+ finish_reason:
193
+ type: string
194
+ usage:
195
+ $ref: "#/components/schemas/Usage"
196
+ CompletionRequest:
197
+ type: object
198
+ required: [model, prompt]
199
+ properties:
200
+ model:
201
+ type: string
202
+ prompt:
203
+ oneOf:
204
+ - type: string
205
+ - type: array
206
+ items:
207
+ type: string
208
+ max_tokens:
209
+ type: integer
210
+ temperature:
211
+ type: number
212
+ stream:
213
+ type: boolean
214
+ default: false
215
+ CompletionResponse:
216
+ type: object
217
+ properties:
218
+ id:
219
+ type: string
220
+ object:
221
+ type: string
222
+ const: text_completion
223
+ model:
224
+ type: string
225
+ choices:
226
+ type: array
227
+ items:
228
+ type: object
229
+ properties:
230
+ index:
231
+ type: integer
232
+ text:
233
+ type: string
234
+ finish_reason:
235
+ type: string
236
+ usage:
237
+ $ref: "#/components/schemas/Usage"
238
+ Usage:
239
+ type: object
240
+ properties:
241
+ prompt_tokens:
242
+ type: integer
243
+ completion_tokens:
244
+ type: integer
245
+ total_tokens:
246
+ type: integer
247
+ Error:
248
+ type: object
249
+ properties:
250
+ error:
251
+ type: object
252
+ properties:
253
+ message:
254
+ type: string
255
+ type:
256
+ type: string
257
+ code:
258
+ type: string
modal/service.py CHANGED
@@ -21,6 +21,7 @@ them by pointing ``base_url`` at the deployed URL.
21
  from __future__ import annotations
22
 
23
  import json
 
24
  from dataclasses import dataclass, field
25
 
26
  import modal
@@ -44,6 +45,22 @@ VLLM_CACHE_PATH = "/root/.cache/vllm"
44
  # modal secret create huggingface-secret HF_TOKEN=hf_...
45
  HF_SECRET_NAME = "huggingface-secret"
46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  # Weights and the vLLM compile cache are shared across every provider app, so a
48
  # model pulled once is warm for all subsequent deploys and containers.
49
  hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
@@ -179,7 +196,12 @@ def register_model(app: modal.App, cfg: ModelConfig) -> modal.Function:
179
  """
180
  image = build_image(cfg)
181
  cmd = build_command(cfg)
182
- secrets = [modal.Secret.from_name(HF_SECRET_NAME)] if cfg.gated else []
 
 
 
 
 
183
 
184
  @app.function(
185
  name=cfg.endpoint_name,
 
21
  from __future__ import annotations
22
 
23
  import json
24
+ import os
25
  from dataclasses import dataclass, field
26
 
27
  import modal
 
45
  # modal secret create huggingface-secret HF_TOKEN=hf_...
46
  HF_SECRET_NAME = "huggingface-secret"
47
 
48
+ # Name of the Modal Secret holding the bearer token clients must present.
49
+ # The key MUST be VLLM_API_KEY — vLLM reads that env var and then enforces
50
+ # `Authorization: Bearer <token>` on every request. Create it once with:
51
+ # modal secret create llm-api-key VLLM_API_KEY=sk-...
52
+ API_KEY_SECRET_NAME = "llm-api-key"
53
+
54
+ # Opt in to API-key auth at deploy time (no code edits needed):
55
+ # MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
56
+ # When enabled, every endpoint mounts API_KEY_SECRET_NAME and rejects requests
57
+ # without a valid bearer token. Off by default (endpoints are then public).
58
+ REQUIRE_API_KEY = os.environ.get("MODAL_LLM_REQUIRE_AUTH", "").lower() in (
59
+ "1",
60
+ "true",
61
+ "yes",
62
+ )
63
+
64
  # Weights and the vLLM compile cache are shared across every provider app, so a
65
  # model pulled once is warm for all subsequent deploys and containers.
66
  hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
 
196
  """
197
  image = build_image(cfg)
198
  cmd = build_command(cfg)
199
+ secrets = []
200
+ if cfg.gated:
201
+ secrets.append(modal.Secret.from_name(HF_SECRET_NAME))
202
+ if REQUIRE_API_KEY:
203
+ # Exposes VLLM_API_KEY in the container; vLLM then enforces bearer auth.
204
+ secrets.append(modal.Secret.from_name(API_KEY_SECRET_NAME))
205
 
206
  @app.function(
207
  name=cfg.endpoint_name,