Spaces:
Running on Zero
Running on Zero
File size: 3,532 Bytes
57b8237 7cedfb2 57b8237 7cedfb2 57b8237 7cedfb2 57b8237 7cedfb2 57b8237 7cedfb2 57b8237 7cedfb2 57b8237 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | # OpenAPI / API reference
Every deployed model speaks the **OpenAI REST protocol**, so the API surface is
the familiar OpenAI one. There are two sources of truth:
- **Live, per-model spec** — each running endpoint serves its own
auto-generated spec at `/openapi.json` and an interactive Swagger UI at
`/docs`:
```
https://<workspace>--<app-name>-<endpoint-name>.modal.run/openapi.json
https://<workspace>--<app-name>-<endpoint-name>.modal.run/docs
```
- **Checked-in spec** — [`../openapi.yaml`](../openapi.yaml) documents the
shared, stable surface across all endpoints (OpenAPI 3.1). Use it for client
generation and review; use the live spec for the exact, version-pinned shape.
## Base URL
```
https://<workspace>--<app-name>-<endpoint-name>.modal.run/v1
```
One server per model; the URL label is `<app-name>-<endpoint-name>` — the
`modal.App` (`nvidia-llms` / `openbmb-llms` / `google-llms`) plus the model's
`endpoint_name` from `registry.py` (e.g. `google-llms-gemma-4-12b`,
`nvidia-llms-nemotron-3-nano-4b`). The `model` you send is the *served id* (the
HF repo id), not this slug.
## Endpoints
| Method & path | Purpose |
| ----------------------- | ---------------------------------------- |
| `GET /v1/models` | List the model served by this endpoint. |
| `POST /v1/chat/completions` | Chat completion (streaming via `stream: true`). |
| `POST /v1/completions` | Text completion. |
Multimodal models (MiniCPM-o-4_5) accept array-style `content` parts
(`text` / `image_url` / `input_audio`) on chat messages. Models configured with
a `tool_call_parser` accept `tools` / `tool_choice`.
## Authentication
Auth is **off by default** (endpoints are public; any token is accepted). To
require a bearer token, deploy with auth enabled — secrets are supplied as
environment variables, never hard-coded:
```bash
# 1. Create the secret. The KEY must be VLLM_API_KEY (vLLM reads this env var);
# the VALUE is the bearer token clients will send.
modal secret create llm-api-key VLLM_API_KEY=sk-your-token
# 2. Deploy with auth turned on (per provider app).
MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
```
With auth on, vLLM enforces `Authorization: Bearer <token>` and returns `401`
otherwise. Clients pass the same token as their API key.
## Examples
### curl
```bash
curl https://<workspace>--google-llms-gemma-4-12b.modal.run/v1/chat/completions \
-H "Authorization: Bearer $LLM_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-12B",
"messages": [{"role": "user", "content": "Describe a mossy ticket booth."}],
"max_tokens": 256
}'
```
### OpenAI SDK
```python
from openai import OpenAI
client = OpenAI(
base_url="https://<workspace>--google-llms-gemma-4-12b.modal.run/v1",
api_key=os.environ["LLM_API_KEY"], # any value when auth is off
)
resp = client.chat.completions.create(
model="google/gemma-4-12B",
messages=[{"role": "user", "content": "Hello from the wood."}],
)
print(resp.choices[0].message.content)
```
The bundled [`../client.py`](../client.py) wraps this and reads the token from
the `LLM_API_KEY` environment variable.
## Generating clients
```bash
# Typed client from the checked-in spec...
openapi-generator-cli generate -i modal/openapi.yaml -g python -o ./gen
# ...or from a live endpoint's exact spec:
curl -s https://<workspace>--google-llms-gemma-4-12b.modal.run/openapi.json -o openapi.json
```
|