Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.19.0
OpenAPI / API reference
Every deployed model speaks the OpenAI REST protocol, so the API surface is the familiar OpenAI one. There are two sources of truth:
Live, per-model spec — each running endpoint serves its own auto-generated spec at
/openapi.jsonand an interactive Swagger UI at/docs:https://<workspace>--<app-name>-<endpoint-name>.modal.run/openapi.json https://<workspace>--<app-name>-<endpoint-name>.modal.run/docsChecked-in spec —
../openapi.yamldocuments the shared, stable surface across all endpoints (OpenAPI 3.1). Use it for client generation and review; use the live spec for the exact, version-pinned shape.
Base URL
https://<workspace>--<app-name>-<endpoint-name>.modal.run/v1
One server per model; the URL label is <app-name>-<endpoint-name> — the
modal.App (nvidia-llms / openbmb-llms / google-llms) plus the model's
endpoint_name from registry.py (e.g. google-llms-gemma-4-12b,
nvidia-llms-nemotron-3-nano-4b). The model you send is the served id (the
HF repo id), not this slug.
Endpoints
| Method & path | Purpose |
|---|---|
GET /v1/models |
List the model served by this endpoint. |
POST /v1/chat/completions |
Chat completion (streaming via stream: true). |
POST /v1/completions |
Text completion. |
Multimodal models (MiniCPM-o-4_5) accept array-style content parts
(text / image_url / input_audio) on chat messages. Models configured with
a tool_call_parser accept tools / tool_choice.
Authentication
Auth is off by default (endpoints are public; any token is accepted). To require a bearer token, deploy with auth enabled — secrets are supplied as environment variables, never hard-coded:
# 1. Create the secret. The KEY must be VLLM_API_KEY (vLLM reads this env var);
# the VALUE is the bearer token clients will send.
modal secret create llm-api-key VLLM_API_KEY=sk-your-token
# 2. Deploy with auth turned on (per provider app).
MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
With auth on, vLLM enforces Authorization: Bearer <token> and returns 401
otherwise. Clients pass the same token as their API key.
Examples
curl
curl https://<workspace>--google-llms-gemma-4-12b.modal.run/v1/chat/completions \
-H "Authorization: Bearer $LLM_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-12B",
"messages": [{"role": "user", "content": "Describe a mossy ticket booth."}],
"max_tokens": 256
}'
OpenAI SDK
from openai import OpenAI
client = OpenAI(
base_url="https://<workspace>--google-llms-gemma-4-12b.modal.run/v1",
api_key=os.environ["LLM_API_KEY"], # any value when auth is off
)
resp = client.chat.completions.create(
model="google/gemma-4-12B",
messages=[{"role": "user", "content": "Hello from the wood."}],
)
print(resp.choices[0].message.content)
The bundled ../client.py wraps this and reads the token from
the LLM_API_KEY environment variable.
Generating clients
# Typed client from the checked-in spec...
openapi-generator-cli generate -i modal/openapi.yaml -g python -o ./gen
# ...or from a live endpoint's exact spec:
curl -s https://<workspace>--google-llms-gemma-4-12b.modal.run/openapi.json -o openapi.json