multi-agent-lab / modal /docs /openapi.md
agharsallah
fix: update endpoint URLs to reflect new app naming conventions
7cedfb2
|
Raw
History Blame Contribute Delete
3.53 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

OpenAPI / API reference

Every deployed model speaks the OpenAI REST protocol, so the API surface is the familiar OpenAI one. There are two sources of truth:

  • Live, per-model spec — each running endpoint serves its own auto-generated spec at /openapi.json and an interactive Swagger UI at /docs:

    https://<workspace>--<app-name>-<endpoint-name>.modal.run/openapi.json
    https://<workspace>--<app-name>-<endpoint-name>.modal.run/docs
    
  • Checked-in spec../openapi.yaml documents the shared, stable surface across all endpoints (OpenAPI 3.1). Use it for client generation and review; use the live spec for the exact, version-pinned shape.

Base URL

https://<workspace>--<app-name>-<endpoint-name>.modal.run/v1

One server per model; the URL label is <app-name>-<endpoint-name> — the modal.App (nvidia-llms / openbmb-llms / google-llms) plus the model's endpoint_name from registry.py (e.g. google-llms-gemma-4-12b, nvidia-llms-nemotron-3-nano-4b). The model you send is the served id (the HF repo id), not this slug.

Endpoints

Method & path Purpose
GET /v1/models List the model served by this endpoint.
POST /v1/chat/completions Chat completion (streaming via stream: true).
POST /v1/completions Text completion.

Multimodal models (MiniCPM-o-4_5) accept array-style content parts (text / image_url / input_audio) on chat messages. Models configured with a tool_call_parser accept tools / tool_choice.

Authentication

Auth is off by default (endpoints are public; any token is accepted). To require a bearer token, deploy with auth enabled — secrets are supplied as environment variables, never hard-coded:

# 1. Create the secret. The KEY must be VLLM_API_KEY (vLLM reads this env var);
#    the VALUE is the bearer token clients will send.
modal secret create llm-api-key VLLM_API_KEY=sk-your-token

# 2. Deploy with auth turned on (per provider app).
MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py

With auth on, vLLM enforces Authorization: Bearer <token> and returns 401 otherwise. Clients pass the same token as their API key.

Examples

curl

curl https://<workspace>--google-llms-gemma-4-12b.modal.run/v1/chat/completions \
  -H "Authorization: Bearer $LLM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-12B",
    "messages": [{"role": "user", "content": "Describe a mossy ticket booth."}],
    "max_tokens": 256
  }'

OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="https://<workspace>--google-llms-gemma-4-12b.modal.run/v1",
    api_key=os.environ["LLM_API_KEY"],  # any value when auth is off
)
resp = client.chat.completions.create(
    model="google/gemma-4-12B",
    messages=[{"role": "user", "content": "Hello from the wood."}],
)
print(resp.choices[0].message.content)

The bundled ../client.py wraps this and reads the token from the LLM_API_KEY environment variable.

Generating clients

# Typed client from the checked-in spec...
openapi-generator-cli generate -i modal/openapi.yaml -g python -o ./gen

# ...or from a live endpoint's exact spec:
curl -s https://<workspace>--google-llms-gemma-4-12b.modal.run/openapi.json -o openapi.json