Spaces:

build-small-hackathon
/

multi-agent-lab

Running on Zero

App Files Files Community

multi-agent-lab / modal /docs /openapi.md

agharsallah

fix: update endpoint URLs to reflect new app naming conventions

7cedfb2 21 days ago

preview code

Raw

History Blame Contribute Delete

3.53 kB

	# OpenAPI / API reference

	Every deployed model speaks the OpenAI REST protocol, so the API surface is
	the familiar OpenAI one. There are two sources of truth:

	- Live, per-model spec — each running endpoint serves its own
	auto-generated spec at `/openapi.json` and an interactive Swagger UI at
	`/docs`:

	```
	https://<workspace>--<app-name>-<endpoint-name>.modal.run/openapi.json
	https://<workspace>--<app-name>-<endpoint-name>.modal.run/docs
	```

	- Checked-in spec — [`../openapi.yaml`](../openapi.yaml) documents the
	shared, stable surface across all endpoints (OpenAPI 3.1). Use it for client
	generation and review; use the live spec for the exact, version-pinned shape.

	## Base URL

	```
	https://<workspace>--<app-name>-<endpoint-name>.modal.run/v1
	```

	One server per model; the URL label is `<app-name>-<endpoint-name>` — the
	`modal.App` (`nvidia-llms` / `openbmb-llms` / `google-llms`) plus the model's
	`endpoint_name` from `registry.py` (e.g. `google-llms-gemma-4-12b`,
	`nvidia-llms-nemotron-3-nano-4b`). The `model` you send is the served id (the
	HF repo id), not this slug.

	## Endpoints

	\| Method & path \| Purpose \|
	\| ----------------------- \| ---------------------------------------- \|
	\| `GET /v1/models` \| List the model served by this endpoint. \|
	\| `POST /v1/chat/completions` \| Chat completion (streaming via `stream: true`). \|
	\| `POST /v1/completions` \| Text completion. \|

	Multimodal models (MiniCPM-o-4_5) accept array-style `content` parts
	(`text` / `image_url` / `input_audio`) on chat messages. Models configured with
	a `tool_call_parser` accept `tools` / `tool_choice`.

	## Authentication

	Auth is off by default (endpoints are public; any token is accepted). To
	require a bearer token, deploy with auth enabled — secrets are supplied as
	environment variables, never hard-coded:

	```bash
	# 1. Create the secret. The KEY must be VLLM_API_KEY (vLLM reads this env var);
	# the VALUE is the bearer token clients will send.
	modal secret create llm-api-key VLLM_API_KEY=sk-your-token

	# 2. Deploy with auth turned on (per provider app).
	MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
	```

	With auth on, vLLM enforces `Authorization: Bearer <token>` and returns `401`
	otherwise. Clients pass the same token as their API key.

	## Examples

	### curl

	```bash
	curl https://<workspace>--google-llms-gemma-4-12b.modal.run/v1/chat/completions \
	-H "Authorization: Bearer $LLM_API_KEY" \
	-H "Content-Type: application/json" \
	-d '{
	"model": "google/gemma-4-12B",
	"messages": [{"role": "user", "content": "Describe a mossy ticket booth."}],
	"max_tokens": 256
	}'
	```

	### OpenAI SDK

	```python
	from openai import OpenAI

	client = OpenAI(
	base_url="https://<workspace>--google-llms-gemma-4-12b.modal.run/v1",
	api_key=os.environ["LLM_API_KEY"], # any value when auth is off
	)
	resp = client.chat.completions.create(
	model="google/gemma-4-12B",
	messages=[{"role": "user", "content": "Hello from the wood."}],
	)
	print(resp.choices[0].message.content)
	```

	The bundled [`../client.py`](../client.py) wraps this and reads the token from
	the `LLM_API_KEY` environment variable.

	## Generating clients

	```bash
	# Typed client from the checked-in spec...
	openapi-generator-cli generate -i modal/openapi.yaml -g python -o ./gen

	# ...or from a live endpoint's exact spec:
	curl -s https://<workspace>--google-llms-gemma-4-12b.modal.run/openapi.json -o openapi.json
	```