File size: 3,532 Bytes
57b8237
 
 
 
 
 
 
 
 
 
7cedfb2
 
57b8237
 
 
 
 
 
 
 
 
7cedfb2
57b8237
 
7cedfb2
 
 
 
 
57b8237
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7cedfb2
57b8237
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7cedfb2
57b8237
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7cedfb2
57b8237
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# OpenAPI / API reference

Every deployed model speaks the **OpenAI REST protocol**, so the API surface is
the familiar OpenAI one. There are two sources of truth:

- **Live, per-model spec** — each running endpoint serves its own
  auto-generated spec at `/openapi.json` and an interactive Swagger UI at
  `/docs`:

  ```
  https://<workspace>--<app-name>-<endpoint-name>.modal.run/openapi.json
  https://<workspace>--<app-name>-<endpoint-name>.modal.run/docs
  ```

- **Checked-in spec** — [`../openapi.yaml`](../openapi.yaml) documents the
  shared, stable surface across all endpoints (OpenAPI 3.1). Use it for client
  generation and review; use the live spec for the exact, version-pinned shape.

## Base URL

```
https://<workspace>--<app-name>-<endpoint-name>.modal.run/v1
```

One server per model; the URL label is `<app-name>-<endpoint-name>` — the
`modal.App` (`nvidia-llms` / `openbmb-llms` / `google-llms`) plus the model's
`endpoint_name` from `registry.py` (e.g. `google-llms-gemma-4-12b`,
`nvidia-llms-nemotron-3-nano-4b`). The `model` you send is the *served id* (the
HF repo id), not this slug.

## Endpoints

| Method & path           | Purpose                                  |
| ----------------------- | ---------------------------------------- |
| `GET  /v1/models`       | List the model served by this endpoint.  |
| `POST /v1/chat/completions` | Chat completion (streaming via `stream: true`). |
| `POST /v1/completions`  | Text completion.                         |

Multimodal models (MiniCPM-o-4_5) accept array-style `content` parts
(`text` / `image_url` / `input_audio`) on chat messages. Models configured with
a `tool_call_parser` accept `tools` / `tool_choice`.

## Authentication

Auth is **off by default** (endpoints are public; any token is accepted). To
require a bearer token, deploy with auth enabled — secrets are supplied as
environment variables, never hard-coded:

```bash
# 1. Create the secret. The KEY must be VLLM_API_KEY (vLLM reads this env var);
#    the VALUE is the bearer token clients will send.
modal secret create llm-api-key VLLM_API_KEY=sk-your-token

# 2. Deploy with auth turned on (per provider app).
MODAL_LLM_REQUIRE_AUTH=1 modal deploy modal/app_google.py
```

With auth on, vLLM enforces `Authorization: Bearer <token>` and returns `401`
otherwise. Clients pass the same token as their API key.

## Examples

### curl

```bash
curl https://<workspace>--google-llms-gemma-4-12b.modal.run/v1/chat/completions \
  -H "Authorization: Bearer $LLM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-12B",
    "messages": [{"role": "user", "content": "Describe a mossy ticket booth."}],
    "max_tokens": 256
  }'
```

### OpenAI SDK

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://<workspace>--google-llms-gemma-4-12b.modal.run/v1",
    api_key=os.environ["LLM_API_KEY"],  # any value when auth is off
)
resp = client.chat.completions.create(
    model="google/gemma-4-12B",
    messages=[{"role": "user", "content": "Hello from the wood."}],
)
print(resp.choices[0].message.content)
```

The bundled [`../client.py`](../client.py) wraps this and reads the token from
the `LLM_API_KEY` environment variable.

## Generating clients

```bash
# Typed client from the checked-in spec...
openapi-generator-cli generate -i modal/openapi.yaml -g python -o ./gen

# ...or from a live endpoint's exact spec:
curl -s https://<workspace>--google-llms-gemma-4-12b.modal.run/openapi.json -o openapi.json
```