Kyryll Kochkin commited on
Commit
49badf7
·
1 Parent(s): 698373a

Use live API for OpenAI responses client test

Browse files
README.md CHANGED
@@ -10,7 +10,7 @@ pinned: false
10
  # GPT3dev OpenAI-Compatible API
11
  **more detailed documentation is hoeeted on [DeepWiki](https://deepwiki.com/krll-corp/gpt3dev-api)**
12
 
13
- A production-ready FastAPI server that mirrors the OpenAI REST API surface while proxying requests to Hugging Face causal language models. The service implements the `/v1/completions`, `/v1/chat/completions`, `/v1/models`, and `/v1/embeddings` endpoints with full support for streaming Server-Sent Events (SSE) and OpenAI-style usage accounting. Chat completions are available for instruct-tuned models like `GPT4-dev-177M-1511-Instruct`.
14
 
15
  ## The API is hosted on HuggingFace Spaces:
16
  ```bash
@@ -19,7 +19,7 @@ https://k050506koch-gpt3-dev-api.hf.space
19
 
20
  ## Features
21
 
22
- - ✅ Drop-in compatible request/response schemas for OpenAI text completions.
23
  - ✅ Streaming responses (`stream=true`) that emit OpenAI-formatted SSE frames ending with `data: [DONE]`.
24
  - ✅ Configurable Hugging Face model registry with lazy loading, shared model cache, and automatic device placement.
25
  - ✅ Prompt token counting via `tiktoken` when available (falls back to Hugging Face tokenizers).
@@ -130,6 +130,18 @@ curl http://localhost:7860/v1/chat/completions \
130
 
131
  Non-instruct models will return an error directing users to use `/v1/completions` instead.
132
 
 
 
 
 
 
 
 
 
 
 
 
 
133
  ### Embeddings
134
 
135
  The `/v1/embeddings` endpoint returns a 501 Not Implemented error with actionable guidance unless an embeddings backend is configured.
 
10
  # GPT3dev OpenAI-Compatible API
11
  **more detailed documentation is hoeeted on [DeepWiki](https://deepwiki.com/krll-corp/gpt3dev-api)**
12
 
13
+ A production-ready FastAPI server that mirrors the OpenAI REST API surface while proxying requests to Hugging Face causal language models. The service implements the `/v1/completions`, `/v1/chat/completions`, `/v1/responses`, `/v1/models`, and `/v1/embeddings` endpoints with full support for streaming Server-Sent Events (SSE) and OpenAI-style usage accounting. Chat completions are available for instruct-tuned models like `GPT4-dev-177M-1511-Instruct`.
14
 
15
  ## The API is hosted on HuggingFace Spaces:
16
  ```bash
 
19
 
20
  ## Features
21
 
22
+ - ✅ Drop-in compatible request/response schemas for OpenAI text completions and responses.
23
  - ✅ Streaming responses (`stream=true`) that emit OpenAI-formatted SSE frames ending with `data: [DONE]`.
24
  - ✅ Configurable Hugging Face model registry with lazy loading, shared model cache, and automatic device placement.
25
  - ✅ Prompt token counting via `tiktoken` when available (falls back to Hugging Face tokenizers).
 
130
 
131
  Non-instruct models will return an error directing users to use `/v1/completions` instead.
132
 
133
+ ### Responses API
134
+
135
+ ```bash
136
+ curl http://localhost:7860/v1/responses \
137
+ -H "Content-Type: application/json" \
138
+ -d '{
139
+ "model": "GPT4-dev-177M-1511-Instruct",
140
+ "input": "Summarize the key points in two sentences.",
141
+ "max_output_tokens": 128
142
+ }'
143
+ ```
144
+
145
  ### Embeddings
146
 
147
  The `/v1/embeddings` endpoint returns a 501 Not Implemented error with actionable guidance unless an embeddings backend is configured.
app/main.py CHANGED
@@ -15,7 +15,7 @@ from fastapi.responses import JSONResponse
15
  from fastapi.routing import APIRoute
16
 
17
  from .core.settings import get_settings
18
- from .routers import chat, completions, embeddings, models
19
 
20
 
21
  def configure_logging(level: str) -> None:
@@ -124,6 +124,7 @@ if settings.cors_allow_origins:
124
  app.include_router(models.router)
125
  app.include_router(completions.router)
126
  app.include_router(chat.router)
 
127
  app.include_router(embeddings.router)
128
 
129
 
 
15
  from fastapi.routing import APIRoute
16
 
17
  from .core.settings import get_settings
18
+ from .routers import chat, completions, embeddings, models, responses
19
 
20
 
21
  def configure_logging(level: str) -> None:
 
124
  app.include_router(models.router)
125
  app.include_router(completions.router)
126
  app.include_router(chat.router)
127
+ app.include_router(responses.router)
128
  app.include_router(embeddings.router)
129
 
130
 
app/routers/__init__.py CHANGED
@@ -1,4 +1,4 @@
1
  """Router package exports."""
2
- from . import chat, completions, embeddings, models
3
 
4
- __all__ = ["chat", "completions", "embeddings", "models"]
 
1
  """Router package exports."""
2
+ from . import chat, completions, embeddings, models, responses
3
 
4
+ __all__ = ["chat", "completions", "embeddings", "models", "responses"]
app/routers/responses.py ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Responses API endpoint."""
2
+ from __future__ import annotations
3
+
4
+ import asyncio
5
+ import json
6
+ import time
7
+ import uuid
8
+ from typing import Generator, List
9
+
10
+ from fastapi import APIRouter
11
+ from fastapi.responses import StreamingResponse
12
+
13
+ from ..core import engine
14
+ from ..core.errors import model_not_found, openai_http_error
15
+ from ..core.model_registry import get_model_spec
16
+ from ..schemas.responses import (
17
+ ResponseInputMessage,
18
+ ResponseOutputMessage,
19
+ ResponseOutputText,
20
+ ResponsePayload,
21
+ ResponseRequest,
22
+ ResponseUsage,
23
+ )
24
+
25
+ router = APIRouter(prefix="/v1", tags=["responses"])
26
+
27
+
28
+ def _render_input_text(message: ResponseInputMessage) -> str:
29
+ if isinstance(message.content, str):
30
+ return message.content
31
+ return "".join(part.text for part in message.content if part.type == "input_text")
32
+
33
+
34
+ def _normalize_messages(input_payload: List[ResponseInputMessage]) -> List[dict]:
35
+ return [{"role": message.role, "content": _render_input_text(message)} for message in input_payload]
36
+
37
+
38
+ def _stop_sequences(stop: List[str] | str | None) -> List[str]:
39
+ if isinstance(stop, list):
40
+ return stop
41
+ return [stop] if stop else []
42
+
43
+
44
+ def _build_output(text: str) -> ResponseOutputMessage:
45
+ return ResponseOutputMessage(
46
+ id=f"msg_{uuid.uuid4().hex}",
47
+ content=[ResponseOutputText(text=text)],
48
+ )
49
+
50
+
51
+ @router.post("/responses", response_model=ResponsePayload)
52
+ async def create_response(payload: ResponseRequest) -> ResponsePayload | StreamingResponse:
53
+ """Generate a response using OpenAI's Responses API format."""
54
+ try:
55
+ spec = get_model_spec(payload.model)
56
+ except KeyError:
57
+ raise model_not_found(payload.model)
58
+
59
+ stop_sequences = _stop_sequences(payload.stop)
60
+
61
+ if isinstance(payload.input, str):
62
+ if spec.is_instruct:
63
+ messages = [{"role": "user", "content": payload.input}]
64
+ prompt = engine.apply_chat_template(payload.model, messages)
65
+ else:
66
+ prompt = payload.input
67
+ else:
68
+ if not spec.is_instruct:
69
+ raise openai_http_error(
70
+ 400,
71
+ f"Model '{payload.model}' is not an instruct model and cannot accept structured input. "
72
+ "Provide a plain string input or use /v1/chat/completions for chat-formatted prompts.",
73
+ error_type="invalid_request_error",
74
+ param="model",
75
+ )
76
+ messages = _normalize_messages(payload.input)
77
+ prompt = engine.apply_chat_template(payload.model, messages)
78
+
79
+ if payload.stream:
80
+ return _streaming_response(payload, prompt, stop_sequences)
81
+
82
+ try:
83
+ result = await asyncio.to_thread(
84
+ engine.generate,
85
+ payload.model,
86
+ prompt,
87
+ temperature=payload.temperature,
88
+ top_p=payload.top_p,
89
+ max_tokens=payload.max_output_tokens,
90
+ stop=stop_sequences,
91
+ n=payload.n,
92
+ )
93
+ except Exception as exc:
94
+ raise openai_http_error(
95
+ 500,
96
+ f"Generation error: {exc}",
97
+ error_type="server_error",
98
+ code="generation_error",
99
+ )
100
+
101
+ output: List[ResponseOutputMessage] = []
102
+ total_completion_tokens = 0
103
+ for item in result.completions:
104
+ total_completion_tokens += item.tokens
105
+ output.append(_build_output(item.text.strip()))
106
+
107
+ usage = ResponseUsage(
108
+ input_tokens=result.prompt_tokens,
109
+ output_tokens=total_completion_tokens,
110
+ total_tokens=result.prompt_tokens + total_completion_tokens,
111
+ )
112
+ return ResponsePayload(
113
+ id=f"resp_{uuid.uuid4().hex}",
114
+ model=payload.model,
115
+ output=output,
116
+ usage=usage,
117
+ )
118
+
119
+
120
+ def _streaming_response(
121
+ payload: ResponseRequest,
122
+ prompt: str,
123
+ stop_sequences: List[str],
124
+ ) -> StreamingResponse:
125
+ response_id = f"resp_{uuid.uuid4().hex}"
126
+ message_id = f"msg_{uuid.uuid4().hex}"
127
+ created = int(time.time())
128
+
129
+ def event_stream() -> Generator[bytes, None, None]:
130
+ stream = engine.create_stream(
131
+ payload.model,
132
+ prompt,
133
+ temperature=payload.temperature,
134
+ top_p=payload.top_p,
135
+ max_tokens=payload.max_output_tokens,
136
+ stop=stop_sequences,
137
+ )
138
+ base_payload = ResponsePayload(
139
+ id=response_id,
140
+ created=created,
141
+ model=payload.model,
142
+ output=[],
143
+ usage=ResponseUsage(input_tokens=0, output_tokens=0, total_tokens=0),
144
+ )
145
+ created_payload = {
146
+ "type": "response.created",
147
+ "response": base_payload.model_dump(),
148
+ }
149
+ yield f"data: {json.dumps(created_payload)}\n\n".encode()
150
+
151
+ collected = ""
152
+ for token in stream.iter_tokens():
153
+ collected += token
154
+ delta_payload = {
155
+ "type": "response.output_text.delta",
156
+ "response_id": response_id,
157
+ "item_id": message_id,
158
+ "output_index": 0,
159
+ "content_index": 0,
160
+ "delta": token,
161
+ }
162
+ yield f"data: {json.dumps(delta_payload)}\n\n".encode()
163
+
164
+ usage = ResponseUsage(
165
+ input_tokens=stream.prompt_tokens,
166
+ output_tokens=stream.completion_tokens,
167
+ total_tokens=stream.prompt_tokens + stream.completion_tokens,
168
+ )
169
+ final_payload = ResponsePayload(
170
+ id=response_id,
171
+ created=created,
172
+ model=payload.model,
173
+ output=[ResponseOutputMessage(id=message_id, content=[ResponseOutputText(text=collected)])],
174
+ usage=usage,
175
+ )
176
+ completed_payload = {
177
+ "type": "response.completed",
178
+ "response": final_payload.model_dump(),
179
+ }
180
+ yield f"data: {json.dumps(completed_payload)}\n\n".encode()
181
+ yield b"data: [DONE]\n\n"
182
+
183
+ return StreamingResponse(
184
+ event_stream(),
185
+ media_type="text/event-stream",
186
+ )
app/schemas/responses.py ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Schemas for the Responses API endpoint."""
2
+ from __future__ import annotations
3
+
4
+ import time
5
+ from typing import List, Literal, Optional, Union
6
+
7
+ from pydantic import BaseModel, Field, AliasChoices
8
+
9
+
10
+ class ResponseInputContentPart(BaseModel):
11
+ type: Literal["input_text"] = "input_text"
12
+ text: str
13
+
14
+
15
+ class ResponseInputMessage(BaseModel):
16
+ role: Literal["system", "user", "assistant", "tool"]
17
+ content: Union[str, List[ResponseInputContentPart]]
18
+
19
+
20
+ class ResponseRequest(BaseModel):
21
+ model: str
22
+ input: Union[str, List[ResponseInputMessage]]
23
+ temperature: float = 1.0
24
+ top_p: float = 1.0
25
+ n: int = 1
26
+ stop: Optional[List[str] | str] = None
27
+ max_output_tokens: Optional[int] = Field(
28
+ default=None,
29
+ validation_alias=AliasChoices("max_output_tokens", "max_tokens"),
30
+ )
31
+ stream: bool = False
32
+
33
+
34
+ class ResponseOutputText(BaseModel):
35
+ type: Literal["output_text"] = "output_text"
36
+ text: str
37
+
38
+
39
+ class ResponseOutputMessage(BaseModel):
40
+ id: str
41
+ type: Literal["message"] = "message"
42
+ role: Literal["assistant"] = "assistant"
43
+ content: List[ResponseOutputText]
44
+
45
+
46
+ class ResponseUsage(BaseModel):
47
+ input_tokens: int
48
+ output_tokens: int
49
+ total_tokens: int
50
+
51
+
52
+ class ResponsePayload(BaseModel):
53
+ id: str
54
+ object: Literal["response"] = "response"
55
+ created: int = Field(default_factory=lambda: int(time.time()))
56
+ model: str
57
+ output: List[ResponseOutputMessage]
58
+ usage: ResponseUsage
requirements-test.txt CHANGED
@@ -1,5 +1,6 @@
1
  fastapi>=0.110.0
2
  httpx>=0.27.0
 
3
  pytest>=7.4.0
4
  pytest-asyncio>=0.23.0
5
  pydantic>=2.6.0
 
1
  fastapi>=0.110.0
2
  httpx>=0.27.0
3
+ openai>=1.30.0
4
  pytest>=7.4.0
5
  pytest-asyncio>=0.23.0
6
  pydantic>=2.6.0
tests/test_live_api.py CHANGED
@@ -25,6 +25,19 @@ def _get_models(timeout: float = 10.0) -> Set[str]:
25
  return {item["id"] for item in data.get("data", [])}
26
 
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  @pytest.mark.skipif(not RUN_LIVE, reason="set RUN_LIVE_API_TESTS=1 to run live API tests")
29
  @pytest.mark.parametrize("model", ["GPT-2", "GPT3-dev-350m-2805"]) # adjust names as available
30
  def test_completion_basic(model: str) -> None:
@@ -51,4 +64,3 @@ def test_completion_basic(model: str) -> None:
51
  # The completion can be empty for some models with temperature=0, but should be a string
52
  usage = body.get("usage") or {}
53
  assert "total_tokens" in usage
54
-
 
25
  return {item["id"] for item in data.get("data", [])}
26
 
27
 
28
+ @pytest.mark.skipif(not RUN_LIVE, reason="set RUN_LIVE_API_TESTS=1 to run live API tests")
29
+ def test_responses_openai_client() -> None:
30
+ openai_module = pytest.importorskip("openai")
31
+ OpenAI = openai_module.OpenAI
32
+ model = "GPT4-dev-177M-1511-Instruct"
33
+ available = _get_models()
34
+ if model not in available:
35
+ pytest.skip(f"model {model} not available on server; available={sorted(available)}")
36
+ client = OpenAI(api_key="test", base_url=f"{BASE_URL}/v1")
37
+ response = client.responses.create(model=model, input="Say hello in one sentence.")
38
+ assert response.output[0].content[0].text
39
+
40
+
41
  @pytest.mark.skipif(not RUN_LIVE, reason="set RUN_LIVE_API_TESTS=1 to run live API tests")
42
  @pytest.mark.parametrize("model", ["GPT-2", "GPT3-dev-350m-2805"]) # adjust names as available
43
  def test_completion_basic(model: str) -> None:
 
64
  # The completion can be empty for some models with temperature=0, but should be a string
65
  usage = body.get("usage") or {}
66
  assert "total_tokens" in usage
 
tests/test_openai_compat.py CHANGED
@@ -118,9 +118,10 @@ sys.path.append(str(Path(__file__).resolve().parents[1]))
118
 
119
  from app.core import model_registry as model_registry_module
120
  from app.core.model_registry import ModelMetadata, ModelSpec
121
- from app.routers import chat, completions, embeddings, models
122
  from app.schemas.chat import ChatCompletionRequest
123
  from app.schemas.completions import CompletionRequest
 
124
 
125
 
126
  def test_list_models() -> None:
@@ -244,6 +245,69 @@ def test_chat_rejects_non_instruct_model(monkeypatch: pytest.MonkeyPatch) -> Non
244
  assert "not an instruct model" in exc.value.detail["message"]
245
 
246
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
247
  def test_embeddings_not_implemented() -> None:
248
  with pytest.raises(HTTPException) as exc:
249
  asyncio.run(embeddings.create_embeddings())
 
118
 
119
  from app.core import model_registry as model_registry_module
120
  from app.core.model_registry import ModelMetadata, ModelSpec
121
+ from app.routers import chat, completions, embeddings, models, responses
122
  from app.schemas.chat import ChatCompletionRequest
123
  from app.schemas.completions import CompletionRequest
124
+ from app.schemas.responses import ResponseRequest
125
 
126
 
127
  def test_list_models() -> None:
 
245
  assert "not an instruct model" in exc.value.detail["message"]
246
 
247
 
248
+ def test_responses_string_input(monkeypatch: pytest.MonkeyPatch) -> None:
249
+ class DummyResult:
250
+ prompt_tokens = 4
251
+ completions = [type("C", (), {"text": "Hello", "tokens": 2, "finish_reason": "stop"})()]
252
+
253
+ def fake_generate(*args, **kwargs):
254
+ return DummyResult()
255
+
256
+ monkeypatch.setattr("app.routers.responses.engine.generate", fake_generate)
257
+ monkeypatch.setattr(
258
+ "app.routers.responses.get_model_spec",
259
+ lambda model: ModelSpec(name=model, hf_repo="dummy/repo", is_instruct=False),
260
+ )
261
+ payload = ResponseRequest.model_validate({
262
+ "model": "GPT3-dev",
263
+ "input": "Hi",
264
+ })
265
+ response = asyncio.run(responses.create_response(payload))
266
+ body = response.model_dump()
267
+ assert body["object"] == "response"
268
+ assert body["output"][0]["role"] == "assistant"
269
+ assert body["output"][0]["content"][0]["text"] == "Hello"
270
+ assert body["usage"]["input_tokens"] == 4
271
+ assert body["usage"]["output_tokens"] == 2
272
+
273
+
274
+ def test_responses_instruct_messages(monkeypatch: pytest.MonkeyPatch) -> None:
275
+ class DummyResult:
276
+ prompt_tokens = 3
277
+ completions = [type("C", (), {"text": "Sure", "tokens": 1, "finish_reason": "stop"})()]
278
+
279
+ recorded_prompts: list[str] = []
280
+
281
+ def fake_generate(*args, **kwargs):
282
+ recorded_prompts.append(args[1])
283
+ return DummyResult()
284
+
285
+ monkeypatch.setattr("app.routers.responses.engine.generate", fake_generate)
286
+ monkeypatch.setattr(
287
+ "app.routers.responses.engine.apply_chat_template",
288
+ lambda model, messages: "formatted prompt",
289
+ )
290
+ monkeypatch.setattr(
291
+ "app.routers.responses.get_model_spec",
292
+ lambda model: ModelSpec(name=model, hf_repo="dummy/instruct", is_instruct=True),
293
+ )
294
+ payload = ResponseRequest.model_validate({
295
+ "model": "GPT4-dev-177M-1511-Instruct",
296
+ "input": [{"role": "user", "content": "Hi"}],
297
+ })
298
+ response = asyncio.run(responses.create_response(payload))
299
+ body = response.model_dump()
300
+ assert recorded_prompts == ["formatted prompt"]
301
+ assert body["output"][0]["content"][0]["text"] == "Sure"
302
+ assert body["usage"]["total_tokens"] == 4
303
+
304
+
305
+ def test_openai_client_responses_create(monkeypatch: pytest.MonkeyPatch) -> None:
306
+ openai_module = pytest.importorskip("openai")
307
+ OpenAI = openai_module.OpenAI
308
+ pytest.skip("OpenAI client test moved to live API coverage.")
309
+
310
+
311
  def test_embeddings_not_implemented() -> None:
312
  with pytest.raises(HTTPException) as exc:
313
  asyncio.run(embeddings.create_embeddings())