Valtry commited on
Commit
cf97964
·
verified ·
1 Parent(s): 42db743

Upload 4 files

Browse files
Files changed (4) hide show
  1. Dockerfile +12 -16
  2. README.md +97 -135
  3. app.py +115 -100
  4. requirements.txt +6 -6
Dockerfile CHANGED
@@ -1,16 +1,12 @@
1
- FROM python:3.10-slim
2
-
3
- ENV PYTHONDONTWRITEBYTECODE=1
4
- ENV PYTHONUNBUFFERED=1
5
- ENV PIP_NO_CACHE_DIR=1
6
-
7
- WORKDIR /app
8
-
9
- COPY requirements.txt /app/requirements.txt
10
- RUN pip install --upgrade pip && pip install -r /app/requirements.txt
11
-
12
- COPY app.py /app/app.py
13
-
14
- EXPOSE 7860
15
-
16
- CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
 
1
+ FROM python:3.11-slim
2
+
3
+ WORKDIR /app
4
+
5
+ COPY requirements.txt ./
6
+ RUN pip install --no-cache-dir -r requirements.txt
7
+
8
+ COPY . .
9
+
10
+ EXPOSE 7860
11
+
12
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
 
 
 
 
README.md CHANGED
@@ -1,135 +1,97 @@
1
- ---
2
- title: Hugging Face Space LLM Runner API
3
- emoji: "🤖"
4
- colorFrom: blue
5
- colorTo: indigo
6
- sdk: docker
7
- app_file: app.py
8
- pinned: false
9
- ---
10
-
11
- # Hugging Face Space LLM Model Runner API
12
-
13
- This Space exposes a lightweight instruction-tuned LLM through a REST API.
14
-
15
- ## Purpose
16
-
17
- The Space is only responsible for:
18
- - Loading the model once at startup
19
- - Accepting prompts from external services
20
- - Generating text
21
- - Returning JSON responses
22
-
23
- Chatbot memory, tool routing, and conversation logic should be handled by your backend service.
24
-
25
- ## Stack
26
-
27
- - Python
28
- - FastAPI
29
- - Transformers
30
- - Accelerate
31
- - Torch
32
-
33
- ## Model
34
-
35
- Default model:
36
- - `Qwen/Qwen2.5-0.5B-Instruct`
37
-
38
- You can replace `MODEL_NAME` in `app.py` with:
39
- - `Qwen/Qwen2.5-1.5B-Instruct`
40
- - `Qwen/Qwen2.5-3B-Instruct`
41
- - `microsoft/Phi-3-mini-4k-instruct`
42
-
43
- ## Files
44
-
45
- - `app.py`
46
- - `requirements.txt`
47
- - `Dockerfile`
48
- - `README.md`
49
-
50
- ## API
51
-
52
- ### Health check
53
-
54
- `GET /health`
55
-
56
- Response:
57
-
58
- ```json
59
- {
60
- "status": "ok"
61
- }
62
- ```
63
-
64
- ### Generate text
65
-
66
- `POST /generate`
67
-
68
- Request body:
69
-
70
- ```json
71
- {
72
- "prompt": "Explain artificial intelligence in simple terms",
73
- "max_tokens": 96,
74
- "temperature": 0.7,
75
- "top_p": 0.9
76
- }
77
- ```
78
-
79
- Notes:
80
- - `prompt` is required and must not be empty.
81
- - `max_tokens` default: `96` (max allowed: `256`)
82
- - `temperature` default: `0.7`
83
- - `top_p` default: `0.9`
84
-
85
- Response:
86
-
87
- ```json
88
- {
89
- "response": "Artificial intelligence is..."
90
- }
91
- ```
92
-
93
- ## Run locally (without Docker)
94
-
95
- ```bash
96
- uvicorn app:app --host 0.0.0.0 --port 7860
97
- ```
98
-
99
- ## Run locally (with Docker)
100
-
101
- Build image:
102
-
103
- ```bash
104
- docker build -t hf-llm-runner .
105
- ```
106
-
107
- Run container:
108
-
109
- ```bash
110
- docker run --rm -p 7860:7860 hf-llm-runner
111
- ```
112
-
113
- ## Create and deploy on Hugging Face Spaces (Docker)
114
-
115
- 1. Go to Hugging Face -> Spaces -> **Create new Space**.
116
- 2. Set:
117
- - Owner: your account/org
118
- - Space name: your choice
119
- - License: your choice
120
- - SDK: **Docker**
121
- - Visibility: Public (or Private if your plan supports it)
122
- 3. Create the Space.
123
- 4. Upload/push these files into the Space repo root:
124
- - `app.py`
125
- - `requirements.txt`
126
- - `Dockerfile`
127
- - `README.md`
128
- 5. Wait for build to finish. First startup may be slow because model weights download.
129
- 6. Test:
130
- - `GET https://<your-space-name>.hf.space/health`
131
- - `POST https://<your-space-name>.hf.space/generate`
132
-
133
- Use your Space URL from backend:
134
-
135
- `https://<your-space-name>.hf.space/generate`
 
1
+ ---
2
+ title: Streaming LLM API
3
+ colorFrom: blue
4
+ colorTo: green
5
+ sdk: docker
6
+ app_port: 7860
7
+ ---
8
+
9
+ # Hugging Face Space Streaming LLM Inference API
10
+
11
+ A lightweight Hugging Face Space API server for real-time token streaming with **Qwen2.5-0.5B-Instruct**.
12
+
13
+ ## Features
14
+
15
+ - FastAPI server with SSE streaming endpoint
16
+ - One-time model/tokenizer loading during startup
17
+ - Configurable generation parameters (`max_tokens`, `temperature`, `top_p`)
18
+ - Efficient inference with `torch.no_grad()` and `device_map="auto"`
19
+ - Request validation and clear error responses
20
+
21
+ ## Model
22
+
23
+ - **Primary model:** `Qwen/Qwen2.5-0.5B-Instruct`
24
+ - Automatically downloaded from Hugging Face at startup
25
+
26
+ ## File Structure
27
+
28
+ - `app.py`
29
+ - `requirements.txt`
30
+ - `README.md`
31
+ - `Dockerfile`
32
+
33
+ ## Requirements
34
+
35
+ ```txt
36
+ transformers
37
+ accelerate
38
+ torch
39
+ fastapi
40
+ uvicorn
41
+ pydantic
42
+ ```
43
+
44
+ ## Run Locally
45
+
46
+ ```bash
47
+ pip install -r requirements.txt
48
+ uvicorn app:app --host 0.0.0.0 --port 7860
49
+ ```
50
+
51
+ ## API
52
+
53
+ ### `POST /generate_stream`
54
+
55
+ Request JSON:
56
+
57
+ ```json
58
+ {
59
+ "prompt": "user prompt text",
60
+ "max_tokens": 512,
61
+ "temperature": 0.7,
62
+ "top_p": 0.9
63
+ }
64
+ ```
65
+
66
+ - `prompt` is required and must not be empty.
67
+ - `max_tokens`, `temperature`, and `top_p` are optional.
68
+
69
+ Response:
70
+
71
+ - Content type: `text/event-stream`
72
+ - Streams generated text chunks incrementally as SSE events.
73
+
74
+ ## Example cURL
75
+
76
+ ```bash
77
+ curl -N -X POST "https://your-space-name.hf.space/generate_stream" \
78
+ -H "Content-Type: application/json" \
79
+ -d '{"prompt":"Explain artificial intelligence"}'
80
+ ```
81
+
82
+ ## Backend Integration Flow
83
+
84
+ 1. Backend sends prompt to Hugging Face Space.
85
+ 2. Space generates and streams tokens.
86
+ 3. Backend relays streamed tokens to client in real time.
87
+
88
+ ## Hugging Face Space Setup
89
+
90
+ - Space SDK: **Docker**
91
+ - Ensure app starts with `uvicorn app:app --host 0.0.0.0 --port 7860`
92
+ - Expose port `7860`
93
+
94
+ ## Notes
95
+
96
+ - The first startup may take longer due to model download.
97
+ - Keep model loading in startup lifecycle so it is initialized once.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py CHANGED
@@ -1,100 +1,115 @@
1
- import os
2
- from typing import Optional
3
-
4
- import torch
5
- from fastapi import FastAPI, HTTPException
6
- from pydantic import BaseModel, Field
7
- from transformers import AutoModelForCausalLM, AutoTokenizer
8
-
9
-
10
- MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
11
- DEFAULT_MAX_NEW_TOKENS = 96
12
- DEFAULT_TEMPERATURE = 0.7
13
- DEFAULT_TOP_P = 0.9
14
- MAX_ALLOWED_NEW_TOKENS = 256
15
- MAX_INPUT_TOKENS = 1024
16
-
17
- app = FastAPI(title="HF Space LLM Runner API", version="1.0.0")
18
-
19
-
20
- class GenerateRequest(BaseModel):
21
- prompt: str = Field(..., description="Input prompt text")
22
- max_tokens: int = Field(
23
- default=DEFAULT_MAX_NEW_TOKENS,
24
- ge=1,
25
- le=MAX_ALLOWED_NEW_TOKENS,
26
- description="Maximum new tokens",
27
- )
28
- temperature: float = Field(
29
- default=DEFAULT_TEMPERATURE, ge=0.0, le=2.0, description="Sampling temperature"
30
- )
31
- top_p: float = Field(default=DEFAULT_TOP_P, ge=0.0, le=1.0, description="Top-p sampling")
32
-
33
-
34
- class GenerateResponse(BaseModel):
35
- response: str
36
-
37
-
38
- tokenizer: Optional[AutoTokenizer] = None
39
- model: Optional[AutoModelForCausalLM] = None
40
-
41
-
42
- @app.on_event("startup")
43
- def load_model() -> None:
44
- global tokenizer, model
45
- if not torch.cuda.is_available():
46
- torch.set_num_threads(min(4, os.cpu_count() or 1))
47
- tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
48
- model = AutoModelForCausalLM.from_pretrained(
49
- MODEL_NAME,
50
- torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
51
- device_map="auto",
52
- )
53
- model.eval()
54
-
55
-
56
- @app.get("/health")
57
- def health() -> dict:
58
- return {"status": "ok"}
59
-
60
-
61
- @app.post("/generate", response_model=GenerateResponse)
62
- def generate_text(request: GenerateRequest) -> GenerateResponse:
63
- if tokenizer is None or model is None:
64
- raise HTTPException(status_code=503, detail="Model not loaded yet")
65
- if not request.prompt or not request.prompt.strip():
66
- raise HTTPException(status_code=422, detail="prompt must not be empty")
67
-
68
- messages = [{"role": "user", "content": request.prompt}]
69
- prompt_text = tokenizer.apply_chat_template(
70
- messages, tokenize=False, add_generation_prompt=True
71
- )
72
-
73
- inputs = tokenizer(
74
- prompt_text,
75
- return_tensors="pt",
76
- truncation=True,
77
- max_length=MAX_INPUT_TOKENS,
78
- )
79
- inputs = {k: v.to(model.device) for k, v in inputs.items()}
80
-
81
- # Greedy output when temperature is 0, sampling otherwise.
82
- do_sample = request.temperature > 0.0
83
-
84
- with torch.no_grad():
85
- output_ids = model.generate(
86
- **inputs,
87
- max_new_tokens=request.max_tokens,
88
- temperature=request.temperature if do_sample else None,
89
- top_p=request.top_p if do_sample else None,
90
- do_sample=do_sample,
91
- use_cache=True,
92
- pad_token_id=tokenizer.eos_token_id,
93
- eos_token_id=tokenizer.eos_token_id,
94
- )
95
-
96
- prompt_len = inputs["input_ids"].shape[1]
97
- generated_ids = output_ids[0][prompt_len:]
98
- generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
99
-
100
- return GenerateResponse(response=generated_text)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import asyncio
2
+ import threading
3
+ from contextlib import asynccontextmanager
4
+ from typing import AsyncGenerator
5
+
6
+ import torch
7
+ from fastapi import FastAPI, HTTPException
8
+ from fastapi.responses import StreamingResponse
9
+ from pydantic import BaseModel, Field, field_validator
10
+ from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
11
+
12
+ MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
13
+
14
+ # Loaded once at startup.
15
+ tokenizer = None
16
+ model = None
17
+
18
+
19
+ class GenerateRequest(BaseModel):
20
+ prompt: str = Field(..., min_length=1, description="Input prompt text")
21
+ max_tokens: int = Field(default=512, ge=1, le=2048)
22
+ temperature: float = Field(default=0.7, ge=0.0, le=2.0)
23
+ top_p: float = Field(default=0.9, gt=0.0, le=1.0)
24
+
25
+ @field_validator("prompt")
26
+ @classmethod
27
+ def prompt_must_not_be_blank(cls, value: str) -> str:
28
+ if not value.strip():
29
+ raise ValueError("Prompt cannot be empty or whitespace")
30
+ return value
31
+
32
+
33
+ @asynccontextmanager
34
+ async def lifespan(_: FastAPI):
35
+ global tokenizer, model
36
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
37
+ model = AutoModelForCausalLM.from_pretrained(
38
+ MODEL_ID,
39
+ torch_dtype="auto",
40
+ device_map="auto",
41
+ trust_remote_code=True,
42
+ )
43
+ model.eval()
44
+ yield
45
+
46
+
47
+ app = FastAPI(
48
+ title="Hugging Face Space Streaming LLM Inference API",
49
+ description="Streaming token generation API using Qwen2.5-0.5B-Instruct",
50
+ version="1.0.0",
51
+ lifespan=lifespan,
52
+ )
53
+
54
+
55
+ @app.get("/")
56
+ async def health() -> dict:
57
+ return {
58
+ "status": "ok",
59
+ "model": MODEL_ID,
60
+ "endpoints": ["POST /generate_stream"],
61
+ }
62
+
63
+
64
+ async def stream_generate(req: GenerateRequest) -> AsyncGenerator[str, None]:
65
+ if model is None or tokenizer is None:
66
+ raise HTTPException(status_code=503, detail="Model is still loading")
67
+
68
+ inputs = tokenizer(req.prompt, return_tensors="pt")
69
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
70
+
71
+ streamer = TextIteratorStreamer(
72
+ tokenizer,
73
+ skip_prompt=True,
74
+ skip_special_tokens=True,
75
+ )
76
+
77
+ generation_kwargs = {
78
+ **inputs,
79
+ "streamer": streamer,
80
+ "max_new_tokens": req.max_tokens,
81
+ "do_sample": req.temperature > 0,
82
+ "temperature": req.temperature if req.temperature > 0 else None,
83
+ "top_p": req.top_p,
84
+ "pad_token_id": tokenizer.eos_token_id,
85
+ }
86
+
87
+ def run_generation() -> None:
88
+ with torch.no_grad():
89
+ model.generate(**generation_kwargs)
90
+
91
+ thread = threading.Thread(target=run_generation, daemon=True)
92
+ thread.start()
93
+
94
+ for text in streamer:
95
+ # SSE format: each event line starts with "data:"
96
+ yield f"data: {text}\n\n"
97
+ await asyncio.sleep(0)
98
+
99
+ yield "data: [DONE]\n\n"
100
+
101
+
102
+ @app.post("/generate_stream")
103
+ async def generate_stream(req: GenerateRequest):
104
+ try:
105
+ return StreamingResponse(stream_generate(req), media_type="text/event-stream")
106
+ except HTTPException:
107
+ raise
108
+ except Exception as exc: # pragma: no cover
109
+ raise HTTPException(status_code=500, detail=f"Generation error: {str(exc)}") from exc
110
+
111
+
112
+ if __name__ == "__main__":
113
+ import uvicorn
114
+
115
+ uvicorn.run("app:app", host="0.0.0.0", port=7860)
requirements.txt CHANGED
@@ -1,6 +1,6 @@
1
- transformers>=4.45.0,<5.0.0
2
- accelerate>=0.33.0,<1.0.0
3
- torch>=2.3.0
4
- fastapi>=0.111.0,<1.0.0
5
- uvicorn>=0.30.0,<1.0.0
6
- pydantic>=2.7.0,<3.0.0
 
1
+ transformers
2
+ accelerate
3
+ torch
4
+ fastapi
5
+ uvicorn
6
+ pydantic