Spaces:

Valtry
/

Valtry-Bot

Sleeping

App Files Files Community

Valtry commited on Mar 8

Commit

cf97964

verified ·

1 Parent(s): 42db743

Upload 4 files

Browse files

Files changed (4) hide show

Dockerfile +12 -16
README.md +97 -135
app.py +115 -100
requirements.txt +6 -6

Dockerfile CHANGED Viewed

@@ -1,16 +1,12 @@
-FROM python:3.10-slim
-ENV PYTHONDONTWRITEBYTECODE=1
-ENV PYTHONUNBUFFERED=1
-ENV PIP_NO_CACHE_DIR=1
-WORKDIR /app
-COPY requirements.txt /app/requirements.txt
-RUN pip install --upgrade pip && pip install -r /app/requirements.txt
-COPY app.py /app/app.py
-EXPOSE 7860
-CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

+FROM python:3.11-slim
+WORKDIR /app
+COPY requirements.txt ./
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+EXPOSE 7860
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,135 +1,97 @@
----
-title: Hugging Face Space LLM Runner API
-emoji: "🤖"
-colorFrom: blue
-colorTo: indigo
-sdk: docker
-app_file: app.py
-pinned: false
----
-# Hugging Face Space LLM Model Runner API
-This Space exposes a lightweight instruction-tuned LLM through a REST API.
-## Purpose
-The Space is only responsible for:
-- Loading the model once at startup
-- Accepting prompts from external services
-- Generating text
-- Returning JSON responses
-Chatbot memory, tool routing, and conversation logic should be handled by your backend service.
-## Stack
-- Python
-- FastAPI
-- Transformers
-- Accelerate
-- Torch
-## Model
-Default model:
-- `Qwen/Qwen2.5-0.5B-Instruct`
-You can replace `MODEL_NAME` in `app.py` with:
-- `Qwen/Qwen2.5-1.5B-Instruct`
-- `Qwen/Qwen2.5-3B-Instruct`
-- `microsoft/Phi-3-mini-4k-instruct`
-## Files
-- `app.py`
-- `requirements.txt`
-- `Dockerfile`
-- `README.md`
-## API
-### Health check
-`GET /health`
-Response:
-```json
-{
-  "status": "ok"
-}
-```
-### Generate text
-`POST /generate`
-Request body:
-```json
-{
-  "prompt": "Explain artificial intelligence in simple terms",
-  "max_tokens": 96,
-  "temperature": 0.7,
-  "top_p": 0.9
-}
-```
-Notes:
-- `prompt` is required and must not be empty.
-- `max_tokens` default: `96` (max allowed: `256`)
-- `temperature` default: `0.7`
-- `top_p` default: `0.9`
-Response:
-```json
-{
-  "response": "Artificial intelligence is..."
-}
-```
-## Run locally (without Docker)
-```bash
-uvicorn app:app --host 0.0.0.0 --port 7860
-```
-## Run locally (with Docker)
-Build image:
-```bash
-docker build -t hf-llm-runner .
-```
-Run container:
-```bash
-docker run --rm -p 7860:7860 hf-llm-runner
-```
-## Create and deploy on Hugging Face Spaces (Docker)
-1. Go to Hugging Face -> Spaces -> **Create new Space**.
-2. Set:
-   - Owner: your account/org
-   - Space name: your choice
-   - License: your choice
-   - SDK: **Docker**
-   - Visibility: Public (or Private if your plan supports it)
-3. Create the Space.
-4. Upload/push these files into the Space repo root:
-   - `app.py`
-   - `requirements.txt`
-   - `Dockerfile`
-   - `README.md`
-5. Wait for build to finish. First startup may be slow because model weights download.
-6. Test:
-   - `GET https://<your-space-name>.hf.space/health`
-   - `POST https://<your-space-name>.hf.space/generate`
-Use your Space URL from backend:
-`https://<your-space-name>.hf.space/generate`

+---
+title: Streaming LLM API
+colorFrom: blue
+colorTo: green
+sdk: docker
+app_port: 7860
+---
+# Hugging Face Space Streaming LLM Inference API
+A lightweight Hugging Face Space API server for real-time token streaming with **Qwen2.5-0.5B-Instruct**.
+## Features
+- FastAPI server with SSE streaming endpoint
+- One-time model/tokenizer loading during startup
+- Configurable generation parameters (`max_tokens`, `temperature`, `top_p`)
+- Efficient inference with `torch.no_grad()` and `device_map="auto"`
+- Request validation and clear error responses
+## Model
+- **Primary model:** `Qwen/Qwen2.5-0.5B-Instruct`
+- Automatically downloaded from Hugging Face at startup
+## File Structure
+- `app.py`
+- `requirements.txt`
+- `README.md`
+- `Dockerfile`
+## Requirements
+```txt
+transformers
+accelerate
+torch
+fastapi
+uvicorn
+pydantic
+```
+## Run Locally
+```bash
+pip install -r requirements.txt
+uvicorn app:app --host 0.0.0.0 --port 7860
+```
+## API
+### `POST /generate_stream`
+Request JSON:
+```json
+{
+  "prompt": "user prompt text",
+  "max_tokens": 512,
+  "temperature": 0.7,
+  "top_p": 0.9
+}
+```
+- `prompt` is required and must not be empty.
+- `max_tokens`, `temperature`, and `top_p` are optional.
+Response:
+- Content type: `text/event-stream`
+- Streams generated text chunks incrementally as SSE events.
+## Example cURL
+```bash
+curl -N -X POST "https://your-space-name.hf.space/generate_stream" \
+  -H "Content-Type: application/json" \
+  -d '{"prompt":"Explain artificial intelligence"}'
+```
+## Backend Integration Flow
+1. Backend sends prompt to Hugging Face Space.
+2. Space generates and streams tokens.
+3. Backend relays streamed tokens to client in real time.
+## Hugging Face Space Setup
+- Space SDK: **Docker**
+- Ensure app starts with `uvicorn app:app --host 0.0.0.0 --port 7860`
+- Expose port `7860`
+## Notes
+- The first startup may take longer due to model download.
+- Keep model loading in startup lifecycle so it is initialized once.

app.py CHANGED Viewed

@@ -1,100 +1,115 @@
-import os
-from typing import Optional
-import torch
-from fastapi import FastAPI, HTTPException
-from pydantic import BaseModel, Field
-from transformers import AutoModelForCausalLM, AutoTokenizer
-MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
-DEFAULT_MAX_NEW_TOKENS = 96
-DEFAULT_TEMPERATURE = 0.7
-DEFAULT_TOP_P = 0.9
-MAX_ALLOWED_NEW_TOKENS = 256
-MAX_INPUT_TOKENS = 1024
-app = FastAPI(title="HF Space LLM Runner API", version="1.0.0")
-class GenerateRequest(BaseModel):
-    prompt: str = Field(..., description="Input prompt text")
-    max_tokens: int = Field(
-        default=DEFAULT_MAX_NEW_TOKENS,
-        ge=1,
-        le=MAX_ALLOWED_NEW_TOKENS,
-        description="Maximum new tokens",
-    )
-    temperature: float = Field(
-        default=DEFAULT_TEMPERATURE, ge=0.0, le=2.0, description="Sampling temperature"
-    )
-    top_p: float = Field(default=DEFAULT_TOP_P, ge=0.0, le=1.0, description="Top-p sampling")
-class GenerateResponse(BaseModel):
-    response: str
-tokenizer: Optional[AutoTokenizer] = None
-model: Optional[AutoModelForCausalLM] = None
-@app.on_event("startup")
-def load_model() -> None:
-    global tokenizer, model
-    if not torch.cuda.is_available():
-        torch.set_num_threads(min(4, os.cpu_count() or 1))
-    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
-    model = AutoModelForCausalLM.from_pretrained(
-        MODEL_NAME,
-        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
-        device_map="auto",
-    )
-    model.eval()
-@app.get("/health")
-def health() -> dict:
-    return {"status": "ok"}
-@app.post("/generate", response_model=GenerateResponse)
-def generate_text(request: GenerateRequest) -> GenerateResponse:
-    if tokenizer is None or model is None:
-        raise HTTPException(status_code=503, detail="Model not loaded yet")
-    if not request.prompt or not request.prompt.strip():
-        raise HTTPException(status_code=422, detail="prompt must not be empty")
-    messages = [{"role": "user", "content": request.prompt}]
-    prompt_text = tokenizer.apply_chat_template(
-        messages, tokenize=False, add_generation_prompt=True
-    )
-    inputs = tokenizer(
-        prompt_text,
-        return_tensors="pt",
-        truncation=True,
-        max_length=MAX_INPUT_TOKENS,
-    )
-    inputs = {k: v.to(model.device) for k, v in inputs.items()}
-    # Greedy output when temperature is 0, sampling otherwise.
-    do_sample = request.temperature > 0.0
-    with torch.no_grad():
-        output_ids = model.generate(
-            **inputs,
-            max_new_tokens=request.max_tokens,
-            temperature=request.temperature if do_sample else None,
-            top_p=request.top_p if do_sample else None,
-            do_sample=do_sample,
-            use_cache=True,
-            pad_token_id=tokenizer.eos_token_id,
-            eos_token_id=tokenizer.eos_token_id,
-        )
-    prompt_len = inputs["input_ids"].shape[1]
-    generated_ids = output_ids[0][prompt_len:]
-    generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
-    return GenerateResponse(response=generated_text)

+import asyncio
+import threading
+from contextlib import asynccontextmanager
+from typing import AsyncGenerator
+import torch
+from fastapi import FastAPI, HTTPException
+from fastapi.responses import StreamingResponse
+from pydantic import BaseModel, Field, field_validator
+from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
+MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
+# Loaded once at startup.
+tokenizer = None
+model = None
+class GenerateRequest(BaseModel):
+    prompt: str = Field(..., min_length=1, description="Input prompt text")
+    max_tokens: int = Field(default=512, ge=1, le=2048)
+    temperature: float = Field(default=0.7, ge=0.0, le=2.0)
+    top_p: float = Field(default=0.9, gt=0.0, le=1.0)
+    @field_validator("prompt")
+    @classmethod
+    def prompt_must_not_be_blank(cls, value: str) -> str:
+        if not value.strip():
+            raise ValueError("Prompt cannot be empty or whitespace")
+        return value
+@asynccontextmanager
+async def lifespan(_: FastAPI):
+    global tokenizer, model
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
+    model = AutoModelForCausalLM.from_pretrained(
+        MODEL_ID,
+        torch_dtype="auto",
+        device_map="auto",
+        trust_remote_code=True,
+    )
+    model.eval()
+    yield
+app = FastAPI(
+    title="Hugging Face Space Streaming LLM Inference API",
+    description="Streaming token generation API using Qwen2.5-0.5B-Instruct",
+    version="1.0.0",
+    lifespan=lifespan,
+)
+@app.get("/")
+async def health() -> dict:
+    return {
+        "status": "ok",
+        "model": MODEL_ID,
+        "endpoints": ["POST /generate_stream"],
+    }
+async def stream_generate(req: GenerateRequest) -> AsyncGenerator[str, None]:
+    if model is None or tokenizer is None:
+        raise HTTPException(status_code=503, detail="Model is still loading")
+    inputs = tokenizer(req.prompt, return_tensors="pt")
+    inputs = {k: v.to(model.device) for k, v in inputs.items()}
+    streamer = TextIteratorStreamer(
+        tokenizer,
+        skip_prompt=True,
+        skip_special_tokens=True,
+    )
+    generation_kwargs = {
+        **inputs,
+        "streamer": streamer,
+        "max_new_tokens": req.max_tokens,
+        "do_sample": req.temperature > 0,
+        "temperature": req.temperature if req.temperature > 0 else None,
+        "top_p": req.top_p,
+        "pad_token_id": tokenizer.eos_token_id,
+    }
+    def run_generation() -> None:
+        with torch.no_grad():
+            model.generate(**generation_kwargs)
+    thread = threading.Thread(target=run_generation, daemon=True)
+    thread.start()
+    for text in streamer:
+        # SSE format: each event line starts with "data:"
+        yield f"data: {text}\n\n"
+        await asyncio.sleep(0)
+    yield "data: [DONE]\n\n"
+@app.post("/generate_stream")
+async def generate_stream(req: GenerateRequest):
+    try:
+        return StreamingResponse(stream_generate(req), media_type="text/event-stream")
+    except HTTPException:
+        raise
+    except Exception as exc:  # pragma: no cover
+        raise HTTPException(status_code=500, detail=f"Generation error: {str(exc)}") from exc
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run("app:app", host="0.0.0.0", port=7860)

requirements.txt CHANGED Viewed

@@ -1,6 +1,6 @@
-transformers>=4.45.0,<5.0.0
-accelerate>=0.33.0,<1.0.0
-torch>=2.3.0
-fastapi>=0.111.0,<1.0.0
-uvicorn>=0.30.0,<1.0.0
-pydantic>=2.7.0,<3.0.0

+transformers
+accelerate
+torch
+fastapi
+uvicorn
+pydantic