Spaces:
Sleeping
Sleeping
| title: Streaming LLM API | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_port: 7860 | |
| # Hugging Face Space Streaming LLM Inference API | |
| A lightweight Hugging Face Space API server for real-time token streaming with **Qwen2.5-0.5B-Instruct**. | |
| ## Features | |
| - FastAPI server with SSE streaming endpoint | |
| - One-time model/tokenizer loading during startup | |
| - Configurable generation parameters (`max_tokens`, `temperature`, `top_p`) | |
| - Efficient inference with `torch.no_grad()` and `device_map="auto"` | |
| - Request validation and clear error responses | |
| ## Model | |
| - **Primary model:** `Qwen/Qwen2.5-0.5B-Instruct` | |
| - Automatically downloaded from Hugging Face at startup | |
| ## File Structure | |
| - `app.py` | |
| - `requirements.txt` | |
| - `README.md` | |
| - `Dockerfile` | |
| ## Requirements | |
| ```txt | |
| transformers | |
| accelerate | |
| torch | |
| fastapi | |
| uvicorn | |
| pydantic | |
| ``` | |
| ## Run Locally | |
| ```bash | |
| pip install -r requirements.txt | |
| uvicorn app:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| ## API | |
| ### `POST /generate_stream` | |
| Request JSON: | |
| ```json | |
| { | |
| "prompt": "user prompt text", | |
| "max_tokens": 512, | |
| "temperature": 0.7, | |
| "top_p": 0.9 | |
| } | |
| ``` | |
| - `prompt` is required and must not be empty. | |
| - `max_tokens`, `temperature`, and `top_p` are optional. | |
| Response: | |
| - Content type: `text/event-stream` | |
| - Streams generated text chunks incrementally as SSE events. | |
| ## Example cURL | |
| ```bash | |
| curl -N -X POST "https://your-space-name.hf.space/generate_stream" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"prompt":"Explain artificial intelligence"}' | |
| ``` | |
| ## Backend Integration Flow | |
| 1. Backend sends prompt to Hugging Face Space. | |
| 2. Space generates and streams tokens. | |
| 3. Backend relays streamed tokens to client in real time. | |
| ## Hugging Face Space Setup | |
| - Space SDK: **Docker** | |
| - Ensure app starts with `uvicorn app:app --host 0.0.0.0 --port 7860` | |
| - Expose port `7860` | |
| ## Notes | |
| - The first startup may take longer due to model download. | |
| - Keep model loading in startup lifecycle so it is initialized once. | |