--- title: Streaming LLM API colorFrom: blue colorTo: green sdk: docker app_port: 7860 --- # Hugging Face Space Streaming LLM Inference API A lightweight Hugging Face Space API server for real-time token streaming with **Qwen2.5-0.5B-Instruct**. ## Features - FastAPI server with SSE streaming endpoint - One-time model/tokenizer loading during startup - Configurable generation parameters (`max_tokens`, `temperature`, `top_p`) - Efficient inference with `torch.no_grad()` and `device_map="auto"` - Request validation and clear error responses ## Model - **Primary model:** `Qwen/Qwen2.5-0.5B-Instruct` - Automatically downloaded from Hugging Face at startup ## File Structure - `app.py` - `requirements.txt` - `README.md` - `Dockerfile` ## Requirements ```txt transformers accelerate torch fastapi uvicorn pydantic ``` ## Run Locally ```bash pip install -r requirements.txt uvicorn app:app --host 0.0.0.0 --port 7860 ``` ## API ### `POST /generate_stream` Request JSON: ```json { "prompt": "user prompt text", "max_tokens": 512, "temperature": 0.7, "top_p": 0.9 } ``` - `prompt` is required and must not be empty. - `max_tokens`, `temperature`, and `top_p` are optional. Response: - Content type: `text/event-stream` - Streams generated text chunks incrementally as SSE events. ## Example cURL ```bash curl -N -X POST "https://your-space-name.hf.space/generate_stream" \ -H "Content-Type: application/json" \ -d '{"prompt":"Explain artificial intelligence"}' ``` ## Backend Integration Flow 1. Backend sends prompt to Hugging Face Space. 2. Space generates and streams tokens. 3. Backend relays streamed tokens to client in real time. ## Hugging Face Space Setup - Space SDK: **Docker** - Ensure app starts with `uvicorn app:app --host 0.0.0.0 --port 7860` - Expose port `7860` ## Notes - The first startup may take longer due to model download. - Keep model loading in startup lifecycle so it is initialized once.