Spaces:
Sleeping
Sleeping
metadata
title: Streaming LLM API
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
Hugging Face Space Streaming LLM Inference API
A lightweight Hugging Face Space API server for real-time token streaming with Qwen2.5-0.5B-Instruct.
Features
- FastAPI server with SSE streaming endpoint
- One-time model/tokenizer loading during startup
- Configurable generation parameters (
max_tokens,temperature,top_p) - Efficient inference with
torch.no_grad()anddevice_map="auto" - Request validation and clear error responses
Model
- Primary model:
Qwen/Qwen2.5-0.5B-Instruct - Automatically downloaded from Hugging Face at startup
File Structure
app.pyrequirements.txtREADME.mdDockerfile
Requirements
transformers
accelerate
torch
fastapi
uvicorn
pydantic
Run Locally
pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860
API
POST /generate_stream
Request JSON:
{
"prompt": "user prompt text",
"max_tokens": 512,
"temperature": 0.7,
"top_p": 0.9
}
promptis required and must not be empty.max_tokens,temperature, andtop_pare optional.
Response:
- Content type:
text/event-stream - Streams generated text chunks incrementally as SSE events.
Example cURL
curl -N -X POST "https://your-space-name.hf.space/generate_stream" \
-H "Content-Type: application/json" \
-d '{"prompt":"Explain artificial intelligence"}'
Backend Integration Flow
- Backend sends prompt to Hugging Face Space.
- Space generates and streams tokens.
- Backend relays streamed tokens to client in real time.
Hugging Face Space Setup
- Space SDK: Docker
- Ensure app starts with
uvicorn app:app --host 0.0.0.0 --port 7860 - Expose port
7860
Notes
- The first startup may take longer due to model download.
- Keep model loading in startup lifecycle so it is initialized once.