Valtry-Bot / README.md
Valtry's picture
Upload 4 files
cf97964 verified
---
title: Streaming LLM API
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
---
# Hugging Face Space Streaming LLM Inference API
A lightweight Hugging Face Space API server for real-time token streaming with **Qwen2.5-0.5B-Instruct**.
## Features
- FastAPI server with SSE streaming endpoint
- One-time model/tokenizer loading during startup
- Configurable generation parameters (`max_tokens`, `temperature`, `top_p`)
- Efficient inference with `torch.no_grad()` and `device_map="auto"`
- Request validation and clear error responses
## Model
- **Primary model:** `Qwen/Qwen2.5-0.5B-Instruct`
- Automatically downloaded from Hugging Face at startup
## File Structure
- `app.py`
- `requirements.txt`
- `README.md`
- `Dockerfile`
## Requirements
```txt
transformers
accelerate
torch
fastapi
uvicorn
pydantic
```
## Run Locally
```bash
pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860
```
## API
### `POST /generate_stream`
Request JSON:
```json
{
"prompt": "user prompt text",
"max_tokens": 512,
"temperature": 0.7,
"top_p": 0.9
}
```
- `prompt` is required and must not be empty.
- `max_tokens`, `temperature`, and `top_p` are optional.
Response:
- Content type: `text/event-stream`
- Streams generated text chunks incrementally as SSE events.
## Example cURL
```bash
curl -N -X POST "https://your-space-name.hf.space/generate_stream" \
-H "Content-Type: application/json" \
-d '{"prompt":"Explain artificial intelligence"}'
```
## Backend Integration Flow
1. Backend sends prompt to Hugging Face Space.
2. Space generates and streams tokens.
3. Backend relays streamed tokens to client in real time.
## Hugging Face Space Setup
- Space SDK: **Docker**
- Ensure app starts with `uvicorn app:app --host 0.0.0.0 --port 7860`
- Expose port `7860`
## Notes
- The first startup may take longer due to model download.
- Keep model loading in startup lifecycle so it is initialized once.