Valtry-Bot / README.md
Valtry's picture
Upload 4 files
cf97964 verified
metadata
title: Streaming LLM API
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860

Hugging Face Space Streaming LLM Inference API

A lightweight Hugging Face Space API server for real-time token streaming with Qwen2.5-0.5B-Instruct.

Features

  • FastAPI server with SSE streaming endpoint
  • One-time model/tokenizer loading during startup
  • Configurable generation parameters (max_tokens, temperature, top_p)
  • Efficient inference with torch.no_grad() and device_map="auto"
  • Request validation and clear error responses

Model

  • Primary model: Qwen/Qwen2.5-0.5B-Instruct
  • Automatically downloaded from Hugging Face at startup

File Structure

  • app.py
  • requirements.txt
  • README.md
  • Dockerfile

Requirements

transformers
accelerate
torch
fastapi
uvicorn
pydantic

Run Locally

pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860

API

POST /generate_stream

Request JSON:

{
  "prompt": "user prompt text",
  "max_tokens": 512,
  "temperature": 0.7,
  "top_p": 0.9
}
  • prompt is required and must not be empty.
  • max_tokens, temperature, and top_p are optional.

Response:

  • Content type: text/event-stream
  • Streams generated text chunks incrementally as SSE events.

Example cURL

curl -N -X POST "https://your-space-name.hf.space/generate_stream" \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Explain artificial intelligence"}'

Backend Integration Flow

  1. Backend sends prompt to Hugging Face Space.
  2. Space generates and streams tokens.
  3. Backend relays streamed tokens to client in real time.

Hugging Face Space Setup

  • Space SDK: Docker
  • Ensure app starts with uvicorn app:app --host 0.0.0.0 --port 7860
  • Expose port 7860

Notes

  • The first startup may take longer due to model download.
  • Keep model loading in startup lifecycle so it is initialized once.