Spaces:

Valtry
/

Valtry-Bot

Sleeping

App Files Files Community

Valtry-Bot / README.md

Valtry

Upload 4 files

cf97964 verified 28 days ago

preview code

raw

history blame contribute delete

2.05 kB

metadata

title: Streaming LLM API
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860

Hugging Face Space Streaming LLM Inference API

A lightweight Hugging Face Space API server for real-time token streaming with Qwen2.5-0.5B-Instruct.

Features

FastAPI server with SSE streaming endpoint
One-time model/tokenizer loading during startup
Configurable generation parameters (max_tokens, temperature, top_p)
Efficient inference with torch.no_grad() and device_map="auto"
Request validation and clear error responses

Model

Primary model: Qwen/Qwen2.5-0.5B-Instruct
Automatically downloaded from Hugging Face at startup

File Structure

app.py
requirements.txt
README.md
Dockerfile

Requirements

transformers
accelerate
torch
fastapi
uvicorn
pydantic

Run Locally

pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860

API

`POST /generate_stream`

Request JSON:

{
  "prompt": "user prompt text",
  "max_tokens": 512,
  "temperature": 0.7,
  "top_p": 0.9
}

prompt is required and must not be empty.
max_tokens, temperature, and top_p are optional.

Response:

Content type: text/event-stream
Streams generated text chunks incrementally as SSE events.

Example cURL

curl -N -X POST "https://your-space-name.hf.space/generate_stream" \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Explain artificial intelligence"}'

Backend Integration Flow

Backend sends prompt to Hugging Face Space.
Space generates and streams tokens.
Backend relays streamed tokens to client in real time.

Hugging Face Space Setup

Space SDK: Docker
Ensure app starts with uvicorn app:app --host 0.0.0.0 --port 7860
Expose port 7860

Notes

The first startup may take longer due to model download.
Keep model loading in startup lifecycle so it is initialized once.