Spaces:

Valtry
/

Valtry-Bot

Sleeping

App Files Files Community

Valtry-Bot / README.md

Valtry

Upload 4 files

cf97964 verified 30 days ago

preview code

raw

history blame contribute delete

2.05 kB

	---
	title: Streaming LLM API
	colorFrom: blue
	colorTo: green
	sdk: docker
	app_port: 7860
	---

	# Hugging Face Space Streaming LLM Inference API

	A lightweight Hugging Face Space API server for real-time token streaming with Qwen2.5-0.5B-Instruct.

	## Features

	- FastAPI server with SSE streaming endpoint
	- One-time model/tokenizer loading during startup
	- Configurable generation parameters (`max_tokens`, `temperature`, `top_p`)
	- Efficient inference with `torch.no_grad()` and `device_map="auto"`
	- Request validation and clear error responses

	## Model

	- Primary model: `Qwen/Qwen2.5-0.5B-Instruct`
	- Automatically downloaded from Hugging Face at startup

	## File Structure

	- `app.py`
	- `requirements.txt`
	- `README.md`
	- `Dockerfile`

	## Requirements

	```txt
	transformers
	accelerate
	torch
	fastapi
	uvicorn
	pydantic
	```

	## Run Locally

	```bash
	pip install -r requirements.txt
	uvicorn app:app --host 0.0.0.0 --port 7860
	```

	## API

	### `POST /generate_stream`

	Request JSON:

	```json
	{
	"prompt": "user prompt text",
	"max_tokens": 512,
	"temperature": 0.7,
	"top_p": 0.9
	}
	```

	- `prompt` is required and must not be empty.
	- `max_tokens`, `temperature`, and `top_p` are optional.

	Response:

	- Content type: `text/event-stream`
	- Streams generated text chunks incrementally as SSE events.

	## Example cURL

	```bash
	curl -N -X POST "https://your-space-name.hf.space/generate_stream" \
	-H "Content-Type: application/json" \
	-d '{"prompt":"Explain artificial intelligence"}'
	```

	## Backend Integration Flow

	1. Backend sends prompt to Hugging Face Space.
	2. Space generates and streams tokens.
	3. Backend relays streamed tokens to client in real time.

	## Hugging Face Space Setup

	- Space SDK: Docker
	- Ensure app starts with `uvicorn app:app --host 0.0.0.0 --port 7860`
	- Expose port `7860`

	## Notes

	- The first startup may take longer due to model download.
	- Keep model loading in startup lifecycle so it is initialized once.