Spaces:

Ashok75
/

react

Sleeping

App Files Files Community

react / README.md

Ashok75

Update README.md

b921a13 verified 28 days ago

preview code

raw

history blame contribute delete

4.66 kB

A newer version of the Gradio SDK is available: 6.11.0

Upgrade

metadata

title: React
emoji: 🌍
colorFrom: yellow
colorTo: gray
sdk: gradio
sdk_version: 6.9.0
app_file: app.py
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

HF Space Backend (Streaming LLM Server)

This folder contains Hugging Face Space backends for two model deployments that share the same production runtime.

Files

app.py: Nanbeige deployment entrypoint (Nanbeige/Nanbeige4.1-3B)
main.py: LiquidAI deployment entrypoint (LiquidAI/LFM2.5-1.2B-Thinking)
server_runtime.py: shared queue + worker + streaming runtime used by both entrypoints
index.html: lightweight local streaming test UI
requirements.txt: runtime dependencies

Runtime Architecture

Both servers use the same execution flow:

Client Request
-> FastAPI /chat
-> asyncio.Queue request buffer
-> worker pool (asyncio tasks)
-> concurrency gate (asyncio.Semaphore)
-> one generation thread per request (model.generate)
-> per-request TextIteratorStreamer
-> SSE token stream to client

Why this structure

Keeps the event loop responsive.
Prevents response mixing across users (isolated request objects).
Supports controlled concurrency under CPU/GPU.
Queues overflow load instead of hard failing during bursts.

Concurrency

Hardware-aware worker count:

CPU: 1..4 workers (core-based)
GPU: 3..5 workers (VRAM tier-based)

Override at runtime:

HF_MAX_WORKERS

Queue settings:

HF_QUEUE_MAX_SIZE (default: 512)

Thread Lifecycle and Safety

Each request gets its own generation thread.
Each request has a cancellation event.
CancelAwareStoppingCriteria stops generation when client disconnects/cancels.
Streamer is explicitly ended in finally block.
Generation thread is joined with a long timeout (HF_GENERATION_JOIN_TIMEOUT_SECONDS, default 180) to avoid orphaned work.

This fixes the old short-join behavior that produced frequent:

Generation thread did not finish within timeout

Metrics and Logging

Per request logs include:

request queued
worker start/end
first token latency
generated token count
tokens/sec
active workers
queue size

Debug token-by-token logging is optional:

HF_DEBUG_TOKEN_LOGS=1

API

`POST /chat`

Body:

messages: chat messages
stream: true for SSE streaming
max_tokens: max new tokens requested
temperature: optional; if omitted model default is used
tools: optional tool schemas for chat template

Streaming response format:

SSE data: {"type":"token","content":"..."} chunks
final {"type":"done","content":""} event

`GET /health`

Returns:

status
model_loaded
device
active_workers
queue_size
max_workers

`GET /index`

Serves index.html test page.

Model-Specific Settings

`app.py` (Nanbeige4.1-3B)

max_input_tokens=32768
eos_token_id=166101
default_temperature=0.6
top_p=0.95
repetition_penalty=1.0
tokenizer_use_fast=False

`main.py` (LFM2.5-1.2B-Thinking)

max_input_tokens=32768
default_temperature=0.1
top_p=0.1
top_k=50
repetition_penalty=1.05
eos_token_id from tokenizer config

Environment Variables

HF_MAX_WORKERS
HF_QUEUE_MAX_SIZE
HF_STREAMER_TIMEOUT_SECONDS
HF_GENERATION_JOIN_TIMEOUT_SECONDS
HF_MAX_INPUT_TOKENS
HF_MAX_NEW_TOKENS
HF_DEBUG_TOKEN_LOGS

Model Documentation References

Nanbeige / `app.py`

LiquidAI / `main.py`

Notes

Model is loaded once per process during FastAPI lifespan startup.
index.html is intentionally a simple streaming test page, not the production frontend.
Both entrypoints (app.py, main.py) now behave consistently by design.