react / README.md
Ashok75's picture
Update README.md
b921a13 verified

A newer version of the Gradio SDK is available: 6.11.0

Upgrade
metadata
title: React
emoji: 🌍
colorFrom: yellow
colorTo: gray
sdk: gradio
sdk_version: 6.9.0
app_file: app.py
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

HF Space Backend (Streaming LLM Server)

This folder contains Hugging Face Space backends for two model deployments that share the same production runtime.

Files

  • app.py: Nanbeige deployment entrypoint (Nanbeige/Nanbeige4.1-3B)
  • main.py: LiquidAI deployment entrypoint (LiquidAI/LFM2.5-1.2B-Thinking)
  • server_runtime.py: shared queue + worker + streaming runtime used by both entrypoints
  • index.html: lightweight local streaming test UI
  • requirements.txt: runtime dependencies

Runtime Architecture

Both servers use the same execution flow:

Client Request
-> FastAPI /chat
-> asyncio.Queue request buffer
-> worker pool (asyncio tasks)
-> concurrency gate (asyncio.Semaphore)
-> one generation thread per request (model.generate)
-> per-request TextIteratorStreamer
-> SSE token stream to client

Why this structure

  • Keeps the event loop responsive.
  • Prevents response mixing across users (isolated request objects).
  • Supports controlled concurrency under CPU/GPU.
  • Queues overflow load instead of hard failing during bursts.

Concurrency

Hardware-aware worker count:

  • CPU: 1..4 workers (core-based)
  • GPU: 3..5 workers (VRAM tier-based)

Override at runtime:

  • HF_MAX_WORKERS

Queue settings:

  • HF_QUEUE_MAX_SIZE (default: 512)

Thread Lifecycle and Safety

  • Each request gets its own generation thread.
  • Each request has a cancellation event.
  • CancelAwareStoppingCriteria stops generation when client disconnects/cancels.
  • Streamer is explicitly ended in finally block.
  • Generation thread is joined with a long timeout (HF_GENERATION_JOIN_TIMEOUT_SECONDS, default 180) to avoid orphaned work.

This fixes the old short-join behavior that produced frequent:

  • Generation thread did not finish within timeout

Metrics and Logging

Per request logs include:

  • request queued
  • worker start/end
  • first token latency
  • generated token count
  • tokens/sec
  • active workers
  • queue size

Debug token-by-token logging is optional:

  • HF_DEBUG_TOKEN_LOGS=1

API

POST /chat

Body:

  • messages: chat messages
  • stream: true for SSE streaming
  • max_tokens: max new tokens requested
  • temperature: optional; if omitted model default is used
  • tools: optional tool schemas for chat template

Streaming response format:

  • SSE data: {"type":"token","content":"..."} chunks
  • final {"type":"done","content":""} event

GET /health

Returns:

  • status
  • model_loaded
  • device
  • active_workers
  • queue_size
  • max_workers

GET /index

Serves index.html test page.

Model-Specific Settings

app.py (Nanbeige4.1-3B)

  • max_input_tokens=32768
  • eos_token_id=166101
  • default_temperature=0.6
  • top_p=0.95
  • repetition_penalty=1.0
  • tokenizer_use_fast=False

main.py (LFM2.5-1.2B-Thinking)

  • max_input_tokens=32768
  • default_temperature=0.1
  • top_p=0.1
  • top_k=50
  • repetition_penalty=1.05
  • eos_token_id from tokenizer config

Environment Variables

  • HF_MAX_WORKERS
  • HF_QUEUE_MAX_SIZE
  • HF_STREAMER_TIMEOUT_SECONDS
  • HF_GENERATION_JOIN_TIMEOUT_SECONDS
  • HF_MAX_INPUT_TOKENS
  • HF_MAX_NEW_TOKENS
  • HF_DEBUG_TOKEN_LOGS

Model Documentation References

Nanbeige / app.py

LiquidAI / main.py

Notes

  • Model is loaded once per process during FastAPI lifespan startup.
  • index.html is intentionally a simple streaming test page, not the production frontend.
  • Both entrypoints (app.py, main.py) now behave consistently by design.