A newer version of the Gradio SDK is available: 6.11.0
title: React
emoji: 🌍
colorFrom: yellow
colorTo: gray
sdk: gradio
sdk_version: 6.9.0
app_file: app.py
pinned: false
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
HF Space Backend (Streaming LLM Server)
This folder contains Hugging Face Space backends for two model deployments that share the same production runtime.
Files
app.py: Nanbeige deployment entrypoint (Nanbeige/Nanbeige4.1-3B)main.py: LiquidAI deployment entrypoint (LiquidAI/LFM2.5-1.2B-Thinking)server_runtime.py: shared queue + worker + streaming runtime used by both entrypointsindex.html: lightweight local streaming test UIrequirements.txt: runtime dependencies
Runtime Architecture
Both servers use the same execution flow:
Client Request
-> FastAPI /chat
-> asyncio.Queue request buffer
-> worker pool (asyncio tasks)
-> concurrency gate (asyncio.Semaphore)
-> one generation thread per request (model.generate)
-> per-request TextIteratorStreamer
-> SSE token stream to client
Why this structure
- Keeps the event loop responsive.
- Prevents response mixing across users (isolated request objects).
- Supports controlled concurrency under CPU/GPU.
- Queues overflow load instead of hard failing during bursts.
Concurrency
Hardware-aware worker count:
- CPU:
1..4workers (core-based) - GPU:
3..5workers (VRAM tier-based)
Override at runtime:
HF_MAX_WORKERS
Queue settings:
HF_QUEUE_MAX_SIZE(default:512)
Thread Lifecycle and Safety
- Each request gets its own generation thread.
- Each request has a cancellation event.
CancelAwareStoppingCriteriastops generation when client disconnects/cancels.- Streamer is explicitly ended in
finallyblock. - Generation thread is joined with a long timeout (
HF_GENERATION_JOIN_TIMEOUT_SECONDS, default180) to avoid orphaned work.
This fixes the old short-join behavior that produced frequent:
Generation thread did not finish within timeout
Metrics and Logging
Per request logs include:
- request queued
- worker start/end
- first token latency
- generated token count
- tokens/sec
- active workers
- queue size
Debug token-by-token logging is optional:
HF_DEBUG_TOKEN_LOGS=1
API
POST /chat
Body:
messages: chat messagesstream:truefor SSE streamingmax_tokens: max new tokens requestedtemperature: optional; if omitted model default is usedtools: optional tool schemas for chat template
Streaming response format:
- SSE
data: {"type":"token","content":"..."}chunks - final
{"type":"done","content":""}event
GET /health
Returns:
statusmodel_loadeddeviceactive_workersqueue_sizemax_workers
GET /index
Serves index.html test page.
Model-Specific Settings
app.py (Nanbeige4.1-3B)
max_input_tokens=32768eos_token_id=166101default_temperature=0.6top_p=0.95repetition_penalty=1.0tokenizer_use_fast=False
main.py (LFM2.5-1.2B-Thinking)
max_input_tokens=32768default_temperature=0.1top_p=0.1top_k=50repetition_penalty=1.05eos_token_idfrom tokenizer config
Environment Variables
HF_MAX_WORKERSHF_QUEUE_MAX_SIZEHF_STREAMER_TIMEOUT_SECONDSHF_GENERATION_JOIN_TIMEOUT_SECONDSHF_MAX_INPUT_TOKENSHF_MAX_NEW_TOKENSHF_DEBUG_TOKEN_LOGS
Model Documentation References
Nanbeige / app.py
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/README.md
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/Nanbeige4.1-3B-Report.pdf
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/generation_config.json
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/config.json
LiquidAI / main.py
- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking
- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/README.md
- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/chat_template.jinja
- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/config.json
- https://docs.liquid.ai/lfm/key-concepts/chat-template
- https://docs.liquid.ai/lfm/key-concepts/text-generation-and-prompting
- https://docs.liquid.ai/lfm/key-concepts/tool-use
- https://huggingface.co/docs/transformers/en/chat_templating#using-applychattemplate
Notes
- Model is loaded once per process during FastAPI lifespan startup.
index.htmlis intentionally a simple streaming test page, not the production frontend.- Both entrypoints (
app.py,main.py) now behave consistently by design.