| --- |
| title: React |
| emoji: 🌍 |
| colorFrom: yellow |
| colorTo: gray |
| sdk: gradio |
| sdk_version: 6.9.0 |
| app_file: app.py |
| pinned: false |
| --- |
| |
| Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference |
|
|
|
|
| # HF Space Backend (Streaming LLM Server) |
|
|
| This folder contains Hugging Face Space backends for two model deployments that share the same production runtime. |
|
|
| ## Files |
| - `app.py`: Nanbeige deployment entrypoint (`Nanbeige/Nanbeige4.1-3B`) |
| - `main.py`: LiquidAI deployment entrypoint (`LiquidAI/LFM2.5-1.2B-Thinking`) |
| - `server_runtime.py`: shared queue + worker + streaming runtime used by both entrypoints |
| - `index.html`: lightweight local streaming test UI |
| - `requirements.txt`: runtime dependencies |
|
|
| ## Runtime Architecture |
| Both servers use the same execution flow: |
|
|
| Client Request |
| -> FastAPI `/chat` |
| -> `asyncio.Queue` request buffer |
| -> worker pool (`asyncio` tasks) |
| -> concurrency gate (`asyncio.Semaphore`) |
| -> one generation thread per request (`model.generate`) |
| -> per-request `TextIteratorStreamer` |
| -> SSE token stream to client |
|
|
| ### Why this structure |
| - Keeps the event loop responsive. |
| - Prevents response mixing across users (isolated request objects). |
| - Supports controlled concurrency under CPU/GPU. |
| - Queues overflow load instead of hard failing during bursts. |
|
|
| ## Concurrency |
| Hardware-aware worker count: |
| - CPU: `1..4` workers (core-based) |
| - GPU: `3..5` workers (VRAM tier-based) |
|
|
| Override at runtime: |
| - `HF_MAX_WORKERS` |
|
|
| Queue settings: |
| - `HF_QUEUE_MAX_SIZE` (default: `512`) |
|
|
| ## Thread Lifecycle and Safety |
| - Each request gets its own generation thread. |
| - Each request has a cancellation event. |
| - `CancelAwareStoppingCriteria` stops generation when client disconnects/cancels. |
| - Streamer is explicitly ended in `finally` block. |
| - Generation thread is joined with a long timeout (`HF_GENERATION_JOIN_TIMEOUT_SECONDS`, default `180`) to avoid orphaned work. |
|
|
| This fixes the old short-join behavior that produced frequent: |
| - `Generation thread did not finish within timeout` |
|
|
| ## Metrics and Logging |
| Per request logs include: |
| - request queued |
| - worker start/end |
| - first token latency |
| - generated token count |
| - tokens/sec |
| - active workers |
| - queue size |
|
|
| Debug token-by-token logging is optional: |
| - `HF_DEBUG_TOKEN_LOGS=1` |
|
|
| ## API |
| ### `POST /chat` |
| Body: |
| - `messages`: chat messages |
| - `stream`: `true` for SSE streaming |
| - `max_tokens`: max new tokens requested |
| - `temperature`: optional; if omitted model default is used |
| - `tools`: optional tool schemas for chat template |
|
|
| Streaming response format: |
| - SSE `data: {"type":"token","content":"..."}` chunks |
| - final `{"type":"done","content":""}` event |
|
|
| ### `GET /health` |
| Returns: |
| - `status` |
| - `model_loaded` |
| - `device` |
| - `active_workers` |
| - `queue_size` |
| - `max_workers` |
|
|
| ### `GET /index` |
| Serves `index.html` test page. |
|
|
| ## Model-Specific Settings |
| ### `app.py` (Nanbeige4.1-3B) |
| - `max_input_tokens=32768` |
| - `eos_token_id=166101` |
| - `default_temperature=0.6` |
| - `top_p=0.95` |
| - `repetition_penalty=1.0` |
| - `tokenizer_use_fast=False` |
|
|
| ### `main.py` (LFM2.5-1.2B-Thinking) |
| - `max_input_tokens=32768` |
| - `default_temperature=0.1` |
| - `top_p=0.1` |
| - `top_k=50` |
| - `repetition_penalty=1.05` |
| - `eos_token_id` from tokenizer config |
|
|
| ## Environment Variables |
| - `HF_MAX_WORKERS` |
| - `HF_QUEUE_MAX_SIZE` |
| - `HF_STREAMER_TIMEOUT_SECONDS` |
| - `HF_GENERATION_JOIN_TIMEOUT_SECONDS` |
| - `HF_MAX_INPUT_TOKENS` |
| - `HF_MAX_NEW_TOKENS` |
| - `HF_DEBUG_TOKEN_LOGS` |
|
|
| ## Model Documentation References |
| ### Nanbeige / `app.py` |
| - https://huggingface.co/Nanbeige/Nanbeige4.1-3B |
| - https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/README.md |
| - https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/Nanbeige4.1-3B-Report.pdf |
| - https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/generation_config.json |
| - https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/config.json |
| |
| ### LiquidAI / `main.py` |
| - https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking |
| - https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/README.md |
| - https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/chat_template.jinja |
| - https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/config.json |
| - https://docs.liquid.ai/lfm/key-concepts/chat-template |
| - https://docs.liquid.ai/lfm/key-concepts/text-generation-and-prompting |
| - https://docs.liquid.ai/lfm/key-concepts/tool-use |
| - https://huggingface.co/docs/transformers/en/chat_templating#using-applychattemplate |
| |
| ## Notes |
| - Model is loaded once per process during FastAPI lifespan startup. |
| - `index.html` is intentionally a simple streaming test page, not the production frontend. |
| - Both entrypoints (`app.py`, `main.py`) now behave consistently by design. |