--- title: React emoji: 🌍 colorFrom: yellow colorTo: gray sdk: gradio sdk_version: 6.9.0 app_file: app.py pinned: false --- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference # HF Space Backend (Streaming LLM Server) This folder contains Hugging Face Space backends for two model deployments that share the same production runtime. ## Files - `app.py`: Nanbeige deployment entrypoint (`Nanbeige/Nanbeige4.1-3B`) - `main.py`: LiquidAI deployment entrypoint (`LiquidAI/LFM2.5-1.2B-Thinking`) - `server_runtime.py`: shared queue + worker + streaming runtime used by both entrypoints - `index.html`: lightweight local streaming test UI - `requirements.txt`: runtime dependencies ## Runtime Architecture Both servers use the same execution flow: Client Request -> FastAPI `/chat` -> `asyncio.Queue` request buffer -> worker pool (`asyncio` tasks) -> concurrency gate (`asyncio.Semaphore`) -> one generation thread per request (`model.generate`) -> per-request `TextIteratorStreamer` -> SSE token stream to client ### Why this structure - Keeps the event loop responsive. - Prevents response mixing across users (isolated request objects). - Supports controlled concurrency under CPU/GPU. - Queues overflow load instead of hard failing during bursts. ## Concurrency Hardware-aware worker count: - CPU: `1..4` workers (core-based) - GPU: `3..5` workers (VRAM tier-based) Override at runtime: - `HF_MAX_WORKERS` Queue settings: - `HF_QUEUE_MAX_SIZE` (default: `512`) ## Thread Lifecycle and Safety - Each request gets its own generation thread. - Each request has a cancellation event. - `CancelAwareStoppingCriteria` stops generation when client disconnects/cancels. - Streamer is explicitly ended in `finally` block. - Generation thread is joined with a long timeout (`HF_GENERATION_JOIN_TIMEOUT_SECONDS`, default `180`) to avoid orphaned work. This fixes the old short-join behavior that produced frequent: - `Generation thread did not finish within timeout` ## Metrics and Logging Per request logs include: - request queued - worker start/end - first token latency - generated token count - tokens/sec - active workers - queue size Debug token-by-token logging is optional: - `HF_DEBUG_TOKEN_LOGS=1` ## API ### `POST /chat` Body: - `messages`: chat messages - `stream`: `true` for SSE streaming - `max_tokens`: max new tokens requested - `temperature`: optional; if omitted model default is used - `tools`: optional tool schemas for chat template Streaming response format: - SSE `data: {"type":"token","content":"..."}` chunks - final `{"type":"done","content":""}` event ### `GET /health` Returns: - `status` - `model_loaded` - `device` - `active_workers` - `queue_size` - `max_workers` ### `GET /index` Serves `index.html` test page. ## Model-Specific Settings ### `app.py` (Nanbeige4.1-3B) - `max_input_tokens=32768` - `eos_token_id=166101` - `default_temperature=0.6` - `top_p=0.95` - `repetition_penalty=1.0` - `tokenizer_use_fast=False` ### `main.py` (LFM2.5-1.2B-Thinking) - `max_input_tokens=32768` - `default_temperature=0.1` - `top_p=0.1` - `top_k=50` - `repetition_penalty=1.05` - `eos_token_id` from tokenizer config ## Environment Variables - `HF_MAX_WORKERS` - `HF_QUEUE_MAX_SIZE` - `HF_STREAMER_TIMEOUT_SECONDS` - `HF_GENERATION_JOIN_TIMEOUT_SECONDS` - `HF_MAX_INPUT_TOKENS` - `HF_MAX_NEW_TOKENS` - `HF_DEBUG_TOKEN_LOGS` ## Model Documentation References ### Nanbeige / `app.py` - https://huggingface.co/Nanbeige/Nanbeige4.1-3B - https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/README.md - https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/Nanbeige4.1-3B-Report.pdf - https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/generation_config.json - https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/config.json ### LiquidAI / `main.py` - https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking - https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/README.md - https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/chat_template.jinja - https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/config.json - https://docs.liquid.ai/lfm/key-concepts/chat-template - https://docs.liquid.ai/lfm/key-concepts/text-generation-and-prompting - https://docs.liquid.ai/lfm/key-concepts/tool-use - https://huggingface.co/docs/transformers/en/chat_templating#using-applychattemplate ## Notes - Model is loaded once per process during FastAPI lifespan startup. - `index.html` is intentionally a simple streaming test page, not the production frontend. - Both entrypoints (`app.py`, `main.py`) now behave consistently by design.