File size: 4,660 Bytes
4297dd7 b921a13 4297dd7 6e9c061 b921a13 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | ---
title: React
emoji: 🌍
colorFrom: yellow
colorTo: gray
sdk: gradio
sdk_version: 6.9.0
app_file: app.py
pinned: false
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# HF Space Backend (Streaming LLM Server)
This folder contains Hugging Face Space backends for two model deployments that share the same production runtime.
## Files
- `app.py`: Nanbeige deployment entrypoint (`Nanbeige/Nanbeige4.1-3B`)
- `main.py`: LiquidAI deployment entrypoint (`LiquidAI/LFM2.5-1.2B-Thinking`)
- `server_runtime.py`: shared queue + worker + streaming runtime used by both entrypoints
- `index.html`: lightweight local streaming test UI
- `requirements.txt`: runtime dependencies
## Runtime Architecture
Both servers use the same execution flow:
Client Request
-> FastAPI `/chat`
-> `asyncio.Queue` request buffer
-> worker pool (`asyncio` tasks)
-> concurrency gate (`asyncio.Semaphore`)
-> one generation thread per request (`model.generate`)
-> per-request `TextIteratorStreamer`
-> SSE token stream to client
### Why this structure
- Keeps the event loop responsive.
- Prevents response mixing across users (isolated request objects).
- Supports controlled concurrency under CPU/GPU.
- Queues overflow load instead of hard failing during bursts.
## Concurrency
Hardware-aware worker count:
- CPU: `1..4` workers (core-based)
- GPU: `3..5` workers (VRAM tier-based)
Override at runtime:
- `HF_MAX_WORKERS`
Queue settings:
- `HF_QUEUE_MAX_SIZE` (default: `512`)
## Thread Lifecycle and Safety
- Each request gets its own generation thread.
- Each request has a cancellation event.
- `CancelAwareStoppingCriteria` stops generation when client disconnects/cancels.
- Streamer is explicitly ended in `finally` block.
- Generation thread is joined with a long timeout (`HF_GENERATION_JOIN_TIMEOUT_SECONDS`, default `180`) to avoid orphaned work.
This fixes the old short-join behavior that produced frequent:
- `Generation thread did not finish within timeout`
## Metrics and Logging
Per request logs include:
- request queued
- worker start/end
- first token latency
- generated token count
- tokens/sec
- active workers
- queue size
Debug token-by-token logging is optional:
- `HF_DEBUG_TOKEN_LOGS=1`
## API
### `POST /chat`
Body:
- `messages`: chat messages
- `stream`: `true` for SSE streaming
- `max_tokens`: max new tokens requested
- `temperature`: optional; if omitted model default is used
- `tools`: optional tool schemas for chat template
Streaming response format:
- SSE `data: {"type":"token","content":"..."}` chunks
- final `{"type":"done","content":""}` event
### `GET /health`
Returns:
- `status`
- `model_loaded`
- `device`
- `active_workers`
- `queue_size`
- `max_workers`
### `GET /index`
Serves `index.html` test page.
## Model-Specific Settings
### `app.py` (Nanbeige4.1-3B)
- `max_input_tokens=32768`
- `eos_token_id=166101`
- `default_temperature=0.6`
- `top_p=0.95`
- `repetition_penalty=1.0`
- `tokenizer_use_fast=False`
### `main.py` (LFM2.5-1.2B-Thinking)
- `max_input_tokens=32768`
- `default_temperature=0.1`
- `top_p=0.1`
- `top_k=50`
- `repetition_penalty=1.05`
- `eos_token_id` from tokenizer config
## Environment Variables
- `HF_MAX_WORKERS`
- `HF_QUEUE_MAX_SIZE`
- `HF_STREAMER_TIMEOUT_SECONDS`
- `HF_GENERATION_JOIN_TIMEOUT_SECONDS`
- `HF_MAX_INPUT_TOKENS`
- `HF_MAX_NEW_TOKENS`
- `HF_DEBUG_TOKEN_LOGS`
## Model Documentation References
### Nanbeige / `app.py`
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/README.md
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/Nanbeige4.1-3B-Report.pdf
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/generation_config.json
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/config.json
### LiquidAI / `main.py`
- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking
- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/README.md
- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/chat_template.jinja
- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/config.json
- https://docs.liquid.ai/lfm/key-concepts/chat-template
- https://docs.liquid.ai/lfm/key-concepts/text-generation-and-prompting
- https://docs.liquid.ai/lfm/key-concepts/tool-use
- https://huggingface.co/docs/transformers/en/chat_templating#using-applychattemplate
## Notes
- Model is loaded once per process during FastAPI lifespan startup.
- `index.html` is intentionally a simple streaming test page, not the production frontend.
- Both entrypoints (`app.py`, `main.py`) now behave consistently by design. |