File size: 4,660 Bytes
4297dd7
 
 
 
 
 
b921a13
4297dd7
 
 
 
 
 
 
6e9c061
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b921a13
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
title: React
emoji: 🌍
colorFrom: yellow
colorTo: gray
sdk: gradio
sdk_version: 6.9.0
app_file: app.py
pinned: false
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference


# HF Space Backend (Streaming LLM Server)

This folder contains Hugging Face Space backends for two model deployments that share the same production runtime.

## Files
- `app.py`: Nanbeige deployment entrypoint (`Nanbeige/Nanbeige4.1-3B`)
- `main.py`: LiquidAI deployment entrypoint (`LiquidAI/LFM2.5-1.2B-Thinking`)
- `server_runtime.py`: shared queue + worker + streaming runtime used by both entrypoints
- `index.html`: lightweight local streaming test UI
- `requirements.txt`: runtime dependencies

## Runtime Architecture
Both servers use the same execution flow:

Client Request  
-> FastAPI `/chat`  
-> `asyncio.Queue` request buffer  
-> worker pool (`asyncio` tasks)  
-> concurrency gate (`asyncio.Semaphore`)  
-> one generation thread per request (`model.generate`)  
-> per-request `TextIteratorStreamer`  
-> SSE token stream to client

### Why this structure
- Keeps the event loop responsive.
- Prevents response mixing across users (isolated request objects).
- Supports controlled concurrency under CPU/GPU.
- Queues overflow load instead of hard failing during bursts.

## Concurrency
Hardware-aware worker count:
- CPU: `1..4` workers (core-based)
- GPU: `3..5` workers (VRAM tier-based)

Override at runtime:
- `HF_MAX_WORKERS`

Queue settings:
- `HF_QUEUE_MAX_SIZE` (default: `512`)

## Thread Lifecycle and Safety
- Each request gets its own generation thread.
- Each request has a cancellation event.
- `CancelAwareStoppingCriteria` stops generation when client disconnects/cancels.
- Streamer is explicitly ended in `finally` block.
- Generation thread is joined with a long timeout (`HF_GENERATION_JOIN_TIMEOUT_SECONDS`, default `180`) to avoid orphaned work.

This fixes the old short-join behavior that produced frequent:
- `Generation thread did not finish within timeout`

## Metrics and Logging
Per request logs include:
- request queued
- worker start/end
- first token latency
- generated token count
- tokens/sec
- active workers
- queue size

Debug token-by-token logging is optional:
- `HF_DEBUG_TOKEN_LOGS=1`

## API
### `POST /chat`
Body:
- `messages`: chat messages
- `stream`: `true` for SSE streaming
- `max_tokens`: max new tokens requested
- `temperature`: optional; if omitted model default is used
- `tools`: optional tool schemas for chat template

Streaming response format:
- SSE `data: {"type":"token","content":"..."}` chunks
- final `{"type":"done","content":""}` event

### `GET /health`
Returns:
- `status`
- `model_loaded`
- `device`
- `active_workers`
- `queue_size`
- `max_workers`

### `GET /index`
Serves `index.html` test page.

## Model-Specific Settings
### `app.py` (Nanbeige4.1-3B)
- `max_input_tokens=32768`
- `eos_token_id=166101`
- `default_temperature=0.6`
- `top_p=0.95`
- `repetition_penalty=1.0`
- `tokenizer_use_fast=False`

### `main.py` (LFM2.5-1.2B-Thinking)
- `max_input_tokens=32768`
- `default_temperature=0.1`
- `top_p=0.1`
- `top_k=50`
- `repetition_penalty=1.05`
- `eos_token_id` from tokenizer config

## Environment Variables
- `HF_MAX_WORKERS`
- `HF_QUEUE_MAX_SIZE`
- `HF_STREAMER_TIMEOUT_SECONDS`
- `HF_GENERATION_JOIN_TIMEOUT_SECONDS`
- `HF_MAX_INPUT_TOKENS`
- `HF_MAX_NEW_TOKENS`
- `HF_DEBUG_TOKEN_LOGS`

## Model Documentation References
### Nanbeige / `app.py`
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/README.md
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/Nanbeige4.1-3B-Report.pdf
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/generation_config.json
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/config.json

### LiquidAI / `main.py`
- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking
- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/README.md
- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/chat_template.jinja
- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/config.json
- https://docs.liquid.ai/lfm/key-concepts/chat-template
- https://docs.liquid.ai/lfm/key-concepts/text-generation-and-prompting
- https://docs.liquid.ai/lfm/key-concepts/tool-use
- https://huggingface.co/docs/transformers/en/chat_templating#using-applychattemplate

## Notes
- Model is loaded once per process during FastAPI lifespan startup.
- `index.html` is intentionally a simple streaming test page, not the production frontend.
- Both entrypoints (`app.py`, `main.py`) now behave consistently by design.