Spaces:

Ashok75
/

react

Sleeping

File size: 4,660 Bytes

---
title: React
emoji: 🌍
colorFrom: yellow
colorTo: gray
sdk: gradio
sdk_version: 6.9.0
app_file: app.py
pinned: false
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference


# HF Space Backend (Streaming LLM Server)

This folder contains Hugging Face Space backends for two model deployments that share the same production runtime.

## Files
- `app.py`: Nanbeige deployment entrypoint (`Nanbeige/Nanbeige4.1-3B`)
- `main.py`: LiquidAI deployment entrypoint (`LiquidAI/LFM2.5-1.2B-Thinking`)
- `server_runtime.py`: shared queue + worker + streaming runtime used by both entrypoints
- `index.html`: lightweight local streaming test UI
- `requirements.txt`: runtime dependencies

## Runtime Architecture
Both servers use the same execution flow:

Client Request  
-> FastAPI `/chat`  
-> `asyncio.Queue` request buffer  
-> worker pool (`asyncio` tasks)  
-> concurrency gate (`asyncio.Semaphore`)  
-> one generation thread per request (`model.generate`)  
-> per-request `TextIteratorStreamer`  
-> SSE token stream to client

### Why this structure
- Keeps the event loop responsive.
- Prevents response mixing across users (isolated request objects).
- Supports controlled concurrency under CPU/GPU.
- Queues overflow load instead of hard failing during bursts.

## Concurrency
Hardware-aware worker count:
- CPU: `1..4` workers (core-based)
- GPU: `3..5` workers (VRAM tier-based)

Override at runtime:
- `HF_MAX_WORKERS`

Queue settings:
- `HF_QUEUE_MAX_SIZE` (default: `512`)

## Thread Lifecycle and Safety
- Each request gets its own generation thread.
- Each request has a cancellation event.
- `CancelAwareStoppingCriteria` stops generation when client disconnects/cancels.
- Streamer is explicitly ended in `finally` block.
- Generation thread is joined with a long timeout (`HF_GENERATION_JOIN_TIMEOUT_SECONDS`, default `180`) to avoid orphaned work.

This fixes the old short-join behavior that produced frequent:
- `Generation thread did not finish within timeout`

## Metrics and Logging
Per request logs include:
- request queued
- worker start/end
- first token latency
- generated token count
- tokens/sec
- active workers
- queue size

Debug token-by-token logging is optional:
- `HF_DEBUG_TOKEN_LOGS=1`

## API
### `POST /chat`
Body:
- `messages`: chat messages
- `stream`: `true` for SSE streaming
- `max_tokens`: max new tokens requested
- `temperature`: optional; if omitted model default is used
- `tools`: optional tool schemas for chat template

Streaming response format:
- SSE `data: {"type":"token","content":"..."}` chunks
- final `{"type":"done","content":""}` event

### `GET /health`
Returns:
- `status`
- `model_loaded`
- `device`
- `active_workers`
- `queue_size`
- `max_workers`

### `GET /index`
Serves `index.html` test page.

## Model-Specific Settings
### `app.py` (Nanbeige4.1-3B)
- `max_input_tokens=32768`
- `eos_token_id=166101`
- `default_temperature=0.6`
- `top_p=0.95`
- `repetition_penalty=1.0`
- `tokenizer_use_fast=False`

### `main.py` (LFM2.5-1.2B-Thinking)
- `max_input_tokens=32768`
- `default_temperature=0.1`
- `top_p=0.1`
- `top_k=50`
- `repetition_penalty=1.05`
- `eos_token_id` from tokenizer config

## Environment Variables
- `HF_MAX_WORKERS`
- `HF_QUEUE_MAX_SIZE`
- `HF_STREAMER_TIMEOUT_SECONDS`
- `HF_GENERATION_JOIN_TIMEOUT_SECONDS`
- `HF_MAX_INPUT_TOKENS`
- `HF_MAX_NEW_TOKENS`
- `HF_DEBUG_TOKEN_LOGS`

## Model Documentation References
### Nanbeige / `app.py`
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/README.md
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/Nanbeige4.1-3B-Report.pdf
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/generation_config.json
- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/config.json

### LiquidAI / `main.py`
- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking
- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/README.md
- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/chat_template.jinja
- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/config.json
- https://docs.liquid.ai/lfm/key-concepts/chat-template
- https://docs.liquid.ai/lfm/key-concepts/text-generation-and-prompting
- https://docs.liquid.ai/lfm/key-concepts/tool-use
- https://huggingface.co/docs/transformers/en/chat_templating#using-applychattemplate

## Notes
- Model is loaded once per process during FastAPI lifespan startup.
- `index.html` is intentionally a simple streaming test page, not the production frontend.
- Both entrypoints (`app.py`, `main.py`) now behave consistently by design.