Spaces:

Ashok75
/

react

Sleeping

App Files Files Community

react / README.md

Ashok75

Update README.md

b921a13 verified 28 days ago

preview code

raw

history blame contribute delete

4.66 kB

	---
	title: React
	emoji: 🌍
	colorFrom: yellow
	colorTo: gray
	sdk: gradio
	sdk_version: 6.9.0
	app_file: app.py
	pinned: false
	---

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference


	# HF Space Backend (Streaming LLM Server)

	This folder contains Hugging Face Space backends for two model deployments that share the same production runtime.

	## Files
	- `app.py`: Nanbeige deployment entrypoint (`Nanbeige/Nanbeige4.1-3B`)
	- `main.py`: LiquidAI deployment entrypoint (`LiquidAI/LFM2.5-1.2B-Thinking`)
	- `server_runtime.py`: shared queue + worker + streaming runtime used by both entrypoints
	- `index.html`: lightweight local streaming test UI
	- `requirements.txt`: runtime dependencies

	## Runtime Architecture
	Both servers use the same execution flow:

	Client Request
	-> FastAPI `/chat`
	-> `asyncio.Queue` request buffer
	-> worker pool (`asyncio` tasks)
	-> concurrency gate (`asyncio.Semaphore`)
	-> one generation thread per request (`model.generate`)
	-> per-request `TextIteratorStreamer`
	-> SSE token stream to client

	### Why this structure
	- Keeps the event loop responsive.
	- Prevents response mixing across users (isolated request objects).
	- Supports controlled concurrency under CPU/GPU.
	- Queues overflow load instead of hard failing during bursts.

	## Concurrency
	Hardware-aware worker count:
	- CPU: `1..4` workers (core-based)
	- GPU: `3..5` workers (VRAM tier-based)

	Override at runtime:
	- `HF_MAX_WORKERS`

	Queue settings:
	- `HF_QUEUE_MAX_SIZE` (default: `512`)

	## Thread Lifecycle and Safety
	- Each request gets its own generation thread.
	- Each request has a cancellation event.
	- `CancelAwareStoppingCriteria` stops generation when client disconnects/cancels.
	- Streamer is explicitly ended in `finally` block.
	- Generation thread is joined with a long timeout (`HF_GENERATION_JOIN_TIMEOUT_SECONDS`, default `180`) to avoid orphaned work.

	This fixes the old short-join behavior that produced frequent:
	- `Generation thread did not finish within timeout`

	## Metrics and Logging
	Per request logs include:
	- request queued
	- worker start/end
	- first token latency
	- generated token count
	- tokens/sec
	- active workers
	- queue size

	Debug token-by-token logging is optional:
	- `HF_DEBUG_TOKEN_LOGS=1`

	## API
	### `POST /chat`
	Body:
	- `messages`: chat messages
	- `stream`: `true` for SSE streaming
	- `max_tokens`: max new tokens requested
	- `temperature`: optional; if omitted model default is used
	- `tools`: optional tool schemas for chat template

	Streaming response format:
	- SSE `data: {"type":"token","content":"..."}` chunks
	- final `{"type":"done","content":""}` event

	### `GET /health`
	Returns:
	- `status`
	- `model_loaded`
	- `device`
	- `active_workers`
	- `queue_size`
	- `max_workers`

	### `GET /index`
	Serves `index.html` test page.

	## Model-Specific Settings
	### `app.py` (Nanbeige4.1-3B)
	- `max_input_tokens=32768`
	- `eos_token_id=166101`
	- `default_temperature=0.6`
	- `top_p=0.95`
	- `repetition_penalty=1.0`
	- `tokenizer_use_fast=False`

	### `main.py` (LFM2.5-1.2B-Thinking)
	- `max_input_tokens=32768`
	- `default_temperature=0.1`
	- `top_p=0.1`
	- `top_k=50`
	- `repetition_penalty=1.05`
	- `eos_token_id` from tokenizer config

	## Environment Variables
	- `HF_MAX_WORKERS`
	- `HF_QUEUE_MAX_SIZE`
	- `HF_STREAMER_TIMEOUT_SECONDS`
	- `HF_GENERATION_JOIN_TIMEOUT_SECONDS`
	- `HF_MAX_INPUT_TOKENS`
	- `HF_MAX_NEW_TOKENS`
	- `HF_DEBUG_TOKEN_LOGS`

	## Model Documentation References
	### Nanbeige / `app.py`
	- https://huggingface.co/Nanbeige/Nanbeige4.1-3B
	- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/README.md
	- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/Nanbeige4.1-3B-Report.pdf
	- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/generation_config.json
	- https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/config.json

	### LiquidAI / `main.py`
	- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking
	- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/README.md
	- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/chat_template.jinja
	- https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking/blob/main/config.json
	- https://docs.liquid.ai/lfm/key-concepts/chat-template
	- https://docs.liquid.ai/lfm/key-concepts/text-generation-and-prompting
	- https://docs.liquid.ai/lfm/key-concepts/tool-use
	- https://huggingface.co/docs/transformers/en/chat_templating#using-applychattemplate

	## Notes
	- Model is loaded once per process during FastAPI lifespan startup.
	- `index.html` is intentionally a simple streaming test page, not the production frontend.
	- Both entrypoints (`app.py`, `main.py`) now behave consistently by design.