Spaces:

Luigi
/

Streaming-Zipformer

Running

App Files Files Community

Streaming-Zipformer / README.md

Luigi

update readme

53fe0cb 8 months ago

preview code

raw

history blame contribute delete

5.24 kB

	---
	title: Streaming Zipformer
	emoji: 👀
	colorFrom: blue
	colorTo: purple
	sdk: docker
	pinned: false
	license: mit
	short_description: Streaming zipformer
	---

	# 🎙️ Real-Time Streaming ASR Demo (FastAPI + Sherpa-ONNX)

	This project demonstrates a real-time speech-to-text (ASR) web application with:

	* 🧠 [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx) streaming Zipformer model
	* 🚀 FastAPI backend with WebSocket support
	* 🎛️ Configurable browser-based UI using vanilla HTML/JS
	* ☁️ Docker-compatible deployment (CPU-only) on Hugging Face Spaces

	## 📦 Model

	The app uses the bilingual (Chinese-English) streaming Zipformer model:

	🔗 Model Source: [Zipformer Small Bilingual zh-en (2023-02-16)](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16-bilingual-chinese-english)

	Model files (ONNX) are located under:

	```
	models/zipformer_bilingual/
	```

	## 🚀 Features

	* 🎤 Real-Time Microphone Input: capture audio directly in the browser.
	* 🎛️ Recognition Settings: select ASR model and precision; view supported languages and model size.
	* 🔑 Hotword Biasing: input custom hotwords (one per line) and adjust boost score. See [Sherpa-ONNX Hotwords Guide](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html).
	* ⏱️ Endpoint Detection: configure silence-based rules (Rule 1 threshold, Rule 2 threshold, minimum utterance length) to control segmentation. See [Sherpa-NCNN Endpoint Detection](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html).
	* 📊 Volume Meter: real-time volume indicator based on RMS.
	* 💬 Streaming Transcription: display partial (in red) and final (in green) results with automatic scrolling.
	* 🛠️ Debug Logging: backend logs configuration steps and endpoint detection events.
	* 🐳 Deployment: Dockerfile provided for CPU-only deployment on Hugging Face Spaces.

	## 🛠️ Configuration Guide

	### 🔑 Hotword Biasing Configuration

	* Hotwords List (`hotwordsList`): Enter one hotword or phrase per line. These are words/phrases the ASR will preferentially recognize. For multilingual models, you can mix scripts according to your model’s `modeling-unit` (e.g., `cjkchar+bpe`).
	* Boost Score (`boostScore`): A global score applied at the token level for each matched hotword (range: `0.0`–`10.0`). You may also specify per-hotword scores inline in the list using `:`, for example:

	```
	语音识别 :3.5
	深度学习 :2.0
	SPEECH RECOGNITION :1.5
	```
	* Decoding Method: Ensure your model uses `modified_beam_search` (not the default `greedy_search`) to enable hotword biasing.
	* Applying: Click Apply Hotwords in the UI to send the following JSON payload to the backend:

	```json
	{
	"type": "config",
	"hotwords": ["..."],
	"hotwordsScore": 2.0
	}
	```

	(For full details, see the [Sherpa-ONNX Hotwords Guide](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html) ([k2-fsa.github.io](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html)).)

	### ⏱️ Endpoint Detection Configuration

	The system supports three endpointing rules borrowed from Kaldi:

	* Rule 1 (`epRule1`): Minimum duration of trailing silence to trigger an endpoint, in seconds (default: `2.4`). Fires whether or not any token has been decoded.
	* Rule 2 (`epRule2`): Minimum duration of trailing silence to trigger an endpoint only after at least one token is decoded, in seconds (default: `1.2`).
	* Rule 3 (`epRule3`): Maximum utterance length before forcing an endpoint, in milliseconds (default: `300`). Disable by setting a very large value.
	* Applying: Click Apply Endpoint Config in the UI to send the following JSON payload to the backend:

	```json
	{
	"type": "config",
	"epRule1": 2.4,
	"epRule2": 1.2,
	"epRule3": 300
	}
	```

	(See the [Sherpa-NCNN Endpointing documentation](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html) ([k2-fsa.github.io](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html)).)

	## 🧪 Local Development

	1. Install dependencies

	```bash
	pip install -r requirements.txt
	```

	2. Run the app locally

	```bash
	uvicorn app.main:app --reload --host 0.0.0.0 --port 8501
	```

	Open [http://localhost:8501](http://localhost:8501) in your browser.

	[https://k2-fsa.github.io/sherpa/ncnn/endpoint.html](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html)

	## 📁 Project Structure

	```
	.
	├── app
	│ ├── main.py # FastAPI + WebSocket endpoint, config parsing, debug logging
	│ ├── asr_worker.py # Audio resampling, inference, endpoint detection, OpenCC conversion
	│ └── static/index.html # Client-side UI: recognition, hotword, endpoint, mic, transcript
	├── models/zipformer_bilingual/
	│ └── ... (onnx, tokens.txt)
	├── requirements.txt
	├── Dockerfile
	└── README.md
	```

	## 🔧 Credits

	* [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx)
	* [OpenCC](https://github.com/BYVoid/OpenCC)
	* [FastAPI](https://fastapi.tiangolo.com/)
	* [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces)