Spaces:
Running
Running
| title: Streaming Zipformer | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| pinned: false | |
| license: mit | |
| short_description: Streaming zipformer | |
| # ποΈ Real-Time Streaming ASR Demo (FastAPI + Sherpa-ONNX) | |
| This project demonstrates a real-time speech-to-text (ASR) web application with: | |
| * π§ [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx) streaming Zipformer model | |
| * π FastAPI backend with WebSocket support | |
| * ποΈ Configurable browser-based UI using vanilla HTML/JS | |
| * βοΈ Docker-compatible deployment (CPU-only) on Hugging Face Spaces | |
| ## π¦ Model | |
| The app uses the bilingual (Chinese-English) streaming Zipformer model: | |
| π **Model Source:** [Zipformer Small Bilingual zh-en (2023-02-16)](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16-bilingual-chinese-english) | |
| Model files (ONNX) are located under: | |
| ``` | |
| models/zipformer_bilingual/ | |
| ``` | |
| ## π Features | |
| * π€ **Real-Time Microphone Input:** capture audio directly in the browser. | |
| * ποΈ **Recognition Settings:** select ASR model and precision; view supported languages and model size. | |
| * π **Hotword Biasing:** input custom hotwords (one per line) and adjust boost score. See [Sherpa-ONNX Hotwords Guide](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html). | |
| * β±οΈ **Endpoint Detection:** configure silence-based rules (RuleΒ 1 threshold, RuleΒ 2 threshold, minimum utterance length) to control segmentation. See [Sherpa-NCNN Endpoint Detection](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html). | |
| * π **Volume Meter:** real-time volume indicator based on RMS. | |
| * π¬ **Streaming Transcription:** display partial (in red) and final (in green) results with automatic scrolling. | |
| * π οΈ **Debug Logging:** backend logs configuration steps and endpoint detection events. | |
| * π³ **Deployment:** Dockerfile provided for CPU-only deployment on Hugging Face Spaces. | |
| ## π οΈ Configuration Guide | |
| ### π Hotword Biasing Configuration | |
| * **Hotwords List** (`hotwordsList`): Enter one hotword or phrase per line. These are words/phrases the ASR will preferentially recognize. For multilingual models, you can mix scripts according to your modelβs `modeling-unit` (e.g., `cjkchar+bpe`). | |
| * **Boost Score** (`boostScore`): A global score applied at the token level for each matched hotword (range: `0.0`β`10.0`). You may also specify per-hotword scores inline in the list using `:`, for example: | |
| ``` | |
| θ―ι³θ―ε« :3.5 | |
| ζ·±εΊ¦ε¦δΉ :2.0 | |
| SPEECH RECOGNITION :1.5 | |
| ``` | |
| * **Decoding Method**: Ensure your model uses `modified_beam_search` (not the default `greedy_search`) to enable hotword biasing. | |
| * **Applying**: Click **Apply Hotwords** in the UI to send the following JSON payload to the backend: | |
| ```json | |
| { | |
| "type": "config", | |
| "hotwords": ["..."], | |
| "hotwordsScore": 2.0 | |
| } | |
| ``` | |
| (For full details, see the [Sherpa-ONNX Hotwords Guide](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html) ([k2-fsa.github.io](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html)).) | |
| ### β±οΈ Endpoint Detection Configuration | |
| The system supports three endpointing rules borrowed from Kaldi: | |
| * **RuleΒ 1** (`epRule1`): Minimum duration of trailing silence to trigger an endpoint, in **seconds** (default: `2.4`). Fires whether or not any token has been decoded. | |
| * **RuleΒ 2** (`epRule2`): Minimum duration of trailing silence to trigger an endpoint *only after* at least one token is decoded, in **seconds** (default: `1.2`). | |
| * **RuleΒ 3** (`epRule3`): Maximum utterance length before forcing an endpoint, in **milliseconds** (default: `300`). Disable by setting a very large value. | |
| * **Applying**: Click **Apply Endpoint Config** in the UI to send the following JSON payload to the backend: | |
| ```json | |
| { | |
| "type": "config", | |
| "epRule1": 2.4, | |
| "epRule2": 1.2, | |
| "epRule3": 300 | |
| } | |
| ``` | |
| (See the [Sherpa-NCNN Endpointing documentation](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html) ([k2-fsa.github.io](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html)).) | |
| ## π§ͺ Local Development | |
| 1. **Install dependencies** | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 2. **Run the app locally** | |
| ```bash | |
| uvicorn app.main:app --reload --host 0.0.0.0 --port 8501 | |
| ``` | |
| Open [http://localhost:8501](http://localhost:8501) in your browser. | |
| [https://k2-fsa.github.io/sherpa/ncnn/endpoint.html](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html) | |
| ## π Project Structure | |
| ``` | |
| . | |
| βββ app | |
| β βββ main.py # FastAPI + WebSocket endpoint, config parsing, debug logging | |
| β βββ asr_worker.py # Audio resampling, inference, endpoint detection, OpenCC conversion | |
| β βββ static/index.html # Client-side UI: recognition, hotword, endpoint, mic, transcript | |
| βββ models/zipformer_bilingual/ | |
| β βββ ... (onnx, tokens.txt) | |
| βββ requirements.txt | |
| βββ Dockerfile | |
| βββ README.md | |
| ``` | |
| ## π§ Credits | |
| * [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx) | |
| * [OpenCC](https://github.com/BYVoid/OpenCC) | |
| * [FastAPI](https://fastapi.tiangolo.com/) | |
| * [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces) | |