| --- |
| language: |
| - kk |
| - ru |
| - uz |
| - en |
| license: cc-by-nc-4.0 |
| tags: |
| - automatic-speech-recognition |
| - nemo |
| - fastconformer |
| - streaming |
| - kazakh |
| - russian |
| - uzbek |
| - english |
| - onnx |
| pipeline_tag: automatic-speech-recognition |
| --- |
| |
| # nur-dev/nemo-fast — Multilingual Streaming STT |
|
|
| **FastConformer Hybrid CTC+Transducer** fine-tuned for Kazakh, Russian, Uzbek, and English. |
| Supports real-time streaming inference via [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) or batch inference via [NeMo](https://github.com/NVIDIA/NeMo). |
|
|
| --- |
|
|
| ## Model Description |
|
|
| | Property | Value | |
| |----------|-------| |
| | Architecture | FastConformer Hybrid CTC+Transducer | |
| | Framework | NVIDIA NeMo | |
| | Parameters | ~120M | |
| | Tokenizer | SentencePiece BPE, 4096 vocab | |
| | Sample rate | 16 kHz mono | |
| | Languages | `kk` · `ru` · `uz` · `en` | |
| | Streaming | Yes (160 ms chunks) | |
|
|
| ### WER Results |
|
|
| Evaluated with RNNT beam=16 + per-language KenLM 4-gram rescoring (ru α=0.4, uz α=0.7, kk/en α=0). |
|
|
| | Language | WER (in-domain) | WER (FLEURS) | |
| |----------|----------------|--------------| |
| | English | 17.84% | 22.38% | |
| | Russian | 33.21% | 57.51% | |
| | Uzbek | 23.74% | 45.31% | |
| | Kazakh | 38.78% | 31.31% | |
|
|
| > **Note on Kazakh FLEURS:** FLEURS WER (31.31%) is better than in-domain (38.78%) because the in-domain validation set includes conversational speech, which is harder than FLEURS read speech. |
|
|
| --- |
|
|
| ## Repository Contents |
|
|
| ``` |
| fastconformer_v6.nemo # Full NeMo model (weights + tokenizer + config) |
| onnx/ |
| encoder.onnx # FastConformer encoder for streaming inference |
| decoder_joint.onnx # Fused RNN-T decoder+joiner for streaming inference |
| ``` |
|
|
| --- |
|
|
| ## Inference |
|
|
| ### Option A — NeMo (batch, GPU recommended) |
|
|
| ```python |
| import nemo.collections.asr as nemo_asr |
| |
| model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.restore_from( |
| "fastconformer_v6.nemo", |
| map_location="cuda", |
| ) |
| model.eval() |
| |
| # Transcribe one or more audio files (16 kHz WAV/FLAC) |
| transcriptions = model.transcribe(["audio.wav"]) |
| print(transcriptions[0]) |
| ``` |
|
|
| For longer files, use CTC decoding (faster, slightly lower accuracy): |
|
|
| ```python |
| transcriptions = model.transcribe(["audio.wav"], decoder_type="ctc") |
| ``` |
|
|
| --- |
|
|
| ### Option B — sherpa-onnx (streaming, CPU or GPU) |
|
|
| #### Install |
|
|
| ```bash |
| pip install sherpa-onnx soundfile numpy |
| ``` |
|
|
| #### Download ONNX files |
|
|
| ```bash |
| # Using huggingface_hub |
| from huggingface_hub import hf_hub_download |
| |
| encoder = hf_hub_download("nur-dev/nemo-fast", "onnx/encoder.onnx") |
| decoder = hf_hub_download("nur-dev/nemo-fast", "onnx/decoder_joint.onnx") |
| ``` |
|
|
| You also need the tokenizer vocabulary. Extract it from the `.nemo` archive: |
|
|
| ```bash |
| # .nemo files are zip archives |
| unzip -p fastconformer_v6.nemo tokenizer.model > tokenizer.model |
| # or extract the vocab txt |
| unzip -p fastconformer_v6.nemo vocab.txt > vocab.txt |
| ``` |
|
|
| #### Transcribe a file (non-streaming) |
|
|
| ```python |
| import sherpa_onnx |
| import soundfile as sf |
| import numpy as np |
| |
| recognizer = sherpa_onnx.OfflineRecognizer.from_transducer( |
| encoder="onnx/encoder.onnx", |
| decoder="onnx/decoder_joint.onnx", |
| joiner="onnx/decoder_joint.onnx", # fused model: same file for both |
| tokens="vocab.txt", |
| num_threads=4, |
| sample_rate=16000, |
| feature_dim=80, |
| ) |
| |
| audio, sr = sf.read("audio.wav", dtype="float32") |
| assert sr == 16000, "Resample to 16 kHz first" |
| |
| stream = recognizer.create_stream() |
| stream.accept_waveform(sr, audio) |
| recognizer.decode_stream(stream) |
| print(stream.result.text) |
| ``` |
|
|
| #### Real-time streaming transcription |
|
|
| ```python |
| import sherpa_onnx |
| import sounddevice as sd |
| import numpy as np |
| |
| SAMPLE_RATE = 16000 |
| CHUNK_MS = 160 # 160 ms per chunk |
| CHUNK_SAMPLES = int(SAMPLE_RATE * CHUNK_MS / 1000) |
| |
| recognizer = sherpa_onnx.OnlineRecognizer.from_transducer( |
| encoder="onnx/encoder.onnx", |
| decoder="onnx/decoder_joint.onnx", |
| joiner="onnx/decoder_joint.onnx", |
| tokens="vocab.txt", |
| num_threads=4, |
| sample_rate=SAMPLE_RATE, |
| feature_dim=80, |
| decoding_method="modified_beam_search", |
| num_active_paths=4, |
| enable_endpoint_detection=True, |
| rule1_min_trailing_silence=2.4, |
| rule2_min_trailing_silence=1.2, |
| rule3_min_utterance_length=20.0, |
| ) |
| |
| stream = recognizer.create_stream() |
| |
| def callback(indata, frames, time, status): |
| audio = indata[:, 0].astype(np.float32) |
| stream.accept_waveform(SAMPLE_RATE, audio) |
| while recognizer.is_ready(stream): |
| recognizer.decode_stream(stream) |
| text = recognizer.get_result(stream).text.strip() |
| if text: |
| print(f"\r{text}", end="", flush=True) |
| if recognizer.is_endpoint(stream): |
| print() |
| recognizer.reset(stream) |
| |
| with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, |
| blocksize=CHUNK_SAMPLES, callback=callback): |
| print("Listening — press Ctrl+C to stop") |
| while True: |
| sd.sleep(100) |
| ``` |
|
|
| --- |
|
|
| ### Option C — WebSocket / REST server |
|
|
| The full server is in the [audio-STT repository](https://github.com/nur-dev/audio-STT). Quick start: |
|
|
| ```bash |
| pip install sherpa-onnx fastapi uvicorn websockets soundfile numpy |
| |
| python serving/serve_streaming.py \ |
| --encoder onnx/encoder.onnx \ |
| --decoder onnx/decoder_joint.onnx \ |
| --joiner onnx/decoder_joint.onnx \ |
| --tokens vocab.txt \ |
| --host 0.0.0.0 \ |
| --port 8001 |
| ``` |
|
|
| **REST endpoint:** |
|
|
| ```bash |
| curl -X POST http://localhost:8001/transcribe \ |
| -F "file=@audio.wav" | jq . |
| # {"text": "транскрипция аудио"} |
| ``` |
|
|
| **WebSocket (streaming):** |
|
|
| ```javascript |
| const ws = new WebSocket("ws://localhost:8001/ws/transcribe"); |
| ws.onmessage = (e) => console.log(JSON.parse(e.data)); |
| |
| // Send raw 16-bit PCM at 16 kHz in 160 ms chunks |
| mediaRecorder.ondataavailable = (e) => ws.send(e.data); |
| ``` |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - **Kazakh (38.78% WER):** Training data is predominantly formal/read speech. Conversational Kazakh (call center, spontaneous) will have higher WER. |
| - **Russian/Uzbek out-of-domain:** FLEURS WER is significantly higher than in-domain (ru: 57.51%, uz: 45.31%), indicating sensitivity to recording conditions and speaking style. |
| - **No language identification:** The model does not auto-detect language. Accuracy on mixed-language audio is not characterized. |
| - **16 kHz mono only.** Audio must be resampled before inference. |
|
|
| --- |
|
|
| ## License |
|
|
| [Creative Commons Attribution Non Commercial 4.0 (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/) |
|
|
| This model may not be used for commercial purposes without explicit written permission. |
|
|