nemo-fast / README.md
nur-dev's picture
Upload README.md with huggingface_hub
93f6c5e verified
metadata
language:
  - kk
  - ru
  - uz
  - en
license: cc-by-nc-4.0
tags:
  - automatic-speech-recognition
  - nemo
  - fastconformer
  - streaming
  - kazakh
  - russian
  - uzbek
  - english
  - onnx
pipeline_tag: automatic-speech-recognition

nur-dev/nemo-fast — Multilingual Streaming STT

FastConformer Hybrid CTC+Transducer fine-tuned for Kazakh, Russian, Uzbek, and English. Supports real-time streaming inference via sherpa-onnx or batch inference via NeMo.


Model Description

Property Value
Architecture FastConformer Hybrid CTC+Transducer
Framework NVIDIA NeMo
Parameters ~120M
Tokenizer SentencePiece BPE, 4096 vocab
Sample rate 16 kHz mono
Languages kk · ru · uz · en
Streaming Yes (160 ms chunks)

WER Results

Evaluated with RNNT beam=16 + per-language KenLM 4-gram rescoring (ru α=0.4, uz α=0.7, kk/en α=0).

Language WER (in-domain) WER (FLEURS)
English 17.84% 22.38%
Russian 33.21% 57.51%
Uzbek 23.74% 45.31%
Kazakh 38.78% 31.31%

Note on Kazakh FLEURS: FLEURS WER (31.31%) is better than in-domain (38.78%) because the in-domain validation set includes conversational speech, which is harder than FLEURS read speech.


Repository Contents

fastconformer_v6.nemo          # Full NeMo model (weights + tokenizer + config)
onnx/
  encoder.onnx                 # FastConformer encoder for streaming inference
  decoder_joint.onnx           # Fused RNN-T decoder+joiner for streaming inference

Inference

Option A — NeMo (batch, GPU recommended)

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.restore_from(
    "fastconformer_v6.nemo",
    map_location="cuda",
)
model.eval()

# Transcribe one or more audio files (16 kHz WAV/FLAC)
transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0])

For longer files, use CTC decoding (faster, slightly lower accuracy):

transcriptions = model.transcribe(["audio.wav"], decoder_type="ctc")

Option B — sherpa-onnx (streaming, CPU or GPU)

Install

pip install sherpa-onnx soundfile numpy

Download ONNX files

# Using huggingface_hub
from huggingface_hub import hf_hub_download

encoder  = hf_hub_download("nur-dev/nemo-fast", "onnx/encoder.onnx")
decoder  = hf_hub_download("nur-dev/nemo-fast", "onnx/decoder_joint.onnx")

You also need the tokenizer vocabulary. Extract it from the .nemo archive:

# .nemo files are zip archives
unzip -p fastconformer_v6.nemo tokenizer.model > tokenizer.model
# or extract the vocab txt
unzip -p fastconformer_v6.nemo vocab.txt > vocab.txt

Transcribe a file (non-streaming)

import sherpa_onnx
import soundfile as sf
import numpy as np

recognizer = sherpa_onnx.OfflineRecognizer.from_transducer(
    encoder="onnx/encoder.onnx",
    decoder="onnx/decoder_joint.onnx",
    joiner="onnx/decoder_joint.onnx",   # fused model: same file for both
    tokens="vocab.txt",
    num_threads=4,
    sample_rate=16000,
    feature_dim=80,
)

audio, sr = sf.read("audio.wav", dtype="float32")
assert sr == 16000, "Resample to 16 kHz first"

stream = recognizer.create_stream()
stream.accept_waveform(sr, audio)
recognizer.decode_stream(stream)
print(stream.result.text)

Real-time streaming transcription

import sherpa_onnx
import sounddevice as sd
import numpy as np

SAMPLE_RATE   = 16000
CHUNK_MS      = 160          # 160 ms per chunk
CHUNK_SAMPLES = int(SAMPLE_RATE * CHUNK_MS / 1000)

recognizer = sherpa_onnx.OnlineRecognizer.from_transducer(
    encoder="onnx/encoder.onnx",
    decoder="onnx/decoder_joint.onnx",
    joiner="onnx/decoder_joint.onnx",
    tokens="vocab.txt",
    num_threads=4,
    sample_rate=SAMPLE_RATE,
    feature_dim=80,
    decoding_method="modified_beam_search",
    num_active_paths=4,
    enable_endpoint_detection=True,
    rule1_min_trailing_silence=2.4,
    rule2_min_trailing_silence=1.2,
    rule3_min_utterance_length=20.0,
)

stream = recognizer.create_stream()

def callback(indata, frames, time, status):
    audio = indata[:, 0].astype(np.float32)
    stream.accept_waveform(SAMPLE_RATE, audio)
    while recognizer.is_ready(stream):
        recognizer.decode_stream(stream)
    text = recognizer.get_result(stream).text.strip()
    if text:
        print(f"\r{text}", end="", flush=True)
    if recognizer.is_endpoint(stream):
        print()
        recognizer.reset(stream)

with sd.InputStream(samplerate=SAMPLE_RATE, channels=1,
                    blocksize=CHUNK_SAMPLES, callback=callback):
    print("Listening — press Ctrl+C to stop")
    while True:
        sd.sleep(100)

Option C — WebSocket / REST server

The full server is in the audio-STT repository. Quick start:

pip install sherpa-onnx fastapi uvicorn websockets soundfile numpy

python serving/serve_streaming.py \
    --encoder  onnx/encoder.onnx \
    --decoder  onnx/decoder_joint.onnx \
    --joiner   onnx/decoder_joint.onnx \
    --tokens   vocab.txt \
    --host 0.0.0.0 \
    --port 8001

REST endpoint:

curl -X POST http://localhost:8001/transcribe \
     -F "file=@audio.wav" | jq .
# {"text": "транскрипция аудио"}

WebSocket (streaming):

const ws = new WebSocket("ws://localhost:8001/ws/transcribe");
ws.onmessage = (e) => console.log(JSON.parse(e.data));

// Send raw 16-bit PCM at 16 kHz in 160 ms chunks
mediaRecorder.ondataavailable = (e) => ws.send(e.data);

Limitations

  • Kazakh (38.78% WER): Training data is predominantly formal/read speech. Conversational Kazakh (call center, spontaneous) will have higher WER.
  • Russian/Uzbek out-of-domain: FLEURS WER is significantly higher than in-domain (ru: 57.51%, uz: 45.31%), indicating sensitivity to recording conditions and speaking style.
  • No language identification: The model does not auto-detect language. Accuracy on mixed-language audio is not characterized.
  • 16 kHz mono only. Audio must be resampled before inference.

License

Creative Commons Attribution Non Commercial 4.0 (CC BY-NC 4.0)

This model may not be used for commercial purposes without explicit written permission.