Upload README.md with huggingface_hub

93f6c5e verified about 2 months ago

6.55 kB

	---
	language:
	- kk
	- ru
	- uz
	- en
	license: cc-by-nc-4.0
	tags:
	- automatic-speech-recognition
	- nemo
	- fastconformer
	- streaming
	- kazakh
	- russian
	- uzbek
	- english
	- onnx
	pipeline_tag: automatic-speech-recognition
	---

	# nur-dev/nemo-fast — Multilingual Streaming STT

	FastConformer Hybrid CTC+Transducer fine-tuned for Kazakh, Russian, Uzbek, and English.
	Supports real-time streaming inference via [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) or batch inference via [NeMo](https://github.com/NVIDIA/NeMo).

	---

	## Model Description

	\| Property \| Value \|
	\|----------\|-------\|
	\| Architecture \| FastConformer Hybrid CTC+Transducer \|
	\| Framework \| NVIDIA NeMo \|
	\| Parameters \| ~120M \|
	\| Tokenizer \| SentencePiece BPE, 4096 vocab \|
	\| Sample rate \| 16 kHz mono \|
	\| Languages \| `kk` · `ru` · `uz` · `en` \|
	\| Streaming \| Yes (160 ms chunks) \|

	### WER Results

	Evaluated with RNNT beam=16 + per-language KenLM 4-gram rescoring (ru α=0.4, uz α=0.7, kk/en α=0).

	\| Language \| WER (in-domain) \| WER (FLEURS) \|
	\|----------\|----------------\|--------------\|
	\| English \| 17.84% \| 22.38% \|
	\| Russian \| 33.21% \| 57.51% \|
	\| Uzbek \| 23.74% \| 45.31% \|
	\| Kazakh \| 38.78% \| 31.31% \|

	> Note on Kazakh FLEURS: FLEURS WER (31.31%) is better than in-domain (38.78%) because the in-domain validation set includes conversational speech, which is harder than FLEURS read speech.

	---

	## Repository Contents

	```
	fastconformer_v6.nemo # Full NeMo model (weights + tokenizer + config)
	onnx/
	encoder.onnx # FastConformer encoder for streaming inference
	decoder_joint.onnx # Fused RNN-T decoder+joiner for streaming inference
	```

	---

	## Inference

	### Option A — NeMo (batch, GPU recommended)

	```python
	import nemo.collections.asr as nemo_asr

	model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.restore_from(
	"fastconformer_v6.nemo",
	map_location="cuda",
	)
	model.eval()

	# Transcribe one or more audio files (16 kHz WAV/FLAC)
	transcriptions = model.transcribe(["audio.wav"])
	print(transcriptions[0])
	```

	For longer files, use CTC decoding (faster, slightly lower accuracy):

	```python
	transcriptions = model.transcribe(["audio.wav"], decoder_type="ctc")
	```

	---

	### Option B — sherpa-onnx (streaming, CPU or GPU)

	#### Install

	```bash
	pip install sherpa-onnx soundfile numpy
	```

	#### Download ONNX files

	```bash
	# Using huggingface_hub
	from huggingface_hub import hf_hub_download

	encoder = hf_hub_download("nur-dev/nemo-fast", "onnx/encoder.onnx")
	decoder = hf_hub_download("nur-dev/nemo-fast", "onnx/decoder_joint.onnx")
	```

	You also need the tokenizer vocabulary. Extract it from the `.nemo` archive:

	```bash
	# .nemo files are zip archives
	unzip -p fastconformer_v6.nemo tokenizer.model > tokenizer.model
	# or extract the vocab txt
	unzip -p fastconformer_v6.nemo vocab.txt > vocab.txt
	```

	#### Transcribe a file (non-streaming)

	```python
	import sherpa_onnx
	import soundfile as sf
	import numpy as np

	recognizer = sherpa_onnx.OfflineRecognizer.from_transducer(
	encoder="onnx/encoder.onnx",
	decoder="onnx/decoder_joint.onnx",
	joiner="onnx/decoder_joint.onnx", # fused model: same file for both
	tokens="vocab.txt",
	num_threads=4,
	sample_rate=16000,
	feature_dim=80,
	)

	audio, sr = sf.read("audio.wav", dtype="float32")
	assert sr == 16000, "Resample to 16 kHz first"

	stream = recognizer.create_stream()
	stream.accept_waveform(sr, audio)
	recognizer.decode_stream(stream)
	print(stream.result.text)
	```

	#### Real-time streaming transcription

	```python
	import sherpa_onnx
	import sounddevice as sd
	import numpy as np

	SAMPLE_RATE = 16000
	CHUNK_MS = 160 # 160 ms per chunk
	CHUNK_SAMPLES = int(SAMPLE_RATE * CHUNK_MS / 1000)

	recognizer = sherpa_onnx.OnlineRecognizer.from_transducer(
	encoder="onnx/encoder.onnx",
	decoder="onnx/decoder_joint.onnx",
	joiner="onnx/decoder_joint.onnx",
	tokens="vocab.txt",
	num_threads=4,
	sample_rate=SAMPLE_RATE,
	feature_dim=80,
	decoding_method="modified_beam_search",
	num_active_paths=4,
	enable_endpoint_detection=True,
	rule1_min_trailing_silence=2.4,
	rule2_min_trailing_silence=1.2,
	rule3_min_utterance_length=20.0,
	)

	stream = recognizer.create_stream()

	def callback(indata, frames, time, status):
	audio = indata[:, 0].astype(np.float32)
	stream.accept_waveform(SAMPLE_RATE, audio)
	while recognizer.is_ready(stream):
	recognizer.decode_stream(stream)
	text = recognizer.get_result(stream).text.strip()
	if text:
	print(f"\r{text}", end="", flush=True)
	if recognizer.is_endpoint(stream):
	print()
	recognizer.reset(stream)

	with sd.InputStream(samplerate=SAMPLE_RATE, channels=1,
	blocksize=CHUNK_SAMPLES, callback=callback):
	print("Listening — press Ctrl+C to stop")
	while True:
	sd.sleep(100)
	```

	---

	### Option C — WebSocket / REST server

	The full server is in the [audio-STT repository](https://github.com/nur-dev/audio-STT). Quick start:

	```bash
	pip install sherpa-onnx fastapi uvicorn websockets soundfile numpy

	python serving/serve_streaming.py \
	--encoder onnx/encoder.onnx \
	--decoder onnx/decoder_joint.onnx \
	--joiner onnx/decoder_joint.onnx \
	--tokens vocab.txt \
	--host 0.0.0.0 \
	--port 8001
	```

	REST endpoint:

	```bash
	curl -X POST http://localhost:8001/transcribe \
	-F "file=@audio.wav" \| jq .
	# {"text": "транскрипция аудио"}
	```

	WebSocket (streaming):

	```javascript
	const ws = new WebSocket("ws://localhost:8001/ws/transcribe");
	ws.onmessage = (e) => console.log(JSON.parse(e.data));

	// Send raw 16-bit PCM at 16 kHz in 160 ms chunks
	mediaRecorder.ondataavailable = (e) => ws.send(e.data);
	```

	---

	## Limitations

	- Kazakh (38.78% WER): Training data is predominantly formal/read speech. Conversational Kazakh (call center, spontaneous) will have higher WER.
	- Russian/Uzbek out-of-domain: FLEURS WER is significantly higher than in-domain (ru: 57.51%, uz: 45.31%), indicating sensitivity to recording conditions and speaking style.
	- No language identification: The model does not auto-detect language. Accuracy on mixed-language audio is not characterized.
	- 16 kHz mono only. Audio must be resampled before inference.

	---

	## License

	[Creative Commons Attribution Non Commercial 4.0 (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/)

	This model may not be used for commercial purposes without explicit written permission.