NVIDIA Conformer-CTC Large (Catalan) - ONNX
This is an ONNX export of nvidia/stt_ca_conformer_ctc_large for use with sherpa-onnx and other ONNX runtimes.
The original model transcribes speech into lowercase Catalan alphabet including spaces, dashes and apostrophes, and was trained on around 1023 hours of Catalan speech data from Mozilla Common Voice 9.0.
Files
| File | Description | Size |
|---|---|---|
model.onnx |
ONNX model (Conformer encoder + CTC decoder) | ~507 MB |
tokens.txt |
BPE vocabulary (128 tokens + blank) | 933 bytes |
Usage with sherpa-onnx
Python
import sherpa_onnx
import soundfile as sf
recognizer = sherpa_onnx.OfflineRecognizer.from_nemo_ctc(
model="model.onnx",
tokens="tokens.txt",
num_threads=4,
)
audio, sample_rate = sf.read("audio.wav")
stream = recognizer.create_stream()
stream.accept_waveform(sample_rate, audio)
recognizer.decode_stream(stream)
print(stream.result.text)
C++ / Rust / Other Languages
See sherpa-onnx documentation for bindings in C++, C, Rust, Go, Swift, Kotlin, and more.
With Gibberish Desktop
This model is natively supported by Gibberish - a local, real-time speech-to-text application. Simply select "Conformer CTC (Catalan)" from the model list.
ONNX Export
This model was exported using the following script:
import nemo.collections.asr as nemo_asr
from huggingface_hub import hf_hub_download
# Download original NeMo model
nemo_path = hf_hub_download(
repo_id="nvidia/stt_ca_conformer_ctc_large",
filename="stt_ca_conformer_ctc_large.nemo"
)
# Load and export
m = nemo_asr.models.EncDecCTCModel.restore_from(nemo_path)
m.eval()
# Export tokens (BPE vocabulary)
vocab_size = m.tokenizer.vocab_size
with open("tokens.txt", "w", encoding="utf-8") as f:
for i in range(vocab_size):
token = m.tokenizer.ids_to_tokens([i])[0]
f.write(f"{token} {i}\n")
f.write(f"<blk> {vocab_size}\n")
# Export ONNX model
m.export("model.onnx")
Requirements
nemo_toolkit[asr]torch<2.6(for ONNX export compatibility)onnxhuggingface_hub
Model Architecture
Conformer-CTC is a non-autoregressive variant of the Conformer model [1] for Automatic Speech Recognition which uses CTC loss/decoding. The architecture combines:
- Convolution modules for local feature extraction
- Self-attention modules for global context modeling
- CTC decoder for non-autoregressive transcription
See the NeMo documentation for complete architecture details.
Input/Output
Input
- 16 kHz mono-channel audio (WAV format recommended)
- Audio is converted to 80-dimensional mel-filterbank features internally
Output
- Transcribed text in lowercase Catalan
- Supported characters:
' - a b c d e f g h i j k l m n o p q r s t u v w x y z · à á ç è é í ï ñ ò ó ú ü ı – —
Performance
| Tokenizer | Vocabulary Size | Dev WER | Test WER | Dataset |
|---|---|---|---|---|
| SentencePiece Unigram | 128 | 4.70% | 4.27% | MCV-9.0 |
Limitations
- Performance may degrade for speech with technical terms or vernacular not in the training data
- May perform worse for heavily accented speech
- Optimized for clean, close-microphone audio at 16 kHz
License
This model is released under CC-BY-4.0, following the original NVIDIA model license.
References
- Conformer: Convolution-augmented Transformer for Speech Recognition
- Google SentencePiece Tokenizer
- NVIDIA NeMo Toolkit
- sherpa-onnx
Acknowledgments
- Original model by NVIDIA NeMo
- ONNX conversion for Gibberish
Model tree for mpuig/stt_ca_conformer_ctc_large_onnx
Base model
nvidia/stt_ca_conformer_ctc_largeEvaluation results
- Test WER on Mozilla Common Voice 9.0test set self-reported4.270