NVIDIA Conformer-CTC Large (Catalan) - ONNX

| | | | |

This is an ONNX export of nvidia/stt_ca_conformer_ctc_large for use with sherpa-onnx and other ONNX runtimes.

The original model transcribes speech into lowercase Catalan alphabet including spaces, dashes and apostrophes, and was trained on around 1023 hours of Catalan speech data from Mozilla Common Voice 9.0.

Files

File	Description	Size
`model.onnx`	ONNX model (Conformer encoder + CTC decoder)	~507 MB
`tokens.txt`	BPE vocabulary (128 tokens + blank)	933 bytes

Usage with sherpa-onnx

Python

import sherpa_onnx
import soundfile as sf

recognizer = sherpa_onnx.OfflineRecognizer.from_nemo_ctc(
    model="model.onnx",
    tokens="tokens.txt",
    num_threads=4,
)

audio, sample_rate = sf.read("audio.wav")
stream = recognizer.create_stream()
stream.accept_waveform(sample_rate, audio)
recognizer.decode_stream(stream)

print(stream.result.text)

C++ / Rust / Other Languages

See sherpa-onnx documentation for bindings in C++, C, Rust, Go, Swift, Kotlin, and more.

With Gibberish Desktop

This model is natively supported by Gibberish - a local, real-time speech-to-text application. Simply select "Conformer CTC (Catalan)" from the model list.

ONNX Export

This model was exported using the following script:

import nemo.collections.asr as nemo_asr
from huggingface_hub import hf_hub_download

# Download original NeMo model
nemo_path = hf_hub_download(
    repo_id="nvidia/stt_ca_conformer_ctc_large",
    filename="stt_ca_conformer_ctc_large.nemo"
)

# Load and export
m = nemo_asr.models.EncDecCTCModel.restore_from(nemo_path)
m.eval()

# Export tokens (BPE vocabulary)
vocab_size = m.tokenizer.vocab_size
with open("tokens.txt", "w", encoding="utf-8") as f:
    for i in range(vocab_size):
        token = m.tokenizer.ids_to_tokens([i])[0]
        f.write(f"{token} {i}\n")
    f.write(f"<blk> {vocab_size}\n")

# Export ONNX model
m.export("model.onnx")

Requirements

nemo_toolkit[asr]
torch<2.6 (for ONNX export compatibility)
onnx
huggingface_hub

Model Architecture

Conformer-CTC is a non-autoregressive variant of the Conformer model [1] for Automatic Speech Recognition which uses CTC loss/decoding. The architecture combines:

Convolution modules for local feature extraction
Self-attention modules for global context modeling
CTC decoder for non-autoregressive transcription

See the NeMo documentation for complete architecture details.

Input/Output

Input

16 kHz mono-channel audio (WAV format recommended)
Audio is converted to 80-dimensional mel-filterbank features internally

Output

Transcribed text in lowercase Catalan
Supported characters: ' - a b c d e f g h i j k l m n o p q r s t u v w x y z · à á ç è é í ï ñ ò ó ú ü ı – —

Performance

Tokenizer	Vocabulary Size	Dev WER	Test WER	Dataset
SentencePiece Unigram	128	4.70%	4.27%	MCV-9.0

Limitations

Performance may degrade for speech with technical terms or vernacular not in the training data
May perform worse for heavily accented speech
Optimized for clean, close-microphone audio at 16 kHz

License

This model is released under CC-BY-4.0, following the original NVIDIA model license.

References

Acknowledgments

Original model by NVIDIA NeMo
ONNX conversion for Gibberish

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mpuig/stt_ca_conformer_ctc_large_onnx

Base model

nvidia/stt_ca_conformer_ctc_large

Quantized

(1)

this model

Paper for mpuig/stt_ca_conformer_ctc_large_onnx

Conformer: Convolution-augmented Transformer for Speech Recognition

Paper • 2005.08100 • Published May 16, 2020 • 1

Evaluation results

Test WER on Mozilla Common Voice 9.0
test set self-reported

4.270