NVIDIA Conformer-CTC Large (Catalan) - ONNX

| Model architecture | Model size | Language | Format |

This is an ONNX export of nvidia/stt_ca_conformer_ctc_large for use with sherpa-onnx and other ONNX runtimes.

The original model transcribes speech into lowercase Catalan alphabet including spaces, dashes and apostrophes, and was trained on around 1023 hours of Catalan speech data from Mozilla Common Voice 9.0.

Files

File Description Size
model.onnx ONNX model (Conformer encoder + CTC decoder) ~507 MB
tokens.txt BPE vocabulary (128 tokens + blank) 933 bytes

Usage with sherpa-onnx

Python

import sherpa_onnx
import soundfile as sf

recognizer = sherpa_onnx.OfflineRecognizer.from_nemo_ctc(
    model="model.onnx",
    tokens="tokens.txt",
    num_threads=4,
)

audio, sample_rate = sf.read("audio.wav")
stream = recognizer.create_stream()
stream.accept_waveform(sample_rate, audio)
recognizer.decode_stream(stream)

print(stream.result.text)

C++ / Rust / Other Languages

See sherpa-onnx documentation for bindings in C++, C, Rust, Go, Swift, Kotlin, and more.

With Gibberish Desktop

This model is natively supported by Gibberish - a local, real-time speech-to-text application. Simply select "Conformer CTC (Catalan)" from the model list.

ONNX Export

This model was exported using the following script:

import nemo.collections.asr as nemo_asr
from huggingface_hub import hf_hub_download

# Download original NeMo model
nemo_path = hf_hub_download(
    repo_id="nvidia/stt_ca_conformer_ctc_large",
    filename="stt_ca_conformer_ctc_large.nemo"
)

# Load and export
m = nemo_asr.models.EncDecCTCModel.restore_from(nemo_path)
m.eval()

# Export tokens (BPE vocabulary)
vocab_size = m.tokenizer.vocab_size
with open("tokens.txt", "w", encoding="utf-8") as f:
    for i in range(vocab_size):
        token = m.tokenizer.ids_to_tokens([i])[0]
        f.write(f"{token} {i}\n")
    f.write(f"<blk> {vocab_size}\n")

# Export ONNX model
m.export("model.onnx")

Requirements

  • nemo_toolkit[asr]
  • torch<2.6 (for ONNX export compatibility)
  • onnx
  • huggingface_hub

Model Architecture

Conformer-CTC is a non-autoregressive variant of the Conformer model [1] for Automatic Speech Recognition which uses CTC loss/decoding. The architecture combines:

  • Convolution modules for local feature extraction
  • Self-attention modules for global context modeling
  • CTC decoder for non-autoregressive transcription

See the NeMo documentation for complete architecture details.

Input/Output

Input

  • 16 kHz mono-channel audio (WAV format recommended)
  • Audio is converted to 80-dimensional mel-filterbank features internally

Output

  • Transcribed text in lowercase Catalan
  • Supported characters: ' - a b c d e f g h i j k l m n o p q r s t u v w x y z · à á ç è é í ï ñ ò ó ú ü ı – —

Performance

Tokenizer Vocabulary Size Dev WER Test WER Dataset
SentencePiece Unigram 128 4.70% 4.27% MCV-9.0

Limitations

  • Performance may degrade for speech with technical terms or vernacular not in the training data
  • May perform worse for heavily accented speech
  • Optimized for clean, close-microphone audio at 16 kHz

License

This model is released under CC-BY-4.0, following the original NVIDIA model license.

References

  1. Conformer: Convolution-augmented Transformer for Speech Recognition
  2. Google SentencePiece Tokenizer
  3. NVIDIA NeMo Toolkit
  4. sherpa-onnx

Acknowledgments

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mpuig/stt_ca_conformer_ctc_large_onnx

Quantized
(1)
this model

Evaluation results