catalan-verification-model-pkt-b

Click to expand

Paper
Model Summary
Intended Uses and Limitations
How to Get Started with the Model
Training Details
Citation
Additional Information

Model Summary

We define verification models as ASR models specifically designed to assess the reliability of transcriptions. These models are particularly useful when no reference transcription is available, as they can generate hypotheses with a certain degree of confidence.

The core idea behind verification models is to train or fine-tune two or more ASR models on different datasets. If these models produce identical transcriptions for the same audio input, the result is likely to be accurate. Furthermore, if a verification model agrees with an existing reference transcription, this agreement can also be interpreted as a signal of reliability.

In this model card, we present Verification Model B for Catalan, available as "catalan-verification-model-pkt-b". This acoustic model is based on "nvidia/parakeet-rnnt-1.1b" and is designed for Automatic Speech Recognition in Catalan. It is intended to be used in tandem with Verification Model A, "catalan-verification-model-pkt-a", to enable cross-verification and boost transcription confidence in unannotated or weakly supervised scenarios.

The datasets used to train models A and B were partitioned between the two models using the following pseudocode:

01: dataset_A = list
02: dataset_B = list
03: for index, recording in training_corpus:
04: {
05:   if index is an even number:
06:   {
07:     dataset_A=dataset_A+recording[index]
08:   }
09:   else:
10:   {
11:     dataset_B=dataset_B+recording[index]
12:   }
13: }

Intended Uses and Limitations

This model is designed for the following scenarios:

Verification of transcriptions: When two or more verification models produce the same output for a given audio segment, the transcription can be considered highly reliable. This is particularly useful in low-resource or weakly supervised settings.
Transcription without references: In situations where no reference transcription exists, this model can still produce a hypothesis that, when corroborated by a second verification model, may be considered trustworthy.
Data filtering and quality control: It can be used to automatically detect and retain high-confidence segments in large-scale speech datasets (e.g., for training or evaluation purposes).
Human-in-the-loop workflows: These models can assist human annotators by flagging reliable transcriptions, helping reduce manual verification time.

As limitations, we identify the following:

No ground-truth guarantee: Agreement between models does not guarantee correctness; it only increases the likelihood of reliability.
Domain sensitivity: The accuracy and agreement rate may drop if used on speech data that differs significantly from the training domain (e.g., different accents, topics, or recording conditions).
Designed for pairwise comparison: This model is intended to work in conjunction with at least one other verification model. Using it in isolation does not provide verification benefits.
Language and model-specific: This particular model is optimized for Catalan and based on the Parakeet RNNT architecture. Performance in other languages or under different acoustic models may vary significantly.

How to Get Started with the Model

To see an updated and functional version of this code, please visit NVIDIA's official repository

Installation

To use this model, you may install the NVIDIA NeMo Framework:

Create a virtual environment:

python -m venv /path/to/venv

Activate the environment:

source /path/to/venv/bin/activate

Install the modules:

BRANCH = 'main'
python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

For Inference

To transcribe audio in Spanish using this model, you can follow this example:

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="projecte-aina/parakeet-rnnt-1.1b_cv17_es_ep18_1270h")

output = asr_model.transcribe(['YOUR_WAV_FILE.wav'])
print(output[0].text)

Training Details

Training data

The specific datasets used to create the model are:

Training procedure

This model is the result of finetuning the model "parakeet-rnnt-1.1b" by following this tutorial

Training Hyperparameters

language: Catalan
hours of training audio: 1799
learning rate: 2e-4
devices=4
num_nodes=8
batch_size=8
accelerator=accelerator
strategy="ddp"
max_epochs=20
enable_checkpointing=True
logger=False
log_every_n_steps=100
check_val_every_n_epoch=1
precision='bf16-mixed'
callbacks=[checkpoint_callback]

Citation

If this model contributes to your research, please cite the work:

@misc{bsc-catvermodel-pkt-a-2025,
  title={Catalan Verification Model Parakeet B},
  author={Hernandez Mena, Carlos Daniel; Messaoudi, Abir; España-Bonet, Cristina;},
  organization={Barcelona Supercomputing Center},
  url={https://huggingface.co/langtech-veu/catalan-verification-model-pkt-b},
  year={2025}
}

Additional Information

Author

The fine-tuning process was performed during June (2025) in the Language Technologies Laboratory of the Barcelona Supercomputing Center by Carlos Daniel Hernández Mena supervised by Cristina España-Bonet. The validation of the model was performed by Abir Messaoudi.

Contact

For further information, please email bsc-lt@bsc.es.

Copyright

License

Apache-2.0

Funding

This work/research has been promoted and financed by the Government of Catalonia through the Aina project.

The training of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5.

We acknowledge the EuroHPC Joint Undertaking for awarding us access to MareNostrum5 as BSC, Spain.

Downloads last month: 28

Evaluation results

WER on Common Voice 17.0 Catalan (Test)
test set self-reported

3.735
WER on Common Voice 17.0 Catalan (Dev)
self-reported

3.409

BSC-LT
/

catalan-verification-model-pkt-b