|
|
--- |
|
|
license: cc-by-4.0 |
|
|
language: |
|
|
- eu |
|
|
- gl |
|
|
- ca |
|
|
- es |
|
|
metrics: |
|
|
- perplexity |
|
|
tags: |
|
|
- kenlm |
|
|
- n-gram |
|
|
- language-model |
|
|
- lm |
|
|
- whisper |
|
|
- automatic-speech-recognition |
|
|
--- |
|
|
|
|
|
# Model Card for Whisper N-Gram Language Models |
|
|
|
|
|
## Model Description |
|
|
|
|
|
These models are [KenLM](https://kheafield.com/code/kenlm/) n-gram models |
|
|
trained for supporting automatic speech recognition (ASR) tasks, specifically |
|
|
designed to work well with Whisper ASR models but are generally applicable to |
|
|
any ASR system requiring robust n-gram language models. These models can |
|
|
improve recognition accuracy by providing context-specific probabilities of |
|
|
word sequences. |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
These models are intended for use in language modeling tasks within ASR systems |
|
|
to improve prediction accuracy, especially in low-resource language scenarios. |
|
|
They can be integrated into any system that supports KenLM models. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
Each model is built using the KenLM toolkit and is based on n-gram statistics |
|
|
extracted from large, domain-specific corpora. The models available are: |
|
|
|
|
|
- **Basque (eu)**: `5gram-eu.bin` (11G) |
|
|
- **Galician (gl)**: `5gram-gl.bin` (8.4G) |
|
|
- **Catalan (ca)**: `5gram-ca.bin` (20G) |
|
|
- **Spanish (es)**: `5gram-es.bin` (13G) |
|
|
|
|
|
## How to Use |
|
|
|
|
|
Here is an example of how to load and use the Basque model with KenLM in |
|
|
Python: |
|
|
|
|
|
```python |
|
|
import kenlm |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
filepath = hf_hub_download(repo_id="HiTZ/whisper-lm-ngrams", filename="5gram-eu.bin") |
|
|
model = kenlm.Model(filepath) |
|
|
print(model.score("talka diskoetxearekin grabatzen ditut beti abestien maketak", bos=True, eos=True)) |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The models were trained on corpora capped at 27 million sentences each to |
|
|
maintain comparability and manageability. Here's a breakdown of the sources for |
|
|
each language: |
|
|
|
|
|
* **Basque**: [EusCrawl 1.0](https://www.ixa.eus/euscrawl/) |
|
|
|
|
|
* **Galician**: [SLI GalWeb Corpus](https://github.com/xavier-gz/SLI_Galician_Corpora) |
|
|
|
|
|
* **Catalan**: [Catalan Textual Corpus](https://zenodo.org/records/4519349) |
|
|
|
|
|
* **Spanish**: [Spanish LibriSpeech MLS](https://openslr.org/94/) |
|
|
|
|
|
Additional data from recent [Wikipedia dumps](https://dumps.wikimedia.org/) and |
|
|
the [Opus corpus](https://opus.nlpl.eu/) were used as needed to reach the |
|
|
sentence cap. |
|
|
|
|
|
## Model Performance |
|
|
|
|
|
The performance of these models varies by the specific language and the quality |
|
|
of the training data. Typically, performance is evaluated based on perplexity |
|
|
and the improvement in ASR accuracy when integrated. |
|
|
|
|
|
## Considerations |
|
|
|
|
|
These models are designed for use in research and production for |
|
|
language-specific ASR tasks. They should be tested for bias and fairness to |
|
|
ensure appropriate use in diverse settings. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use these models in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{dezuazo2025whisperlmimprovingasrmodels, |
|
|
title={Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages}, |
|
|
author={Xabier de Zuazo and Eva Navas and Ibon Saratxaga and Inma Hernáez Rioja}, |
|
|
year={2025}, |
|
|
eprint={2503.23542}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2503.23542}, |
|
|
} |
|
|
``` |
|
|
|
|
|
And you can check the related paper preprint in |
|
|
[arXiv:2503.23542](https://arxiv.org/abs/2503.23542) |
|
|
for more details. |
|
|
|
|
|
## Licensing |
|
|
|
|
|
This model is available under the |
|
|
[Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). |
|
|
You are free to use, modify, and distribute this model as long as you credit |
|
|
the original creators. |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
We would like to express our gratitude to Niels Rogge for his guidance and |
|
|
support in the creation of this dataset repository. You can find more about his |
|
|
work at [his Hugging Face profile](https://huggingface.co/nielsr). |