whisper-lm-ngrams / README.md

Complete the README.md.

1e06373 verified 10 months ago

3.79 kB

	---
	license: cc-by-4.0
	language:
	- eu
	- gl
	- ca
	- es
	metrics:
	- perplexity
	tags:
	- kenlm
	- n-gram
	- language-model
	- lm
	- whisper
	- automatic-speech-recognition
	---

	# Model Card for Whisper N-Gram Language Models

	## Model Description

	These models are [KenLM](https://kheafield.com/code/kenlm/) n-gram models
	trained for supporting automatic speech recognition (ASR) tasks, specifically
	designed to work well with Whisper ASR models but are generally applicable to
	any ASR system requiring robust n-gram language models. These models can
	improve recognition accuracy by providing context-specific probabilities of
	word sequences.

	## Intended Use

	These models are intended for use in language modeling tasks within ASR systems
	to improve prediction accuracy, especially in low-resource language scenarios.
	They can be integrated into any system that supports KenLM models.

	## Model Details

	Each model is built using the KenLM toolkit and is based on n-gram statistics
	extracted from large, domain-specific corpora. The models available are:

	- Basque (eu): `5gram-eu.bin` (11G)
	- Galician (gl): `5gram-gl.bin` (8.4G)
	- Catalan (ca): `5gram-ca.bin` (20G)
	- Spanish (es): `5gram-es.bin` (13G)

	## How to Use

	Here is an example of how to load and use the Basque model with KenLM in
	Python:

	```python
	import kenlm
	from huggingface_hub import hf_hub_download

	filepath = hf_hub_download(repo_id="HiTZ/whisper-lm-ngrams", filename="5gram-eu.bin")
	model = kenlm.Model(filepath)
	print(model.score("talka diskoetxearekin grabatzen ditut beti abestien maketak", bos=True, eos=True))
	```

	## Training Data

	The models were trained on corpora capped at 27 million sentences each to
	maintain comparability and manageability. Here's a breakdown of the sources for
	each language:

	* Basque: [EusCrawl 1.0](https://www.ixa.eus/euscrawl/)

	* Galician: [SLI GalWeb Corpus](https://github.com/xavier-gz/SLI_Galician_Corpora)

	* Catalan: [Catalan Textual Corpus](https://zenodo.org/records/4519349)

	* Spanish: [Spanish LibriSpeech MLS](https://openslr.org/94/)

	Additional data from recent [Wikipedia dumps](https://dumps.wikimedia.org/) and
	the [Opus corpus](https://opus.nlpl.eu/) were used as needed to reach the
	sentence cap.

	## Model Performance

	The performance of these models varies by the specific language and the quality
	of the training data. Typically, performance is evaluated based on perplexity
	and the improvement in ASR accuracy when integrated.

	## Considerations

	These models are designed for use in research and production for
	language-specific ASR tasks. They should be tested for bias and fairness to
	ensure appropriate use in diverse settings.

	## Citation

	If you use these models in your research, please cite:

	```bibtex
	@misc{dezuazo2025whisperlmimprovingasrmodels,
	title={Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages},
	author={Xabier de Zuazo and Eva Navas and Ibon Saratxaga and Inma Hernáez Rioja},
	year={2025},
	eprint={2503.23542},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2503.23542},
	}
	```

	And you can check the related paper preprint in
	[arXiv:2503.23542](https://arxiv.org/abs/2503.23542)
	for more details.

	## Licensing

	This model is available under the
	[Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).
	You are free to use, modify, and distribute this model as long as you credit
	the original creators.

	## Acknowledgements

	We would like to express our gratitude to Niels Rogge for his guidance and
	support in the creation of this dataset repository. You can find more about his
	work at [his Hugging Face profile](https://huggingface.co/nielsr).