KenLM Language Models for JonaWhisper

Pruned trigram language models trained on Wikipedia, used for context-aware spell correction in JonaWhisper.

Models

File Language Order Pruning Quantization
fr.binary French 3-gram --prune 0 0 1 8-bit
en.binary English 3-gram --prune 0 0 1 8-bit
de.binary German 3-gram --prune 0 0 1 8-bit
es.binary Spanish 3-gram --prune 0 0 1 8-bit
pt.binary Portuguese 3-gram --prune 0 0 1 8-bit
it.binary Italian 3-gram --prune 0 0 1 8-bit
nl.binary Dutch 3-gram --prune 0 0 1 8-bit
pl.binary Polish 3-gram --prune 0 0 1 8-bit
ru.binary Russian 3-gram --prune 0 0 1 8-bit

Usage

These models are downloaded automatically by JonaWhisper when spell-check with KenLM reranking is enabled. They provide context-aware correction: SymSpell generates candidates for unknown words, and KenLM scores each candidate in trigram context to pick the most natural option.

Training

Trained from full Wikipedia dumps using KenLM's lmplz + build_binary trie. See jonawhisper-model-tools/kenlm.

License

KenLM is LGPL-2.1. Wikipedia text is CC BY-SA 3.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support