KenLM Language Models for JonaWhisper
Pruned trigram language models trained on Wikipedia, used for context-aware spell correction in JonaWhisper.
Models
| File | Language | Order | Pruning | Quantization |
|---|---|---|---|---|
fr.binary |
French | 3-gram | --prune 0 0 1 |
8-bit |
en.binary |
English | 3-gram | --prune 0 0 1 |
8-bit |
de.binary |
German | 3-gram | --prune 0 0 1 |
8-bit |
es.binary |
Spanish | 3-gram | --prune 0 0 1 |
8-bit |
pt.binary |
Portuguese | 3-gram | --prune 0 0 1 |
8-bit |
it.binary |
Italian | 3-gram | --prune 0 0 1 |
8-bit |
nl.binary |
Dutch | 3-gram | --prune 0 0 1 |
8-bit |
pl.binary |
Polish | 3-gram | --prune 0 0 1 |
8-bit |
ru.binary |
Russian | 3-gram | --prune 0 0 1 |
8-bit |
Usage
These models are downloaded automatically by JonaWhisper when spell-check with KenLM reranking is enabled. They provide context-aware correction: SymSpell generates candidates for unknown words, and KenLM scores each candidate in trigram context to pick the most natural option.
Training
Trained from full Wikipedia dumps using KenLM's lmplz + build_binary trie.
See jonawhisper-model-tools/kenlm.
License
KenLM is LGPL-2.1. Wikipedia text is CC BY-SA 3.0.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support