KenLM Language Models for JonaWhisper

Pruned trigram language models trained on Wikipedia, used for context-aware spell correction in JonaWhisper.

Models

File	Language	Order	Pruning	Quantization
`fr.binary`	French	3-gram	`--prune 0 0 1`	8-bit
`en.binary`	English	3-gram	`--prune 0 0 1`	8-bit
`de.binary`	German	3-gram	`--prune 0 0 1`	8-bit
`es.binary`	Spanish	3-gram	`--prune 0 0 1`	8-bit
`pt.binary`	Portuguese	3-gram	`--prune 0 0 1`	8-bit
`it.binary`	Italian	3-gram	`--prune 0 0 1`	8-bit
`nl.binary`	Dutch	3-gram	`--prune 0 0 1`	8-bit
`pl.binary`	Polish	3-gram	`--prune 0 0 1`	8-bit
`ru.binary`	Russian	3-gram	`--prune 0 0 1`	8-bit

Usage

These models are downloaded automatically by JonaWhisper when spell-check with KenLM reranking is enabled. They provide context-aware correction: SymSpell generates candidates for unknown words, and KenLM scores each candidate in trigram context to pick the most natural option.

Training

Trained from full Wikipedia dumps using KenLM's lmplz + build_binary trie. See jonawhisper-model-tools/kenlm.

License

KenLM is LGPL-2.1. Wikipedia text is CC BY-SA 3.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support