kenlm-models / README.md
realjPlot's picture
docs: list all 9 supported languages in model card
4f0417b verified
metadata
tags:
  - kenlm
  - n-gram
  - language-model
  - spell-correction
license: lgpl-2.1
language:
  - fr
  - en
  - de
  - es
  - pt
  - it
  - nl
  - pl
  - ru

KenLM Language Models for JonaWhisper

Pruned trigram language models trained on Wikipedia, used for context-aware spell correction in JonaWhisper.

Models

File Language Order Pruning Quantization
fr.binary French 3-gram --prune 0 0 1 8-bit
en.binary English 3-gram --prune 0 0 1 8-bit
de.binary German 3-gram --prune 0 0 1 8-bit
es.binary Spanish 3-gram --prune 0 0 1 8-bit
pt.binary Portuguese 3-gram --prune 0 0 1 8-bit
it.binary Italian 3-gram --prune 0 0 1 8-bit
nl.binary Dutch 3-gram --prune 0 0 1 8-bit
pl.binary Polish 3-gram --prune 0 0 1 8-bit
ru.binary Russian 3-gram --prune 0 0 1 8-bit

Usage

These models are downloaded automatically by JonaWhisper when spell-check with KenLM reranking is enabled. They provide context-aware correction: SymSpell generates candidates for unknown words, and KenLM scores each candidate in trigram context to pick the most natural option.

Training

Trained from full Wikipedia dumps using KenLM's lmplz + build_binary trie. See jonawhisper-model-tools/kenlm.

License

KenLM is LGPL-2.1. Wikipedia text is CC BY-SA 3.0.