You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Khasi Spell Checker v1

Khasi Spell Checker v1 is a statistical spell-checking system for the Khasi language built using a large monolingual corpus (~700k sentences). The system detects and corrects spelling errors using edit-distance candidate generation and contextual ranking using a probabilistic language model.

The goal of this project is to provide basic NLP infrastructure for Khasi, a low-resource language, enabling improved writing assistance, text preprocessing, and downstream NLP applications.

🚀 Live Demo: https://huggingface.co/spaces/Bapynshngain/Khasi-Spell-Checker

Overview

This spell checker follows a classical architecture historically used in early search engines and spelling correction systems.

   Input sentence 
         ↓ 
    Tokenization 
         ↓ 
Suspicious word detection 
         ↓

Candidate generation (edit distance)

         ↓ 
Language model scoring 
         
         ↓

Best correction selection

         ↓ 
 Corrected sentence

The system combines:

Edit-distance candidate generation
Word frequency model
Bigram language model
Bidirectional context scoring

Training Data

The model is derived from a Khasi monolingual corpus containing ~700,000 sentences.

From this corpus we extracted:

Resource	Description
Vocabulary	Unique Khasi tokens
Word frequencies	Frequency counts for each token
Bigram frequencies	Context probabilities between word pairs

After cleaning, the vocabulary contains ~58,000 unique Khasi words.

Detection of Misspelled Words

The system first determines whether a word is likely to be incorrect.

A word is trusted if it is:

Present in the vocabulary
Sufficiently frequent in the corpus

Formally:

$\text{If } w \in V \text{ and } freq(w) > \tau$

then the word is accepted.

Otherwise the system attempts correction.

Example:

Word	Frequency	Action
nga	high	keep
ka	very high	keep
shnogn	low	correct

Candidate Generation

Candidate corrections are generated using edit distance operations.

Allowed operations:

Operation	Description
Deletion	remove a character
Insertion	add a character
Replacement	replace a character
Transposition	swap adjacent characters

Example:

sngewhuh → sngewthuh

shnogn → shnong

Candidates are generated using: edits1(word) edits2(word)

Where:

$e d i t s_{2} (w) = e d i t s_{1} (e d i t s_{1} (w))$

After generation:

$Candidates = edits(w) \cap Vocabulary$

Only candidates present in the vocabulary are retained.

Probabilistic Ranking

Once candidate corrections are generated, the system ranks them probabilistically.

The classical spelling correction objective is:

$\hat{c} = \arg\max_c P(c | w)$

Using Bayes' theorem:

$P(c | w) = \frac{P(w|c)P(c)}{P(w)}$

Since (P(w)) is constant:

$\hat{c} = \arg\max_c P(w|c)P(c)$

In this implementation:

(P(c)) is approximated using word frequency
contextual probabilities are modeled using bigram statistics

Language Model

A bigram language model is used to model contextual probability.

$P(w_i | w_{i-1}) = \frac{count(w_{i-1}, w_i)}{count(w_{i-1})}$

Example:

ban sngewthuh → common ban sngewleh → rare

Thus:

$P (s n g e w t h u h ∣ b a n) > P (s n g e w l e h ∣ b a n)$

Bidirectional Context Scoring

To improve correction accuracy, both left and right context are used.

The final candidate score is:

$Score(c) = \log P(c) + \log P(c | w_{i-1}) + \log P(w_{i+1} | c)$

Where:

Term	Meaning
(P(c))	candidate word frequency
(P(c	w_{i-1}))
(P(w_{i+1}	c))

This allows the system to evaluate phrases such as: me khlem leit

instead of only evaluating: me khlem

Implementation Details

Language: Python

Framework: Gradio (via Hugging Face Spaces)

Limitations

Current limitations include:

No explicit typo probability model (P(w|c))
Candidate explosion for short words
No phonetic error modeling
No neural context understanding

Example challenging case:

khlm → kum vs khlem

Because:

frequency(kum) >> frequency(khlem)

Future Improvements

Character Error Model

Learn probabilities for common typing errors.

Trigram Language Model

Replace bigram model with:

$P(w_i | w_{i-1}, w_{i-2})$

using tools such as KenLM.

Neural Spell Correction

Future versions may incorporate neural models such as:

BERT
T5
sequence-to-sequence transformers

for improved contextual understanding.

Intended Use

This spell checker is designed for:

Khasi writing assistance
educational tools
preprocessing Khasi text
improving downstream NLP pipelines

Citation

If you use this work, please cite:

@software{nongkynrih2026khasi_spellchecker_v1,
  author       = {Nongkynrih, Bapynshngainlang},
  title        = {Khasi Spell Checker v1},
  version      = {1.0},
  year         = {2026},
  month        = mar,
  day          = 13
  publisher    = {Hugging Face},
  doi          = {10.57967/hf/7999},
  url          = {https://huggingface.co/Bapynshngain/Khasi-SpellChecker-v1}
}

APA Citation

Nongkynrih, B. (2026, March 13). Khasi Spell Checker v1 (Version 1.0) [Software]. Hugging Face. https://doi.org/10.57967/hf/7999

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Bapynshngain
/

Khasi-SpellChecker-v1

You need to agree to share your contact information to access this model

Khasi Spell Checker v1

Overview

Training Data

Detection of Misspelled Words

Candidate Generation

Probabilistic Ranking

Language Model

Bidirectional Context Scoring

Implementation Details

Limitations

Future Improvements

Character Error Model

Trigram Language Model

Neural Spell Correction

Intended Use

Citation

Spaces using Bapynshngain/Khasi-SpellChecker-v1 2