You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Khasi Spell Checker v1

Khasi Spell Checker v1 is a statistical spell-checking system for the Khasi language built using a large monolingual corpus (~700k sentences). The system detects and corrects spelling errors using edit-distance candidate generation and contextual ranking using a probabilistic language model.

The goal of this project is to provide basic NLP infrastructure for Khasi, a low-resource language, enabling improved writing assistance, text preprocessing, and downstream NLP applications.

πŸš€ Live Demo: https://huggingface.co/spaces/Bapynshngain/Khasi-Spell-Checker


Overview

This spell checker follows a classical architecture historically used in early search engines and spelling correction systems.

   Input sentence 
         ↓ 
    Tokenization 
         ↓ 
Suspicious word detection 
         ↓ 

Candidate generation (edit distance)

         ↓ 
Language model scoring 
         
         ↓ 

Best correction selection

         ↓ 
 Corrected sentence

The system combines:

  • Edit-distance candidate generation
  • Word frequency model
  • Bigram language model
  • Bidirectional context scoring

Training Data

The model is derived from a Khasi monolingual corpus containing ~700,000 sentences.

From this corpus we extracted:

Resource Description
Vocabulary Unique Khasi tokens
Word frequencies Frequency counts for each token
Bigram frequencies Context probabilities between word pairs

After cleaning, the vocabulary contains ~58,000 unique Khasi words.


Detection of Misspelled Words

The system first determines whether a word is likely to be incorrect.

A word is trusted if it is:

  1. Present in the vocabulary
  2. Sufficiently frequent in the corpus

Formally:

If w∈V and freq(w)>Ο„ \text{If } w \in V \text{ and } freq(w) > \tau

then the word is accepted.

Otherwise the system attempts correction.

Example:

Word Frequency Action
nga high keep
ka very high keep
shnogn low correct

Candidate Generation

Candidate corrections are generated using edit distance operations.

Allowed operations:

Operation Description
Deletion remove a character
Insertion add a character
Replacement replace a character
Transposition swap adjacent characters

Example:

sngewhuh β†’ sngewthuh

shnogn β†’ shnong

Candidates are generated using: edits1(word) edits2(word)

Where:

edits2(w)=edits1(edits1(w)) edits_2(w) = edits_1(edits_1(w))

After generation:

Candidates=edits(w)∩Vocabulary Candidates = edits(w) \cap Vocabulary

Only candidates present in the vocabulary are retained.


Probabilistic Ranking

Once candidate corrections are generated, the system ranks them probabilistically.

The classical spelling correction objective is:

c^=arg⁑max⁑cP(c∣w) \hat{c} = \arg\max_c P(c | w)

Using Bayes' theorem:

P(c∣w)=P(w∣c)P(c)P(w) P(c | w) = \frac{P(w|c)P(c)}{P(w)}

Since (P(w)) is constant:

c^=arg⁑max⁑cP(w∣c)P(c) \hat{c} = \arg\max_c P(w|c)P(c)

In this implementation:

  • (P(c)) is approximated using word frequency
  • contextual probabilities are modeled using bigram statistics

Language Model

A bigram language model is used to model contextual probability.

P(wi∣wiβˆ’1)=count(wiβˆ’1,wi)count(wiβˆ’1) P(w_i | w_{i-1}) = \frac{count(w_{i-1}, w_i)}{count(w_{i-1})}

Example:

ban sngewthuh β†’ common ban sngewleh β†’ rare

Thus:

P(sngewthuh∣ban)>P(sngewleh∣ban) P(sngewthuh | ban) > P(sngewleh | ban)


Bidirectional Context Scoring

To improve correction accuracy, both left and right context are used.

The final candidate score is:

Score(c)=log⁑P(c)+log⁑P(c∣wiβˆ’1)+log⁑P(wi+1∣c) Score(c) = \log P(c) + \log P(c | w_{i-1}) + \log P(w_{i+1} | c)

Where:

Term Meaning
(P(c)) candidate word frequency
(P(c w_{i-1}))
(P(w_{i+1} c))

This allows the system to evaluate phrases such as: me khlem leit

instead of only evaluating: me khlem


Implementation Details

Language: Python

Framework: Gradio (via Hugging Face Spaces)


Limitations

Current limitations include:

  • No explicit typo probability model (P(w|c))
  • Candidate explosion for short words
  • No phonetic error modeling
  • No neural context understanding

Example challenging case:

khlm β†’ kum vs khlem

Because:

frequency(kum) >> frequency(khlem)



Future Improvements

Character Error Model

Learn probabilities for common typing errors.

Trigram Language Model

Replace bigram model with:

P(wi∣wiβˆ’1,wiβˆ’2) P(w_i | w_{i-1}, w_{i-2})

using tools such as KenLM.


Neural Spell Correction

Future versions may incorporate neural models such as:

  • BERT
  • T5
  • sequence-to-sequence transformers

for improved contextual understanding.


Intended Use

This spell checker is designed for:

  • Khasi writing assistance
  • educational tools
  • preprocessing Khasi text
  • improving downstream NLP pipelines

Citation

If you use this work, please cite:

@software{nongkynrih2026khasi_spellchecker_v1,
  author       = {Nongkynrih, Bapynshngainlang},
  title        = {Khasi Spell Checker v1},
  version      = {1.0},
  year         = {2026},
  month        = mar,
  day          = 13
  publisher    = {Hugging Face},
  doi          = {10.57967/hf/7999},
  url          = {https://huggingface.co/Bapynshngain/Khasi-SpellChecker-v1}
}

APA Citation

Nongkynrih, B. (2026, March 13). Khasi Spell Checker v1 (Version 1.0) [Software]. Hugging Face. https://doi.org/10.57967/hf/7999
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using Bapynshngain/Khasi-SpellChecker-v1 1