Khasi Spell Checker v1
Khasi Spell Checker v1 is a statistical spell-checking system for the Khasi language built using a large monolingual corpus (~700k sentences). The system detects and corrects spelling errors using edit-distance candidate generation and contextual ranking using a probabilistic language model.
The goal of this project is to provide basic NLP infrastructure for Khasi, a low-resource language, enabling improved writing assistance, text preprocessing, and downstream NLP applications.
π Live Demo: https://huggingface.co/spaces/Bapynshngain/Khasi-Spell-Checker
Overview
This spell checker follows a classical architecture historically used in early search engines and spelling correction systems.
Input sentence
β
Tokenization
β
Suspicious word detection
β
Candidate generation (edit distance)
β
Language model scoring
β
Best correction selection
β
Corrected sentence
The system combines:
- Edit-distance candidate generation
- Word frequency model
- Bigram language model
- Bidirectional context scoring
Training Data
The model is derived from a Khasi monolingual corpus containing ~700,000 sentences.
From this corpus we extracted:
| Resource | Description |
|---|---|
| Vocabulary | Unique Khasi tokens |
| Word frequencies | Frequency counts for each token |
| Bigram frequencies | Context probabilities between word pairs |
After cleaning, the vocabulary contains ~58,000 unique Khasi words.
Detection of Misspelled Words
The system first determines whether a word is likely to be incorrect.
A word is trusted if it is:
- Present in the vocabulary
- Sufficiently frequent in the corpus
Formally:
then the word is accepted.
Otherwise the system attempts correction.
Example:
| Word | Frequency | Action |
|---|---|---|
| nga | high | keep |
| ka | very high | keep |
| shnogn | low | correct |
Candidate Generation
Candidate corrections are generated using edit distance operations.
Allowed operations:
| Operation | Description |
|---|---|
| Deletion | remove a character |
| Insertion | add a character |
| Replacement | replace a character |
| Transposition | swap adjacent characters |
Example:
sngewhuh β sngewthuh
shnogn β shnong
Candidates are generated using: edits1(word) edits2(word)
Where:
After generation:
Only candidates present in the vocabulary are retained.
Probabilistic Ranking
Once candidate corrections are generated, the system ranks them probabilistically.
The classical spelling correction objective is:
Using Bayes' theorem:
Since (P(w)) is constant:
In this implementation:
- (P(c)) is approximated using word frequency
- contextual probabilities are modeled using bigram statistics
Language Model
A bigram language model is used to model contextual probability.
Example:
ban sngewthuh β common ban sngewleh β rare
Thus:
Bidirectional Context Scoring
To improve correction accuracy, both left and right context are used.
The final candidate score is:
Where:
| Term | Meaning |
|---|---|
| (P(c)) | candidate word frequency |
| (P(c | w_{i-1})) |
| (P(w_{i+1} | c)) |
This allows the system to evaluate phrases such as: me khlem leit
instead of only evaluating: me khlem
Implementation Details
Language: Python
Framework: Gradio (via Hugging Face Spaces)
Limitations
Current limitations include:
- No explicit typo probability model (P(w|c))
- Candidate explosion for short words
- No phonetic error modeling
- No neural context understanding
Example challenging case:
khlm β kum vs khlem
Because:
frequency(kum) >> frequency(khlem)
Future Improvements
Character Error Model
Learn probabilities for common typing errors.
Trigram Language Model
Replace bigram model with:
using tools such as KenLM.
Neural Spell Correction
Future versions may incorporate neural models such as:
- BERT
- T5
- sequence-to-sequence transformers
for improved contextual understanding.
Intended Use
This spell checker is designed for:
- Khasi writing assistance
- educational tools
- preprocessing Khasi text
- improving downstream NLP pipelines
Citation
If you use this work, please cite:
@software{nongkynrih2026khasi_spellchecker_v1,
author = {Nongkynrih, Bapynshngainlang},
title = {Khasi Spell Checker v1},
version = {1.0},
year = {2026},
month = mar,
day = 13
publisher = {Hugging Face},
doi = {10.57967/hf/7999},
url = {https://huggingface.co/Bapynshngain/Khasi-SpellChecker-v1}
}
APA Citation
Nongkynrih, B. (2026, March 13). Khasi Spell Checker v1 (Version 1.0) [Software]. Hugging Face. https://doi.org/10.57967/hf/7999