GEO-KenLM / README.md
psyfreak's picture
Update README.md
42e2539 verified
---
language:
- ka
tags:
- kenlm
- ngram
- georgian
- language-model
- asr
license: gpl-3.0
model-index:
- name: Georgian KenLM Language Model
results: []
---
# ๐Ÿฆ‰ Georgian KenLM Language Model (3-gram)
This is a **KenLM 3-gram language model** trained on Georgian (แƒฅแƒแƒ แƒ—แƒฃแƒšแƒ˜) text data, intended for use in **automatic speech recognition (ASR)** and other **language modeling** tasks.
---
## ๐Ÿงพ Model Details
- **Language**: Georgian (`ka`)
- **Model Type**: KenLM n-gram
- **n-gram size**: 3-gram
- **Format**: `.arpa`
- **Tooling**: [KenLM](https://github.com/kpu/kenlm)
---
## ๐Ÿ“‚ Files
- `ge_model9.arpa` โ€“ ARPA plaintext format
---
## ๐Ÿ“š Training Data
The model was trained on a curated collection of Georgian text from various domains:
- News articles
- Subtitles
- Books and web content
Data was cleaned, tokenized with whitespace, and normalized to standard Georgian orthography.
---
## ๐Ÿ’ฌ Intended Use
This model is ideal for:
- **Beam search decoding** in ASR systems (e.g., Whisper, DeepSpeech, Vosk)
- **Scoring and reranking** ASR hypotheses
- **Basic text modeling** or **spelling correction** in Georgian
### ๐Ÿงช Example Usage
```python
import kenlm
def transliterate_georgian(text):
georgian_to_latin = {
'แƒ': 'a', 'แƒ‘': 'b', 'แƒ’': 'g', 'แƒ“': 'd', 'แƒ”': 'e', 'แƒ•': 'v', 'แƒ–': 'z', 'แƒ—': 'T', 'แƒ˜': 'i',
'แƒ™': 'k', 'แƒš': 'l', 'แƒ›': 'm', 'แƒœ': 'n', 'แƒ': 'o', 'แƒž': 'p', 'แƒŸ': 'J', 'แƒ ': 'r', 'แƒก': 's',
'แƒข': 't', 'แƒฃ': 'u', 'แƒค': 'f', 'แƒฅ': 'q', 'แƒฆ': 'R', 'แƒง': 'y', 'แƒจ': 'S', 'แƒฉ': 'C', 'แƒช': 'c',
'แƒซ': 'Z', 'แƒฌ': 'w', 'แƒญ': 'W', 'แƒฎ': 'x', 'แƒฏ': 'j', 'แƒฐ': 'h'}
return ''.join(georgian_to_latin.get(char, char) for char in text)
model = kenlm.Model("ge_model9.arpa")
sentence = "แƒ”แƒก แƒแƒ แƒ˜แƒก แƒขแƒ”แƒกแƒขแƒ˜"
print(model.score(transliterate_georgian(sentence), bos=True, eos=True))
```
---
### Citation
```none
@misc{georgian-kenlm,
title={Georgian KenLM Language Model},
author={Giorgi G},
year={2025},
howpublished={\url{https://huggingface.co/psyfreak/GEO-KenLM}}
}
```