|
|
--- |
|
|
language: |
|
|
- ka |
|
|
tags: |
|
|
- kenlm |
|
|
- ngram |
|
|
- georgian |
|
|
- language-model |
|
|
- asr |
|
|
license: gpl-3.0 |
|
|
model-index: |
|
|
- name: Georgian KenLM Language Model |
|
|
results: [] |
|
|
--- |
|
|
|
|
|
# ๐ฆ Georgian KenLM Language Model (3-gram) |
|
|
|
|
|
This is a **KenLM 3-gram language model** trained on Georgian (แฅแแ แแฃแแ) text data, intended for use in **automatic speech recognition (ASR)** and other **language modeling** tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐งพ Model Details |
|
|
|
|
|
- **Language**: Georgian (`ka`) |
|
|
- **Model Type**: KenLM n-gram |
|
|
- **n-gram size**: 3-gram |
|
|
- **Format**: `.arpa` |
|
|
- **Tooling**: [KenLM](https://github.com/kpu/kenlm) |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Files |
|
|
|
|
|
- `ge_model9.arpa` โ ARPA plaintext format |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Training Data |
|
|
|
|
|
The model was trained on a curated collection of Georgian text from various domains: |
|
|
|
|
|
- News articles |
|
|
- Subtitles |
|
|
- Books and web content |
|
|
|
|
|
Data was cleaned, tokenized with whitespace, and normalized to standard Georgian orthography. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ฌ Intended Use |
|
|
|
|
|
This model is ideal for: |
|
|
|
|
|
- **Beam search decoding** in ASR systems (e.g., Whisper, DeepSpeech, Vosk) |
|
|
- **Scoring and reranking** ASR hypotheses |
|
|
- **Basic text modeling** or **spelling correction** in Georgian |
|
|
|
|
|
### ๐งช Example Usage |
|
|
|
|
|
```python |
|
|
import kenlm |
|
|
|
|
|
def transliterate_georgian(text): |
|
|
georgian_to_latin = { |
|
|
'แ': 'a', 'แ': 'b', 'แ': 'g', 'แ': 'd', 'แ': 'e', 'แ': 'v', 'แ': 'z', 'แ': 'T', 'แ': 'i', |
|
|
'แ': 'k', 'แ': 'l', 'แ': 'm', 'แ': 'n', 'แ': 'o', 'แ': 'p', 'แ': 'J', 'แ ': 'r', 'แก': 's', |
|
|
'แข': 't', 'แฃ': 'u', 'แค': 'f', 'แฅ': 'q', 'แฆ': 'R', 'แง': 'y', 'แจ': 'S', 'แฉ': 'C', 'แช': 'c', |
|
|
'แซ': 'Z', 'แฌ': 'w', 'แญ': 'W', 'แฎ': 'x', 'แฏ': 'j', 'แฐ': 'h'} |
|
|
|
|
|
return ''.join(georgian_to_latin.get(char, char) for char in text) |
|
|
|
|
|
model = kenlm.Model("ge_model9.arpa") |
|
|
sentence = "แแก แแ แแก แขแแกแขแ" |
|
|
print(model.score(transliterate_georgian(sentence), bos=True, eos=True)) |
|
|
``` |
|
|
--- |
|
|
|
|
|
### Citation |
|
|
|
|
|
```none |
|
|
@misc{georgian-kenlm, |
|
|
title={Georgian KenLM Language Model}, |
|
|
author={Giorgi G}, |
|
|
year={2025}, |
|
|
howpublished={\url{https://huggingface.co/psyfreak/GEO-KenLM}} |
|
|
} |
|
|
``` |