|
|
--- |
|
|
language: |
|
|
- az |
|
|
tags: |
|
|
- tokenizer |
|
|
- azerbaijani |
|
|
- nlp |
|
|
- morphology |
|
|
- hybrid |
|
|
- bpe |
|
|
- phonological-restoration |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- uonlp/CulturaX |
|
|
- tatoeba |
|
|
metrics: |
|
|
- token_word_ratio |
|
|
- morphological_boundary_accuracy |
|
|
- root_consistency_rate |
|
|
--- |
|
|
|
|
|
# miLLi: Model Integrating Local Linguistic Insights for Morphologically Robust Tokenization |
|
|
|
|
|
**miLLi 1.0** is a hybrid tokenizer specifically engineered for the **Azerbaijani language**, addressing the limitations of standard statistical models (e.g., BPE, WordPiece) in processing agglutinative morphologies. By integrating a rule-based root dictionary with statistical learning, the model prioritizes **morphological integrity** and **semantic root preservation** over purely frequency-based compression. |
|
|
|
|
|
The model introduces a dynamic **Phonological Restoration** algorithm designed to map allomorphic variations (e.g., vowel loss, consonant mutations) back to their canonical root forms during the pre-tokenization phase. |
|
|
|
|
|
## 1. Methodology and Architecture |
|
|
|
|
|
The architecture of miLLi 1.0 is built upon a three-stage hybrid pipeline: |
|
|
|
|
|
1. **Linguistic Pre-processing:** |
|
|
* Utilization of a cleaned root dictionary based on Mozilla's `az.dic`. |
|
|
* Implementation of an **Aho-Corasick** based Trie structure for efficient root matching. |
|
|
* **Case Handling:** Application of a special `<UPPER>` token strategy to consolidate vocabulary and preserve Named Entity Recognition (NER) signals without case-sensitivity redundancy. |
|
|
|
|
|
2. **Phonological Restoration:** |
|
|
* A dynamic algorithm that identifies phonetically modified stems (e.g., *q-ğ*, *k-y* mutations) and restores them to their lemma forms before segmentation. |
|
|
* Adopts a **"Longest Restored Match"** principle to ensure valid morphological segmentation. |
|
|
|
|
|
3. **Statistical Subword Segmentation:** |
|
|
* A Byte-Pair Encoding (BPE) model trained on the **CulturaX** Azerbaijani corpus (500k line subset) is applied to the remaining suffixes and out-of-vocabulary terms. |
|
|
|
|
|
## 2. Empirical Evaluation |
|
|
|
|
|
The performance of miLLi 1.0 was evaluated using a dual-strategy approach: **Quantitative Efficiency** (measured on the Tatoeba corpus) and **Qualitative Accuracy** (measured on a curated Morphological Challenge Set). |
|
|
|
|
|
### 2.1. Quantitative Analysis: Token Efficiency |
|
|
*Metric: Token/Word (T/W) Ratio (Lower indicates higher compression)* |
|
|
|
|
|
Evaluations on the **Tatoeba** corpus (~4,500 sentences) demonstrate that miLLi 1.0 offers a balanced representation. While local statistical models achieve higher compression through whole-word memorization, miLLi 1.0 significantly outperforms global multilingual models. |
|
|
|
|
|
| Model | Category | T/W Ratio | |
|
|
| :--- | :--- | :---: | |
|
|
| **aLLMA** | Local (Statistical) | 1.418 | |
|
|
| **AzeBERT** | Local (Statistical) | 1.571 | |
|
|
| **miLLi 1.0** | **Local (Hybrid)** | **1.955** | |
|
|
| **GPT-4o** | Global (SOTA) | 2.387 | |
|
|
| **mBERT** | Global (Multilingual) | 2.521 | |
|
|
| **GPT-3.5** | Global (Legacy) | 3.490 | |
|
|
|
|
|
### 2.2. Qualitative Analysis: Linguistic Robustness |
|
|
*Metrics: Morphological Boundary Accuracy (MBA) & Root Consistency Rate (RCR)* |
|
|
|
|
|
This evaluation measures the model's ability to correctly identify the linguistic root and restore phonetically modified stems. |
|
|
|
|
|
* **MBA:** Measures the percentage of words split correctly at the root-suffix boundary. |
|
|
|
|
|
| Model | MBA (%) | |
|
|
| :--- | :---: | |
|
|
| **miLLi 1.0** | **53.0%** | |
|
|
| **XLM-RoBERTa**| 38.0% | |
|
|
| **aLLMA** | 16.0% | |
|
|
| **mBERT** | 11.0% | |
|
|
| **GPT-4o** | 4.0% | |
|
|
|
|
|
## 3. Usage |
|
|
|
|
|
The model is compatible with the Hugging Face `transformers` library. Due to the custom Python logic required for phonological restoration, the `trust_remote_code=True` parameter is mandatory. |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers tokenizers |
|
|
|
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
# Load the tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("elshadrahimov/miLLi-1.0", trust_remote_code=True) |
|
|
|
|
|
text = "Vətənimizin bayrağı yüksəkliklərdə dalğalanır." |
|
|
|
|
|
# Tokenize |
|
|
tokens = tokenizer.tokenize(text) |
|
|
print(tokens) |
|
|
|
|
|
# Encode |
|
|
input_ids = tokenizer.encode(text) |
|
|
print(input_ids) |
|
|
``` |
|
|
# 4. Limitations |
|
|
Dictionary Dependence: The restoration capability is strictly limited to the coverage of the underlying root dictionary. Neologisms and dialectisms not present in the dictionary will be processed via standard BPE without restoration. |
|
|
Computational Latency: The pre-tokenization layer (Trie search and rule-based restoration) introduces a slight inference latency compared to purely C++ optimized tokenizers. |
|
|
Sequence Length: The use of the <UPPER> token for capitalization handling results in a marginal increase in sequence length compared to cased tokenizers. |
|
|
5. Citation |
|
|
If you use this model in your research, please cite the following: |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use miLLi 1.0 in your research, please cite it as follows: |
|
|
|
|
|
**APA Style:** |
|
|
Rahimov, E. (2025). *miLLi: Model Integrating Local Linguistic Insights for Morphologically Robust Tokenization*. Hugging Face. https://huggingface.co/elshadrahimov/miLLi-1.0 |
|
|
|
|
|
**BibTeX:** |
|
|
```bibtex |
|
|
@misc{rahimov2025milli, |
|
|
author = {Rahimov, Elshad}, |
|
|
title = {miLLi: Model Integrating Local Linguistic Insights for Morphologically Robust Tokenization}, |
|
|
year = {2025}, |
|
|
howpublished = {\url{https://huggingface.co/elshadrahimov/miLLi-1.0}}, |
|
|
note = {Hugging Face Model Hub} |
|
|
} |
|
|
|
|
|
|
|
|
|