File size: 5,355 Bytes

cbc3881
 
 
 
 
 
 
 
 
 
093b2e3
cbc3881
 
 
093b2e3
cbc3881
093b2e3
 
 
cbc3881
 
093b2e3
cbc3881
093b2e3
cbc3881
093b2e3
cbc3881
093b2e3
cbc3881
093b2e3
cbc3881
093b2e3
 
 
 
cbc3881
093b2e3
 
 
cbc3881
093b2e3
 
cbc3881
093b2e3
cbc3881
093b2e3
3f90f15
093b2e3
 
3f90f15
093b2e3
cbc3881
093b2e3
 
 
 
0f8df18
093b2e3
 
 
cbc3881
093b2e3
 
 
 
 
 
 
64921d9
 
 
 
 
 
 
093b2e3
 
 
 
cbc3881
3f90f15
093b2e3
 
 
3f90f15
cbc3881
 
093b2e3
 
cbc3881
3f90f15
cbc3881
 
 
093b2e3
cbc3881
093b2e3
cbc3881
093b2e3
7afdc3d
093b2e3
 
 
 
 
 
 
e2fa100
73d6a15
3080582
73d6a15
3080582
 
73d6a15
3080582
 
 
73d6a15
3080582
 
 
 
093b2e3
cbc3881
5764412

---
language:
- az
tags:
- tokenizer
- azerbaijani
- nlp
- morphology
- hybrid
- bpe
- phonological-restoration
license: apache-2.0
datasets:
- uonlp/CulturaX
- tatoeba
metrics:
- token_word_ratio
- morphological_boundary_accuracy
- root_consistency_rate
---

# miLLi: Model Integrating Local Linguistic Insights for Morphologically Robust Tokenization

**miLLi 1.0** is a hybrid tokenizer specifically engineered for the **Azerbaijani language**, addressing the limitations of standard statistical models (e.g., BPE, WordPiece) in processing agglutinative morphologies. By integrating a rule-based root dictionary with statistical learning, the model prioritizes **morphological integrity** and **semantic root preservation** over purely frequency-based compression.

The model introduces a dynamic **Phonological Restoration** algorithm designed to map allomorphic variations (e.g., vowel loss, consonant mutations) back to their canonical root forms during the pre-tokenization phase.

## 1. Methodology and Architecture

The architecture of miLLi 1.0 is built upon a three-stage hybrid pipeline:

1.  **Linguistic Pre-processing:**
    *   Utilization of a cleaned root dictionary based on Mozilla's `az.dic`.
    *   Implementation of an **Aho-Corasick** based Trie structure for efficient root matching.
    *   **Case Handling:** Application of a special `<UPPER>` token strategy to consolidate vocabulary and preserve Named Entity Recognition (NER) signals without case-sensitivity redundancy.

2.  **Phonological Restoration:**
    *   A dynamic algorithm that identifies phonetically modified stems (e.g., *q-ğ*, *k-y* mutations) and restores them to their lemma forms before segmentation.
    *   Adopts a **"Longest Restored Match"** principle to ensure valid morphological segmentation.

3.  **Statistical Subword Segmentation:**
    *   A Byte-Pair Encoding (BPE) model trained on the **CulturaX** Azerbaijani corpus (500k line subset) is applied to the remaining suffixes and out-of-vocabulary terms.

## 2. Empirical Evaluation

The performance of miLLi 1.0 was evaluated using a dual-strategy approach: **Quantitative Efficiency** (measured on the Tatoeba corpus) and **Qualitative Accuracy** (measured on a curated Morphological Challenge Set).

### 2.1. Quantitative Analysis: Token Efficiency
*Metric: Token/Word (T/W) Ratio (Lower indicates higher compression)*

Evaluations on the **Tatoeba** corpus (~4,500 sentences) demonstrate that miLLi 1.0 offers a balanced representation. While local statistical models achieve higher compression through whole-word memorization, miLLi 1.0 significantly outperforms global multilingual models.

| Model | Category | T/W Ratio |
| :--- | :--- | :---: |
| **aLLMA** | Local (Statistical) | 1.418 |
| **AzeBERT** | Local (Statistical) | 1.571 |
| **miLLi 1.0** | **Local (Hybrid)** | **1.955** |
| **GPT-4o** | Global (SOTA) | 2.387 |
| **mBERT** | Global (Multilingual) | 2.521 |
| **GPT-3.5** | Global (Legacy) | 3.490 |

### 2.2. Qualitative Analysis: Linguistic Robustness
*Metrics: Morphological Boundary Accuracy (MBA) & Root Consistency Rate (RCR)*

This evaluation measures the model's ability to correctly identify the linguistic root and restore phonetically modified stems.

*   **MBA:** Measures the percentage of words split correctly at the root-suffix boundary.

| Model | MBA (%) |
| :--- | :---: |
| **miLLi 1.0** | **53.0%** |
| **XLM-RoBERTa**| 38.0% |
| **aLLMA** | 16.0% |
| **mBERT** | 11.0% |
| **GPT-4o** | 4.0% |

## 3. Usage

The model is compatible with the Hugging Face `transformers` library. Due to the custom Python logic required for phonological restoration, the `trust_remote_code=True` parameter is mandatory.

### Installation

```bash
pip install transformers tokenizers

from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("elshadrahimov/miLLi-1.0", trust_remote_code=True)

text = "Vətənimizin bayrağı yüksəkliklərdə dalğalanır."

# Tokenize
tokens = tokenizer.tokenize(text)
print(tokens)

# Encode
input_ids = tokenizer.encode(text)
print(input_ids)
```
# 4. Limitations
Dictionary Dependence: The restoration capability is strictly limited to the coverage of the underlying root dictionary. Neologisms and dialectisms not present in the dictionary will be processed via standard BPE without restoration.
Computational Latency: The pre-tokenization layer (Trie search and rule-based restoration) introduces a slight inference latency compared to purely C++ optimized tokenizers.
Sequence Length: The use of the <UPPER> token for capitalization handling results in a marginal increase in sequence length compared to cased tokenizers.
5. Citation
If you use this model in your research, please cite the following:

## Citation

If you use miLLi 1.0 in your research, please cite it as follows:

**APA Style:**
Rahimov, E. (2025). *miLLi: Model Integrating Local Linguistic Insights for Morphologically Robust Tokenization*. Hugging Face. https://huggingface.co/elshadrahimov/miLLi-1.0

**BibTeX:**
```bibtex
@misc{rahimov2025milli,
  author = {Rahimov, Elshad},
  title = {miLLi: Model Integrating Local Linguistic Insights for Morphologically Robust Tokenization},
  year = {2025},
  howpublished = {\url{https://huggingface.co/elshadrahimov/miLLi-1.0}},
  note = {Hugging Face Model Hub}
}