miLLi-1.0 / README.md

Update README.md

e2fa100 verified 23 days ago

5.36 kB

	---
	language:
	- az
	tags:
	- tokenizer
	- azerbaijani
	- nlp
	- morphology
	- hybrid
	- bpe
	- phonological-restoration
	license: apache-2.0
	datasets:
	- uonlp/CulturaX
	- tatoeba
	metrics:
	- token_word_ratio
	- morphological_boundary_accuracy
	- root_consistency_rate
	---

	# miLLi: Model Integrating Local Linguistic Insights for Morphologically Robust Tokenization

	miLLi 1.0 is a hybrid tokenizer specifically engineered for the Azerbaijani language, addressing the limitations of standard statistical models (e.g., BPE, WordPiece) in processing agglutinative morphologies. By integrating a rule-based root dictionary with statistical learning, the model prioritizes morphological integrity and semantic root preservation over purely frequency-based compression.

	The model introduces a dynamic Phonological Restoration algorithm designed to map allomorphic variations (e.g., vowel loss, consonant mutations) back to their canonical root forms during the pre-tokenization phase.

	## 1. Methodology and Architecture

	The architecture of miLLi 1.0 is built upon a three-stage hybrid pipeline:

	1. Linguistic Pre-processing:
	* Utilization of a cleaned root dictionary based on Mozilla's `az.dic`.
	* Implementation of an Aho-Corasick based Trie structure for efficient root matching.
	* Case Handling: Application of a special `<UPPER>` token strategy to consolidate vocabulary and preserve Named Entity Recognition (NER) signals without case-sensitivity redundancy.

	2. Phonological Restoration:
	* A dynamic algorithm that identifies phonetically modified stems (e.g., q-ğ, k-y mutations) and restores them to their lemma forms before segmentation.
	* Adopts a "Longest Restored Match" principle to ensure valid morphological segmentation.

	3. Statistical Subword Segmentation:
	* A Byte-Pair Encoding (BPE) model trained on the CulturaX Azerbaijani corpus (500k line subset) is applied to the remaining suffixes and out-of-vocabulary terms.

	## 2. Empirical Evaluation

	The performance of miLLi 1.0 was evaluated using a dual-strategy approach: Quantitative Efficiency (measured on the Tatoeba corpus) and Qualitative Accuracy (measured on a curated Morphological Challenge Set).

	### 2.1. Quantitative Analysis: Token Efficiency
	Metric: Token/Word (T/W) Ratio (Lower indicates higher compression)

	Evaluations on the Tatoeba corpus (~4,500 sentences) demonstrate that miLLi 1.0 offers a balanced representation. While local statistical models achieve higher compression through whole-word memorization, miLLi 1.0 significantly outperforms global multilingual models.

	\| Model \| Category \| T/W Ratio \|
	\| :--- \| :--- \| :---: \|
	\| aLLMA \| Local (Statistical) \| 1.418 \|
	\| AzeBERT \| Local (Statistical) \| 1.571 \|
	\| miLLi 1.0 \| Local (Hybrid) \| 1.955 \|
	\| GPT-4o \| Global (SOTA) \| 2.387 \|
	\| mBERT \| Global (Multilingual) \| 2.521 \|
	\| GPT-3.5 \| Global (Legacy) \| 3.490 \|

	### 2.2. Qualitative Analysis: Linguistic Robustness
	Metrics: Morphological Boundary Accuracy (MBA) & Root Consistency Rate (RCR)

	This evaluation measures the model's ability to correctly identify the linguistic root and restore phonetically modified stems.

	* MBA: Measures the percentage of words split correctly at the root-suffix boundary.

	\| Model \| MBA (%) \|
	\| :--- \| :---: \|
	\| miLLi 1.0 \| 53.0% \|
	\| XLM-RoBERTa\| 38.0% \|
	\| aLLMA \| 16.0% \|
	\| mBERT \| 11.0% \|
	\| GPT-4o \| 4.0% \|

	## 3. Usage

	The model is compatible with the Hugging Face `transformers` library. Due to the custom Python logic required for phonological restoration, the `trust_remote_code=True` parameter is mandatory.

	### Installation

	```bash
	pip install transformers tokenizers

	from transformers import AutoTokenizer

	# Load the tokenizer
	tokenizer = AutoTokenizer.from_pretrained("elshadrahimov/miLLi-1.0", trust_remote_code=True)

	text = "Vətənimizin bayrağı yüksəkliklərdə dalğalanır."

	# Tokenize
	tokens = tokenizer.tokenize(text)
	print(tokens)

	# Encode
	input_ids = tokenizer.encode(text)
	print(input_ids)
	```
	# 4. Limitations
	Dictionary Dependence: The restoration capability is strictly limited to the coverage of the underlying root dictionary. Neologisms and dialectisms not present in the dictionary will be processed via standard BPE without restoration.
	Computational Latency: The pre-tokenization layer (Trie search and rule-based restoration) introduces a slight inference latency compared to purely C++ optimized tokenizers.
	Sequence Length: The use of the <UPPER> token for capitalization handling results in a marginal increase in sequence length compared to cased tokenizers.
	5. Citation
	If you use this model in your research, please cite the following:

	## Citation

	If you use miLLi 1.0 in your research, please cite it as follows:

	APA Style:
	Rahimov, E. (2025). miLLi: Model Integrating Local Linguistic Insights for Morphologically Robust Tokenization. Hugging Face. https://huggingface.co/elshadrahimov/miLLi-1.0

	BibTeX:
	```bibtex
	@misc{rahimov2025milli,
	author = {Rahimov, Elshad},
	title = {miLLi: Model Integrating Local Linguistic Insights for Morphologically Robust Tokenization},
	year = {2025},
	howpublished = {\url{https://huggingface.co/elshadrahimov/miLLi-1.0}},
	note = {Hugging Face Model Hub}
	}