📌 Note

This model is specifically designed for training on my own dataset and was built upon the AraBERT v02 foundation. It is optimized for my internal training pipeline and use case.

Users who wish to further improve or extend this model should adapt it according to their own datasets and requirements. Since each dataset has its own linguistic distribution, style, and domain characteristics, it is unrealistic to expect a single model to perform optimally across all scenarios.

However, the model is designed to be easily fine-tuned, allowing it to be efficiently adapted to different datasets and tasks with minimal adjustments.

⚠️ Limitations of Previous Methods

Most traditional word segmentation programs and pre-trained models struggle with Arabic for the following reasons:

❌ Poor handling of imperfect plurals (broken plurals) ❌ Poor representation of dual forms and plural declension ❌ Inconsistent handling of derivational ❌ Over- or under-segmenting of words ❌ Ignoring pronouns and suffixes such as:

The definite article

Conjunctions

Prepositions

Pronoun suffixes

🧠 Proposed Solution

This word segmentation program is designed to improve the handling of Arabic linguistic structures by:

✔ More intelligent sub-word segmentation, specifically tailored for Arabic morphology ✔ Preserving semantic structure during segmentation ✔ Improving the handling of prefixes and suffixes ✔ Reducing loss of meaning in long texts ✔ Optimized for training high-quality Arabic text-to-speech models 🎯 Intended Use

This segmenter is designed for:

Training Arabic text-to-speech models Speech synthesis pathways Improving the quality of Arabic pronunciation in neural models Research in Arabic natural language processing and speech systems

🌍 Complexity of the Arabic Language

Arabic is a morphologically rich language with complex root-pattern formation, inflection, derivation, and clitic attachment. Features such as broken plurals, dual forms, and extensive morphological variation increase lexical diversity and make Arabic particularly challenging for traditional tokenization approaches.

from transformers import AutoTokenizer

MODEL_NAME = "sherif1313/arabic-tokenizer-tts"

# تحميل التوكنيزر
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

print("\n" + "=" * 80)
print("HF ARABIC TOKENIZER TEST")
print("=" * 80)
print("VOCAB SIZE:", tokenizer.vocab_size)
print("=" * 80)

test_cases = [
    "كتابهم كبير",
    "كتابها قديم",
    "بيتنا جميل",
    "سيارتي جديدة",
    "مدرستك بعيدة",
    "اختلافاتها التاريخية والثقافية",
    "التطورات التكنولوجية الحديثة غيرت شكل العالم بشكل كبير",
    "الديمقراطية التمثيلية البرلمانية",
    "الديموغرافيا الاقتصادية",
    "الاستشراف المستقبلي",
    "المؤسسات التعليمية والثقافية تلعب دورا مهما",
    "فبالتالي ومن هنا",
]

total_tokens = 0

for text in test_cases:
    encoding = tokenizer(text, add_special_tokens=False)

    tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"])
    ids = encoding["input_ids"]

    print("\nTEXT:", text)
    print("TOKENS:", tokens)
    print("IDS:", ids)
    print("COUNT:", len(ids))

    total_tokens += len(ids)

print("\n" + "=" * 80)
print("TOTAL TOKENS:", total_tokens)
print("AVERAGE TOKENS PER SENTENCE:", total_tokens / len(test_cases))
print("=" * 80)

⚠️ Note: I will expand and modify it by adding more data.

Thanks to AraBERT v02

This tokenizer is also the initial core component for training the 3arab-TTS-500M-v1 project, which is currently under active development to build a large-scale Arabic TTS system from scratch.

Downloads last month
615
Safetensors
Model size
0.1B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support