Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# NeoAraBERT
|
| 2 |
NeoAraBERT is a state-of-the-art open-source Arabic text-embedding model built on the NeoBERT architecture. We pretrain NeoAraBERT on diverse open-source and internal datasets covering modern standard, classical, and dialectal Arabic. We guided our design choices with Arabic tailored ablation studies including text normalization, light stemming, and diacritics-aware tokenization handling. We also performed POS-aware token masking and learning-rate scheduling ablation studies. We benchmarked NeoAraBERT against five top-performing Arabic models on 23 tasks, including a synonym-based task, [Muradif](https://huggingface.co/datasets/U4RASD/Muradif), that directly assesses embedding quality with no additional fine-tuning. NeoAraBERT variants rank first in 18 tasks and improve average performance across the full benchmark suite.
|
| 3 |
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-sa-4.0
|
| 3 |
+
language:
|
| 4 |
+
- ar
|
| 5 |
+
base_model:
|
| 6 |
+
- U4RASD/NeoAraBERT
|
| 7 |
+
tags:
|
| 8 |
+
- NeoAraBERT
|
| 9 |
+
- NeoBERT
|
| 10 |
+
- BERT
|
| 11 |
+
- MSA
|
| 12 |
+
- dialect-arabic
|
| 13 |
+
- masked-language-model
|
| 14 |
+
pipeline_tag: feature-extraction
|
| 15 |
+
---
|
| 16 |
# NeoAraBERT
|
| 17 |
NeoAraBERT is a state-of-the-art open-source Arabic text-embedding model built on the NeoBERT architecture. We pretrain NeoAraBERT on diverse open-source and internal datasets covering modern standard, classical, and dialectal Arabic. We guided our design choices with Arabic tailored ablation studies including text normalization, light stemming, and diacritics-aware tokenization handling. We also performed POS-aware token masking and learning-rate scheduling ablation studies. We benchmarked NeoAraBERT against five top-performing Arabic models on 23 tasks, including a synonym-based task, [Muradif](https://huggingface.co/datasets/U4RASD/Muradif), that directly assesses embedding quality with no additional fine-tuning. NeoAraBERT variants rank first in 18 tasks and improve average performance across the full benchmark suite.
|
| 18 |
|