Update README.md
Browse files
README.md
CHANGED
|
@@ -18,16 +18,28 @@ library_name: Transformers
|
|
| 18 |
# NeoAraBERT
|
| 19 |
NeoAraBERT is a state-of-the-art open-source Arabic text-embedding model built on the NeoBERT architecture. We pretrain NeoAraBERT on diverse open-source and internal datasets covering modern standard, classical, and dialectal Arabic. We guided our design choices with Arabic tailored ablation studies including text normalization, light stemming, and diacritics-aware tokenization handling. We also performed POS-aware token masking and learning-rate scheduling ablation studies. We benchmarked NeoAraBERT against five top-performing Arabic models on 23 tasks, including a synonym-based task, [Muradif](https://huggingface.co/datasets/U4RASD/Muradif), that directly assesses embedding quality with no additional fine-tuning. NeoAraBERT variants rank first in 18 tasks and improve average performance across the full benchmark suite.
|
| 20 |
|
| 21 |
-
This is the NeoAraBERT_Mix checkpoint, our best-performing checkpoint overall. This model was introduced at the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026). For more information, visit our website: https://acr.ps/neoarabert.
|
| 22 |
|
| 23 |
The available NeoAraBERT checkpoints:
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
-
|
|
|
|
| 31 |
|
| 32 |
### How to Use
|
| 33 |
Install these libraries:
|
|
|
|
| 18 |
# NeoAraBERT
|
| 19 |
NeoAraBERT is a state-of-the-art open-source Arabic text-embedding model built on the NeoBERT architecture. We pretrain NeoAraBERT on diverse open-source and internal datasets covering modern standard, classical, and dialectal Arabic. We guided our design choices with Arabic tailored ablation studies including text normalization, light stemming, and diacritics-aware tokenization handling. We also performed POS-aware token masking and learning-rate scheduling ablation studies. We benchmarked NeoAraBERT against five top-performing Arabic models on 23 tasks, including a synonym-based task, [Muradif](https://huggingface.co/datasets/U4RASD/Muradif), that directly assesses embedding quality with no additional fine-tuning. NeoAraBERT variants rank first in 18 tasks and improve average performance across the full benchmark suite.
|
| 20 |
|
| 21 |
+
This is the **NeoAraBERT_Mix** checkpoint, our best-performing checkpoint overall. This model was introduced at the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026). For more information, visit our website: https://acr.ps/neoarabert.
|
| 22 |
|
| 23 |
The available NeoAraBERT checkpoints:
|
| 24 |
+
| Model | Description | Link |
|
| 25 |
+
|---|---|---|
|
| 26 |
+
| NeoAraBERT (**NeoAraBERT_Mix**) | Trained on both Modern Standard Arabic and Dialectal Arabic. | this repository ✅ |
|
| 27 |
+
| NeoAraBERT_MSA | Trained on Modern Standard Arabic. | [link](https://huggingface.co/U4RASD/NeoAraBERT_MSA) |
|
| 28 |
+
| NeoAraBERT_DA | Trained on Dialectal Arabic. | [link](https://huggingface.co/U4RASD/NeoAraBERT_DA) |
|
| 29 |
+
|
| 30 |
+
| Model | Average Score |
|
| 31 |
+
| ------------------ | ------------: |
|
| 32 |
+
| **NeoAraBERT_Mix** | **83.79** |
|
| 33 |
+
| NeoAraBERT_DA | 83.44 |
|
| 34 |
+
| NeoAraBERT_MSA | 83.30 |
|
| 35 |
+
| AraModernBERT | 81.04 |
|
| 36 |
+
| AraBERTv2 | 80.75 |
|
| 37 |
+
| MARBERTv2 | 80.45 |
|
| 38 |
+
| ARBERTv2 | 80.31 |
|
| 39 |
+
| CAMeLBERT-mix | 80.04 |
|
| 40 |
|
| 41 |
+
|
| 42 |
+
For detailed benchmarking, see https://acr.ps/neoarabert.
|
| 43 |
|
| 44 |
### How to Use
|
| 45 |
Install these libraries:
|