Osama-Rakan-Al-Mraikhat commited on
Commit
5fd9d41
·
verified ·
1 Parent(s): 6c88e62

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -2
README.md CHANGED
@@ -19,7 +19,21 @@ pipeline_tag: feature-extraction
19
  library_name: Transformers
20
  ---
21
  # NeoAraBERT
22
- NeoAraBERT is a state-of-the-art open-source Arabic text-embedding model built on the NeoBERT architecture. We pretrain NeoAraBERT on diverse open-source and internal datasets covering modern standard, classical, and dialectal Arabic. We guided our design choices with Arabic tailored ablation studies including text normalization, light stemming, and diacritics-aware tokenization handling. We also performed POS-aware token masking and learning-rate scheduling ablation studies. We benchmarked NeoAraBERT against five top-performing Arabic models on 23 tasks, including a synonym-based task, [Muradif](https://huggingface.co/datasets/U4RASD/Muradif), that directly assesses embedding quality with no additional fine-tuning. NeoAraBERT variants rank first in 18 tasks and improve average performance across the full benchmark suite.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
  This is the **NeoAraBERT_Mix** checkpoint, our best-performing checkpoint overall. This model was introduced at the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026). For more information, visit our website: https://acr.ps/neoarabert.
25
 
@@ -30,7 +44,7 @@ The available NeoAraBERT checkpoints:
30
  | NeoAraBERT_MSA | Trained on Modern Standard Arabic. | [link](https://huggingface.co/U4RASD/NeoAraBERT_MSA) |
31
  | NeoAraBERT_DA | Trained on Dialectal Arabic. | [link](https://huggingface.co/U4RASD/NeoAraBERT_DA) |
32
 
33
- ![mix](https://cdn-uploads.huggingface.co/production/uploads/65338533a78e70d19c850120/Xupu7ff-rv7bmu8NYT7Sp.png)
34
 
35
  For detailed benchmarking, see https://acr.ps/neoarabert.
36
 
@@ -77,5 +91,8 @@ If you use the code, model, or the Muradif benchmark, please cite:
77
  }
78
  ```
79
 
 
 
 
80
  ### License
81
  This model is licensed under the CC BY-SA 4.0 license. The text of the license can be found [here](https://creativecommons.org/licenses/by-sa/4.0/).
 
19
  library_name: Transformers
20
  ---
21
  # NeoAraBERT
22
+ <table align="right" style="border: none; margin-left: 28px; margin-bottom: 16px; width: 290px;">
23
+ <tr style="border: none;">
24
+ <td align="center" style="border: none; padding: 0;">
25
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/65338533a78e70d19c850120/jl_hN3qIJtAm-oqlH2BXW.png" width="168" style="border: none; box-shadow: none;">
26
+ </td>
27
+ </tr>
28
+ <tr style="border: none;">
29
+ <td align="center" style="border: none; padding: 0; padding-top: 4px;">
30
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/65338533a78e70d19c850120/8ijYrACVusalZ3CIU0rk_.png" width="290" style="border: none; box-shadow: none;">
31
+ </td>
32
+ </tr>
33
+ </table>
34
+ NeoAraBERT is a state-of-the-art open-source Arabic text-embedding model built on the NeoBERT architecture. This project was a collaboration between the Arab Center for Research and Policy Studies’ (ACRPS) Unit for Research In Arabic Social and Digital Spaces (U4RASD) and the American University of Beirut (AUB).
35
+
36
+ We pretrain NeoAraBERT on diverse open-source and internal datasets covering modern standard, classical, and dialectal Arabic. We guided our design choices with Arabic tailored ablation studies including text normalization, light stemming, and diacritics-aware tokenization handling. We also performed POS-aware token masking and learning-rate scheduling ablation studies. We benchmarked NeoAraBERT against five top-performing Arabic models on 23 tasks, including a synonym-based task, [Muradif](https://huggingface.co/datasets/U4RASD/Muradif), that directly assesses embedding quality with no additional fine-tuning. NeoAraBERT variants rank first in 18 tasks and improve average performance across the full benchmark suite.
37
 
38
  This is the **NeoAraBERT_Mix** checkpoint, our best-performing checkpoint overall. This model was introduced at the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026). For more information, visit our website: https://acr.ps/neoarabert.
39
 
 
44
  | NeoAraBERT_MSA | Trained on Modern Standard Arabic. | [link](https://huggingface.co/U4RASD/NeoAraBERT_MSA) |
45
  | NeoAraBERT_DA | Trained on Dialectal Arabic. | [link](https://huggingface.co/U4RASD/NeoAraBERT_DA) |
46
 
47
+ ![mix](https://cdn-uploads.huggingface.co/production/uploads/65338533a78e70d19c850120/Xupu7ff-rv7bmu8NYT7Sp.png)
48
 
49
  For detailed benchmarking, see https://acr.ps/neoarabert.
50
 
 
91
  }
92
  ```
93
 
94
+ ### Acknowledgements
95
+ We would like to acknowledge Ahmad Talal Salman from Assafir and Professor Amer Abdo Mouawad from the American University of Beirut for sharing Assafir data, which was instrumental to the work presented in this paper.
96
+
97
  ### License
98
  This model is licensed under the CC BY-SA 4.0 license. The text of the license can be found [here](https://creativecommons.org/licenses/by-sa/4.0/).