Osama-Rakan-Al-Mraikhat commited on
Commit
9860bfb
·
verified ·
1 Parent(s): 77ed0cb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -7
README.md CHANGED
@@ -18,16 +18,28 @@ library_name: Transformers
18
  # NeoAraBERT
19
  NeoAraBERT is a state-of-the-art open-source Arabic text-embedding model built on the NeoBERT architecture. We pretrain NeoAraBERT on diverse open-source and internal datasets covering modern standard, classical, and dialectal Arabic. We guided our design choices with Arabic tailored ablation studies including text normalization, light stemming, and diacritics-aware tokenization handling. We also performed POS-aware token masking and learning-rate scheduling ablation studies. We benchmarked NeoAraBERT against five top-performing Arabic models on 23 tasks, including a synonym-based task, [Muradif](https://huggingface.co/datasets/U4RASD/Muradif), that directly assesses embedding quality with no additional fine-tuning. NeoAraBERT variants rank first in 18 tasks and improve average performance across the full benchmark suite.
20
 
21
- This is the NeoAraBERT_Mix checkpoint, our best-performing checkpoint overall. This model was introduced at the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026). For more information, visit our website: https://acr.ps/neoarabert.
22
 
23
  The available NeoAraBERT checkpoints:
24
- | Model | Description | Link |
25
- |---|---|---|
26
- | NeoAraBERT | Trained on both Modern Standard Arabic and Dialectal Arabic. | this repository ✅ |
27
- | NeoAraBERT_MSA | Trained on Modern Standard Arabic. | [link](https://huggingface.co/U4RASD/NeoAraBERT_MSA) |
28
- | NeoAraBERT_DA | Trained on Dialectal Arabic. | [link](https://huggingface.co/U4RASD/NeoAraBERT_DA) |
 
 
 
 
 
 
 
 
 
 
 
29
 
30
- ![bench](https://cdn-uploads.huggingface.co/production/uploads/65338533a78e70d19c850120/1Hmc13qHxygG2bQl98xv9.png)
 
31
 
32
  ### How to Use
33
  Install these libraries:
 
18
  # NeoAraBERT
19
  NeoAraBERT is a state-of-the-art open-source Arabic text-embedding model built on the NeoBERT architecture. We pretrain NeoAraBERT on diverse open-source and internal datasets covering modern standard, classical, and dialectal Arabic. We guided our design choices with Arabic tailored ablation studies including text normalization, light stemming, and diacritics-aware tokenization handling. We also performed POS-aware token masking and learning-rate scheduling ablation studies. We benchmarked NeoAraBERT against five top-performing Arabic models on 23 tasks, including a synonym-based task, [Muradif](https://huggingface.co/datasets/U4RASD/Muradif), that directly assesses embedding quality with no additional fine-tuning. NeoAraBERT variants rank first in 18 tasks and improve average performance across the full benchmark suite.
20
 
21
+ This is the **NeoAraBERT_Mix** checkpoint, our best-performing checkpoint overall. This model was introduced at the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026). For more information, visit our website: https://acr.ps/neoarabert.
22
 
23
  The available NeoAraBERT checkpoints:
24
+ | Model | Description | Link |
25
+ |---|---|---|
26
+ | NeoAraBERT (**NeoAraBERT_Mix**) | Trained on both Modern Standard Arabic and Dialectal Arabic. | this repository ✅ |
27
+ | NeoAraBERT_MSA | Trained on Modern Standard Arabic. | [link](https://huggingface.co/U4RASD/NeoAraBERT_MSA) |
28
+ | NeoAraBERT_DA | Trained on Dialectal Arabic. | [link](https://huggingface.co/U4RASD/NeoAraBERT_DA) |
29
+
30
+ | Model | Average Score |
31
+ | ------------------ | ------------: |
32
+ | **NeoAraBERT_Mix** | **83.79** |
33
+ | NeoAraBERT_DA | 83.44 |
34
+ | NeoAraBERT_MSA | 83.30 |
35
+ | AraModernBERT | 81.04 |
36
+ | AraBERTv2 | 80.75 |
37
+ | MARBERTv2 | 80.45 |
38
+ | ARBERTv2 | 80.31 |
39
+ | CAMeLBERT-mix | 80.04 |
40
 
41
+
42
+ For detailed benchmarking, see https://acr.ps/neoarabert.
43
 
44
  ### How to Use
45
  Install these libraries: