Osama-Rakan-Al-Mraikhat commited on
Commit
33dc0a6
·
verified ·
1 Parent(s): a7495f6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -2
README.md CHANGED
@@ -52,9 +52,23 @@ print(embedding.shape)
52
  ```
53
 
54
  ### Citation
55
- If you use the code, model, or the Muradif benchmark, please reference this work in your paper:
56
  ```bibtex
57
- The citation will be added here soon.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ```
59
 
60
  ### License
 
52
  ```
53
 
54
  ### Citation
55
+ If you use the code, model, or the Muradif benchmark, please cite:
56
  ```bibtex
57
+ @inproceedings{abou-chakra-etal-2026-neoarabert,
58
+ title = "{NeoAraBERT}: A Modern Foundation Model for Arabic Embeddings with Diacritics-Aware Tokenization and POS-Targeted Masking",
59
+ author = "Abou Chakra, Chadi and
60
+ Hamoud, Hadi and
61
+ Al Mraikhat, Osama Rakan and
62
+ Abu Obaida, Qusai and
63
+ Ballout, Mohamad and
64
+ Zaraket, Fadi A.",
65
+ booktitle = "Findings of the Association for Computational Linguistics: ACL 2026",
66
+ address = "San Diego, California, United States",
67
+ year = "2026",
68
+ note = "Accepted paper",
69
+ url = "https://acr.ps/neoarabert",
70
+ abstract = {We present NeoAraBERT, a state-of-the-art open-source Arabic text-embedding model built on the NeoBERT architecture. We pre-train NeoAraBERT on diverse open-source and internal datasets covering modern standard, classical, and dialectal Arabic. We guided our design choices with Arabic tailored ablation studies including text normalization, light stemming, and diacritics-aware tokenization handling. We also performed more general POS-aware token masking and learning-rate scheduling ablation studies. We benchmarked NeoAraBERT against five top-performing Arabic models on 23 tasks, including a novel synonym-based task, ``Muradif'', that directly assesses embedding quality with no additional fine-tuning. NeoAraBERT variants (MSA, dialectal, and mixed) rank first in 18 tasks, second in two, third in two, and fourth in one task. They show strong performance on classical and modern standard Arabic, substantial margins of improvement ($>$7\%) in two tasks, and a $+$2.75\% improvement on average across all tasks. Our code and links to checkpoints for our model variants are available on our website: \url{https://acr.ps/neoarabert}}
71
+ }
72
  ```
73
 
74
  ### License