# character-bert_NArabizi This model is a **CharacterBERT-based model trained from scratch** on the **NArabizi raw data**, code-switched dialectal Arabic (NArabizi) from social media. It was introduced in the paper: - [Can Character-based Language Models Improve Downstream Task Performances in Low-Resource and Noisy Language Scenarios?](https://aclanthology.org/2021.wnut-1.7/) (Riabi et al., 2021, W-NUT) --- ## 📝 Description The model follows the **CharacterBERT** architecture (El Boukkouri et al., 2020), which removes WordPiece tokenization in favor of a **Character-CNN module** that generates full word representations from raw character sequences. --- ## 📖 Citation If you use this model, please cite both the paper that introduced it and the original CharacterBERT architecture: ```bibtex @inproceedings{riabi-2021-can, title = "Can Character-based Language Models Improve Downstream Task Performances in Low-Resource and Noisy Language Scenarios?", author = {Riabi, Arij and Sagot, Beno{\^i}t and Seddah, Djam{\'e}}, booktitle = "Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)", month = nov, year = "2021", address = "Punta Cana, Dominican Republic (Online)", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.wnut-1.7/" } @inproceedings{el-boukkouri-etal-2020-characterbert, title = "{C}haracter{BERT}: Reconciling {ELM}o and {BERT} for Word-Level Open-Vocabulary Representations From Characters", author = "El Boukkouri, Hicham and Ferret, Olivier and Lavergne, Thomas and Noji, Hiroshi and Zweigenbaum, Pierre and Tsujii, Jun{'}ichi", booktitle = "Proceedings of the 28th International Conference on Computational Linguistics", month = dec, year = "2020", address = "Barcelona, Spain (Online)", publisher = "International Committee on Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.coling-main.609", doi = "10.18653/v1/2020.coling-main.609", pages = "6903--6915" }