File size: 2,109 Bytes
92a1c5c
02301cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# character-bert_NArabizi

This model is a **CharacterBERT-based model trained from scratch** on the **NArabizi raw data**, code-switched dialectal Arabic (NArabizi) from social media.

It was introduced in the paper:

- [Can Character-based Language Models Improve Downstream Task Performances in Low-Resource and Noisy Language Scenarios?](https://aclanthology.org/2021.wnut-1.7/)  
  (Riabi et al., 2021, W-NUT)

---

## 📝 Description

The model follows the **CharacterBERT** architecture (El Boukkouri et al., 2020), which removes WordPiece tokenization in favor of a **Character-CNN module** that generates full word representations from raw character sequences.

---

## 📖 Citation

If you use this model, please cite both the paper that introduced it and the original CharacterBERT architecture:

```bibtex
@inproceedings{riabi-2021-can,
    title = "Can Character-based Language Models Improve Downstream Task Performances in Low-Resource and Noisy Language Scenarios?",
    author = {Riabi, Arij and Sagot, Beno{\^i}t and Seddah, Djam{\'e}},
    booktitle = "Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic (Online)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.wnut-1.7/"
}

@inproceedings{el-boukkouri-etal-2020-characterbert,
    title = "{C}haracter{BERT}: Reconciling {ELM}o and {BERT} for Word-Level Open-Vocabulary Representations From Characters",
    author = "El Boukkouri, Hicham  and
      Ferret, Olivier  and
      Lavergne, Thomas  and
      Noji, Hiroshi  and
      Zweigenbaum, Pierre  and
      Tsujii, Jun{'}ichi",
    booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
    month = dec,
    year = "2020",
    address = "Barcelona, Spain (Online)",
    publisher = "International Committee on Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.coling-main.609",
    doi = "10.18653/v1/2020.coling-main.609",
    pages = "6903--6915"
}