SultanR commited on
Commit
1f84cc7
·
verified ·
1 Parent(s): f62d881

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -0
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ar
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ pipeline_tag: text-classification
7
+ base_model: FacebookAI/xlm-roberta-base
8
+ tags:
9
+ - quality-classifier
10
+ - data-filtering
11
+ - pretraining
12
+ - fineweb2-hq
13
+ ---
14
+
15
+ <p align="center">
16
+ <a href="https://huggingface.co/collections/AdaMLLab/mixminmatch">
17
+ <img src="https://img.shields.io/badge/🤗_Collection-MixMinMatch-blue" alt="MixMinMatch Collection">
18
+ </a>
19
+ </p>
20
+
21
+ # XLM-RoBERTa Arabic Quality Classifier
22
+
23
+ A text quality classifier for Arabic pretraining data, trained from XLM-RoBERTa. This model reproduces the FineWeb2-HQ approach ([Messmer et al., 2025](https://arxiv.org/abs/2502.10361)) for Arabic, as the original authors did not release their trained classifiers but did release their code.
24
+
25
+ For improved Arabic performance and inference speed, see [mmBERT-Arabic-Quality-Classifier](https://huggingface.co/AdaMLLab/mmBERT-Arabic-Quality-Classifier).
26
+
27
+ ## Usage
28
+
29
+ ```python
30
+ from transformers import pipeline
31
+
32
+ classifier = pipeline("text-classification", model="AdaMLLab/XLM-RoBERTa-Arabic-Quality-Classifier")
33
+ result = classifier("النص العربي هنا")
34
+ ```
35
+
36
+ ## Citation
37
+
38
+ ```bib
39
+ @misc{messmer2025fineweb2hq,
40
+ title={Enhancing Multilingual LLM Pretraining with Model-Based Data Selection},
41
+ author={Bettina Messmer and Vinko Sabolčec and Martin Jaggi},
42
+ year={2025},
43
+ eprint={2502.10361},
44
+ archivePrefix={arXiv},
45
+ primaryClass={cs.CL},
46
+ url={https://arxiv.org/abs/2502.10361},
47
+ }
48
+
49
+ @misc{alrashed2025mixminmatch,
50
+ title={Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets},
51
+ author={Sultan Alrashed and Francesco Orabona},
52
+ year={2025},
53
+ eprint={2512.18834v2},
54
+ archivePrefix={arXiv},
55
+ primaryClass={cs.CL},
56
+ url={https://arxiv.org/abs/2512.18834v2},
57
+ }
58
+ ```