File size: 1,399 Bytes
31ce063
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
language:
- ar
license: apache-2.0
library_name: transformers
pipeline_tag: text-classification
base_model: jhu-clsp/mmBERT-small
tags:
- quality-classifier
- data-filtering
- pretraining
---

<p align="center">
  <a href="https://huggingface.co/collections/AdaMLLab/mixminmatch">
    <img src="https://img.shields.io/badge/🤗_Collection-MixMinMatch-blue" alt="MixMinMatch Collection">
  </a>
</p>

# mmBERT Arabic Quality Classifier

A text quality classifier for Arabic pretraining data, trained from [mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small). Used to create [AraMix-HQ](https://huggingface.co/datasets/AdaMLLab/AraMix-HQ).

This model implements the FineWeb2-HQ approach ([Messmer et al., 2025](https://arxiv.org/abs/2502.10361)) but uses mmBERT as the encoder for improved Arabic understanding.

## Usage

```python
from transformers import pipeline

classifier = pipeline("text-classification", model="AdaMLLab/mmBERT-Arabic-Quality-Classifier")
result = classifier("النص العربي هنا")
```

## Citation

```bib
@misc{alrashed2025mixminmatch,
      title={Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets}, 
      author={Sultan Alrashed and Francesco Orabona},
      year={2025},
      eprint={2512.18834v2},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.18834v2}, 
}
```