--- language: - ar license: apache-2.0 library_name: transformers pipeline_tag: text-classification base_model: jhu-clsp/mmBERT-small tags: - quality-classifier - data-filtering - pretraining ---
# mmBERT Arabic Quality Classifier A text quality classifier for Arabic pretraining data, trained from [mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small). Used to create [AraMix-HQ](https://huggingface.co/datasets/AdaMLLab/AraMix-HQ). This model implements the FineWeb2-HQ approach ([Messmer et al., 2025](https://arxiv.org/abs/2502.10361)) but uses mmBERT as the encoder for improved Arabic understanding. ## Usage ```python from transformers import pipeline classifier = pipeline("text-classification", model="AdaMLLab/mmBERT-Arabic-Quality-Classifier") result = classifier("النص العربي هنا") ``` ## Citation ```bib @misc{alrashed2025mixminmatch, title={Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets}, author={Sultan Alrashed and Francesco Orabona}, year={2025}, eprint={2512.18834v2}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2512.18834v2}, } ```