Text Classification
Transformers
Safetensors
Arabic
quality_classifier
feature-extraction
quality-classifier
data-filtering
pretraining
custom_code
Instructions to use AdaMLLab/mmBERT-Arabic-Quality-Classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AdaMLLab/mmBERT-Arabic-Quality-Classifier with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="AdaMLLab/mmBERT-Arabic-Quality-Classifier", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AdaMLLab/mmBERT-Arabic-Quality-Classifier", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - ar | |
| license: apache-2.0 | |
| library_name: transformers | |
| pipeline_tag: text-classification | |
| base_model: jhu-clsp/mmBERT-small | |
| tags: | |
| - quality-classifier | |
| - data-filtering | |
| - pretraining | |
| <p align="center"> | |
| <a href="https://huggingface.co/collections/AdaMLLab/mixminmatch"> | |
| <img src="https://img.shields.io/badge/🤗_Collection-MixMinMatch-blue" alt="MixMinMatch Collection"> | |
| </a> | |
| </p> | |
| # mmBERT Arabic Quality Classifier | |
| A text quality classifier for Arabic pretraining data, trained from [mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small). Used to create [AraMix-HQ](https://huggingface.co/datasets/AdaMLLab/AraMix-HQ). | |
| This model implements the FineWeb2-HQ approach ([Messmer et al., 2025](https://arxiv.org/abs/2502.10361)) but uses mmBERT as the encoder for improved Arabic understanding. | |
| ## Usage | |
| ```python | |
| from transformers import pipeline | |
| classifier = pipeline("text-classification", model="AdaMLLab/mmBERT-Arabic-Quality-Classifier") | |
| result = classifier("النص العربي هنا") | |
| ``` | |
| ## Citation | |
| ```bib | |
| @misc{alrashed2025mixminmatch, | |
| title={Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets}, | |
| author={Sultan Alrashed and Francesco Orabona}, | |
| year={2025}, | |
| eprint={2512.18834v2}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| url={https://arxiv.org/abs/2512.18834v2}, | |
| } | |
| ``` |