|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- readability |
|
|
license: mit |
|
|
base_model: |
|
|
- aubmindlab/bert-base-arabertv2 |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# AraBERTv2+D3Tok+CE Readability Model |
|
|
|
|
|
## Model description |
|
|
**AraBERTv2+D3Tok+CE** is a readability assessment model that was built by fine-tuning the **AraBERTv2** model with cross-entropy loss (**CE**). |
|
|
For the fine-tuning, we used the **D3Tok** input variant from [BAREC-Corpus-v1.0](https://huggingface.co/datasets/CAMeL-Lab/BAREC-Corpus-v1.0). |
|
|
Our fine-tuning procedure and the hyperparameters we used can be found in our paper *"[A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment](https://arxiv.org/abs/2502.13520)."* |
|
|
|
|
|
## Intended uses |
|
|
You can use the AraBERTv2+D3Tok+CE model as part of the transformers pipeline. |
|
|
You need to preprocess your text into the D3Tok input variant using the preprocessing step [here](https://github.com/CAMeL-Lab/barec_analyzer/tree/main). |
|
|
|
|
|
## How to use |
|
|
To use the model: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
readability = pipeline("text-classification", model="CAMeL-Lab/readability-arabertv2-d3tok-CE") |
|
|
with open("/PATH/TO/preprocessed_d3tok", "r") as f: |
|
|
sentences = f.read().split("\n") |
|
|
readability_levels = [int(readability(sentences)[i]['label'][6:])+1 for i in range(len(sentences))] |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
```bibtex |
|
|
@inproceedings{elmadani-etal-2025-readability, |
|
|
title = "A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment", |
|
|
author = "Elmadani, Khalid N. and |
|
|
Habash, Nizar and |
|
|
Taha-Thomure, Hanada", |
|
|
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025", |
|
|
year = "2025", |
|
|
address = "Vienna, Austria", |
|
|
publisher = "Association for Computational Linguistics" |
|
|
} |
|
|
``` |