|
|
--- |
|
|
library_name: transformers |
|
|
base_model: |
|
|
- microsoft/deberta-v3-xsmall |
|
|
tags: |
|
|
- generated_from_trainer |
|
|
model-index: |
|
|
- name: deberta-v3-xsmall-readability |
|
|
results: [] |
|
|
license: mit |
|
|
datasets: |
|
|
- agentlans/readability |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# English Text Readability Prediction |
|
|
|
|
|
This is a fine-tuned DeBERTa-v3-xsmall model for predicting the readability level of English texts. |
|
|
|
|
|
Suitable for: |
|
|
- Assessing educational material complexity |
|
|
- Evaluating content readability for diverse audiences |
|
|
- Assisting writers in tailoring content to specific reading levels |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was fine-tuned on the [agentlans/readability](https://huggingface.co/datasets/agentlans/readability) dataset |
|
|
containing paragraphs from four sources. |
|
|
|
|
|
1. HuggingFace's Fineweb-Edu |
|
|
2. Ronen Eldan's TinyStories |
|
|
3. Wikipedia-2023-11-embed-multilingual-v3 (English only) |
|
|
4. ArXiv Abstracts-2021 |
|
|
|
|
|
Each paragraph was annotated with 6 readability metrics that estimate U.S. grade level reading comprehension. |
|
|
|
|
|
## How to use |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
model_name="agentlans/deberta-v3-xsmall-readability" |
|
|
|
|
|
# Put model on GPU or else CPU |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
model = model.to(device) |
|
|
|
|
|
def readability(text): |
|
|
"""Processes the text using the model and returns its logits. |
|
|
In this case, it's reading grade level in years of education |
|
|
(the higher the number, the harder it is to read the text).""" |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device) |
|
|
with torch.no_grad(): |
|
|
logits = model(**inputs).logits.squeeze().cpu() |
|
|
return logits.tolist() |
|
|
|
|
|
# Example usage |
|
|
text = ["One day, Tim's teddy bear was sad. Tim did not know why his teddy bear was sad.", |
|
|
"A few years back, I decided it was time for me to take a break from my mundane routine and embark on an adventure.", |
|
|
"We also experimentally verify that simply scaling the pulse energy by 3/2 between linearly and circularly polarized pumping closely reproduces the soliton and dispersive wave dynamics."] |
|
|
result = readability(text) |
|
|
[round(x, 1) for x in result] # Estimated reading grades [2.9, 9.8, 21.9] |
|
|
``` |
|
|
|
|
|
<details> |
|
|
<summary>Performance metrics and training details</summary> |
|
|
|
|
|
## Performance Metrics |
|
|
|
|
|
On the evaluation set: |
|
|
- **Loss**: 1.0767 |
|
|
- **Mean Squared Error (MSE)**: 1.0767 |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Hyperparameters |
|
|
|
|
|
- Learning Rate: 5e-05 |
|
|
- Train Batch Size: 8 |
|
|
- Eval Batch Size: 8 |
|
|
- Seed: 42 |
|
|
- Optimizer: Adam (betas=(0.9, 0.999), epsilon=1e-08) |
|
|
- Learning Rate Scheduler: Linear |
|
|
- Number of Epochs: 3.0 |
|
|
|
|
|
### Framework Versions |
|
|
|
|
|
- Transformers: 4.44.2 |
|
|
- PyTorch: 2.2.2+cu121 |
|
|
- Datasets: 2.18.0 |
|
|
- Tokenizers: 0.19.1 |
|
|
|
|
|
</details> |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- English only |
|
|
- Performance may vary for very long or very short texts |
|
|
- This model is for general texts so it's not optimized for specific uses like children's books or medical texts |
|
|
- Doesn't assess whether the texts make sense for the reader |
|
|
- There's a lot of variability in the readability metrics in the literature |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
- The model should not be the sole determinant for content suitability decisions |
|
|
- The writer or publisher should also consider the content, context, and reader expectations |
|
|
- Potential social or societal biases due to the training data sources |