agentlans
/

deberta-v3-xsmall-readability

Text Classification

Generated from Trainer

text-embeddings-inference

Model card Files Files and versions

deberta-v3-xsmall-readability / README.md

agentlans's picture

Update README.md

94ab33a verified over 1 year ago

|

history blame contribute delete

3.55 kB

	---
	library_name: transformers
	base_model:
	- microsoft/deberta-v3-xsmall
	tags:
	- generated_from_trainer
	model-index:
	- name: deberta-v3-xsmall-readability
	results: []
	license: mit
	datasets:
	- agentlans/readability
	language:
	- en
	pipeline_tag: text-classification
	---

	# English Text Readability Prediction

	This is a fine-tuned DeBERTa-v3-xsmall model for predicting the readability level of English texts.

	Suitable for:
	- Assessing educational material complexity
	- Evaluating content readability for diverse audiences
	- Assisting writers in tailoring content to specific reading levels

	## Training Data

	The model was fine-tuned on the [agentlans/readability](https://huggingface.co/datasets/agentlans/readability) dataset
	containing paragraphs from four sources.

	1. HuggingFace's Fineweb-Edu
	2. Ronen Eldan's TinyStories
	3. Wikipedia-2023-11-embed-multilingual-v3 (English only)
	4. ArXiv Abstracts-2021

	Each paragraph was annotated with 6 readability metrics that estimate U.S. grade level reading comprehension.

	## How to use

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name="agentlans/deberta-v3-xsmall-readability"

	# Put model on GPU or else CPU
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = model.to(device)

	def readability(text):
	"""Processes the text using the model and returns its logits.
	In this case, it's reading grade level in years of education
	(the higher the number, the harder it is to read the text)."""
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
	with torch.no_grad():
	logits = model(**inputs).logits.squeeze().cpu()
	return logits.tolist()

	# Example usage
	text = ["One day, Tim's teddy bear was sad. Tim did not know why his teddy bear was sad.",
	"A few years back, I decided it was time for me to take a break from my mundane routine and embark on an adventure.",
	"We also experimentally verify that simply scaling the pulse energy by 3/2 between linearly and circularly polarized pumping closely reproduces the soliton and dispersive wave dynamics."]
	result = readability(text)
	[round(x, 1) for x in result] # Estimated reading grades [2.9, 9.8, 21.9]
	```

	<details>
	<summary>Performance metrics and training details</summary>

	## Performance Metrics

	On the evaluation set:
	- Loss: 1.0767
	- Mean Squared Error (MSE): 1.0767

	## Training Procedure

	### Hyperparameters

	- Learning Rate: 5e-05
	- Train Batch Size: 8
	- Eval Batch Size: 8
	- Seed: 42
	- Optimizer: Adam (betas=(0.9, 0.999), epsilon=1e-08)
	- Learning Rate Scheduler: Linear
	- Number of Epochs: 3.0

	### Framework Versions

	- Transformers: 4.44.2
	- PyTorch: 2.2.2+cu121
	- Datasets: 2.18.0
	- Tokenizers: 0.19.1

	</details>

	## Limitations

	- English only
	- Performance may vary for very long or very short texts
	- This model is for general texts so it's not optimized for specific uses like children's books or medical texts
	- Doesn't assess whether the texts make sense for the reader
	- There's a lot of variability in the readability metrics in the literature

	## Ethical Considerations

	- The model should not be the sole determinant for content suitability decisions
	- The writer or publisher should also consider the content, context, and reader expectations
	- Potential social or societal biases due to the training data sources