|
|
--- |
|
|
language: ar |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- arabic |
|
|
- regression |
|
|
- arabertv02 |
|
|
- scoring |
|
|
- education |
|
|
datasets: |
|
|
- AraScore |
|
|
metrics: |
|
|
- mse |
|
|
- rmse |
|
|
- mae |
|
|
- r2 |
|
|
pipeline_tag: text-classification |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Arabic Text Scoring Regression Model |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is fine-tuned from [AraELECTRA](https://huggingface.co/aubmindlab/bert-base-arabertv02) for the task of |
|
|
scoring Arabic text answers. It predicts a continuous score for a given Arabic text response. |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on the AraScore dataset, which contains Arabic text answers with corresponding scores. |
|
|
|
|
|
## Metrics |
|
|
|
|
|
The model achieves the following performance metrics: |
|
|
- MSE (Mean Squared Error) |
|
|
- RMSE (Root Mean Squared Error) |
|
|
- MAE (Mean Absolute Error) |
|
|
- R² (R-squared) |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
import torch |
|
|
import re |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "kenzykhaled/arabic-answer-scoring" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
# Function to preprocess Arabic text |
|
|
def preprocess_arabic_text(text): |
|
|
if not isinstance(text, str): |
|
|
return "" |
|
|
|
|
|
# Remove diacritics (تشكيل) |
|
|
text = re.sub(r'[ً-ٰٟ]', '', text) |
|
|
|
|
|
# Normalize Arabic letters |
|
|
text = re.sub('[إأآا]', 'ا', text) # Normalize Alif forms |
|
|
text = re.sub('ى', 'ي', text) # Normalize Yaa |
|
|
text = re.sub('ة', 'ه', text) # Normalize Taa Marbouta |
|
|
|
|
|
# Remove non-Arabic characters except spaces |
|
|
text = re.sub(r'[^-ۿ\s]', '', text) |
|
|
|
|
|
# Remove extra spaces |
|
|
text = re.sub(r'\s+', ' ', text).strip() |
|
|
|
|
|
return text |
|
|
|
|
|
# Define prediction function |
|
|
def predict_score(text): |
|
|
# Preprocess and tokenize |
|
|
processed_text = preprocess_arabic_text(text) |
|
|
inputs = tokenizer(processed_text, return_tensors="pt", padding=True, truncation=True, max_length=256) |
|
|
|
|
|
# Move to appropriate device (GPU if available) |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
model.to(device) |
|
|
inputs = {k: v.to(device) for k, v in inputs.items()} |
|
|
|
|
|
# Predict |
|
|
model.eval() |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
score = outputs.logits.item() |
|
|
|
|
|
return score |
|
|
|
|
|
# Example usage |
|
|
sample_text = "هذه إجابة نموذجية باللغة العربية." |
|
|
score = predict_score(sample_text) |
|
|
print(f"Predicted score: ") |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- The model is optimized for educational answer scoring and may not perform well on other types of text. |
|
|
- The model works best with text similar to that in the training data. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
``` |
|
|
@misc{arabic-scoring-model, |
|
|
author = {Your Name}, |
|
|
title = {Arabic Text Answer Scoring Model}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face} |
|
|
} |
|
|
``` |