|
|
--- |
|
|
language: ar |
|
|
license: mit |
|
|
library_name: transformers |
|
|
tags: |
|
|
- arabic |
|
|
- authorship-attribution |
|
|
- text-classification |
|
|
- arabert |
|
|
- literature |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
model-index: |
|
|
- name: arabic-authorship-classification |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Authorship Attribution |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.7912 |
|
|
name: Accuracy |
|
|
- type: f1 |
|
|
value: 0.7023 |
|
|
name: F1 Macro |
|
|
- type: f1 |
|
|
value: 0.7891 |
|
|
name: F1 Weighted |
|
|
--- |
|
|
|
|
|
# Arabic Authorship Classification Model |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is fine-tuned for Arabic authorship attribution, capable of classifying texts from **21 distinguished Arabic authors**. Built on AraBERT architecture, it demonstrates strong performance in identifying literary writing styles across classical and modern Arabic literature. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Type:** Text Classification |
|
|
- **Base Model:** aubmindlab/bert-base-arabertv2 |
|
|
- **Language:** Arabic (ar) |
|
|
- **Task:** Multi-class Authorship Attribution |
|
|
- **Classes:** 21 authors |
|
|
- **Parameters:** ~163M |
|
|
- **Dataset Size:** 4,157 texts |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Metric | Score | |
|
|
|--------|-------| |
|
|
| Accuracy | 79.12% | |
|
|
| F1 Macro | 70.23% | |
|
|
| F1 Micro | 79.12% | |
|
|
| F1 Weighted | 78.91% | |
|
|
| Training Loss | 0.3439 | |
|
|
| Validation Loss | 0.7434 | |
|
|
|
|
|
## Supported Authors |
|
|
|
|
|
The model identifies texts from these 21 authors: |
|
|
|
|
|
**Arabic Literature:** |
|
|
- حسن حنفي (Hassan Hanafi) - 548 samples |
|
|
- عبد الغفار مكاوي (Abdul Ghaffar Makawi) - 396 samples |
|
|
- نجيب محفوظ (Naguib Mahfouz) - 327 samples |
|
|
- جُرجي زيدان (Jurji Zaydan) - 327 samples |
|
|
- نوال السعداوي (Nawal El Saadawi) - 295 samples |
|
|
- عباس محمود العقاد (Abbas Mahmoud al-Aqqad) - 267 samples |
|
|
- محمد حسين هيكل (Mohamed Hussein Heikal) - 260 samples |
|
|
- طه حسين (Taha Hussein) - 255 samples |
|
|
- أحمد أمين (Ahmed Amin) - 246 samples |
|
|
- أمين الريحاني (Ameen Rihani) - 142 samples |
|
|
- فؤاد زكريا (Fouad Zakaria) - 125 samples |
|
|
- يوسف إدريس (Yusuf Idris) - 120 samples |
|
|
- سلامة موسى (Salama Moussa) - 119 samples |
|
|
- ثروت أباظة (Tharwat Abaza) - 90 samples |
|
|
- أحمد شوقي (Ahmed Shawqi) - 58 samples |
|
|
- أحمد تيمور باشا (Ahmed Taymour Pasha) - 57 samples |
|
|
- جبران خليل جبران (Khalil Gibran) - 30 samples |
|
|
- كامل كيلاني (Kamel Kilani) - 25 samples |
|
|
|
|
|
**Translated Literature:** |
|
|
- ويليام شيكسبير (William Shakespeare) - 238 samples |
|
|
- غوستاف لوبون (Gustave Le Bon) - 150 samples |
|
|
- روبرت بار (Robert Barr) - 82 samples |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Direct Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
# Load model |
|
|
tokenizer = AutoTokenizer.from_pretrained("your-username/arabic-authorship-classification") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("your-username/arabic-authorship-classification") |
|
|
|
|
|
# Predict |
|
|
text = "النص العربي المراد تصنيفه" |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
predicted_class = torch.argmax(predictions, dim=-1) |
|
|
confidence = torch.max(predictions) |
|
|
|
|
|
print(f"Predicted class: {predicted_class.item()}") |
|
|
print(f"Confidence: {confidence:.4f}") |
|
|
``` |
|
|
|
|
|
### Pipeline Usage |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
classifier = pipeline("text-classification", |
|
|
model="your-username/arabic-authorship-classification", |
|
|
tokenizer="your-username/arabic-authorship-classification") |
|
|
|
|
|
result = classifier("النص العربي للتصنيف") |
|
|
print(result) |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
- **Size:** 4,157 Arabic text samples |
|
|
- **Source:** Curated Arabic literary corpus |
|
|
- **Genres:** Essays, novels, poetry, philosophical works |
|
|
- **Period:** Classical to modern Arabic literature |
|
|
- **Quality:** High-quality literary texts |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Training Hyperparameters |
|
|
|
|
|
- **Base Model:** aubmindlab/bert-base-arabertv2 |
|
|
- **Max Length:** 512 tokens |
|
|
- **Learning Rate:** 2e-5 |
|
|
- **Batch Size:** 8 (train), 16 (eval) |
|
|
- **Epochs:** 150 (with early stopping) |
|
|
- **Optimizer:** AdamW |
|
|
- **Weight Decay:** 0.01 |
|
|
|
|
|
### Training Infrastructure |
|
|
|
|
|
- **Hardware:** GPU-accelerated training |
|
|
- **Framework:** PyTorch + Transformers |
|
|
- **Mixed Precision:** Enabled (fp16) |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
The model achieves strong performance across all 21 author classes: |
|
|
|
|
|
- **Balanced Performance:** F1 weighted (78.91%) shows good performance across all authors |
|
|
- **High Accuracy:** 79.12% accuracy for 21-class classification |
|
|
- **Robust Generalization:** Reasonable gap between training and validation loss |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Performance may vary on non-literary Arabic texts |
|
|
- Best suited for Modern Standard Arabic (MSA) |
|
|
- May struggle with very short texts (<50 words) |
|
|
- Not tested on dialectical Arabic variations |
|
|
- Limited to the 21 authors in training data |
|
|
|
|
|
## Bias and Ethical Considerations |
|
|
|
|
|
- Training data focuses on established literary figures |
|
|
- May reflect historical and cultural biases in literary canon |
|
|
- Gender representation varies across authors |
|
|
- Consider fairness when applying to contemporary texts |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{arabic-authorship-classification-2024, |
|
|
title={Arabic Authorship Classification Model}, |
|
|
author={Sabari Nathan}, |
|
|
year={2024}, |
|
|
publisher={Hugging Face}, |
|
|
url={https://huggingface.co/your-username/arabic-authorship-classification} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
Sabari Nathan |
|
|
|
|
|
|