sabaridsnfuji's picture
Update README.md
ad5877d verified
---
language: ar
license: mit
library_name: transformers
tags:
- arabic
- authorship-attribution
- text-classification
- arabert
- literature
datasets:
- custom
metrics:
- accuracy
- f1
model-index:
- name: arabic-authorship-classification
results:
- task:
type: text-classification
name: Authorship Attribution
metrics:
- type: accuracy
value: 0.7912
name: Accuracy
- type: f1
value: 0.7023
name: F1 Macro
- type: f1
value: 0.7891
name: F1 Weighted
---
# Arabic Authorship Classification Model
## Model Description
This model is fine-tuned for Arabic authorship attribution, capable of classifying texts from **21 distinguished Arabic authors**. Built on AraBERT architecture, it demonstrates strong performance in identifying literary writing styles across classical and modern Arabic literature.
## Model Details
- **Model Type:** Text Classification
- **Base Model:** aubmindlab/bert-base-arabertv2
- **Language:** Arabic (ar)
- **Task:** Multi-class Authorship Attribution
- **Classes:** 21 authors
- **Parameters:** ~163M
- **Dataset Size:** 4,157 texts
## Performance
| Metric | Score |
|--------|-------|
| Accuracy | 79.12% |
| F1 Macro | 70.23% |
| F1 Micro | 79.12% |
| F1 Weighted | 78.91% |
| Training Loss | 0.3439 |
| Validation Loss | 0.7434 |
## Supported Authors
The model identifies texts from these 21 authors:
**Arabic Literature:**
- حسن حنفي (Hassan Hanafi) - 548 samples
- عبد الغفار مكاوي (Abdul Ghaffar Makawi) - 396 samples
- نجيب محفوظ (Naguib Mahfouz) - 327 samples
- جُرجي زيدان (Jurji Zaydan) - 327 samples
- نوال السعداوي (Nawal El Saadawi) - 295 samples
- عباس محمود العقاد (Abbas Mahmoud al-Aqqad) - 267 samples
- محمد حسين هيكل (Mohamed Hussein Heikal) - 260 samples
- طه حسين (Taha Hussein) - 255 samples
- أحمد أمين (Ahmed Amin) - 246 samples
- أمين الريحاني (Ameen Rihani) - 142 samples
- فؤاد زكريا (Fouad Zakaria) - 125 samples
- يوسف إدريس (Yusuf Idris) - 120 samples
- سلامة موسى (Salama Moussa) - 119 samples
- ثروت أباظة (Tharwat Abaza) - 90 samples
- أحمد شوقي (Ahmed Shawqi) - 58 samples
- أحمد تيمور باشا (Ahmed Taymour Pasha) - 57 samples
- جبران خليل جبران (Khalil Gibran) - 30 samples
- كامل كيلاني (Kamel Kilani) - 25 samples
**Translated Literature:**
- ويليام شيكسبير (William Shakespeare) - 238 samples
- غوستاف لوبون (Gustave Le Bon) - 150 samples
- روبرت بار (Robert Barr) - 82 samples
## Usage
### Direct Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model
tokenizer = AutoTokenizer.from_pretrained("your-username/arabic-authorship-classification")
model = AutoModelForSequenceClassification.from_pretrained("your-username/arabic-authorship-classification")
# Predict
text = "النص العربي المراد تصنيفه"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=-1)
confidence = torch.max(predictions)
print(f"Predicted class: {predicted_class.item()}")
print(f"Confidence: {confidence:.4f}")
```
### Pipeline Usage
```python
from transformers import pipeline
classifier = pipeline("text-classification",
model="your-username/arabic-authorship-classification",
tokenizer="your-username/arabic-authorship-classification")
result = classifier("النص العربي للتصنيف")
print(result)
```
## Training Data
- **Size:** 4,157 Arabic text samples
- **Source:** Curated Arabic literary corpus
- **Genres:** Essays, novels, poetry, philosophical works
- **Period:** Classical to modern Arabic literature
- **Quality:** High-quality literary texts
## Training Procedure
### Training Hyperparameters
- **Base Model:** aubmindlab/bert-base-arabertv2
- **Max Length:** 512 tokens
- **Learning Rate:** 2e-5
- **Batch Size:** 8 (train), 16 (eval)
- **Epochs:** 150 (with early stopping)
- **Optimizer:** AdamW
- **Weight Decay:** 0.01
### Training Infrastructure
- **Hardware:** GPU-accelerated training
- **Framework:** PyTorch + Transformers
- **Mixed Precision:** Enabled (fp16)
## Evaluation
The model achieves strong performance across all 21 author classes:
- **Balanced Performance:** F1 weighted (78.91%) shows good performance across all authors
- **High Accuracy:** 79.12% accuracy for 21-class classification
- **Robust Generalization:** Reasonable gap between training and validation loss
## Limitations
- Performance may vary on non-literary Arabic texts
- Best suited for Modern Standard Arabic (MSA)
- May struggle with very short texts (<50 words)
- Not tested on dialectical Arabic variations
- Limited to the 21 authors in training data
## Bias and Ethical Considerations
- Training data focuses on established literary figures
- May reflect historical and cultural biases in literary canon
- Gender representation varies across authors
- Consider fairness when applying to contemporary texts
## Citation
```bibtex
@misc{arabic-authorship-classification-2024,
title={Arabic Authorship Classification Model},
author={Sabari Nathan},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/your-username/arabic-authorship-classification}
}
```
## Model Card Authors
Sabari Nathan