File size: 5,632 Bytes

---
language: ar
license: mit
library_name: transformers
tags:
- arabic
- authorship-attribution
- text-classification
- arabert
- literature
datasets:
- custom
metrics:
- accuracy
- f1
model-index:
- name: arabic-authorship-classification
  results:
  - task:
      type: text-classification
      name: Authorship Attribution
    metrics:
    - type: accuracy
      value: 0.7912
      name: Accuracy
    - type: f1
      value: 0.7023
      name: F1 Macro
    - type: f1
      value: 0.7891
      name: F1 Weighted
---

# Arabic Authorship Classification Model

## Model Description

This model is fine-tuned for Arabic authorship attribution, capable of classifying texts from **21 distinguished Arabic authors**. Built on AraBERT architecture, it demonstrates strong performance in identifying literary writing styles across classical and modern Arabic literature.

## Model Details

- **Model Type:** Text Classification
- **Base Model:** aubmindlab/bert-base-arabertv2
- **Language:** Arabic (ar)
- **Task:** Multi-class Authorship Attribution
- **Classes:** 21 authors
- **Parameters:** ~163M
- **Dataset Size:** 4,157 texts

## Performance

| Metric | Score |
|--------|-------|
| Accuracy | 79.12% |
| F1 Macro | 70.23% |
| F1 Micro | 79.12% |
| F1 Weighted | 78.91% |
| Training Loss | 0.3439 |
| Validation Loss | 0.7434 |

## Supported Authors

The model identifies texts from these 21 authors:

**Arabic Literature:**
- حسن حنفي (Hassan Hanafi) - 548 samples
- عبد الغفار مكاوي (Abdul Ghaffar Makawi) - 396 samples  
- نجيب محفوظ (Naguib Mahfouz) - 327 samples
- جُرجي زيدان (Jurji Zaydan) - 327 samples
- نوال السعداوي (Nawal El Saadawi) - 295 samples
- عباس محمود العقاد (Abbas Mahmoud al-Aqqad) - 267 samples
- محمد حسين هيكل (Mohamed Hussein Heikal) - 260 samples
- طه حسين (Taha Hussein) - 255 samples
- أحمد أمين (Ahmed Amin) - 246 samples
- أمين الريحاني (Ameen Rihani) - 142 samples
- فؤاد زكريا (Fouad Zakaria) - 125 samples
- يوسف إدريس (Yusuf Idris) - 120 samples
- سلامة موسى (Salama Moussa) - 119 samples
- ثروت أباظة (Tharwat Abaza) - 90 samples
- أحمد شوقي (Ahmed Shawqi) - 58 samples
- أحمد تيمور باشا (Ahmed Taymour Pasha) - 57 samples
- جبران خليل جبران (Khalil Gibran) - 30 samples
- كامل كيلاني (Kamel Kilani) - 25 samples

**Translated Literature:**
- ويليام شيكسبير (William Shakespeare) - 238 samples
- غوستاف لوبون (Gustave Le Bon) - 150 samples  
- روبرت بار (Robert Barr) - 82 samples

## Usage

### Direct Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model
tokenizer = AutoTokenizer.from_pretrained("your-username/arabic-authorship-classification")
model = AutoModelForSequenceClassification.from_pretrained("your-username/arabic-authorship-classification")

# Predict
text = "النص العربي المراد تصنيفه"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1)
    confidence = torch.max(predictions)

print(f"Predicted class: {predicted_class.item()}")
print(f"Confidence: {confidence:.4f}")
```

### Pipeline Usage

```python
from transformers import pipeline

classifier = pipeline("text-classification", 
                     model="your-username/arabic-authorship-classification",
                     tokenizer="your-username/arabic-authorship-classification")

result = classifier("النص العربي للتصنيف")
print(result)
```

## Training Data

- **Size:** 4,157 Arabic text samples
- **Source:** Curated Arabic literary corpus
- **Genres:** Essays, novels, poetry, philosophical works
- **Period:** Classical to modern Arabic literature
- **Quality:** High-quality literary texts

## Training Procedure

### Training Hyperparameters

- **Base Model:** aubmindlab/bert-base-arabertv2
- **Max Length:** 512 tokens
- **Learning Rate:** 2e-5
- **Batch Size:** 8 (train), 16 (eval)
- **Epochs:** 150 (with early stopping)
- **Optimizer:** AdamW
- **Weight Decay:** 0.01

### Training Infrastructure

- **Hardware:** GPU-accelerated training
- **Framework:** PyTorch + Transformers
- **Mixed Precision:** Enabled (fp16)

## Evaluation

The model achieves strong performance across all 21 author classes:

- **Balanced Performance:** F1 weighted (78.91%) shows good performance across all authors
- **High Accuracy:** 79.12% accuracy for 21-class classification
- **Robust Generalization:** Reasonable gap between training and validation loss

## Limitations

- Performance may vary on non-literary Arabic texts
- Best suited for Modern Standard Arabic (MSA)
- May struggle with very short texts (<50 words)
- Not tested on dialectical Arabic variations
- Limited to the 21 authors in training data

## Bias and Ethical Considerations

- Training data focuses on established literary figures
- May reflect historical and cultural biases in literary canon
- Gender representation varies across authors
- Consider fairness when applying to contemporary texts

## Citation

```bibtex
@misc{arabic-authorship-classification-2024,
  title={Arabic Authorship Classification Model},
  author={Sabari Nathan},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/your-username/arabic-authorship-classification}
}
```

## Model Card Authors

Sabari Nathan