sabaridsnfuji commited on
Commit
5bee3a1
·
verified ·
1 Parent(s): 4897273

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +196 -3
README.md CHANGED
@@ -1,3 +1,196 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ar
3
+ license: mit
4
+ library_name: transformers
5
+ tags:
6
+ - arabic
7
+ - authorship-attribution
8
+ - text-classification
9
+ - arabert
10
+ - literature
11
+ datasets:
12
+ - custom
13
+ metrics:
14
+ - accuracy
15
+ - f1
16
+ model-index:
17
+ - name: arabic-authorship-classification
18
+ results:
19
+ - task:
20
+ type: text-classification
21
+ name: Authorship Attribution
22
+ metrics:
23
+ - type: accuracy
24
+ value: 0.7912
25
+ name: Accuracy
26
+ - type: f1
27
+ value: 0.7023
28
+ name: F1 Macro
29
+ - type: f1
30
+ value: 0.7891
31
+ name: F1 Weighted
32
+ ---
33
+
34
+ # Arabic Authorship Classification Model
35
+
36
+ ## Model Description
37
+
38
+ This model is fine-tuned for Arabic authorship attribution, capable of classifying texts from **21 distinguished Arabic authors**. Built on AraBERT architecture, it demonstrates strong performance in identifying literary writing styles across classical and modern Arabic literature.
39
+
40
+ ## Model Details
41
+
42
+ - **Model Type:** Text Classification
43
+ - **Base Model:** aubmindlab/bert-base-arabertv2
44
+ - **Language:** Arabic (ar)
45
+ - **Task:** Multi-class Authorship Attribution
46
+ - **Classes:** 21 authors
47
+ - **Parameters:** ~163M
48
+ - **Dataset Size:** 4,157 texts
49
+
50
+ ## Performance
51
+
52
+ | Metric | Score |
53
+ |--------|-------|
54
+ | Accuracy | 79.12% |
55
+ | F1 Macro | 70.23% |
56
+ | F1 Micro | 79.12% |
57
+ | F1 Weighted | 78.91% |
58
+ | Training Loss | 0.3439 |
59
+ | Validation Loss | 0.7434 |
60
+
61
+ ## Supported Authors
62
+
63
+ The model identifies texts from these 21 authors:
64
+
65
+ **Arabic Literature:**
66
+ - حسن حنفي (Hassan Hanafi) - 548 samples
67
+ - عبد الغفار مكاوي (Abdul Ghaffar Makawi) - 396 samples
68
+ - نجيب محفوظ (Naguib Mahfouz) - 327 samples
69
+ - جُرجي زيدان (Jurji Zaydan) - 327 samples
70
+ - نوال السعداوي (Nawal El Saadawi) - 295 samples
71
+ - عباس محمود العقاد (Abbas Mahmoud al-Aqqad) - 267 samples
72
+ - محمد حسين هيكل (Mohamed Hussein Heikal) - 260 samples
73
+ - طه حسين (Taha Hussein) - 255 samples
74
+ - أحمد أمين (Ahmed Amin) - 246 samples
75
+ - أمين الريحاني (Ameen Rihani) - 142 samples
76
+ - فؤاد زكريا (Fouad Zakaria) - 125 samples
77
+ - يوسف إدريس (Yusuf Idris) - 120 samples
78
+ - سلامة موسى (Salama Moussa) - 119 samples
79
+ - ثروت أباظة (Tharwat Abaza) - 90 samples
80
+ - أحمد شوقي (Ahmed Shawqi) - 58 samples
81
+ - أحمد تيمور باشا (Ahmed Taymour Pasha) - 57 samples
82
+ - جبران خليل جبران (Khalil Gibran) - 30 samples
83
+ - كامل كيلاني (Kamel Kilani) - 25 samples
84
+
85
+ **Translated Literature:**
86
+ - ويليام شيكسبير (William Shakespeare) - 238 samples
87
+ - غوستاف لوبون (Gustave Le Bon) - 150 samples
88
+ - روبرت بار (Robert Barr) - 82 samples
89
+
90
+ ## Usage
91
+
92
+ ### Direct Usage
93
+
94
+ ```python
95
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
96
+ import torch
97
+
98
+ # Load model
99
+ tokenizer = AutoTokenizer.from_pretrained("your-username/arabic-authorship-classification")
100
+ model = AutoModelForSequenceClassification.from_pretrained("your-username/arabic-authorship-classification")
101
+
102
+ # Predict
103
+ text = "النص العربي المراد تصنيفه"
104
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
105
+
106
+ with torch.no_grad():
107
+ outputs = model(**inputs)
108
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
109
+ predicted_class = torch.argmax(predictions, dim=-1)
110
+ confidence = torch.max(predictions)
111
+
112
+ print(f"Predicted class: {predicted_class.item()}")
113
+ print(f"Confidence: {confidence:.4f}")
114
+ ```
115
+
116
+ ### Pipeline Usage
117
+
118
+ ```python
119
+ from transformers import pipeline
120
+
121
+ classifier = pipeline("text-classification",
122
+ model="your-username/arabic-authorship-classification",
123
+ tokenizer="your-username/arabic-authorship-classification")
124
+
125
+ result = classifier("النص العربي للتصنيف")
126
+ print(result)
127
+ ```
128
+
129
+ ## Training Data
130
+
131
+ - **Size:** 4,157 Arabic text samples
132
+ - **Source:** Curated Arabic literary corpus
133
+ - **Genres:** Essays, novels, poetry, philosophical works
134
+ - **Period:** Classical to modern Arabic literature
135
+ - **Quality:** High-quality literary texts
136
+
137
+ ## Training Procedure
138
+
139
+ ### Training Hyperparameters
140
+
141
+ - **Base Model:** aubmindlab/bert-base-arabertv2
142
+ - **Max Length:** 512 tokens
143
+ - **Learning Rate:** 2e-5
144
+ - **Batch Size:** 8 (train), 16 (eval)
145
+ - **Epochs:** 150 (with early stopping)
146
+ - **Optimizer:** AdamW
147
+ - **Weight Decay:** 0.01
148
+
149
+ ### Training Infrastructure
150
+
151
+ - **Hardware:** GPU-accelerated training
152
+ - **Framework:** PyTorch + Transformers
153
+ - **Mixed Precision:** Enabled (fp16)
154
+
155
+ ## Evaluation
156
+
157
+ The model achieves strong performance across all 21 author classes:
158
+
159
+ - **Balanced Performance:** F1 weighted (78.91%) shows good performance across all authors
160
+ - **High Accuracy:** 79.12% accuracy for 21-class classification
161
+ - **Robust Generalization:** Reasonable gap between training and validation loss
162
+
163
+ ## Limitations
164
+
165
+ - Performance may vary on non-literary Arabic texts
166
+ - Best suited for Modern Standard Arabic (MSA)
167
+ - May struggle with very short texts (<50 words)
168
+ - Not tested on dialectical Arabic variations
169
+ - Limited to the 21 authors in training data
170
+
171
+ ## Bias and Ethical Considerations
172
+
173
+ - Training data focuses on established literary figures
174
+ - May reflect historical and cultural biases in literary canon
175
+ - Gender representation varies across authors
176
+ - Consider fairness when applying to contemporary texts
177
+
178
+ ## Citation
179
+
180
+ ```bibtex
181
+ @misc{arabic-authorship-classification-2024,
182
+ title={Arabic Authorship Classification Model},
183
+ author={Your Name},
184
+ year={2024},
185
+ publisher={Hugging Face},
186
+ url={https://huggingface.co/your-username/arabic-authorship-classification}
187
+ }
188
+ ```
189
+
190
+ ## Model Card Authors
191
+
192
+ [Your Name]
193
+
194
+ ## Model Card Contact
195
+
196
+ [Your Contact Information]