Belall87 commited on
Commit
ee97f7f
·
verified ·
1 Parent(s): 530eb07

Upload folder using huggingface_hub

Browse files
Files changed (7) hide show
  1. README.md +239 -0
  2. config.json +25 -0
  3. model.safetensors +3 -0
  4. special_tokens_map.json +37 -0
  5. tokenizer.json +0 -0
  6. tokenizer_config.json +87 -0
  7. vocab.txt +0 -0
README.md CHANGED
@@ -1,3 +1,242 @@
1
  ---
 
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: ar
3
  license: mit
4
+ tags:
5
+ - sentiment-analysis
6
+ - arabic
7
+ - arabert
8
+ - text-classification
9
+ - pytorch
10
+ base_model: aubmindlab/bert-base-arabertv02
11
+ datasets:
12
+ - custom
13
+ metrics:
14
+ - accuracy
15
+ - f1
16
+ model-index:
17
+ - name: arabert-arabic-sentiment
18
+ results:
19
+ - task:
20
+ type: text-classification
21
+ name: Sentiment Analysis
22
+ dataset:
23
+ type: custom
24
+ name: Arabic Sentiment Dataset
25
+ metrics:
26
+ - type: accuracy
27
+ value: 0.85
28
+ name: Accuracy
29
+ - type: f1
30
+ value: 0.85
31
+ name: F1 Score
32
+ library_name: transformers
33
+ pipeline_tag: text-classification
34
+ widget:
35
+ - text: "هذا المنتج رائع جداً وأنصح به بشدة"
36
+ example_title: "Positive Example"
37
+ - text: "تجربة سيئة جداً ولن أشتري مرة أخرى"
38
+ example_title: "Negative Example"
39
+ - text: "الخدمة ممتازة والتوصيل سريع"
40
+ example_title: "Positive Service"
41
  ---
42
+
43
+ # AraBERT for Arabic Sentiment Analysis
44
+
45
+ Fine-tuned [AraBERT v0.2](https://huggingface.co/aubmindlab/bert-base-arabertv02) for binary sentiment classification on Arabic text.
46
+
47
+ ## Model Description
48
+
49
+ This model is a fine-tuned version of `aubmindlab/bert-base-arabertv02` on a custom Arabic sentiment dataset. It classifies Arabic text into positive or negative sentiment.
50
+
51
+ ### Key Features
52
+ - 🎯 **85%+ accuracy** on Arabic sentiment classification
53
+ - 🌍 Pre-trained on **large Arabic corpus** (AraBERT v0.2)
54
+ - ⚡ **Fast inference** with transformer architecture
55
+ - 🔄 **Transfer learning** from 110M parameter BERT model
56
+
57
+ ## Intended Uses & Limitations
58
+
59
+ ### Intended Uses
60
+ - Arabic social media sentiment analysis
61
+ - Product review classification
62
+ - Customer feedback analysis
63
+ - Market research on Arabic content
64
+
65
+ ### Limitations
66
+ - Binary classification only (positive/negative)
67
+ - Trained on specific domain (may need fine-tuning for other domains)
68
+ - Arabic text only (Modern Standard Arabic and dialects)
69
+ - May not perform well on very short texts (<5 words)
70
+
71
+ ## How to Use
72
+
73
+ ### Quick Start with Pipeline
74
+ ```python
75
+ from transformers import pipeline
76
+
77
+ # Load sentiment analysis pipeline
78
+ classifier = pipeline(
79
+ "sentiment-analysis",
80
+ model="Belall87/arabert-arabic-sentiment"
81
+ )
82
+
83
+ # Classify text
84
+ result = classifier("هذا المنتج رائع جداً")
85
+ print(result)
86
+ # Output: [{'label': 'POSITIVE', 'score': 0.95}]
87
+ ```
88
+
89
+ ### Manual Loading
90
+ ```python
91
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
92
+ import torch
93
+
94
+ # Load model and tokenizer
95
+ model = AutoModelForSequenceClassification.from_pretrained(
96
+ "Belall87/arabert-arabic-sentiment"
97
+ )
98
+ tokenizer = AutoTokenizer.from_pretrained(
99
+ "Belall87/arabert-arabic-sentiment"
100
+ )
101
+
102
+ # Prepare input
103
+ text = "الخدمة ممتازة والموظفون متعاونون"
104
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
105
+
106
+ # Get prediction
107
+ with torch.no_grad():
108
+ outputs = model(**inputs)
109
+ prediction = torch.argmax(outputs.logits, dim=-1)
110
+ probabilities = torch.softmax(outputs.logits, dim=-1)
111
+
112
+ sentiment = "Positive" if prediction.item() == 1 else "Negative"
113
+ confidence = probabilities[0][prediction].item()
114
+
115
+ print(f"Sentiment: {sentiment} (Confidence: {confidence:.2%})")
116
+ ```
117
+
118
+ ### Batch Processing
119
+ ```python
120
+ texts = [
121
+ "المطعم نظيف والطعام لذيذ",
122
+ "الخدمة سيئة جداً",
123
+ "منتج عادي لا بأس به"
124
+ ]
125
+
126
+ # Use pipeline for batch
127
+ results = classifier(texts)
128
+ for text, result in zip(texts, results):
129
+ print(f"{text}: {result['label']} ({result['score']:.2%})")
130
+ ```
131
+
132
+ ## Training Details
133
+
134
+ ### Training Data
135
+
136
+ - **Dataset Size:** ~4,200 Arabic text samples
137
+ - **Train/Val/Test Split:** 72% / 8% / 20%
138
+ - **Data Sources:** Arabic tweets, reviews, and comments
139
+ - **Preprocessing:** Text normalization, diacritics removal, character standardization
140
+
141
+ ### Training Procedure
142
+
143
+ #### Hyperparameters
144
+ ```python
145
+ Learning Rate: 2e-5
146
+ Batch Size: 8 (train), 16 (eval)
147
+ Epochs: 3
148
+ Optimizer: AdamW
149
+ Weight Decay: 0.01
150
+ LR Scheduler: Cosine with 5% warmup
151
+ Max Sequence Length: 256
152
+ ```
153
+
154
+ #### Training Configuration
155
+
156
+ - **Framework:** PyTorch with Hugging Face Transformers
157
+ - **Base Model:** aubmindlab/bert-base-arabertv02
158
+ - **Fine-tuning Strategy:** Full model fine-tuning
159
+ - **Early Stopping:** Patience of 3 epochs on validation accuracy
160
+ - **Mixed Precision:** FP16 (if GPU available)
161
+
162
+ ### Evaluation Results
163
+
164
+ | Metric | Score |
165
+ |--------|-------|
166
+ | **Accuracy** | 85.0% |
167
+ | **Precision** | 85.2% |
168
+ | **Recall** | 84.8% |
169
+ | **F1-Score** | 85.0% |
170
+
171
+ #### Per-Class Performance
172
+
173
+ | Class | Precision | Recall | F1-Score | Support |
174
+ |-------|-----------|--------|----------|---------|
175
+ | Negative | 0.84 | 0.86 | 0.85 | 421 |
176
+ | Positive | 0.86 | 0.84 | 0.85 | 421 |
177
+
178
+ ## Model Comparison
179
+
180
+ This model was developed as part of a comparative study:
181
+
182
+ | Model | Accuracy | Parameters | Inference Speed |
183
+ |-------|----------|------------|-----------------|
184
+ | BiLSTM | 62% | ~500K | Fast (5x) |
185
+ | **AraBERT** | **85%** | ~110M | Baseline |
186
+
187
+ AraBERT achieves **23% higher accuracy** than BiLSTM baseline while maintaining reasonable inference speed.
188
+
189
+ ## Framework Versions
190
+
191
+ - **Transformers:** 4.30.0+
192
+ - **PyTorch:** 2.0.0+
193
+ - **Datasets:** 2.12.0+
194
+ - **Tokenizers:** 0.13.0+
195
+
196
+ ## Citation
197
+ ```bibtex
198
+ @misc{arabert-sentiment-2025,
199
+ author = {Belal Mahmoud Hussien},
200
+ title = {AraBERT Fine-tuned for Arabic Sentiment Analysis},
201
+ year = {2025},
202
+ publisher = {Hugging Face},
203
+ howpublished = {\url{https://huggingface.co/Belall87/arabert-arabic-sentiment}}
204
+ }
205
+ ```
206
+
207
+ ### Base Model Citation
208
+ ```bibtex
209
+ @inproceedings{antoun2020arabert,
210
+ title={AraBERT: Transformer-based Model for Arabic Language Understanding},
211
+ author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
212
+ booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference},
213
+ year={2020}
214
+ }
215
+ ```
216
+
217
+ ## License
218
+
219
+ This model is licensed under the MIT License.
220
+
221
+ The base AraBERT model is also under MIT License - see [aubmindlab/arabert](https://github.com/aub-mind/arabert).
222
+
223
+ ## Related Links
224
+
225
+ - **📊 Full Project:** [Arabic Sentiment BiLSTM vs AraBERT Comparison](https://github.com/Bolaal/Arabic-Sentiment-BiLSTM-vs-AraBERT)
226
+ - **💻 Training Code:** [GitHub Repository](https://github.com/Bolaal/Arabic-Sentiment-BiLSTM-vs-AraBERT)
227
+ - **📓 Kaggle Notebook:** [Comparison Study](https://kaggle.com/...)
228
+ - **🤖 Base Model:** [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02)
229
+
230
+ ## Model Card Authors
231
+
232
+ Belal Mahmoud Hussien
233
+
234
+ ## Contact
235
+
236
+ - **Email:** belalmahmoud8787@gmail.com
237
+ - **GitHub:** [@Bolaal](https://github.com/Bolaal)
238
+ - **LinkedIn:** [Belal Mahmoud](https://www.linkedin.com/in/belal-mahmoud-husien)
239
+
240
+ ---
241
+
242
+ **Developed as part of a comparative study of classical deep learning vs modern transfer learning for Arabic NLP.**
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "dtype": "float32",
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "problem_type": "single_label_classification",
21
+ "transformers_version": "4.57.1",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 64000
25
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e0a5092c3c4fa50559d1dc54b9fc7ec1ec29618406f9b1aa879e4f9599b4634e
3
+ size 540803072
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "5": {
44
+ "content": "[رابط]",
45
+ "lstrip": false,
46
+ "normalized": true,
47
+ "rstrip": false,
48
+ "single_word": true,
49
+ "special": true
50
+ },
51
+ "6": {
52
+ "content": "[بريد]",
53
+ "lstrip": false,
54
+ "normalized": true,
55
+ "rstrip": false,
56
+ "single_word": true,
57
+ "special": true
58
+ },
59
+ "7": {
60
+ "content": "[مستخدم]",
61
+ "lstrip": false,
62
+ "normalized": true,
63
+ "rstrip": false,
64
+ "single_word": true,
65
+ "special": true
66
+ }
67
+ },
68
+ "clean_up_tokenization_spaces": false,
69
+ "cls_token": "[CLS]",
70
+ "do_basic_tokenize": true,
71
+ "do_lower_case": false,
72
+ "extra_special_tokens": {},
73
+ "mask_token": "[MASK]",
74
+ "max_len": 512,
75
+ "model_max_length": 512,
76
+ "never_split": [
77
+ "[بريد]",
78
+ "[مستخدم]",
79
+ "[رابط]"
80
+ ],
81
+ "pad_token": "[PAD]",
82
+ "sep_token": "[SEP]",
83
+ "strip_accents": null,
84
+ "tokenize_chinese_chars": true,
85
+ "tokenizer_class": "BertTokenizer",
86
+ "unk_token": "[UNK]"
87
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff