Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

README.md +239 -0
config.json +25 -0
model.safetensors +3 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +87 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,242 @@
 ---
 license: mit
 ---

 ---
+language: ar
 license: mit
+tags:
+- sentiment-analysis
+- arabic
+- arabert
+- text-classification
+- pytorch
+base_model: aubmindlab/bert-base-arabertv02
+datasets:
+- custom
+metrics:
+- accuracy
+- f1
+model-index:
+- name: arabert-arabic-sentiment
+  results:
+  - task:
+      type: text-classification
+      name: Sentiment Analysis
+    dataset:
+      type: custom
+      name: Arabic Sentiment Dataset
+    metrics:
+    - type: accuracy
+      value: 0.85
+      name: Accuracy
+    - type: f1
+      value: 0.85
+      name: F1 Score
+library_name: transformers
+pipeline_tag: text-classification
+widget:
+- text: "هذا المنتج رائع جداً وأنصح به بشدة"
+  example_title: "Positive Example"
+- text: "تجربة سيئة جداً ولن أشتري مرة أخرى"
+  example_title: "Negative Example"
+- text: "الخدمة ممتازة والتوصيل سريع"
+  example_title: "Positive Service"
 ---
+# AraBERT for Arabic Sentiment Analysis
+Fine-tuned [AraBERT v0.2](https://huggingface.co/aubmindlab/bert-base-arabertv02) for binary sentiment classification on Arabic text.
+## Model Description
+This model is a fine-tuned version of `aubmindlab/bert-base-arabertv02` on a custom Arabic sentiment dataset. It classifies Arabic text into positive or negative sentiment.
+### Key Features
+- 🎯 **85%+ accuracy** on Arabic sentiment classification
+- 🌍 Pre-trained on **large Arabic corpus** (AraBERT v0.2)
+- ⚡ **Fast inference** with transformer architecture
+- 🔄 **Transfer learning** from 110M parameter BERT model
+## Intended Uses & Limitations
+### Intended Uses
+- Arabic social media sentiment analysis
+- Product review classification
+- Customer feedback analysis
+- Market research on Arabic content
+### Limitations
+- Binary classification only (positive/negative)
+- Trained on specific domain (may need fine-tuning for other domains)
+- Arabic text only (Modern Standard Arabic and dialects)
+- May not perform well on very short texts (<5 words)
+## How to Use
+### Quick Start with Pipeline
+```python
+from transformers import pipeline
+# Load sentiment analysis pipeline
+classifier = pipeline(
+    "sentiment-analysis",
+    model="Belall87/arabert-arabic-sentiment"
+)
+# Classify text
+result = classifier("هذا المنتج رائع جداً")
+print(result)
+# Output: [{'label': 'POSITIVE', 'score': 0.95}]
+```
+### Manual Loading
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Load model and tokenizer
+model = AutoModelForSequenceClassification.from_pretrained(
+    "Belall87/arabert-arabic-sentiment"
+)
+tokenizer = AutoTokenizer.from_pretrained(
+    "Belall87/arabert-arabic-sentiment"
+)
+# Prepare input
+text = "الخدمة ممتازة والموظفون متعاونون"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
+# Get prediction
+with torch.no_grad():
+    outputs = model(**inputs)
+    prediction = torch.argmax(outputs.logits, dim=-1)
+    probabilities = torch.softmax(outputs.logits, dim=-1)
+sentiment = "Positive" if prediction.item() == 1 else "Negative"
+confidence = probabilities[0][prediction].item()
+print(f"Sentiment: {sentiment} (Confidence: {confidence:.2%})")
+```
+### Batch Processing
+```python
+texts = [
+    "المطعم نظيف والطعام لذيذ",
+    "الخدمة سيئة جداً",
+    "منتج عادي لا بأس به"
+]
+# Use pipeline for batch
+results = classifier(texts)
+for text, result in zip(texts, results):
+    print(f"{text}: {result['label']} ({result['score']:.2%})")
+```
+## Training Details
+### Training Data
+- **Dataset Size:** ~4,200 Arabic text samples
+- **Train/Val/Test Split:** 72% / 8% / 20%
+- **Data Sources:** Arabic tweets, reviews, and comments
+- **Preprocessing:** Text normalization, diacritics removal, character standardization
+### Training Procedure
+#### Hyperparameters
+```python
+Learning Rate: 2e-5
+Batch Size: 8 (train), 16 (eval)
+Epochs: 3
+Optimizer: AdamW
+Weight Decay: 0.01
+LR Scheduler: Cosine with 5% warmup
+Max Sequence Length: 256
+```
+#### Training Configuration
+- **Framework:** PyTorch with Hugging Face Transformers
+- **Base Model:** aubmindlab/bert-base-arabertv02
+- **Fine-tuning Strategy:** Full model fine-tuning
+- **Early Stopping:** Patience of 3 epochs on validation accuracy
+- **Mixed Precision:** FP16 (if GPU available)
+### Evaluation Results
+| Metric | Score |
+|--------|-------|
+| **Accuracy** | 85.0% |
+| **Precision** | 85.2% |
+| **Recall** | 84.8% |
+| **F1-Score** | 85.0% |
+#### Per-Class Performance
+| Class | Precision | Recall | F1-Score | Support |
+|-------|-----------|--------|----------|---------|
+| Negative | 0.84 | 0.86 | 0.85 | 421 |
+| Positive | 0.86 | 0.84 | 0.85 | 421 |
+## Model Comparison
+This model was developed as part of a comparative study:
+| Model | Accuracy | Parameters | Inference Speed |
+|-------|----------|------------|-----------------|
+| BiLSTM | 62% | ~500K | Fast (5x) |
+| **AraBERT** | **85%** | ~110M | Baseline |
+AraBERT achieves **23% higher accuracy** than BiLSTM baseline while maintaining reasonable inference speed.
+## Framework Versions
+- **Transformers:** 4.30.0+
+- **PyTorch:** 2.0.0+
+- **Datasets:** 2.12.0+
+- **Tokenizers:** 0.13.0+
+## Citation
+```bibtex
+@misc{arabert-sentiment-2025,
+  author = {Belal Mahmoud Hussien},
+  title = {AraBERT Fine-tuned for Arabic Sentiment Analysis},
+  year = {2025},
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/Belall87/arabert-arabic-sentiment}}
+}
+```
+### Base Model Citation
+```bibtex
+@inproceedings{antoun2020arabert,
+  title={AraBERT: Transformer-based Model for Arabic Language Understanding},
+  author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
+  booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference},
+  year={2020}
+}
+```
+## License
+This model is licensed under the MIT License.
+The base AraBERT model is also under MIT License - see [aubmindlab/arabert](https://github.com/aub-mind/arabert).
+## Related Links
+- **📊 Full Project:** [Arabic Sentiment BiLSTM vs AraBERT Comparison](https://github.com/Bolaal/Arabic-Sentiment-BiLSTM-vs-AraBERT)
+- **💻 Training Code:** [GitHub Repository](https://github.com/Bolaal/Arabic-Sentiment-BiLSTM-vs-AraBERT)
+- **📓 Kaggle Notebook:** [Comparison Study](https://kaggle.com/...)
+- **🤖 Base Model:** [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02)
+## Model Card Authors
+Belal Mahmoud Hussien
+## Contact
+- **Email:** belalmahmoud8787@gmail.com
+- **GitHub:** [@Bolaal](https://github.com/Bolaal)
+- **LinkedIn:** [Belal Mahmoud](https://www.linkedin.com/in/belal-mahmoud-husien)
+---
+**Developed as part of a comparative study of classical deep learning vs modern transfer learning for Arabic NLP.**

config.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "dtype": "float32",
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "transformers_version": "4.57.1",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 64000
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e0a5092c3c4fa50559d1dc54b9fc7ec1ec29618406f9b1aa879e4f9599b4634e
+size 540803072

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,87 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "[رابط]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": true,
+      "special": true
+    },
+    "6": {
+      "content": "[بريد]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": true,
+      "special": true
+    },
+    "7": {
+      "content": "[مستخدم]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": true,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_len": 512,
+  "model_max_length": 512,
+  "never_split": [
+    "[بريد]",
+    "[مستخدم]",
+    "[رابط]"
+  ],
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff