Rahilgh
/

model4_2

Text Classification

misinformation-detection

algerian-darija

Model card Files Files and versions

Rahilgh commited on about 1 month ago

Commit

cf34a6f

·

verified ·

1 Parent(s): 5c08110

Create README.md

Files changed (1) hide show

README.md +104 -0

README.md ADDED Viewed

	@@ -0,0 +1,104 @@

+# DziriBERT — Algerian Darija Misinformation Detection
+**DziriBERT** is a fine-tuned **XLM-RoBERTa-large** model for detecting misinformation in **Algerian Darija** text from social media and news.
+- **Base model**: `xlm-roberta-large` (355M parameters)
+- **Task**: Multi-class text classification (5 classes)
+- **Classes**:
+  - **F**: Fake
+  - **R**: Real
+  - **N**: Non-new
+  - **M**: Misleading
+  - **S**: Satire
+---
+## Performance (Test set: 3,344 samples)
+- **Accuracy**: 78.32%
+- **Macro F1**: 68.22%
+- **Weighted F1**: 78.43%
+**Per-class F1**:
+- Fake (F): 85.04%
+- Real (R): 80.44%
+- Non-new (N): 83.23%
+- Misleading (M): 64.57%
+- Satire (S): 27.83%
+---
+## Training Summary
+- **Max sequence length**: 128
+- **Epochs**: 3 (early stopping)
+- **Batch size**: 8 (effective 16 with gradient accumulation)
+- **Learning rate**: 1e-5
+- **Loss**: Weighted CrossEntropy
+- **Data augmentation**: Applied to minority classes (M, S)
+- **Seed**: 42
+---
+## Strengths & Limitations
+**Strengths**
+- Strong performance on Fake, Real, and Non-new classes
+- Handles Darija, Arabic, and French code-switching well
+**Limitations**
+- Low performance on Satire due to limited samples
+- Misleading class remains challenging
+---
+## Usage
+```python
+import os
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+os.environ["USE_TF"] = "0"
+os.environ["USE_TORCH"] = "1"
+MODEL_ID = "Rahilgh/model4_2"
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=False)
+model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE)
+model.eval()
+LABEL_MAP = {0: "F", 1: "R", 2: "N", 3: "M", 4: "S"}
+LABEL_NAMES = {
+    "F": "Fake",
+    "R": "Real",
+    "N": "Non-new",
+    "M": "Misleading",
+    "S": "Satire",
+}
+texts = [
+    "الجزائر فازت ببطولة امم افريقيا 2019",
+    "صورة زعيم عالمي يرتدي ملابس غريبة تثير السخرية",
+]
+for text in texts:
+    inputs = tokenizer(
+        text,
+        return_tensors="pt",
+        max_length=128,
+        truncation=True,
+        padding=True,
+    ).to(DEVICE)
+    with torch.no_grad():
+        outputs = model(**inputs)
+        probs = torch.softmax(outputs.logits, dim=1)
+        pred_id = probs.argmax().item()
+        confidence = probs[0][pred_id].item()
+    label = LABEL_MAP[pred_id]
+    print(f"Text: {text}")
+    print(f"Prediction: {LABEL_NAMES[label]} ({label}) — {confidence:.2%}")