File size: 2,740 Bytes

3aabf59
 
 
 
 
 
 
 
 
 
 
 
 
3af9548
3aabf59
 
 
cf34a6f

---
language:
  - ar
  - fr
license: mit
pipeline_tag: text-classification
tags:
  - misinformation-detection
  - fake-news
  - text-classification
  - algerian-darija
  - arabic
  - xlm-roberta

base_model: xlm-roberta-large
---

# DziriBERT — Algerian Darija Misinformation Detection

**DziriBERT** is a fine-tuned **XLM-RoBERTa-large** model for detecting misinformation in **Algerian Darija** text from social media and news.

- **Base model**: `xlm-roberta-large` (355M parameters)
- **Task**: Multi-class text classification (5 classes)
- **Classes**:
  - **F**: Fake
  - **R**: Real
  - **N**: Non-new
  - **M**: Misleading
  - **S**: Satire

---

## Performance (Test set: 3,344 samples)

- **Accuracy**: 78.32%
- **Macro F1**: 68.22%
- **Weighted F1**: 78.43%

**Per-class F1**:
- Fake (F): 85.04%
- Real (R): 80.44%
- Non-new (N): 83.23%
- Misleading (M): 64.57%
- Satire (S): 27.83%

---

## Training Summary

- **Max sequence length**: 128
- **Epochs**: 3 (early stopping)
- **Batch size**: 8 (effective 16 with gradient accumulation)
- **Learning rate**: 1e-5
- **Loss**: Weighted CrossEntropy
- **Data augmentation**: Applied to minority classes (M, S)
- **Seed**: 42

---

## Strengths & Limitations

**Strengths**
- Strong performance on Fake, Real, and Non-new classes
- Handles Darija, Arabic, and French code-switching well

**Limitations**
- Low performance on Satire due to limited samples
- Misleading class remains challenging

---

## Usage

```python
import os
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

os.environ["USE_TF"] = "0"
os.environ["USE_TORCH"] = "1"

MODEL_ID = "Rahilgh/model4_2"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=False)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE)
model.eval()

LABEL_MAP = {0: "F", 1: "R", 2: "N", 3: "M", 4: "S"}
LABEL_NAMES = {
    "F": "Fake",
    "R": "Real",
    "N": "Non-new",
    "M": "Misleading",
    "S": "Satire",
}

texts = [
    "الجزائر فازت ببطولة امم افريقيا 2019",
    "صورة زعيم عالمي يرتدي ملابس غريبة تثير السخرية",
]

for text in texts:
    inputs = tokenizer(
        text,
        return_tensors="pt",
        max_length=128,
        truncation=True,
        padding=True,
    ).to(DEVICE)

    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=1)
        pred_id = probs.argmax().item()
        confidence = probs[0][pred_id].item()

    label = LABEL_MAP[pred_id]
    print(f"Text: {text}")
    print(f"Prediction: {LABEL_NAMES[label]} ({label}) — {confidence:.2%}")