|
|
--- |
|
|
language: |
|
|
- ar |
|
|
- fr |
|
|
license: mit |
|
|
pipeline_tag: text-classification |
|
|
tags: |
|
|
- misinformation-detection |
|
|
- fake-news |
|
|
- text-classification |
|
|
- algerian-darija |
|
|
- arabic |
|
|
- xlm-roberta |
|
|
|
|
|
base_model: xlm-roberta-large |
|
|
--- |
|
|
|
|
|
# DziriBERT — Algerian Darija Misinformation Detection |
|
|
|
|
|
**DziriBERT** is a fine-tuned **XLM-RoBERTa-large** model for detecting misinformation in **Algerian Darija** text from social media and news. |
|
|
|
|
|
- **Base model**: `xlm-roberta-large` (355M parameters) |
|
|
- **Task**: Multi-class text classification (5 classes) |
|
|
- **Classes**: |
|
|
- **F**: Fake |
|
|
- **R**: Real |
|
|
- **N**: Non-new |
|
|
- **M**: Misleading |
|
|
- **S**: Satire |
|
|
|
|
|
--- |
|
|
|
|
|
## Performance (Test set: 3,344 samples) |
|
|
|
|
|
- **Accuracy**: 78.32% |
|
|
- **Macro F1**: 68.22% |
|
|
- **Weighted F1**: 78.43% |
|
|
|
|
|
**Per-class F1**: |
|
|
- Fake (F): 85.04% |
|
|
- Real (R): 80.44% |
|
|
- Non-new (N): 83.23% |
|
|
- Misleading (M): 64.57% |
|
|
- Satire (S): 27.83% |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Summary |
|
|
|
|
|
- **Max sequence length**: 128 |
|
|
- **Epochs**: 3 (early stopping) |
|
|
- **Batch size**: 8 (effective 16 with gradient accumulation) |
|
|
- **Learning rate**: 1e-5 |
|
|
- **Loss**: Weighted CrossEntropy |
|
|
- **Data augmentation**: Applied to minority classes (M, S) |
|
|
- **Seed**: 42 |
|
|
|
|
|
--- |
|
|
|
|
|
## Strengths & Limitations |
|
|
|
|
|
**Strengths** |
|
|
- Strong performance on Fake, Real, and Non-new classes |
|
|
- Handles Darija, Arabic, and French code-switching well |
|
|
|
|
|
**Limitations** |
|
|
- Low performance on Satire due to limited samples |
|
|
- Misleading class remains challenging |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import os |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
|
|
os.environ["USE_TF"] = "0" |
|
|
os.environ["USE_TORCH"] = "1" |
|
|
|
|
|
MODEL_ID = "Rahilgh/model4_2" |
|
|
DEVICE = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=False) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE) |
|
|
model.eval() |
|
|
|
|
|
LABEL_MAP = {0: "F", 1: "R", 2: "N", 3: "M", 4: "S"} |
|
|
LABEL_NAMES = { |
|
|
"F": "Fake", |
|
|
"R": "Real", |
|
|
"N": "Non-new", |
|
|
"M": "Misleading", |
|
|
"S": "Satire", |
|
|
} |
|
|
|
|
|
texts = [ |
|
|
"الجزائر فازت ببطولة امم افريقيا 2019", |
|
|
"صورة زعيم عالمي يرتدي ملابس غريبة تثير السخرية", |
|
|
] |
|
|
|
|
|
for text in texts: |
|
|
inputs = tokenizer( |
|
|
text, |
|
|
return_tensors="pt", |
|
|
max_length=128, |
|
|
truncation=True, |
|
|
padding=True, |
|
|
).to(DEVICE) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
probs = torch.softmax(outputs.logits, dim=1) |
|
|
pred_id = probs.argmax().item() |
|
|
confidence = probs[0][pred_id].item() |
|
|
|
|
|
label = LABEL_MAP[pred_id] |
|
|
print(f"Text: {text}") |
|
|
print(f"Prediction: {LABEL_NAMES[label]} ({label}) — {confidence:.2%}") |
|
|
|