model4_2 / README.md
Rahilgh's picture
Update README.md
3af9548 verified
---
language:
- ar
- fr
license: mit
pipeline_tag: text-classification
tags:
- misinformation-detection
- fake-news
- text-classification
- algerian-darija
- arabic
- xlm-roberta
base_model: xlm-roberta-large
---
# DziriBERT — Algerian Darija Misinformation Detection
**DziriBERT** is a fine-tuned **XLM-RoBERTa-large** model for detecting misinformation in **Algerian Darija** text from social media and news.
- **Base model**: `xlm-roberta-large` (355M parameters)
- **Task**: Multi-class text classification (5 classes)
- **Classes**:
- **F**: Fake
- **R**: Real
- **N**: Non-new
- **M**: Misleading
- **S**: Satire
---
## Performance (Test set: 3,344 samples)
- **Accuracy**: 78.32%
- **Macro F1**: 68.22%
- **Weighted F1**: 78.43%
**Per-class F1**:
- Fake (F): 85.04%
- Real (R): 80.44%
- Non-new (N): 83.23%
- Misleading (M): 64.57%
- Satire (S): 27.83%
---
## Training Summary
- **Max sequence length**: 128
- **Epochs**: 3 (early stopping)
- **Batch size**: 8 (effective 16 with gradient accumulation)
- **Learning rate**: 1e-5
- **Loss**: Weighted CrossEntropy
- **Data augmentation**: Applied to minority classes (M, S)
- **Seed**: 42
---
## Strengths & Limitations
**Strengths**
- Strong performance on Fake, Real, and Non-new classes
- Handles Darija, Arabic, and French code-switching well
**Limitations**
- Low performance on Satire due to limited samples
- Misleading class remains challenging
---
## Usage
```python
import os
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
os.environ["USE_TF"] = "0"
os.environ["USE_TORCH"] = "1"
MODEL_ID = "Rahilgh/model4_2"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=False)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE)
model.eval()
LABEL_MAP = {0: "F", 1: "R", 2: "N", 3: "M", 4: "S"}
LABEL_NAMES = {
"F": "Fake",
"R": "Real",
"N": "Non-new",
"M": "Misleading",
"S": "Satire",
}
texts = [
"الجزائر فازت ببطولة امم افريقيا 2019",
"صورة زعيم عالمي يرتدي ملابس غريبة تثير السخرية",
]
for text in texts:
inputs = tokenizer(
text,
return_tensors="pt",
max_length=128,
truncation=True,
padding=True,
).to(DEVICE)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=1)
pred_id = probs.argmax().item()
confidence = probs[0][pred_id].item()
label = LABEL_MAP[pred_id]
print(f"Text: {text}")
print(f"Prediction: {LABEL_NAMES[label]} ({label}) — {confidence:.2%}")