|
|
--- |
|
|
language: |
|
|
- ar |
|
|
- fr |
|
|
license: mit |
|
|
pipeline_tag: text-classification |
|
|
tags: |
|
|
- misinformation-detection |
|
|
- fake-news |
|
|
- text-classification |
|
|
- algerian-darija |
|
|
- arabic |
|
|
- mbert |
|
|
model_name: mBERT-Algerian-Darija |
|
|
base_model: bert-base-multilingual-cased |
|
|
--- |
|
|
|
|
|
# mBERT — Algerian Darija Misinformation Detection |
|
|
|
|
|
Fine-tuned **BERT-base-multilingual-cased** for detecting misinformation in **Algerian Darija** text. |
|
|
|
|
|
- **Base model**: `bert-base-multilingual-cased` (170M parameters) |
|
|
- **Task**: Multi-class text classification (5 classes) |
|
|
- **Classes**: F (Factual), R (Reporting), N (Non-factual), M (Misleading), S (Satire) |
|
|
|
|
|
--- |
|
|
|
|
|
## Performance (Test set: 3,344 samples) |
|
|
|
|
|
- **Accuracy**: 75.42% |
|
|
- **Macro F1**: 64.48% |
|
|
- **Weighted F1**: 75.70% |
|
|
|
|
|
**Per-class F1**: |
|
|
- Factual (F): 83.72% |
|
|
- Reporting (R): 76.35% |
|
|
- Non-factual (N): 81.01% |
|
|
- Misleading (M): 61.46% |
|
|
- Satire (S): 19.86% |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## Training Summary |
|
|
|
|
|
- **Max sequence length**: 128 |
|
|
- **Epochs**: 3 (early stopping) |
|
|
- **Batch size**: 16 |
|
|
- **Learning rate**: 2e-5 |
|
|
- **Loss**: Weighted CrossEntropy |
|
|
- **Seed**: 42 (reproducibility) |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
|
|
MODEL_ID = "Rahilgh/model4_1" |
|
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID) |
|
|
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
model.to(device).eval() |
|
|
|
|
|
LABEL_MAP = {0: "F", 1: "R", 2: "N", 3: "M", 4: "S"} |
|
|
LABEL_NAMES = { |
|
|
"F": "Factual", |
|
|
"R": "Reporting", |
|
|
"N": "Non-factual", |
|
|
"M": "Misleading", |
|
|
"S": "Satire" |
|
|
} |
|
|
|
|
|
texts = [ |
|
|
"قالك بلي رايحين ينحو الباك هذا العام", |
|
|
|
|
|
] |
|
|
|
|
|
for text in texts: |
|
|
inputs = tokenizer( |
|
|
text, |
|
|
return_tensors="pt", |
|
|
max_length=128, |
|
|
truncation=True, |
|
|
padding=True, |
|
|
).to(device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
probs = torch.softmax(outputs.logits, dim=1)[0] |
|
|
pred_id = probs.argmax().item() |
|
|
confidence = probs[pred_id].item() |
|
|
|
|
|
label = LABEL_MAP[pred_id] |
|
|
print(f"Text: {text}") |
|
|
print(f"Prediction: {LABEL_NAMES[label]} ({label}) — {confidence:.2%}\n") |