--- language: - ar - fr license: mit pipeline_tag: text-classification tags: - misinformation-detection - fake-news - text-classification - algerian-darija - arabic - xlm-roberta base_model: xlm-roberta-large --- # DziriBERT — Algerian Darija Misinformation Detection **DziriBERT** is a fine-tuned **XLM-RoBERTa-large** model for detecting misinformation in **Algerian Darija** text from social media and news. - **Base model**: `xlm-roberta-large` (355M parameters) - **Task**: Multi-class text classification (5 classes) - **Classes**: - **F**: Fake - **R**: Real - **N**: Non-new - **M**: Misleading - **S**: Satire --- ## Performance (Test set: 3,344 samples) - **Accuracy**: 78.32% - **Macro F1**: 68.22% - **Weighted F1**: 78.43% **Per-class F1**: - Fake (F): 85.04% - Real (R): 80.44% - Non-new (N): 83.23% - Misleading (M): 64.57% - Satire (S): 27.83% --- ## Training Summary - **Max sequence length**: 128 - **Epochs**: 3 (early stopping) - **Batch size**: 8 (effective 16 with gradient accumulation) - **Learning rate**: 1e-5 - **Loss**: Weighted CrossEntropy - **Data augmentation**: Applied to minority classes (M, S) - **Seed**: 42 --- ## Strengths & Limitations **Strengths** - Strong performance on Fake, Real, and Non-new classes - Handles Darija, Arabic, and French code-switching well **Limitations** - Low performance on Satire due to limited samples - Misleading class remains challenging --- ## Usage ```python import os import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification os.environ["USE_TF"] = "0" os.environ["USE_TORCH"] = "1" MODEL_ID = "Rahilgh/model4_2" DEVICE = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=False) model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE) model.eval() LABEL_MAP = {0: "F", 1: "R", 2: "N", 3: "M", 4: "S"} LABEL_NAMES = { "F": "Fake", "R": "Real", "N": "Non-new", "M": "Misleading", "S": "Satire", } texts = [ "الجزائر فازت ببطولة امم افريقيا 2019", "صورة زعيم عالمي يرتدي ملابس غريبة تثير السخرية", ] for text in texts: inputs = tokenizer( text, return_tensors="pt", max_length=128, truncation=True, padding=True, ).to(DEVICE) with torch.no_grad(): outputs = model(**inputs) probs = torch.softmax(outputs.logits, dim=1) pred_id = probs.argmax().item() confidence = probs[0][pred_id].item() label = LABEL_MAP[pred_id] print(f"Text: {text}") print(f"Prediction: {LABEL_NAMES[label]} ({label}) — {confidence:.2%}")