Rahilgh commited on
Commit
cf34a6f
·
verified ·
1 Parent(s): 5c08110

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -0
README.md ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DziriBERT — Algerian Darija Misinformation Detection
2
+
3
+ **DziriBERT** is a fine-tuned **XLM-RoBERTa-large** model for detecting misinformation in **Algerian Darija** text from social media and news.
4
+
5
+ - **Base model**: `xlm-roberta-large` (355M parameters)
6
+ - **Task**: Multi-class text classification (5 classes)
7
+ - **Classes**:
8
+ - **F**: Fake
9
+ - **R**: Real
10
+ - **N**: Non-new
11
+ - **M**: Misleading
12
+ - **S**: Satire
13
+
14
+ ---
15
+
16
+ ## Performance (Test set: 3,344 samples)
17
+
18
+ - **Accuracy**: 78.32%
19
+ - **Macro F1**: 68.22%
20
+ - **Weighted F1**: 78.43%
21
+
22
+ **Per-class F1**:
23
+ - Fake (F): 85.04%
24
+ - Real (R): 80.44%
25
+ - Non-new (N): 83.23%
26
+ - Misleading (M): 64.57%
27
+ - Satire (S): 27.83%
28
+
29
+ ---
30
+
31
+ ## Training Summary
32
+
33
+ - **Max sequence length**: 128
34
+ - **Epochs**: 3 (early stopping)
35
+ - **Batch size**: 8 (effective 16 with gradient accumulation)
36
+ - **Learning rate**: 1e-5
37
+ - **Loss**: Weighted CrossEntropy
38
+ - **Data augmentation**: Applied to minority classes (M, S)
39
+ - **Seed**: 42
40
+
41
+ ---
42
+
43
+ ## Strengths & Limitations
44
+
45
+ **Strengths**
46
+ - Strong performance on Fake, Real, and Non-new classes
47
+ - Handles Darija, Arabic, and French code-switching well
48
+
49
+
50
+ **Limitations**
51
+ - Low performance on Satire due to limited samples
52
+ - Misleading class remains challenging
53
+
54
+ ---
55
+
56
+ ## Usage
57
+
58
+ ```python
59
+ import os
60
+ import torch
61
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
62
+
63
+ os.environ["USE_TF"] = "0"
64
+ os.environ["USE_TORCH"] = "1"
65
+
66
+ MODEL_ID = "Rahilgh/model4_2"
67
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
68
+
69
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=False)
70
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE)
71
+ model.eval()
72
+
73
+ LABEL_MAP = {0: "F", 1: "R", 2: "N", 3: "M", 4: "S"}
74
+ LABEL_NAMES = {
75
+ "F": "Fake",
76
+ "R": "Real",
77
+ "N": "Non-new",
78
+ "M": "Misleading",
79
+ "S": "Satire",
80
+ }
81
+
82
+ texts = [
83
+ "الجزائر فازت ببطولة امم افريقيا 2019",
84
+ "صورة زعيم عالمي يرتدي ملابس غريبة تثير السخرية",
85
+ ]
86
+
87
+ for text in texts:
88
+ inputs = tokenizer(
89
+ text,
90
+ return_tensors="pt",
91
+ max_length=128,
92
+ truncation=True,
93
+ padding=True,
94
+ ).to(DEVICE)
95
+
96
+ with torch.no_grad():
97
+ outputs = model(**inputs)
98
+ probs = torch.softmax(outputs.logits, dim=1)
99
+ pred_id = probs.argmax().item()
100
+ confidence = probs[0][pred_id].item()
101
+
102
+ label = LABEL_MAP[pred_id]
103
+ print(f"Text: {text}")
104
+ print(f"Prediction: {LABEL_NAMES[label]} ({label}) — {confidence:.2%}")