bilalzafar commited on
Commit
e586797
·
verified ·
1 Parent(s): 61db00e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -3
README.md CHANGED
@@ -1,3 +1,77 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+
6
+
7
+ ## Training data
8
+ * **Source:** custom, manually annotated CBDC sentences
9
+ * **Size:** 2,405 sentences
10
+
11
+ **Class distribution:**
12
+ * `neutral`: 1,068 (44.41%)
13
+ * `positive`: 1,026 (42.66%)
14
+ * `negative`: 311 (12.93%)
15
+
16
+ **Splits** (row-wise, stratified by label):
17
+ * **train:** 1,924
18
+ * **validation:** 240
19
+ * **test:** 241
20
+
21
+ ---
22
+
23
+ ## Preprocessing
24
+
25
+ * Lowercased, raw sentences (no stemming or lemmatization)
26
+ * Tokenization: base model tokenizer (`bilalzafar/cb-bert-mlm`), **max\_length=320**, truncation enabled
27
+ * Dynamic padding via `DataCollatorWithPadding`
28
+
29
+ ---
30
+
31
+ ## Training procedure
32
+
33
+ * **Base model:** `bilalzafar/cb-bert-mlm`
34
+ * **Head:** `AutoModelForSequenceClassification` with 3 labels
35
+ * **Optimizer:** AdamW (via HF Trainer)
36
+ * **Learning rate:** 2e-5
37
+ * **Batch size:** 16 (train/eval)
38
+ * **Epochs:** up to 8 with early stopping (patience=2); best epoch \~6
39
+ * **Warmup ratio:** 0.06
40
+ * **Weight decay:** 0.01
41
+ * **Precision:** fp16
42
+ * **Seed:** 42
43
+ * **Hardware:** Google Colab (T4)
44
+
45
+ ---
46
+
47
+ ## Class imbalance & loss
48
+
49
+ * **Loss:** Focal Loss with γ = 1.0
50
+ * **Class weights:** computed from the **train split** (`class_weight="balanced"`) and applied in the loss
51
+ * **Sampler:** `WeightedRandomSampler` with √(inverse frequency) per-sample weights
52
+
53
+ ---
54
+
55
+ ## Evaluation
56
+
57
+ **Validation** (\~10% split):
58
+ * accuracy: **0.8458**
59
+ * macro-F1: **0.8270**
60
+ * weighted-F1: **0.8453**
61
+
62
+ **Test** (\~10% split):
63
+ accuracy: **0.8216**
64
+ macro-F1: **0.8121**
65
+ weighted-F1: **0.8216**
66
+
67
+ **Per-class (test):**
68
+
69
+ | class | precision | recall | f1 | support |
70
+ | -------- | --------- | ------ | ------ | ------- |
71
+ | negative | 0.8214 | 0.7419 | 0.7797 | 31 |
72
+ | neutral | 0.7857 | 0.8224 | 0.8037 | 107 |
73
+ | positive | 0.8614 | 0.8447 | 0.8529 | 103 |
74
+
75
+ > Note: On the **entire annotated set** (in-domain evaluation, not a hold-out),
76
+ > the same model reaches \~0.95 accuracy / weighted-F1.
77
+ > Treat those as upper bounds; the **test split** above is the recommended reference.