bilalzafar commited on
Commit
d67e401
·
verified ·
1 Parent(s): 741d011

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -27
README.md CHANGED
@@ -2,42 +2,25 @@
2
  license: mit
3
  ---
4
 
 
5
 
 
6
 
 
7
 
8
- **Training data:** The dataset consists of **2,405** custom, *manually annotated* sentences related to Central Bank Digital Currencies (CBDCs). The class distribution is **neutral**: *1,068* (44.41%), **positive**: *1,026* (42.66%), and **negative**: *311* (12.93%). The data is split **row-wise**, stratified by label, into **train**: *1,924*, **validation**: *240*, and **test**: *241* examples.
9
 
10
- ---
11
-
12
- ## Preprocessing
13
-
14
- * Lowercased, raw sentences (no stemming or lemmatization)
15
- * Tokenization: base model tokenizer (`bilalzafar/cb-bert-mlm`), **max\_length=320**, truncation enabled
16
- * Dynamic padding via `DataCollatorWithPadding`
17
 
18
  ---
19
 
20
- ## Training procedure
21
-
22
- * **Base model:** `bilalzafar/cb-bert-mlm`
23
- * **Head:** `AutoModelForSequenceClassification` with 3 labels
24
- * **Optimizer:** AdamW (via HF Trainer)
25
- * **Learning rate:** 2e-5
26
- * **Batch size:** 16 (train/eval)
27
- * **Epochs:** up to 8 with early stopping (patience=2); best epoch \~6
28
- * **Warmup ratio:** 0.06
29
- * **Weight decay:** 0.01
30
- * **Precision:** fp16
31
- * **Seed:** 42
32
- * **Hardware:** Google Colab (T4)
33
 
34
  ---
35
 
36
- ## Class imbalance & loss
37
-
38
- * **Loss:** Focal Loss with γ = 1.0
39
- * **Class weights:** computed from the **train split** (`class_weight="balanced"`) and applied in the loss
40
- * **Sampler:** `WeightedRandomSampler` with √(inverse frequency) per-sample weights
41
 
42
  ---
43
 
@@ -63,4 +46,29 @@ weighted-F1: **0.8216**
63
 
64
  > Note: On the **entire annotated set** (in-domain evaluation, not a hold-out),
65
  > the same model reaches \~0.95 accuracy / weighted-F1.
66
- > Treat those as upper bounds; the **test split** above is the recommended reference.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: mit
3
  ---
4
 
5
+ # **CBDC-BERT-Sentiment: A Domain-Specific BERT for CBDC-Related Sentiment Analysis**
6
 
7
+ **CBDC-BERT-Sentiment** is a **3-class** (*negative / neutral / positive*) sentence-level classifier built for **Central Bank Digital Currency (CBDC)** communications. It is trained to identify overall sentiment in central-bank style text such as consultations, speeches, reports, and reputable news.
8
 
9
+ **Base Model:** [`bilalzafar/cb-bert-mlm`](https://huggingface.co/bilalzafar/cb-bert-mlm) — **CB-BERT (`cb-bert-mlm`)** is a domain-adapted **BERT base (uncased)**, pretrained on **66M+ tokens** across **2M+ sentences** from central-bank speeches published via the **Bank for International Settlements (1996–2024)**. It is optimized for *masked-token prediction* within the specialized domains of **monetary policy, financial regulation, and macroeconomic communication**, enabling better contextual understanding of central-bank discourse and financial narratives.
10
 
11
+ **Training data:** The dataset consists of **2,405** custom, *manually annotated* sentences related to Central Bank Digital Currencies (CBDCs), extracted from **BIS speeches**. The class distribution is **neutral**: *1,068* (44.41%), **positive**: *1,026* (42.66%), and **negative**: *311* (12.93%). The data is split **row-wise**, stratified by label, into **train**: *1,924*, **validation**: *240*, and **test**: *241* examples.
12
 
13
+ **Intended usage:** Use this model to **classify sentence-level sentiment** in **CBDC** texts (reports, consultations, speeches, research notes, reputable news). It is **domain-specific** and *not intended* for generic or informal sentiment tasks.
 
 
 
 
 
 
14
 
15
  ---
16
 
17
+ ## Preprocessing & class imbalance
18
+ Sentences were **lowercased** (no stemming/lemmatization) and tokenized with the base tokenizer from [`bilalzafar/cb-bert-mlm`](https://huggingface.co/bilalzafar/cb-bert-mlm) using **max\_length=320** with truncation and **dynamic padding** via `DataCollatorWithPadding`. To address imbalance, training used **Focal Loss (γ=1.0)** with **class weights** computed from the *train* split (`class_weight="balanced"`) applied in the loss, plus a **WeightedRandomSampler** with √(inverse-frequency) **per-sample weights**.
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ---
21
 
22
+ ## Training procedure
23
+ Training used **[`bilalzafar/cb-bert-mlm`](https://huggingface.co/bilalzafar/cb-bert-mlm)** as the base, with a 3-label **`AutoModelForSequenceClassification`** head. Optimization was **AdamW** (HF Trainer) with **learning rate 2e-5**, **batch size 16** (train/eval), and up to **8 epochs** with **early stopping (patience=2)**—best epoch \~**6**. A **warmup ratio of 0.06**, **weight decay 0.01**, and **fp16** precision were applied. Runs were seeded (**42**) and executed on **Google Colab (T4)**.
 
 
 
24
 
25
  ---
26
 
 
46
 
47
  > Note: On the **entire annotated set** (in-domain evaluation, not a hold-out),
48
  > the same model reaches \~0.95 accuracy / weighted-F1.
49
+ > Treat those as upper bounds; the **test split** above is the recommended reference.
50
+
51
+ ---
52
+
53
+ ## Usage
54
+
55
+ ```python
56
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
57
+
58
+ model_name = "bilalzafar/cbdc-bert-sentiment"
59
+
60
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
61
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
62
+
63
+ classifier = pipeline(
64
+ "text-classification",
65
+ model=model,
66
+ tokenizer=tokenizer,
67
+ truncation=True,
68
+ padding=True,
69
+ top_k=1 # return only the top prediction
70
+ )
71
+
72
+ text = "CBDCs will revolutionize payment systems and improve financial inclusion."
73
+ print(classifier(text))
74
+ # Example output: [{'label': 'positive', 'score': 0.9789}]