Update README.md
Browse files
README.md
CHANGED
|
@@ -2,42 +2,25 @@
|
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
|
|
|
|
| 5 |
|
|
|
|
| 6 |
|
|
|
|
| 7 |
|
| 8 |
-
**Training data:** The dataset consists of **2,405** custom, *manually annotated* sentences related to Central Bank Digital Currencies (CBDCs)
|
| 9 |
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
## Preprocessing
|
| 13 |
-
|
| 14 |
-
* Lowercased, raw sentences (no stemming or lemmatization)
|
| 15 |
-
* Tokenization: base model tokenizer (`bilalzafar/cb-bert-mlm`), **max\_length=320**, truncation enabled
|
| 16 |
-
* Dynamic padding via `DataCollatorWithPadding`
|
| 17 |
|
| 18 |
---
|
| 19 |
|
| 20 |
-
##
|
| 21 |
-
|
| 22 |
-
* **Base model:** `bilalzafar/cb-bert-mlm`
|
| 23 |
-
* **Head:** `AutoModelForSequenceClassification` with 3 labels
|
| 24 |
-
* **Optimizer:** AdamW (via HF Trainer)
|
| 25 |
-
* **Learning rate:** 2e-5
|
| 26 |
-
* **Batch size:** 16 (train/eval)
|
| 27 |
-
* **Epochs:** up to 8 with early stopping (patience=2); best epoch \~6
|
| 28 |
-
* **Warmup ratio:** 0.06
|
| 29 |
-
* **Weight decay:** 0.01
|
| 30 |
-
* **Precision:** fp16
|
| 31 |
-
* **Seed:** 42
|
| 32 |
-
* **Hardware:** Google Colab (T4)
|
| 33 |
|
| 34 |
---
|
| 35 |
|
| 36 |
-
##
|
| 37 |
-
|
| 38 |
-
* **Loss:** Focal Loss with γ = 1.0
|
| 39 |
-
* **Class weights:** computed from the **train split** (`class_weight="balanced"`) and applied in the loss
|
| 40 |
-
* **Sampler:** `WeightedRandomSampler` with √(inverse frequency) per-sample weights
|
| 41 |
|
| 42 |
---
|
| 43 |
|
|
@@ -63,4 +46,29 @@ weighted-F1: **0.8216**
|
|
| 63 |
|
| 64 |
> Note: On the **entire annotated set** (in-domain evaluation, not a hold-out),
|
| 65 |
> the same model reaches \~0.95 accuracy / weighted-F1.
|
| 66 |
-
> Treat those as upper bounds; the **test split** above is the recommended reference.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
|
| 5 |
+
# **CBDC-BERT-Sentiment: A Domain-Specific BERT for CBDC-Related Sentiment Analysis**
|
| 6 |
|
| 7 |
+
**CBDC-BERT-Sentiment** is a **3-class** (*negative / neutral / positive*) sentence-level classifier built for **Central Bank Digital Currency (CBDC)** communications. It is trained to identify overall sentiment in central-bank style text such as consultations, speeches, reports, and reputable news.
|
| 8 |
|
| 9 |
+
**Base Model:** [`bilalzafar/cb-bert-mlm`](https://huggingface.co/bilalzafar/cb-bert-mlm) — **CB-BERT (`cb-bert-mlm`)** is a domain-adapted **BERT base (uncased)**, pretrained on **66M+ tokens** across **2M+ sentences** from central-bank speeches published via the **Bank for International Settlements (1996–2024)**. It is optimized for *masked-token prediction* within the specialized domains of **monetary policy, financial regulation, and macroeconomic communication**, enabling better contextual understanding of central-bank discourse and financial narratives.
|
| 10 |
|
| 11 |
+
**Training data:** The dataset consists of **2,405** custom, *manually annotated* sentences related to Central Bank Digital Currencies (CBDCs), extracted from **BIS speeches**. The class distribution is **neutral**: *1,068* (44.41%), **positive**: *1,026* (42.66%), and **negative**: *311* (12.93%). The data is split **row-wise**, stratified by label, into **train**: *1,924*, **validation**: *240*, and **test**: *241* examples.
|
| 12 |
|
| 13 |
+
**Intended usage:** Use this model to **classify sentence-level sentiment** in **CBDC** texts (reports, consultations, speeches, research notes, reputable news). It is **domain-specific** and *not intended* for generic or informal sentiment tasks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
---
|
| 16 |
|
| 17 |
+
## Preprocessing & class imbalance
|
| 18 |
+
Sentences were **lowercased** (no stemming/lemmatization) and tokenized with the base tokenizer from [`bilalzafar/cb-bert-mlm`](https://huggingface.co/bilalzafar/cb-bert-mlm) using **max\_length=320** with truncation and **dynamic padding** via `DataCollatorWithPadding`. To address imbalance, training used **Focal Loss (γ=1.0)** with **class weights** computed from the *train* split (`class_weight="balanced"`) applied in the loss, plus a **WeightedRandomSampler** with √(inverse-frequency) **per-sample weights**.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
---
|
| 21 |
|
| 22 |
+
## Training procedure
|
| 23 |
+
Training used **[`bilalzafar/cb-bert-mlm`](https://huggingface.co/bilalzafar/cb-bert-mlm)** as the base, with a 3-label **`AutoModelForSequenceClassification`** head. Optimization was **AdamW** (HF Trainer) with **learning rate 2e-5**, **batch size 16** (train/eval), and up to **8 epochs** with **early stopping (patience=2)**—best epoch \~**6**. A **warmup ratio of 0.06**, **weight decay 0.01**, and **fp16** precision were applied. Runs were seeded (**42**) and executed on **Google Colab (T4)**.
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
---
|
| 26 |
|
|
|
|
| 46 |
|
| 47 |
> Note: On the **entire annotated set** (in-domain evaluation, not a hold-out),
|
| 48 |
> the same model reaches \~0.95 accuracy / weighted-F1.
|
| 49 |
+
> Treat those as upper bounds; the **test split** above is the recommended reference.
|
| 50 |
+
|
| 51 |
+
---
|
| 52 |
+
|
| 53 |
+
## Usage
|
| 54 |
+
|
| 55 |
+
```python
|
| 56 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
|
| 57 |
+
|
| 58 |
+
model_name = "bilalzafar/cbdc-bert-sentiment"
|
| 59 |
+
|
| 60 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 61 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 62 |
+
|
| 63 |
+
classifier = pipeline(
|
| 64 |
+
"text-classification",
|
| 65 |
+
model=model,
|
| 66 |
+
tokenizer=tokenizer,
|
| 67 |
+
truncation=True,
|
| 68 |
+
padding=True,
|
| 69 |
+
top_k=1 # return only the top prediction
|
| 70 |
+
)
|
| 71 |
+
|
| 72 |
+
text = "CBDCs will revolutionize payment systems and improve financial inclusion."
|
| 73 |
+
print(classifier(text))
|
| 74 |
+
# Example output: [{'label': 'positive', 'score': 0.9789}]
|