Update README.md

Browse files

Files changed (1) hide show

README.md +35 -27

README.md CHANGED Viewed

@@ -2,42 +2,25 @@
 license: mit
 ---
-**Training data:** The dataset consists of **2,405** custom, *manually annotated* sentences related to Central Bank Digital Currencies (CBDCs). The class distribution is **neutral**: *1,068* (44.41%), **positive**: *1,026* (42.66%), and **negative**: *311* (12.93%). The data is split **row-wise**, stratified by label, into **train**: *1,924*, **validation**: *240*, and **test**: *241* examples.
----
-## Preprocessing
-* Lowercased, raw sentences (no stemming or lemmatization)
-* Tokenization: base model tokenizer (`bilalzafar/cb-bert-mlm`), **max\_length=320**, truncation enabled
-* Dynamic padding via `DataCollatorWithPadding`
 ---
-## Training procedure
-* **Base model:** `bilalzafar/cb-bert-mlm`
-* **Head:** `AutoModelForSequenceClassification` with 3 labels
-* **Optimizer:** AdamW (via HF Trainer)
-* **Learning rate:** 2e-5
-* **Batch size:** 16 (train/eval)
-* **Epochs:** up to 8 with early stopping (patience=2); best epoch \~6
-* **Warmup ratio:** 0.06
-* **Weight decay:** 0.01
-* **Precision:** fp16
-* **Seed:** 42
-* **Hardware:** Google Colab (T4)
 ---
-## Class imbalance & loss
-* **Loss:** Focal Loss with γ = 1.0
-* **Class weights:** computed from the **train split** (`class_weight="balanced"`) and applied in the loss
-* **Sampler:** `WeightedRandomSampler` with √(inverse frequency) per-sample weights
 ---
@@ -63,4 +46,29 @@ weighted-F1: **0.8216**
 > Note: On the **entire annotated set** (in-domain evaluation, not a hold-out),
 > the same model reaches \~0.95 accuracy / weighted-F1.
-> Treat those as upper bounds; the **test split** above is the recommended reference.

 license: mit
 ---
+# **CBDC-BERT-Sentiment: A Domain-Specific BERT for CBDC-Related Sentiment Analysis**
+**CBDC-BERT-Sentiment** is a **3-class** (*negative / neutral / positive*) sentence-level classifier built for **Central Bank Digital Currency (CBDC)** communications. It is trained to identify overall sentiment in central-bank style text such as consultations, speeches, reports, and reputable news.
+**Base Model:** [`bilalzafar/cb-bert-mlm`](https://huggingface.co/bilalzafar/cb-bert-mlm) — **CB-BERT (`cb-bert-mlm`)** is a domain-adapted **BERT base (uncased)**, pretrained on **66M+ tokens** across **2M+ sentences** from central-bank speeches published via the **Bank for International Settlements (1996–2024)**. It is optimized for *masked-token prediction* within the specialized domains of **monetary policy, financial regulation, and macroeconomic communication**, enabling better contextual understanding of central-bank discourse and financial narratives.
+**Training data:** The dataset consists of **2,405** custom, *manually annotated* sentences related to Central Bank Digital Currencies (CBDCs), extracted from **BIS speeches**. The class distribution is **neutral**: *1,068* (44.41%), **positive**: *1,026* (42.66%), and **negative**: *311* (12.93%). The data is split **row-wise**, stratified by label, into **train**: *1,924*, **validation**: *240*, and **test**: *241* examples.
+**Intended usage:** Use this model to **classify sentence-level sentiment** in **CBDC** texts (reports, consultations, speeches, research notes, reputable news). It is **domain-specific** and *not intended* for generic or informal sentiment tasks.
 ---
+## Preprocessing & class imbalance
+Sentences were **lowercased** (no stemming/lemmatization) and tokenized with the base tokenizer from [`bilalzafar/cb-bert-mlm`](https://huggingface.co/bilalzafar/cb-bert-mlm) using **max\_length=320** with truncation and **dynamic padding** via `DataCollatorWithPadding`. To address imbalance, training used **Focal Loss (γ=1.0)** with **class weights** computed from the *train* split (`class_weight="balanced"`) applied in the loss, plus a **WeightedRandomSampler** with √(inverse-frequency) **per-sample weights**.
 ---
+## Training procedure
+Training used **[`bilalzafar/cb-bert-mlm`](https://huggingface.co/bilalzafar/cb-bert-mlm)** as the base, with a 3-label **`AutoModelForSequenceClassification`** head. Optimization was **AdamW** (HF Trainer) with **learning rate 2e-5**, **batch size 16** (train/eval), and up to **8 epochs** with **early stopping (patience=2)**—best epoch \~**6**. A **warmup ratio of 0.06**, **weight decay 0.01**, and **fp16** precision were applied. Runs were seeded (**42**) and executed on **Google Colab (T4)**.
 ---
 > Note: On the **entire annotated set** (in-domain evaluation, not a hold-out),
 > the same model reaches \~0.95 accuracy / weighted-F1.
+> Treat those as upper bounds; the **test split** above is the recommended reference.
+---
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
+model_name = "bilalzafar/cbdc-bert-sentiment"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+classifier = pipeline(
+    "text-classification",
+    model=model,
+    tokenizer=tokenizer,
+    truncation=True,
+    padding=True,
+    top_k=1  # return only the top prediction
+)
+text = "CBDCs will revolutionize payment systems and improve financial inclusion."
+print(classifier(text))
+# Example output: [{'label': 'positive', 'score': 0.9789}]