bilalzafar commited on
Commit
90f84ae
·
verified ·
1 Parent(s): 223ca7a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -25
README.md CHANGED
@@ -23,50 +23,28 @@ tags:
23
  - BERT
24
  - Transformers
25
  - Digital Currency
 
26
  ---
27
 
28
  # CBDC-Type-BERT: Classifying Retail vs Wholesale vs General CBDC Sentences
29
 
30
  **A domain-specialized BERT classifier that labels central-bank text about CBDCs into three categories:**
31
-
32
  * **Retail CBDC** – statements about a **general-purpose** CBDC for the public (households, merchants, wallets, offline use, legal-tender for everyday payments, holding limits, tiered remuneration, “digital euro/pound/rupee” for citizens, etc.).
33
  * **Wholesale CBDC** – statements about a **financial-institution** CBDC (RTGS/settlement, DLT platforms, PvP/DvP, tokenised assets/markets, interbank use, central-bank reserves on ledger, etc.).
34
  * **General/Unspecified** – CBDC mentions that **don’t clearly indicate retail or wholesale** scope, or discuss CBDCs at a conceptual/policy level without specifying the type.
35
 
36
-
37
  **Training data:** 1,417 manually annotated CBDC sentences from BIS central-bank speeches — **Retail CBDC** (543), **Wholesale CBDC** (329), and **General/Unspecified** (545) — split **80/10/10** (train/validation/test) with stratification.
38
-
39
  **Base model:** [`bilalzafar/CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT) - **CentralBank-BERT** is a domain-adapted BERT trained on \~2M sentences (66M tokens) of **central bank speeches** (BIS, 1996–2024). It captures monetary-policy and payments vocabulary far better than generic BERT, which materially helps downstream CBDC classification.
40
 
41
  ---
42
 
43
  ## Preprocessing, Class Weights & Training
44
-
45
- * Text cleaning: minimal.
46
- * Tokenizer: CentralBank-BERT WordPiece (max length **192**).
47
- * **Class imbalance:** fewer **Wholesale** examples, hence used **inverse-frequency class weights** in `CrossEntropyLoss` to balance learning:
48
- * General ≈ 0.866, Retail ≈ 0.870, Wholesale ≈ 1.436 (computed from train split).
49
-
50
- * Optimizer: AdamW; **lr=2e-5**, **weight\_decay=0.01**, **warmup\_ratio=0.1**
51
- * Batch sizes: train **8**, eval **16**; epochs: **5**; **fp16**
52
- * Early stopping on validation **macro-F1** (patience=2), best model loaded at end.
53
- * Hardware: single GPU (Colab).
54
 
55
  ---
56
 
57
  ## Performance & Evaluation
58
-
59
- **Held-out test (10%)**
60
-
61
- * **Accuracy:** **0.887**
62
- * **F1 (macro):** **0.898**
63
- * **F1 (weighted):** **0.887**
64
-
65
- Class-wise F1 (test):
66
-
67
- * **Retail:** \~0.86
68
- * **Wholesale:** \~0.97
69
- * **General:** \~0.86
70
 
71
  ---
72
 
 
23
  - BERT
24
  - Transformers
25
  - Digital Currency
26
+ library_name: transformers
27
  ---
28
 
29
  # CBDC-Type-BERT: Classifying Retail vs Wholesale vs General CBDC Sentences
30
 
31
  **A domain-specialized BERT classifier that labels central-bank text about CBDCs into three categories:**
 
32
  * **Retail CBDC** – statements about a **general-purpose** CBDC for the public (households, merchants, wallets, offline use, legal-tender for everyday payments, holding limits, tiered remuneration, “digital euro/pound/rupee” for citizens, etc.).
33
  * **Wholesale CBDC** – statements about a **financial-institution** CBDC (RTGS/settlement, DLT platforms, PvP/DvP, tokenised assets/markets, interbank use, central-bank reserves on ledger, etc.).
34
  * **General/Unspecified** – CBDC mentions that **don’t clearly indicate retail or wholesale** scope, or discuss CBDCs at a conceptual/policy level without specifying the type.
35
 
 
36
  **Training data:** 1,417 manually annotated CBDC sentences from BIS central-bank speeches — **Retail CBDC** (543), **Wholesale CBDC** (329), and **General/Unspecified** (545) — split **80/10/10** (train/validation/test) with stratification.
 
37
  **Base model:** [`bilalzafar/CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT) - **CentralBank-BERT** is a domain-adapted BERT trained on \~2M sentences (66M tokens) of **central bank speeches** (BIS, 1996–2024). It captures monetary-policy and payments vocabulary far better than generic BERT, which materially helps downstream CBDC classification.
38
 
39
  ---
40
 
41
  ## Preprocessing, Class Weights & Training
42
+ Performed light **manual cleaning** (trimming whitespace, normalizing quotes/dashes, de-duplication, dropping nulls) and tokenized with [`bilalzafar/CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT)’s WordPiece (max length **192**). Because **Wholesale** had fewer examples, we applied **inverse-frequency class weights** in `CrossEntropyLoss` to balance learning (train-split weights ≈ General **0.866**, Retail **0.870**, Wholesale **1.436**). The model was fine-tuned with AdamW (lr **2e-5**, weight decay **0.01**, warmup ratio **0.1**), batch sizes **8/16** (train/eval), for **5 epochs** with **fp16** mixed precision. Early stopping monitored validation **macro-F1** (patience = 2), and the best checkpoint was restored at the end. Training ran on a single Colab GPU.
 
 
 
 
 
 
 
 
 
43
 
44
  ---
45
 
46
  ## Performance & Evaluation
47
+ On a 10% held-out test set, the model achieved **88.7% accuracy**, **0.898 macro-F1**, and **0.887 weighted-F1**. Class-wise, performance was strong across categories, with **Retail ≈ 0.86 F1**, **Wholesale ≈ 0.97 F1**, and **General ≈ 0.86 F1**, indicating particularly high precision/recall on Wholesale, and balanced, reliable performance on Retail and General.
 
 
 
 
 
 
 
 
 
 
 
48
 
49
  ---
50