Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,91 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
metrics:
|
| 6 |
+
- f1
|
| 7 |
+
- accuracy
|
| 8 |
+
base_model:
|
| 9 |
+
- bilalzafar/CentralBank-BERT
|
| 10 |
+
pipeline_tag: text-classification
|
| 11 |
+
tags:
|
| 12 |
+
- CBDC
|
| 13 |
+
- Central Bank Digital Currencies
|
| 14 |
+
- Central Bank Digital Currency
|
| 15 |
+
- Classification
|
| 16 |
+
- Wholesale CBDC
|
| 17 |
+
- Retail CBDC
|
| 18 |
+
- Central Bank
|
| 19 |
+
- Tone
|
| 20 |
+
- Finance
|
| 21 |
+
- NLP
|
| 22 |
+
- Finance NLP
|
| 23 |
+
- BERT
|
| 24 |
+
- Transformers
|
| 25 |
+
- Digital Currency
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
# CBDC-Type-BERT: Classifying Retail vs Wholesale vs General CBDC Sentences
|
| 29 |
+
|
| 30 |
+
**A domain-specialized BERT classifier that labels central-bank text about CBDCs into three categories:**
|
| 31 |
+
|
| 32 |
+
* **Retail CBDC** – statements about a **general-purpose** CBDC for the public (households, merchants, wallets, offline use, legal-tender for everyday payments, holding limits, tiered remuneration, “digital euro/pound/rupee” for citizens, etc.).
|
| 33 |
+
* **Wholesale CBDC** – statements about a **financial-institution** CBDC (RTGS/settlement, DLT platforms, PvP/DvP, tokenised assets/markets, interbank use, central-bank reserves on ledger, etc.).
|
| 34 |
+
* **General/Unspecified** – CBDC mentions that **don’t clearly indicate retail or wholesale** scope, or discuss CBDCs at a conceptual/policy level without specifying the type.
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
**Training data:** 1,417 manually annotated CBDC sentences from BIS central-bank speeches — **Retail CBDC** (543), **Wholesale CBDC** (329), and **General/Unspecified** (545) — split **80/10/10** (train/validation/test) with stratification.
|
| 38 |
+
|
| 39 |
+
**Base model:** [`bilalzafar/CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT) - **CentralBank-BERT** is a domain-adapted BERT trained on \~2M sentences (66M tokens) of **central bank speeches** (BIS, 1996–2024). It captures monetary-policy and payments vocabulary far better than generic BERT, which materially helps downstream CBDC classification.
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
## Preprocessing, Class Weights & Training
|
| 44 |
+
|
| 45 |
+
* Text cleaning: minimal.
|
| 46 |
+
* Tokenizer: CentralBank-BERT WordPiece (max length **192**).
|
| 47 |
+
* **Class imbalance:** fewer **Wholesale** examples, hence used **inverse-frequency class weights** in `CrossEntropyLoss` to balance learning:
|
| 48 |
+
* General ≈ 0.866, Retail ≈ 0.870, Wholesale ≈ 1.436 (computed from train split).
|
| 49 |
+
|
| 50 |
+
* Optimizer: AdamW; **lr=2e-5**, **weight\_decay=0.01**, **warmup\_ratio=0.1**
|
| 51 |
+
* Batch sizes: train **8**, eval **16**; epochs: **5**; **fp16**
|
| 52 |
+
* Early stopping on validation **macro-F1** (patience=2), best model loaded at end.
|
| 53 |
+
* Hardware: single GPU (Colab).
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## Performance & Evaluation
|
| 58 |
+
|
| 59 |
+
**Held-out test (10%)**
|
| 60 |
+
|
| 61 |
+
* **Accuracy:** **0.887**
|
| 62 |
+
* **F1 (macro):** **0.898**
|
| 63 |
+
* **F1 (weighted):** **0.887**
|
| 64 |
+
|
| 65 |
+
Class-wise F1 (test):
|
| 66 |
+
|
| 67 |
+
* **Retail:** \~0.86
|
| 68 |
+
* **Wholesale:** \~0.97
|
| 69 |
+
* **General:** \~0.86
|
| 70 |
+
|
| 71 |
+
---
|
| 72 |
+
|
| 73 |
+
## Usage
|
| 74 |
+
|
| 75 |
+
```python
|
| 76 |
+
from transformers import pipeline
|
| 77 |
+
|
| 78 |
+
model_id = "your-username/cbdc-type-bert" # replace with your repo
|
| 79 |
+
clf = pipeline("text-classification", model=model_id, tokenizer=model_id,
|
| 80 |
+
truncation=True, max_length=192)
|
| 81 |
+
|
| 82 |
+
texts = [
|
| 83 |
+
"The digital euro will be available to citizens and merchants for daily payments.", # Retail
|
| 84 |
+
"DLT-based interbank settlement with a central bank liability will lower PvP risk.", # Wholesale
|
| 85 |
+
"Several central banks are assessing CBDCs to modernise payments and policy transmission." # General
|
| 86 |
+
]
|
| 87 |
+
|
| 88 |
+
for t in texts:
|
| 89 |
+
out = clf(t)[0]
|
| 90 |
+
print(f"{out['label']:>20} {out['score']:.3f} | {t}")
|
| 91 |
+
```
|