Update README.md

Browse files

Files changed (1) hide show

README.md +58 -0

README.md CHANGED Viewed

@@ -29,3 +29,61 @@ This model enables structured analysis of CBDC-related policy and research texts
 This classifier is built on top of [`bilalzafar/CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT), a **domain-adapted BERT model** pretrained on over **2 million sentences (\~66M tokens)** from **BIS central bank speeches (1996–2024)**.
 CentralBank-BERT provides deep contextual understanding of **monetary policy, financial regulation, and central banking discourse**, making it an optimal foundation for downstream CBDC-related text classification.

 This classifier is built on top of [`bilalzafar/CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT), a **domain-adapted BERT model** pretrained on over **2 million sentences (\~66M tokens)** from **BIS central bank speeches (1996–2024)**.
 CentralBank-BERT provides deep contextual understanding of **monetary policy, financial regulation, and central banking discourse**, making it an optimal foundation for downstream CBDC-related text classification.
+## Dataset
+The model was fine-tuned on a **manually annotated dataset of CBDC-related sentences** extracted from **Bank for International Settlements (BIS) central bank speeches (1996–2024)**.
+The dataset was balanced across three discourse classes with a total of **2,886 sentences (962 per class)**:
+## Intended Use
+This model is designed for the **automatic classification of CBDC discourse** in policy, research, and financial communications. It enables researchers, analysts, and practitioners to distinguish whether a sentence describes **procedural aspects**, **design features**, or **evaluative outcomes** of central bank digital currencies.
+Such categorization supports **policy analysis, thematic mapping of central bank communication, and structured NLP-based research** in the fields of **finance, monetary economics, and economic policy**.
+## Training Details
+* Tokenization: WordPiece (CentralBank-BERT tokenizer)
+* Maximum sequence length: 256 tokens
+* Dynamic padding (`DataCollatorWithPadding`)
+* Train/Val/Test split: 80/10/10 stratified by label
+| Parameter                     | Value                       |
+| ----------------------------- | --------------------------- |
+| Base model                    | [`bilalzafar/CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT) |
+| Epochs                        | 6                           |
+| Train batch size (per device) | 8                           |
+| Eval batch size (per device)  | 16                          |
+| Gradient accumulation         | 2                           |
+| Effective batch size          | 16                          |
+| Learning rate                 | 2e-5                        |
+| Weight decay                  | 0.01                        |
+| Warmup ratio                  | 0.06                        |
+| Scheduler                     | Cosine                      |
+| Mixed precision (fp16)        | Enabled                     |
+* Environment: Google Colab
+* GPU: Tesla T4 (16GB)
+* Framework: PyTorch 2.8.0 + Hugging Face Transformers
+## Evaluation
+### Validation (10%)
+* Accuracy: **0.851**
+* Macro-F1: **0.839**
+* Weighted-F1: **0.852**
+### Test (10%)
+* Accuracy: **0.823**
+* Macro-F1: **0.803**
+* Weighted-F1: **0.825**
+#### Per-class performance (Test)
+| Class        | Precision | Recall | F1    |
+| ------------ | --------- | ------ | ----- |
+| Feature      | 0.759     | 0.782  | 0.770 |
+| Process      | 0.927     | 0.845  | 0.884 |
+| Risk-Benefit | 0.700     | 0.817  | 0.754 |