CentralBank-NER / README.md
bilalzafar's picture
Update README.md
794d38b verified
---
license: mit
base_model:
- bilalzafar/CentralBank-BERT
pipeline_tag: token-classification
tags:
- NER
- named-entity-recognition
- central-bank
- BIS
- speeches
- finance
- economics
- monetary policy
datasets:
- bilalzafar/BIS-Speeches-NER-dataset
language:
- en
metrics:
- f1
- accuracy
library_name: transformers
---
# Central Bank-BERT for Named Entity Recognition (NER)
A **domain-adapted BERT model ([`CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT))** was fine-tuned for **Named Entity Recognition (NER)** in central banking discourse. The model automatically identifies and labels key entities in central bank speeches and related documents, focusing on three categories of interest:
* **AUTHOR / SPEAKER** – the individual delivering the speech or statement
* **POSITION** – the official title or role of the speaker (e.g., Governor, Deputy Governor, Board Member)
* **AFFILIATION** – the institution or organization associated with the speaker (e.g., Bank of Japan, European Central Bank, Bank of England)
The **COUNTRY** label was not explicitly modeled, since this information can be reliably **inferred from the affiliation of the central bank**.
## Data
* **Source**: **BIS database of central bank speeches (1996–2024)**
* **Corpus Size**: 17,648 speeches with 1,961 held out for validation.
* **Input Field**: *Speech descriptions*, which typically contain a short speech title along with the name, position, and institutional affiliation of the speaker.
**Annotation Process**:
1. A subset of short speech descriptions was **manually annotated** with entity spans for Author, Position, and Affiliation.
2. This annotated subset was used to **train an initial NER model**.
3. The model was then applied to the larger dataset (1996–2024) to generate preliminary labels.
4. All generated labels were **manually reviewed and corrected**, ensuring complete and consistent annotation across the entire corpus of available speeches.
This approach combined **manual expertise** with **machine-assisted annotation**, making it feasible to build a large-scale, high-quality dataset covering nearly three decades of central bank communication.
## Data Preparation
1. **Normalization**: Lowercasing, removal of diacritics, and unification of punctuation.
2. **Alias resolution**: Institution abbreviations normalized (e.g., “BOJ” → “Bank of Japan”, “ECB” → “European Central Bank”).
3. **Entity alignment**: Fuzzy string matching used to locate annotated entities in raw text.
4. **BIO Encoding**:
* Tokenization with *BERT WordPiece tokenizer*.
* Conversion of annotations into **BIO tags** (`B-`, `I-`, `O`) at token level.
* Construction of a training file in **JSONL format** with `tokens` and `ner_tags`.
## Model Training
* **Base model**: [`CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT), a domain-adapted BERT trained on central banking corpora.
* **Task head**: Token classification layer with `num_labels = 7` (BIO scheme for Author, Position, Affiliation).
* **Token alignment**: Word-to-token mapping with subword label propagation (`-100` used for ignored positions).
* **Training setup**:
* Optimizer: AdamW with weight decay `0.01`
* Learning rate: `2e-5`
* Batch size: `16` (train & eval)
* Epochs: `3`
* Mixed precision (`fp16`) when available
* Evaluation with `seqeval` metrics (precision, recall, F1)
## Results
The model was trained on **17,648 annotated speeches** with a **1,961-speech validation set**. Evaluation metrics are reported using **entity-level precision, recall, and F1-score** from the `seqeval` library.
**Final Validation Performance (Epoch 3):**
| Entity Type | Precision | Recall | F1-score | Support |
| --------------- | ---------- | ---------- | ---------- | ------- |
| **Affiliation** | 0.9850 | 0.9862 | 0.9856 | 1,734 |
| **Author** | 0.9816 | 0.9912 | 0.9864 | 1,936 |
| **Position** | 0.9735 | 0.9846 | 0.9790 | 1,942 |
| **Overall** | **0.9798** | **0.9862** | **0.9830** | — |
* **Accuracy (token-level):** 0.9956 | * **Overall F1 (macro):** 0.983
The results show **high precision and recall across all three categories**, confirming that the model provides reliable structured metadata extraction from central bank communications.
---
## Other CBDC Models
This model is part of the **CentralBank-BERT / CBDC model family**, a suite of domain-adapted classifiers for analyzing central-bank communication.
| **Model** | **Purpose** | **Intended Use** | **Link** |
| ------------------------------- | ------------------------------------------------------------------- | ------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| **bilalzafar/CentralBank-BERT** | Domain-adaptive masked LM trained on BIS speeches (1996–2024). | Base encoder for CBDC downstream tasks; fill-mask tasks. | [CentralBank-BERT](https://huggingface.co/bilalzafar/CentralBank-BERT) |
| **bilalzafar/CBDC-BERT** | Binary classifier: CBDC vs. Non-CBDC. | Flagging CBDC-related discourse in large corpora. | [CBDC-BERT](https://huggingface.co/bilalzafar/CBDC-BERT) |
| **bilalzafar/CBDC-Stance** | 3-class stance model (Pro, Wait-and-See, Anti). | Research on policy stances and discourse monitoring. | [CBDC-Stance](https://huggingface.co/bilalzafar/CBDC-Stance) |
| **bilalzafar/CBDC-Sentiment** | 3-class sentiment model (Positive, Neutral, Negative). | Tone analysis in central bank communications. | [CBDC-Sentiment](https://huggingface.co/bilalzafar/CBDC-Sentiment) |
| **bilalzafar/CBDC-Type** | Classifies Retail, Wholesale, General CBDC mentions. | Distinguishing policy focus (retail vs wholesale). | [CBDC-Type](https://huggingface.co/bilalzafar/CBDC-Type) |
| **bilalzafar/CBDC-Discourse** | 3-class discourse classifier (Feature, Process, Risk-Benefit). | Structured categorization of CBDC communications. | [CBDC-Discourse](https://huggingface.co/bilalzafar/CBDC-Discourse) |
| **bilalzafar/CentralBank-NER** | Named Entity Recognition (NER) model for central banking discourse. | Identifying institutions, persons, and policy entities in speeches. | [CentralBank-NER](https://huggingface.co/bilalzafar/CentralBank-NER) |
## Repository and Replication Package
All **training pipelines, preprocessing scripts, evaluation notebooks, and result outputs** are available in the companion GitHub repository:
🔗 **[https://github.com/bilalezafar/CentralBank-BERT](https://github.com/bilalezafar/CentralBank-BERT)**
---
## Usage
```python
from transformers import pipeline
# HF model repo
model = "bilalzafar/CentralBank-NER"
ner = pipeline(
task="token-classification",
model=model,
tokenizer=model,
aggregation_strategy="simple" # merges subword pieces
)
# Example text
text = "Speech by Mr Yi Gang, Governor of the People's Bank of China, at the IMF Annual Meeting."
for ent in ner(text):
print(f"{ent['entity_group']:12} {ent['word']:<25} score={ent['score']:.3f}")
# Example output:
# [{AUTHOR yi gang score=0.997}]
# [{POSITION governor score=0.999}]
# [{AFFILIATION people ' s bank of china score=0.999}]
```
---
## Citation
If you use this model, please cite as:
**Zafar, M. B. (2025). CentralBank-BERT: Machine learning evidence on central bank digital currency discourse. *Journal of Economics and Business.* [https://doi.org/10.1016/j.jeconbus.2026.106300](https://doi.org/10.1016/j.jeconbus.2026.106300)**
```bibtex
@article{zafar2025centralbankbert,
title={CentralBank-BERT: Machine learning evidence on central bank digital currency discourse},
author={Zafar, Muhammad Bilal},
year={2026},
journal={Journal of Economics and Business},
url={https://doi.org/10.1016/j.jeconbus.2026.106300}
}