Update README.md

794d38b verified about 1 month ago

8.37 kB

	---
	license: mit
	base_model:
	- bilalzafar/CentralBank-BERT
	pipeline_tag: token-classification
	tags:
	- NER
	- named-entity-recognition
	- central-bank
	- BIS
	- speeches
	- finance
	- economics
	- monetary policy
	datasets:
	- bilalzafar/BIS-Speeches-NER-dataset
	language:
	- en
	metrics:
	- f1
	- accuracy
	library_name: transformers
	---

	# Central Bank-BERT for Named Entity Recognition (NER)
	A domain-adapted BERT model ([`CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT)) was fine-tuned for Named Entity Recognition (NER) in central banking discourse. The model automatically identifies and labels key entities in central bank speeches and related documents, focusing on three categories of interest:
	* AUTHOR / SPEAKER – the individual delivering the speech or statement
	* POSITION – the official title or role of the speaker (e.g., Governor, Deputy Governor, Board Member)
	* AFFILIATION – the institution or organization associated with the speaker (e.g., Bank of Japan, European Central Bank, Bank of England)

	The COUNTRY label was not explicitly modeled, since this information can be reliably inferred from the affiliation of the central bank.

	## Data
	* Source: BIS database of central bank speeches (1996–2024)
	* Corpus Size: 17,648 speeches with 1,961 held out for validation.
	* Input Field: Speech descriptions, which typically contain a short speech title along with the name, position, and institutional affiliation of the speaker.

	Annotation Process:
	1. A subset of short speech descriptions was manually annotated with entity spans for Author, Position, and Affiliation.
	2. This annotated subset was used to train an initial NER model.
	3. The model was then applied to the larger dataset (1996–2024) to generate preliminary labels.
	4. All generated labels were manually reviewed and corrected, ensuring complete and consistent annotation across the entire corpus of available speeches.

	This approach combined manual expertise with machine-assisted annotation, making it feasible to build a large-scale, high-quality dataset covering nearly three decades of central bank communication.

	## Data Preparation
	1. Normalization: Lowercasing, removal of diacritics, and unification of punctuation.
	2. Alias resolution: Institution abbreviations normalized (e.g., “BOJ” → “Bank of Japan”, “ECB” → “European Central Bank”).
	3. Entity alignment: Fuzzy string matching used to locate annotated entities in raw text.
	4. BIO Encoding:

	* Tokenization with BERT WordPiece tokenizer.
	* Conversion of annotations into BIO tags (`B-`, `I-`, `O`) at token level.
	* Construction of a training file in JSONL format with `tokens` and `ner_tags`.

	## Model Training
	* Base model: [`CentralBank-BERT`](https://huggingface.co/bilalzafar/CentralBank-BERT), a domain-adapted BERT trained on central banking corpora.
	* Task head: Token classification layer with `num_labels = 7` (BIO scheme for Author, Position, Affiliation).
	* Token alignment: Word-to-token mapping with subword label propagation (`-100` used for ignored positions).
	* Training setup:
	* Optimizer: AdamW with weight decay `0.01`
	* Learning rate: `2e-5`
	* Batch size: `16` (train & eval)
	* Epochs: `3`
	* Mixed precision (`fp16`) when available
	* Evaluation with `seqeval` metrics (precision, recall, F1)

	## Results
	The model was trained on 17,648 annotated speeches with a 1,961-speech validation set. Evaluation metrics are reported using entity-level precision, recall, and F1-score from the `seqeval` library.
	Final Validation Performance (Epoch 3):
	\| Entity Type \| Precision \| Recall \| F1-score \| Support \|
	\| --------------- \| ---------- \| ---------- \| ---------- \| ------- \|
	\| Affiliation \| 0.9850 \| 0.9862 \| 0.9856 \| 1,734 \|
	\| Author \| 0.9816 \| 0.9912 \| 0.9864 \| 1,936 \|
	\| Position \| 0.9735 \| 0.9846 \| 0.9790 \| 1,942 \|
	\| Overall \| 0.9798 \| 0.9862 \| 0.9830 \| — \|
	* Accuracy (token-level): 0.9956 \| * Overall F1 (macro): 0.983

	The results show high precision and recall across all three categories, confirming that the model provides reliable structured metadata extraction from central bank communications.

	---

	## Other CBDC Models

	This model is part of the CentralBank-BERT / CBDC model family, a suite of domain-adapted classifiers for analyzing central-bank communication.

	\| Model \| Purpose \| Intended Use \| Link \|
	\| ------------------------------- \| ------------------------------------------------------------------- \| ------------------------------------------------------------------- \| ---------------------------------------------------------------------- \|
	\| bilalzafar/CentralBank-BERT \| Domain-adaptive masked LM trained on BIS speeches (1996–2024). \| Base encoder for CBDC downstream tasks; fill-mask tasks. \| [CentralBank-BERT](https://huggingface.co/bilalzafar/CentralBank-BERT) \|
	\| bilalzafar/CBDC-BERT \| Binary classifier: CBDC vs. Non-CBDC. \| Flagging CBDC-related discourse in large corpora. \| [CBDC-BERT](https://huggingface.co/bilalzafar/CBDC-BERT) \|
	\| bilalzafar/CBDC-Stance \| 3-class stance model (Pro, Wait-and-See, Anti). \| Research on policy stances and discourse monitoring. \| [CBDC-Stance](https://huggingface.co/bilalzafar/CBDC-Stance) \|
	\| bilalzafar/CBDC-Sentiment \| 3-class sentiment model (Positive, Neutral, Negative). \| Tone analysis in central bank communications. \| [CBDC-Sentiment](https://huggingface.co/bilalzafar/CBDC-Sentiment) \|
	\| bilalzafar/CBDC-Type \| Classifies Retail, Wholesale, General CBDC mentions. \| Distinguishing policy focus (retail vs wholesale). \| [CBDC-Type](https://huggingface.co/bilalzafar/CBDC-Type) \|
	\| bilalzafar/CBDC-Discourse \| 3-class discourse classifier (Feature, Process, Risk-Benefit). \| Structured categorization of CBDC communications. \| [CBDC-Discourse](https://huggingface.co/bilalzafar/CBDC-Discourse) \|
	\| bilalzafar/CentralBank-NER \| Named Entity Recognition (NER) model for central banking discourse. \| Identifying institutions, persons, and policy entities in speeches. \| [CentralBank-NER](https://huggingface.co/bilalzafar/CentralBank-NER) \|


	## Repository and Replication Package

	All training pipelines, preprocessing scripts, evaluation notebooks, and result outputs are available in the companion GitHub repository:

	🔗 [https://github.com/bilalezafar/CentralBank-BERT](https://github.com/bilalezafar/CentralBank-BERT)

	---

	## Usage
	```python
	from transformers import pipeline

	# HF model repo
	model = "bilalzafar/CentralBank-NER"

	ner = pipeline(
	task="token-classification",
	model=model,
	tokenizer=model,
	aggregation_strategy="simple" # merges subword pieces
	)

	# Example text
	text = "Speech by Mr Yi Gang, Governor of the People's Bank of China, at the IMF Annual Meeting."
	for ent in ner(text):
	print(f"{ent['entity_group']:12} {ent['word']:<25} score={ent['score']:.3f}")

	# Example output:
	# [{AUTHOR yi gang score=0.997}]
	# [{POSITION governor score=0.999}]
	# [{AFFILIATION people ' s bank of china score=0.999}]
	```
	---

	## Citation

	If you use this model, please cite as:

	*Zafar, M. B. (2025). CentralBank-BERT: Machine learning evidence on central bank digital currency discourse. Journal of Economics and Business.* [https://doi.org/10.1016/j.jeconbus.2026.106300](https://doi.org/10.1016/j.jeconbus.2026.106300)**

	```bibtex
	@article{zafar2025centralbankbert,
	title={CentralBank-BERT: Machine learning evidence on central bank digital currency discourse},
	author={Zafar, Muhammad Bilal},
	year={2026},
	journal={Journal of Economics and Business},
	url={https://doi.org/10.1016/j.jeconbus.2026.106300}
	}