Fin-ModernBERT / README.md

Update README.md

31d3d96 verified 5 months ago

4.59 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: answerdotai/ModernBERT-base
	tags:
	- generated_from_trainer
	metrics:
	- accuracy
	model-index:
	- name: Fin-ModernBERT
	results: []
	datasets:
	- clapAI/FinData-dedup
	language:
	- en
	pipeline_tag: fill-mask
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->


	# Fin-ModernBERT

	Fin-ModernBERT is a domain-adapted pretrained language model for the financial domain, obtained by continual pretraining of [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) with a context length of 1024 tokens on large-scale finance-related corpora.

	---

	## Model Description

	- Base model: ModernBERT-base (context length = 1024)
	- Domain: Finance, Stock Market, Cryptocurrency
	- Objective: Improve representation and understanding of financial text for downstream NLP tasks (sentiment analysis, NER, classification, QA, retrieval, etc.)

	---

	## Training Data

	We collected and combined multiple publicly available finance-related datasets, including:

	- [danidanou/Bloomberg_Financial_News](https://huggingface.co/datasets/danidanou/Bloomberg_Financial_News)
	- [juanberasategui/Crypto_Tweets](https://huggingface.co/datasets/juanberasategui/Crypto_Tweets)
	- [StephanAkkerman/crypto-stock-tweets](https://huggingface.co/datasets/StephanAkkerman/crypto-stock-tweets)
	- [SahandNZ/cryptonews-articles-with-price-momentum-labels](https://huggingface.co/datasets/SahandNZ/cryptonews-articles-with-price-momentum-labels)
	- [edaschau/financial_news](https://huggingface.co/datasets/edaschau/financial_news)
	- [sabareesh88/FNSPID_nasdaq](https://huggingface.co/datasets/sabareesh88/FNSPID_nasdaq)
	- [BAAI/IndustryCorpus_finance](https://huggingface.co/datasets/BAAI/IndustryCorpus_finance)
	- [mjw/stock_market_tweets](https://huggingface.co/datasets/mjw/stock_market_tweets)

	After aggregation, we obtained ~50M financial records.
	A deduplication process reduced this to ~20M records, available at:
	👉 [clapAI/FinData-dedup](https://huggingface.co/datasets/clapAI/FinData-dedup)

	---

	## Training Hyperparameters

	The following hyperparameters were used during training:

	- Learning rate: 2e-4
	- Train batch size: 24
	- Eval batch size: 24
	- Seed: 0
	- Gradient accumulation steps: 128
	- Effective total train batch size: 3072
	- Optimizer: `AdamW_Torch_Fused` with betas=(0.9, 0.999), epsilon=1e-08
	- LR scheduler: Linear
	- Epochs: 1

	---

	## Evaluation Benchmarks

	We benchmarked Fin-ModernBERT against two strong baselines:
	- [ProsusAI/finbert](https://huggingface.co/ProsusAI/finbert)
	- [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)

	### Fine-tuning Setup
	All models were fine-tuned under the same configuration:
	- Optimizer: AdamW
	- Learning rate: 5e-5
	- Batch size: 16
	- Epochs: 5
	- Scheduler: Linear

	### Results

	\| Dataset \| Metric \| FinBERT (ProsusAI) \| ModernBERT-base \| Fin-ModernBERT \|
	\|---------\|--------\|---------------------\|-----------------\|----------------\|
	\| CIKM (datht/fin-cikm) \| F1-score \| 42.77 \| 53.08 \| 54.89 \|
	\| PhraseBank (soumakchak/phrasebank) \| F1-score \| 86.33 \| 85.03 \| 88.09 \|

	> Further evaluations on additional datasets and tasks are ongoing to provide a more comprehensive view of its performance.


	---

	## Use Cases

	Fin-ModernBERT can be used for various financial NLP applications, such as:

	- Financial Sentiment Analysis (e.g., market mood detection from news/tweets)
	- Event-driven Stock Prediction
	- Financial Named Entity Recognition (NER) (companies, tickers, financial instruments)
	- Document Classification & Clustering
	- Question Answering over financial reports and news

	---

	## How to Use

	```python
	from transformers import AutoTokenizer, AutoModel

	model_name = "clapAI/Fin-ModernBERT"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModel.from_pretrained(model_name)

	text = "Federal Reserve hints at possible interest rate cuts."
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model(**inputs)

	```

	## Citation

	If you use this model, please cite:

	```@misc{finmodernbert2025,
	title={Fin-ModernBERT: Continual Pretraining of ModernBERT for Financial Domain},
	author={ClapAI},
	year={2025},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/clapAI/Fin-ModernBERT}}
	}