finbert-finetune / README.md

updated README.md on failed cases

447fea0 verified 5 months ago

13.2 kB

	---
	license: apache-2.0
	datasets:
	- FinGPT/fingpt-sentiment-train
	language:
	- en
	metrics:
	- accuracy
	- f1
	- recall
	- precision
	base_model:
	- ProsusAI/finbert
	pipeline_tag: text-classification
	tags:
	- finance
	- financial
	- news
	- sentiment-analysis
	- finbert
	- transfomer
	- text-classification
	- financial-news
	- financial-news-sentiment
	library_name: transformers
	---


	# 📊 FinBERT Fine-Tuned on Financial News/Texts

	A fine-tuned version of [`ProsusAI/finbert`](https://huggingface.co/ProsusAI/finbert) trained for financial sentiment analysis on financial news texts and headlines.
	This fine-tuned model achieves a significant improvement over the original finbert, outperforming it by over 38% in accuracy on financial sentiment classification tasks.

	---

	## 🔧 Model Objective

	The goal of this model is to detect positive, neutral, or negative sentiment on financial texts and headlines.

	---

	## 🗂️ Training Dataset

	Primary Dataset: [`fingpt-sentiment-train`](https://huggingface.co/datasets/FinGPT/fingpt-sentiment-train) (~60,000 examples)

	- Labeled financial text samples (positive / neutral / negative)
	- Includes earnings statements, market commentary, and financial news headlines
	- Only included neutral, positive and negative texts.

	---

	## 🧪 Benchmark Evaluation

	The model was evaluated against three benchmark datasets:
	- [Financial PhraseBank (All Agree and All Combined)](https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10)
	- [FiQA + PhraseBank Kaggle Merge](https://www.kaggle.com/datasets/sbhatti/financial-sentiment-analysis/data)
	- [fingpt-sentiment-train (test split)](https://huggingface.co/datasets/FinGPT/fingpt-sentiment-train)

	Metrics used:
	- Accuracy
	- F1 Score
	- Precision
	- Recall


	We benchmarked this model against the original [`ProsusAI/finbert`](https://huggingface.co/ProsusAI/finbert) on multiple financial datasets:

	\| Dataset \| Samples \| Model \| Accuracy \| F1 (Macro) \| F1 (Weighted) \| Precision (Macro) \| Precision (Weighted) \| Recall (Macro) \| Recall (Weighted) \|
	\|------------------------------------\|---------\|--------------------------\|---------------\|---------------\|----------------\|--------------------\|------------------------\|----------------\|--------------------\|
	\| fingpt-sentiment-train Eval \| 12511 \| FinBERT \| 0.7131 \| 0.70 \| 0.71 \| 0.71 \| 0.72 \| 0.70 \| 0.71 \|
	\| \| \| FinBERT-Finetuned (Ours) \| 0.9894 (+38.8%) \| 0.99 (+41.4%) \| 0.99 (+39.4%) \| 0.99 (+39.4%) \| 0.99 (+37.5%) \| 0.99 (+41.4%) \| 0.99 (+39.4%) \|
	\| Financial Phrasebank (Agree) \| 2264 \| FinBERT \| 0.9717 \| 0.96 \| 0.97 \| 0.95 \| 0.97 \| 0.98 \| 0.97 \|
	\| \| \| FinBERT-Finetuned (Ours) \| 0.9912 (+2.0%) \| 0.99 (+3.1%) \| 0.99 (+2.1%) \| 0.99 (+4.2%) \| 0.99 (+2.1%) \| 0.99 (+1.0%) \| 0.99 (+2.1%) \|
	\| Financial Phrasebank (Combined)\| 14780 \| FinBERT \| 0.9238 \| 0.91 \| 0.92 \| 0.89 \| 0.93 \| 0.94 \| 0.92 \|
	\| \| \| FinBERT-Finetuned (Ours) \| 0.9792 (+6.0%) \| 0.98 (+7.7%) \| 0.98 (+6.5%) \| 0.98 (+10.1%) \| 0.98 (+5.4%) \| 0.98 (+4.3%) \| 0.98 (+6.5%) \|
	\| FiQA + PhraseBank (Kaggle) \| 5842 \| FinBERT \| 0.7581 \| 0.74 \| 0.77 \| 0.73 \| 0.79 \| 0.77 \| 0.76 \|
	\| \| \| FinBERT-Finetuned (Ours) \| 0.8879 (+17.1%) \| 0.87 (+17.6%) \| 0.89 (+15.6%) \| 0.85 (+16.4%) \| 0.92 (+16.5%) \| 0.92 (+19.5%) \| 0.89 (+17.1%) \|


	> Note: All metrics represent classification performance improvements after fine-tuning FinBERT on respective financial sentiment datasets. Metrics in parentheses represent relative improvement over base FinBERT performance.

	---
	## 🧠 Text-Level Comparison: FinBERT vs FinBERT-Finetuned (Ours)

	### 🔴 FinBERT Failed Texts (as per discussed in its [`Paper`](https://arxiv.org/abs/1908.10063)) (Correctly Predicted by Ours)
	\| Text \| Expected \| FinBERT \| Ours \|
	\|-----------------------------------------------------------------------------------------------------------------------------\|-----------\|------------------------------\|-------------------------------\|
	\| Pre-tax loss totaled euro 0.3 million, compared to a loss of euro 2.2 million in the first quarter of 2005. \| Positive \| ❌ Negative (0.7223) \| ✅ Positive (0.9997) \|
	\| This implementation is very important to the operator, since it is about to launch its Fixed to Mobile convergence service \| Neutral \| ❌ Positive (0.7204) \| ✅ Neutral (0.9998) \|
	\| The situation of coated magazine printing paper will continue to be weak. \| Negative \| ✅ Negative (0.8811) \| ✅ Negative (0.9996) \|

	### 🟡 FinBERT Incorrect, Ours Corrected It
	\| Text \| Expected \| FinBERT \| Ours \|
	\|----------------------------------------------------------------------------------------------------------------\|-----------\|------------------------------\|-------------------------------\|
	\| The debt-to-equity ratio was 1.15, flat quarter-over-quarter. \| Neutral \| ❌ Negative (0.6239) \| ✅ Neutral (0.9998) \|
	\| Earnings smashed expectations $AAPL posts $0.89 EPS vs $0.78 est. Bullish momentum incoming! \| Positive \| ❌ Neutral (0.4237) \| ✅ Positive (0.9998) \|
	\| $TSLA growth is slowing — but hey, at least Elon tweeted something funny today. #Tesla #markets \| Negative \| ❌ Neutral (0.5884) \| ✅ Negative (0.7084) \|

	### ⚪ Out-of-Context Texts (FinBERT Misclassified, Ours Handled Properly)
	\| Text \| Expected \| FinBERT \| Ours \|
	\|--------------------------------------------------------------------------------------------\|-----------\|------------------------------\|-------------------------------\|
	\| Unexpected Snowstorm Hits Sahara Desert, Blanketing Sand Dunes \| Neutral \| ❌ Negative (0.8675) \| ✅ Neutral (0.9993) \|
	\| Virtual Reality Therapy Shows Promise for Treating PTSD \| Neutral \| ❌ Positive (0.8522) \| ✅ Neutral (0.9997) \|

	> Note: These examples demonstrate improvements in real-world understanding, context handling, and sentiment differentiation with our FinBERT-finetuned model. Values in parentheses (e.g., `0.9485`) indicate the model’s confidence score for its predicted sentiment.

	---

	## ⚠️ Limitations & Failure Cases

	While the model outperformed the base FinBERT across benchmarks, some failure cases were observed in statements involving fine-grained numerical reasoning, particularly when numerical comparison semantics are complex or subtle.

	\| Text \| Expected \| FinBERT \| Ours \|
	\|---------------------------------------------------------------------------------------------------------\|-----------\|------------------------------\|-------------------------------\|
	\| Net profit to euro 203 million from euro 172 million in the previous year. \| Positive \| ✅ Positive (0.9485) \| ✅ Positive (0.9995) \|
	\| Net profit to euro 103 million from euro 172 million in the previous year. \| Negative \| ❌ Positive (0.9486) \| ❌ Positive (0.9994) \|
	\| Pre-tax loss totaled euro 0.3 million, compared to a loss of euro 2.2 million in Q1 2005. \| Positive \| ❌ Negative (0.7223) \| ✅ Positive (0.9997) \|
	\| Pre-tax loss totaled euro 5.3 million, compared to a loss of euro 2.2 million in Q1 2005. \| Negative \| ✅ Negative (0.7205) \| ❌ Positive (0.9997) \|
	\| Net profit totaled euro 5.3 million, compared to euro 2.2 million in the previous quarter of 2005. \| Positive \| ❌ Negative (0.6347) \| ❌ Negative (0.9996) \|
	\| Net profit totaled euro 0.3 million, compared to euro 2.2 million in the previous quarter of 2005. \| Negative \| ✅ Negative (0.6320) \| ✅ Negative (0.9996) \|

	> Note: Values in parentheses (e.g., `0.9485`) indicate the model’s confidence score for its predicted sentiment.

	This suggests that explicit numerical comparison reasoning still remains challenging without targeted pretraining or numerical reasoning augmentation.

	---

	## Hyperparameters

	During fine-tuning, the following hyperparameters were used to optimize model performance:

	- Learning Rate: 2e-5
	- Batch Size: 32
	- Number of Epochs: 3
	- Max Sequence Length: 128 tokens
	- Optimizer: AdamW
	- Weight Decay: 0.01
	- Evaluation Strategy: Evaluation performed after each epoch

	> Note: These settings were chosen to balance training efficiency and accuracy for financial news sentiment classification.

	---

	## 💡 Summary

	✅ Better generalization than FinBERT on both benchmark and noisy real-world samples
	✅ Strong accuracy and F1 scores
	⚠️ Room to improve on numerical reasoning comparisons — potential for integration with numerical-aware transformers or contrastive fine-tuning

	---
	## Usage

	### Pipeline Approach
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
	import torch

	model_name = "project-aps/finbert-finetune"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Override the config's id2label and label2id
	label_map = {0: "neutral", 1: "negative", 2: "positive"}
	model.config.id2label = label_map
	model.config.label2id = {v: k for k, v in label_map.items()}

	pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

	text = "Earnings smashed expectations AAPL posts $0.89 EPS vs $0.78 est. Bullish momentum incoming! #EarningsSeason"
	print(pipe(text)) #Output: [{'label': 'positive', 'score': 0.9997484087944031}]

	```

	### Simple Approach
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "project-aps/finbert-finetune"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	text = "Earnings smashed expectations AAPL posts $0.89 EPS vs $0.78 est. Bullish momentum incoming! #EarningsSeason"
	inputs = tokenizer(text, return_tensors="pt", truncation=True)
	outputs = model(**inputs)
	predicted_class = torch.argmax(outputs.logits, dim=1).item()

	label_map = {0: "neutral", 1: "negative", 2: "positive"}
	print(f"Text : {text}")
	print(f"Sentiment: {label_map[predicted_class]}")

	```

	---
	## Acknowledgements

	We gratefully acknowledge the creators and maintainers of the resources used in this project:

	- [ProsusAI/FinBERT](https://huggingface.co/ProsusAI/finbert) – A pre-trained BERT model specifically designed for financial sentiment analysis, which served as the foundation for our fine-tuning efforts.

	- [FinGPT Sentiment Train Dataset](https://huggingface.co/datasets/FinGPT/fingpt-sentiment-train) – The dataset used for fine-tuning, containing a large collection of finance-related news headlines and sentiment annotations.

	- [Financial PhraseBank Dataset](https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10) – A widely used benchmark dataset for financial sentiment classification, including the All Agree and All Combined subsets.

	- [FiQA + PhraseBank Kaggle Merged Dataset](https://www.kaggle.com/datasets/sbhatti/financial-sentiment-analysis/data) – A merged dataset combining FiQA and Financial PhraseBank entries, used for broader benchmarking of sentiment performance.


	We thank these contributors for making their models and datasets publicly available, enabling high-quality research and development in financial NLP.


	---