finbert-finetune / README.md
project-aps's picture
updated README.md on failed cases
447fea0 verified
---
license: apache-2.0
datasets:
- FinGPT/fingpt-sentiment-train
language:
- en
metrics:
- accuracy
- f1
- recall
- precision
base_model:
- ProsusAI/finbert
pipeline_tag: text-classification
tags:
- finance
- financial
- news
- sentiment-analysis
- finbert
- transfomer
- text-classification
- financial-news
- financial-news-sentiment
library_name: transformers
---
# 📊 FinBERT Fine-Tuned on Financial News/Texts
A fine-tuned version of [`ProsusAI/finbert`](https://huggingface.co/ProsusAI/finbert) trained for **financial sentiment analysis** on financial news texts and headlines.
This fine-tuned model achieves a significant improvement over the original finbert, **outperforming it by over 38% in accuracy** on financial sentiment classification tasks.
---
## 🔧 Model Objective
The goal of this model is to detect **positive**, **neutral**, or **negative sentiment** on financial texts and headlines.
---
## 🗂️ Training Dataset
**Primary Dataset**: [`fingpt-sentiment-train`](https://huggingface.co/datasets/FinGPT/fingpt-sentiment-train) (~60,000 examples)
- Labeled financial text samples (positive / neutral / negative)
- Includes earnings statements, market commentary, and financial news headlines
- Only included **neutral**, **positive** and **negative** texts.
---
## 🧪 Benchmark Evaluation
The model was evaluated against **three benchmark datasets**:
- **[Financial PhraseBank (All Agree and All Combined)](https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10)**
- **[FiQA + PhraseBank Kaggle Merge](https://www.kaggle.com/datasets/sbhatti/financial-sentiment-analysis/data)**
- **[fingpt-sentiment-train (test split)](https://huggingface.co/datasets/FinGPT/fingpt-sentiment-train)**
Metrics used:
- **Accuracy**
- **F1 Score**
- **Precision**
- **Recall**
We benchmarked this model against the original [`ProsusAI/finbert`](https://huggingface.co/ProsusAI/finbert) on multiple financial datasets:
| Dataset | Samples | Model | Accuracy | F1 (Macro) | F1 (Weighted) | Precision (Macro) | Precision (Weighted) | Recall (Macro) | Recall (Weighted) |
|------------------------------------|---------|--------------------------|---------------|---------------|----------------|--------------------|------------------------|----------------|--------------------|
| **fingpt-sentiment-train Eval** | 12511 | FinBERT | 0.7131 | 0.70 | 0.71 | 0.71 | 0.72 | 0.70 | 0.71 |
| | | **FinBERT-Finetuned (Ours)** | **0.9894 (+38.8%)** | **0.99 (+41.4%)** | **0.99 (+39.4%)** | **0.99 (+39.4%)** | **0.99 (+37.5%)** | **0.99 (+41.4%)** | **0.99 (+39.4%)** |
| **Financial Phrasebank (Agree)** | 2264 | FinBERT | 0.9717 | 0.96 | 0.97 | 0.95 | 0.97 | 0.98 | 0.97 |
| | | **FinBERT-Finetuned (Ours)** | **0.9912 (+2.0%)** | **0.99 (+3.1%)** | **0.99 (+2.1%)** | **0.99 (+4.2%)** | **0.99 (+2.1%)** | **0.99 (+1.0%)** | **0.99 (+2.1%)** |
| **Financial Phrasebank (Combined)**| 14780 | FinBERT | 0.9238 | 0.91 | 0.92 | 0.89 | 0.93 | 0.94 | 0.92 |
| | | **FinBERT-Finetuned (Ours)** | **0.9792 (+6.0%)** | **0.98 (+7.7%)** | **0.98 (+6.5%)** | **0.98 (+10.1%)** | **0.98 (+5.4%)** | **0.98 (+4.3%)** | **0.98 (+6.5%)** |
| **FiQA + PhraseBank (Kaggle)** | 5842 | FinBERT | 0.7581 | 0.74 | 0.77 | 0.73 | 0.79 | 0.77 | 0.76 |
| | | **FinBERT-Finetuned (Ours)** | **0.8879 (+17.1%)** | **0.87 (+17.6%)** | **0.89 (+15.6%)** | **0.85 (+16.4%)** | **0.92 (+16.5%)** | **0.92 (+19.5%)** | **0.89 (+17.1%)** |
> **Note:** All metrics represent classification performance improvements after fine-tuning FinBERT on respective financial sentiment datasets. Metrics in parentheses represent relative improvement over base FinBERT performance.
---
## 🧠 Text-Level Comparison: FinBERT vs FinBERT-Finetuned (Ours)
### 🔴 FinBERT Failed Texts (as per discussed in its [`Paper`](https://arxiv.org/abs/1908.10063)) (Correctly Predicted by Ours)
| Text | Expected | FinBERT | Ours |
|-----------------------------------------------------------------------------------------------------------------------------|-----------|------------------------------|-------------------------------|
| Pre-tax loss totaled euro 0.3 million, compared to a loss of euro 2.2 million in the first quarter of 2005. | Positive | ❌ Negative (0.7223) | ✅ Positive (0.9997) |
| This implementation is very important to the operator, since it is about to launch its Fixed to Mobile convergence service | Neutral | ❌ Positive (0.7204) | ✅ Neutral (0.9998) |
| The situation of coated magazine printing paper will continue to be weak. | Negative | ✅ Negative (0.8811) | ✅ Negative (0.9996) |
### 🟡 FinBERT Incorrect, Ours Corrected It
| Text | Expected | FinBERT | Ours |
|----------------------------------------------------------------------------------------------------------------|-----------|------------------------------|-------------------------------|
| The debt-to-equity ratio was 1.15, flat quarter-over-quarter. | Neutral | ❌ Negative (0.6239) | ✅ Neutral (0.9998) |
| Earnings smashed expectations $AAPL posts $0.89 EPS vs $0.78 est. Bullish momentum incoming! | Positive | ❌ Neutral (0.4237) | ✅ Positive (0.9998) |
| $TSLA growth is slowing — but hey, at least Elon tweeted something funny today. #Tesla #markets | Negative | ❌ Neutral (0.5884) | ✅ Negative (0.7084) |
### ⚪ Out-of-Context Texts (FinBERT Misclassified, Ours Handled Properly)
| Text | Expected | FinBERT | Ours |
|--------------------------------------------------------------------------------------------|-----------|------------------------------|-------------------------------|
| Unexpected Snowstorm Hits Sahara Desert, Blanketing Sand Dunes | Neutral | ❌ Negative (0.8675) | ✅ Neutral (0.9993) |
| Virtual Reality Therapy Shows Promise for Treating PTSD | Neutral | ❌ Positive (0.8522) | ✅ Neutral (0.9997) |
> **Note**: These examples demonstrate improvements in real-world understanding, context handling, and sentiment differentiation with our FinBERT-finetuned model. Values in parentheses (e.g., `0.9485`) indicate the model’s confidence score for its predicted sentiment.
---
## ⚠️ Limitations & Failure Cases
While the model outperformed the base FinBERT across benchmarks, **some failure cases** were observed in statements involving **fine-grained numerical reasoning**, particularly when numerical comparison semantics are complex or subtle.
| Text | Expected | FinBERT | Ours |
|---------------------------------------------------------------------------------------------------------|-----------|------------------------------|-------------------------------|
| Net profit to euro 203 million from euro 172 million in the previous year. | Positive | ✅ Positive (0.9485) | ✅ Positive (0.9995) |
| Net profit to euro 103 million from euro 172 million in the previous year. | Negative | ❌ Positive (0.9486) | ❌ Positive (0.9994) |
| Pre-tax loss totaled euro 0.3 million, compared to a loss of euro 2.2 million in Q1 2005. | Positive | ❌ Negative (0.7223) | ✅ Positive (0.9997) |
| Pre-tax loss totaled euro 5.3 million, compared to a loss of euro 2.2 million in Q1 2005. | Negative | ✅ Negative (0.7205) | ❌ Positive (0.9997) |
| Net profit totaled euro 5.3 million, compared to euro 2.2 million in the previous quarter of 2005. | Positive | ❌ Negative (0.6347) | ❌ Negative (0.9996) |
| Net profit totaled euro 0.3 million, compared to euro 2.2 million in the previous quarter of 2005. | Negative | ✅ Negative (0.6320) | ✅ Negative (0.9996) |
> **Note**: Values in parentheses (e.g., `0.9485`) indicate the model’s confidence score for its predicted sentiment.
This suggests that **explicit numerical comparison reasoning** still remains challenging without targeted pretraining or numerical reasoning augmentation.
---
## Hyperparameters
During fine-tuning, the following hyperparameters were used to optimize model performance:
- **Learning Rate:** 2e-5
- **Batch Size:** 32
- **Number of Epochs:** 3
- **Max Sequence Length:** 128 tokens
- **Optimizer:** AdamW
- **Weight Decay:** 0.01
- **Evaluation Strategy:** Evaluation performed after each epoch
> **Note**: These settings were chosen to balance training efficiency and accuracy for financial news sentiment classification.
---
## 💡 Summary
**Better generalization** than FinBERT on both benchmark and noisy real-world samples
**Strong accuracy and F1 scores**
⚠️ Room to improve on **numerical reasoning comparisons** — potential for integration with numerical-aware transformers or contrastive fine-tuning
---
## Usage
### Pipeline Approach
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch
model_name = "project-aps/finbert-finetune"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Override the config's id2label and label2id
label_map = {0: "neutral", 1: "negative", 2: "positive"}
model.config.id2label = label_map
model.config.label2id = {v: k for k, v in label_map.items()}
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
text = "Earnings smashed expectations AAPL posts $0.89 EPS vs $0.78 est. Bullish momentum incoming! #EarningsSeason"
print(pipe(text)) #Output: [{'label': 'positive', 'score': 0.9997484087944031}]
```
### Simple Approach
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "project-aps/finbert-finetune"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "Earnings smashed expectations AAPL posts $0.89 EPS vs $0.78 est. Bullish momentum incoming! #EarningsSeason"
inputs = tokenizer(text, return_tensors="pt", truncation=True)
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()
label_map = {0: "neutral", 1: "negative", 2: "positive"}
print(f"Text : {text}")
print(f"Sentiment: {label_map[predicted_class]}")
```
---
## Acknowledgements
We gratefully acknowledge the creators and maintainers of the resources used in this project:
- **[ProsusAI/FinBERT](https://huggingface.co/ProsusAI/finbert)** – A pre-trained BERT model specifically designed for financial sentiment analysis, which served as the foundation for our fine-tuning efforts.
- **[FinGPT Sentiment Train Dataset](https://huggingface.co/datasets/FinGPT/fingpt-sentiment-train)** – The dataset used for fine-tuning, containing a large collection of finance-related news headlines and sentiment annotations.
- **[Financial PhraseBank Dataset](https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10)** – A widely used benchmark dataset for financial sentiment classification, including the *All Agree* and *All Combined* subsets.
- **[FiQA + PhraseBank Kaggle Merged Dataset](https://www.kaggle.com/datasets/sbhatti/financial-sentiment-analysis/data)** – A merged dataset combining FiQA and Financial PhraseBank entries, used for broader benchmarking of sentiment performance.
We thank these contributors for making their models and datasets publicly available, enabling high-quality research and development in financial NLP.
---