File size: 7,164 Bytes
82dad74
 
 
 
 
 
 
 
 
 
f75011d
19c6a0f
82dad74
19c6a0f
 
 
 
 
 
e11ff3d
 
82dad74
e39d4e8
 
 
 
 
 
 
 
 
 
f75011d
 
e39d4e8
 
 
 
 
19c6a0f
82dad74
fe0341b
 
e11ff3d
19c6a0f
f75011d
 
 
 
 
 
2adba97
 
 
 
 
6611bfa
2adba97
 
e11ff3d
 
82dad74
e11ff3d
82dad74
 
e11ff3d
 
 
 
19c6a0f
e11ff3d
19c6a0f
 
e11ff3d
19c6a0f
d8b2a38
19c6a0f
e11ff3d
 
82dad74
 
e11ff3d
 
 
19c6a0f
d8b2a38
19c6a0f
 
e11ff3d
8d00033
 
 
e11ff3d
8d00033
 
 
 
19c6a0f
e11ff3d
8d00033
 
19c6a0f
e11ff3d
19c6a0f
 
 
 
e11ff3d
 
 
 
 
 
 
 
 
 
 
 
 
82dad74
 
e11ff3d
 
 
 
 
 
8d00033
e11ff3d
 
 
 
 
 
 
166346a
e11ff3d
f75011d
 
c44122b
f75011d
 
 
 
 
 
 
cb4bd1b
f75011d
e11ff3d
f75011d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8d00033
e11ff3d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
license: mit
language:
- en
metrics:
- accuracy
- precision
- recall
- f1
pipeline_tag: text-classification
paperswithcode_id: nlp-stock-sentiment-analysis
library_name: sklearn
tags:
- finance
- sentiment-analysis
- embeddings
- gradient-boosting
- classical-ml
- market-analysis
- nlp
- weekly-sentiment
---
<p align="left">
  <a href="https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis">
    <img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" />
  </a>
  <a href="https://doi.org/10.5281/zenodo.17510735">
    <img src="https://img.shields.io/badge/Zenodo-DOI-1877f2?logo=zenodo" />
  </a>
  <a href="https://huggingface.co/joyjitroy/Stock_Market_News_Sentiment_Analysis">
    <img src="https://img.shields.io/badge/HuggingFace-Model-yellow?logo=huggingface" />
  </a>
  <a href="https://ssrn.com/abstract=5784922">
  <img src="https://img.shields.io/badge/SSRN-Preprint-red" />
  </a>
    <img src="https://img.shields.io/badge/License-MIT-green" />
  </a>
</p>

# πŸ“° NLP Stock Sentiment Analysis β€” Embedding-Based Models for Market Signals

![Model3_Thumbnail](https://cdn-uploads.huggingface.co/production/uploads/68faf2c58b9b8d06b47b769c/uuw0OJrmZ3No5S17TJ5ti.png)

This model card documents the resources and workflow described in the paper:

**β€œNLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report”**  
Joyjit Roy  
SSRN, August 2024  
Available at: https://ssrn.com/abstract=5784922  
DOI: http://dx.doi.org/10.2139/ssrn.5784922  

## πŸ‘₯ Authors

| Name | Email | GitHub | ORCID | Google Scholar | ResearchGate |
|------|-------|--------|-------|----------------|--------------|
| **[Joyjit Roy](https://www.linkedin.com/in/royjoyjit/)** | [joyjit.roy.tech@gmail.com](mailto:joyjit.roy.tech@gmail.com) | [GitHub](https://github.com/joyjitroy) | [ORCID](https://orcid.org/0009-0000-0886-782X) | [Google Scholar](https://scholar.google.com/citations?hl=en&user=66qcSP8AAAAJ) | [ResearchGate](https://www.researchgate.net/profile/Joyjit-Roy-3) |
| **[Samaresh Kumar Singh](https://www.linkedin.com/in/samaresh-singh-9772ba23/)** | [ssam3003@gmail.com](mailto:ssam3003@gmail.com) | [GitHub](https://github.com/ssam18) | [ORCID](https://orcid.org/0009-0008-1351-0719) | [Google Scholar](https://scholar.google.com/citations?hl=en&user=3z9qzooAAAAJ) | β€” |


Zenodo DOI: https://doi.org/10.5281/zenodo.17510735  
GitHub: https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis  

This project provides a complete NLP workflow for classifying the sentiment of headline-level financial news and aggregating weekly sentiment indicators. It serves as a reproducible reference for embedding-based sentiment analysis using classical machine learning models.

---
## πŸ” Project Overview
The goal is to transform unstructured financial news into structured sentiment scores that can support market commentary and trend evaluation. Three embedding-based approaches were explored:
### **1. Word2Vec (300D)**  
- Trained locally on the news corpus  
- Mean pooled per headline  
### **2. GloVe (100D)**  
- Pretrained vectors  
- Mean pooled per headline  
### **3. SentenceTransformer Embeddings (384D)**  
- Model: `all-MiniLM-L6-v2`  
- Direct sentence embeddings without fine-tuning  

For each embedding approach, a **Gradient Boosting classifier** is trained and evaluated on the same dataset. Weekly sentiment summaries are generated to demonstrate applied use cases for financial analysis.
> Note: These models are scikit learn models and are not directly deployed as Hugging Face inference endpoints.

---
## πŸ’½ Dataset Summary
The dataset contains **349 financial news headlines**, each paired with OHLCV market indicators and a sentiment label:  
`1` = positive, `0` = neutral, `-1` = negative.

The dataset is available on Hugging Face and also archived on Zenodo for citation and long-term access:
πŸ‘‰ https://doi.org/10.5281/zenodo.17510735

### πŸ“Š Data Dictionary
| Column   | Description                                                   |
|----------|---------------------------------------------------------------|
| `Date`   | Date the news item was released                               |
| `News`   | Headline or snippet text                                      |
| `Open`   | Opening price (USD)                                           |
| `High`   | Highest price (USD) of the day                                |
| `Low`    | Lowest price (USD) of the day                                 |
| `Close`  | Adjusted closing price (USD)                                  |
| `Volume` | Total shares traded                                           |
| `Label`  | Sentiment (`1`=positive, `0`=neutral, `-1`=negative)          |

---
## πŸ“Š Model Performance (Validation)
Across all embedding variants, the **tuned GloVe + Gradient Boosting** model achieved the strongest validation results:
- **Accuracy:** 0.714  
- **Precision:** 0.758  
- **Recall:** 0.714  
- **F1 Score:** 0.694  
- **Error Rate:** 0.286  

Training accuracy is perfect for all models due to dataset size; validation metrics are therefore used for fair comparison. Full results appear in the associated paper (Table 3).

---
## 🧠 Weekly Sentiment Summaries
To show how sentiment predictions can support market interpretation:
1. Daily predictions are aggregated by week  
2. Sentiment ratios (positive / neutral / negative) are computed  
3. Weekly summaries are generated using **Mistral-7B-Instruct**  
4. These summaries provide narrative insight into market mood  

This illustrates a practical workflow where classical NLP models feed into downstream financial analysis.

---
## 🌟 Intended Use
This repository is intended for:
- Research on embedding-based sentiment classification  
- Educational exploration of Word2Vec, GloVe, and Sentence Transformer embeddings  
- Demonstrating weekly sentiment aggregation  
- Benchmarking classical ML approaches for small financial text datasets  

---
## ⚠️ Limitations
- Small dataset (349 samples)  
- Potential overfitting of classical models  
- Not designed for automated trading or real-time systems  
- Weekly summaries rely on LLM outputs and may include stylistic bias  
- No direct price prediction or financial forecasting  

This model is best used for experimentation and learning.
---
## πŸ“„ Related Publication

The full experimental design, results, and limitations are documented in the following model report:

> Roy, Joyjit.  
> *NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report.*  
> SSRN, 2024.  
> https://ssrn.com/abstract=5784922  
> DOI: http://dx.doi.org/10.2139/ssrn.5784922
---

## πŸ“˜ Citation
If you use this model, dataset, or workflow, please cite:

**Model Report**
> Roy, Joyjit.  
> *NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report.*  
> SSRN, August 2024.  
> https://ssrn.com/abstract=5784922  
> DOI: http://dx.doi.org/10.2139/ssrn.5784922

**Dataset**
> Roy, Joyjit.  
> *Stock Market News Sentiment Analysis – Supplementary Materials.*  
> Zenodo, 2025.  
> https://doi.org/10.5281/zenodo.17510735


---