File size: 7,164 Bytes

82dad74
 
 
 
 
 
 
 
 
 
f75011d
19c6a0f
82dad74
19c6a0f
 
 
 
 
 
e11ff3d
 
82dad74
e39d4e8
 
 
 
 
 
 
 
 
 
f75011d
 
e39d4e8
 
 
 
 
19c6a0f
82dad74
fe0341b
 
e11ff3d
19c6a0f
f75011d
 
 
 
 
 
2adba97
 
 
 
 
6611bfa
2adba97
 
e11ff3d
 
82dad74
e11ff3d
82dad74
 
e11ff3d
 
 
 
19c6a0f
e11ff3d
19c6a0f
 
e11ff3d
19c6a0f
d8b2a38
19c6a0f
e11ff3d
 
82dad74
 
e11ff3d
 
 
19c6a0f
d8b2a38
19c6a0f
 
e11ff3d
8d00033
 
 
e11ff3d
8d00033
 
 
 
19c6a0f
e11ff3d
8d00033
 
19c6a0f
e11ff3d
19c6a0f
 
 
 
e11ff3d
 
 
 
 
 
 
 
 
 
 
 
 
82dad74
 
e11ff3d
 
 
 
 
 
8d00033
e11ff3d
 
 
 
 
 
 
166346a
e11ff3d
f75011d
 
c44122b
f75011d
 
 
 
 
 
 
cb4bd1b
f75011d
e11ff3d
f75011d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8d00033
e11ff3d

---
license: mit
language:
- en
metrics:
- accuracy
- precision
- recall
- f1
pipeline_tag: text-classification
paperswithcode_id: nlp-stock-sentiment-analysis
library_name: sklearn
tags:
- finance
- sentiment-analysis
- embeddings
- gradient-boosting
- classical-ml
- market-analysis
- nlp
- weekly-sentiment
---
<p align="left">
  <a href="https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis">
    <img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" />
  </a>
  <a href="https://doi.org/10.5281/zenodo.17510735">
    <img src="https://img.shields.io/badge/Zenodo-DOI-1877f2?logo=zenodo" />
  </a>
  <a href="https://huggingface.co/joyjitroy/Stock_Market_News_Sentiment_Analysis">
    <img src="https://img.shields.io/badge/HuggingFace-Model-yellow?logo=huggingface" />
  </a>
  <a href="https://ssrn.com/abstract=5784922">
  <img src="https://img.shields.io/badge/SSRN-Preprint-red" />
  </a>
    <img src="https://img.shields.io/badge/License-MIT-green" />
  </a>
</p>

# 📰 NLP Stock Sentiment Analysis — Embedding-Based Models for Market Signals

![Model3_Thumbnail](https://cdn-uploads.huggingface.co/production/uploads/68faf2c58b9b8d06b47b769c/uuw0OJrmZ3No5S17TJ5ti.png)

This model card documents the resources and workflow described in the paper:

**“NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report”**  
Joyjit Roy  
SSRN, August 2024  
Available at: https://ssrn.com/abstract=5784922  
DOI: http://dx.doi.org/10.2139/ssrn.5784922  

## 👥 Authors

| Name | Email | GitHub | ORCID | Google Scholar | ResearchGate |
|------|-------|--------|-------|----------------|--------------|
| **[Joyjit Roy](https://www.linkedin.com/in/royjoyjit/)** | [joyjit.roy.tech@gmail.com](mailto:joyjit.roy.tech@gmail.com) | [GitHub](https://github.com/joyjitroy) | [ORCID](https://orcid.org/0009-0000-0886-782X) | [Google Scholar](https://scholar.google.com/citations?hl=en&user=66qcSP8AAAAJ) | [ResearchGate](https://www.researchgate.net/profile/Joyjit-Roy-3) |
| **[Samaresh Kumar Singh](https://www.linkedin.com/in/samaresh-singh-9772ba23/)** | [ssam3003@gmail.com](mailto:ssam3003@gmail.com) | [GitHub](https://github.com/ssam18) | [ORCID](https://orcid.org/0009-0008-1351-0719) | [Google Scholar](https://scholar.google.com/citations?hl=en&user=3z9qzooAAAAJ) | — |


Zenodo DOI: https://doi.org/10.5281/zenodo.17510735  
GitHub: https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis  

This project provides a complete NLP workflow for classifying the sentiment of headline-level financial news and aggregating weekly sentiment indicators. It serves as a reproducible reference for embedding-based sentiment analysis using classical machine learning models.

---
## 🔍 Project Overview
The goal is to transform unstructured financial news into structured sentiment scores that can support market commentary and trend evaluation. Three embedding-based approaches were explored:
### **1. Word2Vec (300D)**  
- Trained locally on the news corpus  
- Mean pooled per headline  
### **2. GloVe (100D)**  
- Pretrained vectors  
- Mean pooled per headline  
### **3. SentenceTransformer Embeddings (384D)**  
- Model: `all-MiniLM-L6-v2`  
- Direct sentence embeddings without fine-tuning  

For each embedding approach, a **Gradient Boosting classifier** is trained and evaluated on the same dataset. Weekly sentiment summaries are generated to demonstrate applied use cases for financial analysis.
> Note: These models are scikit learn models and are not directly deployed as Hugging Face inference endpoints.

---
## 💽 Dataset Summary
The dataset contains **349 financial news headlines**, each paired with OHLCV market indicators and a sentiment label:  
`1` = positive, `0` = neutral, `-1` = negative.

The dataset is available on Hugging Face and also archived on Zenodo for citation and long-term access:
👉 https://doi.org/10.5281/zenodo.17510735

### 📊 Data Dictionary
| Column   | Description                                                   |
|----------|---------------------------------------------------------------|
| `Date`   | Date the news item was released                               |
| `News`   | Headline or snippet text                                      |
| `Open`   | Opening price (USD)                                           |
| `High`   | Highest price (USD) of the day                                |
| `Low`    | Lowest price (USD) of the day                                 |
| `Close`  | Adjusted closing price (USD)                                  |
| `Volume` | Total shares traded                                           |
| `Label`  | Sentiment (`1`=positive, `0`=neutral, `-1`=negative)          |

---
## 📊 Model Performance (Validation)
Across all embedding variants, the **tuned GloVe + Gradient Boosting** model achieved the strongest validation results:
- **Accuracy:** 0.714  
- **Precision:** 0.758  
- **Recall:** 0.714  
- **F1 Score:** 0.694  
- **Error Rate:** 0.286  

Training accuracy is perfect for all models due to dataset size; validation metrics are therefore used for fair comparison. Full results appear in the associated paper (Table 3).

---
## 🧠 Weekly Sentiment Summaries
To show how sentiment predictions can support market interpretation:
1. Daily predictions are aggregated by week  
2. Sentiment ratios (positive / neutral / negative) are computed  
3. Weekly summaries are generated using **Mistral-7B-Instruct**  
4. These summaries provide narrative insight into market mood  

This illustrates a practical workflow where classical NLP models feed into downstream financial analysis.

---
## 🌟 Intended Use
This repository is intended for:
- Research on embedding-based sentiment classification  
- Educational exploration of Word2Vec, GloVe, and Sentence Transformer embeddings  
- Demonstrating weekly sentiment aggregation  
- Benchmarking classical ML approaches for small financial text datasets  

---
## ⚠️ Limitations
- Small dataset (349 samples)  
- Potential overfitting of classical models  
- Not designed for automated trading or real-time systems  
- Weekly summaries rely on LLM outputs and may include stylistic bias  
- No direct price prediction or financial forecasting  

This model is best used for experimentation and learning.
---
## 📄 Related Publication

The full experimental design, results, and limitations are documented in the following model report:

> Roy, Joyjit.  
> *NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report.*  
> SSRN, 2024.  
> https://ssrn.com/abstract=5784922  
> DOI: http://dx.doi.org/10.2139/ssrn.5784922
---

## 📘 Citation
If you use this model, dataset, or workflow, please cite:

**Model Report**
> Roy, Joyjit.  
> *NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report.*  
> SSRN, August 2024.  
> https://ssrn.com/abstract=5784922  
> DOI: http://dx.doi.org/10.2139/ssrn.5784922

**Dataset**
> Roy, Joyjit.  
> *Stock Market News Sentiment Analysis – Supplementary Materials.*  
> Zenodo, 2025.  
> https://doi.org/10.5281/zenodo.17510735


---