---
license: mit
language:
- en
metrics:
- accuracy
- precision
- recall
- f1
pipeline_tag: text-classification
paperswithcode_id: nlp-stock-sentiment-analysis
library_name: sklearn
tags:
- finance
- sentiment-analysis
- embeddings
- gradient-boosting
- classical-ml
- market-analysis
- nlp
- weekly-sentiment
---
<p align="left">
  <a href="https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis">
    <img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" />
  </a>
  <a href="https://doi.org/10.5281/zenodo.17510735">
    <img src="https://img.shields.io/badge/Zenodo-DOI-1877f2?logo=zenodo" />
  </a>
  <a href="https://huggingface.co/joyjitroy/Stock_Market_News_Sentiment_Analysis">
    <img src="https://img.shields.io/badge/HuggingFace-Model-yellow?logo=huggingface" />
  </a>
  <a href="https://ssrn.com/abstract=5784922">
  <img src="https://img.shields.io/badge/SSRN-Preprint-red" />
  </a>
    <img src="https://img.shields.io/badge/License-MIT-green" />
  </a>
</p>

# 📰 NLP Stock Sentiment Analysis — Embedding-Based Models for Market Signals

![Model3_Thumbnail](https://cdn-uploads.huggingface.co/production/uploads/68faf2c58b9b8d06b47b769c/uuw0OJrmZ3No5S17TJ5ti.png)

This model card documents the resources and workflow described in the paper:

**“NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report”**  
Joyjit Roy  
SSRN, August 2024  
Available at: https://ssrn.com/abstract=5784922  
DOI: http://dx.doi.org/10.2139/ssrn.5784922  

## 👥 Authors

| Name | Email | GitHub | ORCID | Google Scholar | ResearchGate |
|------|-------|--------|-------|----------------|--------------|
| **[Joyjit Roy](https://www.linkedin.com/in/royjoyjit/)** | [joyjit.roy.tech@gmail.com](mailto:joyjit.roy.tech@gmail.com) | [GitHub](https://github.com/joyjitroy) | [ORCID](https://orcid.org/0009-0000-0886-782X) | [Google Scholar](https://scholar.google.com/citations?hl=en&user=66qcSP8AAAAJ) | [ResearchGate](https://www.researchgate.net/profile/Joyjit-Roy-3) |
| **[Samaresh Kumar Singh](https://www.linkedin.com/in/samaresh-singh-9772ba23/)** | [ssam3003@gmail.com](mailto:ssam3003@gmail.com) | [GitHub](https://github.com/ssam18) | [ORCID](https://orcid.org/0009-0008-1351-0719) | [Google Scholar](https://scholar.google.com/citations?hl=en&user=3z9qzooAAAAJ) | — |


Zenodo DOI: https://doi.org/10.5281/zenodo.17510735  
GitHub: https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis  

This project provides a complete NLP workflow for classifying the sentiment of headline-level financial news and aggregating weekly sentiment indicators. It serves as a reproducible reference for embedding-based sentiment analysis using classical machine learning models.

---
## 🔍 Project Overview
The goal is to transform unstructured financial news into structured sentiment scores that can support market commentary and trend evaluation. Three embedding-based approaches were explored:
### **1. Word2Vec (300D)**  
- Trained locally on the news corpus  
- Mean pooled per headline  
### **2. GloVe (100D)**  
- Pretrained vectors  
- Mean pooled per headline  
### **3. SentenceTransformer Embeddings (384D)**  
- Model: `all-MiniLM-L6-v2`  
- Direct sentence embeddings without fine-tuning  

For each embedding approach, a **Gradient Boosting classifier** is trained and evaluated on the same dataset. Weekly sentiment summaries are generated to demonstrate applied use cases for financial analysis.
> Note: These models are scikit learn models and are not directly deployed as Hugging Face inference endpoints.

---
## 💽 Dataset Summary
The dataset contains **349 financial news headlines**, each paired with OHLCV market indicators and a sentiment label:  
`1` = positive, `0` = neutral, `-1` = negative.

The dataset is available on Hugging Face and also archived on Zenodo for citation and long-term access:
👉 https://doi.org/10.5281/zenodo.17510735

### 📊 Data Dictionary
| Column   | Description                                                   |
|----------|---------------------------------------------------------------|
| `Date`   | Date the news item was released                               |
| `News`   | Headline or snippet text                                      |
| `Open`   | Opening price (USD)                                           |
| `High`   | Highest price (USD) of the day                                |
| `Low`    | Lowest price (USD) of the day                                 |
| `Close`  | Adjusted closing price (USD)                                  |
| `Volume` | Total shares traded                                           |
| `Label`  | Sentiment (`1`=positive, `0`=neutral, `-1`=negative)          |

---
## 📊 Model Performance (Validation)
Across all embedding variants, the **tuned GloVe + Gradient Boosting** model achieved the strongest validation results:
- **Accuracy:** 0.714  
- **Precision:** 0.758  
- **Recall:** 0.714  
- **F1 Score:** 0.694  
- **Error Rate:** 0.286  

Training accuracy is perfect for all models due to dataset size; validation metrics are therefore used for fair comparison. Full results appear in the associated paper (Table 3).

---
## 🧠 Weekly Sentiment Summaries
To show how sentiment predictions can support market interpretation:
1. Daily predictions are aggregated by week  
2. Sentiment ratios (positive / neutral / negative) are computed  
3. Weekly summaries are generated using **Mistral-7B-Instruct**  
4. These summaries provide narrative insight into market mood  

This illustrates a practical workflow where classical NLP models feed into downstream financial analysis.

---
## 🌟 Intended Use
This repository is intended for:
- Research on embedding-based sentiment classification  
- Educational exploration of Word2Vec, GloVe, and Sentence Transformer embeddings  
- Demonstrating weekly sentiment aggregation  
- Benchmarking classical ML approaches for small financial text datasets  

---
## ⚠️ Limitations
- Small dataset (349 samples)  
- Potential overfitting of classical models  
- Not designed for automated trading or real-time systems  
- Weekly summaries rely on LLM outputs and may include stylistic bias  
- No direct price prediction or financial forecasting  

This model is best used for experimentation and learning.
---
## 📄 Related Publication

The full experimental design, results, and limitations are documented in the following model report:

> Roy, Joyjit.  
> *NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report.*  
> SSRN, 2024.  
> https://ssrn.com/abstract=5784922  
> DOI: http://dx.doi.org/10.2139/ssrn.5784922
---

## 📘 Citation
If you use this model, dataset, or workflow, please cite:

**Model Report**
> Roy, Joyjit.  
> *NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report.*  
> SSRN, August 2024.  
> https://ssrn.com/abstract=5784922  
> DOI: http://dx.doi.org/10.2139/ssrn.5784922

**Dataset**
> Roy, Joyjit.  
> *Stock Market News Sentiment Analysis – Supplementary Materials.*  
> Zenodo, 2025.  
> https://doi.org/10.5281/zenodo.17510735


---