File size: 7,164 Bytes
82dad74 f75011d 19c6a0f 82dad74 19c6a0f e11ff3d 82dad74 e39d4e8 f75011d e39d4e8 19c6a0f 82dad74 fe0341b e11ff3d 19c6a0f f75011d 2adba97 6611bfa 2adba97 e11ff3d 82dad74 e11ff3d 82dad74 e11ff3d 19c6a0f e11ff3d 19c6a0f e11ff3d 19c6a0f d8b2a38 19c6a0f e11ff3d 82dad74 e11ff3d 19c6a0f d8b2a38 19c6a0f e11ff3d 8d00033 e11ff3d 8d00033 19c6a0f e11ff3d 8d00033 19c6a0f e11ff3d 19c6a0f e11ff3d 82dad74 e11ff3d 8d00033 e11ff3d 166346a e11ff3d f75011d c44122b f75011d cb4bd1b f75011d e11ff3d f75011d 8d00033 e11ff3d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 | ---
license: mit
language:
- en
metrics:
- accuracy
- precision
- recall
- f1
pipeline_tag: text-classification
paperswithcode_id: nlp-stock-sentiment-analysis
library_name: sklearn
tags:
- finance
- sentiment-analysis
- embeddings
- gradient-boosting
- classical-ml
- market-analysis
- nlp
- weekly-sentiment
---
<p align="left">
<a href="https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis">
<img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" />
</a>
<a href="https://doi.org/10.5281/zenodo.17510735">
<img src="https://img.shields.io/badge/Zenodo-DOI-1877f2?logo=zenodo" />
</a>
<a href="https://huggingface.co/joyjitroy/Stock_Market_News_Sentiment_Analysis">
<img src="https://img.shields.io/badge/HuggingFace-Model-yellow?logo=huggingface" />
</a>
<a href="https://ssrn.com/abstract=5784922">
<img src="https://img.shields.io/badge/SSRN-Preprint-red" />
</a>
<img src="https://img.shields.io/badge/License-MIT-green" />
</a>
</p>
# π° NLP Stock Sentiment Analysis β Embedding-Based Models for Market Signals

This model card documents the resources and workflow described in the paper:
**βNLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Reportβ**
Joyjit Roy
SSRN, August 2024
Available at: https://ssrn.com/abstract=5784922
DOI: http://dx.doi.org/10.2139/ssrn.5784922
## π₯ Authors
| Name | Email | GitHub | ORCID | Google Scholar | ResearchGate |
|------|-------|--------|-------|----------------|--------------|
| **[Joyjit Roy](https://www.linkedin.com/in/royjoyjit/)** | [joyjit.roy.tech@gmail.com](mailto:joyjit.roy.tech@gmail.com) | [GitHub](https://github.com/joyjitroy) | [ORCID](https://orcid.org/0009-0000-0886-782X) | [Google Scholar](https://scholar.google.com/citations?hl=en&user=66qcSP8AAAAJ) | [ResearchGate](https://www.researchgate.net/profile/Joyjit-Roy-3) |
| **[Samaresh Kumar Singh](https://www.linkedin.com/in/samaresh-singh-9772ba23/)** | [ssam3003@gmail.com](mailto:ssam3003@gmail.com) | [GitHub](https://github.com/ssam18) | [ORCID](https://orcid.org/0009-0008-1351-0719) | [Google Scholar](https://scholar.google.com/citations?hl=en&user=3z9qzooAAAAJ) | β |
Zenodo DOI: https://doi.org/10.5281/zenodo.17510735
GitHub: https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis
This project provides a complete NLP workflow for classifying the sentiment of headline-level financial news and aggregating weekly sentiment indicators. It serves as a reproducible reference for embedding-based sentiment analysis using classical machine learning models.
---
## π Project Overview
The goal is to transform unstructured financial news into structured sentiment scores that can support market commentary and trend evaluation. Three embedding-based approaches were explored:
### **1. Word2Vec (300D)**
- Trained locally on the news corpus
- Mean pooled per headline
### **2. GloVe (100D)**
- Pretrained vectors
- Mean pooled per headline
### **3. SentenceTransformer Embeddings (384D)**
- Model: `all-MiniLM-L6-v2`
- Direct sentence embeddings without fine-tuning
For each embedding approach, a **Gradient Boosting classifier** is trained and evaluated on the same dataset. Weekly sentiment summaries are generated to demonstrate applied use cases for financial analysis.
> Note: These models are scikit learn models and are not directly deployed as Hugging Face inference endpoints.
---
## π½ Dataset Summary
The dataset contains **349 financial news headlines**, each paired with OHLCV market indicators and a sentiment label:
`1` = positive, `0` = neutral, `-1` = negative.
The dataset is available on Hugging Face and also archived on Zenodo for citation and long-term access:
π https://doi.org/10.5281/zenodo.17510735
### π Data Dictionary
| Column | Description |
|----------|---------------------------------------------------------------|
| `Date` | Date the news item was released |
| `News` | Headline or snippet text |
| `Open` | Opening price (USD) |
| `High` | Highest price (USD) of the day |
| `Low` | Lowest price (USD) of the day |
| `Close` | Adjusted closing price (USD) |
| `Volume` | Total shares traded |
| `Label` | Sentiment (`1`=positive, `0`=neutral, `-1`=negative) |
---
## π Model Performance (Validation)
Across all embedding variants, the **tuned GloVe + Gradient Boosting** model achieved the strongest validation results:
- **Accuracy:** 0.714
- **Precision:** 0.758
- **Recall:** 0.714
- **F1 Score:** 0.694
- **Error Rate:** 0.286
Training accuracy is perfect for all models due to dataset size; validation metrics are therefore used for fair comparison. Full results appear in the associated paper (Table 3).
---
## π§ Weekly Sentiment Summaries
To show how sentiment predictions can support market interpretation:
1. Daily predictions are aggregated by week
2. Sentiment ratios (positive / neutral / negative) are computed
3. Weekly summaries are generated using **Mistral-7B-Instruct**
4. These summaries provide narrative insight into market mood
This illustrates a practical workflow where classical NLP models feed into downstream financial analysis.
---
## π Intended Use
This repository is intended for:
- Research on embedding-based sentiment classification
- Educational exploration of Word2Vec, GloVe, and Sentence Transformer embeddings
- Demonstrating weekly sentiment aggregation
- Benchmarking classical ML approaches for small financial text datasets
---
## β οΈ Limitations
- Small dataset (349 samples)
- Potential overfitting of classical models
- Not designed for automated trading or real-time systems
- Weekly summaries rely on LLM outputs and may include stylistic bias
- No direct price prediction or financial forecasting
This model is best used for experimentation and learning.
---
## π Related Publication
The full experimental design, results, and limitations are documented in the following model report:
> Roy, Joyjit.
> *NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report.*
> SSRN, 2024.
> https://ssrn.com/abstract=5784922
> DOI: http://dx.doi.org/10.2139/ssrn.5784922
---
## π Citation
If you use this model, dataset, or workflow, please cite:
**Model Report**
> Roy, Joyjit.
> *NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report.*
> SSRN, August 2024.
> https://ssrn.com/abstract=5784922
> DOI: http://dx.doi.org/10.2139/ssrn.5784922
**Dataset**
> Roy, Joyjit.
> *Stock Market News Sentiment Analysis β Supplementary Materials.*
> Zenodo, 2025.
> https://doi.org/10.5281/zenodo.17510735
--- |