joyjitroy's picture
Update README.md
6611bfa verified
---
license: mit
language:
- en
metrics:
- accuracy
- precision
- recall
- f1
pipeline_tag: text-classification
paperswithcode_id: nlp-stock-sentiment-analysis
library_name: sklearn
tags:
- finance
- sentiment-analysis
- embeddings
- gradient-boosting
- classical-ml
- market-analysis
- nlp
- weekly-sentiment
---
<p align="left">
<a href="https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis">
<img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" />
</a>
<a href="https://doi.org/10.5281/zenodo.17510735">
<img src="https://img.shields.io/badge/Zenodo-DOI-1877f2?logo=zenodo" />
</a>
<a href="https://huggingface.co/joyjitroy/Stock_Market_News_Sentiment_Analysis">
<img src="https://img.shields.io/badge/HuggingFace-Model-yellow?logo=huggingface" />
</a>
<a href="https://ssrn.com/abstract=5784922">
<img src="https://img.shields.io/badge/SSRN-Preprint-red" />
</a>
<img src="https://img.shields.io/badge/License-MIT-green" />
</a>
</p>
# πŸ“° NLP Stock Sentiment Analysis β€” Embedding-Based Models for Market Signals
![Model3_Thumbnail](https://cdn-uploads.huggingface.co/production/uploads/68faf2c58b9b8d06b47b769c/uuw0OJrmZ3No5S17TJ5ti.png)
This model card documents the resources and workflow described in the paper:
**β€œNLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report”**
Joyjit Roy
SSRN, August 2024
Available at: https://ssrn.com/abstract=5784922
DOI: http://dx.doi.org/10.2139/ssrn.5784922
## πŸ‘₯ Authors
| Name | Email | GitHub | ORCID | Google Scholar | ResearchGate |
|------|-------|--------|-------|----------------|--------------|
| **[Joyjit Roy](https://www.linkedin.com/in/royjoyjit/)** | [joyjit.roy.tech@gmail.com](mailto:joyjit.roy.tech@gmail.com) | [GitHub](https://github.com/joyjitroy) | [ORCID](https://orcid.org/0009-0000-0886-782X) | [Google Scholar](https://scholar.google.com/citations?hl=en&user=66qcSP8AAAAJ) | [ResearchGate](https://www.researchgate.net/profile/Joyjit-Roy-3) |
| **[Samaresh Kumar Singh](https://www.linkedin.com/in/samaresh-singh-9772ba23/)** | [ssam3003@gmail.com](mailto:ssam3003@gmail.com) | [GitHub](https://github.com/ssam18) | [ORCID](https://orcid.org/0009-0008-1351-0719) | [Google Scholar](https://scholar.google.com/citations?hl=en&user=3z9qzooAAAAJ) | β€” |
Zenodo DOI: https://doi.org/10.5281/zenodo.17510735
GitHub: https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis
This project provides a complete NLP workflow for classifying the sentiment of headline-level financial news and aggregating weekly sentiment indicators. It serves as a reproducible reference for embedding-based sentiment analysis using classical machine learning models.
---
## πŸ” Project Overview
The goal is to transform unstructured financial news into structured sentiment scores that can support market commentary and trend evaluation. Three embedding-based approaches were explored:
### **1. Word2Vec (300D)**
- Trained locally on the news corpus
- Mean pooled per headline
### **2. GloVe (100D)**
- Pretrained vectors
- Mean pooled per headline
### **3. SentenceTransformer Embeddings (384D)**
- Model: `all-MiniLM-L6-v2`
- Direct sentence embeddings without fine-tuning
For each embedding approach, a **Gradient Boosting classifier** is trained and evaluated on the same dataset. Weekly sentiment summaries are generated to demonstrate applied use cases for financial analysis.
> Note: These models are scikit learn models and are not directly deployed as Hugging Face inference endpoints.
---
## πŸ’½ Dataset Summary
The dataset contains **349 financial news headlines**, each paired with OHLCV market indicators and a sentiment label:
`1` = positive, `0` = neutral, `-1` = negative.
The dataset is available on Hugging Face and also archived on Zenodo for citation and long-term access:
πŸ‘‰ https://doi.org/10.5281/zenodo.17510735
### πŸ“Š Data Dictionary
| Column | Description |
|----------|---------------------------------------------------------------|
| `Date` | Date the news item was released |
| `News` | Headline or snippet text |
| `Open` | Opening price (USD) |
| `High` | Highest price (USD) of the day |
| `Low` | Lowest price (USD) of the day |
| `Close` | Adjusted closing price (USD) |
| `Volume` | Total shares traded |
| `Label` | Sentiment (`1`=positive, `0`=neutral, `-1`=negative) |
---
## πŸ“Š Model Performance (Validation)
Across all embedding variants, the **tuned GloVe + Gradient Boosting** model achieved the strongest validation results:
- **Accuracy:** 0.714
- **Precision:** 0.758
- **Recall:** 0.714
- **F1 Score:** 0.694
- **Error Rate:** 0.286
Training accuracy is perfect for all models due to dataset size; validation metrics are therefore used for fair comparison. Full results appear in the associated paper (Table 3).
---
## 🧠 Weekly Sentiment Summaries
To show how sentiment predictions can support market interpretation:
1. Daily predictions are aggregated by week
2. Sentiment ratios (positive / neutral / negative) are computed
3. Weekly summaries are generated using **Mistral-7B-Instruct**
4. These summaries provide narrative insight into market mood
This illustrates a practical workflow where classical NLP models feed into downstream financial analysis.
---
## 🌟 Intended Use
This repository is intended for:
- Research on embedding-based sentiment classification
- Educational exploration of Word2Vec, GloVe, and Sentence Transformer embeddings
- Demonstrating weekly sentiment aggregation
- Benchmarking classical ML approaches for small financial text datasets
---
## ⚠️ Limitations
- Small dataset (349 samples)
- Potential overfitting of classical models
- Not designed for automated trading or real-time systems
- Weekly summaries rely on LLM outputs and may include stylistic bias
- No direct price prediction or financial forecasting
This model is best used for experimentation and learning.
---
## πŸ“„ Related Publication
The full experimental design, results, and limitations are documented in the following model report:
> Roy, Joyjit.
> *NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report.*
> SSRN, 2024.
> https://ssrn.com/abstract=5784922
> DOI: http://dx.doi.org/10.2139/ssrn.5784922
---
## πŸ“˜ Citation
If you use this model, dataset, or workflow, please cite:
**Model Report**
> Roy, Joyjit.
> *NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report.*
> SSRN, August 2024.
> https://ssrn.com/abstract=5784922
> DOI: http://dx.doi.org/10.2139/ssrn.5784922
**Dataset**
> Roy, Joyjit.
> *Stock Market News Sentiment Analysis – Supplementary Materials.*
> Zenodo, 2025.
> https://doi.org/10.5281/zenodo.17510735
---