|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- accuracy |
|
|
- precision |
|
|
- recall |
|
|
- f1 |
|
|
pipeline_tag: text-classification |
|
|
paperswithcode_id: nlp-stock-sentiment-analysis |
|
|
library_name: sklearn |
|
|
tags: |
|
|
- finance |
|
|
- sentiment-analysis |
|
|
- embeddings |
|
|
- gradient-boosting |
|
|
- classical-ml |
|
|
- market-analysis |
|
|
- nlp |
|
|
- weekly-sentiment |
|
|
--- |
|
|
<p align="left"> |
|
|
<a href="https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis"> |
|
|
<img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" /> |
|
|
</a> |
|
|
<a href="https://doi.org/10.5281/zenodo.17510735"> |
|
|
<img src="https://img.shields.io/badge/Zenodo-DOI-1877f2?logo=zenodo" /> |
|
|
</a> |
|
|
<a href="https://huggingface.co/joyjitroy/Stock_Market_News_Sentiment_Analysis"> |
|
|
<img src="https://img.shields.io/badge/HuggingFace-Model-yellow?logo=huggingface" /> |
|
|
</a> |
|
|
<a href="https://ssrn.com/abstract=5784922"> |
|
|
<img src="https://img.shields.io/badge/SSRN-Preprint-red" /> |
|
|
</a> |
|
|
<img src="https://img.shields.io/badge/License-MIT-green" /> |
|
|
</a> |
|
|
</p> |
|
|
|
|
|
# π° NLP Stock Sentiment Analysis β Embedding-Based Models for Market Signals |
|
|
|
|
|
 |
|
|
|
|
|
This model card documents the resources and workflow described in the paper: |
|
|
|
|
|
**βNLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Reportβ** |
|
|
Joyjit Roy |
|
|
SSRN, August 2024 |
|
|
Available at: https://ssrn.com/abstract=5784922 |
|
|
DOI: http://dx.doi.org/10.2139/ssrn.5784922 |
|
|
|
|
|
## π₯ Authors |
|
|
|
|
|
| Name | Email | GitHub | ORCID | Google Scholar | ResearchGate | |
|
|
|------|-------|--------|-------|----------------|--------------| |
|
|
| **[Joyjit Roy](https://www.linkedin.com/in/royjoyjit/)** | [joyjit.roy.tech@gmail.com](mailto:joyjit.roy.tech@gmail.com) | [GitHub](https://github.com/joyjitroy) | [ORCID](https://orcid.org/0009-0000-0886-782X) | [Google Scholar](https://scholar.google.com/citations?hl=en&user=66qcSP8AAAAJ) | [ResearchGate](https://www.researchgate.net/profile/Joyjit-Roy-3) | |
|
|
| **[Samaresh Kumar Singh](https://www.linkedin.com/in/samaresh-singh-9772ba23/)** | [ssam3003@gmail.com](mailto:ssam3003@gmail.com) | [GitHub](https://github.com/ssam18) | [ORCID](https://orcid.org/0009-0008-1351-0719) | [Google Scholar](https://scholar.google.com/citations?hl=en&user=3z9qzooAAAAJ) | β | |
|
|
|
|
|
|
|
|
Zenodo DOI: https://doi.org/10.5281/zenodo.17510735 |
|
|
GitHub: https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis |
|
|
|
|
|
This project provides a complete NLP workflow for classifying the sentiment of headline-level financial news and aggregating weekly sentiment indicators. It serves as a reproducible reference for embedding-based sentiment analysis using classical machine learning models. |
|
|
|
|
|
--- |
|
|
## π Project Overview |
|
|
The goal is to transform unstructured financial news into structured sentiment scores that can support market commentary and trend evaluation. Three embedding-based approaches were explored: |
|
|
### **1. Word2Vec (300D)** |
|
|
- Trained locally on the news corpus |
|
|
- Mean pooled per headline |
|
|
### **2. GloVe (100D)** |
|
|
- Pretrained vectors |
|
|
- Mean pooled per headline |
|
|
### **3. SentenceTransformer Embeddings (384D)** |
|
|
- Model: `all-MiniLM-L6-v2` |
|
|
- Direct sentence embeddings without fine-tuning |
|
|
|
|
|
For each embedding approach, a **Gradient Boosting classifier** is trained and evaluated on the same dataset. Weekly sentiment summaries are generated to demonstrate applied use cases for financial analysis. |
|
|
> Note: These models are scikit learn models and are not directly deployed as Hugging Face inference endpoints. |
|
|
|
|
|
--- |
|
|
## π½ Dataset Summary |
|
|
The dataset contains **349 financial news headlines**, each paired with OHLCV market indicators and a sentiment label: |
|
|
`1` = positive, `0` = neutral, `-1` = negative. |
|
|
|
|
|
The dataset is available on Hugging Face and also archived on Zenodo for citation and long-term access: |
|
|
π https://doi.org/10.5281/zenodo.17510735 |
|
|
|
|
|
### π Data Dictionary |
|
|
| Column | Description | |
|
|
|----------|---------------------------------------------------------------| |
|
|
| `Date` | Date the news item was released | |
|
|
| `News` | Headline or snippet text | |
|
|
| `Open` | Opening price (USD) | |
|
|
| `High` | Highest price (USD) of the day | |
|
|
| `Low` | Lowest price (USD) of the day | |
|
|
| `Close` | Adjusted closing price (USD) | |
|
|
| `Volume` | Total shares traded | |
|
|
| `Label` | Sentiment (`1`=positive, `0`=neutral, `-1`=negative) | |
|
|
|
|
|
--- |
|
|
## π Model Performance (Validation) |
|
|
Across all embedding variants, the **tuned GloVe + Gradient Boosting** model achieved the strongest validation results: |
|
|
- **Accuracy:** 0.714 |
|
|
- **Precision:** 0.758 |
|
|
- **Recall:** 0.714 |
|
|
- **F1 Score:** 0.694 |
|
|
- **Error Rate:** 0.286 |
|
|
|
|
|
Training accuracy is perfect for all models due to dataset size; validation metrics are therefore used for fair comparison. Full results appear in the associated paper (Table 3). |
|
|
|
|
|
--- |
|
|
## π§ Weekly Sentiment Summaries |
|
|
To show how sentiment predictions can support market interpretation: |
|
|
1. Daily predictions are aggregated by week |
|
|
2. Sentiment ratios (positive / neutral / negative) are computed |
|
|
3. Weekly summaries are generated using **Mistral-7B-Instruct** |
|
|
4. These summaries provide narrative insight into market mood |
|
|
|
|
|
This illustrates a practical workflow where classical NLP models feed into downstream financial analysis. |
|
|
|
|
|
--- |
|
|
## π Intended Use |
|
|
This repository is intended for: |
|
|
- Research on embedding-based sentiment classification |
|
|
- Educational exploration of Word2Vec, GloVe, and Sentence Transformer embeddings |
|
|
- Demonstrating weekly sentiment aggregation |
|
|
- Benchmarking classical ML approaches for small financial text datasets |
|
|
|
|
|
--- |
|
|
## β οΈ Limitations |
|
|
- Small dataset (349 samples) |
|
|
- Potential overfitting of classical models |
|
|
- Not designed for automated trading or real-time systems |
|
|
- Weekly summaries rely on LLM outputs and may include stylistic bias |
|
|
- No direct price prediction or financial forecasting |
|
|
|
|
|
This model is best used for experimentation and learning. |
|
|
--- |
|
|
## π Related Publication |
|
|
|
|
|
The full experimental design, results, and limitations are documented in the following model report: |
|
|
|
|
|
> Roy, Joyjit. |
|
|
> *NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report.* |
|
|
> SSRN, 2024. |
|
|
> https://ssrn.com/abstract=5784922 |
|
|
> DOI: http://dx.doi.org/10.2139/ssrn.5784922 |
|
|
--- |
|
|
|
|
|
## π Citation |
|
|
If you use this model, dataset, or workflow, please cite: |
|
|
|
|
|
**Model Report** |
|
|
> Roy, Joyjit. |
|
|
> *NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report.* |
|
|
> SSRN, August 2024. |
|
|
> https://ssrn.com/abstract=5784922 |
|
|
> DOI: http://dx.doi.org/10.2139/ssrn.5784922 |
|
|
|
|
|
**Dataset** |
|
|
> Roy, Joyjit. |
|
|
> *Stock Market News Sentiment Analysis β Supplementary Materials.* |
|
|
> Zenodo, 2025. |
|
|
> https://doi.org/10.5281/zenodo.17510735 |
|
|
|
|
|
|
|
|
--- |