Update README.md

6611bfa verified 3 days ago

7.16 kB

	---
	license: mit
	language:
	- en
	metrics:
	- accuracy
	- precision
	- recall
	- f1
	pipeline_tag: text-classification
	paperswithcode_id: nlp-stock-sentiment-analysis
	library_name: sklearn
	tags:
	- finance
	- sentiment-analysis
	- embeddings
	- gradient-boosting
	- classical-ml
	- market-analysis
	- nlp
	- weekly-sentiment
	---
	<p align="left">
	<a href="https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis">
	<img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" />
	</a>
	<a href="https://doi.org/10.5281/zenodo.17510735">
	<img src="https://img.shields.io/badge/Zenodo-DOI-1877f2?logo=zenodo" />
	</a>
	<a href="https://huggingface.co/joyjitroy/Stock_Market_News_Sentiment_Analysis">
	<img src="https://img.shields.io/badge/HuggingFace-Model-yellow?logo=huggingface" />
	</a>
	<a href="https://ssrn.com/abstract=5784922">
	<img src="https://img.shields.io/badge/SSRN-Preprint-red" />
	</a>
	<img src="https://img.shields.io/badge/License-MIT-green" />
	</a>
	</p>

	# 📰 NLP Stock Sentiment Analysis — Embedding-Based Models for Market Signals

	![Model3_Thumbnail](https://cdn-uploads.huggingface.co/production/uploads/68faf2c58b9b8d06b47b769c/uuw0OJrmZ3No5S17TJ5ti.png)

	This model card documents the resources and workflow described in the paper:

	“NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report”
	Joyjit Roy
	SSRN, August 2024
	Available at: https://ssrn.com/abstract=5784922
	DOI: http://dx.doi.org/10.2139/ssrn.5784922

	## 👥 Authors

	\| Name \| Email \| GitHub \| ORCID \| Google Scholar \| ResearchGate \|
	\|------\|-------\|--------\|-------\|----------------\|--------------\|
	\| [Joyjit Roy](https://www.linkedin.com/in/royjoyjit/) \| [joyjit.roy.tech@gmail.com](mailto:joyjit.roy.tech@gmail.com) \| [GitHub](https://github.com/joyjitroy) \| [ORCID](https://orcid.org/0009-0000-0886-782X) \| [Google Scholar](https://scholar.google.com/citations?hl=en&user=66qcSP8AAAAJ) \| [ResearchGate](https://www.researchgate.net/profile/Joyjit-Roy-3) \|
	\| [Samaresh Kumar Singh](https://www.linkedin.com/in/samaresh-singh-9772ba23/) \| [ssam3003@gmail.com](mailto:ssam3003@gmail.com) \| [GitHub](https://github.com/ssam18) \| [ORCID](https://orcid.org/0009-0008-1351-0719) \| [Google Scholar](https://scholar.google.com/citations?hl=en&user=3z9qzooAAAAJ) \| — \|


	Zenodo DOI: https://doi.org/10.5281/zenodo.17510735
	GitHub: https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis

	This project provides a complete NLP workflow for classifying the sentiment of headline-level financial news and aggregating weekly sentiment indicators. It serves as a reproducible reference for embedding-based sentiment analysis using classical machine learning models.

	---
	## 🔍 Project Overview
	The goal is to transform unstructured financial news into structured sentiment scores that can support market commentary and trend evaluation. Three embedding-based approaches were explored:
	### 1. Word2Vec (300D)
	- Trained locally on the news corpus
	- Mean pooled per headline
	### 2. GloVe (100D)
	- Pretrained vectors
	- Mean pooled per headline
	### 3. SentenceTransformer Embeddings (384D)
	- Model: `all-MiniLM-L6-v2`
	- Direct sentence embeddings without fine-tuning

	For each embedding approach, a Gradient Boosting classifier is trained and evaluated on the same dataset. Weekly sentiment summaries are generated to demonstrate applied use cases for financial analysis.
	> Note: These models are scikit learn models and are not directly deployed as Hugging Face inference endpoints.

	---
	## 💽 Dataset Summary
	The dataset contains 349 financial news headlines, each paired with OHLCV market indicators and a sentiment label:
	`1` = positive, `0` = neutral, `-1` = negative.

	The dataset is available on Hugging Face and also archived on Zenodo for citation and long-term access:
	👉 https://doi.org/10.5281/zenodo.17510735

	### 📊 Data Dictionary
	\| Column \| Description \|
	\|----------\|---------------------------------------------------------------\|
	\| `Date` \| Date the news item was released \|
	\| `News` \| Headline or snippet text \|
	\| `Open` \| Opening price (USD) \|
	\| `High` \| Highest price (USD) of the day \|
	\| `Low` \| Lowest price (USD) of the day \|
	\| `Close` \| Adjusted closing price (USD) \|
	\| `Volume` \| Total shares traded \|
	\| `Label` \| Sentiment (`1`=positive, `0`=neutral, `-1`=negative) \|

	---
	## 📊 Model Performance (Validation)
	Across all embedding variants, the tuned GloVe + Gradient Boosting model achieved the strongest validation results:
	- Accuracy: 0.714
	- Precision: 0.758
	- Recall: 0.714
	- F1 Score: 0.694
	- Error Rate: 0.286

	Training accuracy is perfect for all models due to dataset size; validation metrics are therefore used for fair comparison. Full results appear in the associated paper (Table 3).

	---
	## 🧠 Weekly Sentiment Summaries
	To show how sentiment predictions can support market interpretation:
	1. Daily predictions are aggregated by week
	2. Sentiment ratios (positive / neutral / negative) are computed
	3. Weekly summaries are generated using Mistral-7B-Instruct
	4. These summaries provide narrative insight into market mood

	This illustrates a practical workflow where classical NLP models feed into downstream financial analysis.

	---
	## 🌟 Intended Use
	This repository is intended for:
	- Research on embedding-based sentiment classification
	- Educational exploration of Word2Vec, GloVe, and Sentence Transformer embeddings
	- Demonstrating weekly sentiment aggregation
	- Benchmarking classical ML approaches for small financial text datasets

	---
	## ⚠️ Limitations
	- Small dataset (349 samples)
	- Potential overfitting of classical models
	- Not designed for automated trading or real-time systems
	- Weekly summaries rely on LLM outputs and may include stylistic bias
	- No direct price prediction or financial forecasting

	This model is best used for experimentation and learning.
	---
	## 📄 Related Publication

	The full experimental design, results, and limitations are documented in the following model report:

	> Roy, Joyjit.
	> NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report.
	> SSRN, 2024.
	> https://ssrn.com/abstract=5784922
	> DOI: http://dx.doi.org/10.2139/ssrn.5784922
	---

	## 📘 Citation
	If you use this model, dataset, or workflow, please cite:

	Model Report
	> Roy, Joyjit.
	> NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report.
	> SSRN, August 2024.
	> https://ssrn.com/abstract=5784922
	> DOI: http://dx.doi.org/10.2139/ssrn.5784922

	Dataset
	> Roy, Joyjit.
	> Stock Market News Sentiment Analysis – Supplementary Materials.
	> Zenodo, 2025.
	> https://doi.org/10.5281/zenodo.17510735


	---