--- license: mit language: - en metrics: - accuracy - precision - recall - f1 pipeline_tag: text-classification paperswithcode_id: nlp-stock-sentiment-analysis library_name: sklearn tags: - finance - sentiment-analysis - embeddings - gradient-boosting - classical-ml - market-analysis - nlp - weekly-sentiment ---

# πŸ“° NLP Stock Sentiment Analysis β€” Embedding-Based Models for Market Signals ![Model3_Thumbnail](https://cdn-uploads.huggingface.co/production/uploads/68faf2c58b9b8d06b47b769c/uuw0OJrmZ3No5S17TJ5ti.png) This model card documents the resources and workflow described in the paper: **β€œNLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report”** Joyjit Roy SSRN, August 2024 Available at: https://ssrn.com/abstract=5784922 DOI: http://dx.doi.org/10.2139/ssrn.5784922 ## πŸ‘₯ Authors | Name | Email | GitHub | ORCID | Google Scholar | ResearchGate | |------|-------|--------|-------|----------------|--------------| | **[Joyjit Roy](https://www.linkedin.com/in/royjoyjit/)** | [joyjit.roy.tech@gmail.com](mailto:joyjit.roy.tech@gmail.com) | [GitHub](https://github.com/joyjitroy) | [ORCID](https://orcid.org/0009-0000-0886-782X) | [Google Scholar](https://scholar.google.com/citations?hl=en&user=66qcSP8AAAAJ) | [ResearchGate](https://www.researchgate.net/profile/Joyjit-Roy-3) | | **[Samaresh Kumar Singh](https://www.linkedin.com/in/samaresh-singh-9772ba23/)** | [ssam3003@gmail.com](mailto:ssam3003@gmail.com) | [GitHub](https://github.com/ssam18) | [ORCID](https://orcid.org/0009-0008-1351-0719) | [Google Scholar](https://scholar.google.com/citations?hl=en&user=3z9qzooAAAAJ) | β€” | Zenodo DOI: https://doi.org/10.5281/zenodo.17510735 GitHub: https://github.com/joyjitroy/Machine_Learning/tree/main/NLP_Stock_Sentiment_Analysis This project provides a complete NLP workflow for classifying the sentiment of headline-level financial news and aggregating weekly sentiment indicators. It serves as a reproducible reference for embedding-based sentiment analysis using classical machine learning models. --- ## πŸ” Project Overview The goal is to transform unstructured financial news into structured sentiment scores that can support market commentary and trend evaluation. Three embedding-based approaches were explored: ### **1. Word2Vec (300D)** - Trained locally on the news corpus - Mean pooled per headline ### **2. GloVe (100D)** - Pretrained vectors - Mean pooled per headline ### **3. SentenceTransformer Embeddings (384D)** - Model: `all-MiniLM-L6-v2` - Direct sentence embeddings without fine-tuning For each embedding approach, a **Gradient Boosting classifier** is trained and evaluated on the same dataset. Weekly sentiment summaries are generated to demonstrate applied use cases for financial analysis. > Note: These models are scikit learn models and are not directly deployed as Hugging Face inference endpoints. --- ## πŸ’½ Dataset Summary The dataset contains **349 financial news headlines**, each paired with OHLCV market indicators and a sentiment label: `1` = positive, `0` = neutral, `-1` = negative. The dataset is available on Hugging Face and also archived on Zenodo for citation and long-term access: πŸ‘‰ https://doi.org/10.5281/zenodo.17510735 ### πŸ“Š Data Dictionary | Column | Description | |----------|---------------------------------------------------------------| | `Date` | Date the news item was released | | `News` | Headline or snippet text | | `Open` | Opening price (USD) | | `High` | Highest price (USD) of the day | | `Low` | Lowest price (USD) of the day | | `Close` | Adjusted closing price (USD) | | `Volume` | Total shares traded | | `Label` | Sentiment (`1`=positive, `0`=neutral, `-1`=negative) | --- ## πŸ“Š Model Performance (Validation) Across all embedding variants, the **tuned GloVe + Gradient Boosting** model achieved the strongest validation results: - **Accuracy:** 0.714 - **Precision:** 0.758 - **Recall:** 0.714 - **F1 Score:** 0.694 - **Error Rate:** 0.286 Training accuracy is perfect for all models due to dataset size; validation metrics are therefore used for fair comparison. Full results appear in the associated paper (Table 3). --- ## 🧠 Weekly Sentiment Summaries To show how sentiment predictions can support market interpretation: 1. Daily predictions are aggregated by week 2. Sentiment ratios (positive / neutral / negative) are computed 3. Weekly summaries are generated using **Mistral-7B-Instruct** 4. These summaries provide narrative insight into market mood This illustrates a practical workflow where classical NLP models feed into downstream financial analysis. --- ## 🌟 Intended Use This repository is intended for: - Research on embedding-based sentiment classification - Educational exploration of Word2Vec, GloVe, and Sentence Transformer embeddings - Demonstrating weekly sentiment aggregation - Benchmarking classical ML approaches for small financial text datasets --- ## ⚠️ Limitations - Small dataset (349 samples) - Potential overfitting of classical models - Not designed for automated trading or real-time systems - Weekly summaries rely on LLM outputs and may include stylistic bias - No direct price prediction or financial forecasting This model is best used for experimentation and learning. --- ## πŸ“„ Related Publication The full experimental design, results, and limitations are documented in the following model report: > Roy, Joyjit. > *NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report.* > SSRN, 2024. > https://ssrn.com/abstract=5784922 > DOI: http://dx.doi.org/10.2139/ssrn.5784922 --- ## πŸ“˜ Citation If you use this model, dataset, or workflow, please cite: **Model Report** > Roy, Joyjit. > *NLP Stock Sentiment Analysis: A Comparative Embedding-Based Model Report.* > SSRN, August 2024. > https://ssrn.com/abstract=5784922 > DOI: http://dx.doi.org/10.2139/ssrn.5784922 **Dataset** > Roy, Joyjit. > *Stock Market News Sentiment Analysis – Supplementary Materials.* > Zenodo, 2025. > https://doi.org/10.5281/zenodo.17510735 ---