FinSentLLM / FNSPID /README.md
jennyyu009's picture
Upload FNSPID folder
060771c verified

FNSPID Sentiment Analysis Pipeline

Overview

This pipeline analyzes the relationship between news sentiment (from FNSPID dataset) and financial market movements using the paper's methodology:

  • FinBERT and RoBERTa dual model predictions (as per paper)
  • Meta-classifier (XGBoost) for sentiment aggregation (paper's approach)
  • Rolling correlation analysis
  • Johansen cointegration tests
  • Statistical analysis and visualization

Key Features - Paper Methodology

βœ… Dual Model Architecture: Uses both FinBERT (ProsusAI/finbert) and RoBERTa (cardiffnlp/twitter-roberta-base-sentiment)
βœ… Meta-Classifier: XGBoost meta-classifier trained on FPB AllAgree dataset
βœ… Feature Engineering: Probability-based features matching paper's approach
βœ… Ensemble Method: Falls back to ensemble averaging if meta-classifier unavailable

Files

  • fnspid_pipeline.py - Main pipeline implementation
  • run_pipeline.py - Simple runner script
  • config.py - Configuration parameters
  • nasdaq_2018_2019.csv - FNSPID news data (2018-2019)
  • Price_2018_2019/ - Directory containing ETF price data

Setup

  1. Install required packages:

    pip install matplotlib arch statsmodels torch transformers pandas numpy tqdm scikit-learn
    
  2. Ensure data files are in place:

    • nasdaq_2018_2019.csv (FNSPID news data)
    • Price_2018_2019/*.csv (price data files)

Usage

Quick Start

python run_pipeline.py

Direct Execution

python fnspid_pipeline.py

Custom Configuration

Edit config.py to modify:

  • Date ranges
  • Price files to analyze
  • Model parameters
  • Output settings

Pipeline Steps

  1. Data Loading: Load news articles and price data
  2. Sentiment Analysis: Score news articles using FinBERT
  3. Daily Aggregation: Aggregate sentiment scores by date
  4. Alignment: Align sentiment with price data
  5. Correlation Analysis: Rolling correlation between returns and sentiment
  6. Cointegration Tests: Johansen tests for long-term relationships
  7. Visualization: Generate correlation plots
  8. Results Export: Save analysis results to CSV

Outputs

All outputs are saved to outputs_fnspid/ directory:

  • daily_sentiment.csv - Daily aggregated sentiment scores
  • aligned_series.csv - Aligned price and sentiment data
  • market_linkage_results.csv - Analysis results summary
  • rolling_corr_*.png - Correlation plots for each asset

Data Requirements

  • FNSPID Data: CSV with 'Date' and 'Article' columns
  • Price Data: CSV files with 'date' and 'adj close' columns
  • Date Format: YYYY-MM-DD or compatible pandas format

Hardware Requirements

  • CPU: Multi-core recommended for FinBERT processing
  • GPU: Optional but recommended for faster sentiment analysis
  • Memory: 4GB+ RAM for full dataset processing

Troubleshooting

  • Ensure all data files exist in correct locations
  • Check date formats match expected patterns
  • Verify sufficient data points for analysis (minimum 50 per asset)
  • For GPU issues, pipeline will automatically fall back to CPU

Results Interpretation

  • Mean Correlation: Average correlation between asset returns and sentiment
  • Correlation Volatility: How much correlation varies over time
  • Johansen Tests: Statistical tests for cointegration relationships
    • trace r=0: Test for at least one cointegrating relationship
    • trace r≀1: Test for at most one cointegrating relationship
    • Values > Critical 5% indicate rejection of null hypothesis

Example Output

βœ… Loaded price data for: ['VOO', 'VTI', 'IWM', 'XLF', 'EFA', 'ACWI']
πŸ“… Date range: 2018-01-02 to 2019-12-31
βœ… News rows in window: 25,847
βœ… Using device: cuda
βœ… Saved daily sentiment -> outputs_fnspid/daily_sentiment.csv
βœ… Saved aligned series -> outputs_fnspid/aligned_series.csv

=== Analyzing VOO ===
Rolling Correlation: mean=0.0234  volatility=0.1456
Johansen: trace r=0 12.34 (crit5 15.41), trace r≀1 3.21 (crit5 3.76)
βœ… Saved correlation plot -> outputs_fnspid/rolling_corr_VOO.png