FNSPID Sentiment Analysis Pipeline
Overview
This pipeline analyzes the relationship between news sentiment (from FNSPID dataset) and financial market movements using the paper's methodology:
- FinBERT and RoBERTa dual model predictions (as per paper)
- Meta-classifier (XGBoost) for sentiment aggregation (paper's approach)
- Rolling correlation analysis
- Johansen cointegration tests
- Statistical analysis and visualization
Key Features - Paper Methodology
β
Dual Model Architecture: Uses both FinBERT (ProsusAI/finbert) and RoBERTa (cardiffnlp/twitter-roberta-base-sentiment)
β
Meta-Classifier: XGBoost meta-classifier trained on FPB AllAgree dataset
β
Feature Engineering: Probability-based features matching paper's approach
β
Ensemble Method: Falls back to ensemble averaging if meta-classifier unavailable
Files
fnspid_pipeline.py- Main pipeline implementationrun_pipeline.py- Simple runner scriptconfig.py- Configuration parametersnasdaq_2018_2019.csv- FNSPID news data (2018-2019)Price_2018_2019/- Directory containing ETF price data
Setup
Install required packages:
pip install matplotlib arch statsmodels torch transformers pandas numpy tqdm scikit-learnEnsure data files are in place:
nasdaq_2018_2019.csv(FNSPID news data)Price_2018_2019/*.csv(price data files)
Usage
Quick Start
python run_pipeline.py
Direct Execution
python fnspid_pipeline.py
Custom Configuration
Edit config.py to modify:
- Date ranges
- Price files to analyze
- Model parameters
- Output settings
Pipeline Steps
- Data Loading: Load news articles and price data
- Sentiment Analysis: Score news articles using FinBERT
- Daily Aggregation: Aggregate sentiment scores by date
- Alignment: Align sentiment with price data
- Correlation Analysis: Rolling correlation between returns and sentiment
- Cointegration Tests: Johansen tests for long-term relationships
- Visualization: Generate correlation plots
- Results Export: Save analysis results to CSV
Outputs
All outputs are saved to outputs_fnspid/ directory:
daily_sentiment.csv- Daily aggregated sentiment scoresaligned_series.csv- Aligned price and sentiment datamarket_linkage_results.csv- Analysis results summaryrolling_corr_*.png- Correlation plots for each asset
Data Requirements
- FNSPID Data: CSV with 'Date' and 'Article' columns
- Price Data: CSV files with 'date' and 'adj close' columns
- Date Format: YYYY-MM-DD or compatible pandas format
Hardware Requirements
- CPU: Multi-core recommended for FinBERT processing
- GPU: Optional but recommended for faster sentiment analysis
- Memory: 4GB+ RAM for full dataset processing
Troubleshooting
- Ensure all data files exist in correct locations
- Check date formats match expected patterns
- Verify sufficient data points for analysis (minimum 50 per asset)
- For GPU issues, pipeline will automatically fall back to CPU
Results Interpretation
- Mean Correlation: Average correlation between asset returns and sentiment
- Correlation Volatility: How much correlation varies over time
- Johansen Tests: Statistical tests for cointegration relationships
- trace r=0: Test for at least one cointegrating relationship
- trace rβ€1: Test for at most one cointegrating relationship
- Values > Critical 5% indicate rejection of null hypothesis
Example Output
β
Loaded price data for: ['VOO', 'VTI', 'IWM', 'XLF', 'EFA', 'ACWI']
π
Date range: 2018-01-02 to 2019-12-31
β
News rows in window: 25,847
β
Using device: cuda
β
Saved daily sentiment -> outputs_fnspid/daily_sentiment.csv
β
Saved aligned series -> outputs_fnspid/aligned_series.csv
=== Analyzing VOO ===
Rolling Correlation: mean=0.0234 volatility=0.1456
Johansen: trace r=0 12.34 (crit5 15.41), trace rβ€1 3.21 (crit5 3.76)
β
Saved correlation plot -> outputs_fnspid/rolling_corr_VOO.png