# FNSPID Sentiment Analysis Pipeline ## Overview This pipeline analyzes the relationship between news sentiment (from FNSPID dataset) and financial market movements using **the paper's methodology**: - **FinBERT and RoBERTa** dual model predictions (as per paper) - **Meta-classifier** (XGBoost) for sentiment aggregation (paper's approach) - Rolling correlation analysis - Johansen cointegration tests - Statistical analysis and visualization ## Key Features - Paper Methodology ✅ **Dual Model Architecture**: Uses both FinBERT (ProsusAI/finbert) and RoBERTa (cardiffnlp/twitter-roberta-base-sentiment) ✅ **Meta-Classifier**: XGBoost meta-classifier trained on FPB AllAgree dataset ✅ **Feature Engineering**: Probability-based features matching paper's approach ✅ **Ensemble Method**: Falls back to ensemble averaging if meta-classifier unavailable ## Files - `fnspid_pipeline.py` - Main pipeline implementation - `run_pipeline.py` - Simple runner script - `config.py` - Configuration parameters - `nasdaq_2018_2019.csv` - FNSPID news data (2018-2019) - `Price_2018_2019/` - Directory containing ETF price data ## Setup 1. Install required packages: ```bash pip install matplotlib arch statsmodels torch transformers pandas numpy tqdm scikit-learn ``` 2. Ensure data files are in place: - `nasdaq_2018_2019.csv` (FNSPID news data) - `Price_2018_2019/*.csv` (price data files) ## Usage ### Quick Start ```bash python run_pipeline.py ``` ### Direct Execution ```bash python fnspid_pipeline.py ``` ### Custom Configuration Edit `config.py` to modify: - Date ranges - Price files to analyze - Model parameters - Output settings ## Pipeline Steps 1. **Data Loading**: Load news articles and price data 2. **Sentiment Analysis**: Score news articles using FinBERT 3. **Daily Aggregation**: Aggregate sentiment scores by date 4. **Alignment**: Align sentiment with price data 5. **Correlation Analysis**: Rolling correlation between returns and sentiment 6. **Cointegration Tests**: Johansen tests for long-term relationships 7. **Visualization**: Generate correlation plots 8. **Results Export**: Save analysis results to CSV ## Outputs All outputs are saved to `outputs_fnspid/` directory: - `daily_sentiment.csv` - Daily aggregated sentiment scores - `aligned_series.csv` - Aligned price and sentiment data - `market_linkage_results.csv` - Analysis results summary - `rolling_corr_*.png` - Correlation plots for each asset ## Data Requirements - **FNSPID Data**: CSV with 'Date' and 'Article' columns - **Price Data**: CSV files with 'date' and 'adj close' columns - **Date Format**: YYYY-MM-DD or compatible pandas format ## Hardware Requirements - **CPU**: Multi-core recommended for FinBERT processing - **GPU**: Optional but recommended for faster sentiment analysis - **Memory**: 4GB+ RAM for full dataset processing ## Troubleshooting - Ensure all data files exist in correct locations - Check date formats match expected patterns - Verify sufficient data points for analysis (minimum 50 per asset) - For GPU issues, pipeline will automatically fall back to CPU ## Results Interpretation - **Mean Correlation**: Average correlation between asset returns and sentiment - **Correlation Volatility**: How much correlation varies over time - **Johansen Tests**: Statistical tests for cointegration relationships - trace r=0: Test for at least one cointegrating relationship - trace r≤1: Test for at most one cointegrating relationship - Values > Critical 5% indicate rejection of null hypothesis ## Example Output ``` ✅ Loaded price data for: ['VOO', 'VTI', 'IWM', 'XLF', 'EFA', 'ACWI'] 📅 Date range: 2018-01-02 to 2019-12-31 ✅ News rows in window: 25,847 ✅ Using device: cuda ✅ Saved daily sentiment -> outputs_fnspid/daily_sentiment.csv ✅ Saved aligned series -> outputs_fnspid/aligned_series.csv === Analyzing VOO === Rolling Correlation: mean=0.0234 volatility=0.1456 Johansen: trace r=0 12.34 (crit5 15.41), trace r≤1 3.21 (crit5 3.76) ✅ Saved correlation plot -> outputs_fnspid/rolling_corr_VOO.png ```