| # FNSPID Sentiment Analysis Pipeline | |
| ## Overview | |
| This pipeline analyzes the relationship between news sentiment (from FNSPID dataset) and financial market movements using **the paper's methodology**: | |
| - **FinBERT and RoBERTa** dual model predictions (as per paper) | |
| - **Meta-classifier** (XGBoost) for sentiment aggregation (paper's approach) | |
| - Rolling correlation analysis | |
| - Johansen cointegration tests | |
| - Statistical analysis and visualization | |
| ## Key Features - Paper Methodology | |
| β **Dual Model Architecture**: Uses both FinBERT (ProsusAI/finbert) and RoBERTa (cardiffnlp/twitter-roberta-base-sentiment) | |
| β **Meta-Classifier**: XGBoost meta-classifier trained on FPB AllAgree dataset | |
| β **Feature Engineering**: Probability-based features matching paper's approach | |
| β **Ensemble Method**: Falls back to ensemble averaging if meta-classifier unavailable | |
| ## Files | |
| - `fnspid_pipeline.py` - Main pipeline implementation | |
| - `run_pipeline.py` - Simple runner script | |
| - `config.py` - Configuration parameters | |
| - `nasdaq_2018_2019.csv` - FNSPID news data (2018-2019) | |
| - `Price_2018_2019/` - Directory containing ETF price data | |
| ## Setup | |
| 1. Install required packages: | |
| ```bash | |
| pip install matplotlib arch statsmodels torch transformers pandas numpy tqdm scikit-learn | |
| ``` | |
| 2. Ensure data files are in place: | |
| - `nasdaq_2018_2019.csv` (FNSPID news data) | |
| - `Price_2018_2019/*.csv` (price data files) | |
| ## Usage | |
| ### Quick Start | |
| ```bash | |
| python run_pipeline.py | |
| ``` | |
| ### Direct Execution | |
| ```bash | |
| python fnspid_pipeline.py | |
| ``` | |
| ### Custom Configuration | |
| Edit `config.py` to modify: | |
| - Date ranges | |
| - Price files to analyze | |
| - Model parameters | |
| - Output settings | |
| ## Pipeline Steps | |
| 1. **Data Loading**: Load news articles and price data | |
| 2. **Sentiment Analysis**: Score news articles using FinBERT | |
| 3. **Daily Aggregation**: Aggregate sentiment scores by date | |
| 4. **Alignment**: Align sentiment with price data | |
| 5. **Correlation Analysis**: Rolling correlation between returns and sentiment | |
| 6. **Cointegration Tests**: Johansen tests for long-term relationships | |
| 7. **Visualization**: Generate correlation plots | |
| 8. **Results Export**: Save analysis results to CSV | |
| ## Outputs | |
| All outputs are saved to `outputs_fnspid/` directory: | |
| - `daily_sentiment.csv` - Daily aggregated sentiment scores | |
| - `aligned_series.csv` - Aligned price and sentiment data | |
| - `market_linkage_results.csv` - Analysis results summary | |
| - `rolling_corr_*.png` - Correlation plots for each asset | |
| ## Data Requirements | |
| - **FNSPID Data**: CSV with 'Date' and 'Article' columns | |
| - **Price Data**: CSV files with 'date' and 'adj close' columns | |
| - **Date Format**: YYYY-MM-DD or compatible pandas format | |
| ## Hardware Requirements | |
| - **CPU**: Multi-core recommended for FinBERT processing | |
| - **GPU**: Optional but recommended for faster sentiment analysis | |
| - **Memory**: 4GB+ RAM for full dataset processing | |
| ## Troubleshooting | |
| - Ensure all data files exist in correct locations | |
| - Check date formats match expected patterns | |
| - Verify sufficient data points for analysis (minimum 50 per asset) | |
| - For GPU issues, pipeline will automatically fall back to CPU | |
| ## Results Interpretation | |
| - **Mean Correlation**: Average correlation between asset returns and sentiment | |
| - **Correlation Volatility**: How much correlation varies over time | |
| - **Johansen Tests**: Statistical tests for cointegration relationships | |
| - trace r=0: Test for at least one cointegrating relationship | |
| - trace rβ€1: Test for at most one cointegrating relationship | |
| - Values > Critical 5% indicate rejection of null hypothesis | |
| ## Example Output | |
| ``` | |
| β Loaded price data for: ['VOO', 'VTI', 'IWM', 'XLF', 'EFA', 'ACWI'] | |
| π Date range: 2018-01-02 to 2019-12-31 | |
| β News rows in window: 25,847 | |
| β Using device: cuda | |
| β Saved daily sentiment -> outputs_fnspid/daily_sentiment.csv | |
| β Saved aligned series -> outputs_fnspid/aligned_series.csv | |
| === Analyzing VOO === | |
| Rolling Correlation: mean=0.0234 volatility=0.1456 | |
| Johansen: trace r=0 12.34 (crit5 15.41), trace rβ€1 3.21 (crit5 3.76) | |
| β Saved correlation plot -> outputs_fnspid/rolling_corr_VOO.png | |
| ``` | |