File size: 4,098 Bytes
060771c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
# FNSPID Sentiment Analysis Pipeline
## Overview
This pipeline analyzes the relationship between news sentiment (from FNSPID dataset) and financial market movements using **the paper's methodology**:
- **FinBERT and RoBERTa** dual model predictions (as per paper)
- **Meta-classifier** (XGBoost) for sentiment aggregation (paper's approach)
- Rolling correlation analysis
- Johansen cointegration tests
- Statistical analysis and visualization
## Key Features - Paper Methodology
β
**Dual Model Architecture**: Uses both FinBERT (ProsusAI/finbert) and RoBERTa (cardiffnlp/twitter-roberta-base-sentiment)
β
**Meta-Classifier**: XGBoost meta-classifier trained on FPB AllAgree dataset
β
**Feature Engineering**: Probability-based features matching paper's approach
β
**Ensemble Method**: Falls back to ensemble averaging if meta-classifier unavailable
## Files
- `fnspid_pipeline.py` - Main pipeline implementation
- `run_pipeline.py` - Simple runner script
- `config.py` - Configuration parameters
- `nasdaq_2018_2019.csv` - FNSPID news data (2018-2019)
- `Price_2018_2019/` - Directory containing ETF price data
## Setup
1. Install required packages:
```bash
pip install matplotlib arch statsmodels torch transformers pandas numpy tqdm scikit-learn
```
2. Ensure data files are in place:
- `nasdaq_2018_2019.csv` (FNSPID news data)
- `Price_2018_2019/*.csv` (price data files)
## Usage
### Quick Start
```bash
python run_pipeline.py
```
### Direct Execution
```bash
python fnspid_pipeline.py
```
### Custom Configuration
Edit `config.py` to modify:
- Date ranges
- Price files to analyze
- Model parameters
- Output settings
## Pipeline Steps
1. **Data Loading**: Load news articles and price data
2. **Sentiment Analysis**: Score news articles using FinBERT
3. **Daily Aggregation**: Aggregate sentiment scores by date
4. **Alignment**: Align sentiment with price data
5. **Correlation Analysis**: Rolling correlation between returns and sentiment
6. **Cointegration Tests**: Johansen tests for long-term relationships
7. **Visualization**: Generate correlation plots
8. **Results Export**: Save analysis results to CSV
## Outputs
All outputs are saved to `outputs_fnspid/` directory:
- `daily_sentiment.csv` - Daily aggregated sentiment scores
- `aligned_series.csv` - Aligned price and sentiment data
- `market_linkage_results.csv` - Analysis results summary
- `rolling_corr_*.png` - Correlation plots for each asset
## Data Requirements
- **FNSPID Data**: CSV with 'Date' and 'Article' columns
- **Price Data**: CSV files with 'date' and 'adj close' columns
- **Date Format**: YYYY-MM-DD or compatible pandas format
## Hardware Requirements
- **CPU**: Multi-core recommended for FinBERT processing
- **GPU**: Optional but recommended for faster sentiment analysis
- **Memory**: 4GB+ RAM for full dataset processing
## Troubleshooting
- Ensure all data files exist in correct locations
- Check date formats match expected patterns
- Verify sufficient data points for analysis (minimum 50 per asset)
- For GPU issues, pipeline will automatically fall back to CPU
## Results Interpretation
- **Mean Correlation**: Average correlation between asset returns and sentiment
- **Correlation Volatility**: How much correlation varies over time
- **Johansen Tests**: Statistical tests for cointegration relationships
- trace r=0: Test for at least one cointegrating relationship
- trace rβ€1: Test for at most one cointegrating relationship
- Values > Critical 5% indicate rejection of null hypothesis
## Example Output
```
β
Loaded price data for: ['VOO', 'VTI', 'IWM', 'XLF', 'EFA', 'ACWI']
π
Date range: 2018-01-02 to 2019-12-31
β
News rows in window: 25,847
β
Using device: cuda
β
Saved daily sentiment -> outputs_fnspid/daily_sentiment.csv
β
Saved aligned series -> outputs_fnspid/aligned_series.csv
=== Analyzing VOO ===
Rolling Correlation: mean=0.0234 volatility=0.1456
Johansen: trace r=0 12.34 (crit5 15.41), trace rβ€1 3.21 (crit5 3.76)
β
Saved correlation plot -> outputs_fnspid/rolling_corr_VOO.png
```
|