File size: 4,098 Bytes
060771c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
# FNSPID Sentiment Analysis Pipeline

## Overview

This pipeline analyzes the relationship between news sentiment (from FNSPID dataset) and financial market movements using **the paper's methodology**:

- **FinBERT and RoBERTa** dual model predictions (as per paper)
- **Meta-classifier** (XGBoost) for sentiment aggregation (paper's approach)
- Rolling correlation analysis
- Johansen cointegration tests
- Statistical analysis and visualization

## Key Features - Paper Methodology

βœ… **Dual Model Architecture**: Uses both FinBERT (ProsusAI/finbert) and RoBERTa (cardiffnlp/twitter-roberta-base-sentiment)  
βœ… **Meta-Classifier**: XGBoost meta-classifier trained on FPB AllAgree dataset  
βœ… **Feature Engineering**: Probability-based features matching paper's approach  
βœ… **Ensemble Method**: Falls back to ensemble averaging if meta-classifier unavailable

## Files

- `fnspid_pipeline.py` - Main pipeline implementation
- `run_pipeline.py` - Simple runner script
- `config.py` - Configuration parameters
- `nasdaq_2018_2019.csv` - FNSPID news data (2018-2019)
- `Price_2018_2019/` - Directory containing ETF price data

## Setup

1. Install required packages:

   ```bash
   pip install matplotlib arch statsmodels torch transformers pandas numpy tqdm scikit-learn
   ```

2. Ensure data files are in place:
   - `nasdaq_2018_2019.csv` (FNSPID news data)
   - `Price_2018_2019/*.csv` (price data files)

## Usage

### Quick Start

```bash
python run_pipeline.py
```

### Direct Execution

```bash
python fnspid_pipeline.py
```

### Custom Configuration

Edit `config.py` to modify:

- Date ranges
- Price files to analyze
- Model parameters
- Output settings

## Pipeline Steps

1. **Data Loading**: Load news articles and price data
2. **Sentiment Analysis**: Score news articles using FinBERT
3. **Daily Aggregation**: Aggregate sentiment scores by date
4. **Alignment**: Align sentiment with price data
5. **Correlation Analysis**: Rolling correlation between returns and sentiment
6. **Cointegration Tests**: Johansen tests for long-term relationships
7. **Visualization**: Generate correlation plots
8. **Results Export**: Save analysis results to CSV

## Outputs

All outputs are saved to `outputs_fnspid/` directory:

- `daily_sentiment.csv` - Daily aggregated sentiment scores
- `aligned_series.csv` - Aligned price and sentiment data
- `market_linkage_results.csv` - Analysis results summary
- `rolling_corr_*.png` - Correlation plots for each asset

## Data Requirements

- **FNSPID Data**: CSV with 'Date' and 'Article' columns
- **Price Data**: CSV files with 'date' and 'adj close' columns
- **Date Format**: YYYY-MM-DD or compatible pandas format

## Hardware Requirements

- **CPU**: Multi-core recommended for FinBERT processing
- **GPU**: Optional but recommended for faster sentiment analysis
- **Memory**: 4GB+ RAM for full dataset processing

## Troubleshooting

- Ensure all data files exist in correct locations
- Check date formats match expected patterns
- Verify sufficient data points for analysis (minimum 50 per asset)
- For GPU issues, pipeline will automatically fall back to CPU

## Results Interpretation

- **Mean Correlation**: Average correlation between asset returns and sentiment
- **Correlation Volatility**: How much correlation varies over time
- **Johansen Tests**: Statistical tests for cointegration relationships
  - trace r=0: Test for at least one cointegrating relationship
  - trace r≀1: Test for at most one cointegrating relationship
  - Values > Critical 5% indicate rejection of null hypothesis

## Example Output

```
βœ… Loaded price data for: ['VOO', 'VTI', 'IWM', 'XLF', 'EFA', 'ACWI']
πŸ“… Date range: 2018-01-02 to 2019-12-31
βœ… News rows in window: 25,847
βœ… Using device: cuda
βœ… Saved daily sentiment -> outputs_fnspid/daily_sentiment.csv
βœ… Saved aligned series -> outputs_fnspid/aligned_series.csv

=== Analyzing VOO ===
Rolling Correlation: mean=0.0234  volatility=0.1456
Johansen: trace r=0 12.34 (crit5 15.41), trace r≀1 3.21 (crit5 3.76)
βœ… Saved correlation plot -> outputs_fnspid/rolling_corr_VOO.png
```