FinSentLLM / FNSPID /README.md
jennyyu009's picture
Upload FNSPID folder
060771c verified
# FNSPID Sentiment Analysis Pipeline
## Overview
This pipeline analyzes the relationship between news sentiment (from FNSPID dataset) and financial market movements using **the paper's methodology**:
- **FinBERT and RoBERTa** dual model predictions (as per paper)
- **Meta-classifier** (XGBoost) for sentiment aggregation (paper's approach)
- Rolling correlation analysis
- Johansen cointegration tests
- Statistical analysis and visualization
## Key Features - Paper Methodology
βœ… **Dual Model Architecture**: Uses both FinBERT (ProsusAI/finbert) and RoBERTa (cardiffnlp/twitter-roberta-base-sentiment)
βœ… **Meta-Classifier**: XGBoost meta-classifier trained on FPB AllAgree dataset
βœ… **Feature Engineering**: Probability-based features matching paper's approach
βœ… **Ensemble Method**: Falls back to ensemble averaging if meta-classifier unavailable
## Files
- `fnspid_pipeline.py` - Main pipeline implementation
- `run_pipeline.py` - Simple runner script
- `config.py` - Configuration parameters
- `nasdaq_2018_2019.csv` - FNSPID news data (2018-2019)
- `Price_2018_2019/` - Directory containing ETF price data
## Setup
1. Install required packages:
```bash
pip install matplotlib arch statsmodels torch transformers pandas numpy tqdm scikit-learn
```
2. Ensure data files are in place:
- `nasdaq_2018_2019.csv` (FNSPID news data)
- `Price_2018_2019/*.csv` (price data files)
## Usage
### Quick Start
```bash
python run_pipeline.py
```
### Direct Execution
```bash
python fnspid_pipeline.py
```
### Custom Configuration
Edit `config.py` to modify:
- Date ranges
- Price files to analyze
- Model parameters
- Output settings
## Pipeline Steps
1. **Data Loading**: Load news articles and price data
2. **Sentiment Analysis**: Score news articles using FinBERT
3. **Daily Aggregation**: Aggregate sentiment scores by date
4. **Alignment**: Align sentiment with price data
5. **Correlation Analysis**: Rolling correlation between returns and sentiment
6. **Cointegration Tests**: Johansen tests for long-term relationships
7. **Visualization**: Generate correlation plots
8. **Results Export**: Save analysis results to CSV
## Outputs
All outputs are saved to `outputs_fnspid/` directory:
- `daily_sentiment.csv` - Daily aggregated sentiment scores
- `aligned_series.csv` - Aligned price and sentiment data
- `market_linkage_results.csv` - Analysis results summary
- `rolling_corr_*.png` - Correlation plots for each asset
## Data Requirements
- **FNSPID Data**: CSV with 'Date' and 'Article' columns
- **Price Data**: CSV files with 'date' and 'adj close' columns
- **Date Format**: YYYY-MM-DD or compatible pandas format
## Hardware Requirements
- **CPU**: Multi-core recommended for FinBERT processing
- **GPU**: Optional but recommended for faster sentiment analysis
- **Memory**: 4GB+ RAM for full dataset processing
## Troubleshooting
- Ensure all data files exist in correct locations
- Check date formats match expected patterns
- Verify sufficient data points for analysis (minimum 50 per asset)
- For GPU issues, pipeline will automatically fall back to CPU
## Results Interpretation
- **Mean Correlation**: Average correlation between asset returns and sentiment
- **Correlation Volatility**: How much correlation varies over time
- **Johansen Tests**: Statistical tests for cointegration relationships
- trace r=0: Test for at least one cointegrating relationship
- trace r≀1: Test for at most one cointegrating relationship
- Values > Critical 5% indicate rejection of null hypothesis
## Example Output
```
βœ… Loaded price data for: ['VOO', 'VTI', 'IWM', 'XLF', 'EFA', 'ACWI']
πŸ“… Date range: 2018-01-02 to 2019-12-31
βœ… News rows in window: 25,847
βœ… Using device: cuda
βœ… Saved daily sentiment -> outputs_fnspid/daily_sentiment.csv
βœ… Saved aligned series -> outputs_fnspid/aligned_series.csv
=== Analyzing VOO ===
Rolling Correlation: mean=0.0234 volatility=0.1456
Johansen: trace r=0 12.34 (crit5 15.41), trace r≀1 3.21 (crit5 3.76)
βœ… Saved correlation plot -> outputs_fnspid/rolling_corr_VOO.png
```