Ranjit Behera
feat: Add comprehensive data pipeline and fine-tuning
9101d7e
# Data Pipeline
3-step pipeline to prepare training data for FinEE model fine-tuning.
## Overview
```
Raw Data (XML/JSON/CSV/MBOX)
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Step 1: Unify β”‚ β†’ step1_unified.csv
β”‚ Standardize formats β”‚ (all records, single schema)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Step 2: Filter β”‚ β†’ step2_training_ready.csv
β”‚ Remove garbage β”‚ (clean transactions only)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Step 3: Baseline β”‚ β†’ step3_baseline_results.csv
β”‚ Test current accuracy β”‚ (extracted fields + metrics)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Quick Start
### 1. Copy your data to the workspace
```bash
# Example: Google Takeout export
cp -r ~/Downloads/Takeout /Users/ranjit/llm-mail-trainer/data/raw/
# Or SMS backup
cp sms_backup.xml /Users/ranjit/llm-mail-trainer/data/raw/
```
### 2. Run the pipeline
```bash
cd /Users/ranjit/llm-mail-trainer
# Step 1: Unify all data formats
python scripts/data_pipeline/step1_unify.py --input data/raw/ --output data/pipeline/step1_unified.csv
# Step 2: Filter garbage (OTPs, spam, etc.)
python scripts/data_pipeline/step2_filter.py --input data/pipeline/step1_unified.csv
# Step 3: Test baseline accuracy
python scripts/data_pipeline/step3_baseline.py
```
## Supported Input Formats
| Format | Source | Example |
|--------|--------|---------|
| `.mbox` | Gmail export | Mail.mbox |
| `.json` | Google Takeout | transactions.json |
| `.csv` | Bank exports | statements.csv |
| `.xml` | SMS Backup apps | sms_backup.xml |
## Output Schema
All data is standardized to:
| Column | Type | Description |
|--------|------|-------------|
| `timestamp` | string | When message was received |
| `sender` | string | Bank/sender name |
| `body` | string | Message content |
| `source` | string | Original file source |
## Step 2 Filters
### Messages REMOVED (Garbage):
- OTPs / Verification codes
- Login alerts
- Marketing spam (% off, offers)
- Bill reminders (not transactions)
- Account statements
- Delivery notifications
### Messages KEPT (Transactions):
- Debit/Credit notifications
- UPI payments
- NEFT/IMPS transfers
- Amount + account references
## Step 3 Metrics
The baseline test measures:
| Metric | Description |
|--------|-------------|
| Extraction Success Rate | % of messages where amount + type extracted |
| Field Coverage | % for each field (amount, merchant, etc.) |
| Confidence Distribution | LOW / MEDIUM / HIGH breakdown |
| Top Merchants | Most common extracted merchants |
| Processing Speed | Messages per second |
## Directory Structure
```
data/
β”œβ”€β”€ raw/ # Put your raw data here
β”‚ └── Takeout/
β”‚ └── ...
β”œβ”€β”€ pipeline/ # Pipeline outputs
β”‚ β”œβ”€β”€ step1_unified.csv
β”‚ β”œβ”€β”€ step2_training_ready.csv
β”‚ β”œβ”€β”€ step2_garbage.csv (optional)
β”‚ β”œβ”€β”€ step2_uncertain.csv (optional)
β”‚ β”œβ”€β”€ step3_baseline_results.csv
β”‚ └── step3_baseline_analysis.json
└── training/ # Final training data (after labeling)
```
## Next Steps
After Step 3:
1. Review low-confidence extractions
2. Add ground truth labels for training
3. Identify patterns that need new regex
4. Fine-tune the LLM on labeled data