Data Pipeline

3-step pipeline to prepare training data for FinEE model fine-tuning.

Overview

Raw Data (XML/JSON/CSV/MBOX)
         │
         ▼
┌────────────────────────┐
│ Step 1: Unify          │  → step1_unified.csv
│ Standardize formats    │     (all records, single schema)
└────────────────────────┘
         │
         ▼
┌────────────────────────┐
│ Step 2: Filter         │  → step2_training_ready.csv
│ Remove garbage         │     (clean transactions only)
└────────────────────────┘
         │
         ▼
┌────────────────────────┐
│ Step 3: Baseline       │  → step3_baseline_results.csv
│ Test current accuracy  │     (extracted fields + metrics)
└────────────────────────┘

Quick Start

1. Copy your data to the workspace

# Example: Google Takeout export
cp -r ~/Downloads/Takeout /Users/ranjit/llm-mail-trainer/data/raw/

# Or SMS backup
cp sms_backup.xml /Users/ranjit/llm-mail-trainer/data/raw/

2. Run the pipeline

cd /Users/ranjit/llm-mail-trainer

# Step 1: Unify all data formats
python scripts/data_pipeline/step1_unify.py --input data/raw/ --output data/pipeline/step1_unified.csv

# Step 2: Filter garbage (OTPs, spam, etc.)
python scripts/data_pipeline/step2_filter.py --input data/pipeline/step1_unified.csv

# Step 3: Test baseline accuracy
python scripts/data_pipeline/step3_baseline.py

Supported Input Formats

Format	Source	Example
`.mbox`	Gmail export	Mail.mbox
`.json`	Google Takeout	transactions.json
`.csv`	Bank exports	statements.csv
`.xml`	SMS Backup apps	sms_backup.xml

Output Schema

All data is standardized to:

Column	Type	Description
`timestamp`	string	When message was received
`sender`	string	Bank/sender name
`body`	string	Message content
`source`	string	Original file source

Step 2 Filters

Messages REMOVED (Garbage):

OTPs / Verification codes
Login alerts
Marketing spam (% off, offers)
Bill reminders (not transactions)
Account statements
Delivery notifications

Messages KEPT (Transactions):

Debit/Credit notifications
UPI payments
NEFT/IMPS transfers
Amount + account references

Step 3 Metrics

The baseline test measures:

Metric	Description
Extraction Success Rate	% of messages where amount + type extracted
Field Coverage	% for each field (amount, merchant, etc.)
Confidence Distribution	LOW / MEDIUM / HIGH breakdown
Top Merchants	Most common extracted merchants
Processing Speed	Messages per second

Directory Structure

data/
├── raw/                    # Put your raw data here
│   └── Takeout/
│       └── ...
├── pipeline/               # Pipeline outputs
│   ├── step1_unified.csv
│   ├── step2_training_ready.csv
│   ├── step2_garbage.csv       (optional)
│   ├── step2_uncertain.csv     (optional)
│   ├── step3_baseline_results.csv
│   └── step3_baseline_analysis.json
└── training/               # Final training data (after labeling)

Next Steps

After Step 3:

Review low-confidence extractions
Add ground truth labels for training
Identify patterns that need new regex
Fine-tune the LLM on labeled data