Data Pipeline
3-step pipeline to prepare training data for FinEE model fine-tuning.
Overview
Raw Data (XML/JSON/CSV/MBOX)
β
βΌ
ββββββββββββββββββββββββββ
β Step 1: Unify β β step1_unified.csv
β Standardize formats β (all records, single schema)
ββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββ
β Step 2: Filter β β step2_training_ready.csv
β Remove garbage β (clean transactions only)
ββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββ
β Step 3: Baseline β β step3_baseline_results.csv
β Test current accuracy β (extracted fields + metrics)
ββββββββββββββββββββββββββ
Quick Start
1. Copy your data to the workspace
# Example: Google Takeout export
cp -r ~/Downloads/Takeout /Users/ranjit/llm-mail-trainer/data/raw/
# Or SMS backup
cp sms_backup.xml /Users/ranjit/llm-mail-trainer/data/raw/
2. Run the pipeline
cd /Users/ranjit/llm-mail-trainer
# Step 1: Unify all data formats
python scripts/data_pipeline/step1_unify.py --input data/raw/ --output data/pipeline/step1_unified.csv
# Step 2: Filter garbage (OTPs, spam, etc.)
python scripts/data_pipeline/step2_filter.py --input data/pipeline/step1_unified.csv
# Step 3: Test baseline accuracy
python scripts/data_pipeline/step3_baseline.py
Supported Input Formats
| Format | Source | Example |
|---|---|---|
.mbox |
Gmail export | Mail.mbox |
.json |
Google Takeout | transactions.json |
.csv |
Bank exports | statements.csv |
.xml |
SMS Backup apps | sms_backup.xml |
Output Schema
All data is standardized to:
| Column | Type | Description |
|---|---|---|
timestamp |
string | When message was received |
sender |
string | Bank/sender name |
body |
string | Message content |
source |
string | Original file source |
Step 2 Filters
Messages REMOVED (Garbage):
- OTPs / Verification codes
- Login alerts
- Marketing spam (% off, offers)
- Bill reminders (not transactions)
- Account statements
- Delivery notifications
Messages KEPT (Transactions):
- Debit/Credit notifications
- UPI payments
- NEFT/IMPS transfers
- Amount + account references
Step 3 Metrics
The baseline test measures:
| Metric | Description |
|---|---|
| Extraction Success Rate | % of messages where amount + type extracted |
| Field Coverage | % for each field (amount, merchant, etc.) |
| Confidence Distribution | LOW / MEDIUM / HIGH breakdown |
| Top Merchants | Most common extracted merchants |
| Processing Speed | Messages per second |
Directory Structure
data/
βββ raw/ # Put your raw data here
β βββ Takeout/
β βββ ...
βββ pipeline/ # Pipeline outputs
β βββ step1_unified.csv
β βββ step2_training_ready.csv
β βββ step2_garbage.csv (optional)
β βββ step2_uncertain.csv (optional)
β βββ step3_baseline_results.csv
β βββ step3_baseline_analysis.json
βββ training/ # Final training data (after labeling)
Next Steps
After Step 3:
- Review low-confidence extractions
- Add ground truth labels for training
- Identify patterns that need new regex
- Fine-tune the LLM on labeled data