Ranjit Behera
feat: Add comprehensive data pipeline and fine-tuning
9101d7e

Data Pipeline

3-step pipeline to prepare training data for FinEE model fine-tuning.

Overview

Raw Data (XML/JSON/CSV/MBOX)
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Step 1: Unify          β”‚  β†’ step1_unified.csv
β”‚ Standardize formats    β”‚     (all records, single schema)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Step 2: Filter         β”‚  β†’ step2_training_ready.csv
β”‚ Remove garbage         β”‚     (clean transactions only)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Step 3: Baseline       β”‚  β†’ step3_baseline_results.csv
β”‚ Test current accuracy  β”‚     (extracted fields + metrics)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quick Start

1. Copy your data to the workspace

# Example: Google Takeout export
cp -r ~/Downloads/Takeout /Users/ranjit/llm-mail-trainer/data/raw/

# Or SMS backup
cp sms_backup.xml /Users/ranjit/llm-mail-trainer/data/raw/

2. Run the pipeline

cd /Users/ranjit/llm-mail-trainer

# Step 1: Unify all data formats
python scripts/data_pipeline/step1_unify.py --input data/raw/ --output data/pipeline/step1_unified.csv

# Step 2: Filter garbage (OTPs, spam, etc.)
python scripts/data_pipeline/step2_filter.py --input data/pipeline/step1_unified.csv

# Step 3: Test baseline accuracy
python scripts/data_pipeline/step3_baseline.py

Supported Input Formats

Format Source Example
.mbox Gmail export Mail.mbox
.json Google Takeout transactions.json
.csv Bank exports statements.csv
.xml SMS Backup apps sms_backup.xml

Output Schema

All data is standardized to:

Column Type Description
timestamp string When message was received
sender string Bank/sender name
body string Message content
source string Original file source

Step 2 Filters

Messages REMOVED (Garbage):

  • OTPs / Verification codes
  • Login alerts
  • Marketing spam (% off, offers)
  • Bill reminders (not transactions)
  • Account statements
  • Delivery notifications

Messages KEPT (Transactions):

  • Debit/Credit notifications
  • UPI payments
  • NEFT/IMPS transfers
  • Amount + account references

Step 3 Metrics

The baseline test measures:

Metric Description
Extraction Success Rate % of messages where amount + type extracted
Field Coverage % for each field (amount, merchant, etc.)
Confidence Distribution LOW / MEDIUM / HIGH breakdown
Top Merchants Most common extracted merchants
Processing Speed Messages per second

Directory Structure

data/
β”œβ”€β”€ raw/                    # Put your raw data here
β”‚   └── Takeout/
β”‚       └── ...
β”œβ”€β”€ pipeline/               # Pipeline outputs
β”‚   β”œβ”€β”€ step1_unified.csv
β”‚   β”œβ”€β”€ step2_training_ready.csv
β”‚   β”œβ”€β”€ step2_garbage.csv       (optional)
β”‚   β”œβ”€β”€ step2_uncertain.csv     (optional)
β”‚   β”œβ”€β”€ step3_baseline_results.csv
β”‚   └── step3_baseline_analysis.json
└── training/               # Final training data (after labeling)

Next Steps

After Step 3:

  1. Review low-confidence extractions
  2. Add ground truth labels for training
  3. Identify patterns that need new regex
  4. Fine-tune the LLM on labeled data