File size: 3,734 Bytes

9101d7e

# Data Pipeline

3-step pipeline to prepare training data for FinEE model fine-tuning.

## Overview

```
Raw Data (XML/JSON/CSV/MBOX)
         │
         ▼
┌────────────────────────┐
│ Step 1: Unify          │  → step1_unified.csv
│ Standardize formats    │     (all records, single schema)
└────────────────────────┘
         │
         ▼
┌────────────────────────┐
│ Step 2: Filter         │  → step2_training_ready.csv
│ Remove garbage         │     (clean transactions only)
└────────────────────────┘
         │
         ▼
┌────────────────────────┐
│ Step 3: Baseline       │  → step3_baseline_results.csv
│ Test current accuracy  │     (extracted fields + metrics)
└────────────────────────┘
```

## Quick Start

### 1. Copy your data to the workspace

```bash
# Example: Google Takeout export
cp -r ~/Downloads/Takeout /Users/ranjit/llm-mail-trainer/data/raw/

# Or SMS backup
cp sms_backup.xml /Users/ranjit/llm-mail-trainer/data/raw/
```

### 2. Run the pipeline

```bash
cd /Users/ranjit/llm-mail-trainer

# Step 1: Unify all data formats
python scripts/data_pipeline/step1_unify.py --input data/raw/ --output data/pipeline/step1_unified.csv

# Step 2: Filter garbage (OTPs, spam, etc.)
python scripts/data_pipeline/step2_filter.py --input data/pipeline/step1_unified.csv

# Step 3: Test baseline accuracy
python scripts/data_pipeline/step3_baseline.py
```

## Supported Input Formats

| Format | Source | Example |
|--------|--------|---------|
| `.mbox` | Gmail export | Mail.mbox |
| `.json` | Google Takeout | transactions.json |
| `.csv` | Bank exports | statements.csv |
| `.xml` | SMS Backup apps | sms_backup.xml |

## Output Schema

All data is standardized to:

| Column | Type | Description |
|--------|------|-------------|
| `timestamp` | string | When message was received |
| `sender` | string | Bank/sender name |
| `body` | string | Message content |
| `source` | string | Original file source |

## Step 2 Filters

### Messages REMOVED (Garbage):
- OTPs / Verification codes
- Login alerts
- Marketing spam (% off, offers)
- Bill reminders (not transactions)
- Account statements
- Delivery notifications

### Messages KEPT (Transactions):
- Debit/Credit notifications
- UPI payments
- NEFT/IMPS transfers
- Amount + account references

## Step 3 Metrics

The baseline test measures:

| Metric | Description |
|--------|-------------|
| Extraction Success Rate | % of messages where amount + type extracted |
| Field Coverage | % for each field (amount, merchant, etc.) |
| Confidence Distribution | LOW / MEDIUM / HIGH breakdown |
| Top Merchants | Most common extracted merchants |
| Processing Speed | Messages per second |

## Directory Structure

```
data/
├── raw/                    # Put your raw data here
│   └── Takeout/
│       └── ...
├── pipeline/               # Pipeline outputs
│   ├── step1_unified.csv
│   ├── step2_training_ready.csv
│   ├── step2_garbage.csv       (optional)
│   ├── step2_uncertain.csv     (optional)
│   ├── step3_baseline_results.csv
│   └── step3_baseline_analysis.json
└── training/               # Final training data (after labeling)
```

## Next Steps

After Step 3:

1. Review low-confidence extractions
2. Add ground truth labels for training
3. Identify patterns that need new regex
4. Fine-tune the LLM on labeled data