File size: 3,734 Bytes
9101d7e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
# Data Pipeline

3-step pipeline to prepare training data for FinEE model fine-tuning.

## Overview

```
Raw Data (XML/JSON/CSV/MBOX)
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Step 1: Unify          β”‚  β†’ step1_unified.csv
β”‚ Standardize formats    β”‚     (all records, single schema)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Step 2: Filter         β”‚  β†’ step2_training_ready.csv
β”‚ Remove garbage         β”‚     (clean transactions only)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Step 3: Baseline       β”‚  β†’ step3_baseline_results.csv
β”‚ Test current accuracy  β”‚     (extracted fields + metrics)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Quick Start

### 1. Copy your data to the workspace

```bash
# Example: Google Takeout export
cp -r ~/Downloads/Takeout /Users/ranjit/llm-mail-trainer/data/raw/

# Or SMS backup
cp sms_backup.xml /Users/ranjit/llm-mail-trainer/data/raw/
```

### 2. Run the pipeline

```bash
cd /Users/ranjit/llm-mail-trainer

# Step 1: Unify all data formats
python scripts/data_pipeline/step1_unify.py --input data/raw/ --output data/pipeline/step1_unified.csv

# Step 2: Filter garbage (OTPs, spam, etc.)
python scripts/data_pipeline/step2_filter.py --input data/pipeline/step1_unified.csv

# Step 3: Test baseline accuracy
python scripts/data_pipeline/step3_baseline.py
```

## Supported Input Formats

| Format | Source | Example |
|--------|--------|---------|
| `.mbox` | Gmail export | Mail.mbox |
| `.json` | Google Takeout | transactions.json |
| `.csv` | Bank exports | statements.csv |
| `.xml` | SMS Backup apps | sms_backup.xml |

## Output Schema

All data is standardized to:

| Column | Type | Description |
|--------|------|-------------|
| `timestamp` | string | When message was received |
| `sender` | string | Bank/sender name |
| `body` | string | Message content |
| `source` | string | Original file source |

## Step 2 Filters

### Messages REMOVED (Garbage):
- OTPs / Verification codes
- Login alerts
- Marketing spam (% off, offers)
- Bill reminders (not transactions)
- Account statements
- Delivery notifications

### Messages KEPT (Transactions):
- Debit/Credit notifications
- UPI payments
- NEFT/IMPS transfers
- Amount + account references

## Step 3 Metrics

The baseline test measures:

| Metric | Description |
|--------|-------------|
| Extraction Success Rate | % of messages where amount + type extracted |
| Field Coverage | % for each field (amount, merchant, etc.) |
| Confidence Distribution | LOW / MEDIUM / HIGH breakdown |
| Top Merchants | Most common extracted merchants |
| Processing Speed | Messages per second |

## Directory Structure

```
data/
β”œβ”€β”€ raw/                    # Put your raw data here
β”‚   └── Takeout/
β”‚       └── ...
β”œβ”€β”€ pipeline/               # Pipeline outputs
β”‚   β”œβ”€β”€ step1_unified.csv
β”‚   β”œβ”€β”€ step2_training_ready.csv
β”‚   β”œβ”€β”€ step2_garbage.csv       (optional)
β”‚   β”œβ”€β”€ step2_uncertain.csv     (optional)
β”‚   β”œβ”€β”€ step3_baseline_results.csv
β”‚   └── step3_baseline_analysis.json
└── training/               # Final training data (after labeling)
```

## Next Steps

After Step 3:

1. Review low-confidence extractions
2. Add ground truth labels for training
3. Identify patterns that need new regex
4. Fine-tune the LLM on labeled data