File size: 3,734 Bytes
9101d7e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
# Data Pipeline
3-step pipeline to prepare training data for FinEE model fine-tuning.
## Overview
```
Raw Data (XML/JSON/CSV/MBOX)
β
βΌ
ββββββββββββββββββββββββββ
β Step 1: Unify β β step1_unified.csv
β Standardize formats β (all records, single schema)
ββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββ
β Step 2: Filter β β step2_training_ready.csv
β Remove garbage β (clean transactions only)
ββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββ
β Step 3: Baseline β β step3_baseline_results.csv
β Test current accuracy β (extracted fields + metrics)
ββββββββββββββββββββββββββ
```
## Quick Start
### 1. Copy your data to the workspace
```bash
# Example: Google Takeout export
cp -r ~/Downloads/Takeout /Users/ranjit/llm-mail-trainer/data/raw/
# Or SMS backup
cp sms_backup.xml /Users/ranjit/llm-mail-trainer/data/raw/
```
### 2. Run the pipeline
```bash
cd /Users/ranjit/llm-mail-trainer
# Step 1: Unify all data formats
python scripts/data_pipeline/step1_unify.py --input data/raw/ --output data/pipeline/step1_unified.csv
# Step 2: Filter garbage (OTPs, spam, etc.)
python scripts/data_pipeline/step2_filter.py --input data/pipeline/step1_unified.csv
# Step 3: Test baseline accuracy
python scripts/data_pipeline/step3_baseline.py
```
## Supported Input Formats
| Format | Source | Example |
|--------|--------|---------|
| `.mbox` | Gmail export | Mail.mbox |
| `.json` | Google Takeout | transactions.json |
| `.csv` | Bank exports | statements.csv |
| `.xml` | SMS Backup apps | sms_backup.xml |
## Output Schema
All data is standardized to:
| Column | Type | Description |
|--------|------|-------------|
| `timestamp` | string | When message was received |
| `sender` | string | Bank/sender name |
| `body` | string | Message content |
| `source` | string | Original file source |
## Step 2 Filters
### Messages REMOVED (Garbage):
- OTPs / Verification codes
- Login alerts
- Marketing spam (% off, offers)
- Bill reminders (not transactions)
- Account statements
- Delivery notifications
### Messages KEPT (Transactions):
- Debit/Credit notifications
- UPI payments
- NEFT/IMPS transfers
- Amount + account references
## Step 3 Metrics
The baseline test measures:
| Metric | Description |
|--------|-------------|
| Extraction Success Rate | % of messages where amount + type extracted |
| Field Coverage | % for each field (amount, merchant, etc.) |
| Confidence Distribution | LOW / MEDIUM / HIGH breakdown |
| Top Merchants | Most common extracted merchants |
| Processing Speed | Messages per second |
## Directory Structure
```
data/
βββ raw/ # Put your raw data here
β βββ Takeout/
β βββ ...
βββ pipeline/ # Pipeline outputs
β βββ step1_unified.csv
β βββ step2_training_ready.csv
β βββ step2_garbage.csv (optional)
β βββ step2_uncertain.csv (optional)
β βββ step3_baseline_results.csv
β βββ step3_baseline_analysis.json
βββ training/ # Final training data (after labeling)
```
## Next Steps
After Step 3:
1. Review low-confidence extractions
2. Add ground truth labels for training
3. Identify patterns that need new regex
4. Fine-tune the LLM on labeled data
|