| # Data Pipeline | |
| 3-step pipeline to prepare training data for FinEE model fine-tuning. | |
| ## Overview | |
| ``` | |
| Raw Data (XML/JSON/CSV/MBOX) | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββ | |
| β Step 1: Unify β β step1_unified.csv | |
| β Standardize formats β (all records, single schema) | |
| ββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββ | |
| β Step 2: Filter β β step2_training_ready.csv | |
| β Remove garbage β (clean transactions only) | |
| ββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββ | |
| β Step 3: Baseline β β step3_baseline_results.csv | |
| β Test current accuracy β (extracted fields + metrics) | |
| ββββββββββββββββββββββββββ | |
| ``` | |
| ## Quick Start | |
| ### 1. Copy your data to the workspace | |
| ```bash | |
| # Example: Google Takeout export | |
| cp -r ~/Downloads/Takeout /Users/ranjit/llm-mail-trainer/data/raw/ | |
| # Or SMS backup | |
| cp sms_backup.xml /Users/ranjit/llm-mail-trainer/data/raw/ | |
| ``` | |
| ### 2. Run the pipeline | |
| ```bash | |
| cd /Users/ranjit/llm-mail-trainer | |
| # Step 1: Unify all data formats | |
| python scripts/data_pipeline/step1_unify.py --input data/raw/ --output data/pipeline/step1_unified.csv | |
| # Step 2: Filter garbage (OTPs, spam, etc.) | |
| python scripts/data_pipeline/step2_filter.py --input data/pipeline/step1_unified.csv | |
| # Step 3: Test baseline accuracy | |
| python scripts/data_pipeline/step3_baseline.py | |
| ``` | |
| ## Supported Input Formats | |
| | Format | Source | Example | | |
| |--------|--------|---------| | |
| | `.mbox` | Gmail export | Mail.mbox | | |
| | `.json` | Google Takeout | transactions.json | | |
| | `.csv` | Bank exports | statements.csv | | |
| | `.xml` | SMS Backup apps | sms_backup.xml | | |
| ## Output Schema | |
| All data is standardized to: | |
| | Column | Type | Description | | |
| |--------|------|-------------| | |
| | `timestamp` | string | When message was received | | |
| | `sender` | string | Bank/sender name | | |
| | `body` | string | Message content | | |
| | `source` | string | Original file source | | |
| ## Step 2 Filters | |
| ### Messages REMOVED (Garbage): | |
| - OTPs / Verification codes | |
| - Login alerts | |
| - Marketing spam (% off, offers) | |
| - Bill reminders (not transactions) | |
| - Account statements | |
| - Delivery notifications | |
| ### Messages KEPT (Transactions): | |
| - Debit/Credit notifications | |
| - UPI payments | |
| - NEFT/IMPS transfers | |
| - Amount + account references | |
| ## Step 3 Metrics | |
| The baseline test measures: | |
| | Metric | Description | | |
| |--------|-------------| | |
| | Extraction Success Rate | % of messages where amount + type extracted | | |
| | Field Coverage | % for each field (amount, merchant, etc.) | | |
| | Confidence Distribution | LOW / MEDIUM / HIGH breakdown | | |
| | Top Merchants | Most common extracted merchants | | |
| | Processing Speed | Messages per second | | |
| ## Directory Structure | |
| ``` | |
| data/ | |
| βββ raw/ # Put your raw data here | |
| β βββ Takeout/ | |
| β βββ ... | |
| βββ pipeline/ # Pipeline outputs | |
| β βββ step1_unified.csv | |
| β βββ step2_training_ready.csv | |
| β βββ step2_garbage.csv (optional) | |
| β βββ step2_uncertain.csv (optional) | |
| β βββ step3_baseline_results.csv | |
| β βββ step3_baseline_analysis.json | |
| βββ training/ # Final training data (after labeling) | |
| ``` | |
| ## Next Steps | |
| After Step 3: | |
| 1. Review low-confidence extractions | |
| 2. Add ground truth labels for training | |
| 3. Identify patterns that need new regex | |
| 4. Fine-tune the LLM on labeled data | |