{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# šŸ“„ Phase 2: Bank Statement Row Extraction\n", "\n", "This notebook guides you through Phase 2 of the model upgrade:\n", "- Extract text rows from bank statement PDFs\n", "- Label rows with entities\n", "- Generate synthetic variations\n", "- Prepare training data with [BANK_STATEMENT] prefix\n", "\n", "## Goal\n", "**Model parses bank statement rows with high accuracy**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Setup & Check PDF Files\n", "\n", "Place your bank statements in: `data/raw/pdfs/statements/`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import json\n", "\n", "# Setup directories\n", "PROJECT_ROOT = Path.cwd()\n", "PDF_DIR = PROJECT_ROOT / \"data/raw/pdfs/statements\"\n", "LABELING_DIR = PROJECT_ROOT / \"data/labeling\"\n", "TRAINING_DIR = PROJECT_ROOT / \"data/training\"\n", "\n", "# Create directories\n", "PDF_DIR.mkdir(parents=True, exist_ok=True)\n", "LABELING_DIR.mkdir(parents=True, exist_ok=True)\n", "\n", "# Check for PDFs\n", "pdfs = list(PDF_DIR.glob(\"*.pdf\")) + list(PDF_DIR.glob(\"*.PDF\"))\n", "print(f\"šŸ“‚ Found {len(pdfs)} PDF files in {PDF_DIR}\")\n", "for pdf in pdfs:\n", " print(f\" • {pdf.name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Extract Rows from PDFs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from src.data.statement_extractor import StatementRowExtractor\n", "\n", "extractor = StatementRowExtractor(debug=False)\n", "\n", "all_rows = []\n", "\n", "for pdf in pdfs:\n", " print(f\"\\nšŸ“„ Processing: {pdf.name}\")\n", " try:\n", " rows, stats = extractor.extract_rows(pdf)\n", " all_rows.extend(rows)\n", " print(f\" āœ… Extracted {stats.valid_rows} rows ({stats.bank.upper()})\")\n", " except Exception as e:\n", " print(f\" āŒ Error: {e}\")\n", "\n", "print(f\"\\nšŸ“Š Total rows extracted: {len(all_rows)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Preview Extracted Rows" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Convert to DataFrame for easy viewing\n", "df = pd.DataFrame([r.to_dict() for r in all_rows])\n", "\n", "# Display columns\n", "display_cols = ['date', 'description', 'debit', 'credit', 'balance', 'bank']\n", "cols = [c for c in display_cols if c in df.columns]\n", "\n", "print(f\"šŸ“‹ Sample rows (first 10):\")\n", "df[cols].head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Export for Manual Labeling\n", "\n", "Export rows to JSON for manual entity labeling." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "output_file = LABELING_DIR / \"statement_rows_unlabeled.json\"\n", "extractor.export_for_labeling(all_rows, output_file)\n", "\n", "print(f\"āœ… Exported {len(all_rows)} rows to:\")\n", "print(f\" {output_file}\")\n", "print(f\"\\nšŸ“ Next: Open the JSON file and add 'entities' to each row\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5: Manual Labeling Guide\n", "\n", "For each row, add the `entities` field with:\n", "\n", "```json\n", "{\n", " \"raw_text\": \"01-12-2025 | UPI-SWIGGY@ybl | 250.00 | | 45,230.50\",\n", " \"labeled\": true,\n", " \"entities\": {\n", " \"date\": \"01-12-2025\",\n", " \"description\": \"UPI-SWIGGY@ybl\",\n", " \"amount\": \"250.00\",\n", " \"type\": \"debit\",\n", " \"balance\": \"45,230.50\",\n", " \"merchant\": \"swiggy\",\n", " \"category\": \"food\"\n", " }\n", "}\n", "```\n", "\n", "**Target: Label 500+ rows**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 6: Load Labeled Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# After manual labeling, load the data\n", "labeled_file = LABELING_DIR / \"statement_rows_labeled.json\"\n", "\n", "if labeled_file.exists():\n", " labeled_rows = extractor.load_labeled_data(labeled_file)\n", " labeled_count = sum(1 for r in labeled_rows if r.labeled)\n", " print(f\"šŸ“Š Loaded {len(labeled_rows)} rows ({labeled_count} labeled)\")\n", "else:\n", " print(f\"āš ļø Labeled file not found: {labeled_file}\")\n", " print(f\" Rename your labeled file to: statement_rows_labeled.json\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 7: Generate Synthetic Variations" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from src.data.statement_extractor import StatementSyntheticGenerator\n", "\n", "generator = StatementSyntheticGenerator(seed=42)\n", "\n", "# Generate variations from labeled data\n", "if 'labeled_rows' in dir() and labeled_rows:\n", " # Filter to labeled only\n", " base_rows = [r for r in labeled_rows if r.labeled]\n", " \n", " # Generate 5x variations\n", " synthetic_rows = generator.generate_variations(\n", " base_rows, \n", " variations_per_row=5,\n", " total_limit=2000\n", " )\n", " \n", " print(f\"āœ… Generated {len(synthetic_rows)} synthetic variations\")\n", "else:\n", " print(\"āš ļø Load labeled data first (Step 6)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 8: Export Training Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from src.data.statement_extractor import export_training_data\n", "\n", "if 'synthetic_rows' in dir() and synthetic_rows:\n", " # Combine labeled + synthetic\n", " all_training = base_rows + synthetic_rows\n", " \n", " # Export\n", " train_file, valid_file = export_training_data(\n", " all_training,\n", " TRAINING_DIR / \"statement\"\n", " )\n", " \n", " print(f\"āœ… Training files created:\")\n", " print(f\" Train: {train_file}\")\n", " print(f\" Valid: {valid_file}\")\n", "else:\n", " print(\"āš ļø Generate synthetic data first (Step 7)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 9: Combine with Phase 1 Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Combine Phase 1 and Phase 2 training data\n", "phase1_train = TRAINING_DIR / \"train.jsonl\"\n", "phase2_train = TRAINING_DIR / \"statement_train.jsonl\"\n", "combined_train = TRAINING_DIR / \"combined_train.jsonl\"\n", "\n", "if phase1_train.exists() and phase2_train.exists():\n", " # Read both files\n", " with open(phase1_train) as f:\n", " p1_data = f.readlines()\n", " with open(phase2_train) as f:\n", " p2_data = f.readlines()\n", " \n", " # Combine and shuffle\n", " import random\n", " combined = p1_data + p2_data\n", " random.shuffle(combined)\n", " \n", " with open(combined_train, 'w') as f:\n", " f.writelines(combined)\n", " \n", " print(f\"āœ… Combined training data:\")\n", " print(f\" Phase 1: {len(p1_data)} samples\")\n", " print(f\" Phase 2: {len(p2_data)} samples\")\n", " print(f\" Total: {len(combined)} samples\")\n", "else:\n", " print(f\"āš ļø Missing files:\")\n", " print(f\" Phase 1: {phase1_train.exists()}\")\n", " print(f\" Phase 2: {phase2_train.exists()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 10: Retrain Model\n", "\n", "Run in terminal (not notebook):\n", "\n", "```bash\n", "cd ~/llm-mail-trainer\n", "source venv/bin/activate\n", "\n", "mlx_lm.lora \\\n", " --model models/base/phi3-mini \\\n", " --data data/training \\\n", " --train \\\n", " --batch-size 1 \\\n", " --lora-layers 8 \\\n", " --iters 800 \\\n", " --adapter-path models/adapters/finance-lora-v4\n", "```\n", "\n", "Note: Use `combined_train.jsonl` and corresponding valid file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 11: Evaluate on Statement Rows" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from src.inference.predict import Predictor\n", "\n", "# Load new model\n", "predictor = Predictor(\n", " model_path=\"models/base/phi3-mini\",\n", " adapter_path=\"models/adapters/finance-lora-v4\"\n", ")\n", "\n", "# Test on statement row\n", "test_row = \"01-12-2025 | UPI-SWIGGY@ybl | 250.00 | | 45,230.50\"\n", "prompt = f\"[BANK_STATEMENT] Extract financial entities from this bank statement row:\\n\\n{test_row}\"\n", "\n", "result = predictor.predict(email_text=prompt)\n", "print(f\"šŸ“‹ Input: {test_row}\")\n", "print(f\"\\nšŸŽÆ Extracted:\")\n", "print(result.to_json())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## āœ… Phase 2 Checklist\n", "\n", "- [ ] Collect bank statements (3-6 months)\n", "- [ ] Extract text rows using pdfplumber\n", "- [ ] Manually label 500+ rows\n", "- [ ] Generate synthetic variations\n", "- [ ] Add [BANK_STATEMENT] prefix to training\n", "- [ ] Retrain model\n", "- [ ] Test accuracy\n", "\n", "**Deliverable: Model parses bank statement rows**" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.9.0" } }, "nbformat": 4, "nbformat_minor": 4 }