finance-entity-extractor / docs /UPGRADE_ROADMAP.md
Ranjit0034's picture
Upload docs/UPGRADE_ROADMAP.md with huggingface_hub
0ae3f18 verified

FinEE v2.0 - Upgrade Roadmap

From Extraction Engine to Intelligent Financial Agent

Current vs Target Comparison

Dimension Current State Target State Priority
Base Model Phi-3 Mini (3.8B) Llama 3.1 8B / Qwen2.5 7B P0
Training Data 456 samples 100K+ distilled samples P0
Output Format Token extraction Instruction-following JSON P0
Context None RAG + Knowledge Graph P1
Interaction Single-turn Multi-turn agent P1
Input Types Email only SMS + Email + PDF + Images P1
Accuracy ~70% (estimated) 95%+ (measured) P0

Phase 1: Foundation (Week 1-2)

1.1 Model Upgrade

  • Download Llama 3.1 8B Instruct
  • Download Qwen2.5 7B Instruct (backup)
  • Benchmark both on finance extraction task
  • Set up quantization pipeline (4-bit, 8-bit)

1.2 Training Data Expansion

  • Generate 100K synthetic samples (DONE βœ…)
  • Distill from GPT-4/Claude for complex cases
  • Add real data from user (2,419 SMS samples βœ…)
  • Create validation set (10K samples)
  • Create test set (5K unseen samples)

1.3 Instruction Format

{
  "system": "You are a financial entity extractor...",
  "instruction": "Extract entities from this message",
  "input": "<bank SMS or email>",
  "output": {
    "amount": 2500.00,
    "type": "debit",
    "merchant": "Swiggy",
    "category": "food",
    "date": "2026-01-12",
    "reference": "123456789012"
  }
}

Phase 2: Multi-Modal Support (Week 3-4)

2.1 Input Types

  • SMS Parser (DONE βœ…)
  • Email Parser (DONE βœ…)
  • PDF Statement Parser
    • Use pdfplumber for text extraction
    • Table detection with camelot
    • OCR fallback with pytesseract
  • Image/Screenshot Parser
    • OCR with EasyOCR or PaddleOCR
    • Vision model for structured extraction

2.2 Bank Statement Processing

PDF Input β†’ Text Extraction β†’ Table Detection β†’ 
Row Parsing β†’ Entity Extraction β†’ Transaction List

2.3 Image Processing Pipeline

Image β†’ OCR β†’ Text Blocks β†’ Layout Analysis β†’ 
Entity Extraction β†’ Structured Output

Phase 3: RAG + Knowledge Graph (Week 5-6)

3.1 Knowledge Base

  • Merchant database (10K+ Indian merchants)
  • Bank template patterns
  • Category taxonomy
  • UPI VPA mappings

3.2 RAG Architecture

Query β†’ Retrieve Similar Transactions β†’ 
Augment Context β†’ Generate Extraction

3.3 Knowledge Graph

[Merchant: Swiggy] --is_a--> [Category: Food Delivery]
                   --accepts--> [Payment: UPI, Card]
                   --typical_amount--> [Range: 100-2000]

3.4 Vector Store

  • Use Qdrant/ChromaDB for transaction embeddings
  • Enable semantic search for similar transactions
  • Support for "transactions like this" queries

Phase 4: Multi-Turn Agent (Week 7-8)

4.1 Agent Capabilities

class FinancialAgent:
    def extract_entities(self, message) -> dict
    def categorize_spending(self, transactions) -> dict
    def detect_anomalies(self, transactions) -> list
    def generate_report(self, period) -> str
    def answer_question(self, question, context) -> str

4.2 Conversation Flow

User: "How much did I spend on food last month?"
Agent: [Retrieves transactions] β†’ [Filters by category] β†’ 
       [Aggregates amounts] β†’ "You spent β‚Ή12,450 on food"

User: "Compare with previous month"
Agent: [Uses conversation context] β†’ [Retrieves both months] β†’
       "December: β‚Ή12,450, November: β‚Ή9,800 (+27%)"

4.3 Tool Use

  • Calculator for aggregations
  • Date parser for time queries
  • Budget tracker integration
  • Export to CSV/Excel

Phase 5: Production Deployment (Week 9-10)

5.1 Model Optimization

  • GGUF quantization for llama.cpp
  • ONNX export for faster inference
  • vLLM for batch processing
  • MLX optimization for Apple Silicon

5.2 API Design

# FastAPI endpoints
POST /extract          # Single message extraction
POST /extract/batch    # Batch extraction
POST /parse/pdf        # PDF statement parsing
POST /parse/image      # Image OCR + extraction
POST /chat             # Multi-turn agent
GET  /analytics        # Spending analytics

5.3 Deployment Options

  • Docker container
  • Hugging Face Spaces (demo)
  • Modal/Replicate (serverless)
  • Self-hosted with vLLM

Technical Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     FinEE v2.0 Agent                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚   SMS    β”‚  β”‚  Email   β”‚  β”‚   PDF    β”‚  β”‚  Image   β”‚   β”‚
β”‚  β”‚  Parser  β”‚  β”‚  Parser  β”‚  β”‚  Parser  β”‚  β”‚   OCR    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜   β”‚
β”‚       β”‚             β”‚             β”‚             β”‚          β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚                            β–Ό                               β”‚
β”‚                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”‚
β”‚                   β”‚  Preprocessor  β”‚                       β”‚
β”‚                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚
β”‚                            β–Ό                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚                  RAG Pipeline                        β”‚  β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚
β”‚  β”‚  β”‚ Vector  β”‚   β”‚  Knowledge  β”‚   β”‚   Merchant   β”‚  β”‚  β”‚
β”‚  β”‚  β”‚  Store  β”‚   β”‚    Graph    β”‚   β”‚   Database   β”‚  β”‚  β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚
β”‚  β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                          β–Ό                                β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚              β”‚  Llama 3.1 8B / Qwen  β”‚                    β”‚
β”‚              β”‚   Instruction-Tuned   β”‚                    β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚                          β–Ό                                β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚              β”‚     JSON Output       β”‚                    β”‚
β”‚              β”‚   + Confidence Score  β”‚                    β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Model Selection Analysis

Model Size Speed Quality License Choice
Llama 3.1 8B 8B Fast Excellent Meta ⭐ Primary
Qwen2.5 7B 7B Fast Excellent Apache ⭐ Backup
Mistral 7B 7B Fast Good Apache Alternative
Phi-3 Medium 14B Medium Excellent MIT Future

Why Llama 3.1 8B?

  1. Instruction following - Best in class for its size
  2. Structured output - Reliable JSON generation
  3. Context length - 128K tokens (future RAG)
  4. Quantization - Excellent 4-bit performance
  5. Ecosystem - Wide support (vLLM, llama.cpp, MLX)

Training Strategy

Stage 1: Supervised Fine-tuning (SFT)

Base: Llama 3.1 8B Instruct
Data: 100K synthetic + 2.4K real
Method: LoRA (rank=16, alpha=32)
Epochs: 3

Stage 2: DPO (Direct Preference Optimization)

Create preference pairs:
- Chosen: Correct extraction with confidence
- Rejected: Partial/incorrect extraction
Objective: Improve extraction precision

Stage 3: RLHF (Optional)

Reward model based on:
- JSON validity
- Field accuracy
- Merchant identification
- Category correctness

Metrics & Benchmarks

Extraction Accuracy

  • Amount: Target 99%+
  • Type (debit/credit): Target 98%+
  • Merchant: Target 90%+
  • Category: Target 85%+
  • Reference: Target 95%+

System Metrics

  • Latency: <100ms per extraction
  • Throughput: >100 msgs/sec
  • Memory: <8GB (quantized)

Next Steps (Immediate)

  1. Download Llama 3.1 8B Instruct
  2. Create instruction-format training data
  3. Set up LoRA fine-tuning pipeline
  4. Run first training experiment
  5. Benchmark against current Phi-3 model