finance-entity-extractor / docs /UPGRADE_ROADMAP.md

Ranjit0034

Upload docs/UPGRADE_ROADMAP.md with huggingface_hub

0ae3f18 verified about 1 month ago

preview code

raw

history blame contribute delete

9.92 kB

FinEE v2.0 - Upgrade Roadmap

From Extraction Engine to Intelligent Financial Agent

Current vs Target Comparison

Dimension	Current State	Target State	Priority
Base Model	Phi-3 Mini (3.8B)	Llama 3.1 8B / Qwen2.5 7B	P0
Training Data	456 samples	100K+ distilled samples	P0
Output Format	Token extraction	Instruction-following JSON	P0
Context	None	RAG + Knowledge Graph	P1
Interaction	Single-turn	Multi-turn agent	P1
Input Types	Email only	SMS + Email + PDF + Images	P1
Accuracy	~70% (estimated)	95%+ (measured)	P0

Phase 1: Foundation (Week 1-2)

1.1 Model Upgrade

Download Llama 3.1 8B Instruct
Download Qwen2.5 7B Instruct (backup)
Benchmark both on finance extraction task
Set up quantization pipeline (4-bit, 8-bit)

1.2 Training Data Expansion

Generate 100K synthetic samples (DONE ✅)
Distill from GPT-4/Claude for complex cases
Add real data from user (2,419 SMS samples ✅)
Create validation set (10K samples)
Create test set (5K unseen samples)

1.3 Instruction Format

{
  "system": "You are a financial entity extractor...",
  "instruction": "Extract entities from this message",
  "input": "<bank SMS or email>",
  "output": {
    "amount": 2500.00,
    "type": "debit",
    "merchant": "Swiggy",
    "category": "food",
    "date": "2026-01-12",
    "reference": "123456789012"
  }
}

Phase 2: Multi-Modal Support (Week 3-4)

2.1 Input Types

SMS Parser (DONE ✅)
Email Parser (DONE ✅)
PDF Statement Parser
- Use pdfplumber for text extraction
- Table detection with camelot
- OCR fallback with pytesseract
Image/Screenshot Parser
- OCR with EasyOCR or PaddleOCR
- Vision model for structured extraction

2.2 Bank Statement Processing

PDF Input → Text Extraction → Table Detection → 
Row Parsing → Entity Extraction → Transaction List

2.3 Image Processing Pipeline

Image → OCR → Text Blocks → Layout Analysis → 
Entity Extraction → Structured Output

Phase 3: RAG + Knowledge Graph (Week 5-6)

3.1 Knowledge Base

Merchant database (10K+ Indian merchants)
Bank template patterns
Category taxonomy
UPI VPA mappings

3.2 RAG Architecture

Query → Retrieve Similar Transactions → 
Augment Context → Generate Extraction

3.3 Knowledge Graph

[Merchant: Swiggy] --is_a--> [Category: Food Delivery]
                   --accepts--> [Payment: UPI, Card]
                   --typical_amount--> [Range: 100-2000]

3.4 Vector Store

Use Qdrant/ChromaDB for transaction embeddings
Enable semantic search for similar transactions
Support for "transactions like this" queries

Phase 4: Multi-Turn Agent (Week 7-8)

4.1 Agent Capabilities

class FinancialAgent:
    def extract_entities(self, message) -> dict
    def categorize_spending(self, transactions) -> dict
    def detect_anomalies(self, transactions) -> list
    def generate_report(self, period) -> str
    def answer_question(self, question, context) -> str

4.2 Conversation Flow

User: "How much did I spend on food last month?"
Agent: [Retrieves transactions] → [Filters by category] → 
       [Aggregates amounts] → "You spent ₹12,450 on food"

User: "Compare with previous month"
Agent: [Uses conversation context] → [Retrieves both months] →
       "December: ₹12,450, November: ₹9,800 (+27%)"

4.3 Tool Use

Calculator for aggregations
Date parser for time queries
Budget tracker integration
Export to CSV/Excel

Phase 5: Production Deployment (Week 9-10)

5.1 Model Optimization

GGUF quantization for llama.cpp
ONNX export for faster inference
vLLM for batch processing
MLX optimization for Apple Silicon

5.2 API Design

# FastAPI endpoints
POST /extract          # Single message extraction
POST /extract/batch    # Batch extraction
POST /parse/pdf        # PDF statement parsing
POST /parse/image      # Image OCR + extraction
POST /chat             # Multi-turn agent
GET  /analytics        # Spending analytics

5.3 Deployment Options

Docker container
Hugging Face Spaces (demo)
Modal/Replicate (serverless)
Self-hosted with vLLM

Technical Architecture

┌─────────────────────────────────────────────────────────────┐
│                     FinEE v2.0 Agent                        │
├─────────────────────────────────────────────────────────────┤
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │   SMS    │  │  Email   │  │   PDF    │  │  Image   │   │
│  │  Parser  │  │  Parser  │  │  Parser  │  │   OCR    │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │             │             │             │          │
│       └─────────────┴──────┬──────┴─────────────┘          │
│                            ▼                               │
│                   ┌────────────────┐                       │
│                   │  Preprocessor  │                       │
│                   └────────┬───────┘                       │
│                            ▼                               │
│  ┌─────────────────────────────────────────────────────┐  │
│  │                  RAG Pipeline                        │  │
│  │  ┌─────────┐   ┌─────────────┐   ┌──────────────┐  │  │
│  │  │ Vector  │   │  Knowledge  │   │   Merchant   │  │  │
│  │  │  Store  │   │    Graph    │   │   Database   │  │  │
│  │  └────┬────┘   └──────┬──────┘   └───────┬──────┘  │  │
│  │       └───────────────┼──────────────────┘          │  │
│  └───────────────────────┼─────────────────────────────┘  │
│                          ▼                                │
│              ┌───────────────────────┐                    │
│              │  Llama 3.1 8B / Qwen  │                    │
│              │   Instruction-Tuned   │                    │
│              └───────────┬───────────┘                    │
│                          ▼                                │
│              ┌───────────────────────┐                    │
│              │     JSON Output       │                    │
│              │   + Confidence Score  │                    │
│              └───────────────────────┘                    │
└─────────────────────────────────────────────────────────────┘

Model Selection Analysis

Model	Size	Speed	Quality	License	Choice
Llama 3.1 8B	8B	Fast	Excellent	Meta	⭐ Primary
Qwen2.5 7B	7B	Fast	Excellent	Apache	⭐ Backup
Mistral 7B	7B	Fast	Good	Apache	Alternative
Phi-3 Medium	14B	Medium	Excellent	MIT	Future

Why Llama 3.1 8B?

Instruction following - Best in class for its size
Structured output - Reliable JSON generation
Context length - 128K tokens (future RAG)
Quantization - Excellent 4-bit performance
Ecosystem - Wide support (vLLM, llama.cpp, MLX)

Training Strategy

Stage 1: Supervised Fine-tuning (SFT)

Base: Llama 3.1 8B Instruct
Data: 100K synthetic + 2.4K real
Method: LoRA (rank=16, alpha=32)
Epochs: 3

Stage 2: DPO (Direct Preference Optimization)

Create preference pairs:
- Chosen: Correct extraction with confidence
- Rejected: Partial/incorrect extraction
Objective: Improve extraction precision

Stage 3: RLHF (Optional)

Reward model based on:
- JSON validity
- Field accuracy
- Merchant identification
- Category correctness

Metrics & Benchmarks

Extraction Accuracy

Amount: Target 99%+
Type (debit/credit): Target 98%+
Merchant: Target 90%+
Category: Target 85%+
Reference: Target 95%+

System Metrics

Latency: <100ms per extraction
Throughput: >100 msgs/sec
Memory: <8GB (quantized)

Next Steps (Immediate)

Download Llama 3.1 8B Instruct
Create instruction-format training data
Set up LoRA fine-tuning pipeline
Run first training experiment
Benchmark against current Phi-3 model