finance-entity-extractor / docs /UPGRADE_ROADMAP.md

Upload docs/UPGRADE_ROADMAP.md with huggingface_hub

0ae3f18 verified about 1 month ago

9.92 kB

	# FinEE v2.0 - Upgrade Roadmap
	## From Extraction Engine to Intelligent Financial Agent

	### Current vs Target Comparison

	\| Dimension \| Current State \| Target State \| Priority \|
	\|-----------\|--------------\|--------------\|----------\|
	\| Base Model \| Phi-3 Mini (3.8B) \| Llama 3.1 8B / Qwen2.5 7B \| P0 \|
	\| Training Data \| 456 samples \| 100K+ distilled samples \| P0 \|
	\| Output Format \| Token extraction \| Instruction-following JSON \| P0 \|
	\| Context \| None \| RAG + Knowledge Graph \| P1 \|
	\| Interaction \| Single-turn \| Multi-turn agent \| P1 \|
	\| Input Types \| Email only \| SMS + Email + PDF + Images \| P1 \|
	\| Accuracy \| ~70% (estimated) \| 95%+ (measured) \| P0 \|

	---

	## Phase 1: Foundation (Week 1-2)
	### 1.1 Model Upgrade
	- [ ] Download Llama 3.1 8B Instruct
	- [ ] Download Qwen2.5 7B Instruct (backup)
	- [ ] Benchmark both on finance extraction task
	- [ ] Set up quantization pipeline (4-bit, 8-bit)

	### 1.2 Training Data Expansion
	- [ ] Generate 100K synthetic samples (DONE ✅)
	- [ ] Distill from GPT-4/Claude for complex cases
	- [ ] Add real data from user (2,419 SMS samples ✅)
	- [ ] Create validation set (10K samples)
	- [ ] Create test set (5K unseen samples)

	### 1.3 Instruction Format
	```json
	{
	"system": "You are a financial entity extractor...",
	"instruction": "Extract entities from this message",
	"input": "<bank SMS or email>",
	"output": {
	"amount": 2500.00,
	"type": "debit",
	"merchant": "Swiggy",
	"category": "food",
	"date": "2026-01-12",
	"reference": "123456789012"
	}
	}
	```

	---

	## Phase 2: Multi-Modal Support (Week 3-4)
	### 2.1 Input Types
	- [ ] SMS Parser (DONE ✅)
	- [ ] Email Parser (DONE ✅)
	- [ ] PDF Statement Parser
	- Use `pdfplumber` for text extraction
	- Table detection with `camelot`
	- OCR fallback with `pytesseract`
	- [ ] Image/Screenshot Parser
	- OCR with `EasyOCR` or `PaddleOCR`
	- Vision model for structured extraction

	### 2.2 Bank Statement Processing
	```
	PDF Input → Text Extraction → Table Detection →
	Row Parsing → Entity Extraction → Transaction List
	```

	### 2.3 Image Processing Pipeline
	```
	Image → OCR → Text Blocks → Layout Analysis →
	Entity Extraction → Structured Output
	```

	---

	## Phase 3: RAG + Knowledge Graph (Week 5-6)
	### 3.1 Knowledge Base
	- Merchant database (10K+ Indian merchants)
	- Bank template patterns
	- Category taxonomy
	- UPI VPA mappings

	### 3.2 RAG Architecture
	```
	Query → Retrieve Similar Transactions →
	Augment Context → Generate Extraction
	```

	### 3.3 Knowledge Graph
	```
	[Merchant: Swiggy] --is_a--> [Category: Food Delivery]
	--accepts--> [Payment: UPI, Card]
	--typical_amount--> [Range: 100-2000]
	```

	### 3.4 Vector Store
	- Use Qdrant/ChromaDB for transaction embeddings
	- Enable semantic search for similar transactions
	- Support for "transactions like this" queries

	---

	## Phase 4: Multi-Turn Agent (Week 7-8)
	### 4.1 Agent Capabilities
	```python
	class FinancialAgent:
	def extract_entities(self, message) -> dict
	def categorize_spending(self, transactions) -> dict
	def detect_anomalies(self, transactions) -> list
	def generate_report(self, period) -> str
	def answer_question(self, question, context) -> str
	```

	### 4.2 Conversation Flow
	```
	User: "How much did I spend on food last month?"
	Agent: [Retrieves transactions] → [Filters by category] →
	[Aggregates amounts] → "You spent ₹12,450 on food"

	User: "Compare with previous month"
	Agent: [Uses conversation context] → [Retrieves both months] →
	"December: ₹12,450, November: ₹9,800 (+27%)"
	```

	### 4.3 Tool Use
	- Calculator for aggregations
	- Date parser for time queries
	- Budget tracker integration
	- Export to CSV/Excel

	---

	## Phase 5: Production Deployment (Week 9-10)
	### 5.1 Model Optimization
	- [ ] GGUF quantization for llama.cpp
	- [ ] ONNX export for faster inference
	- [ ] vLLM for batch processing
	- [ ] MLX optimization for Apple Silicon

	### 5.2 API Design
	```python
	# FastAPI endpoints
	POST /extract # Single message extraction
	POST /extract/batch # Batch extraction
	POST /parse/pdf # PDF statement parsing
	POST /parse/image # Image OCR + extraction
	POST /chat # Multi-turn agent
	GET /analytics # Spending analytics
	```

	### 5.3 Deployment Options
	- Docker container
	- Hugging Face Spaces (demo)
	- Modal/Replicate (serverless)
	- Self-hosted with vLLM

	---

	## Technical Architecture

	```
	┌─────────────────────────────────────────────────────────────┐
	│ FinEE v2.0 Agent │
	├─────────────────────────────────────────────────────────────┤
	│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
	│ │ SMS │ │ Email │ │ PDF │ │ Image │ │
	│ │ Parser │ │ Parser │ │ Parser │ │ OCR │ │
	│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
	│ │ │ │ │ │
	│ └─────────────┴──────┬──────┴─────────────┘ │
	│ ▼ │
	│ ┌────────────────┐ │
	│ │ Preprocessor │ │
	│ └────────┬───────┘ │
	│ ▼ │
	│ ┌─────────────────────────────────────────────────────┐ │
	│ │ RAG Pipeline │ │
	│ │ ┌─────────┐ ┌─────────────┐ ┌──────────────┐ │ │
	│ │ │ Vector │ │ Knowledge │ │ Merchant │ │ │
	│ │ │ Store │ │ Graph │ │ Database │ │ │
	│ │ └────┬────┘ └──────┬──────┘ └───────┬──────┘ │ │
	│ │ └───────────────┼──────────────────┘ │ │
	│ └───────────────────────┼─────────────────────────────┘ │
	│ ▼ │
	│ ┌───────────────────────┐ │
	│ │ Llama 3.1 8B / Qwen │ │
	│ │ Instruction-Tuned │ │
	│ └───────────┬───────────┘ │
	│ ▼ │
	│ ┌───────────────────────┐ │
	│ │ JSON Output │ │
	│ │ + Confidence Score │ │
	│ └───────────────────────┘ │
	└─────────────────────────────────────────────────────────────┘
	```

	---

	## Model Selection Analysis

	\| Model \| Size \| Speed \| Quality \| License \| Choice \|
	\|-------\|------\|-------\|---------\|---------\|--------\|
	\| Llama 3.1 8B \| 8B \| Fast \| Excellent \| Meta \| ⭐ Primary \|
	\| Qwen2.5 7B \| 7B \| Fast \| Excellent \| Apache \| ⭐ Backup \|
	\| Mistral 7B \| 7B \| Fast \| Good \| Apache \| Alternative \|
	\| Phi-3 Medium \| 14B \| Medium \| Excellent \| MIT \| Future \|

	### Why Llama 3.1 8B?
	1. Instruction following - Best in class for its size
	2. Structured output - Reliable JSON generation
	3. Context length - 128K tokens (future RAG)
	4. Quantization - Excellent 4-bit performance
	5. Ecosystem - Wide support (vLLM, llama.cpp, MLX)

	---

	## Training Strategy

	### Stage 1: Supervised Fine-tuning (SFT)
	```
	Base: Llama 3.1 8B Instruct
	Data: 100K synthetic + 2.4K real
	Method: LoRA (rank=16, alpha=32)
	Epochs: 3
	```

	### Stage 2: DPO (Direct Preference Optimization)
	```
	Create preference pairs:
	- Chosen: Correct extraction with confidence
	- Rejected: Partial/incorrect extraction
	Objective: Improve extraction precision
	```

	### Stage 3: RLHF (Optional)
	```
	Reward model based on:
	- JSON validity
	- Field accuracy
	- Merchant identification
	- Category correctness
	```

	---

	## Metrics & Benchmarks

	### Extraction Accuracy
	- Amount: Target 99%+
	- Type (debit/credit): Target 98%+
	- Merchant: Target 90%+
	- Category: Target 85%+
	- Reference: Target 95%+

	### System Metrics
	- Latency: <100ms per extraction
	- Throughput: >100 msgs/sec
	- Memory: <8GB (quantized)

	---

	## Next Steps (Immediate)

	1. [ ] Download Llama 3.1 8B Instruct
	2. [ ] Create instruction-format training data
	3. [ ] Set up LoRA fine-tuning pipeline
	4. [ ] Run first training experiment
	5. [ ] Benchmark against current Phi-3 model