finance-entity-extractor / docs /UPGRADE_ROADMAP.md
Ranjit0034's picture
Upload docs/UPGRADE_ROADMAP.md with huggingface_hub
0ae3f18 verified
# FinEE v2.0 - Upgrade Roadmap
## From Extraction Engine to Intelligent Financial Agent
### Current vs Target Comparison
| Dimension | Current State | Target State | Priority |
|-----------|--------------|--------------|----------|
| **Base Model** | Phi-3 Mini (3.8B) | Llama 3.1 8B / Qwen2.5 7B | P0 |
| **Training Data** | 456 samples | 100K+ distilled samples | P0 |
| **Output Format** | Token extraction | Instruction-following JSON | P0 |
| **Context** | None | RAG + Knowledge Graph | P1 |
| **Interaction** | Single-turn | Multi-turn agent | P1 |
| **Input Types** | Email only | SMS + Email + PDF + Images | P1 |
| **Accuracy** | ~70% (estimated) | 95%+ (measured) | P0 |
---
## Phase 1: Foundation (Week 1-2)
### 1.1 Model Upgrade
- [ ] Download Llama 3.1 8B Instruct
- [ ] Download Qwen2.5 7B Instruct (backup)
- [ ] Benchmark both on finance extraction task
- [ ] Set up quantization pipeline (4-bit, 8-bit)
### 1.2 Training Data Expansion
- [ ] Generate 100K synthetic samples (DONE βœ…)
- [ ] Distill from GPT-4/Claude for complex cases
- [ ] Add real data from user (2,419 SMS samples βœ…)
- [ ] Create validation set (10K samples)
- [ ] Create test set (5K unseen samples)
### 1.3 Instruction Format
```json
{
"system": "You are a financial entity extractor...",
"instruction": "Extract entities from this message",
"input": "<bank SMS or email>",
"output": {
"amount": 2500.00,
"type": "debit",
"merchant": "Swiggy",
"category": "food",
"date": "2026-01-12",
"reference": "123456789012"
}
}
```
---
## Phase 2: Multi-Modal Support (Week 3-4)
### 2.1 Input Types
- [ ] SMS Parser (DONE βœ…)
- [ ] Email Parser (DONE βœ…)
- [ ] PDF Statement Parser
- Use `pdfplumber` for text extraction
- Table detection with `camelot`
- OCR fallback with `pytesseract`
- [ ] Image/Screenshot Parser
- OCR with `EasyOCR` or `PaddleOCR`
- Vision model for structured extraction
### 2.2 Bank Statement Processing
```
PDF Input β†’ Text Extraction β†’ Table Detection β†’
Row Parsing β†’ Entity Extraction β†’ Transaction List
```
### 2.3 Image Processing Pipeline
```
Image β†’ OCR β†’ Text Blocks β†’ Layout Analysis β†’
Entity Extraction β†’ Structured Output
```
---
## Phase 3: RAG + Knowledge Graph (Week 5-6)
### 3.1 Knowledge Base
- Merchant database (10K+ Indian merchants)
- Bank template patterns
- Category taxonomy
- UPI VPA mappings
### 3.2 RAG Architecture
```
Query β†’ Retrieve Similar Transactions β†’
Augment Context β†’ Generate Extraction
```
### 3.3 Knowledge Graph
```
[Merchant: Swiggy] --is_a--> [Category: Food Delivery]
--accepts--> [Payment: UPI, Card]
--typical_amount--> [Range: 100-2000]
```
### 3.4 Vector Store
- Use Qdrant/ChromaDB for transaction embeddings
- Enable semantic search for similar transactions
- Support for "transactions like this" queries
---
## Phase 4: Multi-Turn Agent (Week 7-8)
### 4.1 Agent Capabilities
```python
class FinancialAgent:
def extract_entities(self, message) -> dict
def categorize_spending(self, transactions) -> dict
def detect_anomalies(self, transactions) -> list
def generate_report(self, period) -> str
def answer_question(self, question, context) -> str
```
### 4.2 Conversation Flow
```
User: "How much did I spend on food last month?"
Agent: [Retrieves transactions] β†’ [Filters by category] β†’
[Aggregates amounts] β†’ "You spent β‚Ή12,450 on food"
User: "Compare with previous month"
Agent: [Uses conversation context] β†’ [Retrieves both months] β†’
"December: β‚Ή12,450, November: β‚Ή9,800 (+27%)"
```
### 4.3 Tool Use
- Calculator for aggregations
- Date parser for time queries
- Budget tracker integration
- Export to CSV/Excel
---
## Phase 5: Production Deployment (Week 9-10)
### 5.1 Model Optimization
- [ ] GGUF quantization for llama.cpp
- [ ] ONNX export for faster inference
- [ ] vLLM for batch processing
- [ ] MLX optimization for Apple Silicon
### 5.2 API Design
```python
# FastAPI endpoints
POST /extract # Single message extraction
POST /extract/batch # Batch extraction
POST /parse/pdf # PDF statement parsing
POST /parse/image # Image OCR + extraction
POST /chat # Multi-turn agent
GET /analytics # Spending analytics
```
### 5.3 Deployment Options
- Docker container
- Hugging Face Spaces (demo)
- Modal/Replicate (serverless)
- Self-hosted with vLLM
---
## Technical Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FinEE v2.0 Agent β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ SMS β”‚ β”‚ Email β”‚ β”‚ PDF β”‚ β”‚ Image β”‚ β”‚
β”‚ β”‚ Parser β”‚ β”‚ Parser β”‚ β”‚ Parser β”‚ β”‚ OCR β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Preprocessor β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ RAG Pipeline β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β”‚ Vector β”‚ β”‚ Knowledge β”‚ β”‚ Merchant β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ Store β”‚ β”‚ Graph β”‚ β”‚ Database β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Llama 3.1 8B / Qwen β”‚ β”‚
β”‚ β”‚ Instruction-Tuned β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ JSON Output β”‚ β”‚
β”‚ β”‚ + Confidence Score β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## Model Selection Analysis
| Model | Size | Speed | Quality | License | Choice |
|-------|------|-------|---------|---------|--------|
| Llama 3.1 8B | 8B | Fast | Excellent | Meta | ⭐ Primary |
| Qwen2.5 7B | 7B | Fast | Excellent | Apache | ⭐ Backup |
| Mistral 7B | 7B | Fast | Good | Apache | Alternative |
| Phi-3 Medium | 14B | Medium | Excellent | MIT | Future |
### Why Llama 3.1 8B?
1. **Instruction following** - Best in class for its size
2. **Structured output** - Reliable JSON generation
3. **Context length** - 128K tokens (future RAG)
4. **Quantization** - Excellent 4-bit performance
5. **Ecosystem** - Wide support (vLLM, llama.cpp, MLX)
---
## Training Strategy
### Stage 1: Supervised Fine-tuning (SFT)
```
Base: Llama 3.1 8B Instruct
Data: 100K synthetic + 2.4K real
Method: LoRA (rank=16, alpha=32)
Epochs: 3
```
### Stage 2: DPO (Direct Preference Optimization)
```
Create preference pairs:
- Chosen: Correct extraction with confidence
- Rejected: Partial/incorrect extraction
Objective: Improve extraction precision
```
### Stage 3: RLHF (Optional)
```
Reward model based on:
- JSON validity
- Field accuracy
- Merchant identification
- Category correctness
```
---
## Metrics & Benchmarks
### Extraction Accuracy
- **Amount**: Target 99%+
- **Type (debit/credit)**: Target 98%+
- **Merchant**: Target 90%+
- **Category**: Target 85%+
- **Reference**: Target 95%+
### System Metrics
- Latency: <100ms per extraction
- Throughput: >100 msgs/sec
- Memory: <8GB (quantized)
---
## Next Steps (Immediate)
1. [ ] Download Llama 3.1 8B Instruct
2. [ ] Create instruction-format training data
3. [ ] Set up LoRA fine-tuning pipeline
4. [ ] Run first training experiment
5. [ ] Benchmark against current Phi-3 model