File size: 9,922 Bytes

0ae3f18

# FinEE v2.0 - Upgrade Roadmap
## From Extraction Engine to Intelligent Financial Agent

### Current vs Target Comparison

| Dimension | Current State | Target State | Priority |
|-----------|--------------|--------------|----------|
| **Base Model** | Phi-3 Mini (3.8B) | Llama 3.1 8B / Qwen2.5 7B | P0 |
| **Training Data** | 456 samples | 100K+ distilled samples | P0 |
| **Output Format** | Token extraction | Instruction-following JSON | P0 |
| **Context** | None | RAG + Knowledge Graph | P1 |
| **Interaction** | Single-turn | Multi-turn agent | P1 |
| **Input Types** | Email only | SMS + Email + PDF + Images | P1 |
| **Accuracy** | ~70% (estimated) | 95%+ (measured) | P0 |

---

## Phase 1: Foundation (Week 1-2)
### 1.1 Model Upgrade
- [ ] Download Llama 3.1 8B Instruct
- [ ] Download Qwen2.5 7B Instruct (backup)
- [ ] Benchmark both on finance extraction task
- [ ] Set up quantization pipeline (4-bit, 8-bit)

### 1.2 Training Data Expansion
- [ ] Generate 100K synthetic samples (DONE ✅)
- [ ] Distill from GPT-4/Claude for complex cases
- [ ] Add real data from user (2,419 SMS samples ✅)
- [ ] Create validation set (10K samples)
- [ ] Create test set (5K unseen samples)

### 1.3 Instruction Format
```json
{
  "system": "You are a financial entity extractor...",
  "instruction": "Extract entities from this message",
  "input": "<bank SMS or email>",
  "output": {
    "amount": 2500.00,
    "type": "debit",
    "merchant": "Swiggy",
    "category": "food",
    "date": "2026-01-12",
    "reference": "123456789012"
  }
}
```

---

## Phase 2: Multi-Modal Support (Week 3-4)
### 2.1 Input Types
- [ ] SMS Parser (DONE ✅)
- [ ] Email Parser (DONE ✅)
- [ ] PDF Statement Parser
  - Use `pdfplumber` for text extraction
  - Table detection with `camelot`
  - OCR fallback with `pytesseract`
- [ ] Image/Screenshot Parser
  - OCR with `EasyOCR` or `PaddleOCR`
  - Vision model for structured extraction

### 2.2 Bank Statement Processing
```
PDF Input → Text Extraction → Table Detection → 
Row Parsing → Entity Extraction → Transaction List
```

### 2.3 Image Processing Pipeline
```
Image → OCR → Text Blocks → Layout Analysis → 
Entity Extraction → Structured Output
```

---

## Phase 3: RAG + Knowledge Graph (Week 5-6)
### 3.1 Knowledge Base
- Merchant database (10K+ Indian merchants)
- Bank template patterns
- Category taxonomy
- UPI VPA mappings

### 3.2 RAG Architecture
```
Query → Retrieve Similar Transactions → 
Augment Context → Generate Extraction
```

### 3.3 Knowledge Graph
```
[Merchant: Swiggy] --is_a--> [Category: Food Delivery]
                   --accepts--> [Payment: UPI, Card]
                   --typical_amount--> [Range: 100-2000]
```

### 3.4 Vector Store
- Use Qdrant/ChromaDB for transaction embeddings
- Enable semantic search for similar transactions
- Support for "transactions like this" queries

---

## Phase 4: Multi-Turn Agent (Week 7-8)
### 4.1 Agent Capabilities
```python
class FinancialAgent:
    def extract_entities(self, message) -> dict
    def categorize_spending(self, transactions) -> dict
    def detect_anomalies(self, transactions) -> list
    def generate_report(self, period) -> str
    def answer_question(self, question, context) -> str
```

### 4.2 Conversation Flow
```
User: "How much did I spend on food last month?"
Agent: [Retrieves transactions] → [Filters by category] → 
       [Aggregates amounts] → "You spent ₹12,450 on food"

User: "Compare with previous month"
Agent: [Uses conversation context] → [Retrieves both months] →
       "December: ₹12,450, November: ₹9,800 (+27%)"
```

### 4.3 Tool Use
- Calculator for aggregations
- Date parser for time queries
- Budget tracker integration
- Export to CSV/Excel

---

## Phase 5: Production Deployment (Week 9-10)
### 5.1 Model Optimization
- [ ] GGUF quantization for llama.cpp
- [ ] ONNX export for faster inference
- [ ] vLLM for batch processing
- [ ] MLX optimization for Apple Silicon

### 5.2 API Design
```python
# FastAPI endpoints
POST /extract          # Single message extraction
POST /extract/batch    # Batch extraction
POST /parse/pdf        # PDF statement parsing
POST /parse/image      # Image OCR + extraction
POST /chat             # Multi-turn agent
GET  /analytics        # Spending analytics
```

### 5.3 Deployment Options
- Docker container
- Hugging Face Spaces (demo)
- Modal/Replicate (serverless)
- Self-hosted with vLLM

---

## Technical Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                     FinEE v2.0 Agent                        │
├─────────────────────────────────────────────────────────────┤
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │   SMS    │  │  Email   │  │   PDF    │  │  Image   │   │
│  │  Parser  │  │  Parser  │  │  Parser  │  │   OCR    │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │             │             │             │          │
│       └─────────────┴──────┬──────┴─────────────┘          │
│                            ▼                               │
│                   ┌────────────────┐                       │
│                   │  Preprocessor  │                       │
│                   └────────┬───────┘                       │
│                            ▼                               │
│  ┌─────────────────────────────────────────────────────┐  │
│  │                  RAG Pipeline                        │  │
│  │  ┌─────────┐   ┌─────────────┐   ┌──────────────┐  │  │
│  │  │ Vector  │   │  Knowledge  │   │   Merchant   │  │  │
│  │  │  Store  │   │    Graph    │   │   Database   │  │  │
│  │  └────┬────┘   └──────┬──────┘   └───────┬──────┘  │  │
│  │       └───────────────┼──────────────────┘          │  │
│  └───────────────────────┼─────────────────────────────┘  │
│                          ▼                                │
│              ┌───────────────────────┐                    │
│              │  Llama 3.1 8B / Qwen  │                    │
│              │   Instruction-Tuned   │                    │
│              └───────────┬───────────┘                    │
│                          ▼                                │
│              ┌───────────────────────┐                    │
│              │     JSON Output       │                    │
│              │   + Confidence Score  │                    │
│              └───────────────────────┘                    │
└─────────────────────────────────────────────────────────────┘
```

---

## Model Selection Analysis

| Model | Size | Speed | Quality | License | Choice |
|-------|------|-------|---------|---------|--------|
| Llama 3.1 8B | 8B | Fast | Excellent | Meta | ⭐ Primary |
| Qwen2.5 7B | 7B | Fast | Excellent | Apache | ⭐ Backup |
| Mistral 7B | 7B | Fast | Good | Apache | Alternative |
| Phi-3 Medium | 14B | Medium | Excellent | MIT | Future |

### Why Llama 3.1 8B?
1. **Instruction following** - Best in class for its size
2. **Structured output** - Reliable JSON generation
3. **Context length** - 128K tokens (future RAG)
4. **Quantization** - Excellent 4-bit performance
5. **Ecosystem** - Wide support (vLLM, llama.cpp, MLX)

---

## Training Strategy

### Stage 1: Supervised Fine-tuning (SFT)
```
Base: Llama 3.1 8B Instruct
Data: 100K synthetic + 2.4K real
Method: LoRA (rank=16, alpha=32)
Epochs: 3
```

### Stage 2: DPO (Direct Preference Optimization)
```
Create preference pairs:
- Chosen: Correct extraction with confidence
- Rejected: Partial/incorrect extraction
Objective: Improve extraction precision
```

### Stage 3: RLHF (Optional)
```
Reward model based on:
- JSON validity
- Field accuracy
- Merchant identification
- Category correctness
```

---

## Metrics & Benchmarks

### Extraction Accuracy
- **Amount**: Target 99%+
- **Type (debit/credit)**: Target 98%+
- **Merchant**: Target 90%+
- **Category**: Target 85%+
- **Reference**: Target 95%+

### System Metrics
- Latency: <100ms per extraction
- Throughput: >100 msgs/sec
- Memory: <8GB (quantized)

---

## Next Steps (Immediate)

1. [ ] Download Llama 3.1 8B Instruct
2. [ ] Create instruction-format training data
3. [ ] Set up LoRA fine-tuning pipeline
4. [ ] Run first training experiment
5. [ ] Benchmark against current Phi-3 model