Spaces:

Vaibuzzz
/

financial-intelligence-ai

Sleeping

App Files Files Community

financial-intelligence-ai / README.md

Vaibuzzz

Upload folder using huggingface_hub

fa84020 verified 15 days ago

preview code

raw

history blame contribute delete

5.14 kB

metadata

title: Financial Intelligence AI
emoji: 💸
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false

🔍 Financial Document Extractor & Anomaly Detector

A fine-tuned Qwen 2.5 7B model that extracts structured JSON from financial documents and intelligently flags anomalies. Built in 20 hours as a practical demonstration of production-grade ML engineering.

🎯 What It Does

Input	Output
Raw financial PDF (invoice, PO, receipt, bank statement)	Structured JSON + anomaly flags

Anomaly Detection (5 Categories)

🔴 Arithmetic Errors — Totals that don't add up
🟡 Missing Fields — Required information absent from the document
🔵 Format Anomalies — Invalid dates, negative quantities, duplicates
🟡 Business Logic — Round-number fraud indicators, extreme amounts
🔴 Cross-Field Inconsistencies — Mismatched PO references, currency conflicts

📊 Results: Base vs Fine-Tuned

Results are populated after training. Run the evaluation script to generate.

Metric	Base Qwen 2.5 7B	Fine-Tuned	Improvement
L1: Valid JSON Rate	TBD	TBD	TBD
L2: Schema Compliance	TBD	TBD	TBD
L3: Field Extraction F1	TBD	TBD	TBD
L4: Anomaly Detection F1	TBD	TBD	TBD
L5: End-to-End Success	TBD	TBD	TBD

Evaluated on 30 held-out test documents

🏗️ Architecture

PDF Upload → PyPDF2 Text Extraction → Fine-Tuned Qwen 2.5 7B → Pydantic Validation → JSON + Flags
                                                                      ↓ (if invalid)
                                                                  Retry (up to 3x)

🛠️ Tech Stack

Component	Technology
Base Model	Qwen 2.5 7B Instruct (4-bit quantized)
Fine-Tuning	QLoRA (Rank=16, Alpha=32) via Unsloth
Training Compute	Kaggle T4 GPU (free)
Output Validation	Pydantic v2 with retry logic
Web Interface	Gradio on HuggingFace Spaces
Data Strategy	Hybrid: real documents + synthetic anomaly injection

📁 Project Structure

financial-doc-extractor/
├── src/                    # Core library
│   ├── schema.py          # Pydantic models (Option C schema)
│   ├── pdf_reader.py      # PDF text extraction
│   ├── extractor.py       # Inference pipeline + retry
│   └── validator.py       # Validation helpers
├── scripts/               # Data generation pipeline
│   ├── generate_synthetic.py
│   ├── inject_anomalies.py
│   └── prepare_training_data.py
├── training/              # Model training
│   └── train.py           # Unsloth + QLoRA training script
├── evaluation/            # Evaluation framework
│   └── evaluate.py        # 5-level metric evaluation
├── app/                   # Web application
│   └── app.py             # Gradio interface
└── data/                  # Training & test data
    ├── training/
    └── test/

🚀 Quick Start

1. Generate Training Data

pip install -r requirements.txt
python scripts/generate_synthetic.py --count 150
python scripts/inject_anomalies.py --anomaly-rate 0.4
python scripts/prepare_training_data.py --test-size 30

2. Train on Kaggle

Upload data/training/train.jsonl to a Kaggle Dataset, then run training/train.py in a Kaggle Notebook with T4 GPU enabled.

3. Run the Demo

# UI testing (no model required)
python app/app.py --demo-mode

# With model (after training)
MODEL_REPO=your-username/financial-doc-extractor-qwen2.5-7b python app/app.py

📐 Output Schema (Option C: Common Core + Type Extensions)

{
  "common": {
    "document_type": "invoice",
    "date": "2024-03-15",
    "issuer": {"name": "Acme Corp", "address": "123 Business Ave"},
    "recipient": {"name": "Widget Inc", "address": "456 Commerce St"},
    "total_amount": 1728.00,
    "currency": "USD"
  },
  "line_items": [
    {"description": "Steel Bolts", "quantity": 500, "unit_price": 2.50, "amount": 1250.00}
  ],
  "type_specific": {
    "invoice_number": "INV-2024-0847",
    "due_date": "2024-04-14",
    "payment_terms": "Net 30",
    "tax_amount": 128.00,
    "subtotal": 1600.00
  },
  "flags": [
    {
      "category": "arithmetic_error",
      "field": "total_amount",
      "severity": "high",
      "description": "Total does not equal subtotal + tax"
    }
  ],
  "confidence_score": 0.92
}

🔮 Future Extensions

OCR Integration — Tesseract/EasyOCR for scanned document support
Multi-document Analysis — Cross-reference invoices with POs
Streaming Inference — Real-time extraction for large batches
Fine-grained Evaluation — Per-field accuracy breakdown by document type

📜 License

MIT

👤 Author

Built by Vaibhav Patil (vaibhavofficial413@gmail.com)(linkidin- https://www.linkedin.com/in/vaibhav-patil225/) as a demonstration of production ML engineering skills.