Vaibuzzz's picture
Upload folder using huggingface_hub
fa84020 verified
metadata
title: Financial Intelligence AI
emoji: ๐Ÿ’ธ
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false

๐Ÿ” Financial Document Extractor & Anomaly Detector

A fine-tuned Qwen 2.5 7B model that extracts structured JSON from financial documents and intelligently flags anomalies. Built in 20 hours as a practical demonstration of production-grade ML engineering.

๐ŸŽฏ What It Does

Input Output
Raw financial PDF (invoice, PO, receipt, bank statement) Structured JSON + anomaly flags

Anomaly Detection (5 Categories)

  • ๐Ÿ”ด Arithmetic Errors โ€” Totals that don't add up
  • ๐ŸŸก Missing Fields โ€” Required information absent from the document
  • ๐Ÿ”ต Format Anomalies โ€” Invalid dates, negative quantities, duplicates
  • ๐ŸŸก Business Logic โ€” Round-number fraud indicators, extreme amounts
  • ๐Ÿ”ด Cross-Field Inconsistencies โ€” Mismatched PO references, currency conflicts

๐Ÿ“Š Results: Base vs Fine-Tuned

Results are populated after training. Run the evaluation script to generate.

Metric Base Qwen 2.5 7B Fine-Tuned Improvement
L1: Valid JSON Rate TBD TBD TBD
L2: Schema Compliance TBD TBD TBD
L3: Field Extraction F1 TBD TBD TBD
L4: Anomaly Detection F1 TBD TBD TBD
L5: End-to-End Success TBD TBD TBD

Evaluated on 30 held-out test documents

๐Ÿ—๏ธ Architecture

PDF Upload โ†’ PyPDF2 Text Extraction โ†’ Fine-Tuned Qwen 2.5 7B โ†’ Pydantic Validation โ†’ JSON + Flags
                                                                      โ†“ (if invalid)
                                                                  Retry (up to 3x)

๐Ÿ› ๏ธ Tech Stack

Component Technology
Base Model Qwen 2.5 7B Instruct (4-bit quantized)
Fine-Tuning QLoRA (Rank=16, Alpha=32) via Unsloth
Training Compute Kaggle T4 GPU (free)
Output Validation Pydantic v2 with retry logic
Web Interface Gradio on HuggingFace Spaces
Data Strategy Hybrid: real documents + synthetic anomaly injection

๐Ÿ“ Project Structure

financial-doc-extractor/
โ”œโ”€โ”€ src/                    # Core library
โ”‚   โ”œโ”€โ”€ schema.py          # Pydantic models (Option C schema)
โ”‚   โ”œโ”€โ”€ pdf_reader.py      # PDF text extraction
โ”‚   โ”œโ”€โ”€ extractor.py       # Inference pipeline + retry
โ”‚   โ””โ”€โ”€ validator.py       # Validation helpers
โ”œโ”€โ”€ scripts/               # Data generation pipeline
โ”‚   โ”œโ”€โ”€ generate_synthetic.py
โ”‚   โ”œโ”€โ”€ inject_anomalies.py
โ”‚   โ””โ”€โ”€ prepare_training_data.py
โ”œโ”€โ”€ training/              # Model training
โ”‚   โ””โ”€โ”€ train.py           # Unsloth + QLoRA training script
โ”œโ”€โ”€ evaluation/            # Evaluation framework
โ”‚   โ””โ”€โ”€ evaluate.py        # 5-level metric evaluation
โ”œโ”€โ”€ app/                   # Web application
โ”‚   โ””โ”€โ”€ app.py             # Gradio interface
โ””โ”€โ”€ data/                  # Training & test data
    โ”œโ”€โ”€ training/
    โ””โ”€โ”€ test/

๐Ÿš€ Quick Start

1. Generate Training Data

pip install -r requirements.txt
python scripts/generate_synthetic.py --count 150
python scripts/inject_anomalies.py --anomaly-rate 0.4
python scripts/prepare_training_data.py --test-size 30

2. Train on Kaggle

Upload data/training/train.jsonl to a Kaggle Dataset, then run training/train.py in a Kaggle Notebook with T4 GPU enabled.

3. Run the Demo

# UI testing (no model required)
python app/app.py --demo-mode

# With model (after training)
MODEL_REPO=your-username/financial-doc-extractor-qwen2.5-7b python app/app.py

๐Ÿ“ Output Schema (Option C: Common Core + Type Extensions)

{
  "common": {
    "document_type": "invoice",
    "date": "2024-03-15",
    "issuer": {"name": "Acme Corp", "address": "123 Business Ave"},
    "recipient": {"name": "Widget Inc", "address": "456 Commerce St"},
    "total_amount": 1728.00,
    "currency": "USD"
  },
  "line_items": [
    {"description": "Steel Bolts", "quantity": 500, "unit_price": 2.50, "amount": 1250.00}
  ],
  "type_specific": {
    "invoice_number": "INV-2024-0847",
    "due_date": "2024-04-14",
    "payment_terms": "Net 30",
    "tax_amount": 128.00,
    "subtotal": 1600.00
  },
  "flags": [
    {
      "category": "arithmetic_error",
      "field": "total_amount",
      "severity": "high",
      "description": "Total does not equal subtotal + tax"
    }
  ],
  "confidence_score": 0.92
}

๐Ÿ”ฎ Future Extensions

  • OCR Integration โ€” Tesseract/EasyOCR for scanned document support
  • Multi-document Analysis โ€” Cross-reference invoices with POs
  • Streaming Inference โ€” Real-time extraction for large batches
  • Fine-grained Evaluation โ€” Per-field accuracy breakdown by document type

๐Ÿ“œ License

MIT

๐Ÿ‘ค Author

Built by Vaibhav Patil (vaibhavofficial413@gmail.com)(linkidin- https://www.linkedin.com/in/vaibhav-patil225/) as a demonstration of production ML engineering skills.