Spaces:
Sleeping
Sleeping
metadata
title: Financial Intelligence AI
emoji: ๐ธ
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
๐ Financial Document Extractor & Anomaly Detector
A fine-tuned Qwen 2.5 7B model that extracts structured JSON from financial documents and intelligently flags anomalies. Built in 20 hours as a practical demonstration of production-grade ML engineering.
๐ฏ What It Does
| Input | Output |
|---|---|
| Raw financial PDF (invoice, PO, receipt, bank statement) | Structured JSON + anomaly flags |
Anomaly Detection (5 Categories)
- ๐ด Arithmetic Errors โ Totals that don't add up
- ๐ก Missing Fields โ Required information absent from the document
- ๐ต Format Anomalies โ Invalid dates, negative quantities, duplicates
- ๐ก Business Logic โ Round-number fraud indicators, extreme amounts
- ๐ด Cross-Field Inconsistencies โ Mismatched PO references, currency conflicts
๐ Results: Base vs Fine-Tuned
Results are populated after training. Run the evaluation script to generate.
| Metric | Base Qwen 2.5 7B | Fine-Tuned | Improvement |
|---|---|---|---|
| L1: Valid JSON Rate | TBD | TBD | TBD |
| L2: Schema Compliance | TBD | TBD | TBD |
| L3: Field Extraction F1 | TBD | TBD | TBD |
| L4: Anomaly Detection F1 | TBD | TBD | TBD |
| L5: End-to-End Success | TBD | TBD | TBD |
Evaluated on 30 held-out test documents
๐๏ธ Architecture
PDF Upload โ PyPDF2 Text Extraction โ Fine-Tuned Qwen 2.5 7B โ Pydantic Validation โ JSON + Flags
โ (if invalid)
Retry (up to 3x)
๐ ๏ธ Tech Stack
| Component | Technology |
|---|---|
| Base Model | Qwen 2.5 7B Instruct (4-bit quantized) |
| Fine-Tuning | QLoRA (Rank=16, Alpha=32) via Unsloth |
| Training Compute | Kaggle T4 GPU (free) |
| Output Validation | Pydantic v2 with retry logic |
| Web Interface | Gradio on HuggingFace Spaces |
| Data Strategy | Hybrid: real documents + synthetic anomaly injection |
๐ Project Structure
financial-doc-extractor/
โโโ src/ # Core library
โ โโโ schema.py # Pydantic models (Option C schema)
โ โโโ pdf_reader.py # PDF text extraction
โ โโโ extractor.py # Inference pipeline + retry
โ โโโ validator.py # Validation helpers
โโโ scripts/ # Data generation pipeline
โ โโโ generate_synthetic.py
โ โโโ inject_anomalies.py
โ โโโ prepare_training_data.py
โโโ training/ # Model training
โ โโโ train.py # Unsloth + QLoRA training script
โโโ evaluation/ # Evaluation framework
โ โโโ evaluate.py # 5-level metric evaluation
โโโ app/ # Web application
โ โโโ app.py # Gradio interface
โโโ data/ # Training & test data
โโโ training/
โโโ test/
๐ Quick Start
1. Generate Training Data
pip install -r requirements.txt
python scripts/generate_synthetic.py --count 150
python scripts/inject_anomalies.py --anomaly-rate 0.4
python scripts/prepare_training_data.py --test-size 30
2. Train on Kaggle
Upload data/training/train.jsonl to a Kaggle Dataset, then run training/train.py in a Kaggle Notebook with T4 GPU enabled.
3. Run the Demo
# UI testing (no model required)
python app/app.py --demo-mode
# With model (after training)
MODEL_REPO=your-username/financial-doc-extractor-qwen2.5-7b python app/app.py
๐ Output Schema (Option C: Common Core + Type Extensions)
{
"common": {
"document_type": "invoice",
"date": "2024-03-15",
"issuer": {"name": "Acme Corp", "address": "123 Business Ave"},
"recipient": {"name": "Widget Inc", "address": "456 Commerce St"},
"total_amount": 1728.00,
"currency": "USD"
},
"line_items": [
{"description": "Steel Bolts", "quantity": 500, "unit_price": 2.50, "amount": 1250.00}
],
"type_specific": {
"invoice_number": "INV-2024-0847",
"due_date": "2024-04-14",
"payment_terms": "Net 30",
"tax_amount": 128.00,
"subtotal": 1600.00
},
"flags": [
{
"category": "arithmetic_error",
"field": "total_amount",
"severity": "high",
"description": "Total does not equal subtotal + tax"
}
],
"confidence_score": 0.92
}
๐ฎ Future Extensions
- OCR Integration โ Tesseract/EasyOCR for scanned document support
- Multi-document Analysis โ Cross-reference invoices with POs
- Streaming Inference โ Real-time extraction for large batches
- Fine-grained Evaluation โ Per-field accuracy breakdown by document type
๐ License
MIT
๐ค Author
Built by Vaibhav Patil (vaibhavofficial413@gmail.com)(linkidin- https://www.linkedin.com/in/vaibhav-patil225/) as a demonstration of production ML engineering skills.