Update README.md

7873060 verified 19 days ago

8.34 kB

	---
	license: apache-2.0
	language:
	- en
	base_model: Qwen/Qwen2.5-7B-Instruct
	pipeline_tag: text-generation
	tags:
	- finance
	- document-parsing
	- qlora
	- qwen2.5
	- invoice
	- sap
	- structured-extraction
	- json
	---

	# 🏦 Multi-Format Finance Document Parser

	A production-grade financial document parser fine-tuned on Qwen2.5-7B-Instruct using QLoRA (4-bit NF4 quantization). Given raw text from any financial document, it outputs structured JSON — ready for downstream processing, ERP integration, or analytics pipelines.

	---

	## 🚀 Live Demo

	👉 [Try it on HuggingFace Spaces](https://huggingface.co/spaces/ratulsur/finance-parser-demo)

	---

	## 📄 Supported Document Types

	\| Format \| Examples \|
	\|---\|---\|
	\| Invoice \| Vendor invoices, GST bills, service bills \|
	\| SAP Report \| ALV exports, FI vendor payment reports \|
	\| Income Statement \| P&L statements, quarterly earnings \|
	\| Balance Sheet \| Assets, liabilities, equity statements \|
	\| Bank Statement \| Transaction records, account summaries \|
	\| Purchase Order \| PO documents, procurement records \|
	\| SQL Result \| Query outputs from finance databases \|
	\| CSV / Excel \| Tabular finance data \|

	---

	## 🧠 Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| Qwen/Qwen2.5-7B-Instruct \|
	\| Model size \| 8B parameters \|
	\| Fine-tuning method \| QLoRA (PEFT) \|
	\| Quantization \| 4-bit NF4 + double quantization \|
	\| Compute dtype \| bfloat16 \|
	\| LoRA rank \| r=8, alpha=16 \|
	\| Max sequence length \| 512 tokens \|
	\| Training hardware \| L40S 48GB GPU (Lightning AI) \|
	\| Training time \| ~1 hour \|
	\| License \| Apache 2.0 \|

	---

	## 📊 Training Data

	\| Dataset \| Samples \| Type \|
	\|---\|---\|---\|
	\| [CORD-v2](https://huggingface.co/datasets/naver-clova-ix/cord-v2) \| 454 \| Real receipt images + structured JSON \|
	\| Synthetic invoices \| 300 \| Generated with realistic Indian/global vendors \|
	\| Synthetic SAP reports \| 100 \| ALV-style pipe-delimited exports \|
	\| Synthetic income statements \| 100 \| P&L with revenue, COGS, EBIT, net income \|
	\| Total \| 954 \| Train: 812 · Eval: 95 · Test: 47 \|

	---

	## ⚙️ Quantization Techniques

	\| Technique \| Purpose \|
	\|---\|---\|
	\| NF4 4-bit quantization \| Stores weights in 4-bit NormalFloat format — ~4x model size reduction \|
	\| Double quantization \| Quantizes the quantization constants — additional ~0.4 bits/param saving \|
	\| bfloat16 compute \| Full precision operations, 4-bit storage \|
	\| LoRA adapters (r=8) \| Only 0.5% of parameters trained — 99.5% frozen \|
	\| Paged AdamW 8-bit \| Optimizer state memory reduction \|
	\| Gradient checkpointing \| ~40% activation memory reduction \|

	---

	## 📤 Output Schema

	```json
	{
	"document_type": "invoice\|balance_sheet\|income_stmt\|sap_report\|sql_result\|bank_statement\|purchase_order",
	"vendor": "string or null",
	"client": "string or null",
	"date": "YYYY-MM-DD or null",
	"due_date": "YYYY-MM-DD or null",
	"document_id": "string or null",
	"currency": "USD\|EUR\|INR\|GBP\|...",
	"subtotal": "float or null",
	"tax_amount": "float or null",
	"tax_rate_pct": "float or null",
	"total_amount": "float or null",
	"line_items": [
	{
	"description": "string",
	"quantity": "float or null",
	"unit_price": "float or null",
	"amount": "float"
	}
	],
	"payment_terms": "string or null",
	"notes": "string or null",
	"metadata": {}
	}
	```

	---

	## 💻 Usage

	### Via HuggingFace Inference API

	```python
	import requests
	import json
	import re

	API_URL = "https://api-inference.huggingface.co/models/ratulsur/multi-format-finance-parser"
	HF_TOKEN = "hf_xxxxxxxxxxxx"

	SYSTEM_PROMPT = """You are a production financial document parser.
	Given raw text from any financial document, output ONLY a single valid JSON object.
	Schema: {document_type, vendor, client, date (YYYY-MM-DD), due_date, document_id,
	currency, subtotal, tax_amount, tax_rate_pct, total_amount,
	line_items:[{description,quantity,unit_price,amount}], payment_terms, notes, metadata}.
	All monetary values must be floats. Unknown fields → null. No explanation."""

	def parse_document(text: str) -> dict:
	prompt = (
	f"<\|im_start\|>system\n{SYSTEM_PROMPT}<\|im_end\|>\n"
	f"<\|im_start\|>user\nParse this financial document:\n\n{text}<\|im_end\|>\n"
	f"<\|im_start\|>assistant\n"
	)
	headers = {"Authorization": f"Bearer {HF_TOKEN}"}
	payload = {
	"inputs": prompt,
	"parameters": {
	"max_new_tokens": 512,
	"temperature": 0.05,
	"return_full_text": False,
	"do_sample": False,
	}
	}
	resp = requests.post(API_URL, headers=headers, json=payload, timeout=120)
	raw = resp.json()[0]["generated_text"].strip()
	raw = re.sub(r"```json\s\|```\s", "", raw).strip()
	return json.loads(raw)

	# Example
	invoice = """
	INVOICE
	Vendor: Tata Consultancy Services Ltd.
	Invoice No: TCS-2024-8821
	Date: 2024-11-15
	Service: Cloud Infrastructure Management INR 42,500.00
	GST @ 18%: INR 7,650.00
	TOTAL DUE: INR 50,150.00
	Payment Terms: Net 30
	"""

	result = parse_document(invoice)
	print(json.dumps(result, indent=2))
	```

	### Expected output

	```json
	{
	"document_type": "invoice",
	"vendor": "Tata Consultancy Services Ltd.",
	"client": null,
	"date": "2024-11-15",
	"due_date": null,
	"document_id": "TCS-2024-8821",
	"currency": "INR",
	"subtotal": 42500.0,
	"tax_amount": 7650.0,
	"tax_rate_pct": 18.0,
	"total_amount": 50150.0,
	"line_items": [
	{
	"description": "Cloud Infrastructure Management",
	"quantity": 1,
	"unit_price": 42500.0,
	"amount": 42500.0
	}
	],
	"payment_terms": "Net 30",
	"notes": null,
	"metadata": {}
	}
	```

	### Load locally with transformers

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
	import torch

	bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_use_double_quant=True,
	bnb_4bit_compute_dtype=torch.bfloat16,
	)

	model = AutoModelForCausalLM.from_pretrained(
	"ratulsur/multi-format-finance-parser",
	quantization_config=bnb_config,
	device_map="auto",
	trust_remote_code=True,
	)
	tokenizer = AutoTokenizer.from_pretrained(
	"ratulsur/multi-format-finance-parser",
	trust_remote_code=True,
	)
	```

	---

	## 🏗️ Training Setup

	```python
	# QLoRA config
	bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_use_double_quant=True,
	bnb_4bit_compute_dtype=torch.bfloat16,
	)

	lora_config = LoraConfig(
	r=8,
	lora_alpha=16,
	target_modules=["q_proj","k_proj","v_proj","o_proj",
	"gate_proj","up_proj","down_proj"],
	lora_dropout=0.05,
	bias="none",
	task_type=TaskType.CAUSAL_LM,
	)

	# Training args
	SFTConfig(
	num_train_epochs=3,
	per_device_train_batch_size=1,
	gradient_accumulation_steps=8,
	learning_rate=2e-4,
	lr_scheduler_type="cosine",
	optim="paged_adamw_8bit",
	bf16=True,
	gradient_checkpointing=True,
	max_length=512,
	)
	```

	---

	## 📁 Repository Structure

	```
	ratulsur/multi-format-finance-parser/
	├── model.safetensors # Merged model weights (15.2 GB)
	├── config.json # Model configuration
	├── tokenizer.json # Tokenizer
	├── tokenizer_config.json # Tokenizer configuration
	├── chat_template.jinja # Chat template
	└── generation_config.json # Generation configuration
	```

	---

	## ⚠️ Limitations

	- Trained primarily on English financial documents
	- Best performance on structured text (not handwritten documents)
	- OCR quality affects accuracy for scanned documents
	- SAP reports tested on ALV-style exports only
	- 954 training samples — production use should involve more data

	---

	## 🔗 Links

	- Live Demo: [HuggingFace Spaces](https://huggingface.co/spaces/ratulsur/finance-parser-demo)
	- Base Model: [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
	- Training Dataset: [CORD-v2](https://huggingface.co/datasets/naver-clova-ix/cord-v2)

	---

	## 👤 Author

	Ratul Sur
	- HuggingFace: [ratulsur](https://huggingface.co/ratulsur)

	---

	If you find this model useful, please give it a ⭐ like on HuggingFace!