| --- |
| license: apache-2.0 |
| language: |
| - en |
| base_model: Qwen/Qwen2.5-7B-Instruct |
| pipeline_tag: text-generation |
| tags: |
| - finance |
| - document-parsing |
| - qlora |
| - qwen2.5 |
| - invoice |
| - sap |
| - structured-extraction |
| - json |
| --- |
| |
| # π¦ Multi-Format Finance Document Parser |
|
|
| A production-grade financial document parser fine-tuned on **Qwen2.5-7B-Instruct** using **QLoRA (4-bit NF4 quantization)**. Given raw text from any financial document, it outputs structured JSON β ready for downstream processing, ERP integration, or analytics pipelines. |
|
|
| --- |
|
|
| ## π Live Demo |
|
|
| π [Try it on HuggingFace Spaces](https://huggingface.co/spaces/ratulsur/finance-parser-demo) |
|
|
| --- |
|
|
| ## π Supported Document Types |
|
|
| | Format | Examples | |
| |---|---| |
| | **Invoice** | Vendor invoices, GST bills, service bills | |
| | **SAP Report** | ALV exports, FI vendor payment reports | |
| | **Income Statement** | P&L statements, quarterly earnings | |
| | **Balance Sheet** | Assets, liabilities, equity statements | |
| | **Bank Statement** | Transaction records, account summaries | |
| | **Purchase Order** | PO documents, procurement records | |
| | **SQL Result** | Query outputs from finance databases | |
| | **CSV / Excel** | Tabular finance data | |
|
|
| --- |
|
|
| ## π§ Model Details |
|
|
| | Property | Value | |
| |---|---| |
| | **Base model** | Qwen/Qwen2.5-7B-Instruct | |
| | **Model size** | 8B parameters | |
| | **Fine-tuning method** | QLoRA (PEFT) | |
| | **Quantization** | 4-bit NF4 + double quantization | |
| | **Compute dtype** | bfloat16 | |
| | **LoRA rank** | r=8, alpha=16 | |
| | **Max sequence length** | 512 tokens | |
| | **Training hardware** | L40S 48GB GPU (Lightning AI) | |
| | **Training time** | ~1 hour | |
| | **License** | Apache 2.0 | |
|
|
| --- |
|
|
| ## π Training Data |
|
|
| | Dataset | Samples | Type | |
| |---|---|---| |
| | [CORD-v2](https://huggingface.co/datasets/naver-clova-ix/cord-v2) | 454 | Real receipt images + structured JSON | |
| | Synthetic invoices | 300 | Generated with realistic Indian/global vendors | |
| | Synthetic SAP reports | 100 | ALV-style pipe-delimited exports | |
| | Synthetic income statements | 100 | P&L with revenue, COGS, EBIT, net income | |
| | **Total** | **954** | Train: 812 Β· Eval: 95 Β· Test: 47 | |
|
|
| --- |
|
|
| ## βοΈ Quantization Techniques |
|
|
| | Technique | Purpose | |
| |---|---| |
| | **NF4 4-bit quantization** | Stores weights in 4-bit NormalFloat format β ~4x model size reduction | |
| | **Double quantization** | Quantizes the quantization constants β additional ~0.4 bits/param saving | |
| | **bfloat16 compute** | Full precision operations, 4-bit storage | |
| | **LoRA adapters (r=8)** | Only 0.5% of parameters trained β 99.5% frozen | |
| | **Paged AdamW 8-bit** | Optimizer state memory reduction | |
| | **Gradient checkpointing** | ~40% activation memory reduction | |
|
|
| --- |
|
|
| ## π€ Output Schema |
|
|
| ```json |
| { |
| "document_type": "invoice|balance_sheet|income_stmt|sap_report|sql_result|bank_statement|purchase_order", |
| "vendor": "string or null", |
| "client": "string or null", |
| "date": "YYYY-MM-DD or null", |
| "due_date": "YYYY-MM-DD or null", |
| "document_id": "string or null", |
| "currency": "USD|EUR|INR|GBP|...", |
| "subtotal": "float or null", |
| "tax_amount": "float or null", |
| "tax_rate_pct": "float or null", |
| "total_amount": "float or null", |
| "line_items": [ |
| { |
| "description": "string", |
| "quantity": "float or null", |
| "unit_price": "float or null", |
| "amount": "float" |
| } |
| ], |
| "payment_terms": "string or null", |
| "notes": "string or null", |
| "metadata": {} |
| } |
| ``` |
|
|
| --- |
|
|
| ## π» Usage |
|
|
| ### Via HuggingFace Inference API |
|
|
| ```python |
| import requests |
| import json |
| import re |
| |
| API_URL = "https://api-inference.huggingface.co/models/ratulsur/multi-format-finance-parser" |
| HF_TOKEN = "hf_xxxxxxxxxxxx" |
| |
| SYSTEM_PROMPT = """You are a production financial document parser. |
| Given raw text from any financial document, output ONLY a single valid JSON object. |
| Schema: {document_type, vendor, client, date (YYYY-MM-DD), due_date, document_id, |
| currency, subtotal, tax_amount, tax_rate_pct, total_amount, |
| line_items:[{description,quantity,unit_price,amount}], payment_terms, notes, metadata}. |
| All monetary values must be floats. Unknown fields β null. No explanation.""" |
| |
| def parse_document(text: str) -> dict: |
| prompt = ( |
| f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n" |
| f"<|im_start|>user\nParse this financial document:\n\n{text}<|im_end|>\n" |
| f"<|im_start|>assistant\n" |
| ) |
| headers = {"Authorization": f"Bearer {HF_TOKEN}"} |
| payload = { |
| "inputs": prompt, |
| "parameters": { |
| "max_new_tokens": 512, |
| "temperature": 0.05, |
| "return_full_text": False, |
| "do_sample": False, |
| } |
| } |
| resp = requests.post(API_URL, headers=headers, json=payload, timeout=120) |
| raw = resp.json()[0]["generated_text"].strip() |
| raw = re.sub(r"```json\s*|```\s*", "", raw).strip() |
| return json.loads(raw) |
| |
| # Example |
| invoice = """ |
| INVOICE |
| Vendor: Tata Consultancy Services Ltd. |
| Invoice No: TCS-2024-8821 |
| Date: 2024-11-15 |
| Service: Cloud Infrastructure Management INR 42,500.00 |
| GST @ 18%: INR 7,650.00 |
| TOTAL DUE: INR 50,150.00 |
| Payment Terms: Net 30 |
| """ |
|
|
| result = parse_document(invoice) |
| print(json.dumps(result, indent=2)) |
| ``` |
| |
| ### Expected output |
| |
| ```json |
| { |
| "document_type": "invoice", |
| "vendor": "Tata Consultancy Services Ltd.", |
| "client": null, |
| "date": "2024-11-15", |
| "due_date": null, |
| "document_id": "TCS-2024-8821", |
| "currency": "INR", |
| "subtotal": 42500.0, |
| "tax_amount": 7650.0, |
| "tax_rate_pct": 18.0, |
| "total_amount": 50150.0, |
| "line_items": [ |
| { |
| "description": "Cloud Infrastructure Management", |
| "quantity": 1, |
| "unit_price": 42500.0, |
| "amount": 42500.0 |
| } |
| ], |
| "payment_terms": "Net 30", |
| "notes": null, |
| "metadata": {} |
| } |
| ``` |
| |
| ### Load locally with transformers |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig |
| import torch |
| |
| bnb_config = BitsAndBytesConfig( |
| load_in_4bit=True, |
| bnb_4bit_quant_type="nf4", |
| bnb_4bit_use_double_quant=True, |
| bnb_4bit_compute_dtype=torch.bfloat16, |
| ) |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| "ratulsur/multi-format-finance-parser", |
| quantization_config=bnb_config, |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| tokenizer = AutoTokenizer.from_pretrained( |
| "ratulsur/multi-format-finance-parser", |
| trust_remote_code=True, |
| ) |
| ``` |
|
|
| --- |
|
|
| ## ποΈ Training Setup |
|
|
| ```python |
| # QLoRA config |
| bnb_config = BitsAndBytesConfig( |
| load_in_4bit=True, |
| bnb_4bit_quant_type="nf4", |
| bnb_4bit_use_double_quant=True, |
| bnb_4bit_compute_dtype=torch.bfloat16, |
| ) |
| |
| lora_config = LoraConfig( |
| r=8, |
| lora_alpha=16, |
| target_modules=["q_proj","k_proj","v_proj","o_proj", |
| "gate_proj","up_proj","down_proj"], |
| lora_dropout=0.05, |
| bias="none", |
| task_type=TaskType.CAUSAL_LM, |
| ) |
| |
| # Training args |
| SFTConfig( |
| num_train_epochs=3, |
| per_device_train_batch_size=1, |
| gradient_accumulation_steps=8, |
| learning_rate=2e-4, |
| lr_scheduler_type="cosine", |
| optim="paged_adamw_8bit", |
| bf16=True, |
| gradient_checkpointing=True, |
| max_length=512, |
| ) |
| ``` |
|
|
| --- |
|
|
| ## π Repository Structure |
|
|
| ``` |
| ratulsur/multi-format-finance-parser/ |
| βββ model.safetensors # Merged model weights (15.2 GB) |
| βββ config.json # Model configuration |
| βββ tokenizer.json # Tokenizer |
| βββ tokenizer_config.json # Tokenizer configuration |
| βββ chat_template.jinja # Chat template |
| βββ generation_config.json # Generation configuration |
| ``` |
|
|
| --- |
|
|
| ## β οΈ Limitations |
|
|
| - Trained primarily on English financial documents |
| - Best performance on structured text (not handwritten documents) |
| - OCR quality affects accuracy for scanned documents |
| - SAP reports tested on ALV-style exports only |
| - 954 training samples β production use should involve more data |
|
|
| --- |
|
|
| ## π Links |
|
|
| - **Live Demo:** [HuggingFace Spaces](https://huggingface.co/spaces/ratulsur/finance-parser-demo) |
| - **Base Model:** [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |
| - **Training Dataset:** [CORD-v2](https://huggingface.co/datasets/naver-clova-ix/cord-v2) |
|
|
| --- |
|
|
| ## π€ Author |
|
|
| **Ratul Sur** |
| - HuggingFace: [ratulsur](https://huggingface.co/ratulsur) |
|
|
| --- |
|
|
| *If you find this model useful, please give it a β like on HuggingFace!* |