--- license: apache-2.0 language: - en base_model: Qwen/Qwen2.5-7B-Instruct pipeline_tag: text-generation tags: - finance - document-parsing - qlora - qwen2.5 - invoice - sap - structured-extraction - json --- # ๐Ÿฆ Multi-Format Finance Document Parser A production-grade financial document parser fine-tuned on **Qwen2.5-7B-Instruct** using **QLoRA (4-bit NF4 quantization)**. Given raw text from any financial document, it outputs structured JSON โ€” ready for downstream processing, ERP integration, or analytics pipelines. --- ## ๐Ÿš€ Live Demo ๐Ÿ‘‰ [Try it on HuggingFace Spaces](https://huggingface.co/spaces/ratulsur/finance-parser-demo) --- ## ๐Ÿ“„ Supported Document Types | Format | Examples | |---|---| | **Invoice** | Vendor invoices, GST bills, service bills | | **SAP Report** | ALV exports, FI vendor payment reports | | **Income Statement** | P&L statements, quarterly earnings | | **Balance Sheet** | Assets, liabilities, equity statements | | **Bank Statement** | Transaction records, account summaries | | **Purchase Order** | PO documents, procurement records | | **SQL Result** | Query outputs from finance databases | | **CSV / Excel** | Tabular finance data | --- ## ๐Ÿง  Model Details | Property | Value | |---|---| | **Base model** | Qwen/Qwen2.5-7B-Instruct | | **Model size** | 8B parameters | | **Fine-tuning method** | QLoRA (PEFT) | | **Quantization** | 4-bit NF4 + double quantization | | **Compute dtype** | bfloat16 | | **LoRA rank** | r=8, alpha=16 | | **Max sequence length** | 512 tokens | | **Training hardware** | L40S 48GB GPU (Lightning AI) | | **Training time** | ~1 hour | | **License** | Apache 2.0 | --- ## ๐Ÿ“Š Training Data | Dataset | Samples | Type | |---|---|---| | [CORD-v2](https://huggingface.co/datasets/naver-clova-ix/cord-v2) | 454 | Real receipt images + structured JSON | | Synthetic invoices | 300 | Generated with realistic Indian/global vendors | | Synthetic SAP reports | 100 | ALV-style pipe-delimited exports | | Synthetic income statements | 100 | P&L with revenue, COGS, EBIT, net income | | **Total** | **954** | Train: 812 ยท Eval: 95 ยท Test: 47 | --- ## โš™๏ธ Quantization Techniques | Technique | Purpose | |---|---| | **NF4 4-bit quantization** | Stores weights in 4-bit NormalFloat format โ€” ~4x model size reduction | | **Double quantization** | Quantizes the quantization constants โ€” additional ~0.4 bits/param saving | | **bfloat16 compute** | Full precision operations, 4-bit storage | | **LoRA adapters (r=8)** | Only 0.5% of parameters trained โ€” 99.5% frozen | | **Paged AdamW 8-bit** | Optimizer state memory reduction | | **Gradient checkpointing** | ~40% activation memory reduction | --- ## ๐Ÿ“ค Output Schema ```json { "document_type": "invoice|balance_sheet|income_stmt|sap_report|sql_result|bank_statement|purchase_order", "vendor": "string or null", "client": "string or null", "date": "YYYY-MM-DD or null", "due_date": "YYYY-MM-DD or null", "document_id": "string or null", "currency": "USD|EUR|INR|GBP|...", "subtotal": "float or null", "tax_amount": "float or null", "tax_rate_pct": "float or null", "total_amount": "float or null", "line_items": [ { "description": "string", "quantity": "float or null", "unit_price": "float or null", "amount": "float" } ], "payment_terms": "string or null", "notes": "string or null", "metadata": {} } ``` --- ## ๐Ÿ’ป Usage ### Via HuggingFace Inference API ```python import requests import json import re API_URL = "https://api-inference.huggingface.co/models/ratulsur/multi-format-finance-parser" HF_TOKEN = "hf_xxxxxxxxxxxx" SYSTEM_PROMPT = """You are a production financial document parser. Given raw text from any financial document, output ONLY a single valid JSON object. Schema: {document_type, vendor, client, date (YYYY-MM-DD), due_date, document_id, currency, subtotal, tax_amount, tax_rate_pct, total_amount, line_items:[{description,quantity,unit_price,amount}], payment_terms, notes, metadata}. All monetary values must be floats. Unknown fields โ†’ null. No explanation.""" def parse_document(text: str) -> dict: prompt = ( f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n" f"<|im_start|>user\nParse this financial document:\n\n{text}<|im_end|>\n" f"<|im_start|>assistant\n" ) headers = {"Authorization": f"Bearer {HF_TOKEN}"} payload = { "inputs": prompt, "parameters": { "max_new_tokens": 512, "temperature": 0.05, "return_full_text": False, "do_sample": False, } } resp = requests.post(API_URL, headers=headers, json=payload, timeout=120) raw = resp.json()[0]["generated_text"].strip() raw = re.sub(r"```json\s*|```\s*", "", raw).strip() return json.loads(raw) # Example invoice = """ INVOICE Vendor: Tata Consultancy Services Ltd. Invoice No: TCS-2024-8821 Date: 2024-11-15 Service: Cloud Infrastructure Management INR 42,500.00 GST @ 18%: INR 7,650.00 TOTAL DUE: INR 50,150.00 Payment Terms: Net 30 """ result = parse_document(invoice) print(json.dumps(result, indent=2)) ``` ### Expected output ```json { "document_type": "invoice", "vendor": "Tata Consultancy Services Ltd.", "client": null, "date": "2024-11-15", "due_date": null, "document_id": "TCS-2024-8821", "currency": "INR", "subtotal": 42500.0, "tax_amount": 7650.0, "tax_rate_pct": 18.0, "total_amount": 50150.0, "line_items": [ { "description": "Cloud Infrastructure Management", "quantity": 1, "unit_price": 42500.0, "amount": 42500.0 } ], "payment_terms": "Net 30", "notes": null, "metadata": {} } ``` ### Load locally with transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16, ) model = AutoModelForCausalLM.from_pretrained( "ratulsur/multi-format-finance-parser", quantization_config=bnb_config, device_map="auto", trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained( "ratulsur/multi-format-finance-parser", trust_remote_code=True, ) ``` --- ## ๐Ÿ—๏ธ Training Setup ```python # QLoRA config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16, ) lora_config = LoraConfig( r=8, lora_alpha=16, target_modules=["q_proj","k_proj","v_proj","o_proj", "gate_proj","up_proj","down_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, ) # Training args SFTConfig( num_train_epochs=3, per_device_train_batch_size=1, gradient_accumulation_steps=8, learning_rate=2e-4, lr_scheduler_type="cosine", optim="paged_adamw_8bit", bf16=True, gradient_checkpointing=True, max_length=512, ) ``` --- ## ๐Ÿ“ Repository Structure ``` ratulsur/multi-format-finance-parser/ โ”œโ”€โ”€ model.safetensors # Merged model weights (15.2 GB) โ”œโ”€โ”€ config.json # Model configuration โ”œโ”€โ”€ tokenizer.json # Tokenizer โ”œโ”€โ”€ tokenizer_config.json # Tokenizer configuration โ”œโ”€โ”€ chat_template.jinja # Chat template โ””โ”€โ”€ generation_config.json # Generation configuration ``` --- ## โš ๏ธ Limitations - Trained primarily on English financial documents - Best performance on structured text (not handwritten documents) - OCR quality affects accuracy for scanned documents - SAP reports tested on ALV-style exports only - 954 training samples โ€” production use should involve more data --- ## ๐Ÿ”— Links - **Live Demo:** [HuggingFace Spaces](https://huggingface.co/spaces/ratulsur/finance-parser-demo) - **Base Model:** [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) - **Training Dataset:** [CORD-v2](https://huggingface.co/datasets/naver-clova-ix/cord-v2) --- ## ๐Ÿ‘ค Author **Ratul Sur** - HuggingFace: [ratulsur](https://huggingface.co/ratulsur) --- *If you find this model useful, please give it a โญ like on HuggingFace!*