Upload MDF form reader: Phi-3.5-Vision + QLoRA fine-tune

eb867a2 verified 2 days ago

5.97 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- vision-language-model
	- document-understanding
	- handwritten-text
	- insurance-forms
	- vqa
	- phi-3.5-vision
	- lora
	- qlora
	- unsloth
	- medical-forms
	- ocr-free
	pipeline_tag: image-to-text
	base_model: microsoft/Phi-3.5-vision-instruct
	datasets:
	- custom-mdf-forms
	metrics:
	- exact_match
	model-index:
	- name: mdf-form-reader-phi35-vision
	results:
	- task:
	type: visual-question-answering
	name: Visual Question Answering (MDF Forms)
	metrics:
	- type: exact_match
	value: 0
	name: Exact Match (%)
	- type: ood_refusal_rate
	value: 0
	name: OOD Refusal Rate (%)
	---

	# MDF Form Reader — Phi-3.5-Vision Fine-tuned

	Vision-native handwritten insurance form understanding, fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) using QLoRA.

	> No OCR needed. This model reads handwriting, checks checkbox states, and extracts structured data directly from scanned MDF (Monthly Disability Verification) form images.

	---

	## 📋 Model Summary

	\| Property \| Value \|
	\|---\|---\|
	\| Base Model \| `microsoft/Phi-3.5-vision-instruct` (4.2B) \|
	\| Task \| Visual Question Answering on MDF forms \|
	\| Fine-tuning Method \| QLoRA (r=16, alpha=32) via Unsloth \|
	\| Quantization \| 4-bit NF4 (training) → 16-bit merged \|
	\| Annotator \| Vertex AI Gemini 2.5 Flash \|
	\| Exact Match \| 0% \|
	\| OOD Refusal Rate \| 0% \|
	\| License \| Apache 2.0 \|

	---

	## 🚀 Quick Start

	```python
	from transformers import AutoModelForCausalLM, AutoProcessor
	from PIL import Image
	import torch

	model_id = "solvrays/mdf-form-reader-phi35-vision"

	processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="cuda",
	trust_remote_code=True,
	)

	# Load your scanned MDF form image
	image = Image.open("mdf_form.png").convert("RGB")

	# Ask a question about the form
	question = "What is the name of the physician who signed this form?"

	messages = [{"role": "user", "content": f"<\|image_1\|>
	{question}"}]
	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

	inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")

	with torch.no_grad():
	out = model.generate(**inputs, max_new_tokens=200, temperature=0.1)

	answer = processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
	print(answer)
	```

	---

	## 🏥 What is an MDF Form?

	A Monthly Disability Verification Form (Form 441.O.MDF.O) is issued by TriPlus Services, acting as Third-Party Administrator of Penn Treaty Network America and American Network policies. It requires a licensed physician to certify a patient's ongoing disability status monthly.

	### Key Fields Extracted

	- Physician name, address, phone, fax
	- Submission date range (from / to)
	- Patient disability status (YES checked / NO checked)
	- Disability end date (if applicable)
	- Form completion date
	- Physician signature presence

	---

	## 🔬 Why Vision-Native vs OCR?

	\| Challenge \| OCR Approach \| This Model \|
	\|---\|---\|---\|
	\| Cursive physician names \| Fails ("Carnazzo", "Kruszka") \| Reads directly from image \|
	\| Checkbox state (YES/NO) \| Misses (no text to extract) \| Sees the ✓/✗ mark in context \|
	\| Date grid cells (MM/DD/YYYY) \| Digit confusion in small boxes \| Layout-aware reading \|
	\| Signature field \| Garbage output \| Correctly ignored \|
	\| Handwritten addresses \| High error rate \| Contextual correction \|

	---

	## 🛠️ Training Pipeline

	```
	Scanned MDF Form (PDF)
	↓ Image pre-processing (deskew 300 DPI, bilateral denoise, CLAHE)
	↓ Vertex AI Gemini 2.5 Flash → structured JSON annotation
	↓ VQA triplet dataset (field extraction + OOD refusal pairs)
	↓ Phi-3.5-Vision + QLoRA (Unsloth, 2-5× faster, 80% less VRAM)
	↓ Merge adapters → full 16-bit model
	↓ HuggingFace Hub (safetensors)
	```

	### Training Configuration

	```yaml
	base_model: microsoft/Phi-3.5-vision-instruct
	fine_tuning_method: QLoRA (NF4, double quantization)
	lora_rank: 16
	lora_alpha: 32
	lora_dropout: 0.05
	use_rslora: true
	vision_layers: frozen
	language_layers: adapted
	optimizer: AdamW 8-bit (paged)
	lr_scheduler: cosine
	neftune_noise_alpha: 5
	annotator: Vertex AI Gemini 2.5 Flash
	framework: Unsloth + HuggingFace TRL
	```

	---

	## 📊 Evaluation Results

	\| Metric \| Value \|
	\|---\|---\|
	\| Exact Match (field extraction) \| 0% \|
	\| OOD Refusal Rate \| 0% \|
	\| Evaluation Set \| Held-out MDF form pages \|

	OOD Refusal Rate measures how reliably the model declines to answer questions not answerable from the form (e.g. "What is the diagnosis?", "Has this claim been approved?").

	---

	## ⚠️ Limitations

	- Domain-specific: Trained exclusively on TriPlus Services MDF forms. Performance on other form types is not guaranteed.
	- Image quality: Works best on scans ≥ 300 DPI. Very low-resolution or heavily degraded scans may reduce accuracy.
	- Language: English only.
	- Redacted fields: Returns `null` for blacked-out fields (insured name/policy number).
	- Not for medical diagnosis: This model extracts administrative form data only.

	---

	## 📄 License

	This model is released under the Apache 2.0 License.
	The base model ([microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct)) is also Apache 2.0.

	---

	## 🙏 Acknowledgements

	- [Unsloth](https://github.com/unslothai/unsloth) for 2-5× faster fine-tuning
	- [Microsoft Phi-3.5-Vision](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) for the base vision-language model
	- [Vertex AI Gemini 2.5 Flash](https://cloud.google.com/vertex-ai) for dataset annotation
	- [HuggingFace TRL](https://github.com/huggingface/trl) for SFTTrainer