singtan's picture
Upload MDF form reader: Phi-3.5-Vision + QLoRA fine-tune
eb867a2 verified
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- vision-language-model
- document-understanding
- handwritten-text
- insurance-forms
- vqa
- phi-3.5-vision
- lora
- qlora
- unsloth
- medical-forms
- ocr-free
pipeline_tag: image-to-text
base_model: microsoft/Phi-3.5-vision-instruct
datasets:
- custom-mdf-forms
metrics:
- exact_match
model-index:
- name: mdf-form-reader-phi35-vision
results:
- task:
type: visual-question-answering
name: Visual Question Answering (MDF Forms)
metrics:
- type: exact_match
value: 0
name: Exact Match (%)
- type: ood_refusal_rate
value: 0
name: OOD Refusal Rate (%)
---
# MDF Form Reader β€” Phi-3.5-Vision Fine-tuned
**Vision-native handwritten insurance form understanding, fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) using QLoRA.**
> **No OCR needed.** This model reads handwriting, checks checkbox states, and extracts structured data directly from scanned MDF (Monthly Disability Verification) form images.
---
## πŸ“‹ Model Summary
| Property | Value |
|---|---|
| **Base Model** | `microsoft/Phi-3.5-vision-instruct` (4.2B) |
| **Task** | Visual Question Answering on MDF forms |
| **Fine-tuning Method** | QLoRA (r=16, alpha=32) via Unsloth |
| **Quantization** | 4-bit NF4 (training) β†’ 16-bit merged |
| **Annotator** | Vertex AI Gemini 2.5 Flash |
| **Exact Match** | 0% |
| **OOD Refusal Rate** | 0% |
| **License** | Apache 2.0 |
---
## πŸš€ Quick Start
```python
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch
model_id = "solvrays/mdf-form-reader-phi35-vision"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="cuda",
trust_remote_code=True,
)
# Load your scanned MDF form image
image = Image.open("mdf_form.png").convert("RGB")
# Ask a question about the form
question = "What is the name of the physician who signed this form?"
messages = [{"role": "user", "content": f"<|image_1|>
{question}"}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=200, temperature=0.1)
answer = processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(answer)
```
---
## πŸ₯ What is an MDF Form?
A **Monthly Disability Verification Form (Form 441.O.MDF.O)** is issued by TriPlus Services, acting as Third-Party Administrator of Penn Treaty Network America and American Network policies. It requires a licensed physician to certify a patient's ongoing disability status monthly.
### Key Fields Extracted
- Physician name, address, phone, fax
- Submission date range (from / to)
- Patient disability status (YES checked / NO checked)
- Disability end date (if applicable)
- Form completion date
- Physician signature presence
---
## πŸ”¬ Why Vision-Native vs OCR?
| Challenge | OCR Approach | This Model |
|---|---|---|
| Cursive physician names | Fails ("Carnazzo", "Kruszka") | Reads directly from image |
| Checkbox state (YES/NO) | Misses (no text to extract) | Sees the βœ“/βœ— mark in context |
| Date grid cells (MM/DD/YYYY) | Digit confusion in small boxes | Layout-aware reading |
| Signature field | Garbage output | Correctly ignored |
| Handwritten addresses | High error rate | Contextual correction |
---
## πŸ› οΈ Training Pipeline
```
Scanned MDF Form (PDF)
↓ Image pre-processing (deskew 300 DPI, bilateral denoise, CLAHE)
↓ Vertex AI Gemini 2.5 Flash β†’ structured JSON annotation
↓ VQA triplet dataset (field extraction + OOD refusal pairs)
↓ Phi-3.5-Vision + QLoRA (Unsloth, 2-5Γ— faster, 80% less VRAM)
↓ Merge adapters β†’ full 16-bit model
↓ HuggingFace Hub (safetensors)
```
### Training Configuration
```yaml
base_model: microsoft/Phi-3.5-vision-instruct
fine_tuning_method: QLoRA (NF4, double quantization)
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
use_rslora: true
vision_layers: frozen
language_layers: adapted
optimizer: AdamW 8-bit (paged)
lr_scheduler: cosine
neftune_noise_alpha: 5
annotator: Vertex AI Gemini 2.5 Flash
framework: Unsloth + HuggingFace TRL
```
---
## πŸ“Š Evaluation Results
| Metric | Value |
|---|---|
| Exact Match (field extraction) | 0% |
| OOD Refusal Rate | 0% |
| Evaluation Set | Held-out MDF form pages |
**OOD Refusal Rate** measures how reliably the model declines to answer questions not answerable from the form (e.g. "What is the diagnosis?", "Has this claim been approved?").
---
## ⚠️ Limitations
- **Domain-specific**: Trained exclusively on TriPlus Services MDF forms. Performance on other form types is not guaranteed.
- **Image quality**: Works best on scans β‰₯ 300 DPI. Very low-resolution or heavily degraded scans may reduce accuracy.
- **Language**: English only.
- **Redacted fields**: Returns `null` for blacked-out fields (insured name/policy number).
- **Not for medical diagnosis**: This model extracts administrative form data only.
---
## πŸ“„ License
This model is released under the **Apache 2.0 License**.
The base model ([microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct)) is also Apache 2.0.
---
## πŸ™ Acknowledgements
- [Unsloth](https://github.com/unslothai/unsloth) for 2-5Γ— faster fine-tuning
- [Microsoft Phi-3.5-Vision](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) for the base vision-language model
- [Vertex AI Gemini 2.5 Flash](https://cloud.google.com/vertex-ai) for dataset annotation
- [HuggingFace TRL](https://github.com/huggingface/trl) for SFTTrainer