File size: 5,972 Bytes

eb867a2

---
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - vision-language-model
  - document-understanding
  - handwritten-text
  - insurance-forms
  - vqa
  - phi-3.5-vision
  - lora
  - qlora
  - unsloth
  - medical-forms
  - ocr-free
pipeline_tag: image-to-text
base_model: microsoft/Phi-3.5-vision-instruct
datasets:
  - custom-mdf-forms
metrics:
  - exact_match
model-index:
  - name: mdf-form-reader-phi35-vision
    results:
      - task:
          type: visual-question-answering
          name: Visual Question Answering (MDF Forms)
        metrics:
          - type: exact_match
            value: 0
            name: Exact Match (%)
          - type: ood_refusal_rate
            value: 0
            name: OOD Refusal Rate (%)
---

# MDF Form Reader — Phi-3.5-Vision Fine-tuned

**Vision-native handwritten insurance form understanding, fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) using QLoRA.**

> **No OCR needed.** This model reads handwriting, checks checkbox states, and extracts structured data directly from scanned MDF (Monthly Disability Verification) form images.

---

## 📋 Model Summary

| Property | Value |
|---|---|
| **Base Model** | `microsoft/Phi-3.5-vision-instruct` (4.2B) |
| **Task** | Visual Question Answering on MDF forms |
| **Fine-tuning Method** | QLoRA (r=16, alpha=32) via Unsloth |
| **Quantization** | 4-bit NF4 (training) → 16-bit merged |
| **Annotator** | Vertex AI Gemini 2.5 Flash |
| **Exact Match** | 0% |
| **OOD Refusal Rate** | 0% |
| **License** | Apache 2.0 |

---

## 🚀 Quick Start

```python
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

model_id = "solvrays/mdf-form-reader-phi35-vision"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    trust_remote_code=True,
)

# Load your scanned MDF form image
image = Image.open("mdf_form.png").convert("RGB")

# Ask a question about the form
question = "What is the name of the physician who signed this form?"

messages = [{"role": "user", "content": f"<|image_1|>
{question}"}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=200, temperature=0.1)

answer = processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(answer)
```

---

## 🏥 What is an MDF Form?

A **Monthly Disability Verification Form (Form 441.O.MDF.O)** is issued by TriPlus Services, acting as Third-Party Administrator of Penn Treaty Network America and American Network policies. It requires a licensed physician to certify a patient's ongoing disability status monthly.

### Key Fields Extracted

- Physician name, address, phone, fax
- Submission date range (from / to)
- Patient disability status (YES checked / NO checked)
- Disability end date (if applicable)
- Form completion date
- Physician signature presence

---

## 🔬 Why Vision-Native vs OCR?

| Challenge | OCR Approach | This Model |
|---|---|---|
| Cursive physician names | Fails ("Carnazzo", "Kruszka") | Reads directly from image |
| Checkbox state (YES/NO) | Misses (no text to extract) | Sees the ✓/✗ mark in context |
| Date grid cells (MM/DD/YYYY) | Digit confusion in small boxes | Layout-aware reading |
| Signature field | Garbage output | Correctly ignored |
| Handwritten addresses | High error rate | Contextual correction |

---

## 🛠️ Training Pipeline

```
Scanned MDF Form (PDF)
    ↓ Image pre-processing (deskew 300 DPI, bilateral denoise, CLAHE)
    ↓ Vertex AI Gemini 2.5 Flash → structured JSON annotation
    ↓ VQA triplet dataset (field extraction + OOD refusal pairs)
    ↓ Phi-3.5-Vision + QLoRA (Unsloth, 2-5× faster, 80% less VRAM)
    ↓ Merge adapters → full 16-bit model
    ↓ HuggingFace Hub (safetensors)
```

### Training Configuration

```yaml
base_model: microsoft/Phi-3.5-vision-instruct
fine_tuning_method: QLoRA (NF4, double quantization)
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
use_rslora: true
vision_layers: frozen
language_layers: adapted
optimizer: AdamW 8-bit (paged)
lr_scheduler: cosine
neftune_noise_alpha: 5
annotator: Vertex AI Gemini 2.5 Flash
framework: Unsloth + HuggingFace TRL
```

---

## 📊 Evaluation Results

| Metric | Value |
|---|---|
| Exact Match (field extraction) | 0% |
| OOD Refusal Rate | 0% |
| Evaluation Set | Held-out MDF form pages |

**OOD Refusal Rate** measures how reliably the model declines to answer questions not answerable from the form (e.g. "What is the diagnosis?", "Has this claim been approved?").

---

## ⚠️ Limitations

- **Domain-specific**: Trained exclusively on TriPlus Services MDF forms. Performance on other form types is not guaranteed.
- **Image quality**: Works best on scans ≥ 300 DPI. Very low-resolution or heavily degraded scans may reduce accuracy.
- **Language**: English only.
- **Redacted fields**: Returns `null` for blacked-out fields (insured name/policy number).
- **Not for medical diagnosis**: This model extracts administrative form data only.

---

## 📄 License

This model is released under the **Apache 2.0 License**.
The base model ([microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct)) is also Apache 2.0.

---

## 🙏 Acknowledgements

- [Unsloth](https://github.com/unslothai/unsloth) for 2-5× faster fine-tuning
- [Microsoft Phi-3.5-Vision](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) for the base vision-language model
- [Vertex AI Gemini 2.5 Flash](https://cloud.google.com/vertex-ai) for dataset annotation
- [HuggingFace TRL](https://github.com/huggingface/trl) for SFTTrainer