File size: 5,972 Bytes
eb867a2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
---
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - vision-language-model
  - document-understanding
  - handwritten-text
  - insurance-forms
  - vqa
  - phi-3.5-vision
  - lora
  - qlora
  - unsloth
  - medical-forms
  - ocr-free
pipeline_tag: image-to-text
base_model: microsoft/Phi-3.5-vision-instruct
datasets:
  - custom-mdf-forms
metrics:
  - exact_match
model-index:
  - name: mdf-form-reader-phi35-vision
    results:
      - task:
          type: visual-question-answering
          name: Visual Question Answering (MDF Forms)
        metrics:
          - type: exact_match
            value: 0
            name: Exact Match (%)
          - type: ood_refusal_rate
            value: 0
            name: OOD Refusal Rate (%)
---

# MDF Form Reader β€” Phi-3.5-Vision Fine-tuned

**Vision-native handwritten insurance form understanding, fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) using QLoRA.**

> **No OCR needed.** This model reads handwriting, checks checkbox states, and extracts structured data directly from scanned MDF (Monthly Disability Verification) form images.

---

## πŸ“‹ Model Summary

| Property | Value |
|---|---|
| **Base Model** | `microsoft/Phi-3.5-vision-instruct` (4.2B) |
| **Task** | Visual Question Answering on MDF forms |
| **Fine-tuning Method** | QLoRA (r=16, alpha=32) via Unsloth |
| **Quantization** | 4-bit NF4 (training) β†’ 16-bit merged |
| **Annotator** | Vertex AI Gemini 2.5 Flash |
| **Exact Match** | 0% |
| **OOD Refusal Rate** | 0% |
| **License** | Apache 2.0 |

---

## πŸš€ Quick Start

```python
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

model_id = "solvrays/mdf-form-reader-phi35-vision"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    trust_remote_code=True,
)

# Load your scanned MDF form image
image = Image.open("mdf_form.png").convert("RGB")

# Ask a question about the form
question = "What is the name of the physician who signed this form?"

messages = [{"role": "user", "content": f"<|image_1|>
{question}"}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=200, temperature=0.1)

answer = processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(answer)
```

---

## πŸ₯ What is an MDF Form?

A **Monthly Disability Verification Form (Form 441.O.MDF.O)** is issued by TriPlus Services, acting as Third-Party Administrator of Penn Treaty Network America and American Network policies. It requires a licensed physician to certify a patient's ongoing disability status monthly.

### Key Fields Extracted

- Physician name, address, phone, fax
- Submission date range (from / to)
- Patient disability status (YES checked / NO checked)
- Disability end date (if applicable)
- Form completion date
- Physician signature presence

---

## πŸ”¬ Why Vision-Native vs OCR?

| Challenge | OCR Approach | This Model |
|---|---|---|
| Cursive physician names | Fails ("Carnazzo", "Kruszka") | Reads directly from image |
| Checkbox state (YES/NO) | Misses (no text to extract) | Sees the βœ“/βœ— mark in context |
| Date grid cells (MM/DD/YYYY) | Digit confusion in small boxes | Layout-aware reading |
| Signature field | Garbage output | Correctly ignored |
| Handwritten addresses | High error rate | Contextual correction |

---

## πŸ› οΈ Training Pipeline

```
Scanned MDF Form (PDF)
    ↓ Image pre-processing (deskew 300 DPI, bilateral denoise, CLAHE)
    ↓ Vertex AI Gemini 2.5 Flash β†’ structured JSON annotation
    ↓ VQA triplet dataset (field extraction + OOD refusal pairs)
    ↓ Phi-3.5-Vision + QLoRA (Unsloth, 2-5Γ— faster, 80% less VRAM)
    ↓ Merge adapters β†’ full 16-bit model
    ↓ HuggingFace Hub (safetensors)
```

### Training Configuration

```yaml
base_model: microsoft/Phi-3.5-vision-instruct
fine_tuning_method: QLoRA (NF4, double quantization)
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
use_rslora: true
vision_layers: frozen
language_layers: adapted
optimizer: AdamW 8-bit (paged)
lr_scheduler: cosine
neftune_noise_alpha: 5
annotator: Vertex AI Gemini 2.5 Flash
framework: Unsloth + HuggingFace TRL
```

---

## πŸ“Š Evaluation Results

| Metric | Value |
|---|---|
| Exact Match (field extraction) | 0% |
| OOD Refusal Rate | 0% |
| Evaluation Set | Held-out MDF form pages |

**OOD Refusal Rate** measures how reliably the model declines to answer questions not answerable from the form (e.g. "What is the diagnosis?", "Has this claim been approved?").

---

## ⚠️ Limitations

- **Domain-specific**: Trained exclusively on TriPlus Services MDF forms. Performance on other form types is not guaranteed.
- **Image quality**: Works best on scans β‰₯ 300 DPI. Very low-resolution or heavily degraded scans may reduce accuracy.
- **Language**: English only.
- **Redacted fields**: Returns `null` for blacked-out fields (insured name/policy number).
- **Not for medical diagnosis**: This model extracts administrative form data only.

---

## πŸ“„ License

This model is released under the **Apache 2.0 License**.
The base model ([microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct)) is also Apache 2.0.

---

## πŸ™ Acknowledgements

- [Unsloth](https://github.com/unslothai/unsloth) for 2-5Γ— faster fine-tuning
- [Microsoft Phi-3.5-Vision](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) for the base vision-language model
- [Vertex AI Gemini 2.5 Flash](https://cloud.google.com/vertex-ai) for dataset annotation
- [HuggingFace TRL](https://github.com/huggingface/trl) for SFTTrainer