Image-to-Text
Transformers
Safetensors
English
qwen2_vl
image-text-to-text
vision-language-model
document-understanding
handwritten-text
insurance-forms
vqa
phi-3.5-vision
lora
qlora
unsloth
medical-forms
ocr-free
Eval Results (legacy)
text-generation-inference
Instructions to use solvrays/mdf-form-reader-phi35-vision with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use solvrays/mdf-form-reader-phi35-vision with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="solvrays/mdf-form-reader-phi35-vision")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("solvrays/mdf-form-reader-phi35-vision") model = AutoModelForImageTextToText.from_pretrained("solvrays/mdf-form-reader-phi35-vision") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Unsloth Studio new
How to use solvrays/mdf-form-reader-phi35-vision with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for solvrays/mdf-form-reader-phi35-vision to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for solvrays/mdf-form-reader-phi35-vision to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for solvrays/mdf-form-reader-phi35-vision to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="solvrays/mdf-form-reader-phi35-vision", max_seq_length=2048, )
File size: 5,972 Bytes
eb867a2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 | ---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- vision-language-model
- document-understanding
- handwritten-text
- insurance-forms
- vqa
- phi-3.5-vision
- lora
- qlora
- unsloth
- medical-forms
- ocr-free
pipeline_tag: image-to-text
base_model: microsoft/Phi-3.5-vision-instruct
datasets:
- custom-mdf-forms
metrics:
- exact_match
model-index:
- name: mdf-form-reader-phi35-vision
results:
- task:
type: visual-question-answering
name: Visual Question Answering (MDF Forms)
metrics:
- type: exact_match
value: 0
name: Exact Match (%)
- type: ood_refusal_rate
value: 0
name: OOD Refusal Rate (%)
---
# MDF Form Reader β Phi-3.5-Vision Fine-tuned
**Vision-native handwritten insurance form understanding, fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) using QLoRA.**
> **No OCR needed.** This model reads handwriting, checks checkbox states, and extracts structured data directly from scanned MDF (Monthly Disability Verification) form images.
---
## π Model Summary
| Property | Value |
|---|---|
| **Base Model** | `microsoft/Phi-3.5-vision-instruct` (4.2B) |
| **Task** | Visual Question Answering on MDF forms |
| **Fine-tuning Method** | QLoRA (r=16, alpha=32) via Unsloth |
| **Quantization** | 4-bit NF4 (training) β 16-bit merged |
| **Annotator** | Vertex AI Gemini 2.5 Flash |
| **Exact Match** | 0% |
| **OOD Refusal Rate** | 0% |
| **License** | Apache 2.0 |
---
## π Quick Start
```python
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch
model_id = "solvrays/mdf-form-reader-phi35-vision"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="cuda",
trust_remote_code=True,
)
# Load your scanned MDF form image
image = Image.open("mdf_form.png").convert("RGB")
# Ask a question about the form
question = "What is the name of the physician who signed this form?"
messages = [{"role": "user", "content": f"<|image_1|>
{question}"}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=200, temperature=0.1)
answer = processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(answer)
```
---
## π₯ What is an MDF Form?
A **Monthly Disability Verification Form (Form 441.O.MDF.O)** is issued by TriPlus Services, acting as Third-Party Administrator of Penn Treaty Network America and American Network policies. It requires a licensed physician to certify a patient's ongoing disability status monthly.
### Key Fields Extracted
- Physician name, address, phone, fax
- Submission date range (from / to)
- Patient disability status (YES checked / NO checked)
- Disability end date (if applicable)
- Form completion date
- Physician signature presence
---
## π¬ Why Vision-Native vs OCR?
| Challenge | OCR Approach | This Model |
|---|---|---|
| Cursive physician names | Fails ("Carnazzo", "Kruszka") | Reads directly from image |
| Checkbox state (YES/NO) | Misses (no text to extract) | Sees the β/β mark in context |
| Date grid cells (MM/DD/YYYY) | Digit confusion in small boxes | Layout-aware reading |
| Signature field | Garbage output | Correctly ignored |
| Handwritten addresses | High error rate | Contextual correction |
---
## π οΈ Training Pipeline
```
Scanned MDF Form (PDF)
β Image pre-processing (deskew 300 DPI, bilateral denoise, CLAHE)
β Vertex AI Gemini 2.5 Flash β structured JSON annotation
β VQA triplet dataset (field extraction + OOD refusal pairs)
β Phi-3.5-Vision + QLoRA (Unsloth, 2-5Γ faster, 80% less VRAM)
β Merge adapters β full 16-bit model
β HuggingFace Hub (safetensors)
```
### Training Configuration
```yaml
base_model: microsoft/Phi-3.5-vision-instruct
fine_tuning_method: QLoRA (NF4, double quantization)
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
use_rslora: true
vision_layers: frozen
language_layers: adapted
optimizer: AdamW 8-bit (paged)
lr_scheduler: cosine
neftune_noise_alpha: 5
annotator: Vertex AI Gemini 2.5 Flash
framework: Unsloth + HuggingFace TRL
```
---
## π Evaluation Results
| Metric | Value |
|---|---|
| Exact Match (field extraction) | 0% |
| OOD Refusal Rate | 0% |
| Evaluation Set | Held-out MDF form pages |
**OOD Refusal Rate** measures how reliably the model declines to answer questions not answerable from the form (e.g. "What is the diagnosis?", "Has this claim been approved?").
---
## β οΈ Limitations
- **Domain-specific**: Trained exclusively on TriPlus Services MDF forms. Performance on other form types is not guaranteed.
- **Image quality**: Works best on scans β₯ 300 DPI. Very low-resolution or heavily degraded scans may reduce accuracy.
- **Language**: English only.
- **Redacted fields**: Returns `null` for blacked-out fields (insured name/policy number).
- **Not for medical diagnosis**: This model extracts administrative form data only.
---
## π License
This model is released under the **Apache 2.0 License**.
The base model ([microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct)) is also Apache 2.0.
---
## π Acknowledgements
- [Unsloth](https://github.com/unslothai/unsloth) for 2-5Γ faster fine-tuning
- [Microsoft Phi-3.5-Vision](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) for the base vision-language model
- [Vertex AI Gemini 2.5 Flash](https://cloud.google.com/vertex-ai) for dataset annotation
- [HuggingFace TRL](https://github.com/huggingface/trl) for SFTTrainer
|