Image-to-Text
Transformers
Safetensors
English
qwen2_vl
image-text-to-text
vision-language-model
document-understanding
handwritten-text
insurance-forms
vqa
phi-3.5-vision
lora
qlora
unsloth
medical-forms
ocr-free
Eval Results (legacy)
text-generation-inference
Instructions to use solvrays/mdf-form-reader-phi35-vision with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use solvrays/mdf-form-reader-phi35-vision with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="solvrays/mdf-form-reader-phi35-vision")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("solvrays/mdf-form-reader-phi35-vision") model = AutoModelForImageTextToText.from_pretrained("solvrays/mdf-form-reader-phi35-vision") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Unsloth Studio new
How to use solvrays/mdf-form-reader-phi35-vision with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for solvrays/mdf-form-reader-phi35-vision to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for solvrays/mdf-form-reader-phi35-vision to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for solvrays/mdf-form-reader-phi35-vision to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="solvrays/mdf-form-reader-phi35-vision", max_seq_length=2048, )
| language: | |
| - en | |
| license: apache-2.0 | |
| library_name: transformers | |
| tags: | |
| - vision-language-model | |
| - document-understanding | |
| - handwritten-text | |
| - insurance-forms | |
| - vqa | |
| - phi-3.5-vision | |
| - lora | |
| - qlora | |
| - unsloth | |
| - medical-forms | |
| - ocr-free | |
| pipeline_tag: image-to-text | |
| base_model: microsoft/Phi-3.5-vision-instruct | |
| datasets: | |
| - custom-mdf-forms | |
| metrics: | |
| - exact_match | |
| model-index: | |
| - name: mdf-form-reader-phi35-vision | |
| results: | |
| - task: | |
| type: visual-question-answering | |
| name: Visual Question Answering (MDF Forms) | |
| metrics: | |
| - type: exact_match | |
| value: 0 | |
| name: Exact Match (%) | |
| - type: ood_refusal_rate | |
| value: 0 | |
| name: OOD Refusal Rate (%) | |
| # MDF Form Reader β Phi-3.5-Vision Fine-tuned | |
| **Vision-native handwritten insurance form understanding, fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) using QLoRA.** | |
| > **No OCR needed.** This model reads handwriting, checks checkbox states, and extracts structured data directly from scanned MDF (Monthly Disability Verification) form images. | |
| --- | |
| ## π Model Summary | |
| | Property | Value | | |
| |---|---| | |
| | **Base Model** | `microsoft/Phi-3.5-vision-instruct` (4.2B) | | |
| | **Task** | Visual Question Answering on MDF forms | | |
| | **Fine-tuning Method** | QLoRA (r=16, alpha=32) via Unsloth | | |
| | **Quantization** | 4-bit NF4 (training) β 16-bit merged | | |
| | **Annotator** | Vertex AI Gemini 2.5 Flash | | |
| | **Exact Match** | 0% | | |
| | **OOD Refusal Rate** | 0% | | |
| | **License** | Apache 2.0 | | |
| --- | |
| ## π Quick Start | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoProcessor | |
| from PIL import Image | |
| import torch | |
| model_id = "solvrays/mdf-form-reader-phi35-vision" | |
| processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_id, | |
| torch_dtype=torch.bfloat16, | |
| device_map="cuda", | |
| trust_remote_code=True, | |
| ) | |
| # Load your scanned MDF form image | |
| image = Image.open("mdf_form.png").convert("RGB") | |
| # Ask a question about the form | |
| question = "What is the name of the physician who signed this form?" | |
| messages = [{"role": "user", "content": f"<|image_1|> | |
| {question}"}] | |
| text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda") | |
| with torch.no_grad(): | |
| out = model.generate(**inputs, max_new_tokens=200, temperature=0.1) | |
| answer = processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) | |
| print(answer) | |
| ``` | |
| --- | |
| ## π₯ What is an MDF Form? | |
| A **Monthly Disability Verification Form (Form 441.O.MDF.O)** is issued by TriPlus Services, acting as Third-Party Administrator of Penn Treaty Network America and American Network policies. It requires a licensed physician to certify a patient's ongoing disability status monthly. | |
| ### Key Fields Extracted | |
| - Physician name, address, phone, fax | |
| - Submission date range (from / to) | |
| - Patient disability status (YES checked / NO checked) | |
| - Disability end date (if applicable) | |
| - Form completion date | |
| - Physician signature presence | |
| --- | |
| ## π¬ Why Vision-Native vs OCR? | |
| | Challenge | OCR Approach | This Model | | |
| |---|---|---| | |
| | Cursive physician names | Fails ("Carnazzo", "Kruszka") | Reads directly from image | | |
| | Checkbox state (YES/NO) | Misses (no text to extract) | Sees the β/β mark in context | | |
| | Date grid cells (MM/DD/YYYY) | Digit confusion in small boxes | Layout-aware reading | | |
| | Signature field | Garbage output | Correctly ignored | | |
| | Handwritten addresses | High error rate | Contextual correction | | |
| --- | |
| ## π οΈ Training Pipeline | |
| ``` | |
| Scanned MDF Form (PDF) | |
| β Image pre-processing (deskew 300 DPI, bilateral denoise, CLAHE) | |
| β Vertex AI Gemini 2.5 Flash β structured JSON annotation | |
| β VQA triplet dataset (field extraction + OOD refusal pairs) | |
| β Phi-3.5-Vision + QLoRA (Unsloth, 2-5Γ faster, 80% less VRAM) | |
| β Merge adapters β full 16-bit model | |
| β HuggingFace Hub (safetensors) | |
| ``` | |
| ### Training Configuration | |
| ```yaml | |
| base_model: microsoft/Phi-3.5-vision-instruct | |
| fine_tuning_method: QLoRA (NF4, double quantization) | |
| lora_rank: 16 | |
| lora_alpha: 32 | |
| lora_dropout: 0.05 | |
| use_rslora: true | |
| vision_layers: frozen | |
| language_layers: adapted | |
| optimizer: AdamW 8-bit (paged) | |
| lr_scheduler: cosine | |
| neftune_noise_alpha: 5 | |
| annotator: Vertex AI Gemini 2.5 Flash | |
| framework: Unsloth + HuggingFace TRL | |
| ``` | |
| --- | |
| ## π Evaluation Results | |
| | Metric | Value | | |
| |---|---| | |
| | Exact Match (field extraction) | 0% | | |
| | OOD Refusal Rate | 0% | | |
| | Evaluation Set | Held-out MDF form pages | | |
| **OOD Refusal Rate** measures how reliably the model declines to answer questions not answerable from the form (e.g. "What is the diagnosis?", "Has this claim been approved?"). | |
| --- | |
| ## β οΈ Limitations | |
| - **Domain-specific**: Trained exclusively on TriPlus Services MDF forms. Performance on other form types is not guaranteed. | |
| - **Image quality**: Works best on scans β₯ 300 DPI. Very low-resolution or heavily degraded scans may reduce accuracy. | |
| - **Language**: English only. | |
| - **Redacted fields**: Returns `null` for blacked-out fields (insured name/policy number). | |
| - **Not for medical diagnosis**: This model extracts administrative form data only. | |
| --- | |
| ## π License | |
| This model is released under the **Apache 2.0 License**. | |
| The base model ([microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct)) is also Apache 2.0. | |
| --- | |
| ## π Acknowledgements | |
| - [Unsloth](https://github.com/unslothai/unsloth) for 2-5Γ faster fine-tuning | |
| - [Microsoft Phi-3.5-Vision](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) for the base vision-language model | |
| - [Vertex AI Gemini 2.5 Flash](https://cloud.google.com/vertex-ai) for dataset annotation | |
| - [HuggingFace TRL](https://github.com/huggingface/trl) for SFTTrainer | |