--- language: - en license: apache-2.0 library_name: transformers tags: - vision-language-model - document-understanding - handwritten-text - insurance-forms - vqa - phi-3.5-vision - lora - qlora - unsloth - medical-forms - ocr-free pipeline_tag: image-to-text base_model: microsoft/Phi-3.5-vision-instruct datasets: - custom-mdf-forms metrics: - exact_match model-index: - name: mdf-form-reader-phi35-vision results: - task: type: visual-question-answering name: Visual Question Answering (MDF Forms) metrics: - type: exact_match value: 0 name: Exact Match (%) - type: ood_refusal_rate value: 0 name: OOD Refusal Rate (%) --- # MDF Form Reader — Phi-3.5-Vision Fine-tuned **Vision-native handwritten insurance form understanding, fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) using QLoRA.** > **No OCR needed.** This model reads handwriting, checks checkbox states, and extracts structured data directly from scanned MDF (Monthly Disability Verification) form images. --- ## 📋 Model Summary | Property | Value | |---|---| | **Base Model** | `microsoft/Phi-3.5-vision-instruct` (4.2B) | | **Task** | Visual Question Answering on MDF forms | | **Fine-tuning Method** | QLoRA (r=16, alpha=32) via Unsloth | | **Quantization** | 4-bit NF4 (training) → 16-bit merged | | **Annotator** | Vertex AI Gemini 2.5 Flash | | **Exact Match** | 0% | | **OOD Refusal Rate** | 0% | | **License** | Apache 2.0 | --- ## 🚀 Quick Start ```python from transformers import AutoModelForCausalLM, AutoProcessor from PIL import Image import torch model_id = "solvrays/mdf-form-reader-phi35-vision" processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="cuda", trust_remote_code=True, ) # Load your scanned MDF form image image = Image.open("mdf_form.png").convert("RGB") # Ask a question about the form question = "What is the name of the physician who signed this form?" messages = [{"role": "user", "content": f"<|image_1|> {question}"}] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda") with torch.no_grad(): out = model.generate(**inputs, max_new_tokens=200, temperature=0.1) answer = processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) print(answer) ``` --- ## 🏥 What is an MDF Form? A **Monthly Disability Verification Form (Form 441.O.MDF.O)** is issued by TriPlus Services, acting as Third-Party Administrator of Penn Treaty Network America and American Network policies. It requires a licensed physician to certify a patient's ongoing disability status monthly. ### Key Fields Extracted - Physician name, address, phone, fax - Submission date range (from / to) - Patient disability status (YES checked / NO checked) - Disability end date (if applicable) - Form completion date - Physician signature presence --- ## 🔬 Why Vision-Native vs OCR? | Challenge | OCR Approach | This Model | |---|---|---| | Cursive physician names | Fails ("Carnazzo", "Kruszka") | Reads directly from image | | Checkbox state (YES/NO) | Misses (no text to extract) | Sees the ✓/✗ mark in context | | Date grid cells (MM/DD/YYYY) | Digit confusion in small boxes | Layout-aware reading | | Signature field | Garbage output | Correctly ignored | | Handwritten addresses | High error rate | Contextual correction | --- ## 🛠️ Training Pipeline ``` Scanned MDF Form (PDF) ↓ Image pre-processing (deskew 300 DPI, bilateral denoise, CLAHE) ↓ Vertex AI Gemini 2.5 Flash → structured JSON annotation ↓ VQA triplet dataset (field extraction + OOD refusal pairs) ↓ Phi-3.5-Vision + QLoRA (Unsloth, 2-5× faster, 80% less VRAM) ↓ Merge adapters → full 16-bit model ↓ HuggingFace Hub (safetensors) ``` ### Training Configuration ```yaml base_model: microsoft/Phi-3.5-vision-instruct fine_tuning_method: QLoRA (NF4, double quantization) lora_rank: 16 lora_alpha: 32 lora_dropout: 0.05 use_rslora: true vision_layers: frozen language_layers: adapted optimizer: AdamW 8-bit (paged) lr_scheduler: cosine neftune_noise_alpha: 5 annotator: Vertex AI Gemini 2.5 Flash framework: Unsloth + HuggingFace TRL ``` --- ## 📊 Evaluation Results | Metric | Value | |---|---| | Exact Match (field extraction) | 0% | | OOD Refusal Rate | 0% | | Evaluation Set | Held-out MDF form pages | **OOD Refusal Rate** measures how reliably the model declines to answer questions not answerable from the form (e.g. "What is the diagnosis?", "Has this claim been approved?"). --- ## ⚠️ Limitations - **Domain-specific**: Trained exclusively on TriPlus Services MDF forms. Performance on other form types is not guaranteed. - **Image quality**: Works best on scans ≥ 300 DPI. Very low-resolution or heavily degraded scans may reduce accuracy. - **Language**: English only. - **Redacted fields**: Returns `null` for blacked-out fields (insured name/policy number). - **Not for medical diagnosis**: This model extracts administrative form data only. --- ## 📄 License This model is released under the **Apache 2.0 License**. The base model ([microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct)) is also Apache 2.0. --- ## 🙏 Acknowledgements - [Unsloth](https://github.com/unslothai/unsloth) for 2-5× faster fine-tuning - [Microsoft Phi-3.5-Vision](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) for the base vision-language model - [Vertex AI Gemini 2.5 Flash](https://cloud.google.com/vertex-ai) for dataset annotation - [HuggingFace TRL](https://github.com/huggingface/trl) for SFTTrainer