Insurance Policy PDF Extractor
A robust pipeline to extract structured JSON data from two-wheeler insurance policy PDFs.
Architecture
Hybrid Approach:
- Text Extraction - Extract text from PDF using
pdfplumber/PyMuPDF - Regex Extraction - Extract known fields using domain-specific regex patterns
- VLM Fallback - Use
Qwen2.5-VL-3B-Instructfor fields not found via regex (handles scanned/image-based PDFs) - Merge & Validate - Combine results, apply validation rules (VIN=17 chars, Engine=8-15 chars, etc.)
Installation
pip install -r requirements.txt
Usage
python extract.py sample_policy.pdf
Or programmatically:
from extract import extract_policy
import json
result = extract_policy("path/to/policy.pdf", use_vlm=True)
print(json.dumps(result, indent=2))
Output Schema
{
"policy_number": "",
"customer_name": "",
"customer_address": "",
"make": "",
"model": "",
"variant": "",
"cc": "",
"manufacturing_year": "",
"engine_number": "",
"chassis_number": "",
"total_insured_value": "",
"total_premium_amount": "",
"breakdown": {
"own_damage_premium": "",
"liability_premium": ""
},
"financier_name": "",
"addons": {
"nil_dep": ""
}
}
Rules
- Chassis = VIN (17 characters)
- Engine = 8-15 characters
- Chassis and engine can never be the same
- Nil Dep:
Yesif found,Noif explicitly absent,Not Foundotherwise - Missing values return
"Not Found"
Model
Uses Qwen/Qwen2.5-VL-3B-Instruct for vision-language extraction on scanned or complex-layout PDFs.
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support