Insurance Policy PDF Extractor

A robust pipeline to extract structured JSON data from two-wheeler insurance policy PDFs.

Architecture

Hybrid Approach:

Text Extraction - Extract text from PDF using pdfplumber / PyMuPDF
Regex Extraction - Extract known fields using domain-specific regex patterns
VLM Fallback - Use Qwen2.5-VL-3B-Instruct for fields not found via regex (handles scanned/image-based PDFs)
Merge & Validate - Combine results, apply validation rules (VIN=17 chars, Engine=8-15 chars, etc.)

Installation

pip install -r requirements.txt

Usage

python extract.py sample_policy.pdf

Or programmatically:

from extract import extract_policy
import json

result = extract_policy("path/to/policy.pdf", use_vlm=True)
print(json.dumps(result, indent=2))

Output Schema

{
  "policy_number": "",
  "customer_name": "",
  "customer_address": "",
  "make": "",
  "model": "",
  "variant": "",
  "cc": "",
  "manufacturing_year": "",
  "engine_number": "",
  "chassis_number": "",
  "total_insured_value": "",
  "total_premium_amount": "",
  "breakdown": {
    "own_damage_premium": "",
    "liability_premium": ""
  },
  "financier_name": "",
  "addons": {
    "nil_dep": ""
  }
}

Rules

Chassis = VIN (17 characters)
Engine = 8-15 characters
Chassis and engine can never be the same
Nil Dep: Yes if found, No if explicitly absent, Not Found otherwise
Missing values return "Not Found"

Model

Uses Qwen/Qwen2.5-VL-3B-Instruct for vision-language extraction on scanned or complex-layout PDFs.

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support