DIVE-Doc-FRD / README.md

Update README.md

cb71bf4 verified 23 days ago

5.25 kB

	---
	library_name: transformers
	license: mit
	datasets:
	- lmms-lab/DocVQA
	- pixparse/docvqa-single-page-questions
	spaces:
	- JayRay5/DIVE-Doc-docvqa
	tags:
	- docvqa
	- distillation
	- VLM
	- document-understanding
	- OCR-free
	---

	## 1 Introduction
	DIVE-Doc is a VLM architecture built as a trade-off between end-to-end lightweight architectures and LVLMs for the DocVQA task.
	Without relying on external tools such as OCR, it processes the inputs in an end-to-end way.
	It takes an image document and a question as input and returns an answer. <br>
	- Paper (Spotlight/Best Paper Award VisionDocs@ICCV2025): <br>
	[DIVE-Doc: Downscaling foundational Image Visual Encoder into hierarchical architecture for DocVQA](https://openaccess.thecvf.com/content/ICCV2025W/VisionDocs/html/Bencharef_DIVE-Doc_Downscaling_foundational_Image_Visual_Encoder_into_hierarchical_architecture_for_ICCVW_2025_paper.html)
	- Repository: [GitHub](https://github.com/JayRay5/DIVE-Doc)
	- Demo: [Space](https://huggingface.co/spaces/JayRay5/DIVE-Doc-docvqa) <br>

	## 2 Model Summary
	DIVE-Doc is built as a trade-off between end-to-end lightweight architectures and LVLMs.
	Where the first category has both a lightweight visual encoder and a language decoder, and LVLMs have both a large visual encoder and a large decoder,
	DIVE-Doc contains a small visual encoder in combination with a large decoder in order to balance model size and performance.
	It is built by distilling the [SigLIP-400m](https://arxiv.org/abs/2303.15343) visual encoder of [PaliGEMMA](https://arxiv.org/abs/2407.07726) into a small hierarchical [Swin transformer](https://openaccess.thecvf.com/content/ICCV2021/html/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper) initialized with the weights of [Donut](https://link.springer.com/chapter/10.1007/978-3-031-19815-1_29), while reusing the original [GEMMA](https://arxiv.org/abs/2403.08295) decoder.
	This enables DIVE‑Doc to reduce its visual encoder’s parameter count by 80%.
	Moreover, the model is finetuned using LoRA adapters, which have been merged into the base model using [merge_and_unload](https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraModel.merge_and_unload).
	Trained on the [DocVQA dataset](https://openaccess.thecvf.com/content/WACV2021/html/Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html) for both the distillation and finetuning steps, this strategy allows DIVE-Doc to be competitive with LVLMs while outperforming ligthweight architectures.


	## 3 Quick Start

	### Direct Use

	#### From Hugging Face Space

	[Click here](https://huggingface.co/spaces/JayRay5/DIVE-Doc-docvqa)

	#### From the Transformers library
	```bash
	from transformers import AutoProcessor, AutoModelForCausalLM
	from PIL import Image
	import torch

	processor = AutoProcessor.from_pretrained("JayRay5/DIVE-Doc-FRD", trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained("JayRay5/DIVE-Doc-FRD", trust_remote_code=True)

	image = Image.open("your_image_document_path/image_document.png").convert("RGB")
	question_example = "What the the name of the author"

	inputs = (
	processor(text=question_example, images=image, return_tensors="pt", padding=True)
	.to(model.device)
	.to(model.dtype)
	)
	input_length = inputs["input_ids"].shape[-1]

	with torch.inference_mode():
	output_ids = model.generate(**inputs, max_new_tokens=100, do_sample=False)

	generated_ids = output_ids[0][input_length:]
	answer = processor.decode(generated_ids, skip_special_tokens=True)

	print(answer)
	```

	#### From the GitHub repository
	##### Installation
	```bash
	git clone https://github.com/JayRay5/DIVE-Doc.git
	cd DIVE-Doc
	conda create -n dive-doc-env python=3.11.5
	conda activate dive-doc-env
	pip install -r requirements.txt
	```
	##### Inference example using the model repository and gradio
	In app.py, modify the path variable to "JayRay5/DIVE-Doc-FRD":
	```bash
	if __name__ == "__main__":
	path = "JayRay5/DIVE-Doc-FRD"
	app(path)
	```
	Then run:
	```bash
	python app.py
	```
	This will start a [gradio](https://www.gradio.app/) web interface where you can use the model.

	## Notification


	### Direct Use

	This model is designed to answer a question from a single-page image document and is mostly trained on industrial documents [DocVQA dataset](https://openaccess.thecvf.com/content/WACV2021/html/Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html).


	### Downstream Use

	This model can be finetuned on other DocVQA datasets such as [InfoGraphVQA](https://openaccess.thecvf.com/content/WACV2022/html/Mathew_InfographicVQA_WACV_2022_paper.html) to improve its performance on infographic documents.



	## Citation

	BibTeX:

	```bibtex
	@inproceedings{Bencharef_2025_ICCV,
	author = {Bencharef, Rayane and Rahiche, Abderrahmane and Cheriet, Mohamed},
	title = {DIVE-Doc: Downscaling foundational Image Visual Encoder into hierarchical architecture for DocVQA},
	booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
	month = {October},
	year = {2025},
	pages = {7547-7556}
	}
	```
	## Contact

	rayane.bencharef.1@ens.etsmtl.ca