|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
datasets: |
|
|
- lmms-lab/DocVQA |
|
|
- pixparse/docvqa-single-page-questions |
|
|
spaces: |
|
|
- JayRay5/DIVE-Doc-docvqa |
|
|
tags: |
|
|
- docvqa |
|
|
- distillation |
|
|
- VLM |
|
|
- document-understanding |
|
|
- OCR-free |
|
|
--- |
|
|
|
|
|
## 1 Introduction |
|
|
DIVE-Doc is a VLM architecture built as a trade-off between end-to-end lightweight architectures and LVLMs for the DocVQA task. |
|
|
Without relying on external tools such as OCR, it processes the inputs in an end-to-end way. |
|
|
It takes an image document and a question as input and returns an answer. <br> |
|
|
- **Paper (Spotlight/Best Paper Award VisionDocs@ICCV2025):** <br> |
|
|
[DIVE-Doc: Downscaling foundational Image Visual Encoder into hierarchical architecture for DocVQA](https://openaccess.thecvf.com/content/ICCV2025W/VisionDocs/html/Bencharef_DIVE-Doc_Downscaling_foundational_Image_Visual_Encoder_into_hierarchical_architecture_for_ICCVW_2025_paper.html) |
|
|
- **Repository:** [GitHub](https://github.com/JayRay5/DIVE-Doc) |
|
|
- **Demo:** [Space](https://huggingface.co/spaces/JayRay5/DIVE-Doc-docvqa) <br> |
|
|
|
|
|
## 2 Model Summary |
|
|
DIVE-Doc is built as a trade-off between end-to-end lightweight architectures and LVLMs. |
|
|
Where the first category has both a lightweight visual encoder and a language decoder, and LVLMs have both a large visual encoder and a large decoder, |
|
|
DIVE-Doc contains a small visual encoder in combination with a large decoder in order to balance model size and performance. |
|
|
It is built by distilling the [SigLIP-400m](https://arxiv.org/abs/2303.15343) visual encoder of [PaliGEMMA](https://arxiv.org/abs/2407.07726) into a small hierarchical [Swin transformer](https://openaccess.thecvf.com/content/ICCV2021/html/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper) initialized with the weights of [Donut](https://link.springer.com/chapter/10.1007/978-3-031-19815-1_29), while reusing the original [GEMMA](https://arxiv.org/abs/2403.08295) decoder. |
|
|
This enables DIVE‑Doc to reduce its visual encoder’s parameter count by 80%. |
|
|
Moreover, the model is finetuned using LoRA adapters, which have been merged into the base model using [merge_and_unload](https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraModel.merge_and_unload). |
|
|
Trained on the [DocVQA dataset](https://openaccess.thecvf.com/content/WACV2021/html/Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html) for both the distillation and finetuning steps, this strategy allows DIVE-Doc to be competitive with LVLMs while outperforming ligthweight architectures. |
|
|
|
|
|
|
|
|
## 3 Quick Start |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
#### From Hugging Face Space |
|
|
|
|
|
[Click here](https://huggingface.co/spaces/JayRay5/DIVE-Doc-docvqa) |
|
|
|
|
|
#### From the Transformers library |
|
|
```bash |
|
|
from transformers import AutoProcessor, AutoModelForCausalLM |
|
|
from PIL import Image |
|
|
import torch |
|
|
|
|
|
processor = AutoProcessor.from_pretrained("JayRay5/DIVE-Doc-FRD", trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained("JayRay5/DIVE-Doc-FRD", trust_remote_code=True) |
|
|
|
|
|
image = Image.open("your_image_document_path/image_document.png").convert("RGB") |
|
|
question_example = "What the the name of the author" |
|
|
|
|
|
inputs = ( |
|
|
processor(text=question_example, images=image, return_tensors="pt", padding=True) |
|
|
.to(model.device) |
|
|
.to(model.dtype) |
|
|
) |
|
|
input_length = inputs["input_ids"].shape[-1] |
|
|
|
|
|
with torch.inference_mode(): |
|
|
output_ids = model.generate(**inputs, max_new_tokens=100, do_sample=False) |
|
|
|
|
|
generated_ids = output_ids[0][input_length:] |
|
|
answer = processor.decode(generated_ids, skip_special_tokens=True) |
|
|
|
|
|
print(answer) |
|
|
``` |
|
|
|
|
|
#### From the GitHub repository |
|
|
##### Installation |
|
|
```bash |
|
|
git clone https://github.com/JayRay5/DIVE-Doc.git |
|
|
cd DIVE-Doc |
|
|
conda create -n dive-doc-env python=3.11.5 |
|
|
conda activate dive-doc-env |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
##### Inference example using the model repository and gradio |
|
|
In app.py, modify the path variable to "JayRay5/DIVE-Doc-FRD": |
|
|
```bash |
|
|
if __name__ == "__main__": |
|
|
path = "JayRay5/DIVE-Doc-FRD" |
|
|
app(path) |
|
|
``` |
|
|
Then run: |
|
|
```bash |
|
|
python app.py |
|
|
``` |
|
|
This will start a [gradio](https://www.gradio.app/) web interface where you can use the model. |
|
|
|
|
|
## Notification |
|
|
|
|
|
|
|
|
### Direct Use |
|
|
|
|
|
This model is designed to answer a question from a single-page image document and is mostly trained on industrial documents [DocVQA dataset](https://openaccess.thecvf.com/content/WACV2021/html/Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html). |
|
|
|
|
|
|
|
|
### Downstream Use |
|
|
|
|
|
This model can be finetuned on other DocVQA datasets such as [InfoGraphVQA](https://openaccess.thecvf.com/content/WACV2022/html/Mathew_InfographicVQA_WACV_2022_paper.html) to improve its performance on infographic documents. |
|
|
|
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
**BibTeX:** |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{Bencharef_2025_ICCV, |
|
|
author = {Bencharef, Rayane and Rahiche, Abderrahmane and Cheriet, Mohamed}, |
|
|
title = {DIVE-Doc: Downscaling foundational Image Visual Encoder into hierarchical architecture for DocVQA}, |
|
|
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, |
|
|
month = {October}, |
|
|
year = {2025}, |
|
|
pages = {7547-7556} |
|
|
} |
|
|
``` |
|
|
## Contact |
|
|
|
|
|
rayane.bencharef.1@ens.etsmtl.ca |