Transformers
Safetensors
DIVEdoc
docvqa
distillation
VLM
document-understanding
OCR-free
File size: 5,061 Bytes
d0172a2
 
b253a08
 
 
2c564cf
 
 
 
 
 
 
d0172a2
 
8344496
 
fc05d25
8344496
2c564cf
 
 
d0172a2
5abcde8
fc05d25
 
 
5abcde8
 
bc31efd
5abcde8
d0172a2
 
5abcde8
d0172a2
2c564cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6d6d406
 
 
 
 
 
 
2c564cf
5abcde8
6d6d406
5abcde8
 
 
6d6d406
 
 
 
 
5abcde8
2c564cf
6d6d406
d0172a2
 
 
 
8aa842e
d0172a2
 
6d6d406
d0172a2
5abcde8
d0172a2
 
 
5c5ca12
d0172a2
 
 
5c5ca12
 
 
 
 
 
 
 
 
 
 
2c564cf
5c5ca12
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
library_name: transformers
license: mit
datasets:
- lmms-lab/DocVQA
- pixparse/docvqa-single-page-questions
tags:
- docvqa
- distillation
- VLM
- document-understanding
- OCR-free
---

## 1 Introduction
DIVE-Doc is a VLM architecture built as a trade-off between end-to-end lightweight architectures and LVLMs for the DocVQA task. 
Without relying on external tools such as OCR, it processes the inputs in an end-to-end way.
It takes an image document and a question as input and returns an answer. <br>
- **Paper (Spotlight/Best Paper Award VisionDocs@ICCV2025):** <br>
  [DIVE-Doc: Downscaling foundational Image Visual Encoder into hierarchical architecture for DocVQA](https://openaccess.thecvf.com/content/ICCV2025W/VisionDocs/html/Bencharef_DIVE-Doc_Downscaling_foundational_Image_Visual_Encoder_into_hierarchical_architecture_for_ICCVW_2025_paper.html)
- **Repository:** [GitHub](https://github.com/JayRay5/DIVE-Doc)<br>

## 2 Model Summary
DIVE-Doc is built as a trade-off between end-to-end lightweight architectures and LVLMs.
Where the first category has both a lightweight visual encoder and a language decoder, and LVLMs have both a large visual encoder and a large decoder, 
DIVE-Doc contains a small visual encoder in combination with a large decoder in order to balance model size and performance.
It is built by distilling the [SigLIP-400m](https://arxiv.org/abs/2303.15343) visual encoder of [PaliGEMMA](https://arxiv.org/abs/2407.07726) into a small hierarchical [Swin transformer](https://openaccess.thecvf.com/content/ICCV2021/html/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper) initialized with the weights of [Donut](https://link.springer.com/chapter/10.1007/978-3-031-19815-1_29), while reusing the original [GEMMA](https://arxiv.org/abs/2403.08295) decoder. 
This enables DIVE‑Doc to reduce its visual encoder’s parameter count by 80%.
Moreover, the model is finetuned using LoRA adapters, which have been merged into the base model using [merge_and_unload](https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraModel.merge_and_unload).
Trained on the [DocVQA dataset](https://openaccess.thecvf.com/content/WACV2021/html/Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html) for both the distillation and finetuning steps, this strategy allows DIVE-Doc to be competitive with LVLMs while outperforming ligthweight architectures.


## 3 Quick Start

### Direct Use

#### From the Transformers library
```bash
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch

processor = AutoProcessor.from_pretrained("JayRay5/DIVE-Doc-ARD-LRes", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("JayRay5/DIVE-Doc-ARD-LRes", trust_remote_code=True)

image = Image.open("your_image_document_path/image_document.png").convert("RGB")
question_example = "What the the name of the author"

inputs = (
            processor(text=question_example, images=image, return_tensors="pt", padding=True)
            .to(model.device)
            .to(model.dtype)
        )
input_length = inputs["input_ids"].shape[-1]

with torch.inference_mode():
        output_ids = model.generate(**inputs, max_new_tokens=100, do_sample=False)

generated_ids = output_ids[0][input_length:]
answer = processor.decode(generated_ids, skip_special_tokens=True)

print(answer)
```

#### From the GitHub repository
##### Installation
```bash
git clone https://github.com/JayRay5/DIVE-Doc.git
cd DIVE-Doc
conda create -n dive-doc-env python=3.11.5
conda activate dive-doc-env
pip install -r requirements.txt
```
##### Inference example using the model repository and gradio
In app.py, modify the path variable to "JayRay5/DIVE-Doc-ARD-LRes":
```bash
if __name__ == "__main__":
    path = "JayRay5/DIVE-Doc-ARD-LRes"
    app(path) 
```
Then run:
```bash
python app.py
```
This will start a [gradio](https://www.gradio.app/) web interface where you can use the model.

## Notification


### Direct Use

This model is designed to answer a question from a single-page image document and is mostly trained on industrial documents [DocVQA dataset](https://openaccess.thecvf.com/content/WACV2021/html/Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html). 


### Downstream Use 

This model can be finetuned on other DocVQA datasets such as [InfoGraphVQA](https://openaccess.thecvf.com/content/WACV2022/html/Mathew_InfographicVQA_WACV_2022_paper.html) to improve its performance on infographic documents.



## Citation

**BibTeX:**

```bibtex
@inproceedings{Bencharef_2025_ICCV,
    author    = {Bencharef, Rayane and Rahiche, Abderrahmane and Cheriet, Mohamed},
    title     = {DIVE-Doc: Downscaling foundational Image Visual Encoder into hierarchical architecture for DocVQA},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
    month     = {October},
    year      = {2025},
    pages     = {7547-7556}
}
```
## Contact

rayane.bencharef.1@ens.etsmtl.ca