Transformers
Safetensors
DIVEdoc
docvqa
distillation
VLM
document-understanding
OCR-free
JayRay5 commited on
Commit
2c564cf
·
verified ·
1 Parent(s): 3c0d1f0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -5
README.md CHANGED
@@ -3,15 +3,22 @@ library_name: transformers
3
  license: mit
4
  datasets:
5
  - lmms-lab/DocVQA
 
 
 
 
 
 
 
6
  ---
7
 
8
  ## 1 Introduction
9
  DIVE-Doc is a VLM architecture built as a trade-off between end-to-end lightweight architectures and LVLMs for the DocVQA task.
10
  Without relying on external tools such as OCR, it processes the inputs in an end-to-end way.
11
  It takes an image document and a question as input and returns an answer. <br>
12
- - **Repository:** [GitHub](https://github.com/JayRay5/DIVE-Doc)
13
- - **Paper:** [DIVE-Doc: Downscaling foundational Image Visual Encoder into hierarchical architecture for DocVQA](https://openaccess.thecvf.com/content/ICCV2025W/VisionDocs/html/Bencharef_DIVE-Doc_Downscaling_foundational_Image_Visual_Encoder_into_hierarchical_architecture_for_ICCVW_2025_paper.html)
14
-
15
 
16
  ## 2 Model Summary
17
  DIVE-Doc is built as a trade-off between end-to-end lightweight architectures and LVLMs.
@@ -25,7 +32,38 @@ Trained on the [DocVQA dataset](https://openaccess.thecvf.com/content/WACV2021/h
25
 
26
  ## 3 Quick Start
27
 
28
- ### Installation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  ```bash
30
  git clone https://github.com/JayRay5/DIVE-Doc.git
31
  cd DIVE-Doc
@@ -33,7 +71,7 @@ conda create -n dive-doc-env python=3.11.5
33
  conda activate dive-doc-env
34
  pip install -r requirements.txt
35
  ```
36
- ### Inference example using the model repository and gradio
37
  In app.py, modify the path variable to "JayRay5/DIVE-Doc-ARD-LRes":
38
  ```bash
39
  if __name__ == "__main__":
@@ -45,6 +83,7 @@ Then run:
45
  python app.py
46
  ```
47
  This will start a [gradio](https://www.gradio.app/) web interface where you can use the model.
 
48
  ## Notification
49
 
50
 
@@ -74,4 +113,5 @@ This model can be finetuned on other DocVQA datasets such as [InfoGraphVQA](http
74
  }
75
  ```
76
  ## Contact
 
77
  rayane.bencharef.1@ens.etsmtl.ca
 
3
  license: mit
4
  datasets:
5
  - lmms-lab/DocVQA
6
+ - pixparse/docvqa-single-page-questions
7
+ tags:
8
+ - docvqa
9
+ - distillation
10
+ - VLM
11
+ - document-understanding
12
+ - OCR-free
13
  ---
14
 
15
  ## 1 Introduction
16
  DIVE-Doc is a VLM architecture built as a trade-off between end-to-end lightweight architectures and LVLMs for the DocVQA task.
17
  Without relying on external tools such as OCR, it processes the inputs in an end-to-end way.
18
  It takes an image document and a question as input and returns an answer. <br>
19
+ - **Paper (Spotlight/Best Paper Award VisionDocs@ICCV2025):** <br>
20
+ [DIVE-Doc: Downscaling foundational Image Visual Encoder into hierarchical architecture for DocVQA](https://openaccess.thecvf.com/content/ICCV2025W/VisionDocs/html/Bencharef_DIVE-Doc_Downscaling_foundational_Image_Visual_Encoder_into_hierarchical_architecture_for_ICCVW_2025_paper.html)
21
+ - **Repository:** [GitHub](https://github.com/JayRay5/DIVE-Doc)<br>
22
 
23
  ## 2 Model Summary
24
  DIVE-Doc is built as a trade-off between end-to-end lightweight architectures and LVLMs.
 
32
 
33
  ## 3 Quick Start
34
 
35
+ ### Direct Use
36
+
37
+ #### From the Transformers library
38
+ ```bash
39
+ from transformers import AutoProcessor, AutoModelForCausalLM
40
+ from PIL import Image
41
+ import torch
42
+
43
+ processor = AutoProcessor.from_pretrained("JayRay5/DIVE-Doc-ARD-LRes", trust_remote_code=True)
44
+ model = AutoModelForCausalLM.from_pretrained("JayRay5/DIVE-Doc-ARD-LRes", trust_remote_code=True)
45
+
46
+ image = Image.open("your_image_document_path/image_document.png").convert("RGB")
47
+ question_example = "What the the name of the author"
48
+
49
+ inputs = (
50
+ processor(text=question_example, images=image, return_tensors="pt", padding=True)
51
+ .to(model.device)
52
+ .to(model.dtype)
53
+ )
54
+ input_length = inputs["input_ids"].shape[-1]
55
+
56
+ with torch.inference_mode():
57
+ output_ids = model.generate(**inputs, max_new_tokens=100, do_sample=False)
58
+
59
+ generated_ids = output_ids[0][input_length:]
60
+ answer = processor.decode(generated_ids, skip_special_tokens=True)
61
+
62
+ print(answer)
63
+ ```
64
+
65
+ #### From the GitHub repository
66
+ ##### Installation
67
  ```bash
68
  git clone https://github.com/JayRay5/DIVE-Doc.git
69
  cd DIVE-Doc
 
71
  conda activate dive-doc-env
72
  pip install -r requirements.txt
73
  ```
74
+ ##### Inference example using the model repository and gradio
75
  In app.py, modify the path variable to "JayRay5/DIVE-Doc-ARD-LRes":
76
  ```bash
77
  if __name__ == "__main__":
 
83
  python app.py
84
  ```
85
  This will start a [gradio](https://www.gradio.app/) web interface where you can use the model.
86
+
87
  ## Notification
88
 
89
 
 
113
  }
114
  ```
115
  ## Contact
116
+
117
  rayane.bencharef.1@ens.etsmtl.ca