JayRay5
/

DIVE-Doc-ARD-LRes

@@ -3,15 +3,22 @@ library_name: transformers
 license: mit
 datasets:
 - lmms-lab/DocVQA
 ---
 ## 1 Introduction
 DIVE-Doc is a VLM architecture built as a trade-off between end-to-end lightweight architectures and LVLMs for the DocVQA task.
 Without relying on external tools such as OCR, it processes the inputs in an end-to-end way.
 It takes an image document and a question as input and returns an answer. <br>
-- **Repository:** [GitHub](https://github.com/JayRay5/DIVE-Doc)
-- **Paper:** [DIVE-Doc: Downscaling foundational Image Visual Encoder into hierarchical architecture for DocVQA](https://openaccess.thecvf.com/content/ICCV2025W/VisionDocs/html/Bencharef_DIVE-Doc_Downscaling_foundational_Image_Visual_Encoder_into_hierarchical_architecture_for_ICCVW_2025_paper.html)
 ## 2 Model Summary
 DIVE-Doc is built as a trade-off between end-to-end lightweight architectures and LVLMs.
@@ -25,7 +32,38 @@ Trained on the [DocVQA dataset](https://openaccess.thecvf.com/content/WACV2021/h
 ## 3 Quick Start
-### Installation
 ```bash
 git clone https://github.com/JayRay5/DIVE-Doc.git
 cd DIVE-Doc
@@ -33,7 +71,7 @@ conda create -n dive-doc-env python=3.11.5
 conda activate dive-doc-env
 pip install -r requirements.txt
 ```
-### Inference example using the model repository and gradio
 In app.py, modify the path variable to "JayRay5/DIVE-Doc-ARD-LRes":
 ```bash
 if __name__ == "__main__":
@@ -45,6 +83,7 @@ Then run:
 python app.py
 ```
 This will start a [gradio](https://www.gradio.app/) web interface where you can use the model.
 ## Notification
@@ -74,4 +113,5 @@ This model can be finetuned on other DocVQA datasets such as [InfoGraphVQA](http
 }
 ```
 ## Contact
 rayane.bencharef.1@ens.etsmtl.ca

 license: mit
 datasets:
 - lmms-lab/DocVQA
+- pixparse/docvqa-single-page-questions
+tags:
+- docvqa
+- distillation
+- VLM
+- document-understanding
+- OCR-free
 ---
 ## 1 Introduction
 DIVE-Doc is a VLM architecture built as a trade-off between end-to-end lightweight architectures and LVLMs for the DocVQA task.
 Without relying on external tools such as OCR, it processes the inputs in an end-to-end way.
 It takes an image document and a question as input and returns an answer. <br>
+- **Paper (Spotlight/Best Paper Award VisionDocs@ICCV2025):** <br>
+  [DIVE-Doc: Downscaling foundational Image Visual Encoder into hierarchical architecture for DocVQA](https://openaccess.thecvf.com/content/ICCV2025W/VisionDocs/html/Bencharef_DIVE-Doc_Downscaling_foundational_Image_Visual_Encoder_into_hierarchical_architecture_for_ICCVW_2025_paper.html)
+- **Repository:** [GitHub](https://github.com/JayRay5/DIVE-Doc)<br>
 ## 2 Model Summary
 DIVE-Doc is built as a trade-off between end-to-end lightweight architectures and LVLMs.
 ## 3 Quick Start
+### Direct Use
+#### From the Transformers library
+```bash
+from transformers import AutoProcessor, AutoModelForCausalLM
+from PIL import Image
+import torch
+processor = AutoProcessor.from_pretrained("JayRay5/DIVE-Doc-ARD-LRes", trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained("JayRay5/DIVE-Doc-ARD-LRes", trust_remote_code=True)
+image = Image.open("your_image_document_path/image_document.png").convert("RGB")
+question_example = "What the the name of the author"
+inputs = (
+            processor(text=question_example, images=image, return_tensors="pt", padding=True)
+            .to(model.device)
+            .to(model.dtype)
+        )
+input_length = inputs["input_ids"].shape[-1]
+with torch.inference_mode():
+        output_ids = model.generate(**inputs, max_new_tokens=100, do_sample=False)
+generated_ids = output_ids[0][input_length:]
+answer = processor.decode(generated_ids, skip_special_tokens=True)
+print(answer)
+```
+#### From the GitHub repository
+##### Installation
 ```bash
 git clone https://github.com/JayRay5/DIVE-Doc.git
 cd DIVE-Doc
 conda activate dive-doc-env
 pip install -r requirements.txt
 ```
+##### Inference example using the model repository and gradio
 In app.py, modify the path variable to "JayRay5/DIVE-Doc-ARD-LRes":
 ```bash
 if __name__ == "__main__":
 python app.py
 ```
 This will start a [gradio](https://www.gradio.app/) web interface where you can use the model.
 ## Notification
 }
 ```
 ## Contact
 rayane.bencharef.1@ens.etsmtl.ca