JayRay5
/

DIVE-Doc-ARD-LRes

@@ -13,54 +13,49 @@ It takes an image document and a question as input and returns an answer. <br>
 - **Paper [optional]:** [More Information Needed]
-## Model Summary
 DIVE-Doc is built as a trade-off between end-to-end lightweight architectures and LVLMs.
 Where the first category has both a lightweight visual encoder and a language decoder, and LVLMs have both a large visual encoder and a large decoder,
 DIVE-Doc contains a small visual encoder in combination with a large decoder in order to balance model size and performance.
-It is built by distilling the [SigLIP-400m](https://arxiv.org/abs/2303.15343) visual encoder of [PaliGEMMA]() into a small hierarchical [Swin transformer]() initialized with the weights of [Donut](), while reusing the original [GEMMA]() decoder.
-This allows DIVE-Doc to keep competitive performance with its teacher while reducing the visual encoder's parameters to one-fifth.
-Moreover, the model is finetuned using LoRA adapters, which have been merged into the base model using [unload_and_merge]()).
-## Quick Start
 ### Installation
 ```bash
 git clone https://github.com/JayRay5/DIVE-Doc.git
 cd DIVE-Doc
 conda create -n dive-doc-env python=3.11.5
 conda activate dive-doc-env
 pip install -r requirements.txt
 ```
 ### Inference example using the model repository and gradio
-In app.py, modify the path variable by "JayRay5/DIVE-Doc-ARD-LRes":
 ```bash
 ```
 Then run:
 ```bash
 python app.py
 ```
-This will start a [Gradio]() web interface where you can use the model.
 ## Notification
 ### Direct Use
-This model is designed to answer a question from a single-page image document.
 ### Downstream Use
-This model can be finetuned on other DocVQA dataset such as [InfoGraphVQA]() to improve its performance on Infographic documents, or []() for a specialization on degraded or historical documents.
-## Bias, Risks, and Limitations
-This model may not perform well on degraded or infographic documents because of its finetuning on mostly industrial documents.

 - **Paper [optional]:** [More Information Needed]
+## 2 Model Summary
 DIVE-Doc is built as a trade-off between end-to-end lightweight architectures and LVLMs.
 Where the first category has both a lightweight visual encoder and a language decoder, and LVLMs have both a large visual encoder and a large decoder,
 DIVE-Doc contains a small visual encoder in combination with a large decoder in order to balance model size and performance.
+It is built by distilling the [SigLIP-400m](https://arxiv.org/abs/2303.15343) visual encoder of [PaliGEMMA](https://arxiv.org/abs/2407.07726) into a small hierarchical [Swin transformer](https://openaccess.thecvf.com/content/ICCV2021/html/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper) initialized with the weights of [Donut](https://link.springer.com/chapter/10.1007/978-3-031-19815-1_29), while reusing the original [GEMMA](https://arxiv.org/abs/2403.08295) decoder.
+This enables DIVE‑Doc to reduce its visual encoder’s parameter count by 80%.
+Moreover, the model is finetuned using LoRA adapters, which have been merged into the base model using [merge_and_unload](https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraModel.merge_and_unload)).
+Trained on the [DocVQA dataset](https://openaccess.thecvf.com/content/WACV2021/html/Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html) for both the distillation and finetuning steps, this strategy allows DIVE-Doc to be competitive with LVLMs while outperforming ligthweight architectures.
+## 3 Quick Start
 ### Installation
 ```bash
 git clone https://github.com/JayRay5/DIVE-Doc.git
 cd DIVE-Doc
 conda create -n dive-doc-env python=3.11.5
 conda activate dive-doc-env
 pip install -r requirements.txt
 ```
 ### Inference example using the model repository and gradio
+In app.py, modify the path variable to "JayRay5/DIVE-Doc-ARD-LRes":
 ```bash
+if __name__ == "__main__":
+    path = "JayRay5/DIVE-Doc-ARD-LRes"
+    app(path)
 ```
 Then run:
 ```bash
 python app.py
 ```
+This will start a [gradio](https://www.gradio.app/) web interface where you can use the model.
 ## Notification
 ### Direct Use
+This model is designed to answer a question from a single-page image document and is mostly trained on industrial documents ([DocVQA dataset](https://openaccess.thecvf.com/content/WACV2021/html/Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html)].
 ### Downstream Use
+This model can be finetuned on other DocVQA datasets such as [InfoGraphVQA](https://openaccess.thecvf.com/content/WACV2022/html/Mathew_InfographicVQA_WACV_2022_paper.html) to improve its performance on infographic documents.