Update README.md
Browse files
README.md
CHANGED
|
@@ -13,54 +13,49 @@ It takes an image document and a question as input and returns an answer. <br>
|
|
| 13 |
- **Paper [optional]:** [More Information Needed]
|
| 14 |
|
| 15 |
|
| 16 |
-
## Model Summary
|
| 17 |
DIVE-Doc is built as a trade-off between end-to-end lightweight architectures and LVLMs.
|
| 18 |
Where the first category has both a lightweight visual encoder and a language decoder, and LVLMs have both a large visual encoder and a large decoder,
|
| 19 |
DIVE-Doc contains a small visual encoder in combination with a large decoder in order to balance model size and performance.
|
| 20 |
-
It is built by distilling the [SigLIP-400m](https://arxiv.org/abs/2303.15343) visual encoder of [PaliGEMMA]() into a small hierarchical [Swin transformer]() initialized with the weights of [Donut](), while reusing the original [GEMMA]() decoder.
|
| 21 |
-
This
|
| 22 |
-
Moreover, the model is finetuned using LoRA adapters, which have been merged into the base model using [
|
|
|
|
| 23 |
|
| 24 |
|
| 25 |
-
|
| 26 |
-
## Quick Start
|
| 27 |
|
| 28 |
### Installation
|
| 29 |
```bash
|
| 30 |
git clone https://github.com/JayRay5/DIVE-Doc.git
|
| 31 |
-
|
| 32 |
cd DIVE-Doc
|
| 33 |
conda create -n dive-doc-env python=3.11.5
|
| 34 |
conda activate dive-doc-env
|
| 35 |
pip install -r requirements.txt
|
| 36 |
```
|
| 37 |
### Inference example using the model repository and gradio
|
| 38 |
-
In app.py, modify the path variable
|
| 39 |
```bash
|
|
|
|
|
|
|
|
|
|
| 40 |
```
|
| 41 |
Then run:
|
| 42 |
```bash
|
| 43 |
python app.py
|
| 44 |
```
|
| 45 |
-
This will start a [
|
| 46 |
## Notification
|
| 47 |
|
| 48 |
|
| 49 |
### Direct Use
|
| 50 |
|
| 51 |
-
This model is designed to answer a question from a single-page image document.
|
| 52 |
|
| 53 |
|
| 54 |
### Downstream Use
|
| 55 |
|
| 56 |
-
This model can be finetuned on other DocVQA
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
## Bias, Risks, and Limitations
|
| 60 |
-
|
| 61 |
-
This model may not perform well on degraded or infographic documents because of its finetuning on mostly industrial documents.
|
| 62 |
-
|
| 63 |
-
|
| 64 |
|
| 65 |
|
| 66 |
|
|
|
|
| 13 |
- **Paper [optional]:** [More Information Needed]
|
| 14 |
|
| 15 |
|
| 16 |
+
## 2 Model Summary
|
| 17 |
DIVE-Doc is built as a trade-off between end-to-end lightweight architectures and LVLMs.
|
| 18 |
Where the first category has both a lightweight visual encoder and a language decoder, and LVLMs have both a large visual encoder and a large decoder,
|
| 19 |
DIVE-Doc contains a small visual encoder in combination with a large decoder in order to balance model size and performance.
|
| 20 |
+
It is built by distilling the [SigLIP-400m](https://arxiv.org/abs/2303.15343) visual encoder of [PaliGEMMA](https://arxiv.org/abs/2407.07726) into a small hierarchical [Swin transformer](https://openaccess.thecvf.com/content/ICCV2021/html/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper) initialized with the weights of [Donut](https://link.springer.com/chapter/10.1007/978-3-031-19815-1_29), while reusing the original [GEMMA](https://arxiv.org/abs/2403.08295) decoder.
|
| 21 |
+
This enables DIVE‑Doc to reduce its visual encoder’s parameter count by 80%.
|
| 22 |
+
Moreover, the model is finetuned using LoRA adapters, which have been merged into the base model using [merge_and_unload](https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraModel.merge_and_unload)).
|
| 23 |
+
Trained on the [DocVQA dataset](https://openaccess.thecvf.com/content/WACV2021/html/Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html) for both the distillation and finetuning steps, this strategy allows DIVE-Doc to be competitive with LVLMs while outperforming ligthweight architectures.
|
| 24 |
|
| 25 |
|
| 26 |
+
## 3 Quick Start
|
|
|
|
| 27 |
|
| 28 |
### Installation
|
| 29 |
```bash
|
| 30 |
git clone https://github.com/JayRay5/DIVE-Doc.git
|
|
|
|
| 31 |
cd DIVE-Doc
|
| 32 |
conda create -n dive-doc-env python=3.11.5
|
| 33 |
conda activate dive-doc-env
|
| 34 |
pip install -r requirements.txt
|
| 35 |
```
|
| 36 |
### Inference example using the model repository and gradio
|
| 37 |
+
In app.py, modify the path variable to "JayRay5/DIVE-Doc-ARD-LRes":
|
| 38 |
```bash
|
| 39 |
+
if __name__ == "__main__":
|
| 40 |
+
path = "JayRay5/DIVE-Doc-ARD-LRes"
|
| 41 |
+
app(path)
|
| 42 |
```
|
| 43 |
Then run:
|
| 44 |
```bash
|
| 45 |
python app.py
|
| 46 |
```
|
| 47 |
+
This will start a [gradio](https://www.gradio.app/) web interface where you can use the model.
|
| 48 |
## Notification
|
| 49 |
|
| 50 |
|
| 51 |
### Direct Use
|
| 52 |
|
| 53 |
+
This model is designed to answer a question from a single-page image document and is mostly trained on industrial documents ([DocVQA dataset](https://openaccess.thecvf.com/content/WACV2021/html/Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html)].
|
| 54 |
|
| 55 |
|
| 56 |
### Downstream Use
|
| 57 |
|
| 58 |
+
This model can be finetuned on other DocVQA datasets such as [InfoGraphVQA](https://openaccess.thecvf.com/content/WACV2022/html/Mathew_InfographicVQA_WACV_2022_paper.html) to improve its performance on infographic documents.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
|
| 61 |
|