Transformers
Safetensors
DIVEdoc
docvqa
distillation
VLM
document-understanding
OCR-free
JayRay5 commited on
Commit
5abcde8
·
verified ·
1 Parent(s): 6519e2d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -18
README.md CHANGED
@@ -13,54 +13,49 @@ It takes an image document and a question as input and returns an answer. <br>
13
  - **Paper [optional]:** [More Information Needed]
14
 
15
 
16
- ## Model Summary
17
  DIVE-Doc is built as a trade-off between end-to-end lightweight architectures and LVLMs.
18
  Where the first category has both a lightweight visual encoder and a language decoder, and LVLMs have both a large visual encoder and a large decoder,
19
  DIVE-Doc contains a small visual encoder in combination with a large decoder in order to balance model size and performance.
20
- It is built by distilling the [SigLIP-400m](https://arxiv.org/abs/2303.15343) visual encoder of [PaliGEMMA]() into a small hierarchical [Swin transformer]() initialized with the weights of [Donut](), while reusing the original [GEMMA]() decoder.
21
- This allows DIVE-Doc to keep competitive performance with its teacher while reducing the visual encoder's parameters to one-fifth.
22
- Moreover, the model is finetuned using LoRA adapters, which have been merged into the base model using [unload_and_merge]()).
 
23
 
24
 
25
-
26
- ## Quick Start
27
 
28
  ### Installation
29
  ```bash
30
  git clone https://github.com/JayRay5/DIVE-Doc.git
31
-
32
  cd DIVE-Doc
33
  conda create -n dive-doc-env python=3.11.5
34
  conda activate dive-doc-env
35
  pip install -r requirements.txt
36
  ```
37
  ### Inference example using the model repository and gradio
38
- In app.py, modify the path variable by "JayRay5/DIVE-Doc-ARD-LRes":
39
  ```bash
 
 
 
40
  ```
41
  Then run:
42
  ```bash
43
  python app.py
44
  ```
45
- This will start a [Gradio]() web interface where you can use the model.
46
  ## Notification
47
 
48
 
49
  ### Direct Use
50
 
51
- This model is designed to answer a question from a single-page image document.
52
 
53
 
54
  ### Downstream Use
55
 
56
- This model can be finetuned on other DocVQA dataset such as [InfoGraphVQA]() to improve its performance on Infographic documents, or []() for a specialization on degraded or historical documents.
57
-
58
-
59
- ## Bias, Risks, and Limitations
60
-
61
- This model may not perform well on degraded or infographic documents because of its finetuning on mostly industrial documents.
62
-
63
-
64
 
65
 
66
 
 
13
  - **Paper [optional]:** [More Information Needed]
14
 
15
 
16
+ ## 2 Model Summary
17
  DIVE-Doc is built as a trade-off between end-to-end lightweight architectures and LVLMs.
18
  Where the first category has both a lightweight visual encoder and a language decoder, and LVLMs have both a large visual encoder and a large decoder,
19
  DIVE-Doc contains a small visual encoder in combination with a large decoder in order to balance model size and performance.
20
+ It is built by distilling the [SigLIP-400m](https://arxiv.org/abs/2303.15343) visual encoder of [PaliGEMMA](https://arxiv.org/abs/2407.07726) into a small hierarchical [Swin transformer](https://openaccess.thecvf.com/content/ICCV2021/html/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper) initialized with the weights of [Donut](https://link.springer.com/chapter/10.1007/978-3-031-19815-1_29), while reusing the original [GEMMA](https://arxiv.org/abs/2403.08295) decoder.
21
+ This enables DIVEDoc to reduce its visual encoders parameter count by 80%.
22
+ Moreover, the model is finetuned using LoRA adapters, which have been merged into the base model using [merge_and_unload](https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraModel.merge_and_unload)).
23
+ Trained on the [DocVQA dataset](https://openaccess.thecvf.com/content/WACV2021/html/Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html) for both the distillation and finetuning steps, this strategy allows DIVE-Doc to be competitive with LVLMs while outperforming ligthweight architectures.
24
 
25
 
26
+ ## 3 Quick Start
 
27
 
28
  ### Installation
29
  ```bash
30
  git clone https://github.com/JayRay5/DIVE-Doc.git
 
31
  cd DIVE-Doc
32
  conda create -n dive-doc-env python=3.11.5
33
  conda activate dive-doc-env
34
  pip install -r requirements.txt
35
  ```
36
  ### Inference example using the model repository and gradio
37
+ In app.py, modify the path variable to "JayRay5/DIVE-Doc-ARD-LRes":
38
  ```bash
39
+ if __name__ == "__main__":
40
+ path = "JayRay5/DIVE-Doc-ARD-LRes"
41
+ app(path)
42
  ```
43
  Then run:
44
  ```bash
45
  python app.py
46
  ```
47
+ This will start a [gradio](https://www.gradio.app/) web interface where you can use the model.
48
  ## Notification
49
 
50
 
51
  ### Direct Use
52
 
53
+ This model is designed to answer a question from a single-page image document and is mostly trained on industrial documents ([DocVQA dataset](https://openaccess.thecvf.com/content/WACV2021/html/Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html)].
54
 
55
 
56
  ### Downstream Use
57
 
58
+ This model can be finetuned on other DocVQA datasets such as [InfoGraphVQA](https://openaccess.thecvf.com/content/WACV2022/html/Mathew_InfographicVQA_WACV_2022_paper.html) to improve its performance on infographic documents.
 
 
 
 
 
 
 
59
 
60
 
61