Transformers
Safetensors
DIVEdoc
docvqa
distillation
VLM
document-understanding
OCR-free
JayRay5 commited on
Commit
6d6d406
·
verified ·
1 Parent(s): 8344496

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -62
README.md CHANGED
@@ -5,12 +5,6 @@ datasets:
5
  - lmms-lab/DocVQA
6
  ---
7
 
8
- # Model Card for Model ID
9
-
10
- <!-- Provide a quick summary of what the model is/does. -->
11
-
12
-
13
-
14
  ## 1 Introduction
15
  DIVE-Doc is a VLM architecture built as a trade-off between end-to-end lightweight architectures and LVLMs for the DocVQA task.
16
  Without relying on external tools such as OCR, it processes the inputs in an end-to-end way.
@@ -31,22 +25,33 @@ Moreover, the model is finetuned using LoRA adapters (in this repo, adapters hav
31
 
32
  ## Quick Start
33
 
34
- <!-- Provide the basic links for the model. -->
35
-
36
- - **Repository:** [GitHub](https://github.com/JayRay5/DIVE-Doc)
37
- - **Paper [optional]:** [More Information Needed]
38
-
39
- ## Uses
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
42
 
43
  ### Direct Use
44
 
45
  This model is designed to answer a question from a single-page image document.
46
 
47
- [More Information Needed]
48
 
49
- ### Downstream Use [optional]
50
 
51
  This model can be finetuned on other DocVQA dataset such as [InfoGraphVQA]() to improve its performance on Infographic documents, or []() for a specialization on degraded or historical documents.
52
 
@@ -56,55 +61,8 @@ This model can be finetuned on other DocVQA dataset such as [InfoGraphVQA]() to
56
  This model may not perform well on degraded or infographic documents because of its finetuning on mostly industrial documents.
57
 
58
 
59
- ## How to Get Started with the Model
60
-
61
- Use the code below to get started with the model.
62
-
63
- [More Information Needed]
64
-
65
- ## Implementation Details
66
-
67
- ### Training Data
68
-
69
- This model has been trained using the [DocVQA dataset]().
70
- [More Information Needed]
71
-
72
- ### Evaluation
73
-
74
- <!-- Met. -->
75
-
76
- #### Metrics
77
-
78
- <!-- This should link to a Dataset Card if possible. -->
79
-
80
- [More Information Needed]
81
-
82
-
83
- #### Results
84
- <!-- docvqa performance -->
85
- <!-- inference time -->
86
-
87
-
88
-
89
-
90
- <!--
91
- ## Technical Specifications [optional]
92
-
93
- ### Model Architecture and Objective
94
-
95
- [More Information Needed]
96
-
97
- ### Compute Infrastructure
98
-
99
- [More Information Needed]
100
-
101
- #### Hardware
102
-
103
- [More Information Needed]
104
 
105
- #### Software
106
 
107
- [More Information Needed] -->
108
 
109
  ## Citation [optional]
110
 
 
5
  - lmms-lab/DocVQA
6
  ---
7
 
 
 
 
 
 
 
8
  ## 1 Introduction
9
  DIVE-Doc is a VLM architecture built as a trade-off between end-to-end lightweight architectures and LVLMs for the DocVQA task.
10
  Without relying on external tools such as OCR, it processes the inputs in an end-to-end way.
 
25
 
26
  ## Quick Start
27
 
28
+ ### Installation
29
+ ```bash
30
+ git clone https://github.com/JayRay5/DIVE-Doc.git
31
+
32
+ cd DIVE-Doc
33
+ conda create -n dive-doc-env python=3.11.5
34
+ conda activate dive-doc-env
35
+ pip install -r requirements.txt
36
+ ```
37
+ ### Inference example using the model repository and gradio
38
+ In app.py, modify the path variable by "JayRay5/DIVE-Doc-ARD-LRes":
39
+ ```bash
40
+ ```
41
+ Then run:
42
+ ```bash
43
+ python app.py
44
+ ```
45
+ This will start a [Gradio]() web interface where you can use the model.
46
+ ## Notification
47
 
 
48
 
49
  ### Direct Use
50
 
51
  This model is designed to answer a question from a single-page image document.
52
 
 
53
 
54
+ ### Downstream Use
55
 
56
  This model can be finetuned on other DocVQA dataset such as [InfoGraphVQA]() to improve its performance on Infographic documents, or []() for a specialization on degraded or historical documents.
57
 
 
61
  This model may not perform well on degraded or infographic documents because of its finetuning on mostly industrial documents.
62
 
63
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
 
65
 
 
66
 
67
  ## Citation [optional]
68