Transformers
Safetensors
DIVEdoc
docvqa
distillation
VLM
document-understanding
OCR-free
JayRay5 commited on
Commit
6519e2d
·
verified ·
1 Parent(s): 6d6d406

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -19,7 +19,7 @@ Where the first category has both a lightweight visual encoder and a language de
19
  DIVE-Doc contains a small visual encoder in combination with a large decoder in order to balance model size and performance.
20
  It is built by distilling the [SigLIP-400m](https://arxiv.org/abs/2303.15343) visual encoder of [PaliGEMMA]() into a small hierarchical [Swin transformer]() initialized with the weights of [Donut](), while reusing the original [GEMMA]() decoder.
21
  This allows DIVE-Doc to keep competitive performance with its teacher while reducing the visual encoder's parameters to one-fifth.
22
- Moreover, the model is finetuned using LoRA adapters (in this repo, adapters have been merged into the base model using [unload_and_merge]()).
23
 
24
 
25
 
 
19
  DIVE-Doc contains a small visual encoder in combination with a large decoder in order to balance model size and performance.
20
  It is built by distilling the [SigLIP-400m](https://arxiv.org/abs/2303.15343) visual encoder of [PaliGEMMA]() into a small hierarchical [Swin transformer]() initialized with the weights of [Donut](), while reusing the original [GEMMA]() decoder.
21
  This allows DIVE-Doc to keep competitive performance with its teacher while reducing the visual encoder's parameters to one-fifth.
22
+ Moreover, the model is finetuned using LoRA adapters, which have been merged into the base model using [unload_and_merge]()).
23
 
24
 
25