AlessioChenn commited on
Commit
3bcbbc0
·
verified ·
1 Parent(s): f24e585

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -27,7 +27,7 @@ license: apache-2.0
27
  DocExplainer is a an approach to Document Visual Question Answering (Document VQA) with bounding box localization.
28
  Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.
29
  It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
30
-
31
  - **Authors:** Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai
32
  - **Affiliations:** [Letxbe AI](https://letxbe.ai/), [University of Florence](https://www.unifi.it/it)
33
  - **License:** CC-BY-4.0
@@ -41,21 +41,21 @@ It is designed as a **plug-and-play module** to be combined with existing Vision
41
 
42
  ## Model Details
43
 
44
- DocExplainer is a fine-tuned SigLIP-based regressor that predicts bounding box coordinates for answer localization in document images. The system operates in a two-stage process:
45
 
46
- 1. **Question Answering**: Any VLM processes the document image and question to generate a textual answer.
47
  2. **Bounding Box Explanation**: DocExplainer takes the image, question, and generated answer to predict the coordinates of the supporting evidence.
48
 
49
 
50
  ## Model Architecture
51
- DocExplainer builds on [SigLIP2](https://huggingface.co/google/siglip2-giant-opt-patch16-384) visual and text embeddings.
52
 
53
- ![https://i.postimg.cc/BQR21cxp/Blank-diagram-5.png](https://i.postimg.cc/BQR21cxp/Blank-diagram-5.png)
54
 
55
  ## Training Procedure
56
  - Visual and textual embeddings from SigLiP2 are projected into a shared latent space, fused via fully connected layers.
57
  - A regression head outputs normalized coordinates `[x1, y1, x2, y2]`.
58
- - **Backbone**: SigLiP2 (frozen).
59
  - **Loss Function**: Smooth L1 (Huber loss) applied to normalized coordinates in [0,1].
60
 
61
  #### Training Setup
 
27
  DocExplainer is a an approach to Document Visual Question Answering (Document VQA) with bounding box localization.
28
  Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.
29
  It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
30
+
31
  - **Authors:** Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai
32
  - **Affiliations:** [Letxbe AI](https://letxbe.ai/), [University of Florence](https://www.unifi.it/it)
33
  - **License:** CC-BY-4.0
 
41
 
42
  ## Model Details
43
 
44
+ DocExplainer is a fine-tuned [SigLIP2 Giant](https://huggingface.co/google/siglip2-giant-opt-patch16-384)-based regressor that predicts bounding box coordinates for answer localization in document images. The system operates in a two-stage process:
45
 
46
+ 1. **Question Answering**: Any VLM is used as a black box component to generate a textual answer given in input a document image and question.
47
  2. **Bounding Box Explanation**: DocExplainer takes the image, question, and generated answer to predict the coordinates of the supporting evidence.
48
 
49
 
50
  ## Model Architecture
51
+ DocExplainer builds on [SigLIP2 Giant](https://huggingface.co/google/siglip2-giant-opt-patch16-384) visual and text embeddings.
52
 
53
+ ![https://i.postimg.cc/2yN9GYwb/Blank-diagram-5.jpg](https://i.postimg.cc/2yN9GYwb/Blank-diagram-5.jpg)
54
 
55
  ## Training Procedure
56
  - Visual and textual embeddings from SigLiP2 are projected into a shared latent space, fused via fully connected layers.
57
  - A regression head outputs normalized coordinates `[x1, y1, x2, y2]`.
58
+ - **Backbone**: SigLiP2 Giant (frozen).
59
  - **Loss Function**: Smooth L1 (Huber loss) applied to normalized coordinates in [0,1].
60
 
61
  #### Training Setup