letxbe
/

DocExplainer

@@ -27,7 +27,7 @@ license: apache-2.0
 DocExplainer is a an approach to Document Visual Question Answering (Document VQA) with bounding box localization.
 Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.
 It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
 - **Authors:** Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai
 - **Affiliations:**  [Letxbe AI](https://letxbe.ai/), [University of Florence](https://www.unifi.it/it)
 - **License:** CC-BY-4.0
@@ -41,21 +41,21 @@ It is designed as a **plug-and-play module** to be combined with existing Vision
 ## Model Details
-DocExplainer is a fine-tuned SigLIP-based regressor that predicts bounding box coordinates for answer localization in document images. The system operates in a two-stage process:
-1. **Question Answering**: Any VLM processes the document image and question to generate a textual answer.
 2. **Bounding Box Explanation**: DocExplainer takes the image, question, and generated answer to predict the coordinates of the supporting evidence.
 ## Model Architecture
-DocExplainer builds on [SigLIP2](https://huggingface.co/google/siglip2-giant-opt-patch16-384) visual and text embeddings.
-![https://i.postimg.cc/BQR21cxp/Blank-diagram-5.png](https://i.postimg.cc/BQR21cxp/Blank-diagram-5.png)
 ## Training Procedure
 - Visual and textual embeddings from SigLiP2 are projected into a shared latent space, fused via fully connected layers.
 - A regression head outputs normalized coordinates `[x1, y1, x2, y2]`.
-- **Backbone**: SigLiP2 (frozen).
 - **Loss Function**: Smooth L1 (Huber loss) applied to normalized coordinates in [0,1].
 #### Training Setup

 DocExplainer is a an approach to Document Visual Question Answering (Document VQA) with bounding box localization.
 Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.
 It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
 - **Authors:** Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai
 - **Affiliations:**  [Letxbe AI](https://letxbe.ai/), [University of Florence](https://www.unifi.it/it)
 - **License:** CC-BY-4.0
 ## Model Details
+DocExplainer is a fine-tuned [SigLIP2 Giant](https://huggingface.co/google/siglip2-giant-opt-patch16-384)-based regressor that predicts bounding box coordinates for answer localization in document images. The system operates in a two-stage process:
+1. **Question Answering**: Any VLM is used as a black box component to generate a textual answer given in input a document image and question.
 2. **Bounding Box Explanation**: DocExplainer takes the image, question, and generated answer to predict the coordinates of the supporting evidence.
 ## Model Architecture
+DocExplainer builds on [SigLIP2 Giant](https://huggingface.co/google/siglip2-giant-opt-patch16-384) visual and text embeddings.
+![https://i.postimg.cc/2yN9GYwb/Blank-diagram-5.jpg](https://i.postimg.cc/2yN9GYwb/Blank-diagram-5.jpg)
 ## Training Procedure
 - Visual and textual embeddings from SigLiP2 are projected into a shared latent space, fused via fully connected layers.
 - A regression head outputs normalized coordinates `[x1, y1, x2, y2]`.
+- **Backbone**: SigLiP2 Giant (frozen).
 - **Loss Function**: Smooth L1 (Huber loss) applied to normalized coordinates in [0,1].
 #### Training Setup