letxbe
/

DocExplainer

@@ -14,18 +14,18 @@ license: apache-2.0
 <div align="center">
-<h1>DocExplainerV0: Visual Document QA with Bounding Box Localization</h1>
 [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
 <!-- [![arXiv](https://img.shields.io/badge/arXiv-2501.03403-b31b1b.svg)]() -->
-[![HuggingFace](https://img.shields.io/badge/🤗%20Hugging%20Face-Datasets-yellow)](https://huggingface.co/letxbe/DocExplainerV0)
 </div>
 ## Model description
-DocExplainerV0 is a **first-step approach** to Visual Document Question Answering with bounding box localization.
-Unlike standard VLMs that only provide text-based answers, DocExplainerV0 adds **visual evidence through bounding boxes**, making model predictions more interpretable.
 It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
 ⚠️ **Important Note**:
@@ -45,14 +45,14 @@ This model is **not intended as a final solution**, but rather as a **baseline f
 ## Model Details
-DocExplainerV0 is a fine-tuned SigLIP-based regressor that predicts bounding box coordinates for answer localization in document images. The system operates in a two-stage process:
 1. **Question Answering**: Any VLM processes the document image and question to generate a textual answer.
-2. **Bounding Box Explanation**: DocExplainerV0 takes the image, question, and generated answer to predict the coordinates of the supporting evidence.
 ## Model Architecture
-DocExplainerV0 builds on [SigLIP2](https://huggingface.co/google/siglip2-giant-opt-patch16-384) visual and text embeddings.
 ![https://i.postimg.cc/BQR21cxp/Blank-diagram-5.png](https://i.postimg.cc/BQR21cxp/Blank-diagram-5.png)
@@ -140,9 +140,9 @@ answer = json.loads(decoded_output)['content']
 print(f"Answer: {answer}")
 # -----------------------
-# 2. Load DocExplainerV0 for bounding box prediction
 # -----------------------
-explainer = AutoModel.from_pretrained("letxbe/DocExplainerV0", trust_remote_code=True)
 bbox = explainer.predict(image, answer)
 print(f"Predicted bounding box (normalized): {bbox}")
 ```

 <div align="center">
+<h1>DocExplainer: Visual Document QA with Bounding Box Localization</h1>
 [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
 <!-- [![arXiv](https://img.shields.io/badge/arXiv-2501.03403-b31b1b.svg)]() -->
+[![HuggingFace](https://img.shields.io/badge/🤗%20Hugging%20Face-Datasets-yellow)](https://huggingface.co/letxbe/DocExplainer)
 </div>
 ## Model description
+DocExplainer is a **first-step approach** to Visual Document Question Answering with bounding box localization.
+Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.
 It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
 ⚠️ **Important Note**:
 ## Model Details
+DocExplainer is a fine-tuned SigLIP-based regressor that predicts bounding box coordinates for answer localization in document images. The system operates in a two-stage process:
 1. **Question Answering**: Any VLM processes the document image and question to generate a textual answer.
+2. **Bounding Box Explanation**: DocExplainer takes the image, question, and generated answer to predict the coordinates of the supporting evidence.
 ## Model Architecture
+DocExplainer builds on [SigLIP2](https://huggingface.co/google/siglip2-giant-opt-patch16-384) visual and text embeddings.
 ![https://i.postimg.cc/BQR21cxp/Blank-diagram-5.png](https://i.postimg.cc/BQR21cxp/Blank-diagram-5.png)
 print(f"Answer: {answer}")
 # -----------------------
+# 2. Load DocExplainer for bounding box prediction
 # -----------------------
+explainer = AutoModel.from_pretrained("letxbe/DocExplainer", trust_remote_code=True)
 bbox = explainer.predict(image, answer)
 print(f"Predicted bounding box (normalized): {bbox}")
 ```