AlessioChenn commited on
Commit
73c54dc
·
verified ·
1 Parent(s): eb5fc4b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -14,18 +14,18 @@ license: apache-2.0
14
 
15
  <div align="center">
16
 
17
- <h1>DocExplainerV0: Visual Document QA with Bounding Box Localization</h1>
18
 
19
  [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
20
  <!-- [![arXiv](https://img.shields.io/badge/arXiv-2501.03403-b31b1b.svg)]() -->
21
- [![HuggingFace](https://img.shields.io/badge/🤗%20Hugging%20Face-Datasets-yellow)](https://huggingface.co/letxbe/DocExplainerV0)
22
 
23
  </div>
24
 
25
  ## Model description
26
 
27
- DocExplainerV0 is a **first-step approach** to Visual Document Question Answering with bounding box localization.
28
- Unlike standard VLMs that only provide text-based answers, DocExplainerV0 adds **visual evidence through bounding boxes**, making model predictions more interpretable.
29
  It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
30
 
31
  ⚠️ **Important Note**:
@@ -45,14 +45,14 @@ This model is **not intended as a final solution**, but rather as a **baseline f
45
 
46
  ## Model Details
47
 
48
- DocExplainerV0 is a fine-tuned SigLIP-based regressor that predicts bounding box coordinates for answer localization in document images. The system operates in a two-stage process:
49
 
50
  1. **Question Answering**: Any VLM processes the document image and question to generate a textual answer.
51
- 2. **Bounding Box Explanation**: DocExplainerV0 takes the image, question, and generated answer to predict the coordinates of the supporting evidence.
52
 
53
 
54
  ## Model Architecture
55
- DocExplainerV0 builds on [SigLIP2](https://huggingface.co/google/siglip2-giant-opt-patch16-384) visual and text embeddings.
56
 
57
  ![https://i.postimg.cc/BQR21cxp/Blank-diagram-5.png](https://i.postimg.cc/BQR21cxp/Blank-diagram-5.png)
58
 
@@ -140,9 +140,9 @@ answer = json.loads(decoded_output)['content']
140
  print(f"Answer: {answer}")
141
 
142
  # -----------------------
143
- # 2. Load DocExplainerV0 for bounding box prediction
144
  # -----------------------
145
- explainer = AutoModel.from_pretrained("letxbe/DocExplainerV0", trust_remote_code=True)
146
  bbox = explainer.predict(image, answer)
147
  print(f"Predicted bounding box (normalized): {bbox}")
148
  ```
 
14
 
15
  <div align="center">
16
 
17
+ <h1>DocExplainer: Visual Document QA with Bounding Box Localization</h1>
18
 
19
  [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
20
  <!-- [![arXiv](https://img.shields.io/badge/arXiv-2501.03403-b31b1b.svg)]() -->
21
+ [![HuggingFace](https://img.shields.io/badge/🤗%20Hugging%20Face-Datasets-yellow)](https://huggingface.co/letxbe/DocExplainer)
22
 
23
  </div>
24
 
25
  ## Model description
26
 
27
+ DocExplainer is a **first-step approach** to Visual Document Question Answering with bounding box localization.
28
+ Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.
29
  It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.
30
 
31
  ⚠️ **Important Note**:
 
45
 
46
  ## Model Details
47
 
48
+ DocExplainer is a fine-tuned SigLIP-based regressor that predicts bounding box coordinates for answer localization in document images. The system operates in a two-stage process:
49
 
50
  1. **Question Answering**: Any VLM processes the document image and question to generate a textual answer.
51
+ 2. **Bounding Box Explanation**: DocExplainer takes the image, question, and generated answer to predict the coordinates of the supporting evidence.
52
 
53
 
54
  ## Model Architecture
55
+ DocExplainer builds on [SigLIP2](https://huggingface.co/google/siglip2-giant-opt-patch16-384) visual and text embeddings.
56
 
57
  ![https://i.postimg.cc/BQR21cxp/Blank-diagram-5.png](https://i.postimg.cc/BQR21cxp/Blank-diagram-5.png)
58
 
 
140
  print(f"Answer: {answer}")
141
 
142
  # -----------------------
143
+ # 2. Load DocExplainer for bounding box prediction
144
  # -----------------------
145
+ explainer = AutoModel.from_pretrained("letxbe/DocExplainer", trust_remote_code=True)
146
  bbox = explainer.predict(image, answer)
147
  print(f"Predicted bounding box (normalized): {bbox}")
148
  ```