AlessioChenn commited on
Commit
eb5fc4b
·
verified ·
1 Parent(s): 8d28966

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -31
README.md CHANGED
@@ -17,7 +17,7 @@ license: apache-2.0
17
  <h1>DocExplainerV0: Visual Document QA with Bounding Box Localization</h1>
18
 
19
  [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
20
- [![arXiv](https://img.shields.io/badge/arXiv-2501.03403-b31b1b.svg)]()
21
  [![HuggingFace](https://img.shields.io/badge/🤗%20Hugging%20Face-Datasets-yellow)](https://huggingface.co/letxbe/DocExplainerV0)
22
 
23
  </div>
@@ -78,17 +78,73 @@ Here is a simple example of how to use `DocExplainer` to get an answer and its c
78
  ```python
79
  from PIL import Image
80
  import requests
81
- from transformers import AutoModel
 
 
82
 
83
- # Load example document image
84
- url = "https://datasets-server.huggingface.co/cached-assets/letxbe/BoundingDocs/--/47db6d2b6af0aadfd082591a8445d0f47c3b8d61/--/default/test/7/doc_images/image-1d100e9.jpg"
85
  image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
86
  question = "What is the invoice number?"
87
- answer = "3Y8M2d-846" # generate it with any VLM
88
 
89
- explainer = AutoModel.from_pretrained("letxbe/DocExplainerv0", trust_remote_code=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
  bbox = explainer.predict(image, answer)
91
- print(f"Bounding box: {bbox}") # [x1, y1, x2, y2]
92
  ```
93
 
94
 
@@ -109,30 +165,6 @@ Example Output:
109
  </table>
110
 
111
 
112
-
113
-
114
- ## Performance
115
- Evaluated on [BoundingDocs v2.0](https://huggingface.co/datasets/letxbe/BoundingDocs) dataset:
116
- ### Full DocExplainer Pipeline
117
-
118
- | VLM Model | ANLS ↑| IoU ↑ |
119
- | --------------- | ----- | ----- |
120
- | SmolVLM2-2.2b | 0.572 | 0.175 |
121
- | qwen2.5-vl-7b | 0.689 | 0.188 |
122
-
123
-
124
- ### VLM-only Baseline (for comparison)
125
- | VLM Model | ANLS ↑| IoU ↑ |
126
- | --------------- | ----- | ----- |
127
- | SmolVLM2-2.2b | 0.561 | 0.011 |
128
- | qwen2.5-vl-7b | 0.720 | 0.038 |
129
- | Claude Sonnet 4 | 0.737 | 0.031 |
130
-
131
-
132
-
133
-
134
-
135
-
136
  ## Limitations
137
  - **Prototype only**: Intended as a first approach, not a production-ready solution.
138
  - **Dataset constraints**: Current evaluation is limited to cases where an answer fits in a single bounding box. Answers requiring reasoning over multiple regions or not fully captured by OCR cannot be properly evaluated.
 
17
  <h1>DocExplainerV0: Visual Document QA with Bounding Box Localization</h1>
18
 
19
  [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
20
+ <!-- [![arXiv](https://img.shields.io/badge/arXiv-2501.03403-b31b1b.svg)]() -->
21
  [![HuggingFace](https://img.shields.io/badge/🤗%20Hugging%20Face-Datasets-yellow)](https://huggingface.co/letxbe/DocExplainerV0)
22
 
23
  </div>
 
78
  ```python
79
  from PIL import Image
80
  import requests
81
+ import torch
82
+ from transformers import AutoModel, AutoModelForImageTextToText, AutoProcessor
83
+ import json
84
 
85
+ url = "https://i.postimg.cc/BvftyvS3/image-1d100e9.jpg"
 
86
  image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
87
  question = "What is the invoice number?"
 
88
 
89
+ # -----------------------
90
+ # 1. Load SmolVLM2-2.2B for answer generation
91
+ # -----------------------
92
+ vlm_model = AutoModelForImageTextToText.from_pretrained(
93
+ "HuggingFaceTB/SmolVLM2-2.2B-Instruct",
94
+ torch_dtype=torch.bfloat16,
95
+ device_map="auto",
96
+ attn_implementation="flash_attention_2"
97
+ )
98
+ processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")
99
+
100
+ PROMPT = """Based only on the document image, answer the following question:
101
+ Question: {QUESTION}
102
+ Provide ONLY a JSON response in the following format (no trailing commas!):
103
+ {{
104
+ "content": "answer"
105
+ }}
106
+ """
107
+
108
+ prompt_text = PROMPT.format(QUESTION=question)
109
+
110
+ messages = [
111
+ {
112
+ "role": "user",
113
+ "content": [
114
+ {"type": "image", "image": image},
115
+ {"type": "text", "text": prompt_text},
116
+ ]
117
+ },
118
+ ]
119
+
120
+ inputs = processor.apply_chat_template(
121
+ messages,
122
+ add_generation_prompt=True,
123
+ tokenize=True,
124
+ return_dict=True,
125
+ return_tensors="pt",
126
+ ).to(vlm_model.device, dtype=torch.bfloat16)
127
+
128
+ input_length = inputs['input_ids'].shape[1]
129
+ generated_ids = vlm_model.generate(**inputs, do_sample=False, max_new_tokens=2056)
130
+
131
+ output_ids = generated_ids[:, input_length:]
132
+ generated_texts = processor.batch_decode(
133
+ output_ids,
134
+ skip_special_tokens=True,
135
+ )
136
+
137
+ decoded_output = generated_texts[0].replace("Assistant:", "", 1).strip()
138
+ answer = json.loads(decoded_output)['content']
139
+
140
+ print(f"Answer: {answer}")
141
+
142
+ # -----------------------
143
+ # 2. Load DocExplainerV0 for bounding box prediction
144
+ # -----------------------
145
+ explainer = AutoModel.from_pretrained("letxbe/DocExplainerV0", trust_remote_code=True)
146
  bbox = explainer.predict(image, answer)
147
+ print(f"Predicted bounding box (normalized): {bbox}")
148
  ```
149
 
150
 
 
165
  </table>
166
 
167
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
168
  ## Limitations
169
  - **Prototype only**: Intended as a first approach, not a production-ready solution.
170
  - **Dataset constraints**: Current evaluation is limited to cases where an answer fits in a single bounding box. Answers requiring reasoning over multiple regions or not fully captured by OCR cannot be properly evaluated.