nielsr HF Staff commited on
Commit
84109fe
Β·
verified Β·
1 Parent(s): b14aa1b

Improve metadata with `any-to-any` pipeline tag and `transformers` library name

Browse files

This PR updates the model card to accurately reflect the model's versatile capabilities and improve discoverability.

Key changes include:
- Updating `pipeline_tag` from `image-text-to-text` to `any-to-any` to better represent the model's ability to generate both textual and diverse visual outputs (like segmentation masks and bounding boxes).
- Adding `library_name: transformers` to ensure the model benefits from automated code snippets and is correctly identified as compatible with the Hugging Face `transformers` library.
- Updating the sample usage code snippet's prompt to match the more descriptive example found in the GitHub README.

Please review and merge this PR.

Files changed (1) hide show
  1. README.md +85 -14
README.md CHANGED
@@ -1,12 +1,13 @@
1
  ---
2
- license: apache-2.0
 
 
3
  language:
4
  - en
5
  - zh
6
- base_model:
7
- - Qwen/Qwen2.5-VL-3B-Instruct
8
- - Qwen/Qwen2.5-VL-7B-Instruct
9
- pipeline_tag: image-text-to-text
10
  ---
11
 
12
  <div align='center'><h1>Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs</h1></div>
@@ -30,13 +31,21 @@ By introducing VRTs, we achieve **semantic reasoning and object-specific visual
30
 
31
  As illustrated in Figure C, we have validated PaDT across four major visual perception and understanding tasks. In all cases, PaDT achieves **state-of-the-art** performance compared to conventional character-by-character coordinate-generation MLLMs.
32
 
33
- We hope this work will inspire further exploration in the community:
34
 
35
- - What does true multimodal reasoning look like?
36
 
37
- - How can textual and visual elements be generated together in an MLLM output sequence?
 
 
 
 
38
 
39
- - And is a purely text-based output ever sufficient for visual reasoning?
 
 
 
 
40
 
41
  <div align="center">
42
  <img src="./assets/Motivation.webp" width="900"/>
@@ -62,7 +71,7 @@ conda activate PaDT
62
  bash setup.sh
63
  ```
64
 
65
- The following contains a code snippet illustrating how to use our PaDT.
66
 
67
  ```python
68
  import torch
@@ -84,7 +93,7 @@ processor = VisonTextProcessingClass(processor, model.config.vision_config.spati
84
  processor.prepare(model.model.embed_tokens.weight.shape[0])
85
 
86
  # question prompt
87
- PROMPT = "Please describe this image."
88
 
89
  # construct conversation
90
  message = [
@@ -122,7 +131,8 @@ with torch.inference_mode():
122
  # extract Visual Reference Tokens within the sequence
123
  completions, feats, labels, vrts, vrts_feats = parseVRTintoCompletion(processor, completion_ids, generate_returned_result['hidden_states'], torch.Tensor([False]))
124
 
125
- print("\ngenerate result:", completions[0])
 
126
 
127
  # decode low-level visual task results
128
  low_res_image_embeds = generate_returned_result.past_image_embeds
@@ -130,7 +140,10 @@ with torch.inference_mode():
130
  visual_pe = generate_returned_result.past_visual_pe
131
  decoded_list = model.vl_decode(feats, low_res_image_embeds, high_res_image_embeds, prompt_inputs['image_grid_thw'], visual_pe)
132
 
133
- print(f"\npred_bboxes: {decoded_list['pred_boxes']},\npred_scores: {decoded_list['pred_score'].sigmoid()}\n")
 
 
 
134
  ```
135
 
136
  ## Models
@@ -174,6 +187,64 @@ Here are some randomly selected test examples showcasing PaDT’s excellent perf
174
  <img src="./assets/TAM.webp" width="900"/>
175
  </div>
176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
  ## License Agreement
178
 
179
  PaDT is licensed under Apache 2.0.
@@ -192,4 +263,4 @@ We kindly encourage citation of our work if you find it useful.
192
  primaryClass={cs.CV},
193
  url={https://arxiv.org/abs/2510.01954},
194
  }
195
- ```
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-VL-3B-Instruct
4
+ - Qwen/Qwen2.5-VL-7B-Instruct
5
  language:
6
  - en
7
  - zh
8
+ license: apache-2.0
9
+ pipeline_tag: any-to-any
10
+ library_name: transformers
 
11
  ---
12
 
13
  <div align='center'><h1>Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs</h1></div>
 
31
 
32
  As illustrated in Figure C, we have validated PaDT across four major visual perception and understanding tasks. In all cases, PaDT achieves **state-of-the-art** performance compared to conventional character-by-character coordinate-generation MLLMs.
33
 
34
+ ### Why PaDT Succeeds?
35
 
36
+ The success of PaDT stems from its deep insight into the visual capability bottlenecks of MLLMs.
37
 
38
+ 1. **Native Vision-Language Alignment**: Instead of β€œfitting” vision into text space, PaDT directly treats visual patches as decodable tokens, achieving seamless modality alignment.
39
+ 2. **Dynamic Visual Binding**: A dynamic embedding mechanism tightly binds Visual Reference Tokens (VRTs) to each image, preventing cross-image confusion.
40
+ 3. **Unified Token Space**: Enables the LLM to handle language and vision uniformly, simplifying training and improving consistency.
41
+ 4. **Lightweight Decoder**: Decouples dense prediction from the LLM, preserving its semantic reasoning while adding precise spatial output capability.
42
+ 5. **Strong Multi-Task Generalization**: The PaDT Pro model, jointly trained on REC/RES/OVD/RIC, can switch tasks via prompts and outperforms single-task models.
43
 
44
+ We hope this work will **inspire further exploration** in the community:
45
+
46
+ - What does true multimodal reasoning look like?
47
+
48
+ - And is a purely text-based output ever sufficient for visual reasoning?
49
 
50
  <div align="center">
51
  <img src="./assets/Motivation.webp" width="900"/>
 
71
  bash setup.sh
72
  ```
73
 
74
+ The following contains a code snippet illustrating how to use our PaDT. More details can refer to [eval/test_demo.py](eval/test_demo.py).
75
 
76
  ```python
77
  import torch
 
93
  processor.prepare(model.model.embed_tokens.weight.shape[0])
94
 
95
  # question prompt
96
+ PROMPT = """Please carefully check the image and detect the object this sentence describes: "The car is on the left side of the horse"."""
97
 
98
  # construct conversation
99
  message = [
 
131
  # extract Visual Reference Tokens within the sequence
132
  completions, feats, labels, vrts, vrts_feats = parseVRTintoCompletion(processor, completion_ids, generate_returned_result['hidden_states'], torch.Tensor([False]))
133
 
134
+ print("
135
+ generate result:", completions[0])
136
 
137
  # decode low-level visual task results
138
  low_res_image_embeds = generate_returned_result.past_image_embeds
 
140
  visual_pe = generate_returned_result.past_visual_pe
141
  decoded_list = model.vl_decode(feats, low_res_image_embeds, high_res_image_embeds, prompt_inputs['image_grid_thw'], visual_pe)
142
 
143
+ print(f"
144
+ pred_bboxes: {decoded_list['pred_boxes']},
145
+ pred_scores: {decoded_list['pred_score'].sigmoid()}
146
+ ")
147
  ```
148
 
149
  ## Models
 
187
  <img src="./assets/TAM.webp" width="900"/>
188
  </div>
189
 
190
+ ## Training Instruction
191
+
192
+ Download Datasets:
193
+
194
+ - [COCO](https://cocodataset.org/#home)
195
+
196
+ - RefCOCO/+/g
197
+ ```bash
198
+ wget https://web.archive.org/web/20220413011718/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip
199
+ wget https://web.archive.org/web/20220413011656/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco+.zip
200
+ wget https://web.archive.org/web/20220413012904/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcocog.zip
201
+ ```
202
+
203
+ Unpack these datasets and place them under the following directory:
204
+
205
+ ```
206
+ PaDT/
207
+ β”œβ”€β”€ dataset/
208
+ β”‚ β”œβ”€β”€ coco/
209
+ β”‚ β”‚ β”œβ”€β”€ annotations/
210
+ β”‚ β”‚ β”œβ”€β”€ train2014/
211
+ β”‚ β”‚ β”œβ”€β”€ train2017/
212
+ β”‚ β”‚ β”œβ”€β”€ val2014/
213
+ β”‚ β”‚ └── val2017/
214
+ β”‚ └── RefCOCO/
215
+ β”‚ β”œβ”€β”€ refcoco/
216
+ β”‚ β”œβ”€β”€ refcoco+/
217
+ β”‚ └── refcocog/
218
+ ```
219
+
220
+ Preprocess the datasets:
221
+ - 1. Preprocess via our scripts. (Please first update the dataset path configuration in the preprocessing scripts)
222
+ ```bash
223
+ cd src/preprocess
224
+ python process_coco.py
225
+ python process_refcoco.py
226
+ ```
227
+ - 2. We also released the preprocessed datasets which are ready to use for training in huggingface.
228
+
229
+ | Dataset | Dataset Path | Task Type |
230
+ | - | - | -|
231
+ | COCO | [PaDT-MLLM/COCO](https://huggingface.co/datasets/PaDT-MLLM/COCO) | Open Vocabulary Detection |
232
+ | RefCOCO | [PaDT-MLLM/RefCOCO](https://huggingface.co/datasets/PaDT-MLLM/RefCOCO) | Referring Expression Comprehension/Segmentation |
233
+ | RIC | [PaDT-MLLM/ReferringImageCaptioning](https://huggingface.co/datasets/PaDT-MLLM/ReferringImageCaptioning) | Referring Image Captioning |
234
+
235
+
236
+ The training scripts in `run_scripts` are ready to execute.
237
+
238
+ For example: Train the PaDT-Pro 3B model on a single node with 8Γ—96 GB GPUs.
239
+
240
+ ```bash
241
+ bash ./run_scripts/padt_pro_3b_sft.sh
242
+ ```
243
+
244
+ ## Evaluation
245
+
246
+ We provide a simple inference example in `eval/test_demo.py`. More evaluation scripts will be added soon.
247
+
248
  ## License Agreement
249
 
250
  PaDT is licensed under Apache 2.0.
 
263
  primaryClass={cs.CV},
264
  url={https://arxiv.org/abs/2510.01954},
265
  }
266
+ ```