Improve metadata with `any-to-any` pipeline tag and `transformers` library name

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +85 -14
README.md CHANGED
@@ -1,12 +1,13 @@
1
  ---
2
- license: apache-2.0
 
 
3
  language:
4
  - en
5
  - zh
6
- base_model:
7
- - Qwen/Qwen2.5-VL-3B-Instruct
8
- - Qwen/Qwen2.5-VL-7B-Instruct
9
- pipeline_tag: image-text-to-text
10
  ---
11
 
12
  <div align='center'><h1>Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs</h1></div>
@@ -30,13 +31,21 @@ By introducing VRTs, we achieve **semantic reasoning and object-specific visual
30
 
31
  As illustrated in Figure C, we have validated PaDT across four major visual perception and understanding tasks. In all cases, PaDT achieves **state-of-the-art** performance compared to conventional character-by-character coordinate-generation MLLMs.
32
 
33
- We hope this work will inspire further exploration in the community:
34
 
35
- - What does true multimodal reasoning look like?
36
 
37
- - How can textual and visual elements be generated together in an MLLM output sequence?
 
 
 
 
38
 
39
- - And is a purely text-based output ever sufficient for visual reasoning?
 
 
 
 
40
 
41
  <div align="center">
42
  <img src="./assets/Motivation.webp" width="900"/>
@@ -62,7 +71,7 @@ conda activate PaDT
62
  bash setup.sh
63
  ```
64
 
65
- The following contains a code snippet illustrating how to use our PaDT.
66
 
67
  ```python
68
  import torch
@@ -84,7 +93,7 @@ processor = VisonTextProcessingClass(processor, model.config.vision_config.spati
84
  processor.prepare(model.model.embed_tokens.weight.shape[0])
85
 
86
  # question prompt
87
- PROMPT = "Please describe this image."
88
 
89
  # construct conversation
90
  message = [
@@ -122,7 +131,8 @@ with torch.inference_mode():
122
  # extract Visual Reference Tokens within the sequence
123
  completions, feats, labels, vrts, vrts_feats = parseVRTintoCompletion(processor, completion_ids, generate_returned_result['hidden_states'], torch.Tensor([False]))
124
 
125
- print("\ngenerate result:", completions[0])
 
126
 
127
  # decode low-level visual task results
128
  low_res_image_embeds = generate_returned_result.past_image_embeds
@@ -130,7 +140,10 @@ with torch.inference_mode():
130
  visual_pe = generate_returned_result.past_visual_pe
131
  decoded_list = model.vl_decode(feats, low_res_image_embeds, high_res_image_embeds, prompt_inputs['image_grid_thw'], visual_pe)
132
 
133
- print(f"\npred_bboxes: {decoded_list['pred_boxes']},\npred_scores: {decoded_list['pred_score'].sigmoid()}\n")
 
 
 
134
  ```
135
 
136
  ## Models
@@ -174,6 +187,64 @@ Here are some randomly selected test examples showcasing PaDT’s excellent perf
174
  <img src="./assets/TAM.webp" width="900"/>
175
  </div>
176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
  ## License Agreement
178
 
179
  PaDT is licensed under Apache 2.0.
@@ -192,4 +263,4 @@ We kindly encourage citation of our work if you find it useful.
192
  primaryClass={cs.CV},
193
  url={https://arxiv.org/abs/2510.01954},
194
  }
195
- ```
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-VL-3B-Instruct
4
+ - Qwen/Qwen2.5-VL-7B-Instruct
5
  language:
6
  - en
7
  - zh
8
+ license: apache-2.0
9
+ pipeline_tag: any-to-any
10
+ library_name: transformers
 
11
  ---
12
 
13
  <div align='center'><h1>Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs</h1></div>
 
31
 
32
  As illustrated in Figure C, we have validated PaDT across four major visual perception and understanding tasks. In all cases, PaDT achieves **state-of-the-art** performance compared to conventional character-by-character coordinate-generation MLLMs.
33
 
34
+ ### Why PaDT Succeeds?
35
 
36
+ The success of PaDT stems from its deep insight into the visual capability bottlenecks of MLLMs.
37
 
38
+ 1. **Native Vision-Language Alignment**: Instead of β€œfitting” vision into text space, PaDT directly treats visual patches as decodable tokens, achieving seamless modality alignment.
39
+ 2. **Dynamic Visual Binding**: A dynamic embedding mechanism tightly binds Visual Reference Tokens (VRTs) to each image, preventing cross-image confusion.
40
+ 3. **Unified Token Space**: Enables the LLM to handle language and vision uniformly, simplifying training and improving consistency.
41
+ 4. **Lightweight Decoder**: Decouples dense prediction from the LLM, preserving its semantic reasoning while adding precise spatial output capability.
42
+ 5. **Strong Multi-Task Generalization**: The PaDT Pro model, jointly trained on REC/RES/OVD/RIC, can switch tasks via prompts and outperforms single-task models.
43
 
44
+ We hope this work will **inspire further exploration** in the community:
45
+
46
+ - What does true multimodal reasoning look like?
47
+
48
+ - And is a purely text-based output ever sufficient for visual reasoning?
49
 
50
  <div align="center">
51
  <img src="./assets/Motivation.webp" width="900"/>
 
71
  bash setup.sh
72
  ```
73
 
74
+ The following contains a code snippet illustrating how to use our PaDT. More details can refer to [eval/test_demo.py](eval/test_demo.py).
75
 
76
  ```python
77
  import torch
 
93
  processor.prepare(model.model.embed_tokens.weight.shape[0])
94
 
95
  # question prompt
96
+ PROMPT = """Please carefully check the image and detect the object this sentence describes: "The car is on the left side of the horse"."""
97
 
98
  # construct conversation
99
  message = [
 
131
  # extract Visual Reference Tokens within the sequence
132
  completions, feats, labels, vrts, vrts_feats = parseVRTintoCompletion(processor, completion_ids, generate_returned_result['hidden_states'], torch.Tensor([False]))
133
 
134
+ print("
135
+ generate result:", completions[0])
136
 
137
  # decode low-level visual task results
138
  low_res_image_embeds = generate_returned_result.past_image_embeds
 
140
  visual_pe = generate_returned_result.past_visual_pe
141
  decoded_list = model.vl_decode(feats, low_res_image_embeds, high_res_image_embeds, prompt_inputs['image_grid_thw'], visual_pe)
142
 
143
+ print(f"
144
+ pred_bboxes: {decoded_list['pred_boxes']},
145
+ pred_scores: {decoded_list['pred_score'].sigmoid()}
146
+ ")
147
  ```
148
 
149
  ## Models
 
187
  <img src="./assets/TAM.webp" width="900"/>
188
  </div>
189
 
190
+ ## Training Instruction
191
+
192
+ Download Datasets:
193
+
194
+ - [COCO](https://cocodataset.org/#home)
195
+
196
+ - RefCOCO/+/g
197
+ ```bash
198
+ wget https://web.archive.org/web/20220413011718/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip
199
+ wget https://web.archive.org/web/20220413011656/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco+.zip
200
+ wget https://web.archive.org/web/20220413012904/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcocog.zip
201
+ ```
202
+
203
+ Unpack these datasets and place them under the following directory:
204
+
205
+ ```
206
+ PaDT/
207
+ β”œβ”€β”€ dataset/
208
+ β”‚ β”œβ”€β”€ coco/
209
+ β”‚ β”‚ β”œβ”€β”€ annotations/
210
+ β”‚ β”‚ β”œβ”€β”€ train2014/
211
+ β”‚ β”‚ β”œβ”€β”€ train2017/
212
+ β”‚ β”‚ β”œβ”€β”€ val2014/
213
+ β”‚ β”‚ └── val2017/
214
+ β”‚ └── RefCOCO/
215
+ β”‚ β”œβ”€β”€ refcoco/
216
+ β”‚ β”œβ”€β”€ refcoco+/
217
+ β”‚ └── refcocog/
218
+ ```
219
+
220
+ Preprocess the datasets:
221
+ - 1. Preprocess via our scripts. (Please first update the dataset path configuration in the preprocessing scripts)
222
+ ```bash
223
+ cd src/preprocess
224
+ python process_coco.py
225
+ python process_refcoco.py
226
+ ```
227
+ - 2. We also released the preprocessed datasets which are ready to use for training in huggingface.
228
+
229
+ | Dataset | Dataset Path | Task Type |
230
+ | - | - | -|
231
+ | COCO | [PaDT-MLLM/COCO](https://huggingface.co/datasets/PaDT-MLLM/COCO) | Open Vocabulary Detection |
232
+ | RefCOCO | [PaDT-MLLM/RefCOCO](https://huggingface.co/datasets/PaDT-MLLM/RefCOCO) | Referring Expression Comprehension/Segmentation |
233
+ | RIC | [PaDT-MLLM/ReferringImageCaptioning](https://huggingface.co/datasets/PaDT-MLLM/ReferringImageCaptioning) | Referring Image Captioning |
234
+
235
+
236
+ The training scripts in `run_scripts` are ready to execute.
237
+
238
+ For example: Train the PaDT-Pro 3B model on a single node with 8Γ—96 GB GPUs.
239
+
240
+ ```bash
241
+ bash ./run_scripts/padt_pro_3b_sft.sh
242
+ ```
243
+
244
+ ## Evaluation
245
+
246
+ We provide a simple inference example in `eval/test_demo.py`. More evaluation scripts will be added soon.
247
+
248
  ## License Agreement
249
 
250
  PaDT is licensed under Apache 2.0.
 
263
  primaryClass={cs.CV},
264
  url={https://arxiv.org/abs/2510.01954},
265
  }
266
+ ```