PaDT-MLLM
/

PaDT_Pro_7B

@@ -1,12 +1,13 @@
 ---
-license: apache-2.0
 language:
 - en
 - zh
-base_model:
-  - Qwen/Qwen2.5-VL-3B-Instruct
-  - Qwen/Qwen2.5-VL-7B-Instruct
-pipeline_tag:  image-text-to-text
 ---
 <div align='center'><h1>Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs</h1></div>
@@ -30,13 +31,21 @@ By introducing VRTs, we achieve **semantic reasoning and object-specific visual
 As illustrated in Figure C, we have validated PaDT across four major visual perception and understanding tasks. In all cases, PaDT achieves **state-of-the-art** performance compared to conventional character-by-character coordinate-generation MLLMs.
-We hope this work will inspire further exploration in the community:
-- What does true multimodal reasoning look like?
-- How can textual and visual elements be generated together in an MLLM output sequence?
-- And is a purely text-based output ever sufficient for visual reasoning?
 <div align="center">
 <img src="./assets/Motivation.webp" width="900"/>
@@ -62,7 +71,7 @@ conda activate PaDT
 bash setup.sh
 ```
-The following contains a code snippet illustrating how to use our PaDT.
 ```python
 import torch
@@ -84,7 +93,7 @@ processor = VisonTextProcessingClass(processor, model.config.vision_config.spati
 processor.prepare(model.model.embed_tokens.weight.shape[0])
 # question prompt
-PROMPT = "Please describe this image."
 # construct conversation
 message = [
@@ -122,7 +131,8 @@ with torch.inference_mode():
     # extract Visual Reference Tokens within the sequence
     completions, feats, labels, vrts, vrts_feats = parseVRTintoCompletion(processor, completion_ids, generate_returned_result['hidden_states'], torch.Tensor([False]))
-    print("\ngenerate result:", completions[0])
     # decode low-level visual task results
     low_res_image_embeds = generate_returned_result.past_image_embeds
@@ -130,7 +140,10 @@ with torch.inference_mode():
     visual_pe = generate_returned_result.past_visual_pe
     decoded_list = model.vl_decode(feats, low_res_image_embeds, high_res_image_embeds, prompt_inputs['image_grid_thw'], visual_pe)
-    print(f"\npred_bboxes: {decoded_list['pred_boxes']},\npred_scores: {decoded_list['pred_score'].sigmoid()}\n")
 ```
 ## Models
@@ -174,6 +187,64 @@ Here are some randomly selected test examples showcasing PaDT’s excellent perf
 <img src="./assets/TAM.webp" width="900"/>
 </div>
 ## License Agreement
 PaDT is licensed under Apache 2.0.
@@ -192,4 +263,4 @@ We kindly encourage citation of our work if you find it useful.
       primaryClass={cs.CV},
       url={https://arxiv.org/abs/2510.01954},
 }
-```

 ---
+base_model:
+- Qwen/Qwen2.5-VL-3B-Instruct
+- Qwen/Qwen2.5-VL-7B-Instruct
 language:
 - en
 - zh
+license: apache-2.0
+pipeline_tag: any-to-any
+library_name: transformers
 ---
 <div align='center'><h1>Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs</h1></div>
 As illustrated in Figure C, we have validated PaDT across four major visual perception and understanding tasks. In all cases, PaDT achieves **state-of-the-art** performance compared to conventional character-by-character coordinate-generation MLLMs.
+### Why PaDT Succeeds?
+The success of PaDT stems from its deep insight into the visual capability bottlenecks of MLLMs.
+1.  **Native Vision-Language Alignment**: Instead of “fitting” vision into text space, PaDT directly treats visual patches as decodable tokens, achieving seamless modality alignment.
+2.  **Dynamic Visual Binding**: A dynamic embedding mechanism tightly binds Visual Reference Tokens (VRTs) to each image, preventing cross-image confusion.
+3.  **Unified Token Space**: Enables the LLM to handle language and vision uniformly, simplifying training and improving consistency.
+4.  **Lightweight Decoder**: Decouples dense prediction from the LLM, preserving its semantic reasoning while adding precise spatial output capability.
+5.  **Strong Multi-Task Generalization**: The PaDT Pro model, jointly trained on REC/RES/OVD/RIC, can switch tasks via prompts and outperforms single-task models.
+We hope this work will **inspire further exploration** in the community:
+-   What does true multimodal reasoning look like?
+-   And is a purely text-based output ever sufficient for visual reasoning?
 <div align="center">
 <img src="./assets/Motivation.webp" width="900"/>
 bash setup.sh
 ```
+The following contains a code snippet illustrating how to use our PaDT. More details can refer to [eval/test_demo.py](eval/test_demo.py).
 ```python
 import torch
 processor.prepare(model.model.embed_tokens.weight.shape[0])
 # question prompt
+PROMPT = """Please carefully check the image and detect the object this sentence describes: "The car is on the left side of the horse"."""
 # construct conversation
 message = [
     # extract Visual Reference Tokens within the sequence
     completions, feats, labels, vrts, vrts_feats = parseVRTintoCompletion(processor, completion_ids, generate_returned_result['hidden_states'], torch.Tensor([False]))
+    print("
+generate result:", completions[0])
     # decode low-level visual task results
     low_res_image_embeds = generate_returned_result.past_image_embeds
     visual_pe = generate_returned_result.past_visual_pe
     decoded_list = model.vl_decode(feats, low_res_image_embeds, high_res_image_embeds, prompt_inputs['image_grid_thw'], visual_pe)
+    print(f"
+pred_bboxes: {decoded_list['pred_boxes']},
+pred_scores: {decoded_list['pred_score'].sigmoid()}
+")
 ```
 ## Models
 <img src="./assets/TAM.webp" width="900"/>
 </div>
+## Training Instruction
+Download Datasets:
+- [COCO](https://cocodataset.org/#home)
+- RefCOCO/+/g
+    ```bash
+    wget https://web.archive.org/web/20220413011718/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip
+    wget https://web.archive.org/web/20220413011656/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco+.zip
+    wget https://web.archive.org/web/20220413012904/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcocog.zip
+    ```
+Unpack these datasets and place them under the following directory:
+```
+PaDT/
+ ├── dataset/
+ │    ├── coco/
+ │    │     ├── annotations/
+ │    │     ├── train2014/
+ │    │     ├── train2017/
+ │    │     ├── val2014/
+ │    │     └── val2017/
+ │    └── RefCOCO/
+ │          ├── refcoco/
+ │          ├── refcoco+/
+ │          └── refcocog/
+```
+Preprocess the datasets:
+- 1. Preprocess via our scripts. (Please first update the dataset path configuration in the preprocessing scripts)
+    ```bash
+    cd src/preprocess
+    python process_coco.py
+    python process_refcoco.py
+    ```
+- 2. We also released the preprocessed datasets which are ready to use for training in huggingface.
+    | Dataset | Dataset Path | Task Type |
+    | - | - | -|
+    | COCO | [PaDT-MLLM/COCO](https://huggingface.co/datasets/PaDT-MLLM/COCO) | Open Vocabulary Detection |
+    | RefCOCO | [PaDT-MLLM/RefCOCO](https://huggingface.co/datasets/PaDT-MLLM/RefCOCO) | Referring Expression Comprehension/Segmentation |
+    | RIC | [PaDT-MLLM/ReferringImageCaptioning](https://huggingface.co/datasets/PaDT-MLLM/ReferringImageCaptioning) | Referring Image Captioning |
+The training scripts in `run_scripts` are ready to execute.
+For example: Train the PaDT-Pro 3B model on a single node with 8×96 GB GPUs.
+```bash
+bash ./run_scripts/padt_pro_3b_sft.sh
+```
+## Evaluation
+We provide a simple inference example in `eval/test_demo.py`. More evaluation scripts will be added soon.
 ## License Agreement
 PaDT is licensed under Apache 2.0.
       primaryClass={cs.CV},
       url={https://arxiv.org/abs/2510.01954},
 }
+```