hongxingli
/

SpatialLadder-3B

Image-Text-to-Text

Model card Files Files and versions

hongxingli commited on Oct 9, 2025

Commit

82542e4

·

verified ·

1 Parent(s): 378f69e

Update README.md

Files changed (1) hide show

README.md +89 -0

README.md CHANGED Viewed

@@ -4,3 +4,92 @@ base_model:
 - Qwen/Qwen2.5-VL-3B-Instruct
 pipeline_tag: image-text-to-text
 ---

 - Qwen/Qwen2.5-VL-3B-Instruct
 pipeline_tag: image-text-to-text
 ---
+<a href="" target="_blank">
+    <img alt="arXiv" src="https://img.shields.io/badge/arXiv-SpatialLadder-red?logo=arxiv" height="20" />
+</a>
+<a href="" target="_blank">
+    <img alt="Website" src="https://img.shields.io/badge/🌎_Website-SpaitalLadder-blue.svg" height="20" />
+</a>
+<a href="https://github.com/ZJU-REAL/SpatialLadder" target="_blank">
+    <img alt="Code" src="https://img.shields.io/badge/Code-SpaitalLadder-white?logo=github" height="20" />
+</a>
+<a href="" target="_blank">
+    <img alt="Data" src="https://img.shields.io/badge/%F0%9F%A4%97%20_Data-SpatialLadder--26k-ffc107?color=ffc107&logoColor=white" height="20" />
+</a>
+<a href="" target="_blank">
+    <img alt="Bench" src="https://img.shields.io/badge/%F0%9F%A4%97%20_Bench-SPBench-ffc107?color=ffc107&logoColor=white" height="20" />
+</a>
+# SpatialLadder-3B
+This repository contains the SpatialLadder-3B, introduced in [SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models]().
+## Model Description
+## Usage
+First, install the required dependencies:
+```python
+pip install transformers==4.49.0 qwen-vl-utils
+```
+```
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
+from qwen_vl_utils import process_vision_info
+# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+     "inclusionAI/GUI-G2-7B",
+     torch_dtype=torch.bfloat16,
+     attn_implementation="flash_attention_2",
+     device_map="auto")
+processor = AutoProcessor.from_pretrained("hongxingli/SpatialLadder-3B")
+image_path = ''
+instruction = ''
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "image_path",
+            },
+            {"type": "text", "text": instruction},
+        ],
+    }
+]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to(model.device)
+# Inference: Generation of the output
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+## Training
+The training code and usage guidelines are available in our [GitHub repository](https://github.com/ZJU-REAL/SpatialLadder). For comprehensive details, please refer to our paper and the repository documentation.
+## Citation