hongxingli commited on
Commit
82542e4
·
verified ·
1 Parent(s): 378f69e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +89 -0
README.md CHANGED
@@ -4,3 +4,92 @@ base_model:
4
  - Qwen/Qwen2.5-VL-3B-Instruct
5
  pipeline_tag: image-text-to-text
6
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - Qwen/Qwen2.5-VL-3B-Instruct
5
  pipeline_tag: image-text-to-text
6
  ---
7
+
8
+ <a href="" target="_blank">
9
+ <img alt="arXiv" src="https://img.shields.io/badge/arXiv-SpatialLadder-red?logo=arxiv" height="20" />
10
+ </a>
11
+ <a href="" target="_blank">
12
+ <img alt="Website" src="https://img.shields.io/badge/🌎_Website-SpaitalLadder-blue.svg" height="20" />
13
+ </a>
14
+ <a href="https://github.com/ZJU-REAL/SpatialLadder" target="_blank">
15
+ <img alt="Code" src="https://img.shields.io/badge/Code-SpaitalLadder-white?logo=github" height="20" />
16
+ </a>
17
+ <a href="" target="_blank">
18
+ <img alt="Data" src="https://img.shields.io/badge/%F0%9F%A4%97%20_Data-SpatialLadder--26k-ffc107?color=ffc107&logoColor=white" height="20" />
19
+ </a>
20
+ <a href="" target="_blank">
21
+ <img alt="Bench" src="https://img.shields.io/badge/%F0%9F%A4%97%20_Bench-SPBench-ffc107?color=ffc107&logoColor=white" height="20" />
22
+ </a>
23
+
24
+ # SpatialLadder-3B
25
+
26
+ This repository contains the SpatialLadder-3B, introduced in [SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models]().
27
+
28
+ ## Model Description
29
+
30
+ ## Usage
31
+
32
+ First, install the required dependencies:
33
+
34
+ ```python
35
+ pip install transformers==4.49.0 qwen-vl-utils
36
+ ```
37
+
38
+ ```
39
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
40
+ from qwen_vl_utils import process_vision_info
41
+
42
+ # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
43
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
44
+ "inclusionAI/GUI-G2-7B",
45
+ torch_dtype=torch.bfloat16,
46
+ attn_implementation="flash_attention_2",
47
+ device_map="auto")
48
+
49
+ processor = AutoProcessor.from_pretrained("hongxingli/SpatialLadder-3B")
50
+ image_path = ''
51
+ instruction = ''
52
+
53
+ messages = [
54
+ {
55
+ "role": "user",
56
+ "content": [
57
+ {
58
+ "type": "image",
59
+ "image": "image_path",
60
+ },
61
+ {"type": "text", "text": instruction},
62
+ ],
63
+ }
64
+ ]
65
+
66
+ # Preparation for inference
67
+ text = processor.apply_chat_template(
68
+ messages, tokenize=False, add_generation_prompt=True
69
+ )
70
+ image_inputs, video_inputs = process_vision_info(messages)
71
+ inputs = processor(
72
+ text=[text],
73
+ images=image_inputs,
74
+ videos=video_inputs,
75
+ padding=True,
76
+ return_tensors="pt",
77
+ )
78
+ inputs = inputs.to(model.device)
79
+
80
+ # Inference: Generation of the output
81
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
82
+ generated_ids_trimmed = [
83
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
84
+ ]
85
+ output_text = processor.batch_decode(
86
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
87
+ )
88
+ print(output_text)
89
+ ```
90
+
91
+ ## Training
92
+
93
+ The training code and usage guidelines are available in our [GitHub repository](https://github.com/ZJU-REAL/SpatialLadder). For comprehensive details, please refer to our paper and the repository documentation.
94
+
95
+ ## Citation