BBBBCHAN commited on
Commit
d296cf5
·
verified ·
1 Parent(s): a07fe85

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +175 -3
README.md CHANGED
@@ -1,3 +1,175 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - google/siglip-so400m-patch14-384
4
+ - Qwen/Qwen2.5-7B-Instruct
5
+ - Qwen/Qwen2.5-VL-7B-Instruct
6
+ datasets:
7
+ - lmms-lab/LLaVA-Video-178K
8
+ - DAMO-NLP-SG/VideoRefer-700K
9
+ language:
10
+ - en
11
+ - zh
12
+ library_name: transformers
13
+ license: cc-by-nc-4.0
14
+ metrics:
15
+ - accuracy
16
+ pipeline_tag: video-text-to-text
17
+ tags:
18
+ - video-understanding
19
+ - multimodal
20
+ - SWIM
21
+ - Qwen2.5-VL
22
+ - fine-grained-understanding
23
+ model-index:
24
+ - name: SWIM-7B
25
+ results:
26
+ - task:
27
+ type: multimodal
28
+ dataset:
29
+ name: VideoRefer-Q
30
+ type: VideoRefer-Q
31
+ metrics:
32
+ - type: accuracy
33
+ value: 78.3
34
+ name: accuracy
35
+ verified: true
36
+ - task:
37
+ type: multimodal
38
+ dataset:
39
+ name: VideoRefer-D
40
+ type: VideoRefer-D
41
+ metrics:
42
+ - type: accuracy
43
+ value: 3.78
44
+ name: accuracy
45
+ verified: true
46
+ - task:
47
+ type: multimodal
48
+ dataset:
49
+ name: MVBench
50
+ type: mvbench
51
+ metrics:
52
+ - type: accuracy
53
+ value: 62.1
54
+ name: accuracy
55
+ verified: true
56
+ - task:
57
+ type: multimodal
58
+ dataset:
59
+ name: VideoMME
60
+ type: videomme
61
+ metrics:
62
+ - type: accuracy
63
+ value: 55.9
64
+ name: accuracy
65
+ verified: true
66
+ - task:
67
+ type: multimodal
68
+ dataset:
69
+ name: ActivityNetQA
70
+ type: ActivityNetQA
71
+ metrics:
72
+ - type: accuracy
73
+ value: 55.6
74
+ name: accuracy
75
+ verified: true
76
+
77
+ ---
78
+
79
+ # SWIM-7B
80
+
81
+ This repository contains the baseline model for [See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding](https://huggingface.co/papers/2506.21862).
82
+
83
+ Code: https://github.com/HumanMLLM/
84
+
85
+ ## Model Summary
86
+ This repository contains the baseline model SWIM-7B.
87
+ This model is fine-tuned from [Qwen2.5-VL](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) model with [SIGLIP](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder and [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) large language model.
88
+
89
+ SWIM shares a same architecture with Qwen2.5-VL, You can directly replace "Qwen/Qwen2.5-VL-7B-Instruct" to "BBBBCHAN/SWIM-7B" to get fine-grained object understanding with nature language.
90
+
91
+ ## Quick Start
92
+ Here we provide a quick run script for SWIM-7B adopted from Qwen2.5-VL.
93
+ ```python
94
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
95
+ from qwen_vl_utils import process_vision_info
96
+
97
+ # default: Load the model on the available device(s)
98
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
99
+ "BBBBCHAN/SWIM-7B", torch_dtype="auto", device_map="auto"
100
+ )
101
+
102
+ # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
103
+ # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
104
+ # "BBBBCHAN/SWIM-7B",
105
+ # torch_dtype=torch.bfloat16,
106
+ # attn_implementation="flash_attention_2",
107
+ # device_map="auto",
108
+ # )
109
+
110
+ # default processer
111
+ processor = AutoProcessor.from_pretrained("BBBBCHAN/SWIM-7B")
112
+
113
+ # The default range for the number of visual tokens per image in the model is 4-16384.
114
+ # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
115
+ # min_pixels = 256*28*28
116
+ # max_pixels = 1280*28*28
117
+ # processor = AutoProcessor.from_pretrained("BBBBCHAN/SWIM-7B", min_pixels=min_pixels, max_pixels=max_pixels)
118
+
119
+
120
+ # Messages containing a local video path and a text query
121
+ messages = [
122
+ {
123
+ "role": "user",
124
+ "content": [
125
+ {
126
+ "type": "video",
127
+ "video": "file:///path/to/video1.mp4",
128
+ "max_pixels": 360 * 420,
129
+ "fps": 1.0,
130
+ },
131
+ {"type": "text", "text": "Describe this video."},
132
+ ],
133
+ }
134
+ ]
135
+
136
+ #In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
137
+ # Preparation for inference
138
+ text = processor.apply_chat_template(
139
+ messages, tokenize=False, add_generation_prompt=True
140
+ )
141
+ image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
142
+ inputs = processor(
143
+ text=[text],
144
+ images=image_inputs,
145
+ videos=video_inputs,
146
+ fps=fps,
147
+ padding=True,
148
+ return_tensors="pt",
149
+ **video_kwargs,
150
+ )
151
+ inputs = inputs.to("cuda")
152
+
153
+ # Inference
154
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
155
+ generated_ids_trimmed = [
156
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
157
+ ]
158
+ output_text = processor.batch_decode(
159
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
160
+ )
161
+ print(output_text)
162
+ ```
163
+
164
+ ## Citation
165
+
166
+ If you find our repo useful for your research, please consider citing our paper:
167
+
168
+ ```bibtex
169
+ @article{sun2025see,
170
+ title={See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding},
171
+ author={Sun, Boyuan and Yin, Bowen and Li, Yuanming and Wei, Xihan and Hou, Qibin},
172
+ journal={arXiv preprint arXiv:xxxx},
173
+ year={2025}
174
+ }
175
+ ```