BBBBCHAN
/

SWIM-7B

+---
+base_model:
+- google/siglip-so400m-patch14-384
+- Qwen/Qwen2.5-7B-Instruct
+- Qwen/Qwen2.5-VL-7B-Instruct
+datasets:
+- lmms-lab/LLaVA-Video-178K
+- DAMO-NLP-SG/VideoRefer-700K
+language:
+- en
+- zh
+library_name: transformers
+license: cc-by-nc-4.0
+metrics:
+- accuracy
+pipeline_tag: video-text-to-text
+tags:
+- video-understanding
+- multimodal
+- SWIM
+- Qwen2.5-VL
+- fine-grained-understanding
+model-index:
+- name: SWIM-7B
+  results:
+  - task:
+      type: multimodal
+    dataset:
+      name: VideoRefer-Q
+      type: VideoRefer-Q
+    metrics:
+    - type: accuracy
+      value: 78.3
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: VideoRefer-D
+      type: VideoRefer-D
+    metrics:
+    - type: accuracy
+      value: 3.78
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: MVBench
+      type: mvbench
+    metrics:
+    - type: accuracy
+      value: 62.1
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: VideoMME
+      type: videomme
+    metrics:
+    - type: accuracy
+      value: 55.9
+      name: accuracy
+      verified: true
+  - task:
+      type: multimodal
+    dataset:
+      name: ActivityNetQA
+      type: ActivityNetQA
+    metrics:
+    - type: accuracy
+      value: 55.6
+      name: accuracy
+      verified: true
+---
+# SWIM-7B
+This repository contains the baseline model for [See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding](https://huggingface.co/papers/2506.21862).
+Code: https://github.com/HumanMLLM/
+## Model Summary
+This repository contains the baseline model SWIM-7B.
+This model is fine-tuned from [Qwen2.5-VL](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) model with [SIGLIP](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder and [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) large language model.
+SWIM shares a same architecture with Qwen2.5-VL, You can directly replace "Qwen/Qwen2.5-VL-7B-Instruct" to "BBBBCHAN/SWIM-7B" to get fine-grained object understanding with nature language.
+## Quick Start
+Here we provide a quick run script for SWIM-7B adopted from Qwen2.5-VL.
+```python
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
+from qwen_vl_utils import process_vision_info
+# default: Load the model on the available device(s)
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    "BBBBCHAN/SWIM-7B", torch_dtype="auto", device_map="auto"
+)
+# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
+# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+#     "BBBBCHAN/SWIM-7B",
+#     torch_dtype=torch.bfloat16,
+#     attn_implementation="flash_attention_2",
+#     device_map="auto",
+# )
+# default processer
+processor = AutoProcessor.from_pretrained("BBBBCHAN/SWIM-7B")
+# The default range for the number of visual tokens per image in the model is 4-16384.
+# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
+# min_pixels = 256*28*28
+# max_pixels = 1280*28*28
+# processor = AutoProcessor.from_pretrained("BBBBCHAN/SWIM-7B", min_pixels=min_pixels, max_pixels=max_pixels)
+# Messages containing a local video path and a text query
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "video",
+                "video": "file:///path/to/video1.mp4",
+                "max_pixels": 360 * 420,
+                "fps": 1.0,
+            },
+            {"type": "text", "text": "Describe this video."},
+        ],
+    }
+]
+#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    fps=fps,
+    padding=True,
+    return_tensors="pt",
+    **video_kwargs,
+)
+inputs = inputs.to("cuda")
+# Inference
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+## Citation
+If you find our repo useful for your research, please consider citing our paper:
+```bibtex
+@article{sun2025see,
+  title={See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding},
+  author={Sun, Boyuan and Yin, Bowen and Li, Yuanming and Wei, Xihan and Hou, Qibin},
+  journal={arXiv preprint arXiv:xxxx},
+  year={2025}
+}
+```