SWIM-7B / README.md

Upload README.md with huggingface_hub

1ba42a8 verified 1 day ago

5.25 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	- google/siglip-so400m-patch14-384
	- Qwen/Qwen2.5-7B-Instruct
	datasets:
	- lmms-lab/LLaVA-Video-178K
	- DAMO-NLP-SG/VideoRefer-700K
	- BBBBCHAN/NL-Refer
	language:
	- en
	- zh
	library_name: transformers
	license: cc-by-nc-4.0
	metrics:
	- accuracy
	pipeline_tag: video-text-to-text
	tags:
	- video-understanding
	- multimodal
	- SWIM
	- Qwen2.5-VL
	- fine-grained-understanding
	model-index:
	- name: SWIM-7B
	results:
	- task:
	type: multimodal
	dataset:
	name: VideoRefer-Q
	type: VideoRefer-Q
	metrics:
	- type: accuracy
	value: 78.3
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: VideoRefer-D
	type: VideoRefer-D
	metrics:
	- type: accuracy
	value: 3.78
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: MVBench
	type: mvbench
	metrics:
	- type: accuracy
	value: 62.1
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: VideoMME
	type: videomme
	metrics:
	- type: accuracy
	value: 55.9
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: ActivityNetQA
	type: ActivityNetQA
	metrics:
	- type: accuracy
	value: 55.6
	name: accuracy
	verified: true
	---

	# SWIM-7B

	[Paper](https://arxiv.org/abs/2605.18018) \| [GitHub](https://github.com/HumanMLLM/SWIM) \| [NL-Refer Dataset](https://huggingface.co/datasets/BBBBCHAN/NL-Refer)

	This repository contains the baseline model for [See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding](https://arxiv.org/abs/2605.18018).

	## Model Summary
	This repository contains the baseline model SWIM-7B.
	This model is fine-tuned from [Qwen2.5-VL](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) model with [SIGLIP](https://huggingface.co/google/siglip-so400m-patch14-384) vision encoder and [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) large language model.

	SWIM shares a same architecture with Qwen2.5-VL, You can directly replace "Qwen/Qwen2.5-VL-7B-Instruct" to "BBBBCHAN/SWIM-7B" to get fine-grained object understanding with nature language.

	## Quick Start
	Here we provide a quick run script for SWIM-7B adopted from Qwen2.5-VL.
	```python
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
	from qwen_vl_utils import process_vision_info

	# default: Load the model on the available device(s)
	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	"BBBBCHAN/SWIM-7B", torch_dtype="auto", device_map="auto"
	)

	# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
	# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	# "BBBBCHAN/SWIM-7B",
	# torch_dtype=torch.bfloat16,
	# attn_implementation="flash_attention_2",
	# device_map="auto",
	# )

	# default processer
	processor = AutoProcessor.from_pretrained("BBBBCHAN/SWIM-7B")

	# The default range for the number of visual tokens per image in the model is 4-16384.
	# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
	# min_pixels = 2562828
	# max_pixels = 12802828
	# processor = AutoProcessor.from_pretrained("BBBBCHAN/SWIM-7B", min_pixels=min_pixels, max_pixels=max_pixels)


	# Messages containing a local video path and a text query
	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "video",
	"video": "file:///path/to/video1.mp4",
	"max_pixels": 360 * 420,
	"fps": 1.0,
	},
	{"type": "text", "text": "Describe this video."},
	],
	}
	]

	#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
	# Preparation for inference
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	fps=fps,
	padding=True,
	return_tensors="pt",
	**video_kwargs,
	)
	inputs = inputs.to("cuda")

	# Inference
	generated_ids = model.generate(**inputs, max_new_tokens=128)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```

	## Citation

	If you find our repo useful for your research, please consider citing our paper:

	```bibtex
	@inproceedings{sun2026swim,
	title = {See What I Mean: Aligning Vision and Language Representations
	for Video Fine-grained Object Understanding},
	author = {Sun, Boyuan and Yin, Bowen and Li, Yuanming and Wei, Xihan and Hou, Qibin},
	booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
	year = {2026}
	}
	```