jcwang0602
/

VPTracker

Image-Text-to-Text

vision-language-tracking

Model card Files Files and versions

VPTracker / README.md

jcwang0602's picture

Improve model card and add metadata (#1)

19f1933 verified 15 days ago

|

history blame contribute delete

2.89 kB

	---
	pipeline_tag: image-text-to-text
	library_name: transformers
	base_model: Qwen/Qwen3-VL-4B-Instruct
	tags:
	- vision-language-tracking
	- multimodal
	- mllm
	- video
	---

	# VPTracker: Global Vision-Language Tracking via Visual Prompt and MLLM

	This repository contains the weights for VPTracker, the first global tracking framework based on Multimodal Large Language Models (MLLMs).

	VPTracker exploits the powerful semantic reasoning of MLLMs to locate targets across the entire image space. To address distractions from visually or semantically similar objects during global search, it introduces a location-aware visual prompting mechanism that incorporates spatial priors.

	- Paper: [VPTracker: Global Vision-Language Tracking via Visual Prompt and MLLM](https://huggingface.co/papers/2512.22799)
	- Repository: [GitHub - jcwang0602/VPTracker](https://github.com/jcwang0602/VPTracker)

	[![arXiv](https://img.shields.io/badge/Arxiv-2512.22799-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2512.22799)
	[![Python](https://img.shields.io/badge/Python-3.9-blue.svg)](https://www.python.org/downloads/)
	[![PyTorch](https://img.shields.io/badge/PyTorch-2.5.1-red.svg)](https://pytorch.org/)
	[![Transformers](https://img.shields.io/badge/Transformers-4.37.2-green.svg)](https://huggingface.co/docs/transformers/)

	<!-- <img src="assets/VPTracker.jpg" width="800"> -->

	## 🚀 Quick Start

	### Installation

	```bash
	conda create -n gltrack python==3.10
	conda activate gltrack

	cd ms-swift
	conda install -c conda-forge pyarrow sentencepiece
	pip install -e .
	pip install "sglang[all]" -U
	pip install "vllm>=0.5.1" "transformers<4.55" "trl<0.21" -U
	pip install "lmdeploy>=0.5" -U
	pip install autoawq -U --no-deps
	pip install auto_gptq optimum bitsandbytes "gradio<5.33" -U
	pip install git+https://github.com/modelscope/ms-swift.git
	pip install timm -U
	pip install "deepspeed" -U
	pip install flash-attn==2.7.4.post1 --no-build-isolation

	conda install av -c conda-forge
	pip install qwen_vl_utils qwen_omni_utils decord librosa icecream soundfile -U
	pip install liger_kernel nvitop pre-commit math_verify py-spy -U
	```

	<!-- ## 👀 Visualization
	<img src="assets/Results.jpg" width="800"> -->

	## 🙏 Acknowledgments
	This code is developed on top of [ms-swift](https://github.com/modelscope/ms-swift).

	## ✉️ Contact
	Email: jcwang@stu.ecnu.edu.cn. Any kind discussions are welcomed!

	---

	## 📖 Citation
	If our work is useful for your research, please consider citing:
	```bibtex
	@misc{wang2025vptrackerglobalvisionlanguagetracking,
	title={VPTracker: Global Vision-Language Tracking via Visual Prompt and MLLM},
	author={Jingchao Wang and Kaiwen Zhou and Zhijian Wu and Kunhua Ji and Dingjiang Huang and Yefeng Zheng},
	year={2025},
	eprint={2512.22799},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2512.22799},
	}
	```