|
|
--- |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: transformers |
|
|
base_model: Qwen/Qwen3-VL-4B-Instruct |
|
|
tags: |
|
|
- vision-language-tracking |
|
|
- multimodal |
|
|
- mllm |
|
|
- video |
|
|
--- |
|
|
|
|
|
# VPTracker: Global Vision-Language Tracking via Visual Prompt and MLLM |
|
|
|
|
|
This repository contains the weights for **VPTracker**, the first global tracking framework based on Multimodal Large Language Models (MLLMs). |
|
|
|
|
|
VPTracker exploits the powerful semantic reasoning of MLLMs to locate targets across the entire image space. To address distractions from visually or semantically similar objects during global search, it introduces a location-aware visual prompting mechanism that incorporates spatial priors. |
|
|
|
|
|
- **Paper:** [VPTracker: Global Vision-Language Tracking via Visual Prompt and MLLM](https://huggingface.co/papers/2512.22799) |
|
|
- **Repository:** [GitHub - jcwang0602/VPTracker](https://github.com/jcwang0602/VPTracker) |
|
|
|
|
|
[](https://arxiv.org/abs/2512.22799) |
|
|
[](https://www.python.org/downloads/) |
|
|
[](https://pytorch.org/) |
|
|
[](https://huggingface.co/docs/transformers/) |
|
|
|
|
|
<!-- <img src="assets/VPTracker.jpg" width="800"> --> |
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
conda create -n gltrack python==3.10 |
|
|
conda activate gltrack |
|
|
|
|
|
cd ms-swift |
|
|
conda install -c conda-forge pyarrow sentencepiece |
|
|
pip install -e . |
|
|
pip install "sglang[all]" -U |
|
|
pip install "vllm>=0.5.1" "transformers<4.55" "trl<0.21" -U |
|
|
pip install "lmdeploy>=0.5" -U |
|
|
pip install autoawq -U --no-deps |
|
|
pip install auto_gptq optimum bitsandbytes "gradio<5.33" -U |
|
|
pip install git+https://github.com/modelscope/ms-swift.git |
|
|
pip install timm -U |
|
|
pip install "deepspeed" -U |
|
|
pip install flash-attn==2.7.4.post1 --no-build-isolation |
|
|
|
|
|
conda install av -c conda-forge |
|
|
pip install qwen_vl_utils qwen_omni_utils decord librosa icecream soundfile -U |
|
|
pip install liger_kernel nvitop pre-commit math_verify py-spy -U |
|
|
``` |
|
|
|
|
|
<!-- ## π Visualization |
|
|
<img src="assets/Results.jpg" width="800"> --> |
|
|
|
|
|
## π Acknowledgments |
|
|
This code is developed on top of [ms-swift](https://github.com/modelscope/ms-swift). |
|
|
|
|
|
## βοΈ Contact |
|
|
Email: jcwang@stu.ecnu.edu.cn. Any kind discussions are welcomed! |
|
|
|
|
|
--- |
|
|
|
|
|
## π Citation |
|
|
If our work is useful for your research, please consider citing: |
|
|
```bibtex |
|
|
@misc{wang2025vptrackerglobalvisionlanguagetracking, |
|
|
title={VPTracker: Global Vision-Language Tracking via Visual Prompt and MLLM}, |
|
|
author={Jingchao Wang and Kaiwen Zhou and Zhijian Wu and Kunhua Ji and Dingjiang Huang and Yefeng Zheng}, |
|
|
year={2025}, |
|
|
eprint={2512.22799}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2512.22799}, |
|
|
} |
|
|
``` |