metadata
license: bsd-3-clause
pipeline_tag: video-text-to-text
library_name: transformers
base_model: Qwen/Qwen2-VL-2B-Instruct
VideoMind-2B-FT-QVHighlights
VideoMind is a multi-modal agent framework that enhances video reasoning by emulating human-like processes, such as breaking down tasks, localizing and verifying moments, and synthesizing answers.
This repository contains the fine-tuned LoRA adapter (specifically the Grounder role for temporal event localization) for the 2B version of the framework, trained on the QVHighlights dataset. It is based on the paper VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning.
π Model Details
Model Description
- Model type: Multi-modal Large Language Model (LoRA Adapter)
- Base Model: Qwen/Qwen2-VL-2B-Instruct
- Role: Grounder (Temporal Event Localization)
- Language(s): English
- License: BSD-3-Clause
Framework Overview
VideoMind identifies four essential capabilities for grounded video reasoning:
- Planner: Coordinates roles.
- Grounder: Temporal event localization.
- Verifier: Assesses event candidates.
- Answerer: Question answering.
This specific checkpoint is the Grounder specialized for the QVHighlights task using the Chain-of-LoRA mechanism.
π Citation
Please kindly cite our paper if you find this project helpful.
@inproceedings{liu2026videomind,
title={VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning},
author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}