nielsr HF Staff

Improve model card and metadata

e783fef verified about 2 months ago

2.52 kB

license: bsd-3-clause
pipeline_tag: video-text-to-text
library_name: transformers
base_model: Qwen/Qwen2-VL-2B-Instruct

VideoMind-2B-FT-QVHighlights

VideoMind is a multi-modal agent framework that enhances video reasoning by emulating human-like processes, such as breaking down tasks, localizing and verifying moments, and synthesizing answers.

This repository contains the fine-tuned LoRA adapter (specifically the Grounder role for temporal event localization) for the 2B version of the framework, trained on the QVHighlights dataset. It is based on the paper VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning.

🔖 Model Details

Model Description

Model type: Multi-modal Large Language Model (LoRA Adapter)
Base Model: Qwen/Qwen2-VL-2B-Instruct
Role: Grounder (Temporal Event Localization)
Language(s): English
License: BSD-3-Clause

Framework Overview

VideoMind identifies four essential capabilities for grounded video reasoning:

Planner: Coordinates roles.
Grounder: Temporal event localization.
Verifier: Assesses event candidates.
Answerer: Question answering.

This specific checkpoint is the Grounder specialized for the QVHighlights task using the Chain-of-LoRA mechanism.

📖 Citation

Please kindly cite our paper if you find this project helpful.

@inproceedings{liu2026videomind,
  title={VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning},
  author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}