nielsr's picture
nielsr HF Staff
Improve model card and metadata
e783fef verified
|
raw
history blame
2.52 kB
metadata
license: bsd-3-clause
pipeline_tag: video-text-to-text
library_name: transformers
base_model: Qwen/Qwen2-VL-2B-Instruct

VideoMind-2B-FT-QVHighlights

VideoMind is a multi-modal agent framework that enhances video reasoning by emulating human-like processes, such as breaking down tasks, localizing and verifying moments, and synthesizing answers.

This repository contains the fine-tuned LoRA adapter (specifically the Grounder role for temporal event localization) for the 2B version of the framework, trained on the QVHighlights dataset. It is based on the paper VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning.

πŸ”– Model Details

Model Description

  • Model type: Multi-modal Large Language Model (LoRA Adapter)
  • Base Model: Qwen/Qwen2-VL-2B-Instruct
  • Role: Grounder (Temporal Event Localization)
  • Language(s): English
  • License: BSD-3-Clause

Framework Overview

VideoMind identifies four essential capabilities for grounded video reasoning:

  1. Planner: Coordinates roles.
  2. Grounder: Temporal event localization.
  3. Verifier: Assesses event candidates.
  4. Answerer: Question answering.

This specific checkpoint is the Grounder specialized for the QVHighlights task using the Chain-of-LoRA mechanism.

πŸ“– Citation

Please kindly cite our paper if you find this project helpful.

@inproceedings{liu2026videomind,
  title={VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning},
  author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}