Sudong Wang commited on
Update README.md
Browse files
README.md
CHANGED
|
@@ -2,13 +2,13 @@
|
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 4 |
datasets:
|
| 5 |
-
-
|
| 6 |
license: apache-2.0
|
| 7 |
library_name: transformers
|
| 8 |
-
pipeline_tag:
|
| 9 |
---
|
| 10 |
|
| 11 |
-
#
|
| 12 |
|
| 13 |
<div align="center">
|
| 14 |
|
|
@@ -20,18 +20,25 @@ pipeline_tag: image-text-to-text
|
|
| 20 |
|
| 21 |
## Overview
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
|
| 28 |
## Model Card
|
| 29 |
|
| 30 |
-
The model is the RL version of the
|
| 31 |
|
| 32 |
## Basic Usage
|
| 33 |
|
| 34 |
-
We present a very basic inference usage here for our model. Our model can be used just as Qwen2.5-VL-7B-Instruct and using vllm. For more detail about using and evaluation of our model, please visit [GitHub](https://github.com/EvolvingLMMs-Lab/
|
| 35 |
|
| 36 |
```python
|
| 37 |
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 4 |
datasets:
|
| 5 |
+
- longvideotool/LongVT-Parquet
|
| 6 |
license: apache-2.0
|
| 7 |
library_name: transformers
|
| 8 |
+
pipeline_tag: video-text-to-text
|
| 9 |
---
|
| 10 |
|
| 11 |
+
# LongVT: Incentivizing “Thinking with Long Videos” via Native Tool Calling
|
| 12 |
|
| 13 |
<div align="center">
|
| 14 |
|
|
|
|
| 20 |
|
| 21 |
## Overview
|
| 22 |
|
| 23 |
+
Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought.
|
| 24 |
+
However, they remain vulnerable to hallucination, especially when processing long-form videos where evidence is sparse and temporally dispersed.
|
| 25 |
+
Inspired by how humans comprehend long videos-by first skimming globally and then examining relevant clips for details-we introduce **LongVT**, an end-to-end agentic framework that enables ``Thinking with **Long** **V**ideos'' via interleaved Multimodal Chain-of-**T**ool-Thought.
|
| 26 |
+
Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames.
|
| 27 |
|
| 28 |
+
This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence.
|
| 29 |
+
Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named **VideoSIAH** to facilitate both training and evaluation.
|
| 30 |
+
Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively.
|
| 31 |
+
Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation.
|
| 32 |
+
With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks.
|
| 33 |
|
| 34 |
|
| 35 |
## Model Card
|
| 36 |
|
| 37 |
+
The model is the RL version of the LongVT and was trained on https://huggingface.co/datasets/longvideotool/LongVT-Parquet.
|
| 38 |
|
| 39 |
## Basic Usage
|
| 40 |
|
| 41 |
+
We present a very basic inference usage here for our model. Our model can be used just as Qwen2.5-VL-7B-Instruct and using vllm. For more detail about using and evaluation of our model, please visit [GitHub](https://github.com/EvolvingLMMs-Lab/LongVT) for more information.
|
| 42 |
|
| 43 |
```python
|
| 44 |
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
|