Sudong Wang commited on
Commit
c611d41
·
verified ·
1 Parent(s): dfc1a88

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -7
README.md CHANGED
@@ -2,13 +2,13 @@
2
  base_model:
3
  - Qwen/Qwen2.5-VL-7B-Instruct
4
  datasets:
5
- - OpenMMReasoner/OpenMMReasoner-RL-74K
6
  license: apache-2.0
7
  library_name: transformers
8
- pipeline_tag: image-text-to-text
9
  ---
10
 
11
- # OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
12
 
13
  <div align="center">
14
 
@@ -20,18 +20,25 @@ pipeline_tag: image-text-to-text
20
 
21
  ## Overview
22
 
23
- Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research.
 
 
 
24
 
25
- In this work, we introduce **OpenMMReasoner**, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research.
 
 
 
 
26
 
27
 
28
  ## Model Card
29
 
30
- The model is the RL version of the OpenMMReasoner and was trained on https://huggingface.co/datasets/OpenMMReasoner/OpenMMReasoner-RL-74K.
31
 
32
  ## Basic Usage
33
 
34
- We present a very basic inference usage here for our model. Our model can be used just as Qwen2.5-VL-7B-Instruct and using vllm. For more detail about using and evaluation of our model, please visit [GitHub](https://github.com/EvolvingLMMs-Lab/OpenMMReasoner) for more information.
35
 
36
  ```python
37
  from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
 
2
  base_model:
3
  - Qwen/Qwen2.5-VL-7B-Instruct
4
  datasets:
5
+ - longvideotool/LongVT-Parquet
6
  license: apache-2.0
7
  library_name: transformers
8
+ pipeline_tag: video-text-to-text
9
  ---
10
 
11
+ # LongVT: Incentivizing “Thinking with Long Videos” via Native Tool Calling
12
 
13
  <div align="center">
14
 
 
20
 
21
  ## Overview
22
 
23
+ Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought.
24
+ However, they remain vulnerable to hallucination, especially when processing long-form videos where evidence is sparse and temporally dispersed.
25
+ Inspired by how humans comprehend long videos-by first skimming globally and then examining relevant clips for details-we introduce **LongVT**, an end-to-end agentic framework that enables ``Thinking with **Long** **V**ideos'' via interleaved Multimodal Chain-of-**T**ool-Thought.
26
+ Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames.
27
 
28
+ This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence.
29
+ Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named **VideoSIAH** to facilitate both training and evaluation.
30
+ Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively.
31
+ Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation.
32
+ With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks.
33
 
34
 
35
  ## Model Card
36
 
37
+ The model is the RL version of the LongVT and was trained on https://huggingface.co/datasets/longvideotool/LongVT-Parquet.
38
 
39
  ## Basic Usage
40
 
41
+ We present a very basic inference usage here for our model. Our model can be used just as Qwen2.5-VL-7B-Instruct and using vllm. For more detail about using and evaluation of our model, please visit [GitHub](https://github.com/EvolvingLMMs-Lab/LongVT) for more information.
42
 
43
  ```python
44
  from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor