longvideotool
/

LongVT-RL

@@ -96,25 +96,33 @@ print(output_text)
 ## Evaluation Results
-Our **OpenMMReasoner-7B (OMR-7B)** model demonstrates strong performance across a comprehensive suite of multimodal reasoning benchmarks. With only 874K SFT samples and 74K RL samples—significantly less data than many competing methods—our model achieves state-of-the-art or highly competitive results on 9 out of 14 benchmark tasks. Notably, OMR-7B achieves **79.5%** on MathVista testmini (best among all models), **63.8%** on MathVerse testmini (best), and **79.0%** on WeMath loose (best), demonstrating the effectiveness of our transparent two-stage training recipe. This performance validates our emphasis on data quality and rigorous training design over simply scaling dataset size.
-| Model | SFT Data | RL Data | MathVista<br/>testmini | MathVision<br/>test | MathVision<br/>testmini | MathVerse<br/>testmini | DynaMath<br/>worst | WeMath<br/>loose | LogicVista<br/>test | MMMU<br/>val | MMMU-Pro<br/>standard | MMMU-Pro<br/>vision | CharXiv<br/>reas. | CharXiv<br/>desc. |
-|-------|----------|---------|------------------------|---------------------|-------------------------|------------------------|--------------------|--------------------|---------------------|--------------|-----------------------|---------------------|-------------------|-------------------|
-| VLAA-Thinker-Qwen2.5-7B | 126k | 25k | 68.0 | 26.4 | - | 48.2 | 22.4 | - | 48.5 | - | - | - | - | - |
-| ThinkLite-7B-VL | - | 11k | 71.6 | 24.6 | - | 42.9 | 16.5 | - | 42.7 | - | - | - | - | - |
-| VL-Rethinker-7B | - | 39k | 73.7 | 28.4 | - | 46.4 | 17.8 | - | 42.7 | - | 41.7 | - | - | - |
-| M2-Reasoning | 6.2M | 102k | 75.0 | 42.1 | - | 40.4 | - | - | 50.6 | - | - | - | - | - |
-| MMR1 | 1.6M | 15k | 72.0 | 31.8 | 29.0† | 55.4 | 27.9† | 68.0† | 48.9 | 52.4† | 41.1† | 37.1† | 43.5† | 71.1† |
-| OpenVLThinker-7B | 3.3k | 9.6k | 65.3 | 23.0 | 26.9† | 38.1 | 16.8 | 61.9† | 44.5 | 55.1† | 39.7† | 38.4† | 41.0† | 69.2† |
-| MM-Eureka-Qwen-7B | - | 15.6k | 72.6 | 28.1 | 32.1† | 45.4 | 23.0 | 59.8† | 46.3 | 54.4† | 40.1† | 37.1† | 42.4† | 74.1† |
-| OVR-7B | 2M | 300k | 72.1 | **51.8** | 38.2† | 54.6 | 33.5 | 64.8 | **54.8** | 51.8† | **50.2** | 29.1† | 44.5 | 73.6 |
-| **OMR-7B (ours)** | **874k** | **74k** | **79.5** | 43.6 | **38.8** | **63.8** | **34.9** | **79.0** | 50.0 | **57.8** | 44.1 | **40.6** | **46.1** | 73.5 |
-**Note:** Bold numbers indicate the best performance, and † indicates results reproduced using the authors' checkpoints.
 ## Citation
-If you find OpenMMReasoner useful for your research and applications, please cite using this BibTeX:
 ```bibtex
 @misc{zhang2025openmmreasonerpushingfrontiersmultimodal,

 ## Evaluation Results
+| Model | Reasoning Prompt | Tool Calling | VideoMME<br>(≈1018s) | VideoMMMU<br>(subtitle) | VideoMMMU<br>(adaptation) | VideoMMMU<br>(comprehension) | LVBench<br>(≈4101s) | VideoSIAH-Eval<br>(≈1688s) | Average Score |
+| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+| **Proprietary LMMs** | | | | | | | | | |
+| GPT-4o | ✗ | ✗ | 77.2<sup>†</sup> | 66.0<sup>†</sup> | 62.0<sup>†</sup> | 55.7<sup>†</sup> | 30.8<sup>†</sup> | 17.4 | 51.5 |
+| Gemini 1.5 Pro | ✗ | ✗ | 81.3<sup>†</sup> | 59.0<sup>†</sup> | 53.3<sup>†</sup> | 49.3<sup>†</sup> | 33.1<sup>†</sup> | - | 55.2 |
+| **Open-Source (Sparse)** | | | | | | | | | |
+| Qwen2.5-VL-7B | ✗ | ✗ | <u>62.6</u> | <u>37.3</u> | 28.0 | 36.7 | 30.7 | <u>28.1</u> | 37.2 |
+| Video-R1-7B | ✓ | ✗ | 61.0 | 36.3 | 40.7 | 52.3 | 37.2 | 27.9 | <u>42.6</u> |
+| VideoRFT-7B | ✓ | ✗ | 60.9 | 36.7 | 42.0 | <u>53.0</u> | 34.7 | 26.5 | 42.3 |
+| Video-Thinker-7B | ✓ | ✗ | 61.0 | 34.3 | <u>44.7</u> | <u>53.0</u> | **52.2** | 10.4 | <u>42.6</u> |
+| LongVT-7B-SFT (Ours) | ✓ | ✓ | 12.5 | **37.7** | **46.0** | **58.3** | 36.0 | 26.8 | 36.2 |
+| **LongVT-7B-RL (Ours)** | ✓ | ✓ | **66.1** | 32.7 | <u>44.7</u> | 50.0 | <u>37.8</u> | **31.0** | **43.7** |
+| **Open-Source (Dense)** | | | | | | | | | |
+| Qwen2.5-VL-7B | ✗ | ✗ | 64.3 | 35.7 | **44.3** | **56.7** | 40.9 | 33.8 | 46.0 |
+| Video-R1-7B | ✓ | ✗ | 60.5 | <u>37.3</u> | 38.7 | 46.3 | 40.1 | 33.1 | 42.7 |
+| VideoRFT-7B | ✓ | ✗ | 49.2 | **37.7** | 40.7 | 48.7 | 18.7 | 26.9 | 37.0 |
+| Video-Thinker-7B | ✓ | ✗ | 60.8 | **37.7** | 42.7 | 55.3 | **54.3** | 6.6 | 42.9 |
+| LongVT-7B-SFT (Ours) | ✓ | ✓ | 64.9 | 32.3 | 42.0 | 49.7 | 41.1 | 34.8 | 44.1 |
+| LongVT-7B-RL (Ours) | ✓ | ✓ | <u>66.1</u> | **37.7** | 42.3 | <u>56.3</u> | <u>41.4</u> | <u>35.9</u> | <u>46.6</u> |
+| **LongVT-7B-RFT (Ours)** | ✓ | ✓ | **67.0** | 35.7 | <u>43.7</u> | **56.7** | 41.3 | **42.0** | **47.7** |
+> **Performance Comparison with Existing Video-Centric LMMs across Various Long Video Understanding and Reasoning Benchmarks.** The best and second-best result among open-source models in each column is marked in **bold** and <u>underlined</u>, respectively. The numbers with "≈" denote the average video duration of each benchmark. <sup>†</sup> indicates results sourced from official reports. **Reasoning Prompt** indicates whether a standard reasoning-style prompt (✓) or a direct question-answering prompt (✗) is applied; **Tool Calling** denotes whether native tool calling is enabled (✓) or disabled (✗) in the prompt.
 ## Citation
+If you find LongVT useful for your research and applications, please cite using this BibTeX:
 ```bibtex
 @misc{zhang2025openmmreasonerpushingfrontiersmultimodal,