Sudong Wang
commited on
Update README.md
Browse files
README.md
CHANGED
|
@@ -96,25 +96,33 @@ print(output_text)
|
|
| 96 |
|
| 97 |
## Evaluation Results
|
| 98 |
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
|
| 102 |
-
|
| 103 |
-
|
|
| 104 |
-
|
|
| 105 |
-
|
|
| 106 |
-
|
|
| 107 |
-
|
|
| 108 |
-
|
|
| 109 |
-
|
|
| 110 |
-
|
|
| 111 |
-
| **
|
| 112 |
-
|
| 113 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
|
| 115 |
## Citation
|
| 116 |
|
| 117 |
-
If you find
|
| 118 |
|
| 119 |
```bibtex
|
| 120 |
@misc{zhang2025openmmreasonerpushingfrontiersmultimodal,
|
|
|
|
| 96 |
|
| 97 |
## Evaluation Results
|
| 98 |
|
| 99 |
+
|
| 100 |
+
| Model | Reasoning Prompt | Tool Calling | VideoMME<br>(≈1018s) | VideoMMMU<br>(subtitle) | VideoMMMU<br>(adaptation) | VideoMMMU<br>(comprehension) | LVBench<br>(≈4101s) | VideoSIAH-Eval<br>(≈1688s) | Average Score |
|
| 101 |
+
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
| 102 |
+
| **Proprietary LMMs** | | | | | | | | | |
|
| 103 |
+
| GPT-4o | ✗ | ✗ | 77.2<sup>†</sup> | 66.0<sup>†</sup> | 62.0<sup>†</sup> | 55.7<sup>†</sup> | 30.8<sup>†</sup> | 17.4 | 51.5 |
|
| 104 |
+
| Gemini 1.5 Pro | ✗ | ✗ | 81.3<sup>†</sup> | 59.0<sup>†</sup> | 53.3<sup>†</sup> | 49.3<sup>†</sup> | 33.1<sup>†</sup> | - | 55.2 |
|
| 105 |
+
| **Open-Source (Sparse)** | | | | | | | | | |
|
| 106 |
+
| Qwen2.5-VL-7B | ✗ | ✗ | <u>62.6</u> | <u>37.3</u> | 28.0 | 36.7 | 30.7 | <u>28.1</u> | 37.2 |
|
| 107 |
+
| Video-R1-7B | ✓ | ✗ | 61.0 | 36.3 | 40.7 | 52.3 | 37.2 | 27.9 | <u>42.6</u> |
|
| 108 |
+
| VideoRFT-7B | ✓ | ✗ | 60.9 | 36.7 | 42.0 | <u>53.0</u> | 34.7 | 26.5 | 42.3 |
|
| 109 |
+
| Video-Thinker-7B | ✓ | ✗ | 61.0 | 34.3 | <u>44.7</u> | <u>53.0</u> | **52.2** | 10.4 | <u>42.6</u> |
|
| 110 |
+
| LongVT-7B-SFT (Ours) | ✓ | ✓ | 12.5 | **37.7** | **46.0** | **58.3** | 36.0 | 26.8 | 36.2 |
|
| 111 |
+
| **LongVT-7B-RL (Ours)** | ✓ | ✓ | **66.1** | 32.7 | <u>44.7</u> | 50.0 | <u>37.8</u> | **31.0** | **43.7** |
|
| 112 |
+
| **Open-Source (Dense)** | | | | | | | | | |
|
| 113 |
+
| Qwen2.5-VL-7B | ✗ | ✗ | 64.3 | 35.7 | **44.3** | **56.7** | 40.9 | 33.8 | 46.0 |
|
| 114 |
+
| Video-R1-7B | ✓ | ✗ | 60.5 | <u>37.3</u> | 38.7 | 46.3 | 40.1 | 33.1 | 42.7 |
|
| 115 |
+
| VideoRFT-7B | ✓ | ✗ | 49.2 | **37.7** | 40.7 | 48.7 | 18.7 | 26.9 | 37.0 |
|
| 116 |
+
| Video-Thinker-7B | ✓ | ✗ | 60.8 | **37.7** | 42.7 | 55.3 | **54.3** | 6.6 | 42.9 |
|
| 117 |
+
| LongVT-7B-SFT (Ours) | ✓ | ✓ | 64.9 | 32.3 | 42.0 | 49.7 | 41.1 | 34.8 | 44.1 |
|
| 118 |
+
| LongVT-7B-RL (Ours) | ✓ | ✓ | <u>66.1</u> | **37.7** | 42.3 | <u>56.3</u> | <u>41.4</u> | <u>35.9</u> | <u>46.6</u> |
|
| 119 |
+
| **LongVT-7B-RFT (Ours)** | ✓ | ✓ | **67.0** | 35.7 | <u>43.7</u> | **56.7** | 41.3 | **42.0** | **47.7** |
|
| 120 |
+
|
| 121 |
+
> **Performance Comparison with Existing Video-Centric LMMs across Various Long Video Understanding and Reasoning Benchmarks.** The best and second-best result among open-source models in each column is marked in **bold** and <u>underlined</u>, respectively. The numbers with "≈" denote the average video duration of each benchmark. <sup>†</sup> indicates results sourced from official reports. **Reasoning Prompt** indicates whether a standard reasoning-style prompt (✓) or a direct question-answering prompt (✗) is applied; **Tool Calling** denotes whether native tool calling is enabled (✓) or disabled (✗) in the prompt.
|
| 122 |
|
| 123 |
## Citation
|
| 124 |
|
| 125 |
+
If you find LongVT useful for your research and applications, please cite using this BibTeX:
|
| 126 |
|
| 127 |
```bibtex
|
| 128 |
@misc{zhang2025openmmreasonerpushingfrontiersmultimodal,
|