OpenGVLab
/

InternVL2-Llama3-76B

Image-Text-to-Text

feature-extraction

Model card Files Files and versions

czczup commited on Jul 18, 2024

Commit

12d403f

·

verified ·

1 Parent(s): 6ce5764

Upload folder using huggingface_hub

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -75,8 +75,8 @@ InternVL 2.0 is a multimodal large language model series, featuring models of va
 |           MVBench           |   -    |   -    |       -        |     72.5      |         69.6         |
 | MMBench-Video<sub>8f</sub>  |  1.62  |  1.53  |      1.30      |     1.32      |         1.37         |
 | MMBench-Video<sub>16f</sub> |  1.86  |  1.68  |      1.60      |     1.45      |         1.52         |
-|    Video-MME<br>w/o subs    |  71.9  |  59.9  |      75.0      |     TODO      |         TODO         |
-|     Video-MME<br>w subs     |  77.2  |  63.3  |      81.3      |     TODO      |         TODO         |
 - We evaluate our models on MVBench and Video-MME by extracting 16 frames from each video, and each frame was resized to a 448x448 image.

 |           MVBench           |   -    |   -    |       -        |     72.5      |         69.6         |
 | MMBench-Video<sub>8f</sub>  |  1.62  |  1.53  |      1.30      |     1.32      |         1.37         |
 | MMBench-Video<sub>16f</sub> |  1.86  |  1.68  |      1.60      |     1.45      |         1.52         |
+|    Video-MME<br>w/o subs    |  71.9  |  59.9  |      75.0      |     61.2      |         TODO         |
+|     Video-MME<br>w subs     |  77.2  |  63.3  |      81.3      |     62.4      |         TODO         |
 - We evaluate our models on MVBench and Video-MME by extracting 16 frames from each video, and each frame was resized to a 448x448 image.