OpenGVLab
/

InternVL_2_5_HiCo_R64

Video-Text-to-Text

feature-extraction

Eval Results (legacy)

Model card Files Files and versions

lixinhao commited on Feb 13, 2025

Commit

c99d9ed

·

verified ·

1 Parent(s): 54e2c8d

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -70,7 +70,7 @@ model-index:
 [\[📜 Tech Report\]](https://arxiv.org/abs/2501.12386)
 <!-- [\[🗨️ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) -->
- InternVideo2.5 is a video multimodal large language model (MLLM, built upoon InternVL2.5) enhanced with **long and rich context (LRC) modeling**. It significantly improves upon existing MLLMs by enhancing their ability to perceive fine-grained details and capture long-form temporal structures. We achieve this through dense vision task annotations using direct preference optimization (TPO) and compact spatiotemporal representations via adaptive hierarchical token compression (HiCo). This model is a variant of InternVideo2.5's ablation experiment, built on HiCo technology only (R64 means 64 tokens per frames).

 [\[📜 Tech Report\]](https://arxiv.org/abs/2501.12386)
 <!-- [\[🗨️ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) -->
+ InternVideo2.5 is a video multimodal large language model (MLLM, built upoon InternVL2.5) enhanced with **long and rich context (LRC) modeling**. It significantly improves upon existing MLLMs by enhancing their ability to perceive fine-grained details and capture long-form temporal structures. We achieve this through dense vision task annotations using direct preference optimization (TPO) and compact spatiotemporal representations via adaptive hierarchical token compression (HiCo). This model is a variant of InternVideo2.5's ablation experiment, built on HiCo technology only (**R64 means 64 tokens per frames**).