Video-Text-to-Text
Transformers
Safetensors
English
internvl_chat
feature-extraction
multimodal
custom_code
Eval Results (legacy)
Instructions to use OpenGVLab/InternVL_2_5_HiCo_R64 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenGVLab/InternVL_2_5_HiCo_R64 with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("OpenGVLab/InternVL_2_5_HiCo_R64", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -70,7 +70,7 @@ model-index:
|
|
| 70 |
[\[📜 Tech Report\]](https://arxiv.org/abs/2501.12386)
|
| 71 |
<!-- [\[🗨️ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) -->
|
| 72 |
|
| 73 |
-
InternVideo2.5 is a video multimodal large language model (MLLM, built upoon InternVL2.5) enhanced with **long and rich context (LRC) modeling**. It significantly improves upon existing MLLMs by enhancing their ability to perceive fine-grained details and capture long-form temporal structures. We achieve this through dense vision task annotations using direct preference optimization (TPO) and compact spatiotemporal representations via adaptive hierarchical token compression (HiCo). This model is a variant of InternVideo2.5's ablation experiment, built on HiCo technology only (R64 means 64 tokens per frames).
|
| 74 |
|
| 75 |
|
| 76 |
|
|
|
|
| 70 |
[\[📜 Tech Report\]](https://arxiv.org/abs/2501.12386)
|
| 71 |
<!-- [\[🗨️ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) -->
|
| 72 |
|
| 73 |
+
InternVideo2.5 is a video multimodal large language model (MLLM, built upoon InternVL2.5) enhanced with **long and rich context (LRC) modeling**. It significantly improves upon existing MLLMs by enhancing their ability to perceive fine-grained details and capture long-form temporal structures. We achieve this through dense vision task annotations using direct preference optimization (TPO) and compact spatiotemporal representations via adaptive hierarchical token compression (HiCo). This model is a variant of InternVideo2.5's ablation experiment, built on HiCo technology only (**R64 means 64 tokens per frames**).
|
| 74 |
|
| 75 |
|
| 76 |
|