byminji
/

Mini-InternVL-4B-Video-FT

+---
+library_name: transformers
+tags:
+- multi-modal
+- large-language-model
+- video-language-model
+pipeline_tag: video-text-to-text
+datasets:
+- OpenGVLab/VideoChat2-IT
+language:
+- en
+metrics:
+- accuracy
+base_model:
+- OpenGVLab/Mini-InternVL-Chat-4B-V1-5
+---
+<h3 align="center"><a href="https://arxiv.org/abs/2510.13251">[ICLR 2026] Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs</a></h3>
+<div align="center">
+  <img width="1000" alt="teaser" src="https://cdn-uploads.huggingface.co/production/uploads/66e345c9596fcff3e4b22e5a/z8qfSvZXfIHb0IdSWCLNA.jpeg">
+</div>
+<h5 align="center"> TL;DR:  This paper presents a systematic analysis of where and how information flows in VideoLLMs for temporal reasoning in VideoQA, revealing key patterns and effective pathways.  </h5>
+<h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/byminji/map-the-flow">Github</a> for the latest update.  </h5>
+## Introduction
+This is **Mini-InternVL-4B-Video-FT**, a video-language model fine-tuned for our ICLR 2026 paper [Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs](https://arxiv.org/abs/2510.13251).
+We fine-tuned [OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5) on the video portion of [VideoChat2-IT](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT) for 3epochs to study how video instruction tuning shapes information flow in VideoLLMs.
+This model is used to analyze temporal reasoning patterns via causal intervention tools such as Attention Knockout and Logit Lens.
+## Model Zoo
+| Model | Base Model | HF Link |
+|-------|------------|---------|
+| LLaVA-NeXT-7B-Video-FT | [llava-hf/llava-v1.6-vicuna-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf) | [byminji/LLaVA-NeXT-7B-Video-FT](https://huggingface.co/byminji/LLaVA-NeXT-7B-Video-FT) |
+| LLaVA-NeXT-13B-Video-FT | [llava-hf/llava-v1.6-vicuna-13b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf) | [byminji/LLaVA-NeXT-13B-Video-FT](https://huggingface.co/byminji/LLaVA-NeXT-13B-Video-FT) |
+| Mini-InternVL-4B-Video-FT (**This Checkpoint**) | [OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5) | [byminji/Mini-InternVL-4B-Video-FT](https://huggingface.co/byminji/Mini-InternVL-4B-Video-FT) |
+## Results
+We identify effective information pathways in VideoLLMs and show that these sparse pathways are sufficient for solving VideoQA tasks.
+With only **40%** of attention edges in Mini-InternVL-4B-Video-FT composing these effective pathways, the model retains its VideoQA performance.
+<img width="800" alt="main results" src="https://cdn-uploads.huggingface.co/production/uploads/66e345c9596fcff3e4b22e5a/v_yig9G_yG-F7exis4ueZ.png">
+## Citation
+If you find our paper useful in your research, please consider citing:
+```bibtex
+@inproceedings{kim2026map,
+  author    = {Kim, Minji and Kim, Taekyung and Han, Bohyung},
+  title     = {Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs},
+  booktitle = {International Conference on Learning Representations (ICLR)},
+  year      = {2026},
+}
+@article{kim2025map,
+  author    = {Kim, Minji and Kim, Taekyung and Han, Bohyung},
+  title     = {Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs},
+  journal   = {arXiv preprint arXiv:2510.13251},
+  year      = {2025},
+}
+```