| | --- |
| | library_name: transformers |
| | tags: |
| | - multi-modal |
| | - large-language-model |
| | - video-language-model |
| | pipeline_tag: video-text-to-text |
| | datasets: |
| | - OpenGVLab/VideoChat2-IT |
| | - byminji/VideoChat2-IT-clean |
| | language: |
| | - en |
| | metrics: |
| | - accuracy |
| | base_model: |
| | - OpenGVLab/Mini-InternVL-Chat-4B-V1-5 |
| | --- |
| | |
| |
|
| | <h3 align="center"><a href="https://arxiv.org/abs/2510.13251">[ICLR 2026] Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs</a></h3> |
| |
|
| |
|
| | <div align="center"> |
| | <img width="1000" alt="teaser" src="https://cdn-uploads.huggingface.co/production/uploads/66e345c9596fcff3e4b22e5a/z8qfSvZXfIHb0IdSWCLNA.jpeg"> |
| | </div> |
| |
|
| | <h5 align="center"> TL;DR: This paper presents a systematic analysis of where and how information flows in VideoLLMs for temporal reasoning in VideoQA, revealing key patterns and effective pathways. </h5> |
| | <h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/byminji/map-the-flow">Github</a> for the latest update. </h5> |
| |
|
| |
|
| |
|
| |
|
| | ## Introduction |
| |
|
| | This is **Mini-InternVL-4B-Video-FT**, a video-language model fine-tuned for our ICLR 2026 paper [Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs](https://arxiv.org/abs/2510.13251). |
| |
|
| | We fine-tuned [OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5) on the video portion of [VideoChat2-IT](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT) (our cleaned annotations: [VideoChat2-IT-clean](https://huggingface.co/datasets/byminji/VideoChat2-IT-clean)) for 3epochs to study how video instruction tuning shapes information flow in VideoLLMs. |
| | This model is used to analyze temporal reasoning patterns via causal intervention tools such as Attention Knockout and Logit Lens. |
| |
|
| |
|
| |
|
| | ## Model Zoo |
| |
|
| | | Model | Base Model | HF Link | |
| | |-------|------------|---------| |
| | | LLaVA-NeXT-7B-Video-FT | [llava-hf/llava-v1.6-vicuna-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf) | [byminji/LLaVA-NeXT-7B-Video-FT](https://huggingface.co/byminji/LLaVA-NeXT-7B-Video-FT) | |
| | | LLaVA-NeXT-13B-Video-FT | [llava-hf/llava-v1.6-vicuna-13b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf) | [byminji/LLaVA-NeXT-13B-Video-FT](https://huggingface.co/byminji/LLaVA-NeXT-13B-Video-FT) | |
| | | Mini-InternVL-4B-Video-FT (**This Checkpoint**) | [OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5) | [byminji/Mini-InternVL-4B-Video-FT](https://huggingface.co/byminji/Mini-InternVL-4B-Video-FT) | |
| |
|
| |
|
| |
|
| | ## Results |
| |
|
| | We identify effective information pathways in VideoLLMs and show that these sparse pathways are sufficient for solving VideoQA tasks. |
| | With only **40%** of attention edges in Mini-InternVL-4B-Video-FT composing these effective pathways, the model retains its VideoQA performance. |
| |
|
| | <img width="800" alt="main results" src="https://cdn-uploads.huggingface.co/production/uploads/66e345c9596fcff3e4b22e5a/v_yig9G_yG-F7exis4ueZ.png"> |
| |
|
| |
|
| |
|
| | ## Citation |
| |
|
| | If you find our paper useful in your research, please consider citing: |
| |
|
| | ```bibtex |
| | @inproceedings{kim2026map, |
| | author = {Kim, Minji and Kim, Taekyung and Han, Bohyung}, |
| | title = {Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs}, |
| | booktitle = {International Conference on Learning Representations (ICLR)}, |
| | year = {2026}, |
| | } |
| | |
| | @article{kim2025map, |
| | author = {Kim, Minji and Kim, Taekyung and Han, Bohyung}, |
| | title = {Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs}, |
| | journal = {arXiv preprint arXiv:2510.13251}, |
| | year = {2025}, |
| | } |
| | ``` |