byminji
/

Mini-InternVL-4B-Video-FT

Video-Text-to-Text

feature-extraction

large-language-model

video-language-model

Model card Files Files and versions

Mini-InternVL-4B-Video-FT / README.md

byminji's picture

Update README.md

ecafeaa verified 3 days ago

|

history blame contribute delete

4.65 kB

	---
	library_name: transformers
	tags:
	- multi-modal
	- large-language-model
	- video-language-model
	pipeline_tag: video-text-to-text
	datasets:
	- OpenGVLab/VideoChat2-IT
	- byminji/VideoChat2-IT-clean
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- OpenGVLab/Mini-InternVL-Chat-4B-V1-5
	---


	<h3 align="center"><a href="https://arxiv.org/abs/2510.13251">[ICLR 2026] Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs</a></h3>


	<div align="center">
	<img width="1000" alt="teaser" src="https://cdn-uploads.huggingface.co/production/uploads/66e345c9596fcff3e4b22e5a/z8qfSvZXfIHb0IdSWCLNA.jpeg">
	</div>

	<h5 align="center"> TL;DR: This paper presents a systematic analysis of where and how information flows in VideoLLMs for temporal reasoning in VideoQA, revealing key patterns and effective pathways. </h5>
	<h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/byminji/map-the-flow">Github</a> for the latest update. </h5>




	## Introduction

	This is Mini-InternVL-4B-Video-FT, a video-language model fine-tuned for our ICLR 2026 paper [Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs](https://arxiv.org/abs/2510.13251).

	We fine-tuned [OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5) on the video portion of [VideoChat2-IT](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT) (our cleaned annotations: [VideoChat2-IT-clean](https://huggingface.co/datasets/byminji/VideoChat2-IT-clean)) for 3epochs to study how video instruction tuning shapes information flow in VideoLLMs.
	This model is used to analyze temporal reasoning patterns via causal intervention tools such as Attention Knockout and Logit Lens.



	## Model Zoo

	\| Model \| Base Model \| HF Link \|
	\|-------\|------------\|---------\|
	\| LLaVA-NeXT-7B-Video-FT \| [llava-hf/llava-v1.6-vicuna-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf) \| [byminji/LLaVA-NeXT-7B-Video-FT](https://huggingface.co/byminji/LLaVA-NeXT-7B-Video-FT) \|
	\| LLaVA-NeXT-13B-Video-FT \| [llava-hf/llava-v1.6-vicuna-13b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf) \| [byminji/LLaVA-NeXT-13B-Video-FT](https://huggingface.co/byminji/LLaVA-NeXT-13B-Video-FT) \|
	\| Mini-InternVL-4B-Video-FT (This Checkpoint) \| [OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5) \| [byminji/Mini-InternVL-4B-Video-FT](https://huggingface.co/byminji/Mini-InternVL-4B-Video-FT) \|



	## Results

	We identify effective information pathways in VideoLLMs and show that these sparse pathways are sufficient for solving VideoQA tasks.
	With only 40% of attention edges in Mini-InternVL-4B-Video-FT composing these effective pathways, the model retains its VideoQA performance.

	<img width="800" alt="main results" src="https://cdn-uploads.huggingface.co/production/uploads/66e345c9596fcff3e4b22e5a/v_yig9G_yG-F7exis4ueZ.png">



	## Citation

	If you find our paper useful in your research, please consider citing:

	```bibtex
	@inproceedings{kim2026map,
	author = {Kim, Minji and Kim, Taekyung and Han, Bohyung},
	title = {Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs},
	booktitle = {International Conference on Learning Representations (ICLR)},
	year = {2026},
	}

	@article{kim2025map,
	author = {Kim, Minji and Kim, Taekyung and Han, Bohyung},
	title = {Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs},
	journal = {arXiv preprint arXiv:2510.13251},
	year = {2025},
	}
	```