byminji commited on
Commit
cbec734
·
verified ·
1 Parent(s): c210b0d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -0
README.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - multi-modal
5
+ - large-language-model
6
+ - video-language-model
7
+ pipeline_tag: video-text-to-text
8
+ datasets:
9
+ - OpenGVLab/VideoChat2-IT
10
+ language:
11
+ - en
12
+ metrics:
13
+ - accuracy
14
+ base_model:
15
+ - OpenGVLab/Mini-InternVL-Chat-4B-V1-5
16
+ ---
17
+
18
+
19
+ <h3 align="center"><a href="https://arxiv.org/abs/2510.13251">[ICLR 2026] Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs</a></h3>
20
+
21
+
22
+ <div align="center">
23
+ <img width="1000" alt="teaser" src="https://cdn-uploads.huggingface.co/production/uploads/66e345c9596fcff3e4b22e5a/z8qfSvZXfIHb0IdSWCLNA.jpeg">
24
+ </div>
25
+
26
+ <h5 align="center"> TL;DR: This paper presents a systematic analysis of where and how information flows in VideoLLMs for temporal reasoning in VideoQA, revealing key patterns and effective pathways. </h5>
27
+ <h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/byminji/map-the-flow">Github</a> for the latest update. </h5>
28
+
29
+
30
+
31
+
32
+ ## Introduction
33
+
34
+ This is **Mini-InternVL-4B-Video-FT**, a video-language model fine-tuned for our ICLR 2026 paper [Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs](https://arxiv.org/abs/2510.13251).
35
+
36
+ We fine-tuned [OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5) on the video portion of [VideoChat2-IT](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT) for 3epochs to study how video instruction tuning shapes information flow in VideoLLMs.
37
+ This model is used to analyze temporal reasoning patterns via causal intervention tools such as Attention Knockout and Logit Lens.
38
+
39
+
40
+
41
+ ## Model Zoo
42
+
43
+ | Model | Base Model | HF Link |
44
+ |-------|------------|---------|
45
+ | LLaVA-NeXT-7B-Video-FT | [llava-hf/llava-v1.6-vicuna-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf) | [byminji/LLaVA-NeXT-7B-Video-FT](https://huggingface.co/byminji/LLaVA-NeXT-7B-Video-FT) |
46
+ | LLaVA-NeXT-13B-Video-FT | [llava-hf/llava-v1.6-vicuna-13b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf) | [byminji/LLaVA-NeXT-13B-Video-FT](https://huggingface.co/byminji/LLaVA-NeXT-13B-Video-FT) |
47
+ | Mini-InternVL-4B-Video-FT (**This Checkpoint**) | [OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5) | [byminji/Mini-InternVL-4B-Video-FT](https://huggingface.co/byminji/Mini-InternVL-4B-Video-FT) |
48
+
49
+
50
+
51
+ ## Results
52
+
53
+ We identify effective information pathways in VideoLLMs and show that these sparse pathways are sufficient for solving VideoQA tasks.
54
+ With only **40%** of attention edges in Mini-InternVL-4B-Video-FT composing these effective pathways, the model retains its VideoQA performance.
55
+
56
+ <img width="800" alt="main results" src="https://cdn-uploads.huggingface.co/production/uploads/66e345c9596fcff3e4b22e5a/v_yig9G_yG-F7exis4ueZ.png">
57
+
58
+
59
+
60
+ ## Citation
61
+
62
+ If you find our paper useful in your research, please consider citing:
63
+
64
+ ```bibtex
65
+ @inproceedings{kim2026map,
66
+ author = {Kim, Minji and Kim, Taekyung and Han, Bohyung},
67
+ title = {Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs},
68
+ booktitle = {International Conference on Learning Representations (ICLR)},
69
+ year = {2026},
70
+ }
71
+
72
+ @article{kim2025map,
73
+ author = {Kim, Minji and Kim, Taekyung and Han, Bohyung},
74
+ title = {Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs},
75
+ journal = {arXiv preprint arXiv:2510.13251},
76
+ year = {2025},
77
+ }
78
+ ```