Enxin
/

VideoNSA

Video-Text-to-Text

video-understanding

sparse-attention

vision-language

Model card Files Files and versions

VideoNSA / README.md

Enxin's picture

Update README.md

d18c083 verified 4 months ago

|

history blame contribute delete

2.9 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- video-understanding
	- sparse-attention
	- vision-language
	- qwen2.5-vl
	- multimodal
	pipeline_tag: video-text-to-text
	---

	# VideoNSA: Native Sparse Attention for Video Understanding

	<div align="center">
	<img src="VideoNSA.png" alt="VideoNSA Overview" width="100%">
	</div>

	## Model Description

	VideoNSA is a learnable, hardware-aware sparse-attention framework built on Qwen2.5-VL-7B for efficient long video understanding. It processes up to 128K vision-text tokens using only 3.6% of the full attention budget while maintaining competitive performance.

	### Key Features

	- 🎯 Learned Sparsity: Intelligently learns sparsity patterns over video tokens
	- 🚀 Efficient Scaling: Handles massive video contexts with minimal computational overhead
	- 🎬 Hybrid Attention: Combines compression, selection, and sliding window mechanisms
	- 🔧 Hardware-Aware: Optimized for efficient inference on modern GPUs
	- 📊 Strong Performance: Achieves leading results on video understanding benchmarks

	## Model Architecture

	VideoNSA employs a hybrid attention strategy with three complementary branches:

	1. Compression Branch: Averages frame KV blocks to maintain salient visual cues
	2. Selection Branch: Ranks and retains the most informative video segments
	3. Sliding Window Branch: Ensures local temporal coverage for fine-grained details

	Each branch is weighted by learnable per-head gates for adaptive token allocation across different tasks.

	## Training Details

	- Base Model: Qwen2.5-VL-7B-Instruct
	- Dataset: Filtered LLaVA-Video-178K
	- Sampling Rate: 4 fps
	- Context Limit: 36K tokens during training
	- Compute: ~4,600 H100 GPU hours

	## Usage

	For installation, training, and evaluation instructions, please refer to:
	- 💻 [GitHub Repository](https://github.com/Espere-1119-Song/VideoNSA)
	- 🌐 [Project Page](https://enxinsong.com/VideoNSA-web/)

	## Limitations

	- Optimized for video understanding tasks; may not be optimal for pure image tasks
	- Requires sufficient GPU memory for long video processing
	- Performance may vary with different video resolutions and frame rates

	## Citation

	```bibtex
	@misc{song2025videonsanativesparseattention,
	title={VideoNSA: Native Sparse Attention Scales Video Understanding},
	author={Enxin Song and Wenhao Chai and Shusheng Yang and Ethan Armand and Xiaojun Shan and Haiyang Xu and Jianwen Xie and Zhuowen Tu},
	year={2025},
	eprint={2510.02295},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2510.02295},
	}
	```

	## Resources

	- 📄 [Paper](https://arxiv.org/abs/2510.02295)
	- 🌐 [Project Page](https://enxinsong.com/VideoNSA-web/)
	- 💻 [GitHub Repository](https://github.com/Espere-1119-Song/VideoNSA)

	## License

	This model is released under the Apache 2.0 License.