--- language: - en license: apache-2.0 tags: - video-understanding - sparse-attention - vision-language - qwen2.5-vl - multimodal pipeline_tag: video-text-to-text --- # VideoNSA: Native Sparse Attention for Video Understanding
VideoNSA Overview
## Model Description VideoNSA is a learnable, hardware-aware sparse-attention framework built on Qwen2.5-VL-7B for efficient long video understanding. It processes up to **128K vision-text tokens** using only **3.6%** of the full attention budget while maintaining competitive performance. ### Key Features - 🎯 **Learned Sparsity**: Intelligently learns sparsity patterns over video tokens - 🚀 **Efficient Scaling**: Handles massive video contexts with minimal computational overhead - 🎬 **Hybrid Attention**: Combines compression, selection, and sliding window mechanisms - 🔧 **Hardware-Aware**: Optimized for efficient inference on modern GPUs - 📊 **Strong Performance**: Achieves leading results on video understanding benchmarks ## Model Architecture VideoNSA employs a hybrid attention strategy with three complementary branches: 1. **Compression Branch**: Averages frame KV blocks to maintain salient visual cues 2. **Selection Branch**: Ranks and retains the most informative video segments 3. **Sliding Window Branch**: Ensures local temporal coverage for fine-grained details Each branch is weighted by learnable per-head gates for adaptive token allocation across different tasks. ## Training Details - **Base Model**: Qwen2.5-VL-7B-Instruct - **Dataset**: Filtered LLaVA-Video-178K - **Sampling Rate**: 4 fps - **Context Limit**: 36K tokens during training - **Compute**: ~4,600 H100 GPU hours ## Usage For installation, training, and evaluation instructions, please refer to: - 💻 [GitHub Repository](https://github.com/Espere-1119-Song/VideoNSA) - 🌐 [Project Page](https://enxinsong.com/VideoNSA-web/) ## Limitations - Optimized for video understanding tasks; may not be optimal for pure image tasks - Requires sufficient GPU memory for long video processing - Performance may vary with different video resolutions and frame rates ## Citation ```bibtex @misc{song2025videonsanativesparseattention, title={VideoNSA: Native Sparse Attention Scales Video Understanding}, author={Enxin Song and Wenhao Chai and Shusheng Yang and Ethan Armand and Xiaojun Shan and Haiyang Xu and Jianwen Xie and Zhuowen Tu}, year={2025}, eprint={2510.02295}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.02295}, } ``` ## Resources - 📄 [Paper](https://arxiv.org/abs/2510.02295) - 🌐 [Project Page](https://enxinsong.com/VideoNSA-web/) - 💻 [GitHub Repository](https://github.com/Espere-1119-Song/VideoNSA) ## License This model is released under the Apache 2.0 License.