File size: 2,904 Bytes
511bff8 48b726c 511bff8 4ce6757 511bff8 d18c083 511bff8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
---
language:
- en
license: apache-2.0
tags:
- video-understanding
- sparse-attention
- vision-language
- qwen2.5-vl
- multimodal
pipeline_tag: video-text-to-text
---
# VideoNSA: Native Sparse Attention for Video Understanding
<div align="center">
<img src="VideoNSA.png" alt="VideoNSA Overview" width="100%">
</div>
## Model Description
VideoNSA is a learnable, hardware-aware sparse-attention framework built on Qwen2.5-VL-7B for efficient long video understanding. It processes up to **128K vision-text tokens** using only **3.6%** of the full attention budget while maintaining competitive performance.
### Key Features
- π― **Learned Sparsity**: Intelligently learns sparsity patterns over video tokens
- π **Efficient Scaling**: Handles massive video contexts with minimal computational overhead
- π¬ **Hybrid Attention**: Combines compression, selection, and sliding window mechanisms
- π§ **Hardware-Aware**: Optimized for efficient inference on modern GPUs
- π **Strong Performance**: Achieves leading results on video understanding benchmarks
## Model Architecture
VideoNSA employs a hybrid attention strategy with three complementary branches:
1. **Compression Branch**: Averages frame KV blocks to maintain salient visual cues
2. **Selection Branch**: Ranks and retains the most informative video segments
3. **Sliding Window Branch**: Ensures local temporal coverage for fine-grained details
Each branch is weighted by learnable per-head gates for adaptive token allocation across different tasks.
## Training Details
- **Base Model**: Qwen2.5-VL-7B-Instruct
- **Dataset**: Filtered LLaVA-Video-178K
- **Sampling Rate**: 4 fps
- **Context Limit**: 36K tokens during training
- **Compute**: ~4,600 H100 GPU hours
## Usage
For installation, training, and evaluation instructions, please refer to:
- π» [GitHub Repository](https://github.com/Espere-1119-Song/VideoNSA)
- π [Project Page](https://enxinsong.com/VideoNSA-web/)
## Limitations
- Optimized for video understanding tasks; may not be optimal for pure image tasks
- Requires sufficient GPU memory for long video processing
- Performance may vary with different video resolutions and frame rates
## Citation
```bibtex
@misc{song2025videonsanativesparseattention,
title={VideoNSA: Native Sparse Attention Scales Video Understanding},
author={Enxin Song and Wenhao Chai and Shusheng Yang and Ethan Armand and Xiaojun Shan and Haiyang Xu and Jianwen Xie and Zhuowen Tu},
year={2025},
eprint={2510.02295},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.02295},
}
```
## Resources
- π [Paper](https://arxiv.org/abs/2510.02295)
- π [Project Page](https://enxinsong.com/VideoNSA-web/)
- π» [GitHub Repository](https://github.com/Espere-1119-Song/VideoNSA)
## License
This model is released under the Apache 2.0 License.
|