|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- video-understanding |
|
|
- sparse-attention |
|
|
- vision-language |
|
|
- qwen2.5-vl |
|
|
- multimodal |
|
|
pipeline_tag: video-text-to-text |
|
|
--- |
|
|
|
|
|
# VideoNSA: Native Sparse Attention for Video Understanding |
|
|
|
|
|
<div align="center"> |
|
|
<img src="VideoNSA.png" alt="VideoNSA Overview" width="100%"> |
|
|
</div> |
|
|
|
|
|
## Model Description |
|
|
|
|
|
VideoNSA is a learnable, hardware-aware sparse-attention framework built on Qwen2.5-VL-7B for efficient long video understanding. It processes up to **128K vision-text tokens** using only **3.6%** of the full attention budget while maintaining competitive performance. |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- π― **Learned Sparsity**: Intelligently learns sparsity patterns over video tokens |
|
|
- π **Efficient Scaling**: Handles massive video contexts with minimal computational overhead |
|
|
- π¬ **Hybrid Attention**: Combines compression, selection, and sliding window mechanisms |
|
|
- π§ **Hardware-Aware**: Optimized for efficient inference on modern GPUs |
|
|
- π **Strong Performance**: Achieves leading results on video understanding benchmarks |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
VideoNSA employs a hybrid attention strategy with three complementary branches: |
|
|
|
|
|
1. **Compression Branch**: Averages frame KV blocks to maintain salient visual cues |
|
|
2. **Selection Branch**: Ranks and retains the most informative video segments |
|
|
3. **Sliding Window Branch**: Ensures local temporal coverage for fine-grained details |
|
|
|
|
|
Each branch is weighted by learnable per-head gates for adaptive token allocation across different tasks. |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Base Model**: Qwen2.5-VL-7B-Instruct |
|
|
- **Dataset**: Filtered LLaVA-Video-178K |
|
|
- **Sampling Rate**: 4 fps |
|
|
- **Context Limit**: 36K tokens during training |
|
|
- **Compute**: ~4,600 H100 GPU hours |
|
|
|
|
|
## Usage |
|
|
|
|
|
For installation, training, and evaluation instructions, please refer to: |
|
|
- π» [GitHub Repository](https://github.com/Espere-1119-Song/VideoNSA) |
|
|
- π [Project Page](https://enxinsong.com/VideoNSA-web/) |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Optimized for video understanding tasks; may not be optimal for pure image tasks |
|
|
- Requires sufficient GPU memory for long video processing |
|
|
- Performance may vary with different video resolutions and frame rates |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{song2025videonsanativesparseattention, |
|
|
title={VideoNSA: Native Sparse Attention Scales Video Understanding}, |
|
|
author={Enxin Song and Wenhao Chai and Shusheng Yang and Ethan Armand and Xiaojun Shan and Haiyang Xu and Jianwen Xie and Zhuowen Tu}, |
|
|
year={2025}, |
|
|
eprint={2510.02295}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2510.02295}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## Resources |
|
|
|
|
|
- π [Paper](https://arxiv.org/abs/2510.02295) |
|
|
- π [Project Page](https://enxinsong.com/VideoNSA-web/) |
|
|
- π» [GitHub Repository](https://github.com/Espere-1119-Song/VideoNSA) |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the Apache 2.0 License. |
|
|
|