File size: 2,904 Bytes
511bff8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48b726c
511bff8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4ce6757
 
 
 
 
 
 
 
511bff8
 
 
 
 
d18c083
511bff8
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
language:
- en
license: apache-2.0
tags:
- video-understanding
- sparse-attention
- vision-language
- qwen2.5-vl
- multimodal
pipeline_tag: video-text-to-text
---

# VideoNSA: Native Sparse Attention for Video Understanding

<div align="center">
  <img src="VideoNSA.png" alt="VideoNSA Overview" width="100%">
</div>

## Model Description

VideoNSA is a learnable, hardware-aware sparse-attention framework built on Qwen2.5-VL-7B for efficient long video understanding. It processes up to **128K vision-text tokens** using only **3.6%** of the full attention budget while maintaining competitive performance.

### Key Features

- 🎯 **Learned Sparsity**: Intelligently learns sparsity patterns over video tokens
- πŸš€ **Efficient Scaling**: Handles massive video contexts with minimal computational overhead
- 🎬 **Hybrid Attention**: Combines compression, selection, and sliding window mechanisms
- πŸ”§ **Hardware-Aware**: Optimized for efficient inference on modern GPUs
- πŸ“Š **Strong Performance**: Achieves leading results on video understanding benchmarks

## Model Architecture

VideoNSA employs a hybrid attention strategy with three complementary branches:

1. **Compression Branch**: Averages frame KV blocks to maintain salient visual cues
2. **Selection Branch**: Ranks and retains the most informative video segments
3. **Sliding Window Branch**: Ensures local temporal coverage for fine-grained details

Each branch is weighted by learnable per-head gates for adaptive token allocation across different tasks.

## Training Details

- **Base Model**: Qwen2.5-VL-7B-Instruct
- **Dataset**: Filtered LLaVA-Video-178K
- **Sampling Rate**: 4 fps
- **Context Limit**: 36K tokens during training
- **Compute**: ~4,600 H100 GPU hours

## Usage

For installation, training, and evaluation instructions, please refer to:
- πŸ’» [GitHub Repository](https://github.com/Espere-1119-Song/VideoNSA)
- 🌐 [Project Page](https://enxinsong.com/VideoNSA-web/)

## Limitations

- Optimized for video understanding tasks; may not be optimal for pure image tasks
- Requires sufficient GPU memory for long video processing
- Performance may vary with different video resolutions and frame rates

## Citation

```bibtex
@misc{song2025videonsanativesparseattention,
      title={VideoNSA: Native Sparse Attention Scales Video Understanding}, 
      author={Enxin Song and Wenhao Chai and Shusheng Yang and Ethan Armand and Xiaojun Shan and Haiyang Xu and Jianwen Xie and Zhuowen Tu},
      year={2025},
      eprint={2510.02295},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.02295}, 
}
```

## Resources

- πŸ“„ [Paper](https://arxiv.org/abs/2510.02295)
- 🌐 [Project Page](https://enxinsong.com/VideoNSA-web/)
- πŸ’» [GitHub Repository](https://github.com/Espere-1119-Song/VideoNSA)

## License

This model is released under the Apache 2.0 License.