Enxin
/

VideoNSA

+---
+language:
+- en
+license: apache-2.0
+tags:
+- video-understanding
+- sparse-attention
+- vision-language
+- qwen2.5-vl
+- multimodal
+pipeline_tag: video-text-to-text
+---
+# VideoNSA: Native Sparse Attention for Video Understanding
+<div align="center">
+  <img src="https://enxinsong.com/VideoNSA-web/assets/teaser.png" alt="VideoNSA Overview" width="100%">
+</div>
+## Model Description
+VideoNSA is a learnable, hardware-aware sparse-attention framework built on Qwen2.5-VL-7B for efficient long video understanding. It processes up to **128K vision-text tokens** using only **3.6%** of the full attention budget while maintaining competitive performance.
+### Key Features
+- 🎯 **Learned Sparsity**: Intelligently learns sparsity patterns over video tokens
+- 🚀 **Efficient Scaling**: Handles massive video contexts with minimal computational overhead
+- 🎬 **Hybrid Attention**: Combines compression, selection, and sliding window mechanisms
+- 🔧 **Hardware-Aware**: Optimized for efficient inference on modern GPUs
+- 📊 **Strong Performance**: Achieves leading results on video understanding benchmarks
+## Model Architecture
+VideoNSA employs a hybrid attention strategy with three complementary branches:
+1. **Compression Branch**: Averages frame KV blocks to maintain salient visual cues
+2. **Selection Branch**: Ranks and retains the most informative video segments
+3. **Sliding Window Branch**: Ensures local temporal coverage for fine-grained details
+Each branch is weighted by learnable per-head gates for adaptive token allocation across different tasks.
+## Training Details
+- **Base Model**: Qwen2.5-VL-7B-Instruct
+- **Dataset**: Filtered LLaVA-Video-178K
+- **Sampling Rate**: 4 fps
+- **Context Limit**: 36K tokens during training
+- **Compute**: ~4,600 H100 GPU hours
+## Usage
+For installation, training, and evaluation instructions, please refer to:
+- 💻 [GitHub Repository](https://github.com/Espere-1119-Song/VideoNSA)
+- 🌐 [Project Page](https://enxinsong.com/VideoNSA-web/)
+## Limitations
+- Optimized for video understanding tasks; may not be optimal for pure image tasks
+- Requires sufficient GPU memory for long video processing
+- Performance may vary with different video resolutions and frame rates
+## Citation
+```bibtex
+@misc{chai2025auroracapefficientperformantvideo,
+      title={AuroraCap: Efficient, Performant Video Detailed Captioning},
+      author={Wenhao Chai et al.},
+      year={2025}
+}
+```
+## Resources
+- 📄 [Paper](https://arxiv.org/abs/TODO)
+- 🌐 [Project Page](https://enxinsong.com/VideoNSA-web/)
+- 💻 [GitHub Repository](https://github.com/Espere-1119-Song/VideoNSA)
+## License
+This model is released under the Apache 2.0 License.