Enxin commited on
Commit
511bff8
Β·
verified Β·
1 Parent(s): 80a80d5

Upload MODEL_CARD.md

Browse files
Files changed (1) hide show
  1. MODEL_CARD.md +80 -0
MODEL_CARD.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - video-understanding
7
+ - sparse-attention
8
+ - vision-language
9
+ - qwen2.5-vl
10
+ - multimodal
11
+ pipeline_tag: video-text-to-text
12
+ ---
13
+
14
+ # VideoNSA: Native Sparse Attention for Video Understanding
15
+
16
+ <div align="center">
17
+ <img src="https://enxinsong.com/VideoNSA-web/assets/teaser.png" alt="VideoNSA Overview" width="100%">
18
+ </div>
19
+
20
+ ## Model Description
21
+
22
+ VideoNSA is a learnable, hardware-aware sparse-attention framework built on Qwen2.5-VL-7B for efficient long video understanding. It processes up to **128K vision-text tokens** using only **3.6%** of the full attention budget while maintaining competitive performance.
23
+
24
+ ### Key Features
25
+
26
+ - 🎯 **Learned Sparsity**: Intelligently learns sparsity patterns over video tokens
27
+ - πŸš€ **Efficient Scaling**: Handles massive video contexts with minimal computational overhead
28
+ - 🎬 **Hybrid Attention**: Combines compression, selection, and sliding window mechanisms
29
+ - πŸ”§ **Hardware-Aware**: Optimized for efficient inference on modern GPUs
30
+ - πŸ“Š **Strong Performance**: Achieves leading results on video understanding benchmarks
31
+
32
+ ## Model Architecture
33
+
34
+ VideoNSA employs a hybrid attention strategy with three complementary branches:
35
+
36
+ 1. **Compression Branch**: Averages frame KV blocks to maintain salient visual cues
37
+ 2. **Selection Branch**: Ranks and retains the most informative video segments
38
+ 3. **Sliding Window Branch**: Ensures local temporal coverage for fine-grained details
39
+
40
+ Each branch is weighted by learnable per-head gates for adaptive token allocation across different tasks.
41
+
42
+ ## Training Details
43
+
44
+ - **Base Model**: Qwen2.5-VL-7B-Instruct
45
+ - **Dataset**: Filtered LLaVA-Video-178K
46
+ - **Sampling Rate**: 4 fps
47
+ - **Context Limit**: 36K tokens during training
48
+ - **Compute**: ~4,600 H100 GPU hours
49
+
50
+ ## Usage
51
+
52
+ For installation, training, and evaluation instructions, please refer to:
53
+ - πŸ’» [GitHub Repository](https://github.com/Espere-1119-Song/VideoNSA)
54
+ - 🌐 [Project Page](https://enxinsong.com/VideoNSA-web/)
55
+
56
+ ## Limitations
57
+
58
+ - Optimized for video understanding tasks; may not be optimal for pure image tasks
59
+ - Requires sufficient GPU memory for long video processing
60
+ - Performance may vary with different video resolutions and frame rates
61
+
62
+ ## Citation
63
+
64
+ ```bibtex
65
+ @misc{chai2025auroracapefficientperformantvideo,
66
+ title={AuroraCap: Efficient, Performant Video Detailed Captioning},
67
+ author={Wenhao Chai et al.},
68
+ year={2025}
69
+ }
70
+ ```
71
+
72
+ ## Resources
73
+
74
+ - πŸ“„ [Paper](https://arxiv.org/abs/TODO)
75
+ - 🌐 [Project Page](https://enxinsong.com/VideoNSA-web/)
76
+ - πŸ’» [GitHub Repository](https://github.com/Espere-1119-Song/VideoNSA)
77
+
78
+ ## License
79
+
80
+ This model is released under the Apache 2.0 License.