File size: 5,400 Bytes
c27e9dc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0f1ad36
 
c27e9dc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
license: apache-2.0
tags:
- pytorch
---

<a id="top"></a>
<div align="center">
  <h1>πŸš€ ViSAGE @ CVPR-NTIRE Video Saliency Prediction Challenge 2026</h1>

  <p>
    <b>Kun Wang</b><sup>1</sup>&nbsp;
    <b>Yupeng Hu</b><sup>1</sup>&nbsp;
    <b>Zhiran Li</b><sup>1</sup>&nbsp;
    <b>Hao Liu</b><sup>1</sup>&nbsp;
    <b>Qianlong Xiang</b><sup>2,3,4</sup>&nbsp;
    <b>Liqiang Nie</b><sup>2</sup>
  </p>

  <p>
    <sup>1</sup>Shandong University<br>
    <sup>2</sup>Harbin Institute of Technology<br>
    <sup>3</sup>City University of Hong Kong<br>
    <sup>4</sup>Shenzhen Loop Area Institute
  </p>
</div>

These are the official implementation, pre-trained model weights, and configuration files for **ViSAGE**, designed for the NTIRE 2026 Challenge on Video Saliency Prediction (CVPRW 2026).

πŸ”— **Paper:** [Accepted by CVPRW 2026](https://arxiv.org)
πŸ”— **GitHub Repository:** [iLearn-Lab/CVPRW26-ViSAGE](https://github.com/iLearn-Lab/CVPRW26-ViSAGE.git)
πŸ”— **Challenge Page:** [NTIRE 2026 VSP Challenge](https://www.codabench.org/competitions/12842/)

---

<p align="center">
  <video src="https://github.com/user-attachments/assets/a2dbabc0-9d8e-4f7a-8b16-c2d56af7b071" controls width="95%"></video>
</p>

---

## πŸ“Œ Model Information

### 1. Model Name
**ViSAGE(Video Saliency with Adaptive Gated Experts)**

### 2. Task Type & Applicable Tasks
- **Task Type:** Video Saliency Prediction (VSP) / Computer Vision
- **Applicable Tasks:** Robust and adaptive prediction of human visual attention (saliency maps) in dynamic video sequences.

### 3. Project Introduction
Video Saliency Prediction requires capturing complex spatio-temporal dynamics and human visual priors. **ViSAGE** tackles this by leveraging a powerful multi-expert ensemble framework.

> πŸ’‘ **Method Highlight:** The framework consists of a shared **InternVideo2 backbone** adapted via two-stage LoRA fine-tuning, alongside dual specialized experts utilizing Temporal Modulation (for explicit spatial priors) and Multi-Scale Fusion (for adaptive data-driven perception). For robust performance, the **Ensemble Fusion Module** obtains the final prediction by converting the expert outputs to logit space before averaging, which provides significantly more accurate estimation than simple saliency map averaging.

### 4. Training Data Source
- Dataset provided by the **NTIRE 2026 Video Saliency Prediction Challenge** (Private Test and Validation sets).

---

## πŸš€ Usage & Basic Inference

### Step 1: Prepare the Environment
Clone the GitHub repository and set up the Conda environment:
```bash
git clone https://github.com/iLearn-Lab/CVPRW26-ViSAGE.git
cd ViSAGE
```
```bash
conda create -n visage python=3.10 -y
conda activate visage
pip install -r requirements.txt
```

### Step 2: Data & Pre-trained Weights Preparation
1. **Challenge Data:** Use the provided scripts to extract frames from the source videos. The extracted frames will be automatically saved to `derived_fullfps`.
   *(⚠️ **Important:** Do not modify the output directory name `derived_fullfps` unless you manually update the path configs in all inference scripts.)*
  ```bash
   python video_to_frames.py
  ```
2. **ViSAGE Checkpoints:** Download our model checkpoints(https://huggingface.co/iLearn-Lab/CVPRW26-ViSAGE).
3. **InternVideo2 Backbone:** Download the pre-trained `InternVideo2-Stage2_6B-224p-f4` model from [Hugging Face](https://huggingface.co/OpenGVLab/InternVideo2-Stage2_6B-224p-f4) and clone the `InternVideo` repo:
  ```bash
   git clone https://github.com/OpenGVLab/InternVideo.git
   *(Update the pre-trained weight paths in `Expert1/inference.py` and `Expert2/inference.py` to match your local directory).*
  ```
### Step 3: Run Inference & Ensemble

**1. Inference:** Generate predictions for both experts.
```bash
python Expert1/inference.py
python Expert2/inference.py
```
**2. Ensemble:** Merge the inference results from Expert 1 and Expert 2 in logit space.
```bash
python ensemble.py
```
**3. Format Check & Video Generation:** Validate your submission format and render the predicted saliency outputs onto the source video frames.
```bash
python check.py
python makevideos.py
```

### Step 4: Training (Optional)
If you wish to train the model from scratch, run the two-stage LoRA fine-tuning pipeline:
```bash
python trainnew.py   # Stage 1
python trainnew2.py  # Stage 2
```

---

## ⚠️ Limitations & Notes

**Disclaimer:** This framework and its pre-trained weights are intended for **academic research purposes only**.
- The model relies heavily on the InternVideo2 backbone; out-of-memory (OOM) errors may occur on GPUs with less than 24GB VRAM.
- Inference speed and performance may fluctuate depending on the hardware utilized.

---

## 🀝 Acknowledgements & Contact

- **Contact:** If you have any questions or encounter issues, feel free to open an issue or contact the author Kun Wang at `khylon.kun.wang@gmail.com`.

---

## πŸ“β­οΈ Citation

If you find this project useful for your research, please consider citing:


@inproceedings{ntire26visage, 
  title={{ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results}}, 
  author={Wang, Kun and Hu, Yupeng and Li, Zhiran and Liu, Hao and Xiang, Qianlong and Nie, Liqiang},  
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},  
  year={2026} 
}