--- license: apache-2.0 tags: - pytorch - video-saliency-prediction pipeline_tag: other ---

🚀 ViSAGE @ CVPR-NTIRE Video Saliency Prediction Challenge 2026

Kun Wang1  Yupeng Hu1  Zhiran Li1  Hao Liu1  Qianlong Xiang2,3,4  Liqiang Nie2

1Shandong University
2Harbin Institute of Technology
3City University of Hong Kong
4Shenzhen Loop Area Institute

These are the official implementation, pre-trained model weights, and configuration files for **ViSAGE**, designed for the NTIRE 2026 Challenge on Video Saliency Prediction (CVPRW 2026). 🔗 **Paper:** [ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction](https://huggingface.co/papers/2604.08613) 🔗 **GitHub Repository:** [iLearn-Lab/CVPRW26-ViSAGE](https://github.com/iLearn-Lab/CVPRW26-ViSAGE) 🔗 **Challenge Page:** [NTIRE 2026 VSP Challenge](https://www.codabench.org/competitions/12842/) ---

--- ## 📌 Model Information ### 1. Model Name **ViSAGE (Video Saliency with Adaptive Gated Experts)** ### 2. Task Type & Applicable Tasks - **Task Type:** Video Saliency Prediction (VSP) / Computer Vision - **Applicable Tasks:** Robust and adaptive prediction of human visual attention (saliency maps) in dynamic video sequences. ### 3. Project Introduction Video Saliency Prediction requires capturing complex spatio-temporal dynamics and human visual priors. **ViSAGE** tackles this by leveraging a powerful multi-expert ensemble framework. > 💡 **Method Highlight:** The framework consists of a shared **InternVideo2 backbone** adapted via two-stage LoRA fine-tuning, alongside dual specialized experts utilizing Temporal Modulation (for explicit spatial priors) and Multi-Scale Fusion (for adaptive data-driven perception). For robust performance, the **Ensemble Fusion Module** obtains the final prediction by converting the expert outputs to logit space before averaging. ### 4. Training Data Source - Dataset provided by the **NTIRE 2026 Video Saliency Prediction Challenge** (Private Test and Validation sets). --- ## 🚀 Usage & Basic Inference ### Step 1: Prepare the Environment Clone the GitHub repository and set up the Conda environment: ```bash git clone https://github.com/iLearn-Lab/CVPRW26-ViSAGE.git cd ViSAGE conda create -n visage python=3.10 -y conda activate visage pip install -r requirements.txt ``` ### Step 2: Data & Pre-trained Weights Preparation 1. **Challenge Data:** Use the provided scripts to extract frames from the source videos. ```bash python video_to_frames.py ``` 2. **InternVideo2 Backbone:** Download the pre-trained `InternVideo2-Stage2_6B-224p-f4` model from [Hugging Face](https://huggingface.co/OpenGVLab/InternVideo2-Stage2_6B-224p-f4) and clone the `InternVideo` repo. 3. **Paths:** Update the pre-trained weight paths in `Expert1/inference.py` and `Expert2/inference.py` to match your local directory. ### Step 3: Run Inference & Ensemble **1. Inference:** Generate predictions for both experts. ```bash python Expert1/inference.py python Expert2/inference.py ``` **2. Ensemble:** Merge the inference results from Expert 1 and Expert 2 in logit space. ```bash python ensemble.py ``` **3. Format Check & Video Generation:** ```bash python check.py python makevideos.py ``` ### Step 4: Training (Optional) Run the two-stage LoRA fine-tuning pipeline: ```bash python trainnew.py # Stage 1 python trainnew2.py # Stage 2 ``` --- ## ⚠️ Limitations & Notes - The model relies heavily on the InternVideo2 backbone; out-of-memory (OOM) errors may occur on GPUs with less than 24GB VRAM. - This framework and its pre-trained weights are intended for **academic research purposes only**. --- ## 📝⭐️ Citation If you find this project useful for your research, please consider citing: ```bibtex @inproceedings{ntire26visage, title={{ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results}}, author={Wang, Kun and Hu, Yupeng and Li, Zhiran and Liu, Hao and Xiang, Qianlong and Nie, Liqiang}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, year={2026} } ```