SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

Jiongze Yu1, Xiangbo Gao1, Pooja Verlani2, Akshay Gadde2, Yilin Wang2, Balu Adsumilli2, Zhengzhong Tu†,1

1Texas A&M University    2YouTube, Google
†Corresponding author

   

πŸ“° News

  • 2026.03.17: This repo is released.πŸ”₯πŸ”₯πŸ”₯

Abstract: Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black boxes at inference time: users cannot reliably correct unexpected artifacts, but instead can only accept whatever the model produces. In this paper, we propose a novel interactive VSR framework dubbed SparkVSR that makes sparse keyframes a simple and expressive control signal. Specifically, users can first super-resolve or optionally a small set of keyframes using any off-the-shelf image super-resolution (ISR) model, then SparkVSR propagates the keyframe priors to the entire video sequence while remaining grounded by the original LR video motion. Concretely, we introduce a keyframe-conditioned latent-pixel two-stage training pipeline that fuses LR video latents with sparsely encoded HR keyframe latents to learn robust cross-space propagation and refine perceptual details. At inference time, SparkVSR supports flexible keyframe selection (manual specification, codec I-frame extraction, or random sampling) and a reference-free guidance mechanism that continuously balances keyframe adherence and blind restoration, ensuring robust performance even when reference keyframes are absent or imperfect. Experiments on multiple VSR benchmarks demonstrate improved temporal consistency and strong restoration quality, surpassing baselines by up to 24.6%, 21.8%, and 5.6% on CLIP-IQA, DOVER, and MUSIQ, respectively, enabling controllable, keyframe-driven video super-resolution. Moreover, we demonstrate that SparkVSR is a generic interactive, keyframe-conditioned video processing framework as it can be applied out of the box to unseen tasks such as old-film restoration and video style transfer.


Inference Pipeline


Training Pipeline

πŸ”– TODO

  • βœ… Release inference code.
  • βœ… Release pre-trained models.
  • βœ… Release training code.
  • βœ… Release project page.

βš™οΈ Dependencies

  • Python 3.10+
  • PyTorch >= 2.5.0
  • Diffusers
  • Other dependencies (see requirements.txt)
# Clone the github repo and go to the directory
git clone https://github.com/taco-group/SparkVSR
cd SparkVSR

# Create and activate conda environment
conda create -n sparkvsr python=3.10
conda activate sparkvsr

# Install all required dependencies
pip install -r requirements.txt

πŸ“– Contents

  1. Datasets
  2. Models
  3. Training
  4. Inference
  5. Citation
  6. Acknowledgements

πŸ“ Datasets

πŸ—³οΈ Train Datasets

Our model is trained on the same datasets as DOVE: HQ-VSR and DIV2K-HR. All datasets should be placed in the directory datasets/train/.

Dataset Type # Videos / Images Download
HQ-VSR Video 2,055 Google Drive
DIV2K-HR Image 800 Official Link

All datasets should follow this structure:

datasets/
└── train/
    β”œβ”€β”€ HQ-VSR/
    └── DIV2K_train_HR/

πŸ—³οΈ Test Datasets

We use several real-world and synthetic test datasets for evaluation. All datasets follow a consistent directory structure:

Dataset Type # Videos Average Frames Download
UDM10 Synthetic 10 32 Google Drive
SPMCS Synthetic 30 32 Google Drive
YouHQ40 Synthetic 40 32 Google Drive
RealVSR Real-world 50 50 Google Drive
MovieLQ Old-movie 10 192 Google Drive

Make sure the path (datasets/test/) is correct before running inference.

The directory structure is as follows:

datasets/
└── test/
    └── [DatasetName]/
        β”œβ”€β”€ GT/         # Ground Truth: folder of high-quality frames (one per clip)
        β”œβ”€β”€ GT-Video/   # Ground Truth (video version): lossless MKV format
        β”œβ”€β”€ LQ/         # Low-quality Input: folder of degraded frames (one per clip)
        └── LQ-Video/   # Low-Quality Input (video version): lossless MKV format

πŸ“Š Dataset Preparation (Path Lists)

Before training or testing, you need to generate .txt files containing the relative paths of all valid video and image files in your dataset directories. These text lists act as the index for the dataloader during training and inference. Run the following commands:

# πŸ”Ή Train dataset
python finetune/scripts/prepare_dataset.py --dir datasets/train/HQ-VSR
python finetune/scripts/prepare_dataset.py --dir datasets/train/DIV2K_train_HR

# πŸ”Ή Testing dataset
python finetune/scripts/prepare_dataset.py --dir datasets/test/UDM10/GT-Video
python finetune/scripts/prepare_dataset.py --dir datasets/test/UDM10/LQ-Video
# (You may need to repeat the above for other test datasets as needed)

πŸ“¦ Models

Our model is built upon the CogVideoX1.5-5B-I2V base model. We provide pretrained weights for SparkVSR at different training stages.

Model Name Description HuggingFace
CogVideoX1.5-5B-I2V Base model used for initialization zai-org/CogVideoX1.5-5B-I2V
SparkVSR (Stage-1) SparkVSR Stage-1 trained weights JiongzeYu/SparkVSR-S1
SparkVSR (Stage-2) SparkVSR Stage-2 final weights JiongzeYu/SparkVSR

πŸ’‘ Placement of Models:

  • Place the base model (CogVideoX1.5-5B-I2V) into the pretrained_weights/ folder.
  • Place the downloaded SparkVSR weights (Stage-1 and Stage-2) into the checkpoints/ folder.

πŸ”§ Training

Note: Training requires 4Γ—A100 GPUs.

  • πŸ”Ή Stage-1 (Latent-Space): Keyframe-Conditioned Adaptation. Enter the finetune/ directory and start training:

    cd finetune/
    bash sparkvsr_train_s1_ref.sh
    

    This stage adapts the base model to VSR by learning to fuse LR video latents with sparse HR keyframe latents for robust cross-space propagation.

  • πŸ”Ή Stage-2 (Pixel-Space): Detail Refinement. First, convert the Stage-1 checkpoint into a loadable SFT weight format:

    python scripts/prepare_sft_ckpt.py --checkpoint_dir ../checkpoint/SparkVSR-s1/checkpoint-10000
    

    (Adjust the path and step number to match your actual training output).

    You can skip Stage-1 by downloading our SparkVSR Stage-1 weight as the starting point for Stage-2.

    Then, run the second-stage fine-tuning:

    bash sparkvsr_train_s2_ref.sh
    

    This stage refines perceptual details in pixel space, ensuring adherence to provided keyframes while simultaneously maintaining strong no-reference blind SR capabilities when keyframes are absent or imperfect.

  • Finally, convert the Stage-2 checkpoint for inference:

    python scripts/prepare_sft_ckpt.py --checkpoint_dir ../checkpoint/SparkVSR-s2/checkpoint-500
    

πŸ”¨ Inference

  • Before running inference, make sure you have downloaded the corresponding pre-trained models and test datasets.
  • The full inference commands are provided in the shell script: sparkvsr_inference.sh.

SparkVSR supports flexible keyframe propagation through three primary inference modes (--ref_mode).

🌟 Global Customization Flags

Regardless of the mode you choose, you can customize the temporal propagation behavior using these flags:

  • --ref_indices: Specifies the indices of the keyframes you want to use as references (0-indexed).
    • Example: --ref_indices 0 16 32
    • ⚠️ Important: The interval between any two reference frame indices must be strictly greater than 4.
  • --ref_guidance_scale: Controls the strength of the reference keyframe's influence on the output video (Default is 1.0). Increasing this value forces the model to adhere more strictly to the provided keyframes.

1️⃣ No-Ref Mode (--ref_mode no_ref)

Performs blind video super-resolution without any reference keyframes.

MODEL_PATH="checkpoints/sparkvsr-s2/ckpt-500-sft" 

CUDA_VISIBLE_DEVICES=0 python sparkvsr_inference_script.py \
    --input_dir datasets/test/UDM10/LQ-Video \
    --model_path $MODEL_PATH \
    --output_path results/UDM10/no_ref \
    --gt_dir datasets/test/UDM10/GT-Video \
    --is_vae_st \
    --ref_mode no_ref \
    --ref_prompt_mode fixed \
    --ref_guidance_scale 1.0 \
    --eval_metrics psnr,ssim,lpips,dists,clipiqa \
    --upscale 4

2️⃣ API Mode (--ref_mode api)

Uses keyframes restored by a commercial API as the condition signal. SparkVSR defaults to using the impressive fal-ai/nano-banana-pro/edit endpoint.

⚠️ Setup Requirement:

  1. Open finetune/utils/ref_utils.py.
  2. Locate the configuration block at the top of the file.
  3. Replace 'your_fal_key' with your actual API key.
  4. (Optional) Customize the TASK_PROMPT in the same file to better guide the restoration process.
MODEL_PATH="checkpoints/sparkvsr-s2/ckpt-500-sft" 

CUDA_VISIBLE_DEVICES=0 python sparkvsr_inference_script.py \
    --input_dir datasets/test/UDM10/LQ-Video \
    --model_path $MODEL_PATH \
    --output_path results/UDM10/api_ref \
    --gt_dir datasets/test/UDM10/GT-Video \
    --is_vae_st \
    --ref_mode api \
    --ref_prompt_mode fixed \
    --ref_guidance_scale 1.0 \
    --eval_metrics psnr,ssim,lpips,dists,clipiqa \
    --upscale 4 \
    --ref_indices 0

3️⃣ PiSA-SR Mode (--ref_mode pisasr)

Uses keyframes restored by the open-source PiSA-SR model.

⚠️ Setup Requirement:

  1. Clone the PiSA-SR Repository and follow their instructions to install dependencies in a separate Conda environment.
  2. Download their pre-trained weights (stable-diffusion-2-1-base and pisa_sr.pkl).
  3. Update the --pisa_* flags in sparkvsr_inference.sh to point to your actual cloned PiSA-SR directory, environment, and desired GPU.
MODEL_PATH="checkpoints/sparkvsr-s2/ckpt-500-sft" 

CUDA_VISIBLE_DEVICES=0 python sparkvsr_inference_script.py \
    --input_dir datasets/test/UDM10/LQ-Video \
    --model_path $MODEL_PATH \
    --output_path results/UDM10/pisa_ref \
    --gt_dir datasets/test/UDM10/GT-Video \
    --is_vae_st \
    --ref_mode pisasr \
    --ref_prompt_mode fixed \
    --ref_guidance_scale 1.0 \
    --eval_metrics psnr,ssim,lpips,dists,clipiqa \
    --upscale 4 \
    --ref_indices 0 \
    --pisa_python_executable "path/to/your/pisasr/conda/env/bin/python" \
    --pisa_script_path "path/to/your/PiSA-SR/test_pisasr.py" \
    --pisa_sd_model_path "path/to/your/PiSA-SR/preset/models/stable-diffusion-2-1-base" \
    --pisa_chkpt_path "path/to/your/PiSA-SR/preset/models/pisa_sr.pkl" \
    --pisa_gpu "0"

πŸ’‘ Note: All three of the above inference modes and their complete execution commands are fully organized and ready to run in the sparkvsr_inference.sh script!

πŸ“ Metric Evaluation

To quantitatively evaluate the super-resolved videos, we provide a unified evaluation script: run_eval_all.sh.

⚠️ Evaluation Setup Requirement: To calculate DOVER and FastVQA/FasterVQA scores, you must clone their respective repositories and place them (along with their weights) into the metrics/ directory.

  1. Clone VQAssessment/DOVER into metrics/DOVER.
  2. Clone VQAssessment/FAST-VQA-and-FasterVQA into metrics/FastVQA.
  3. Download the pre-trained weights specified in their repositories to their respective nested algorithm folders.

Once the metrics are set up, you can simply run the unified evaluation script run_eval_all.sh to calculate the scores. The evaluation results will be saved as all_metrics_results.json in your specified output directory.

πŸ“Ž Citation

If you find the code helpful in your research or work, please cite the following paper(s).

@misc{yu2026sparkvsrinteractivevideosuperresolution,
      title={SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation}, 
      author={Jiongze Yu and Xiangbo Gao and Pooja Verlani and Akshay Gadde and Yilin Wang and Balu Adsumilli and Zhengzhong Tu},
      year={2026},
      eprint={2603.16864},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.16864}, 
}

πŸ’‘ Acknowledgements

Our work is built upon the solid foundations laid by DOVE and CogVideoX. We sincerely thank the authors for their excellent open-source contributions.

Downloads last month
54
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for JiongzeYu/SparkVSR