๐Ÿ“„ Paper   |   ๐Ÿ–ฅ๏ธ Github    

This repo contains the model for the paper V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration.

Overview

Large-scale video generative models are trained on vast and diverse visual data, enabling them to internalize rich structural, semantic, and dynamic priors of the visual world. While these models have demonstrated impressive generative capability, their potential as general-purpose visual learners remains largely untapped. In this work, we introduce V-Bridge, a framework that bridges this latent capacity to versatile few-shot image restoration tasks. We reinterpret image restoration not as a static regression problem, but as a progressive generative process, and leverage video models to simulate the gradual refinement from degraded inputs to high-fidelity outputs. Surprisingly, with only 1,000 multi-task training samples (less than 2% of existing restoration methods), pretrained video models can be induced to perform competitive image restoration, achieving multiple tasks with a single model, rivaling specialized architectures designed explicitly for this purpose. Our findings reveal that video generative models implicitly learn powerful and transferable restoration priors that can be activated with only extremely limited data, challenging the traditional boundary between generative modeling and low-level vision, and opening a new design paradigm for foundation models in visual tasks.

Details

Our model uses a full fine-tuning approach, with the base model being Wan2.2-TI2V-5B.

The following are some of the detailed parameters for inference.

cfg_skip_ratio = 0.15

sampler_name = "Flow_Unipc"
shift = 5

video_length = 5
fps = 24

weight_dtype = torch.bfloat16

prompt = (
        "A restoration-focused video strictly based on the input image. "
        "The camera is completely static with no movement, no zoom, and no rotation. "
        "The original composition, objects, layout, and perspective are preserved exactly. "
        "Focus on visual restoration and enhancement: remove noise, reduce blur, eliminate rain artifacts, "
        "remove compression artifacts, and improve clarity, sharpness, and fine details while maintaining "
        "natural textures, accurate colors, and balanced lighting. "
        "Only extremely subtle and natural temporal consistency is allowed. "
        "The video should appear stable, clean, and realistic, as if the input image has been gently restored over time."
    )
negative_prompt = (
        "camera movement, panning, tilting, zooming, rotation, "
        "scene change, object movement, new objects, object deformation, "
        "style change, artistic style, illustration, painting, cartoon, "
        "over-saturated colors, overexposure, underexposure, "
        "motion blur, jitter, flickering, shaking, "
        "low quality, worst quality, noise, blur, rain, fog, "
        "compression artifacts, jpeg artifacts, aliasing, "
        "text, subtitles, watermark, logo, "
        "distorted anatomy, extra limbs, duplicated objects, "
        "exaggerated motion, creative animation"
    )
guidance_scale = 6.0
num_inference_steps = 50

More details and usage instructions can be found on GitHub.

Acknowledgements

We would like to thank the contributors to Qwen, VideoX-Fun and HuggingFace repositories, for their open research.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support