| | --- |
| | tags: |
| | - face-restoration |
| | - diffusion |
| | - one-step |
| | - stable-diffusion |
| | - lora |
| | - image-to-image |
| | base_model: stabilityai/stable-diffusion-2-1-base |
| | pipeline_tag: image-to-image |
| | --- |
| | |
| | # OSDFace β Pretrained Weights (Mirror) |
| |
|
| |
|
| |  |
| |
|
| | > **This is an unofficial mirror.** |
| | > All credit goes to the original authors. The weights are mirrored here from the [official OSDFace repository](https://github.com/jkwang28/OSDFace) for convenience, as the original download is hosted on OneDrive/Google Drive which can be slow or inaccessible in some regions. |
| | > Please cite the original paper and star the original repo if you use these weights. |
| |
|
| | ## Overview |
| |
|
| | OSDFace (**One-Step Diffusion Model for Face Restoration**) is a single-step diffusion model that restores degraded, low-quality face images into high-fidelity, identity-consistent outputs. It was accepted at **CVPR 2025**. |
| |
|
| | Unlike multi-step diffusion approaches, OSDFace requires only **one forward pass** through a modified Stable Diffusion 2.1 UNet, making it significantly faster at inference while achieving state-of-the-art results on both synthetic (CelebA-Test) and real-world (Wider-Test, LFW-Test, WebPhoto-Test) benchmarks. |
| |
|
| | The key innovations are: |
| |
|
| | - **Visual Representation Embedder (VRE):** A VQ-VAE encoder that tokenizes the low-quality input face and produces visual prompt embeddings via a vector-quantized dictionary. These embeddings replace the text encoder's output and are fed directly into the UNet's cross-attention layers. |
| | - **Facial Identity Loss:** A face-recognition-derived loss that enforces identity consistency between the restored and ground-truth faces. |
| | - **GAN Guidance:** A generative adversarial network guides the one-step diffusion to align the output distribution with the ground truth. |
| |
|
| | ## Usage |
| |
|
| | ### Prerequisites |
| |
|
| | - **Base model:** [stabilityai/stable-diffusion-2-1-base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base) |
| | - **Python 3.10**, PyTorch 2.4.0, diffusers 0.27.2 |
| |
|
| | ### Quick Start |
| |
|
| | ```bash |
| | # Clone the official repo |
| | git clone https://github.com/jkwang28/OSDFace.git |
| | cd OSDFace |
| | |
| | # Download these weights into pretrained/ |
| | # Place: associate_2.ckpt, embedding_change_weights.pth, pytorch_lora_weights.safetensors |
| | |
| | # Run inference (with LoRA merging for speed) |
| | python infer.py \ |
| | --input_image data/WebPhoto-Test \ |
| | --output_dir results/WebPhoto-Test \ |
| | --pretrained_model_name_or_path "stabilityai/stable-diffusion-2-1-base" \ |
| | --img_encoder_weight "pretrained/associate_2.ckpt" \ |
| | --ckpt_path pretrained \ |
| | --merge_lora \ |
| | --mixed_precision fp16 \ |
| | --gpu_ids 0 |
| | ``` |
| |
|
| | > **Note on the different pretrained model** |
| | > Although the project is based on `stabilityai/stable-diffusion-2-1-base` we use `Manojb/stable-diffusion-2-1-base` because the former can't be downloaded from huggingface. |
| |
|
| |
|
| | ## Files in This Repository |
| |
|
| |  |
| |
|
| | ### `associate_2.ckpt` (1.87 GB) |
| | |
| | The **VQ-VAE image encoder** (referred to as the Visual Representation Embedder in the paper). This is the core component that understands the degraded input face. |
| | |
| | It contains a multi-head encoder with downsampling blocks, a mid-block with attention, and a vector quantizer with a learned 1024-entry codebook (embedding dim 512). At inference, the encoder processes a 512Γ512 low-quality face, extracts spatial features, quantizes them against the codebook, and selects the 77 closest (non-duplicate) codebook entries β producing a `(batch, 77, 512)` tensor that acts as a drop-in replacement for CLIP text embeddings in the UNet's cross-attention. |
| | |
| | **Loaded via:** `--img_encoder_weight associate_2.ckpt` |
| |
|
| | ### `embedding_change_weights.pth` (1.58 MB) |
| |
|
| | A lightweight **embedding projection module** (`TwoLayerConv1x1`) that maps the VRE output from 512 dimensions to 1024 dimensions, matching the hidden size expected by Stable Diffusion 2.1's UNet cross-attention layers. |
| |
|
| | Architecture: two 1Γ1 Conv1d layers with SiLU activations (`512 β 256 β 1024`), operating over the 77-token sequence. |
| |
|
| | This module is used in the default configuration (without `--cat_prompt_embedding`). When `--cat_prompt_embedding` is enabled, the VRE instead outputs 154 tokens at 512-dim which are reshaped to 77 tokens at 1024-dim, bypassing this module entirely. |
| |
|
| | **Loaded from:** `<ckpt_path>/embedding_change_weights.pth` |
| |
|
| | ### `pytorch_lora_weights.safetensors` (67.9 MB) |
| |
|
| | **LoRA (Low-Rank Adaptation) weights** for the Stable Diffusion 2.1 UNet. These adapt the frozen SD2.1 UNet to perform one-step face restoration conditioned on the VRE embeddings. |
| |
|
| | Default LoRA configuration: **rank 16, alpha 16** (effective scaling factor `alpha/rank = 1.0`). The weights cover both standard LoRA layers (`lora_A`/`lora_B`) and some additional `lora.up`/`lora.down` layers. |
| |
|
| | These can be loaded in two ways: |
| | - **Dynamic loading** (default): loaded at runtime via `diffusers`' `load_lora_weights()` |
| | - **Merged loading** (`--merge_lora`): pre-merged into the UNet weights before inference for slightly faster execution |
| |
|
| | **Loaded from:** `<ckpt_path>/pytorch_lora_weights.safetensors` |
| |
|
| |
|
| | ## Key Inference Arguments |
| |
|
| | | Argument | Default | Description | |
| | |---|---|---| |
| | | `--merge_lora` | off | Merge LoRA into UNet weights (recommended) | |
| | | `--mixed_precision` | `fp32` | Use `fp16` for faster inference / lower VRAM | |
| | | `--gpu_ids` | `[0]` | Multi-GPU support, e.g. `--gpu_ids 0 1 2 3` | |
| | | `--cat_prompt_embedding` | off | Alternative embedding strategy (skips embedding_change module) | |
| | | `--lora_rank` | 16 | LoRA rank (must match training) | |
| | | `--lora_alpha` | 16 | LoRA alpha (must match training) | |
| |
|
| | ## Inference Pipeline (Summary) |
| |
|
| | 1. Input image resized to **512Γ512** |
| | 2. VRE encodes the LQ face β `(B, 77, 512)` visual prompt |
| | 3. Embedding projection maps to `(B, 77, 1024)` (or concatenation path) |
| | 4. VAE encodes the LQ face to latent space |
| | 5. UNet performs a **single denoising step** at timestep 399, conditioned on the visual prompt |
| | 6. Predicted clean latent is decoded by the VAE β restored face |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @InProceedings{wang2025osdface, |
| | author = {Wang, Jingkai and Gong, Jue and Zhang, Lin and Chen, Zheng and Liu, Xing and Gu, Hong and Liu, Yutong and Zhang, Yulun and Yang, Xiaokang}, |
| | title = {{OSDFace}: One-Step Diffusion Model for Face Restoration}, |
| | booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, |
| | month = {June}, |
| | year = {2025}, |
| | pages = {12626-12636} |
| | } |
| | ``` |
| |
|
| | ## Links |
| |
|
| | - π [Paper (arXiv)](https://arxiv.org/abs/2411.17163) |
| | - π» [Official Repository](https://github.com/jkwang28/OSDFace) |
| | - π [Project Page](https://www.jingkaiwang.com/OSDFace/) |