Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,139 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- face-restoration
|
| 4 |
+
- diffusion
|
| 5 |
+
- one-step
|
| 6 |
+
- stable-diffusion
|
| 7 |
+
- lora
|
| 8 |
+
- image-to-image
|
| 9 |
+
base_model: stabilityai/stable-diffusion-2-1-base
|
| 10 |
+
pipeline_tag: image-to-image
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# OSDFace β Pretrained Weights (Mirror)
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+

|
| 17 |
+
|
| 18 |
+
> **This is an unofficial mirror.**
|
| 19 |
+
> All credit goes to the original authors. The weights are mirrored here from the [official OSDFace repository](https://github.com/jkwang28/OSDFace) for convenience, as the original download is hosted on OneDrive/Google Drive which can be slow or inaccessible in some regions.
|
| 20 |
+
> Please cite the original paper and star the original repo if you use these weights.
|
| 21 |
+
|
| 22 |
+
## Overview
|
| 23 |
+
|
| 24 |
+
OSDFace (**One-Step Diffusion Model for Face Restoration**) is a single-step diffusion model that restores degraded, low-quality face images into high-fidelity, identity-consistent outputs. It was accepted at **CVPR 2025**.
|
| 25 |
+
|
| 26 |
+
Unlike multi-step diffusion approaches, OSDFace requires only **one forward pass** through a modified Stable Diffusion 2.1 UNet, making it significantly faster at inference while achieving state-of-the-art results on both synthetic (CelebA-Test) and real-world (Wider-Test, LFW-Test, WebPhoto-Test) benchmarks.
|
| 27 |
+
|
| 28 |
+
The key innovations are:
|
| 29 |
+
|
| 30 |
+
- **Visual Representation Embedder (VRE):** A VQ-VAE encoder that tokenizes the low-quality input face and produces visual prompt embeddings via a vector-quantized dictionary. These embeddings replace the text encoder's output and are fed directly into the UNet's cross-attention layers.
|
| 31 |
+
- **Facial Identity Loss:** A face-recognition-derived loss that enforces identity consistency between the restored and ground-truth faces.
|
| 32 |
+
- **GAN Guidance:** A generative adversarial network guides the one-step diffusion to align the output distribution with the ground truth.
|
| 33 |
+
|
| 34 |
+
## Usage
|
| 35 |
+
|
| 36 |
+
### Prerequisites
|
| 37 |
+
|
| 38 |
+
- **Base model:** [stabilityai/stable-diffusion-2-1-base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base)
|
| 39 |
+
- **Python 3.10**, PyTorch 2.4.0, diffusers 0.27.2
|
| 40 |
+
|
| 41 |
+
### Quick Start
|
| 42 |
+
|
| 43 |
+
```bash
|
| 44 |
+
# Clone the official repo
|
| 45 |
+
git clone https://github.com/jkwang28/OSDFace.git
|
| 46 |
+
cd OSDFace
|
| 47 |
+
|
| 48 |
+
# Download these weights into pretrained/
|
| 49 |
+
# Place: associate_2.ckpt, embedding_change_weights.pth, pytorch_lora_weights.safetensors
|
| 50 |
+
|
| 51 |
+
# Run inference (with LoRA merging for speed)
|
| 52 |
+
python infer.py \
|
| 53 |
+
--input_image data/WebPhoto-Test \
|
| 54 |
+
--output_dir results/WebPhoto-Test \
|
| 55 |
+
--pretrained_model_name_or_path "stabilityai/stable-diffusion-2-1-base" \
|
| 56 |
+
--img_encoder_weight "pretrained/associate_2.ckpt" \
|
| 57 |
+
--ckpt_path pretrained \
|
| 58 |
+
--merge_lora \
|
| 59 |
+
--mixed_precision fp16 \
|
| 60 |
+
--gpu_ids 0
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
> **Note on the different pretrained model**
|
| 64 |
+
> Although the project is based on `stabilityai/stable-diffusion-2-1-base` we use `Manojb/stable-diffusion-2-1-base` because the former can't be downloaded from huggingface.
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
## Files in This Repository
|
| 68 |
+
|
| 69 |
+

|
| 70 |
+
|
| 71 |
+
### `associate_2.ckpt` (1.87 GB)
|
| 72 |
+
|
| 73 |
+
The **VQ-VAE image encoder** (referred to as the Visual Representation Embedder in the paper). This is the core component that understands the degraded input face.
|
| 74 |
+
|
| 75 |
+
It contains a multi-head encoder with downsampling blocks, a mid-block with attention, and a vector quantizer with a learned 1024-entry codebook (embedding dim 512). At inference, the encoder processes a 512Γ512 low-quality face, extracts spatial features, quantizes them against the codebook, and selects the 77 closest (non-duplicate) codebook entries β producing a `(batch, 77, 512)` tensor that acts as a drop-in replacement for CLIP text embeddings in the UNet's cross-attention.
|
| 76 |
+
|
| 77 |
+
**Loaded via:** `--img_encoder_weight associate_2.ckpt`
|
| 78 |
+
|
| 79 |
+
### `embedding_change_weights.pth` (1.58 MB)
|
| 80 |
+
|
| 81 |
+
A lightweight **embedding projection module** (`TwoLayerConv1x1`) that maps the VRE output from 512 dimensions to 1024 dimensions, matching the hidden size expected by Stable Diffusion 2.1's UNet cross-attention layers.
|
| 82 |
+
|
| 83 |
+
Architecture: two 1Γ1 Conv1d layers with SiLU activations (`512 β 256 β 1024`), operating over the 77-token sequence.
|
| 84 |
+
|
| 85 |
+
This module is used in the default configuration (without `--cat_prompt_embedding`). When `--cat_prompt_embedding` is enabled, the VRE instead outputs 154 tokens at 512-dim which are reshaped to 77 tokens at 1024-dim, bypassing this module entirely.
|
| 86 |
+
|
| 87 |
+
**Loaded from:** `<ckpt_path>/embedding_change_weights.pth`
|
| 88 |
+
|
| 89 |
+
### `pytorch_lora_weights.safetensors` (67.9 MB)
|
| 90 |
+
|
| 91 |
+
**LoRA (Low-Rank Adaptation) weights** for the Stable Diffusion 2.1 UNet. These adapt the frozen SD2.1 UNet to perform one-step face restoration conditioned on the VRE embeddings.
|
| 92 |
+
|
| 93 |
+
Default LoRA configuration: **rank 16, alpha 16** (effective scaling factor `alpha/rank = 1.0`). The weights cover both standard LoRA layers (`lora_A`/`lora_B`) and some additional `lora.up`/`lora.down` layers.
|
| 94 |
+
|
| 95 |
+
These can be loaded in two ways:
|
| 96 |
+
- **Dynamic loading** (default): loaded at runtime via `diffusers`' `load_lora_weights()`
|
| 97 |
+
- **Merged loading** (`--merge_lora`): pre-merged into the UNet weights before inference for slightly faster execution
|
| 98 |
+
|
| 99 |
+
**Loaded from:** `<ckpt_path>/pytorch_lora_weights.safetensors`
|
| 100 |
+
|
| 101 |
+
|
| 102 |
+
## Key Inference Arguments
|
| 103 |
+
|
| 104 |
+
| Argument | Default | Description |
|
| 105 |
+
|---|---|---|
|
| 106 |
+
| `--merge_lora` | off | Merge LoRA into UNet weights (recommended) |
|
| 107 |
+
| `--mixed_precision` | `fp32` | Use `fp16` for faster inference / lower VRAM |
|
| 108 |
+
| `--gpu_ids` | `[0]` | Multi-GPU support, e.g. `--gpu_ids 0 1 2 3` |
|
| 109 |
+
| `--cat_prompt_embedding` | off | Alternative embedding strategy (skips embedding_change module) |
|
| 110 |
+
| `--lora_rank` | 16 | LoRA rank (must match training) |
|
| 111 |
+
| `--lora_alpha` | 16 | LoRA alpha (must match training) |
|
| 112 |
+
|
| 113 |
+
## Inference Pipeline (Summary)
|
| 114 |
+
|
| 115 |
+
1. Input image resized to **512Γ512**
|
| 116 |
+
2. VRE encodes the LQ face β `(B, 77, 512)` visual prompt
|
| 117 |
+
3. Embedding projection maps to `(B, 77, 1024)` (or concatenation path)
|
| 118 |
+
4. VAE encodes the LQ face to latent space
|
| 119 |
+
5. UNet performs a **single denoising step** at timestep 399, conditioned on the visual prompt
|
| 120 |
+
6. Predicted clean latent is decoded by the VAE β restored face
|
| 121 |
+
|
| 122 |
+
## Citation
|
| 123 |
+
|
| 124 |
+
```bibtex
|
| 125 |
+
@InProceedings{wang2025osdface,
|
| 126 |
+
author = {Wang, Jingkai and Gong, Jue and Zhang, Lin and Chen, Zheng and Liu, Xing and Gu, Hong and Liu, Yutong and Zhang, Yulun and Yang, Xiaokang},
|
| 127 |
+
title = {{OSDFace}: One-Step Diffusion Model for Face Restoration},
|
| 128 |
+
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
|
| 129 |
+
month = {June},
|
| 130 |
+
year = {2025},
|
| 131 |
+
pages = {12626-12636}
|
| 132 |
+
}
|
| 133 |
+
```
|
| 134 |
+
|
| 135 |
+
## Links
|
| 136 |
+
|
| 137 |
+
- π [Paper (arXiv)](https://arxiv.org/abs/2411.17163)
|
| 138 |
+
- π» [Official Repository](https://github.com/jkwang28/OSDFace)
|
| 139 |
+
- π [Project Page](https://www.jingkaiwang.com/OSDFace/)
|