alecccdd commited on
Commit
eb21619
Β·
verified Β·
1 Parent(s): e4ce037

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -0
README.md ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - face-restoration
4
+ - diffusion
5
+ - one-step
6
+ - stable-diffusion
7
+ - lora
8
+ - image-to-image
9
+ base_model: stabilityai/stable-diffusion-2-1-base
10
+ pipeline_tag: image-to-image
11
+ ---
12
+
13
+ # OSDFace β€” Pretrained Weights (Mirror)
14
+
15
+
16
+ ![github.com_jkwang28_OSDFace_](https://cdn-uploads.huggingface.co/production/uploads/6557b9b4deee83130ac92941/Qs7aOIkhOyghdH8Bdt5wF.png)
17
+
18
+ > **This is an unofficial mirror.**
19
+ > All credit goes to the original authors. The weights are mirrored here from the [official OSDFace repository](https://github.com/jkwang28/OSDFace) for convenience, as the original download is hosted on OneDrive/Google Drive which can be slow or inaccessible in some regions.
20
+ > Please cite the original paper and star the original repo if you use these weights.
21
+
22
+ ## Overview
23
+
24
+ OSDFace (**One-Step Diffusion Model for Face Restoration**) is a single-step diffusion model that restores degraded, low-quality face images into high-fidelity, identity-consistent outputs. It was accepted at **CVPR 2025**.
25
+
26
+ Unlike multi-step diffusion approaches, OSDFace requires only **one forward pass** through a modified Stable Diffusion 2.1 UNet, making it significantly faster at inference while achieving state-of-the-art results on both synthetic (CelebA-Test) and real-world (Wider-Test, LFW-Test, WebPhoto-Test) benchmarks.
27
+
28
+ The key innovations are:
29
+
30
+ - **Visual Representation Embedder (VRE):** A VQ-VAE encoder that tokenizes the low-quality input face and produces visual prompt embeddings via a vector-quantized dictionary. These embeddings replace the text encoder's output and are fed directly into the UNet's cross-attention layers.
31
+ - **Facial Identity Loss:** A face-recognition-derived loss that enforces identity consistency between the restored and ground-truth faces.
32
+ - **GAN Guidance:** A generative adversarial network guides the one-step diffusion to align the output distribution with the ground truth.
33
+
34
+ ## Usage
35
+
36
+ ### Prerequisites
37
+
38
+ - **Base model:** [stabilityai/stable-diffusion-2-1-base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base)
39
+ - **Python 3.10**, PyTorch 2.4.0, diffusers 0.27.2
40
+
41
+ ### Quick Start
42
+
43
+ ```bash
44
+ # Clone the official repo
45
+ git clone https://github.com/jkwang28/OSDFace.git
46
+ cd OSDFace
47
+
48
+ # Download these weights into pretrained/
49
+ # Place: associate_2.ckpt, embedding_change_weights.pth, pytorch_lora_weights.safetensors
50
+
51
+ # Run inference (with LoRA merging for speed)
52
+ python infer.py \
53
+ --input_image data/WebPhoto-Test \
54
+ --output_dir results/WebPhoto-Test \
55
+ --pretrained_model_name_or_path "stabilityai/stable-diffusion-2-1-base" \
56
+ --img_encoder_weight "pretrained/associate_2.ckpt" \
57
+ --ckpt_path pretrained \
58
+ --merge_lora \
59
+ --mixed_precision fp16 \
60
+ --gpu_ids 0
61
+ ```
62
+
63
+ > **Note on the different pretrained model**
64
+ > Although the project is based on `stabilityai/stable-diffusion-2-1-base` we use `Manojb/stable-diffusion-2-1-base` because the former can't be downloaded from huggingface.
65
+
66
+
67
+ ## Files in This Repository
68
+
69
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/6557b9b4deee83130ac92941/rqOY2DHGJ1MhnJkJ5keJ0.png)
70
+
71
+ ### `associate_2.ckpt` (1.87 GB)
72
+
73
+ The **VQ-VAE image encoder** (referred to as the Visual Representation Embedder in the paper). This is the core component that understands the degraded input face.
74
+
75
+ It contains a multi-head encoder with downsampling blocks, a mid-block with attention, and a vector quantizer with a learned 1024-entry codebook (embedding dim 512). At inference, the encoder processes a 512Γ—512 low-quality face, extracts spatial features, quantizes them against the codebook, and selects the 77 closest (non-duplicate) codebook entries β€” producing a `(batch, 77, 512)` tensor that acts as a drop-in replacement for CLIP text embeddings in the UNet's cross-attention.
76
+
77
+ **Loaded via:** `--img_encoder_weight associate_2.ckpt`
78
+
79
+ ### `embedding_change_weights.pth` (1.58 MB)
80
+
81
+ A lightweight **embedding projection module** (`TwoLayerConv1x1`) that maps the VRE output from 512 dimensions to 1024 dimensions, matching the hidden size expected by Stable Diffusion 2.1's UNet cross-attention layers.
82
+
83
+ Architecture: two 1Γ—1 Conv1d layers with SiLU activations (`512 β†’ 256 β†’ 1024`), operating over the 77-token sequence.
84
+
85
+ This module is used in the default configuration (without `--cat_prompt_embedding`). When `--cat_prompt_embedding` is enabled, the VRE instead outputs 154 tokens at 512-dim which are reshaped to 77 tokens at 1024-dim, bypassing this module entirely.
86
+
87
+ **Loaded from:** `<ckpt_path>/embedding_change_weights.pth`
88
+
89
+ ### `pytorch_lora_weights.safetensors` (67.9 MB)
90
+
91
+ **LoRA (Low-Rank Adaptation) weights** for the Stable Diffusion 2.1 UNet. These adapt the frozen SD2.1 UNet to perform one-step face restoration conditioned on the VRE embeddings.
92
+
93
+ Default LoRA configuration: **rank 16, alpha 16** (effective scaling factor `alpha/rank = 1.0`). The weights cover both standard LoRA layers (`lora_A`/`lora_B`) and some additional `lora.up`/`lora.down` layers.
94
+
95
+ These can be loaded in two ways:
96
+ - **Dynamic loading** (default): loaded at runtime via `diffusers`' `load_lora_weights()`
97
+ - **Merged loading** (`--merge_lora`): pre-merged into the UNet weights before inference for slightly faster execution
98
+
99
+ **Loaded from:** `<ckpt_path>/pytorch_lora_weights.safetensors`
100
+
101
+
102
+ ## Key Inference Arguments
103
+
104
+ | Argument | Default | Description |
105
+ |---|---|---|
106
+ | `--merge_lora` | off | Merge LoRA into UNet weights (recommended) |
107
+ | `--mixed_precision` | `fp32` | Use `fp16` for faster inference / lower VRAM |
108
+ | `--gpu_ids` | `[0]` | Multi-GPU support, e.g. `--gpu_ids 0 1 2 3` |
109
+ | `--cat_prompt_embedding` | off | Alternative embedding strategy (skips embedding_change module) |
110
+ | `--lora_rank` | 16 | LoRA rank (must match training) |
111
+ | `--lora_alpha` | 16 | LoRA alpha (must match training) |
112
+
113
+ ## Inference Pipeline (Summary)
114
+
115
+ 1. Input image resized to **512Γ—512**
116
+ 2. VRE encodes the LQ face β†’ `(B, 77, 512)` visual prompt
117
+ 3. Embedding projection maps to `(B, 77, 1024)` (or concatenation path)
118
+ 4. VAE encodes the LQ face to latent space
119
+ 5. UNet performs a **single denoising step** at timestep 399, conditioned on the visual prompt
120
+ 6. Predicted clean latent is decoded by the VAE β†’ restored face
121
+
122
+ ## Citation
123
+
124
+ ```bibtex
125
+ @InProceedings{wang2025osdface,
126
+ author = {Wang, Jingkai and Gong, Jue and Zhang, Lin and Chen, Zheng and Liu, Xing and Gu, Hong and Liu, Yutong and Zhang, Yulun and Yang, Xiaokang},
127
+ title = {{OSDFace}: One-Step Diffusion Model for Face Restoration},
128
+ booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
129
+ month = {June},
130
+ year = {2025},
131
+ pages = {12626-12636}
132
+ }
133
+ ```
134
+
135
+ ## Links
136
+
137
+ - πŸ“„ [Paper (arXiv)](https://arxiv.org/abs/2411.17163)
138
+ - πŸ’» [Official Repository](https://github.com/jkwang28/OSDFace)
139
+ - 🌐 [Project Page](https://www.jingkaiwang.com/OSDFace/)