OSDFace / README.md

Create README.md

eb21619 verified 8 days ago

6.89 kB

	---
	tags:
	- face-restoration
	- diffusion
	- one-step
	- stable-diffusion
	- lora
	- image-to-image
	base_model: stabilityai/stable-diffusion-2-1-base
	pipeline_tag: image-to-image
	---

	# OSDFace — Pretrained Weights (Mirror)


	![github.com_jkwang28_OSDFace_](https://cdn-uploads.huggingface.co/production/uploads/6557b9b4deee83130ac92941/Qs7aOIkhOyghdH8Bdt5wF.png)

	> This is an unofficial mirror.
	> All credit goes to the original authors. The weights are mirrored here from the [official OSDFace repository](https://github.com/jkwang28/OSDFace) for convenience, as the original download is hosted on OneDrive/Google Drive which can be slow or inaccessible in some regions.
	> Please cite the original paper and star the original repo if you use these weights.

	## Overview

	OSDFace (One-Step Diffusion Model for Face Restoration) is a single-step diffusion model that restores degraded, low-quality face images into high-fidelity, identity-consistent outputs. It was accepted at CVPR 2025.

	Unlike multi-step diffusion approaches, OSDFace requires only one forward pass through a modified Stable Diffusion 2.1 UNet, making it significantly faster at inference while achieving state-of-the-art results on both synthetic (CelebA-Test) and real-world (Wider-Test, LFW-Test, WebPhoto-Test) benchmarks.

	The key innovations are:

	- Visual Representation Embedder (VRE): A VQ-VAE encoder that tokenizes the low-quality input face and produces visual prompt embeddings via a vector-quantized dictionary. These embeddings replace the text encoder's output and are fed directly into the UNet's cross-attention layers.
	- Facial Identity Loss: A face-recognition-derived loss that enforces identity consistency between the restored and ground-truth faces.
	- GAN Guidance: A generative adversarial network guides the one-step diffusion to align the output distribution with the ground truth.

	## Usage

	### Prerequisites

	- Base model: [stabilityai/stable-diffusion-2-1-base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base)
	- Python 3.10, PyTorch 2.4.0, diffusers 0.27.2

	### Quick Start

	```bash
	# Clone the official repo
	git clone https://github.com/jkwang28/OSDFace.git
	cd OSDFace

	# Download these weights into pretrained/
	# Place: associate_2.ckpt, embedding_change_weights.pth, pytorch_lora_weights.safetensors

	# Run inference (with LoRA merging for speed)
	python infer.py \
	--input_image data/WebPhoto-Test \
	--output_dir results/WebPhoto-Test \
	--pretrained_model_name_or_path "stabilityai/stable-diffusion-2-1-base" \
	--img_encoder_weight "pretrained/associate_2.ckpt" \
	--ckpt_path pretrained \
	--merge_lora \
	--mixed_precision fp16 \
	--gpu_ids 0
	```

	> Note on the different pretrained model
	> Although the project is based on `stabilityai/stable-diffusion-2-1-base` we use `Manojb/stable-diffusion-2-1-base` because the former can't be downloaded from huggingface.


	## Files in This Repository

	![image](https://cdn-uploads.huggingface.co/production/uploads/6557b9b4deee83130ac92941/rqOY2DHGJ1MhnJkJ5keJ0.png)

	### `associate_2.ckpt` (1.87 GB)

	The VQ-VAE image encoder (referred to as the Visual Representation Embedder in the paper). This is the core component that understands the degraded input face.

	It contains a multi-head encoder with downsampling blocks, a mid-block with attention, and a vector quantizer with a learned 1024-entry codebook (embedding dim 512). At inference, the encoder processes a 512×512 low-quality face, extracts spatial features, quantizes them against the codebook, and selects the 77 closest (non-duplicate) codebook entries — producing a `(batch, 77, 512)` tensor that acts as a drop-in replacement for CLIP text embeddings in the UNet's cross-attention.

	Loaded via: `--img_encoder_weight associate_2.ckpt`

	### `embedding_change_weights.pth` (1.58 MB)

	A lightweight embedding projection module (`TwoLayerConv1x1`) that maps the VRE output from 512 dimensions to 1024 dimensions, matching the hidden size expected by Stable Diffusion 2.1's UNet cross-attention layers.

	Architecture: two 1×1 Conv1d layers with SiLU activations (`512 → 256 → 1024`), operating over the 77-token sequence.

	This module is used in the default configuration (without `--cat_prompt_embedding`). When `--cat_prompt_embedding` is enabled, the VRE instead outputs 154 tokens at 512-dim which are reshaped to 77 tokens at 1024-dim, bypassing this module entirely.

	Loaded from: `<ckpt_path>/embedding_change_weights.pth`

	### `pytorch_lora_weights.safetensors` (67.9 MB)

	LoRA (Low-Rank Adaptation) weights for the Stable Diffusion 2.1 UNet. These adapt the frozen SD2.1 UNet to perform one-step face restoration conditioned on the VRE embeddings.

	Default LoRA configuration: rank 16, alpha 16 (effective scaling factor `alpha/rank = 1.0`). The weights cover both standard LoRA layers (`lora_A`/`lora_B`) and some additional `lora.up`/`lora.down` layers.

	These can be loaded in two ways:
	- Dynamic loading (default): loaded at runtime via `diffusers`' `load_lora_weights()`
	- Merged loading (`--merge_lora`): pre-merged into the UNet weights before inference for slightly faster execution

	Loaded from: `<ckpt_path>/pytorch_lora_weights.safetensors`


	## Key Inference Arguments

	\| Argument \| Default \| Description \|
	\|---\|---\|---\|
	\| `--merge_lora` \| off \| Merge LoRA into UNet weights (recommended) \|
	\| `--mixed_precision` \| `fp32` \| Use `fp16` for faster inference / lower VRAM \|
	\| `--gpu_ids` \| `[0]` \| Multi-GPU support, e.g. `--gpu_ids 0 1 2 3` \|
	\| `--cat_prompt_embedding` \| off \| Alternative embedding strategy (skips embedding_change module) \|
	\| `--lora_rank` \| 16 \| LoRA rank (must match training) \|
	\| `--lora_alpha` \| 16 \| LoRA alpha (must match training) \|

	## Inference Pipeline (Summary)

	1. Input image resized to 512×512
	2. VRE encodes the LQ face → `(B, 77, 512)` visual prompt
	3. Embedding projection maps to `(B, 77, 1024)` (or concatenation path)
	4. VAE encodes the LQ face to latent space
	5. UNet performs a single denoising step at timestep 399, conditioned on the visual prompt
	6. Predicted clean latent is decoded by the VAE → restored face

	## Citation

	```bibtex
	@InProceedings{wang2025osdface,
	author = {Wang, Jingkai and Gong, Jue and Zhang, Lin and Chen, Zheng and Liu, Xing and Gu, Hong and Liu, Yutong and Zhang, Yulun and Yang, Xiaokang},
	title = {{OSDFace}: One-Step Diffusion Model for Face Restoration},
	booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
	month = {June},
	year = {2025},
	pages = {12626-12636}
	}
	```

	## Links

	- 📄 [Paper (arXiv)](https://arxiv.org/abs/2411.17163)
	- 💻 [Official Repository](https://github.com/jkwang28/OSDFace)
	- 🌐 [Project Page](https://www.jingkaiwang.com/OSDFace/)