license: fair-noncommercial-research-license
datasets:
- bitmind/celeb-a-hq
base_model:
- SG161222/Realistic_Vision_V4.0_noVAE
tags: - text-to-image - stable-diffusion - diffusers - ip-adapter - face-id - custom-finetune language: - en library_name: diffusers
IP-Adapter-FaceID-PlusV2-Finetuned (RishabhInCode)
Introduction
This is a custom, fine-tuned version of the IP-Adapter-FaceID-PlusV2 model for Stable Diffusion 1.5. It was specifically trained to prioritize high-fidelity identity preservation while maintaining compositional realism across highly diverse prompts.
The model relies on FaceID embeddings extracted via the InsightFace buffalo_l model to condition the image generation process directly into the UNet cross-attention layers.
- Base Diffusion Model:
SG161222/Realistic_Vision_V4.0_noVAE - VAE:
stabilityai/sd-vae-ft-mse - Image Encoder:
laion/CLIP-ViT-H-14-laion2B-s32B-b79K - Dataset: images sampled from
bitmind/celeb-a-hq. - Optimization: Joint optimization utilizing standard Diffusion Loss paired with Identity Loss (ArcFace Cosine Similarity).
Evaluation Metrics
The model was rigorously evaluated against the generic zero-shot IP-Adapter baseline. Testing involved generating multiple stylistic variations (cinematic lighting, charcoal sketch, outdoor lighting, etc.) across various seed images.
| Metric | Baseline (Zero-Shot) | Fine-Tuned (This Model) | Note |
|---|---|---|---|
| Identity Score (Higher is better) | 0.8327 | 0.8754 | Significant improvement in facial structure retention. |
| FID Score (Lower is better) | 259.27 | 283.11 | Standard distributional gap trade-off when forcing strict identity constraints. |
Note: In 1-to-1 sample comparisons, this fine-tuned model successfully pushed specific Identity Scores as high as 0.9680, achieving superior sample-specific realism (FID: 421.97 vs Baseline: 448.15).
Usage
To use this model, you first need to extract the face embedding and aligned face image using insightface.
import cv2
import torch
from insightface.app import FaceAnalysis
from insightface.utils import face_align
from diffusers import StableDiffusionPipeline, DDIMScheduler, AutoencoderKL
from ip_adapter.ip_adapter_faceid import IPAdapterFaceIDPlus
# 1. Setup Face Extraction
app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider'])
app.prepare(ctx_id=0, det_size=(640, 640))
image = cv2.imread("your_seed_image.jpg")
faces = app.get(image)
faceid_embeds = torch.from_numpy(faces[0].normed_embedding).unsqueeze(0)
face_image = face_align.norm_crop(image, landmark=faces[0].kps, image_size=224)
# 2. Setup Pipeline
device = "cuda"
base_model_path = "SG161222/Realistic_Vision_V4.0_noVAE"
vae_model_path = "stabilityai/sd-vae-ft-mse"
image_encoder_path = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K"
ip_ckpt = "ip-adapter-faceid-plusv2_sd15-finetuned_RishabhInCode.bin" # This repo's file
noise_scheduler = DDIMScheduler(
num_train_timesteps=1000,
beta_start=0.00085,
beta_end=0.012,
beta_schedule="scaled_linear",
clip_sample=False,
set_alpha_to_one=False,
steps_offset=1,
)
vae = AutoencoderKL.from_pretrained(vae_model_path).to(dtype=torch.float16)
pipe = StableDiffusionPipeline.from_pretrained(
base_model_path,
torch_dtype=torch.float16,
scheduler=noise_scheduler,
vae=vae,
safety_checker=None
).to(device)
# 3. Load IP-Adapter with Custom Fine-Tuned Weights
ip_model = IPAdapterFaceIDPlus(pipe, image_encoder_path, ip_ckpt, device)
# 4. Generate
prompt = "a cinematic portrait of the person in cyberpunk lighting"
images = ip_model.generate(
prompt=prompt,
face_image=face_image,
faceid_embeds=faceid_embeds,
shortcut=True,
s_scale=1.0,
num_samples=1,
width=512,
height=768,
num_inference_steps=30
)
images[0].save("output.png")