AEmotionStudio's picture
Upload README.md with huggingface_hub
fde1d4c verified
metadata
license: apache-2.0
base_model:
  - Wan-AI/Wan2.2-I2V-14B-480P
tags:
  - facecam
  - portrait-video
  - camera-control
  - wan2.2
  - diffusion
  - safetensors
  - comfyui
pipeline_tag: image-to-video

FaceCam — Merged bf16 Checkpoints

Portrait Video Camera Control via Scale-Aware Conditioning

🏔️ CVPR 2026 🏔️ | Paper (arXiv 2603.05506) | Code | Project Page | Original Weights

Weijie Lyu, Ming-Hsuan Yang, Zhixin Shu University of California, Merced · Adobe Research

What's in This Repo

Pre-merged single-file bf16 safetensors converted from the upstream sharded checkpoints at wlyu/FaceCam. These are partial fine-tune checkpoints (self-attention + patch embedding layers only, ~402 keys each) that patch on top of a base Wan 2.2 14B I2V model.

File Description Size
facecam_wan2.2_14b_high_bf16.safetensors High-noise stage DiT (camera trajectory) ~7.9 GB
facecam_wan2.2_14b_low_bf16.safetensors Low-noise stage DiT (detail refinement) ~7.9 GB
gaussians.ply 3D Gaussian head proxy for camera conditioning ~43 MB
face_landmarker_v2_with_blendshapes.task MediaPipe face landmarker for conditioning extraction ~3.6 MB

Usage with ComfyUI-FFMPEGA

These weights are used by the FaceCam node in ComfyUI-FFMPEGA.

  1. Place facecam_wan2.2_14b_*_bf16.safetensors in ComfyUI/models/diffusion_models/
  2. Place gaussians.ply and face_landmarker_v2_with_blendshapes.task in ComfyUI/models/facecam/
  3. Load a Wan 2.2 14B I2V base model (e.g. GGUF Q4_K_M) via "Load Diffusion Model"
  4. Connect to the FaceCam node along with the FaceCam checkpoint

Pipeline

FaceCam generates portrait videos with precise camera control from a single input video:

  1. Face-centered crop of input video
  2. 3D Gaussian proxy rendering for camera trajectory conditioning (via gaussians.ply)
  3. MediaPipe face landmark extraction from proxy → camera_cond
  4. VAE-encode video_cond + camera_cond
  5. Wan 2.2 DiT inference with FaceCam conditioning:
    • Two-stage denoising: HIGH model (90%) → LOW model (10%)
    • Temporal concat: [noise_latents | video_cond_latents]
    • Channel concat: [camera_cond_latents | i2v_y]

Citation

@misc{facecam,
  title   = {FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning},
  author  = {Weijie Lyu and Ming-Hsuan Yang and Zhixin Shu},
  year    = {2026},
  eprint  = {2603.05506},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url     = {https://arxiv.org/abs/2603.05506},
}

License

These model weights are released under the Apache License 2.0, matching the upstream FaceCam repository license.

Acknowledgements