Audio2Lipsync

Audio-driven lipsync model for Unreal Engine MetaHumans. Given a mono 16 kHz audio clip, the model predicts 52-channel ARKit blendshape coefficients at 60 fps โ€” frame-accurate mouth, jaw, tongue, and cheek motion that drops straight into the MetaHuman face rig via LiveLink.

Companion code repository: https://github.com/aaryansachdeva/unreal-audio2lipsync

Architecture

  • Encoder: HuBERT-Large (frozen) โ€” phonetic features from 16 kHz audio
  • Decoder: 8-layer Transformer (d=512, 8 heads, FF=2048, dropout=0.2)
  • Output: 52 ARKit blendshape coefficients per frame at 60 fps
  • Loss: Weighted masked L1 (mouth channels get 3ร— weight) + velocity loss
  • Training data: iPhone Live Link Face recordings (ARKit blendshapes) paired with their audio tracks

Files

File Purpose
best.pt EMA inference checkpoint (trainable params only; frozen HuBERT loaded from torchaudio)

Usage

Install the code repo and point the sidecar server at this checkpoint:

git clone https://github.com/aaryansachdeva/unreal-audio2lipsync.git
cd unreal-audio2lipsync
pip install -r python/requirements.txt

# Download best.pt from this HF repo into a local checkpoints/ dir, then:
python python/src/server.py --ckpt checkpoints/best.pt --stats-dir python/stats

See the code repo README for full Unreal Engine integration instructions.

License

MIT โ€” use commercially, modify, redistribute, no obligations beyond keeping the license notice. See LICENSE in the code repo.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support