Audio2Lipsync
Audio-driven lipsync model for Unreal Engine MetaHumans. Given a mono 16 kHz audio clip, the model predicts 52-channel ARKit blendshape coefficients at 60 fps โ frame-accurate mouth, jaw, tongue, and cheek motion that drops straight into the MetaHuman face rig via LiveLink.
Companion code repository: https://github.com/aaryansachdeva/unreal-audio2lipsync
Architecture
- Encoder: HuBERT-Large (frozen) โ phonetic features from 16 kHz audio
- Decoder: 8-layer Transformer (d=512, 8 heads, FF=2048, dropout=0.2)
- Output: 52 ARKit blendshape coefficients per frame at 60 fps
- Loss: Weighted masked L1 (mouth channels get 3ร weight) + velocity loss
- Training data: iPhone Live Link Face recordings (ARKit blendshapes) paired with their audio tracks
Files
| File | Purpose |
|---|---|
best.pt |
EMA inference checkpoint (trainable params only; frozen HuBERT loaded from torchaudio) |
Usage
Install the code repo and point the sidecar server at this checkpoint:
git clone https://github.com/aaryansachdeva/unreal-audio2lipsync.git
cd unreal-audio2lipsync
pip install -r python/requirements.txt
# Download best.pt from this HF repo into a local checkpoints/ dir, then:
python python/src/server.py --ckpt checkpoints/best.pt --stats-dir python/stats
See the code repo README for full Unreal Engine integration instructions.
License
MIT โ use commercially, modify, redistribute, no obligations beyond keeping the license notice. See LICENSE in the code repo.