Audio2Lipsync

Audio-driven lipsync model for Unreal Engine MetaHumans. Given a mono 16 kHz audio clip, the model predicts 52-channel ARKit blendshape coefficients at 60 fps — frame-accurate mouth, jaw, tongue, and cheek motion that drops straight into the MetaHuman face rig via LiveLink.

Companion code repository: https://github.com/aaryansachdeva/unreal-audio2lipsync

Architecture

Encoder: HuBERT-Large (frozen) — phonetic features from 16 kHz audio
Decoder: 8-layer Transformer (d=512, 8 heads, FF=2048, dropout=0.2)
Output: 52 ARKit blendshape coefficients per frame at 60 fps
Loss: Weighted masked L1 (mouth channels get 3× weight) + velocity loss
Training data: iPhone Live Link Face recordings (ARKit blendshapes) paired with their audio tracks

Files

File	Purpose
`best.pt`	EMA inference checkpoint (trainable params only; frozen HuBERT loaded from `torchaudio`)

Usage

Install the code repo and point the sidecar server at this checkpoint:

git clone https://github.com/aaryansachdeva/unreal-audio2lipsync.git
cd unreal-audio2lipsync
pip install -r python/requirements.txt

# Download best.pt from this HF repo into a local checkpoints/ dir, then:
python python/src/server.py --ckpt checkpoints/best.pt --stats-dir python/stats

See the code repo README for full Unreal Engine integration instructions.

License

MIT — use commercially, modify, redistribute, no obligations beyond keeping the license notice. See LICENSE in the code repo.

Downloads last month: -; Downloads are not tracked for this model. How to track