metadata
base_model:
- Wan-AI/Wan2.1-T2V-14B
license: agpl-3.0
pipeline_tag: image-to-video
MoCha: End-to-End Video Character Replacement without Structural Guidance
Paper | Project Page | Github
MoCha is a pioneering framework for controllable video character replacement that allows users to replace a character in a video with a provided identity using only a single arbitrary frame mask.
Unlike prior reconstruction-based methods, MoCha does not require per-frame segmentation masks or explicit structural guidance like skeletons or depth maps. This makes it more robust in complex scenarios involving occlusions, unusual poses, or challenging illumination.
Key Features
- End-to-End Replacement: Bypasses the need for per-frame masks and structural guidance.
- Identity Preservation: Uses a condition-aware RoPE and RL-based post-training to enhance facial identity and adapt multi-modal inputs.
- Robustness: Handles character-object interactions and complex scenarios better than previous state-of-the-art methods.
- Data Construction: Trained on specialized high-fidelity datasets including UE5-rendered videos and expression-driven portrait animations.
Usage
To use MoCha, please refer to the official GitHub repository for environment setup and inference scripts.
The basic inference workflow requires:
- Source Video: The original video with the character to be replaced.
- Designation Mask: A mask for the first frame marking the character.
- Reference Images: Images of the new character identity.
python inference_mocha.py --data_path path/to/your/data.csv
Citation
If you find MoCha helpful for your research, please cite:
@inproceedings{orange2025mocha,
title={MoCha: End-to-End Video Character Replacement without Structural Guidance},
author={Zhengbo Xu, Jie Ma, Ziheng Wang, Zhan Peng, Jun Liang, Jing Li},
journal={arXiv preprint arXiv:2601.08587},
year={2026},
url={https://github.com/Orange-3DV-Team/MoCha}
}