MoCha / README.md

nielsr HF Staff

Add model card and metadata

06c0f3e verified 3 months ago

2.2 kB

base_model:
  - Wan-AI/Wan2.1-T2V-14B
license: agpl-3.0
pipeline_tag: image-to-video

MoCha: End-to-End Video Character Replacement without Structural Guidance

Paper | Project Page | Github

MoCha is a pioneering framework for controllable video character replacement that allows users to replace a character in a video with a provided identity using only a single arbitrary frame mask.

Unlike prior reconstruction-based methods, MoCha does not require per-frame segmentation masks or explicit structural guidance like skeletons or depth maps. This makes it more robust in complex scenarios involving occlusions, unusual poses, or challenging illumination.

Key Features

End-to-End Replacement: Bypasses the need for per-frame masks and structural guidance.
Identity Preservation: Uses a condition-aware RoPE and RL-based post-training to enhance facial identity and adapt multi-modal inputs.
Robustness: Handles character-object interactions and complex scenarios better than previous state-of-the-art methods.
Data Construction: Trained on specialized high-fidelity datasets including UE5-rendered videos and expression-driven portrait animations.

Usage

To use MoCha, please refer to the official GitHub repository for environment setup and inference scripts.

The basic inference workflow requires:

Source Video: The original video with the character to be replaced.
Designation Mask: A mask for the first frame marking the character.
Reference Images: Images of the new character identity.

python inference_mocha.py --data_path path/to/your/data.csv

Citation

If you find MoCha helpful for your research, please cite:

@inproceedings{orange2025mocha,
  title={MoCha: End-to-End Video Character Replacement without Structural Guidance}, 
  author={Zhengbo Xu, Jie Ma, Ziheng Wang, Zhan Peng, Jun Liang, Jing Li},
  journal={arXiv preprint arXiv:2601.08587},
  year={2026},
  url={https://github.com/Orange-3DV-Team/MoCha}
}