--- license: apache-2.0 base_model: Wan-AI/Wan2.1-I2V-14B-480P pipeline_tag: image-to-video tags: - video-generation - diffusion - lora - hand-pose - egocentric - controllable-generation --- # JointControlVideo — MANO Hand-Conditioned Video Generation Official checkpoint for **"Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints"** (ECCV 2026). Given a starting frame and a sequence of 3D MANO hand-joint trajectories, this checkpoint generates an egocentric video that follows the prescribed hand motion — conditioned via occlusion-aware, 3D geometric hand-joint embeddings injected directly into the latent space of [Wan2.1-I2V-14B-480P](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P), instead of dense 2D tracks or a separate pose tokenizer. [![arXiv](https://img.shields.io/badge/arXiv-2603.11755-b31b1b.svg)](https://arxiv.org/pdf/2603.11755) [![Project Page](https://img.shields.io/badge/Project-Page-blue.svg)](https://zhangcyg.github.io/handcontrolvideo/) [![Code](https://img.shields.io/badge/GitHub-Code-black.svg)](https://github.com/ZhangCYG/JointControlVideo) ## Files | File | Description | | --- | --- | | `dit.safetensors` | LoRA weights for the Wan2.1 DiT (`q,k,v,o,ffn.0,ffn.2`, rank 64) + the expanded `patch_embedding` (+16 input channels for the fused hand embedding) | | `hand-controller.safetensors` | `HandConditioningModule` — the occlusion-aware hand-joint conditioning network (42 MANO joints, 2×21 per hand) | Both are trained on top of a **frozen** Wan2.1-I2V-14B-480P backbone; you still need the base model's DiT/VAE/text-encoder/CLIP weights, which are pulled automatically from [`Wan-AI/Wan2.1-I2V-14B-480P`](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P) the first time you run the pipeline. ## Usage Set up the code from [ZhangCYG/JointControlVideo](https://github.com/ZhangCYG/JointControlVideo) ## Citation ```bibtex @inproceedings{zhang2026controllable, title = {Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints}, author = {Zhang, Chenyangguang and Ye, Botao and Chen, Boqi and Delitzas, Alexandros and Wang, Fangjinhua and Pollefeys, Marc and Wang, Xi}, booktitle = {European Conference on Computer Vision (ECCV)}, year = {2026} } ``` ## License Apache 2.0.