| --- |
| license: apache-2.0 |
| base_model: Wan-AI/Wan2.1-I2V-14B-480P |
| pipeline_tag: image-to-video |
| tags: |
| - video-generation |
| - diffusion |
| - lora |
| - hand-pose |
| - egocentric |
| - controllable-generation |
| --- |
| |
| # JointControlVideo — MANO Hand-Conditioned Video Generation |
|
|
| Official checkpoint for **"Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints"** (ECCV 2026). |
|
|
| Given a starting frame and a sequence of 3D MANO hand-joint trajectories, this checkpoint generates an egocentric video that follows the prescribed hand motion — conditioned via occlusion-aware, 3D geometric hand-joint embeddings injected directly into the latent space of [Wan2.1-I2V-14B-480P](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P), instead of dense 2D tracks or a separate pose tokenizer. |
|
|
| [](https://arxiv.org/pdf/2603.11755) |
| [](https://zhangcyg.github.io/handcontrolvideo/) |
| [](https://github.com/ZhangCYG/JointControlVideo) |
|
|
| ## Files |
|
|
| | File | Description | |
| | --- | --- | |
| | `dit.safetensors` | LoRA weights for the Wan2.1 DiT (`q,k,v,o,ffn.0,ffn.2`, rank 64) + the expanded `patch_embedding` (+16 input channels for the fused hand embedding) | |
| | `hand-controller.safetensors` | `HandConditioningModule` — the occlusion-aware hand-joint conditioning network (42 MANO joints, 2×21 per hand) | |
|
|
| Both are trained on top of a **frozen** Wan2.1-I2V-14B-480P backbone; you still need the base model's DiT/VAE/text-encoder/CLIP weights, which are pulled automatically from [`Wan-AI/Wan2.1-I2V-14B-480P`](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P) the first time you run the pipeline. |
|
|
| ## Usage |
|
|
| Set up the code from [ZhangCYG/JointControlVideo](https://github.com/ZhangCYG/JointControlVideo) |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{zhang2026controllable, |
| title = {Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints}, |
| author = {Zhang, Chenyangguang and Ye, Botao and Chen, Boqi and Delitzas, Alexandros and Wang, Fangjinhua and Pollefeys, Marc and Wang, Xi}, |
| booktitle = {European Conference on Computer Vision (ECCV)}, |
| year = {2026} |
| } |
| ``` |
|
|
| ## License |
|
|
| Apache 2.0. |
|
|