JointControlVideo — MANO Hand-Conditioned Video Generation

Official checkpoint for "Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints" (ECCV 2026).

Given a starting frame and a sequence of 3D MANO hand-joint trajectories, this checkpoint generates an egocentric video that follows the prescribed hand motion — conditioned via occlusion-aware, 3D geometric hand-joint embeddings injected directly into the latent space of Wan2.1-I2V-14B-480P, instead of dense 2D tracks or a separate pose tokenizer.

Files

File	Description
`dit.safetensors`	LoRA weights for the Wan2.1 DiT (`q,k,v,o,ffn.0,ffn.2`, rank 64) + the expanded `patch_embedding` (+16 input channels for the fused hand embedding)
`hand-controller.safetensors`	`HandConditioningModule` — the occlusion-aware hand-joint conditioning network (42 MANO joints, 2×21 per hand)

Both are trained on top of a frozen Wan2.1-I2V-14B-480P backbone; you still need the base model's DiT/VAE/text-encoder/CLIP weights, which are pulled automatically from Wan-AI/Wan2.1-I2V-14B-480P the first time you run the pipeline.

Usage

Set up the code from ZhangCYG/JointControlVideo

Citation

@inproceedings{zhang2026controllable,
  title     = {Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints},
  author    = {Zhang, Chenyangguang and Ye, Botao and Chen, Boqi and Delitzas, Alexandros and Wang, Fangjinhua and Pollefeys, Marc and Wang, Xi},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

License

Apache 2.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for CyrusZhang312/JointControlvideo

Base model

Wan-AI/Wan2.1-I2V-14B-480P

Adapter

(115)

this model

Paper for CyrusZhang312/JointControlvideo

Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints

Paper • 2603.11755 • Published 5 days ago