CyrusZhang312
/

JointControlvideo

video-generation

controllable-generation

Model card Files Files and versions

JointControlvideo / README.md

CyrusZhang312's picture

Create README.md

5961026 verified about 19 hours ago

|

History Blame Contribute Delete

2.33 kB

	---
	license: apache-2.0
	base_model: Wan-AI/Wan2.1-I2V-14B-480P
	pipeline_tag: image-to-video
	tags:
	- video-generation
	- diffusion
	- lora
	- hand-pose
	- egocentric
	- controllable-generation
	---

	# JointControlVideo — MANO Hand-Conditioned Video Generation

	Official checkpoint for "Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints" (ECCV 2026).

	Given a starting frame and a sequence of 3D MANO hand-joint trajectories, this checkpoint generates an egocentric video that follows the prescribed hand motion — conditioned via occlusion-aware, 3D geometric hand-joint embeddings injected directly into the latent space of [Wan2.1-I2V-14B-480P](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P), instead of dense 2D tracks or a separate pose tokenizer.

	[![arXiv](https://img.shields.io/badge/arXiv-2603.11755-b31b1b.svg)](https://arxiv.org/pdf/2603.11755)
	[![Project Page](https://img.shields.io/badge/Project-Page-blue.svg)](https://zhangcyg.github.io/handcontrolvideo/)
	[![Code](https://img.shields.io/badge/GitHub-Code-black.svg)](https://github.com/ZhangCYG/JointControlVideo)

	## Files

	\| File \| Description \|
	\| --- \| --- \|
	\| `dit.safetensors` \| LoRA weights for the Wan2.1 DiT (`q,k,v,o,ffn.0,ffn.2`, rank 64) + the expanded `patch_embedding` (+16 input channels for the fused hand embedding) \|
	\| `hand-controller.safetensors` \| `HandConditioningModule` — the occlusion-aware hand-joint conditioning network (42 MANO joints, 2×21 per hand) \|

	Both are trained on top of a frozen Wan2.1-I2V-14B-480P backbone; you still need the base model's DiT/VAE/text-encoder/CLIP weights, which are pulled automatically from [`Wan-AI/Wan2.1-I2V-14B-480P`](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P) the first time you run the pipeline.

	## Usage

	Set up the code from [ZhangCYG/JointControlVideo](https://github.com/ZhangCYG/JointControlVideo)

	## Citation

	```bibtex
	@inproceedings{zhang2026controllable,
	title = {Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints},
	author = {Zhang, Chenyangguang and Ye, Botao and Chen, Boqi and Delitzas, Alexandros and Wang, Fangjinhua and Pollefeys, Marc and Wang, Xi},
	booktitle = {European Conference on Computer Vision (ECCV)},
	year = {2026}
	}
	```

	## License

	Apache 2.0.