Video-to-Video
English
DecMem / README.md
JeffreyYzh's picture
Update README.md
6725228 verified
metadata
pipeline_tag: video-to-video
license: apache-2.0
language:
  - en
base_model:
  - Wan-AI/Wan2.1-T2V-1.3B

DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory

We propose DecMem, a decoupled memory architecture that employs Sparse Global Memory for efficient fine-grained access to global history and Anchored Local Memory for stable and high-quality extrapolation.

Project Page | Paper | Code

Checkpoints

Download the Wan2.1 backbone (VAE + tokenizer weights used by the pipeline):

huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B \
    --local-dir-use-symlinks False \
    --local-dir wan_models/Wan2.1-T2V-1.3B

Download DecMem trained checkpoints from HuggingFace:

huggingface-cli download KlingTeam/DecMem --local-dir checkpoints

Checkpoint layout expected by training / inference scripts:

checkpoints/
└── decmem.pt             # released weights

Quick start

We provide the example video-pose pairs for quick inference. The inference is Block-by-block causal denoising manner with KV cache.

bash scripts/infer_example.sh

Citation

If you find our work helpful, please cite our paper:

@misc{yang2026decmemminutelongconsistentworld,
      title={DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory}, 
      author={Zhenhao Yang and Xiaoshi Wu and Zhengyao Lv and Xiaoyu Shi and Xintao Wang and Pengfei Wan and Kun Gai and Kwan-Yee K. Wong},
      year={2026},
      eprint={2605.31336},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.31336}, 
}