Video-to-Video
English
DecMem / README.md
nielsr's picture
nielsr HF Staff
Improve model card metadata and content
336537d verified
|
raw
history blame
2.07 kB
metadata
base_model:
  - Wan-AI/Wan2.1-T2V-1.3B
language:
  - en
license: apache-2.0
pipeline_tag: text-to-video

DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory

DecMem is a decoupled memory architecture designed for consistent, long-horizon world generation. It employs Sparse Global Memory for efficient fine-grained access to global history and Anchored Local Memory for stable and high-quality extrapolation. This approach enables minute-level controllable long video generation with high fidelity and consistency.

Project Page | Paper | Code

Checkpoints

Download the Wan2.1 backbone (VAE + tokenizer weights used by the pipeline):

huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B \
    --local-dir-use-symlinks False \
    --local-dir wan_models/Wan2.1-T2V-1.3B

Download DecMem trained checkpoints:

huggingface-cli download KlingTeam/DecMem --local-dir checkpoints

Checkpoint layout expected by training / inference scripts:

checkpoints/
└── decmem.pt             # released weights

Quick start

We provide example video-pose pairs for quick inference. The inference is performed in a block-by-block causal denoising manner with KV cache.

To run the inference, follow the installation instructions in the official repository and run:

bash scripts/infer_example.sh

Citation

If you find our work helpful, please cite our paper:

@misc{yang2026decmemminutelongconsistentworld,
      title={DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory}, 
      author={Zhenhao Yang and Xiaoshi Wu and Zhengyao Lv and Xiaoyu Shi and Xintao Wang and Pengfei Wan and Kun Gai and Kwan-Yee K. Wong},
      year={2026},
      eprint={2605.31336},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.31336}, 
}