--- license: apache-2.0 library_name: pytorch pipeline_tag: text-to-video tags: - text-to-video - video-generation - streaming - self-forcing - wan2.1 - 3d-aware base_model: Wan-AI/Wan2.1-T2V-1.3B --- # EndlessWorld — Real-Time 3D-Aware Long Video Generation Checkpoint for **EndlessWorld**, a streaming video diffusion model that produces *unbounded-length*, 3D-consistent videos in real time on a single GPU. - **Paper:** [arXiv:2512.12430](https://arxiv.org/abs/2512.12430) - **Code:** [github.com/BWGZK-keke/EndlessWorld](https://github.com/BWGZK-keke/EndlessWorld) - **Base model:** [Wan-AI/Wan2.1-T2V-1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) - **3D encoder:** [lhjiang/anysplat](https://huggingface.co/lhjiang/anysplat) ## What's in this repo | File | Description | |------------|-------------------------------------------------------------------------| | `model.pt` | DMD-distilled generator weights for the EndlessWorld causal Wan model (step 1000 of the `self_forcing_dmd_separate` SOTA run). | This is the generator checkpoint only. To run inference you also need: 1. The Wan2.1-T2V-1.3B base weights (text encoder, VAE, etc.) 2. The AnySplat 3D Gaussian feature encoder See the [GitHub README](https://github.com/BWGZK-keke/EndlessWorld#installation) for the full setup. ## Method EndlessWorld extends the **Self-Forcing** causal diffusion framework (Wan2.1 T2V-1.3B backbone) with a **Global 3D-Aware Attention** module that injects scene geometry — extracted on the fly by AnySplat — into the conditional embedding of every autoregressive chunk. ![EndlessWorld pipeline](pipeline.png) Three ingredients: - **Conditional autoregressive (self-forcing) training** — frames are denoised block-by-block with KV-cache, conditioning each new block on previously generated content. - **Global 3D-Aware Attention** — `CrossAttentionFusion` + `To3D` modules ingest 3D Gaussian features produced by AnySplat and fuse them with the text embedding, giving the generator a persistent geometric memory of the world rendered so far. - **Real-time streaming inference** — the rollout loop re-extracts 3D features from the most recently decoded chunk and feeds the fused embedding back into the causal generator, enabling indefinite extension on a single GPU. ## Quick start ```bash git clone https://github.com/BWGZK-keke/EndlessWorld cd EndlessWorld pip install -r requirements.txt # Download this checkpoint huggingface-cli download BWGZK/EndlessWorld model.pt --local-dir checkpoints/ # Update configs/self_forcing_dmd.yaml -> generator_ckpt: checkpoints/model.pt bash test.sh ``` Loading directly from Python: ```python import torch from huggingface_hub import hf_hub_download ckpt = hf_hub_download(repo_id="BWGZK/EndlessWorld", filename="model.pt") state_dict = torch.load(ckpt, map_location="cpu") ``` ## Training - **Framework:** Multi-GPU FSDP via the [`train.py`](https://github.com/BWGZK-keke/EndlessWorld/blob/main/train.py) entry point with [`configs/self_forcing_dmd.yaml`](https://github.com/BWGZK-keke/EndlessWorld/blob/main/configs/self_forcing_dmd.yaml). ## Citation ```bibtex @article{zhang2025endlessworld, title = {Endless World: Real-Time 3D-Aware Long Video Generation}, author = {Zhang, Ke and others}, journal = {arXiv preprint arXiv:2512.12430}, year = {2025} } ``` ## License Apache 2.0 — same as the upstream Wan2.1 and Self-Forcing projects.