File size: 3,527 Bytes

---
license: apache-2.0
library_name: pytorch
pipeline_tag: text-to-video
tags:
  - text-to-video
  - video-generation
  - streaming
  - self-forcing
  - wan2.1
  - 3d-aware
base_model: Wan-AI/Wan2.1-T2V-1.3B
---

# EndlessWorld — Real-Time 3D-Aware Long Video Generation

Checkpoint for **EndlessWorld**, a streaming video diffusion model that produces
*unbounded-length*, 3D-consistent videos in real time on a single GPU.

- **Paper:** [arXiv:2512.12430](https://arxiv.org/abs/2512.12430)
- **Code:** [github.com/BWGZK-keke/EndlessWorld](https://github.com/BWGZK-keke/EndlessWorld)
- **Base model:** [Wan-AI/Wan2.1-T2V-1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B)
- **3D encoder:** [lhjiang/anysplat](https://huggingface.co/lhjiang/anysplat)

## What's in this repo

| File       | Description                                                             |
|------------|-------------------------------------------------------------------------|
| `model.pt` | DMD-distilled generator weights for the EndlessWorld causal Wan model (step 1000 of the `self_forcing_dmd_separate` SOTA run). |

This is the generator checkpoint only. To run inference you also need:
1. The Wan2.1-T2V-1.3B base weights (text encoder, VAE, etc.)
2. The AnySplat 3D Gaussian feature encoder

See the [GitHub README](https://github.com/BWGZK-keke/EndlessWorld#installation)
for the full setup.

## Method

EndlessWorld extends the **Self-Forcing** causal diffusion framework (Wan2.1
T2V-1.3B backbone) with a **Global 3D-Aware Attention** module that injects
scene geometry — extracted on the fly by AnySplat — into the conditional
embedding of every autoregressive chunk.

![EndlessWorld pipeline](pipeline.png)

Three ingredients:

- **Conditional autoregressive (self-forcing) training** — frames are denoised
  block-by-block with KV-cache, conditioning each new block on previously
  generated content.
- **Global 3D-Aware Attention** — `CrossAttentionFusion` + `To3D` modules ingest
  3D Gaussian features produced by AnySplat and fuse them with the text
  embedding, giving the generator a persistent geometric memory of the world
  rendered so far.
- **Real-time streaming inference** — the rollout loop re-extracts 3D features
  from the most recently decoded chunk and feeds the fused embedding back into
  the causal generator, enabling indefinite extension on a single GPU.

## Quick start

```bash
git clone https://github.com/BWGZK-keke/EndlessWorld
cd EndlessWorld
pip install -r requirements.txt

# Download this checkpoint
huggingface-cli download BWGZK/EndlessWorld model.pt --local-dir checkpoints/

# Update configs/self_forcing_dmd.yaml -> generator_ckpt: checkpoints/model.pt
bash test.sh
```

Loading directly from Python:

```python
import torch
from huggingface_hub import hf_hub_download

ckpt = hf_hub_download(repo_id="BWGZK/EndlessWorld", filename="model.pt")
state_dict = torch.load(ckpt, map_location="cpu")
```

## Training

- **Framework:** Multi-GPU FSDP via the [`train.py`](https://github.com/BWGZK-keke/EndlessWorld/blob/main/train.py)
  entry point with [`configs/self_forcing_dmd.yaml`](https://github.com/BWGZK-keke/EndlessWorld/blob/main/configs/self_forcing_dmd.yaml).

## Citation

```bibtex
@article{zhang2025endlessworld,
  title   = {Endless World: Real-Time 3D-Aware Long Video Generation},
  author  = {Zhang, Ke and others},
  journal = {arXiv preprint arXiv:2512.12430},
  year    = {2025}
}
```

## License

Apache 2.0 — same as the upstream Wan2.1 and Self-Forcing projects.