BWGZK
/

EndlessWorld

video-generation

Model card Files Files and versions

EndlessWorld / README.md

BWGZK's picture

Update README.md

6ccbef2 verified 6 days ago

|

history blame contribute delete

3.53 kB

	---
	license: apache-2.0
	library_name: pytorch
	pipeline_tag: text-to-video
	tags:
	- text-to-video
	- video-generation
	- streaming
	- self-forcing
	- wan2.1
	- 3d-aware
	base_model: Wan-AI/Wan2.1-T2V-1.3B
	---

	# EndlessWorld — Real-Time 3D-Aware Long Video Generation

	Checkpoint for EndlessWorld, a streaming video diffusion model that produces
	unbounded-length, 3D-consistent videos in real time on a single GPU.

	- Paper: [arXiv:2512.12430](https://arxiv.org/abs/2512.12430)
	- Code: [github.com/BWGZK-keke/EndlessWorld](https://github.com/BWGZK-keke/EndlessWorld)
	- Base model: [Wan-AI/Wan2.1-T2V-1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B)
	- 3D encoder: [lhjiang/anysplat](https://huggingface.co/lhjiang/anysplat)

	## What's in this repo

	\| File \| Description \|
	\|------------\|-------------------------------------------------------------------------\|
	\| `model.pt` \| DMD-distilled generator weights for the EndlessWorld causal Wan model (step 1000 of the `self_forcing_dmd_separate` SOTA run). \|

	This is the generator checkpoint only. To run inference you also need:
	1. The Wan2.1-T2V-1.3B base weights (text encoder, VAE, etc.)
	2. The AnySplat 3D Gaussian feature encoder

	See the [GitHub README](https://github.com/BWGZK-keke/EndlessWorld#installation)
	for the full setup.

	## Method

	EndlessWorld extends the Self-Forcing causal diffusion framework (Wan2.1
	T2V-1.3B backbone) with a Global 3D-Aware Attention module that injects
	scene geometry — extracted on the fly by AnySplat — into the conditional
	embedding of every autoregressive chunk.

	![EndlessWorld pipeline](pipeline.png)

	Three ingredients:

	- Conditional autoregressive (self-forcing) training — frames are denoised
	block-by-block with KV-cache, conditioning each new block on previously
	generated content.
	- Global 3D-Aware Attention — `CrossAttentionFusion` + `To3D` modules ingest
	3D Gaussian features produced by AnySplat and fuse them with the text
	embedding, giving the generator a persistent geometric memory of the world
	rendered so far.
	- Real-time streaming inference — the rollout loop re-extracts 3D features
	from the most recently decoded chunk and feeds the fused embedding back into
	the causal generator, enabling indefinite extension on a single GPU.

	## Quick start

	```bash
	git clone https://github.com/BWGZK-keke/EndlessWorld
	cd EndlessWorld
	pip install -r requirements.txt

	# Download this checkpoint
	huggingface-cli download BWGZK/EndlessWorld model.pt --local-dir checkpoints/

	# Update configs/self_forcing_dmd.yaml -> generator_ckpt: checkpoints/model.pt
	bash test.sh
	```

	Loading directly from Python:

	```python
	import torch
	from huggingface_hub import hf_hub_download

	ckpt = hf_hub_download(repo_id="BWGZK/EndlessWorld", filename="model.pt")
	state_dict = torch.load(ckpt, map_location="cpu")
	```

	## Training

	- Framework: Multi-GPU FSDP via the [`train.py`](https://github.com/BWGZK-keke/EndlessWorld/blob/main/train.py)
	entry point with [`configs/self_forcing_dmd.yaml`](https://github.com/BWGZK-keke/EndlessWorld/blob/main/configs/self_forcing_dmd.yaml).

	## Citation

	```bibtex
	@article{zhang2025endlessworld,
	title = {Endless World: Real-Time 3D-Aware Long Video Generation},
	author = {Zhang, Ke and others},
	journal = {arXiv preprint arXiv:2512.12430},
	year = {2025}
	}
	```

	## License

	Apache 2.0 — same as the upstream Wan2.1 and Self-Forcing projects.