How to use from the
Use from the
Diffusers library
pip install -U diffusers transformers accelerate
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import load_image, export_to_video

# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("GD-ML/DreamX-World-5B", dtype=torch.bfloat16, device_map="cuda")
pipe.to("cuda")

prompt = "A man with short gray hair plays a red electric guitar."
image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png"
)

output = pipe(image=image, prompt=prompt).frames[0]
export_to_video(output, "output.mp4")

DreamX-World-5B

DreamX-World Teaser

Project Page GitHub Technical Report

Model Description

DreamX-World is a general-purpose world model for interactive world simulation. It generates diverse, high-fidelity worlds that users can explore, control, and transform with event prompts.

DreamX-World-5B is the long-horizon autoregressive variant of DreamX-World. It is built on top of Wan2.2-TI2V-5B and generates videos from an input image, a text prompt, and keyboard-style camera action commands. Compared with the 5B-Cam variant, DreamX-World-5B uses chunk-wise causal autoregressive inference with KV caching, making long-horizon generation practical, including videos up to about 1 minute at 16 FPS.

The model is trained with a scalable data engine on Unreal Engine data, gameplay footage, and real-world videos, together with camera estimation and data filtering. DreamX-World follows a progressive training pipeline for action control, open-ended event response, reinforcement-learning-based action following, and efficient inference through forcing and distillation.

Key Features

  • Long-horizon world generation: Supports autoregressive generation for coherent world exploration over hundreds of frames.
  • Camera-controllable video generation: Converts keyboard-style action commands into camera trajectories and PRoPE camera conditioning.
  • Image, text, and action conditioning: Uses a starting image, a scene/event prompt, and an action sequence to generate controllable videos.
  • Chunk-wise causal inference: Generates latent frames in causal blocks with KV cache reuse for efficient long rollouts.
  • Diverse world types: Supports realistic, indoor, outdoor, urban, natural, game-like, fantasy, sci-fi, and stylized scenes.

How to Use

Requirements

Clone the inference code and install dependencies:

git clone https://github.com/AMAP-ML/DreamX-World
cd DreamX-World
pip install -r requirements.txt

Key dependencies include:

  • torch==2.5.1
  • torchvision==0.20.1
  • diffusers>=0.30.1
  • transformers>=4.46.2
  • xfuser==0.4.1
  • flash_attn==2.8.3
  • triton==3.1.0

Download Base Model

DreamX-World-5B uses Wan2.2-TI2V-5B components for the text encoder, tokenizer, and VAE:

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.2-TI2V-5B --local-dir ./Wan2.2-TI2V-5B

Download the DreamX-World-5B checkpoint from this repository and set BASE_CHECKPOINT_PATH to the .pt checkpoint path.

Prepare Input JSON

The inference script expects a JSON list. Each item contains an initial image, a text prompt, and camera actions:

[
  {
    "image_path": "./demo/your_image.png",
    "caption": "Style: Photorealistic. A description of the scene and desired world behavior.",
    "action_seq": ["w", "wj", "wl"],
    "action_speed_list": [4, 6, 6]
  }
]

In the current inference script, action_speed_list is used as the relative duration weight for each action segment. For example, [4, 6, 6] allocates the rollout across the three action segments in a 4:6:6 ratio.

Camera Action Commands

DreamX-World-5B uses WASD for camera translation and IJKL for camera rotation:

Action Camera Control
w Push in
s Pull out
a Move left
d Move right
i Tilt up
k Tilt down
j Pan left
l Pan right

Actions can be composed in one string:

  • wi: push in while tilting up
  • wk: push in while tilting down
  • wj: push in while panning left
  • wl: push in while panning right
  • dj: move right while panning left

Run Inference

Use the provided AR-forcing script:

BASE_CHECKPOINT_PATH=./DreamX-World-5B/baseline.pt \
MODEL_NAME=./Wan2.2-TI2V-5B \
DATA_PATH=configs/dreamx/eval.json \
OUTPUT_FOLDER=./outputs_ar \
bash inference_ar_forcing.sh

For custom generation length or direct control over all arguments, run the Python entry point:

python inference_ar_forcing.py \
  --config_path configs/dreamx-ar/causal_camera_forcing_5b.yaml \
  --model_name ./Wan2.2-TI2V-5B \
  --transformer_path ./configs/dreamx-ar/ \
  --base_checkpoint_path ./DreamX-World-5B/baseline.pt \
  --data_path configs/dreamx/eval.json \
  --output_folder ./outputs_ar \
  --num_output_frames 123 \
  --fps 16 \
  --seed 42 \
  --color_correction_strength 1.0 \
  --chunk_relative

--num_output_frames is the number of latent frames. The generated pixel-frame count is:

pixel_frames = (num_output_frames - 1) * 4 + 1

Because the default causal block size is 3 latent frames, num_output_frames should be divisible by 3. Examples:

num_output_frames Pixel frames Duration at 16 FPS
21 81 ~5.1s
63 249 ~15.6s
123 489 ~30.6s
243 969 ~60.6s

Technical Specifications

Attribute Value
Architecture Causal Wan/Wan2.2-style Diffusion Transformer
Parameters ~5B
Base Model Wan2.2-TI2V-5B
Input Initial image, text prompt, camera action sequence
Output Camera-controlled video
Resolution 704 x 1280 in the provided inference script
FPS 16
Long-horizon Length Up to about 1 minute
Camera Control PRoPE camera conditioning from generated camera trajectories
Action Interface WASD translation + IJKL view rotation
Inference Mode Chunk-wise causal autoregressive generation with KV cache
Causal Block Size 3 latent frames per block by default
VAE Wan2.2 VAE, temporal compression 4x, spatial compression 16x
Text Encoder UMT5-XXL
Precision BFloat16

WeChat Group

Join our WeChat group for discussion:

WeChat Group QR Code

License

This model is released under the MIT License.

Citation

If you find this model useful, please cite:

@article{dreamxworld2026,
  title={DreamX-World: A General-Purpose Interactive World Model},
  author={DreamX Team},
  journal={arXiv preprint arXiv:2606.16993},
  year={2026}
}

Acknowledgement

We thank the Wan Team for open-sourcing their code and models.

Downloads last month
-
Safetensors
Model size
5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GD-ML/DreamX-World-5B

Finetuned
(56)
this model

Collection including GD-ML/DreamX-World-5B

Paper for GD-ML/DreamX-World-5B