DreamX-World-5B

Model Description

DreamX-World is a general-purpose world model for interactive world simulation. It generates diverse, high-fidelity worlds that users can explore, control, and transform with event prompts.

DreamX-World-5B is the long-horizon autoregressive variant of DreamX-World. It is built on top of Wan2.2-TI2V-5B and generates videos from an input image, a text prompt, and keyboard-style camera action commands. Compared with the 5B-Cam variant, DreamX-World-5B uses chunk-wise causal autoregressive inference with KV caching, making long-horizon generation practical, including videos up to about 1 minute at 16 FPS.

The model is trained with a scalable data engine on Unreal Engine data, gameplay footage, and real-world videos, together with camera estimation and data filtering. DreamX-World follows a progressive training pipeline for action control, open-ended event response, reinforcement-learning-based action following, and efficient inference through forcing and distillation.

Key Features

Long-horizon world generation: Supports autoregressive generation for coherent world exploration over hundreds of frames.
Camera-controllable video generation: Converts keyboard-style action commands into camera trajectories and PRoPE camera conditioning.
Image, text, and action conditioning: Uses a starting image, a scene/event prompt, and an action sequence to generate controllable videos.
Chunk-wise causal inference: Generates latent frames in causal blocks with KV cache reuse for efficient long rollouts.
Diverse world types: Supports realistic, indoor, outdoor, urban, natural, game-like, fantasy, sci-fi, and stylized scenes.

How to Use

Requirements

Clone the inference code and install dependencies:

git clone https://github.com/AMAP-ML/DreamX-World
cd DreamX-World
pip install -r requirements.txt

Key dependencies include:

torch==2.5.1
torchvision==0.20.1
diffusers>=0.30.1
transformers>=4.46.2
xfuser==0.4.1
flash_attn==2.8.3
triton==3.1.0

Download Base Model

DreamX-World-5B uses Wan2.2-TI2V-5B components for the text encoder, tokenizer, and VAE:

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.2-TI2V-5B --local-dir ./Wan2.2-TI2V-5B

Download the DreamX-World-5B checkpoint from this repository and set BASE_CHECKPOINT_PATH to the .pt checkpoint path.

Prepare Input JSON

The inference script expects a JSON list. Each item contains an initial image, a text prompt, and camera actions:

[
  {
    "image_path": "./demo/your_image.png",
    "caption": "Style: Photorealistic. A description of the scene and desired world behavior.",
    "action_seq": ["w", "wj", "wl"],
    "action_speed_list": [4, 6, 6]
  }
]

In the current inference script, action_speed_list is used as the relative duration weight for each action segment. For example, [4, 6, 6] allocates the rollout across the three action segments in a 4:6:6 ratio.

Camera Action Commands

DreamX-World-5B uses WASD for camera translation and IJKL for camera rotation:

Action	Camera Control
`w`	Push in
`s`	Pull out
`a`	Move left
`d`	Move right
`i`	Tilt up
`k`	Tilt down
`j`	Pan left
`l`	Pan right

Actions can be composed in one string:

wi: push in while tilting up
wk: push in while tilting down
wj: push in while panning left
wl: push in while panning right
dj: move right while panning left

Run Inference

Use the provided AR-forcing script:

BASE_CHECKPOINT_PATH=./DreamX-World-5B/baseline.pt \
MODEL_NAME=./Wan2.2-TI2V-5B \
DATA_PATH=configs/dreamx/eval.json \
OUTPUT_FOLDER=./outputs_ar \
bash inference_ar_forcing.sh

For custom generation length or direct control over all arguments, run the Python entry point:

python inference_ar_forcing.py \
  --config_path configs/dreamx-ar/causal_camera_forcing_5b.yaml \
  --model_name ./Wan2.2-TI2V-5B \
  --transformer_path ./configs/dreamx-ar/ \
  --base_checkpoint_path ./DreamX-World-5B/baseline.pt \
  --data_path configs/dreamx/eval.json \
  --output_folder ./outputs_ar \
  --num_output_frames 123 \
  --fps 16 \
  --seed 42 \
  --color_correction_strength 1.0 \
  --chunk_relative

--num_output_frames is the number of latent frames. The generated pixel-frame count is:

pixel_frames = (num_output_frames - 1) * 4 + 1

Because the default causal block size is 3 latent frames, num_output_frames should be divisible by 3. Examples:

`num_output_frames`	Pixel frames	Duration at 16 FPS
21	81	~5.1s
63	249	~15.6s
123	489	~30.6s
243	969	~60.6s

Technical Specifications

Attribute	Value
Architecture	Causal Wan/Wan2.2-style Diffusion Transformer
Parameters	~5B
Base Model	Wan2.2-TI2V-5B
Input	Initial image, text prompt, camera action sequence
Output	Camera-controlled video
Resolution	704 x 1280 in the provided inference script
FPS	16
Long-horizon Length	Up to about 1 minute
Camera Control	PRoPE camera conditioning from generated camera trajectories
Action Interface	`WASD` translation + `IJKL` view rotation
Inference Mode	Chunk-wise causal autoregressive generation with KV cache
Causal Block Size	3 latent frames per block by default
VAE	Wan2.2 VAE, temporal compression 4x, spatial compression 16x
Text Encoder	UMT5-XXL
Precision	BFloat16

WeChat Group

Join our WeChat group for discussion:

License

This model is released under the MIT License.

Citation

If you find this model useful, please cite:

@article{dreamxworld2026,
  title={DreamX-World: A General-Purpose Interactive World Model},
  author={DreamX Team},
  journal={arXiv preprint arXiv:2606.16993},
  year={2026}
}