DreamX-World
Collection
A General-Purpose Interactive World Model • 1 item • Updated • 2
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import load_image, export_to_video
# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("GD-ML/DreamX-World-5B-Cam", dtype=torch.bfloat16, device_map="cuda")
pipe.to("cuda")
prompt = "A man with short gray hair plays a red electric guitar."
image = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png"
)
output = pipe(image=image, prompt=prompt).frames[0]
export_to_video(output, "output.mp4")DreamX-World is a general-purpose world model for interactive world simulation. It generates diverse, high-fidelity worlds that users can explore, control, and transform with event prompts.
DreamX-World-5B-Cam is the 5B-parameter camera-control variant, built on top of Wan2.2-T2V-5B. Given a single input image, a text description, and camera action commands, it generates high-quality videos with precise camera trajectory control using PRoPE (Projective Position Encoding) for camera conditioning.
pip install -r requirements.txt
Key dependencies:
torch==2.5.1diffusers>=0.30.1transformers>=4.46.2xfuser==0.4.1flash_attn==2.8.3Download the base model weights:
configs/dreamx/eval.json for examples):{
"image_path": "./demo/your_image.png",
"caption": "Style: Photorealistic. A description of the scene...",
"action_seq": ["w", "wj"],
"action_speed_list": [4, 6]
}
# ======================== Model Path ========================
MODEL_NAME="./Wan2.2-TI2V-5B"
CONFIG_PATH="./configs/wan2.2/wan_ti2v_5b.yaml"
TRANSFORMER_PATH="./Dreamx-5b/"
# ====================== Basic Settings ======================
INPUT_DIR="./configs/dreamx/eval.json"
OUTPUT_DIR="./outputs/"
SAMPLE_HEIGHT=704
SAMPLE_WIDTH=1280
VIDEO_LENGTH=121 # 121 frames = 5s @ 24fps, 81 frames = 5s @ 16fps
FPS=24
GUIDANCE_SCALE=3.0
NUM_INFERENCE_STEPS=50
SEED=42
# ====================== Camera Control ======================
CAM_METHOD="prope"
ADD_CONTROL_ADAPTER="--add_control_adapter"
# ======================== Multi-GPU ========================
WEIGHT_DTYPE="bfloat16"
ULYSSES_DEGREE=8
RING_DEGREE=1
CUDA_DEVICES="0,1,2,3,4,5,6,7"
sh inference_dreamx_5b.sh
| Action | Description |
|---|---|
w |
Move forward |
s |
Move backward |
a |
Move left |
d |
Move right |
j |
Tilt down |
k |
Tilt up |
l |
Pan right |
h |
Pan left |
Actions can be composed (e.g., wj = move forward + tilt down, dj = move right + tilt down).
| Attribute | Value |
|---|---|
| Architecture | Transformer-based DiT (Diffusion Transformer) |
| Parameters | ~5B |
| Base Model | Wan2.2-5B-TI2V |
| Camera Control | PRoPE (Projective Position Encoding) |
| VAE | AutoencoderKLWan3_8 (temporal compression 4×, spatial compression 16×) |
| Text Encoder | UMT5-XXL |
| Scheduler | Flow Matching Euler Discrete |
| Precision | BFloat16 |
| Max Resolution | 704 × 1280 |
| Frame Count | 121 (5s@24fps) / 81 (5s@16fps), up to 7.5s@16fps |
| Multi-GPU | Ulysses + Ring parallelism via xfuser |
This model is released under the MIT License.
We thank the Wan Team for open-sourcing their code and models.