Add diffusers support

by dn6 HF Staff - opened Jan 20

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+3493

-3

Files changed (26) hide show

README.md +91 -3
__init__.py +61 -0
before_denoise.py +612 -0
decoders.py +122 -0
denoise.py +210 -0
encoders.py +318 -0
modular_blocks.py +45 -0
modular_config.json +7 -0
modular_model_index.json +76 -0
transformer/__init__.py +31 -0
transformer/attn.py +297 -0
transformer/cache.py +112 -0
transformer/config.json +49 -0
transformer/diffusion_pytorch_model.safetensors +3 -0
transformer/model.py +452 -0
transformer/nn.py +153 -0
transformer/quantize.py +245 -0
vae/__init__.py +19 -0
vae/__pycache__/__init__.cpython-311.pyc +0 -0
vae/__pycache__/ae_model.cpython-311.pyc +0 -0
vae/__pycache__/dcae.cpython-311.pyc +0 -0
vae/ae_model.py +141 -0
vae/config.json +33 -0
vae/dcae.py +271 -0
vae/diffusion_pytorch_model.safetensors +3 -0
vae/model.py +142 -0

README.md CHANGED Viewed

@@ -8,7 +8,7 @@ tags:
 - Egocentric
 ---
-Waypoint-1-Small is a 2.3 billion parameter control-and-text-conditioned causal diffusion model. It is a transformer architecture utilizing rectified flow, distilled via self forcing with DMD. The model can autoregressively generate new frames given historical frames, actions, and text.
 # Capabilities:
@@ -23,12 +23,99 @@ In order to simply use Waypoint-1-Small, we recommend [Biome](https://github.com
 To run the model locally, we recommend an NVIDIA RTX 5090, which should achieve 20-30 FPS, or an RTX 6000 Pro Blackwell, which should achieve ~35 FPS.
 # Keywords
 To properly explain limitations and misuse we must define some terms. While the model can be used for general interactive video generation tasks, we herein define interacting with the model via sending controls and receiving new frames as “playing” the model, and the agent/user inputting controls as the “player”. The model has two forms of output, continuations and generations. Continuations occur when seed frames are given and no inputs are given. For example, if a scene has fire or water, you may see them evolve progressively in the generated frames even if no action is given. Likewise, if you seed with an image of a humanoid entity, the entity will persist on the screen as you move/look around. However, generations occur when the player plays with the model extensively, for example moving around, turning around fully, or interacting with objects/items. Continuations roughly correspond to moving around already existing information in the given context frames while generations correspond to creating entirely new information.
 # Limitations
 - Continuations can plausibly model any inputted scene or photo, and will depend largely on the seed frame given. For generations, the model may occasionally:
 - Ignore given text prompt
 - Ignore certain controls in specific contexts
@@ -46,4 +133,5 @@ To properly explain limitations and misuse we must define some terms. While the
 - For simulating extremely violent acts
 - For generating violent/gory video
 - For facilitation of large-scale disinformation campaigns
-- For the purpose of generating any sexually explicit or suggestive material

 - Egocentric
 ---
+Waypoint-1-Small is a 2.3 billion parameter control-and-text-conditioned causal diffusion model. It is a transformer architecture utilizing rectified flow, distilled via self forcing with DMD. The model can autoregressively generate new frames given historical frames, actions, and text.
 # Capabilities:
 To run the model locally, we recommend an NVIDIA RTX 5090, which should achieve 20-30 FPS, or an RTX 6000 Pro Blackwell, which should achieve ~35 FPS.
+# Run with Diffusers Modular Pipelines
+World Engine and Waypoint-1 can be used with [Diffusers Modular Pipelines](https://huggingface.co/docs/diffusers/main/en/api/modular_pipelines/modular_pipeline).
+## Setup
+```bash
+uv venv -p 3.11 && uv pip install \
+    torch>=2.9.0 \
+    diffusers>=0.36.0 \
+    transformers>=4.57.1 \
+    einops>=0.8.0 \
+    tensordict>=0.5.0 \
+    regex \
+    ftfy \
+    imageio \
+    imageio-ffmpeg \
+    tqdm
+```
+## Usage Example
+```python
+import random
+import torch
+from tqdm import tqdm
+from dataclasses import dataclass, field
+from typing import Set, Tuple
+from diffusers.modular_pipelines import ModularPipeline
+from diffusers.utils import load_image, export_to_video
+@dataclass
+class CtrlInput:
+    button: Set[int] = field(default_factory=set)  # pressed button IDs
+    mouse: Tuple[float, float] = (0.0, 0.0)  # (x, y) velocity
+# Generate random control trajectories
+ctrl = lambda: random.choice(
+    [
+        CtrlInput(button={48, 42}, mouse=[0.4, 0.3]),
+        CtrlInput(mouse=[0.1, 0.2]),
+        CtrlInput(button={95, 32, 105}),
+    ]
+)
+model_id = "Overworld/Waypoint-1-Small"
+pipe = ModularPipeline.from_pretrained(model_id, trust_remote_code=True)
+pipe.load_components(
+    device_map="cuda", torch_dtype=torch.bfloat16, trust_remote_code=True
+)
+pipe.transformer.apply_inference_patches()
+# Optional Quantization Step
+# Available options are: nvfp4 (if running on Blackwell hardware), fp8, w8a8
+# pipe.transformer.quantize("nvfp4")
+pipe.transformer.compile(fullgraph=True, mode="max-autotune", dynamic=False)
+pipe.vae.bake_weight_norm()
+pipe.vae.compile(fullgraph=True, mode="max-autotune")
+prompt = "A fun game"
+image = load_image(
+    "https://gist.github.com/user-attachments/assets/4adc5a3d-6980-4d1e-b6e8-9033cdf61c66"
+)
+num_frames = 240
+outputs = []
+# create world state based on an initial image
+state = pipe(prompt=prompt, image=image, button=ctrl().button, mouse=ctrl().mouse)
+outputs.append(state.values["images"])
+state.values["image"] = None
+for _ in tqdm(range(1, num_frames)):
+    state = pipe(
+        state,
+        prompt=prompt,
+        button=ctrl().button,
+        mouse=ctrl().mouse,
+        output_type="pil",
+    )
+    outputs.append(state.values["images"])
+export_to_video(outputs, "waypoint-1-small.mp4", fps=60)
+```
 # Keywords
 To properly explain limitations and misuse we must define some terms. While the model can be used for general interactive video generation tasks, we herein define interacting with the model via sending controls and receiving new frames as “playing” the model, and the agent/user inputting controls as the “player”. The model has two forms of output, continuations and generations. Continuations occur when seed frames are given and no inputs are given. For example, if a scene has fire or water, you may see them evolve progressively in the generated frames even if no action is given. Likewise, if you seed with an image of a humanoid entity, the entity will persist on the screen as you move/look around. However, generations occur when the player plays with the model extensively, for example moving around, turning around fully, or interacting with objects/items. Continuations roughly correspond to moving around already existing information in the given context frames while generations correspond to creating entirely new information.
 # Limitations
 - Continuations can plausibly model any inputted scene or photo, and will depend largely on the seed frame given. For generations, the model may occasionally:
 - Ignore given text prompt
 - Ignore certain controls in specific contexts
 - For simulating extremely violent acts
 - For generating violent/gory video
 - For facilitation of large-scale disinformation campaigns
+- For the purpose of generating any sexually explicit or suggestive material

__init__.py ADDED Viewed

	@@ -0,0 +1,61 @@

+# Copyright (C) 2025 Hugging Face Team and Overworld
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+"""
+WorldEngine Modular Pipeline
+A Diffusers-compatible modular pipeline for frame-by-frame world model generation.
+Supports text and controller (mouse + button + scroll) conditioning.
+"""
+from .modular_blocks import WorldEngineBlocks, AUTO_BLOCKS
+from .encoders import WorldEngineTextEncoderStep, WorldEngineControllerEncoderStep
+from .before_denoise import (
+    WorldEngineBeforeDenoiseStep,
+    WorldEngineSetTimestepsStep,
+    WorldEnginePrepareLatentsStep,
+    WorldEngineSetupKVCacheStep,
+    StaticKVCache,
+    LayerKVCache,
+)
+from .denoise import WorldEngineDenoiseLoop
+from .decoders import WorldEngineDecodeStep
+from .vae import WorldEngineVAE
+__version__ = "0.1.0"
+__all__ = [
+    # Main pipeline blocks
+    "WorldEngineBlocks",
+    "AUTO_BLOCKS",
+    # Encoder blocks
+    "WorldEngineTextEncoderStep",
+    "WorldEngineControllerEncoderStep",
+    # Before denoise blocks
+    "WorldEngineBeforeDenoiseStep",
+    "WorldEngineSetTimestepsStep",
+    "WorldEnginePrepareLatentsStep",
+    "WorldEngineSetupKVCacheStep",
+    # Denoise block
+    "WorldEngineDenoiseLoop",
+    # Decoder blocks
+    "WorldEngineDecodeStep",
+    # Models
+    "WorldModel",
+    "WorldEngineVAE",
+    # KV Cache
+    "StaticKVCache",
+    "LayerKVCache",
+]

before_denoise.py ADDED Viewed

	@@ -0,0 +1,612 @@

+# Copyright (C) 2025 Hugging Face Team and Overworld
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+"""Before-denoise blocks for WorldEngine modular pipeline."""
+from typing import List, Optional, Union
+import PIL.Image
+import torch
+from torch import nn, Tensor
+from tensordict import TensorDict
+from torch.nn.attention.flex_attention import _DEFAULT_SPARSE_BLOCK_SIZE, BlockMask
+from diffusers.configuration_utils import FrozenDict
+from diffusers.image_processor import VaeImageProcessor
+from diffusers.utils import logging
+from diffusers.utils.torch_utils import randn_tensor
+from diffusers.modular_pipelines import (
+    ModularPipelineBlocks,
+    ModularPipeline,
+    PipelineState,
+    SequentialPipelineBlocks,
+)
+from diffusers.modular_pipelines.modular_pipeline_utils import (
+    ComponentSpec,
+    ConfigSpec,
+    InputParam,
+    OutputParam,
+)
+logger = logging.get_logger(__name__)
+def make_block_mask(T: int, L: int, written: torch.Tensor) -> BlockMask:
+    """
+    Create a block mask for flex_attention.
+    Args:
+        T: Q length for this frame
+        L: KV capacity == written.numel()
+        written: [L] bool, True where there is valid KV data
+    """
+    BS = _DEFAULT_SPARSE_BLOCK_SIZE
+    KV_blocks = (L + BS - 1) // BS
+    Q_blocks = (T + BS - 1) // BS
+    # [KV_blocks, BS]
+    written_blocks = torch.nn.functional.pad(written, (0, KV_blocks * BS - L)).view(
+        KV_blocks, BS
+    )
+    # Block-level occupancy
+    block_any = written_blocks.any(-1)  # block has at least one written token
+    block_all = written_blocks.all(-1)  # block is fully written
+    # Every Q-block sees the same KV-block pattern
+    nonzero_bm = block_any[None, :].expand(Q_blocks, KV_blocks)  # [Q_blocks, KV_blocks]
+    full_bm = block_all[None, :].expand_as(nonzero_bm)  # [Q_blocks, KV_blocks]
+    partial_bm = nonzero_bm & ~full_bm  # [Q_blocks, KV_blocks]
+    def dense_to_ordered(dense_mask: torch.Tensor):
+        # dense_mask: [Q_blocks, KV_blocks] bool
+        # returns: [1,1,Q_blocks], [1,1,Q_blocks,KV_blocks]
+        num_blocks = dense_mask.sum(dim=-1, dtype=torch.int32)  # [Q_blocks]
+        indices = dense_mask.argsort(dim=-1, descending=True, stable=True).to(
+            torch.int32
+        )
+        return num_blocks[None, None].contiguous(), indices[None, None].contiguous()
+    # Partial blocks (need mask_mod)
+    kv_num_blocks, kv_indices = dense_to_ordered(partial_bm)
+    # Full blocks (mask_mod can be skipped entirely)
+    full_kv_num_blocks, full_kv_indices = dense_to_ordered(full_bm)
+    def mask_mod(b, h, q, kv):
+        return written[kv]
+    bm = BlockMask.from_kv_blocks(
+        kv_num_blocks,
+        kv_indices,
+        full_kv_num_blocks,
+        full_kv_indices,
+        BLOCK_SIZE=BS,
+        mask_mod=mask_mod,
+        seq_lengths=(T, L),
+        compute_q_blocks=False,  # no backward, avoids the transpose/_ordered_to_dense path
+    )
+    return bm
+class LayerKVCache(nn.Module):
+    """
+    Ring-buffer KV cache with fixed capacity L (tokens) for history plus
+    one extra frame (tokens_per_frame) at the tail holding the current frame.
+    """
+    def __init__(
+        self, B, H, L, Dh, dtype, tokens_per_frame: int, pinned_dilation: int = 1
+    ):
+        super().__init__()
+        self.tpf = tokens_per_frame
+        self.L = L
+        # total KV capacity: ring (L) + tail frame (tpf)
+        self.capacity = L + self.tpf
+        self.pinned_dilation = pinned_dilation
+        self.num_buckets = (L // self.tpf) // self.pinned_dilation
+        assert (L // self.tpf) % pinned_dilation == 0 and L % self.tpf == 0
+        # KV buffer: [2, B, H, capacity, Dh]
+        self.kv = nn.Buffer(
+            torch.zeros(2, B, H, self.capacity, Dh, dtype=dtype),
+            persistent=False,
+        )
+        # which slots have ever been written
+        # tail slice [L, L+tpf) always holds the current frame and is considered written
+        written = torch.zeros(self.capacity, dtype=torch.bool)
+        written[L:] = True
+        self.written = nn.Buffer(written, persistent=False)
+        # Precompute indices:
+        #   frame_offsets: [0, 1, ..., tpf-1] (for ring indexing)
+        #   current_idx:   [L, L+1, ..., L+tpf-1] (tail slice)
+        self.frame_offsets = nn.Buffer(
+            torch.arange(self.tpf, dtype=torch.long), persistent=False
+        )
+        self.current_idx = nn.Buffer(self.frame_offsets + L, persistent=False)
+    def reset(self):
+        self.kv.zero_()
+        self.written.zero_()
+        self.written[self.L :].fill_(True)
+    def upsert(self, kv: Tensor, pos_ids: TensorDict, is_frozen: bool):
+        """
+        Args:
+            kv: [2, B, H, T, Dh] for a single frame (T = tokens_per_frame)
+            pos_ids: TensorDict with t_pos [B, T], all equal per frame (ignoring -1)
+        """
+        T = self.tpf
+        t_pos = pos_ids["t_pos"]
+        if not torch.compiler.is_compiling():
+            torch._check(
+                kv.size(3) == self.tpf, "KV cache expects exactly one frame per upsert"
+            )
+            torch._check(t_pos.shape == (kv.size(1), T), "t_pos must be [B, T]")
+            torch._check(self.tpf <= self.L, "frame longer than KV ring capacity")
+            torch._check(
+                self.L % self.tpf == 0,
+                f"L ({self.L}) must be a multiple of tokens_per_frame ({self.tpf})",
+            )
+            torch._check(
+                self.kv.size(3) == self.capacity,
+                "KV buffer has unexpected length (expected L + tokens_per_frame)",
+            )
+            torch._check(
+                (t_pos >= 0).all().item(),
+                "t_pos must be non-negative during inference",
+            )
+            torch._check(
+                ((t_pos == t_pos[:, :1]).all()).item(),
+                "t_pos must be constant within frame",
+            )
+        frame_t = t_pos[0, 0]
+        # map frame_t to a bucket, each bucket owns T contiguous slots
+        bucket = (frame_t + (self.pinned_dilation - 1)) // self.pinned_dilation
+        slot = bucket % self.num_buckets
+        base = slot * T
+        # indices in the ring for this frame: [T] in [0, L)
+        ring_idx = self.frame_offsets + base
+        # Always write current frame into the tail slice [L, L+T):
+        # this is the "self-attention component" for the current frame.
+        self.kv.index_copy_(3, self.current_idx, kv)
+        write_step = frame_t.remainder(self.pinned_dilation) == 0
+        mask_written = self.written.clone()
+        mask_written[ring_idx] = mask_written[ring_idx] & ~write_step
+        bm = make_block_mask(T, self.capacity, mask_written)
+        # Persist current frame into the ring for future queries when unfrozen.
+        if not is_frozen:
+            # Persist current frame into the ring for future queries.
+            dst = torch.where(write_step, ring_idx, self.current_idx)
+            self.kv.index_copy_(3, dst, kv)
+            self.written[dst] = True
+        k, v = self.kv.unbind(0)
+        return k, v, bm
+class StaticKVCache(nn.Module):
+    """Static KV cache with per-layer configuration for local/global attention."""
+    def __init__(self, config, batch_size, dtype):
+        super().__init__()
+        self.tpf = config.tokens_per_frame
+        local_L = config.local_window * self.tpf
+        global_L = config.global_window * self.tpf
+        period = config.global_attn_period
+        off = getattr(config, "global_attn_offset", 0) % period
+        self.layers = nn.ModuleList(
+            [
+                LayerKVCache(
+                    batch_size,
+                    getattr(config, "n_kv_heads", config.n_heads),
+                    global_L if ((layer_idx - off) % period == 0) else local_L,
+                    config.d_model // config.n_heads,
+                    dtype,
+                    self.tpf,
+                    (
+                        config.global_pinned_dilation
+                        if ((layer_idx - off) % period == 0)
+                        else 1
+                    ),
+                )
+                for layer_idx in range(config.n_layers)
+            ]
+        )
+        self._is_frozen = True
+    def reset(self):
+        for layer in self.layers:
+            layer.reset()
+        self._is_frozen = True
+    def set_frozen(self, is_frozen: bool):
+        self._is_frozen = is_frozen
+    def upsert(self, k: Tensor, v: Tensor, pos_ids: TensorDict, layer: int):
+        kv = torch.stack([k, v], dim=0)
+        return self.layers[layer].upsert(kv, pos_ids, self._is_frozen)
+class WorldEngineSetTimestepsStep(ModularPipelineBlocks):
+    """Sets up the scheduler sigmas for rectified flow denoising."""
+    model_name = "world_engine"
+    @property
+    def description(self) -> str:
+        return "Sets up scheduler sigmas for rectified flow denoising"
+    @property
+    def expected_components(self) -> List[ComponentSpec]:
+        return []
+    @property
+    def expected_configs(self) -> List[ConfigSpec]:
+        return [ConfigSpec("scheduler_sigmas", [1.0, 0.94921875, 0.83984375, 0.0])]
+    @property
+    def inputs(self) -> List[InputParam]:
+        return [
+            InputParam(
+                "scheduler_sigmas",
+                type_hint=List[float],
+                description="Custom scheduler sigmas (overrides config)",
+            ),
+            InputParam(
+                "frame_timestamp",
+                type_hint=torch.Tensor,
+                description="Current frame timestamp",
+            ),
+        ]
+    @property
+    def intermediate_outputs(self) -> List[OutputParam]:
+        return [
+            OutputParam(
+                "scheduler_sigmas",
+                type_hint=torch.Tensor,
+                description="Tensor of scheduler sigmas for denoising",
+            ),
+            OutputParam(
+                "frame_timestamp",
+                type_hint=torch.Tensor,
+                description="Current frame timestamp",
+            ),
+        ]
+    @torch.no_grad()
+    def __call__(
+        self, components: ModularPipeline, state: PipelineState
+    ) -> PipelineState:
+        block_state = self.get_block_state(state)
+        device = components._execution_device
+        dtype = components.transformer.dtype
+        # Use provided sigmas or get from config
+        sigmas = block_state.scheduler_sigmas
+        if sigmas is None:
+            sigmas = components.config.scheduler_sigmas
+            block_state.scheduler_sigmas = torch.tensor(
+                sigmas, device=device, dtype=dtype
+            )
+        frame_ts = block_state.frame_timestamp
+        if frame_ts is None:
+            frame_ts = torch.tensor([[0]], dtype=torch.long, device=device)
+        elif isinstance(frame_ts, int):
+            frame_ts = torch.tensor([[frame_ts]], dtype=torch.long, device=device)
+        block_state.frame_timestamp = frame_ts
+        self.set_block_state(state, block_state)
+        return components, state
+class WorldEngineSetupKVCacheStep(ModularPipelineBlocks):
+    """Initializes or reuses the KV cache for autoregressive generation."""
+    model_name = "world_engine"
+    @property
+    def description(self) -> str:
+        return "Initializes or reuses KV cache for autoregressive frame generation"
+    @property
+    def expected_components(self) -> List[ComponentSpec]:
+        return []
+    @property
+    def inputs(self) -> List[InputParam]:
+        return [
+            InputParam(
+                "kv_cache",
+                type_hint=Optional[StaticKVCache],
+                description="Existing KV cache (will be reused if provided)",
+            ),
+            InputParam(
+                "reset_cache",
+                type_hint=bool,
+                default=False,
+                description="If True, reset the KV cache even if one exists",
+            ),
+        ]
+    @property
+    def intermediate_outputs(self) -> List[OutputParam]:
+        return [
+            OutputParam(
+                "kv_cache",
+                type_hint=StaticKVCache,
+                description="KV cache for transformer attention",
+            ),
+        ]
+    @torch.no_grad()
+    def __call__(
+        self, components: ModularPipeline, state: PipelineState
+    ) -> PipelineState:
+        block_state = self.get_block_state(state)
+        device = components._execution_device
+        dtype = components.transformer.dtype
+        # Create or reuse KV cache
+        if block_state.kv_cache is None:
+            block_state.kv_cache = StaticKVCache(
+                components.transformer.config,
+                batch_size=1,
+                dtype=dtype,
+            ).to(device)
+        elif block_state.reset_cache:
+            block_state.kv_cache.reset()
+        self.set_block_state(state, block_state)
+        return components, state
+class WorldEnginePrepareLatentsStep(ModularPipelineBlocks):
+    """Prepares latents for frame generation, optionally encoding an input image."""
+    model_name = "world_engine"
+    @property
+    def description(self) -> str:
+        return (
+            "Prepares latents for frame generation. If an image is provided on the "
+            "first frame, encodes it and caches it as context. Always creates fresh "
+            "random noise for the actual denoising."
+        )
+    @property
+    def expected_components(self) -> List[ComponentSpec]:
+        return [
+            ComponentSpec(
+                "image_processor",
+                VaeImageProcessor,
+                config=FrozenDict(
+                    {
+                        "vae_scale_factor": 16,
+                        "do_normalize": False,
+                        "do_convert_rgb": False,
+                    }
+                ),
+                default_creation_method="from_config",
+            ),
+        ]
+    @property
+    def expected_configs(self) -> List[ConfigSpec]:
+        return [
+            ConfigSpec("channels", 16),
+            ConfigSpec("height", 16),
+            ConfigSpec("width", 16),
+            ConfigSpec("patch", [2, 2]),
+            ConfigSpec("vae_scale_factor", 16),
+        ]
+    @property
+    def inputs(self) -> List[InputParam]:
+        return [
+            InputParam(
+                "image",
+                type_hint=Union[PIL.Image.Image, torch.Tensor],
+                description="Input image (PIL Image or [H, W, 3] uint8 tensor), only used on first frame",
+            ),
+            InputParam(
+                "latents",
+                type_hint=torch.Tensor,
+                description="Latent tensor for denoising [1, 1, C, H, W]. Only used if use_random_latents=False.",
+            ),
+            InputParam(
+                "use_random_latents",
+                type_hint=bool,
+                default=True,
+                description="If True, always generate fresh random latents. If False, use provided latents.",
+            ),
+            InputParam(
+                "kv_cache",
+                description="KV cache to update",
+            ),
+            InputParam(
+                "frame_timestamp",
+                type_hint=torch.Tensor,
+                description="Current frame timestamp",
+            ),
+            InputParam(
+                "prompt_embeds",
+                type_hint=torch.Tensor,
+                description="Prompt embeddings for cache pass",
+            ),
+            InputParam(
+                "prompt_pad_mask",
+                type_hint=torch.Tensor,
+                description="Prompt padding mask",
+            ),
+            InputParam(
+                "button_tensor",
+                type_hint=torch.Tensor,
+                description="Button tensor for cache pass",
+            ),
+            InputParam(
+                "mouse_tensor",
+                type_hint=torch.Tensor,
+                description="Mouse tensor for cache pass",
+            ),
+            InputParam(
+                "scroll_tensor",
+                type_hint=torch.Tensor,
+                description="Scroll tensor for cache pass",
+            ),
+            InputParam(
+                "generator",
+                type_hint=torch.Generator,
+                default=None,
+                description="torch Generator for deterministic output",
+            ),
+        ]
+    @property
+    def intermediate_outputs(self) -> List[OutputParam]:
+        return [
+            OutputParam(
+                "latents",
+                type_hint=torch.Tensor,
+                description="Latent tensor for denoising [1, 1, C, H, W]",
+            ),
+        ]
+    @staticmethod
+    def _cache_pass(
+        transformer,
+        x,
+        frame_timestamp,
+        prompt_emb,
+        prompt_pad_mask,
+        mouse,
+        button,
+        scroll,
+        kv_cache,
+    ):
+        """Cache pass to persist frame in KV cache."""
+        kv_cache.set_frozen(False)
+        transformer(
+            x=x,
+            sigma=x.new_zeros((x.size(0), x.size(1))),
+            frame_timestamp=frame_timestamp,
+            prompt_emb=prompt_emb,
+            prompt_pad_mask=prompt_pad_mask,
+            mouse=mouse,
+            button=button,
+            scroll=scroll,
+            kv_cache=kv_cache,
+        )
+    @torch.inference_mode()
+    def __call__(
+        self, components: ModularPipeline, state: PipelineState
+    ) -> PipelineState:
+        block_state = self.get_block_state(state)
+        device = components._execution_device
+        dtype = components.transformer.dtype
+        # Get latent shape info
+        channels = components.config.channels
+        height = components.config.height
+        width = components.config.width
+        patch = components.config.patch
+        pH, pW = patch if isinstance(patch, (list, tuple)) else (patch, patch)
+        shape = (
+            1,
+            1,
+            channels,
+            components.config.vae_scale_factor * pH,
+            components.config.vae_scale_factor * pW,
+        )
+        if block_state.image is not None:
+            image = block_state.image
+            # Preprocess: PIL/tensor -> [B, C, H, W] float32 in [0, 1]
+            image = components.image_processor.preprocess(
+                image,
+                height=height,
+                width=width,
+            )
+            # Convert to [H, W, 3] uint8 for VAE encoder
+            image = (image[0].permute(1, 2, 0) * 255).to(torch.uint8)
+            assert image.dtype == torch.uint8, (
+                f"Expected uint8 image, got {image.dtype}"
+            )
+            latents = components.vae.encode(image)
+            latents = latents.unsqueeze(1)
+            # Run cache pass to persist encoded frame
+            self._cache_pass(
+                components.transformer,
+                latents,
+                block_state.frame_timestamp,
+                block_state.prompt_embeds,
+                block_state.prompt_pad_mask,
+                block_state.mouse_tensor,
+                block_state.button_tensor,
+                block_state.scroll_tensor,
+                block_state.kv_cache,
+            )
+            block_state.frame_timestamp.add_(1)
+        # Generate latents based on use_random_latents flag
+        if block_state.use_random_latents or block_state.latents is None:
+            block_state.latents = torch.randn(
+                shape, device=device, dtype=torch.bfloat16
+            )
+        self.set_block_state(state, block_state)
+        return components, state
+class WorldEngineBeforeDenoiseStep(SequentialPipelineBlocks):
+    """Sequential pipeline that prepares all inputs for denoising."""
+    block_classes = [
+        WorldEngineSetTimestepsStep,
+        WorldEngineSetupKVCacheStep,
+        WorldEnginePrepareLatentsStep,
+    ]
+    block_names = ["set_timesteps", "setup_kv_cache", "prepare_latents"]
+    @property
+    def description(self) -> str:
+        return (
+            "Before denoise step that prepares inputs for denoising:\n"
+            " - WorldEngineSetTimestepsStep: Set up scheduler sigmas\n"
+            " - WorldEngineSetupKVCacheStep: Initialize or reuse KV cache\n"
+            " - WorldEnginePrepareLatentsStep: Encode image (if first frame) and create noise"
+        )

decoders.py ADDED Viewed

	@@ -0,0 +1,122 @@

+# Copyright (C) 2025 Hugging Face Team and Overworld
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+"""Decoder blocks for WorldEngine modular pipeline."""
+from typing import List, Union
+import numpy as np
+import PIL.Image
+import torch
+from diffusers import AutoModel
+from diffusers.configuration_utils import FrozenDict
+from diffusers.image_processor import VaeImageProcessor
+from diffusers.utils import logging
+from diffusers.modular_pipelines import (
+    ModularPipelineBlocks,
+    ModularPipeline,
+    PipelineState,
+)
+from diffusers.modular_pipelines.modular_pipeline_utils import (
+    ComponentSpec,
+    InputParam,
+    OutputParam,
+)
+logger = logging.get_logger(__name__)
+class WorldEngineDecodeStep(ModularPipelineBlocks):
+    """Decodes denoised latents back to RGB image using VAE."""
+    model_name = "world_engine"
+    @property
+    def expected_components(self) -> List[ComponentSpec]:
+        return [
+            ComponentSpec("vae", AutoModel),
+            ComponentSpec(
+                "image_processor",
+                VaeImageProcessor,
+                config=FrozenDict(
+                    {
+                        "vae_scale_factor": 16,
+                        "do_normalize": False,
+                        "do_convert_rgb": True,
+                    }
+                ),
+                default_creation_method="from_config",
+            ),
+        ]
+    @property
+    def description(self) -> str:
+        return "Decodes denoised latents to RGB image using the VAE decoder"
+    @property
+    def inputs(self) -> List[InputParam]:
+        return [
+            InputParam(
+                "latents",
+                required=True,
+                type_hint=torch.Tensor,
+                description="Denoised latent tensor [1, 1, C, H, W]",
+            ),
+            InputParam(
+                "output_type",
+                default="pil",
+                description="The output format for the generated images (pil, latent, pt, or np)",
+            ),
+        ]
+    @property
+    def intermediate_outputs(self) -> List[OutputParam]:
+        return [
+            OutputParam(
+                "images",
+                type_hint=Union[PIL.Image.Image, torch.Tensor, np.ndarray],
+                description="Decoded RGB image in requested output format",
+            ),
+        ]
+    @torch.no_grad()
+    def __call__(
+        self, components: ModularPipeline, state: PipelineState
+    ) -> PipelineState:
+        block_state = self.get_block_state(state)
+        latents = block_state.latents
+        output_type = block_state.output_type or "pil"
+        if output_type == "latent":
+            block_state.images = latents
+        else:
+            # Decode to image
+            # VAE expects [B, C, H, W] input, squeeze frame dim
+            # VAE returns [H, W, 3] uint8 tensor
+            image = components.vae.decode(latents.squeeze(1))
+            # Postprocess based on output_type
+            if output_type == "pt":
+                block_state.images = image
+            elif output_type == "np":
+                block_state.images = image.cpu().numpy()
+            else:  # "pil"
+                block_state.images = PIL.Image.fromarray(image.cpu().numpy())
+        # Clear latents so next frame generates fresh random noise
+        block_state.latents = None
+        self.set_block_state(state, block_state)
+        return components, state

denoise.py ADDED Viewed

	@@ -0,0 +1,210 @@

+# Copyright (C) 2025 Hugging Face Team and Overworld
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+"""Denoising block for WorldEngine modular pipeline."""
+from typing import List
+import torch
+from diffusers.utils import logging
+from diffusers.modular_pipelines import (
+    ModularPipelineBlocks,
+    ModularPipeline,
+    PipelineState,
+)
+from diffusers.modular_pipelines.modular_pipeline_utils import (
+    ComponentSpec,
+    InputParam,
+    OutputParam,
+)
+from diffusers import AutoModel
+logger = logging.get_logger(__name__)
+class WorldEngineDenoiseLoop(ModularPipelineBlocks):
+    """Denoises latents using rectified flow and updates KV cache."""
+    model_name = "world_engine"
+    @property
+    def expected_components(self) -> List[ComponentSpec]:
+        return [ComponentSpec("transformer", AutoModel)]
+    @property
+    def description(self) -> str:
+        return (
+            "Denoises latents using rectified flow (x = x + dsigma * v) "
+            "and updates KV cache for autoregressive generation."
+        )
+    @property
+    def inputs(self) -> List[InputParam]:
+        return [
+            InputParam(
+                "scheduler_sigmas",
+                required=True,
+                type_hint=torch.Tensor,
+                description="Scheduler sigmas for denoising",
+            ),
+            InputParam(
+                "latents",
+                required=True,
+                type_hint=torch.Tensor,
+                description="Initial noisy latents [1, 1, C, H, W]",
+            ),
+            InputParam(
+                "kv_cache",
+                required=True,
+                description="KV cache for transformer attention",
+            ),
+            InputParam(
+                "frame_timestamp",
+                required=True,
+                type_hint=torch.Tensor,
+                description="Current frame timestamp",
+            ),
+            InputParam(
+                "prompt_embeds",
+                required=True,
+                type_hint=torch.Tensor,
+                description="Text embeddings for conditioning",
+            ),
+            InputParam(
+                "prompt_pad_mask",
+                type_hint=torch.Tensor,
+                description="Padding mask for prompt embeddings",
+            ),
+            InputParam(
+                "button_tensor",
+                required=True,
+                type_hint=torch.Tensor,
+                description="One-hot encoded button tensor",
+            ),
+            InputParam(
+                "mouse_tensor",
+                required=True,
+                type_hint=torch.Tensor,
+                description="Mouse velocity tensor",
+            ),
+            InputParam(
+                "scroll_tensor",
+                required=True,
+                type_hint=torch.Tensor,
+                description="Scroll wheel sign tensor",
+            ),
+        ]
+    @property
+    def intermediate_outputs(self) -> List[OutputParam]:
+        return [
+            OutputParam(
+                "latents",
+                type_hint=torch.Tensor,
+                description="Denoised latents",
+            ),
+        ]
+    @staticmethod
+    def _denoise_pass(
+        transformer,
+        x,
+        sigmas,
+        frame_timestamp,
+        prompt_emb,
+        prompt_pad_mask,
+        mouse,
+        button,
+        scroll,
+        kv_cache,
+    ):
+        """Denoising loop using rectified flow."""
+        kv_cache.set_frozen(True)
+        sigma = x.new_empty((x.size(0), x.size(1)))
+        for step_sig, step_dsig in zip(sigmas, sigmas.diff()):
+            v = transformer(
+                x=x,
+                sigma=sigma.fill_(step_sig),
+                frame_timestamp=frame_timestamp,
+                prompt_emb=prompt_emb,
+                prompt_pad_mask=prompt_pad_mask,
+                mouse=mouse,
+                button=button,
+                scroll=scroll,
+                kv_cache=kv_cache,
+            )
+            x = x + step_dsig * v
+        return x
+    @staticmethod
+    def _cache_pass(
+        transformer,
+        x,
+        frame_timestamp,
+        prompt_emb,
+        prompt_pad_mask,
+        mouse,
+        button,
+        scroll,
+        kv_cache,
+    ):
+        """Cache pass to persist frame for next generation."""
+        kv_cache.set_frozen(False)
+        transformer(
+            x=x,
+            sigma=x.new_zeros((x.size(0), x.size(1))),
+            frame_timestamp=frame_timestamp,
+            prompt_emb=prompt_emb,
+            prompt_pad_mask=prompt_pad_mask,
+            mouse=mouse,
+            button=button,
+            scroll=scroll,
+            kv_cache=kv_cache,
+        )
+    @torch.inference_mode()
+    def __call__(
+        self, components: ModularPipeline, state: PipelineState
+    ) -> PipelineState:
+        block_state = self.get_block_state(state)
+        block_state.latents = self._denoise_pass(
+            components.transformer,
+            block_state.latents,
+            block_state.scheduler_sigmas,
+            block_state.frame_timestamp,
+            block_state.prompt_embeds,
+            block_state.prompt_pad_mask,
+            block_state.mouse_tensor,
+            block_state.button_tensor,
+            block_state.scroll_tensor,
+            block_state.kv_cache,
+        ).clone()
+        self._cache_pass(
+            components.transformer,
+            block_state.latents,
+            block_state.frame_timestamp,
+            block_state.prompt_embeds,
+            block_state.prompt_pad_mask,
+            block_state.mouse_tensor,
+            block_state.button_tensor,
+            block_state.scroll_tensor,
+            block_state.kv_cache,
+        )
+        block_state.frame_timestamp.add_(1)
+        self.set_block_state(state, block_state)
+        return components, state

encoders.py ADDED Viewed

	@@ -0,0 +1,318 @@

+# Copyright (C) 2025 Hugging Face Team and Overworld
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+"""Text and controller encoder blocks for WorldEngine modular pipeline."""
+import html
+from typing import List, Set, Tuple, Union
+import regex as re
+import torch
+from transformers import AutoTokenizer, UMT5EncoderModel
+from diffusers.utils import is_ftfy_available, logging
+from diffusers.modular_pipelines import (
+    ModularPipelineBlocks,
+    ModularPipeline,
+    PipelineState,
+)
+from diffusers.modular_pipelines.modular_pipeline_utils import (
+    ComponentSpec,
+    ConfigSpec,
+    InputParam,
+    OutputParam,
+)
+if is_ftfy_available():
+    import ftfy
+logger = logging.get_logger(__name__)
+def basic_clean(text):
+    text = ftfy.fix_text(text)
+    text = html.unescape(html.unescape(text))
+    return text.strip()
+def whitespace_clean(text):
+    text = re.sub(r"\s+", " ", text)
+    text = text.strip()
+    return text
+def prompt_clean(text):
+    text = whitespace_clean(basic_clean(text))
+    return text
+class WorldEngineTextEncoderStep(ModularPipelineBlocks):
+    """Encodes text prompts using UMT5-XL for conditioning."""
+    model_name = "world_engine"
+    @property
+    def description(self) -> str:
+        return (
+            "Text Encoder step that generates text embeddings to guide frame generation"
+        )
+    @property
+    def expected_components(self) -> List[ComponentSpec]:
+        return [
+            ComponentSpec("text_encoder", UMT5EncoderModel),
+            ComponentSpec("tokenizer", AutoTokenizer),
+        ]
+    @property
+    def inputs(self) -> List[InputParam]:
+        return [
+            InputParam(
+                "prompt",
+                description="The prompt or prompts to guide the frame generation",
+            ),
+            InputParam(
+                "prompt_embeds",
+                type_hint=torch.Tensor,
+                description="Pre-computed text embeddings",
+            ),
+            InputParam(
+                "prompt_pad_mask",
+                type_hint=torch.Tensor,
+                description="Padding mask for prompt embeddings",
+            ),
+        ]
+    @property
+    def intermediate_outputs(self) -> List[OutputParam]:
+        return [
+            OutputParam(
+                "prompt_embeds",
+                type_hint=torch.Tensor,
+                kwargs_type="denoiser_input_fields",
+                description="Text embeddings used to guide frame generation",
+            ),
+            OutputParam(
+                "prompt_pad_mask",
+                type_hint=torch.Tensor,
+                kwargs_type="denoiser_input_fields",
+                description="Padding mask for prompt embeddings",
+            ),
+        ]
+    @staticmethod
+    def check_inputs(block_state):
+        if block_state.prompt is not None and (
+            not isinstance(block_state.prompt, str)
+            and not isinstance(block_state.prompt, list)
+        ):
+            raise ValueError(
+                f"`prompt` has to be of type `str` or `list` but is {type(block_state.prompt)}"
+            )
+    @staticmethod
+    def encode_prompt(
+        components,
+        prompt: Union[str, List[str]],
+        device: torch.device,
+        max_sequence_length: int = 512,
+    ):
+        dtype = components.text_encoder.dtype
+        prompt = [prompt] if isinstance(prompt, str) else prompt
+        prompt = [prompt_clean(p) for p in prompt]
+        text_inputs = components.tokenizer(
+            prompt,
+            padding="max_length",
+            max_length=max_sequence_length,
+            truncation=True,
+            return_attention_mask=True,
+            return_tensors="pt",
+        )
+        text_input_ids = text_inputs.input_ids.to(device)
+        attention_mask = text_inputs.attention_mask.to(device)
+        prompt_embeds = components.text_encoder(
+            text_input_ids, attention_mask
+        ).last_hidden_state
+        prompt_embeds = prompt_embeds.to(dtype=dtype)
+        # Zero out padding
+        prompt_embeds = prompt_embeds * attention_mask.unsqueeze(-1).type_as(
+            prompt_embeds
+        )
+        # Create padding mask (True where padded)
+        prompt_pad_mask = attention_mask.eq(0)
+        return prompt_embeds, prompt_pad_mask
+    @torch.no_grad()
+    def __call__(
+        self, components: ModularPipeline, state: PipelineState
+    ) -> PipelineState:
+        block_state = self.get_block_state(state)
+        self.check_inputs(block_state)
+        device = components._execution_device
+        if block_state.prompt_embeds is None:
+            block_state.prompt = block_state.prompt or "An explorable world"
+            (
+                block_state.prompt_embeds,
+                block_state.prompt_pad_mask,
+            ) = self.encode_prompt(components, block_state.prompt, device)
+            block_state.prompt_embeds = block_state.prompt_embeds.contiguous()
+        if block_state.prompt_pad_mask is None:
+            block_state.prompt_pad_mask = torch.zeros(
+                block_state.prompt_embeds.shape[:2],
+                dtype=torch.bool,
+                device=device,
+            )
+        self.set_block_state(state, block_state)
+        return components, state
+class WorldEngineControllerEncoderStep(ModularPipelineBlocks):
+    """Encodes controller inputs (mouse + buttons + scroll) for conditioning."""
+    model_name = "world_engine"
+    @property
+    def description(self) -> str:
+        return "Controller Encoder step that encodes mouse, button, and scroll inputs for conditioning"
+    @property
+    def expected_components(self) -> List[ComponentSpec]:
+        return []  # Controller embedding is part of transformer
+    @property
+    def expected_configs(self) -> List[ComponentSpec]:
+        return [ConfigSpec("n_buttons", 256)]
+    @property
+    def inputs(self) -> List[InputParam]:
+        return [
+            InputParam(
+                "button",
+                type_hint=Set[int],
+                default=set(),
+                description="Set of pressed button IDs",
+            ),
+            InputParam(
+                "mouse",
+                type_hint=Tuple[float, float],
+                default=(0.0, 0.0),
+                description="Mouse velocity (x, y)",
+            ),
+            InputParam(
+                "scroll",
+                type_hint=int,
+                default=0,
+                description="Scroll wheel direction (-1, 0, 1)",
+            ),
+            InputParam(
+                "button_tensor",
+                type_hint=torch.Tensor,
+                kwargs_type="denoiser_input_fields",
+                description="One-hot encoded button tensor",
+            ),
+            InputParam(
+                "mouse_tensor",
+                type_hint=torch.Tensor,
+                kwargs_type="denoiser_input_fields",
+                description="Mouse velocity tensor",
+            ),
+            InputParam(
+                "scroll_tensor",
+                type_hint=torch.Tensor,
+                kwargs_type="denoiser_input_fields",
+                description="Scroll wheel sign tensor",
+            ),
+        ]
+    @property
+    def intermediate_outputs(self) -> List[OutputParam]:
+        return [
+            OutputParam(
+                "button_tensor",
+                type_hint=torch.Tensor,
+                kwargs_type="denoiser_input_fields",
+                description="One-hot encoded button tensor",
+            ),
+            OutputParam(
+                "mouse_tensor",
+                type_hint=torch.Tensor,
+                kwargs_type="denoiser_input_fields",
+                description="Mouse velocity tensor",
+            ),
+            OutputParam(
+                "scroll_tensor",
+                type_hint=torch.Tensor,
+                kwargs_type="denoiser_input_fields",
+                description="Scroll wheel sign tensor",
+            ),
+        ]
+    @torch.no_grad()
+    def __call__(
+        self, components: ModularPipeline, state: PipelineState
+    ) -> PipelineState:
+        block_state = self.get_block_state(state)
+        device = components._execution_device
+        dtype = components.transformer.dtype
+        n_buttons = components.config.n_buttons
+        # Create or reuse button tensor [1, 1, n_buttons]
+        if block_state.button_tensor is None:
+            block_state.button_tensor = torch.zeros(
+                (1, 1, n_buttons), device=device, dtype=dtype
+            )
+        # Update button tensor in-place (avoid dynamic shapes for torch.compile)
+        block_state.button_tensor.zero_()
+        if block_state.button:
+            for btn_id in block_state.button:
+                if 0 <= btn_id < n_buttons:
+                    block_state.button_tensor[0, 0, btn_id] = 1.0
+        # Create or reuse mouse tensor [1, 1, 2]
+        if block_state.mouse_tensor is None:
+            block_state.mouse_tensor = torch.zeros(
+                (1, 1, 2), device=device, dtype=dtype
+            )
+        # Update mouse tensor in-place
+        mouse = block_state.mouse if block_state.mouse is not None else (0.0, 0.0)
+        block_state.mouse_tensor[0, 0, 0] = mouse[0]
+        block_state.mouse_tensor[0, 0, 1] = mouse[1]
+        # Create or reuse scroll tensor [1, 1, 1]
+        if block_state.scroll_tensor is None:
+            block_state.scroll_tensor = torch.zeros(
+                (1, 1, 1), device=device, dtype=dtype
+            )
+        # Update scroll tensor in-place (sign of scroll value: -1, 0, or 1)
+        scroll = block_state.scroll if block_state.scroll is not None else 0
+        block_state.scroll_tensor[0, 0, 0] = float(scroll > 0) - float(scroll < 0)
+        self.set_block_state(state, block_state)
+        return components, state

modular_blocks.py ADDED Viewed

	@@ -0,0 +1,45 @@

+# Copyright (C) 2025 Hugging Face Team and Overworld
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+"""Block registry for WorldEngine modular pipeline."""
+from diffusers.utils import logging
+from diffusers.modular_pipelines import SequentialPipelineBlocks
+from diffusers.modular_pipelines.modular_pipeline_utils import InsertableDict
+from .encoders import WorldEngineTextEncoderStep, WorldEngineControllerEncoderStep
+from .before_denoise import WorldEngineBeforeDenoiseStep
+from .denoise import WorldEngineDenoiseLoop
+from .decoders import WorldEngineDecodeStep
+logger = logging.get_logger(__name__)
+AUTO_BLOCKS = InsertableDict(
+    [
+        ("text_encoder", WorldEngineTextEncoderStep),
+        ("controller_encoder", WorldEngineControllerEncoderStep),
+        ("before_denoise", WorldEngineBeforeDenoiseStep),
+        ("denoise", WorldEngineDenoiseLoop),
+        ("decode", WorldEngineDecodeStep),
+    ]
+)
+class WorldEngineBlocks(SequentialPipelineBlocks):
+    """Sequential pipeline blocks for WorldEngine frame generation."""
+    block_classes = list(AUTO_BLOCKS.copy().values())
+    block_names = list(AUTO_BLOCKS.copy().keys())

modular_config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "_class_name": "WorldEngineBlocks",
+  "_diffusers_version": "0.36.0.dev0",
+  "auto_map": {
+    "ModularPipelineBlocks": "modular_blocks.WorldEngineBlocks"
+  }
+}

modular_model_index.json ADDED Viewed

	@@ -0,0 +1,76 @@

+{
+    "_blocks_class_name": "WorldEngineBlocks",
+    "_class_name": "ModularPipeline",
+    "_diffusers_version": "0.36.0.dev0",
+    "channels": 16,
+    "height": 360,
+    "width": 640,
+    "patch": [
+        2,
+        2
+    ],
+    "vae_scale_factor": 16,
+    "n_buttons": 256,
+    "tokens_per_frame": 256,
+    "scheduler_sigmas": [
+        1.0,
+        0.8609585762023926,
+        0.729332447052002,
+        0.3205108940601349,
+        0.0
+    ],
+    "transformer": [
+        null,
+        null,
+        {
+            "pretrained_model_name_or_path": "Overworld/Waypoint-1-Small",
+            "subfolder": "transformer",
+            "type_hint": [
+                "diffusers",
+                "AutoModel"
+            ],
+            "revision": null,
+            "variant": null
+        }
+    ],
+    "vae": [
+        null,
+        null,
+        {
+            "pretrained_model_name_or_path": "Overworld/Waypoint-1-Small",
+            "subfolder": "vae",
+            "type_hint": [
+                "diffusers",
+                "AutoModel"
+            ],
+            "revision": null,
+            "variant": null
+        }
+    ],
+    "text_encoder": [
+        null,
+        null,
+        {
+            "pretrained_model_name_or_path": "google/umt5-xl",
+            "type_hint": [
+                "transformers",
+                "UMT5EncoderModel"
+            ],
+            "revision": null,
+            "variant": null
+        }
+    ],
+    "tokenizer": [
+        null,
+        null,
+        {
+            "pretrained_model_name_or_path": "google/umt5-xl",
+            "type_hint": [
+                "transformers",
+                "AutoTokenizer"
+            ],
+            "revision": null,
+            "variant": null
+        }
+    ]
+}

transformer/__init__.py ADDED Viewed

	@@ -0,0 +1,31 @@

+# Copyright (C) 2025 Hugging Face Team and Overworld
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+from .model import WorldModel
+from .attn import Attn, CrossAttention, OrthoRoPE
+from .nn import MLP, AdaLN, NoiseConditioner, rms_norm, ada_rmsnorm, ada_gate
+__all__ = [
+    "WorldModel",
+    "Attn",
+    "CrossAttention",
+    "OrthoRoPE",
+    "MLP",
+    "AdaLN",
+    "NoiseConditioner",
+    "rms_norm",
+    "ada_rmsnorm",
+    "ada_gate",
+]

transformer/attn.py ADDED Viewed

	@@ -0,0 +1,297 @@

+# Copyright (C) 2025 Hugging Face Team and Overworld
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+"""Attention mechanisms for WorldModel transformer."""
+import math
+import einops as eo
+import torch
+from torch import nn
+from torch.nn.attention.flex_attention import flex_attention
+from .nn import rms_norm, NoCastModule
+def pixel_frequencies(dim: int, max_freq: float) -> torch.Tensor:
+    """Linear frequency spectrum for spatial RoPE (pixel positions).
+    Matches rotary_embedding_torch RotaryEmbedding(freqs_for='pixel').
+    Args:
+        dim: Output dimension (freqs will be repeated to fill this)
+        max_freq: Maximum frequency (should be below Nyquist)
+    Returns:
+        Tensor of shape [dim // 2] with linear frequencies
+    """
+    # Library uses max_freq/2 as the upper bound
+    return torch.linspace(1.0, max_freq / 2, dim // 2) * math.pi
+def lang_frequencies(dim: int) -> torch.Tensor:
+    """Geometric frequency spectrum for temporal RoPE (language-style).
+    Matches rotary_embedding_torch RotaryEmbedding(freqs_for='lang').
+    Args:
+        dim: Output dimension (freqs will be repeated to fill this)
+    Returns:
+        Tensor of shape [dim // 2] with geometric frequencies
+    """
+    # Library uses 10^(-i/2) pattern
+    return 10.0 ** (-torch.arange(dim // 2).float() / 2)
+class OrthoRoPE(NoCastModule):
+    """Rotary Position Embeddings for orthogonal axes: time, height, and width.
+    - Time: Geometric spectrum (like language models) -- rotates 1/2 of head dim
+    - Height/Width: Linear spectrum (for pixels) -- rotates 1/4 of head dim each
+    """
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        assert not getattr(self.config, "has_audio", False)
+        # Compute frequencies and store cos/sin buffers
+        freqs = self._compute_freqs()
+        self.cos = nn.Buffer(freqs.cos().contiguous(), persistent=False)
+        self.sin = nn.Buffer(freqs.sin().contiguous(), persistent=False)
+    def _compute_freqs(self):
+        """Compute frequency table for all positions.
+        Matches the behavior of rotary_embedding_torch.RotaryEmbedding.
+        The library interleaves frequencies so each freq value is used twice.
+        """
+        config = self.config
+        H, W, T = config.height, config.width, config.n_frames
+        head_dim = config.d_model // config.n_heads
+        # Spatial frequencies (linear spectrum, below Nyquist)
+        # Library: RotaryEmbedding(dim=head_dim//8) creates head_dim//16 freqs,
+        # outputs head_dim//8 values (each freq repeated twice)
+        max_freq = min(H, W) * 0.8
+        spatial_freqs = pixel_frequencies(head_dim // 8, max_freq)  # [D/16]
+        # Positions in [-1, 1] range
+        pos_x = torch.linspace(-1 + 1 / W, 1 - 1 / W, W)  # [W]
+        pos_y = torch.linspace(-1 + 1 / H, 1 - 1 / H, H)  # [H]
+        # Spatial frequency embeddings with interleaving (like library)
+        freqs_x = torch.outer(pos_x, spatial_freqs)  # [W, D/16]
+        freqs_y = torch.outer(pos_y, spatial_freqs)  # [H, D/16]
+        freqs_x = freqs_x.repeat_interleave(2, dim=-1)  # [W, D/8]
+        freqs_y = freqs_y.repeat_interleave(2, dim=-1)  # [H, D/8]
+        # Expand to grid and repeat for all frames
+        freqs_x = freqs_x[None, :, :].expand(H, W, -1)  # [H, W, D/8]
+        freqs_y = freqs_y[:, None, :].expand(H, W, -1)  # [H, W, D/8]
+        freqs_x = eo.repeat(freqs_x, "h w d -> (t h w) d", t=T)  # [T*H*W, D/8]
+        freqs_y = eo.repeat(freqs_y, "h w d -> (t h w) d", t=T)  # [T*H*W, D/8]
+        # Temporal frequencies (geometric spectrum)
+        # Library: RotaryEmbedding(dim=head_dim//4) creates head_dim//8 freqs,
+        # outputs head_dim//4 values (each freq repeated twice)
+        temporal_freqs = lang_frequencies(head_dim // 4)  # [D/8]
+        pos_t = torch.arange(T).float()  # [T]
+        freqs_t = torch.outer(pos_t, temporal_freqs)  # [T, D/8]
+        freqs_t = freqs_t.repeat_interleave(2, dim=-1)  # [T, D/4]
+        freqs_t = eo.repeat(freqs_t, "t d -> (t h w) d", h=H, w=W)  # [T*H*W, D/4]
+        # Concatenate: [X, Y, T] -> [T*H*W, D/2]
+        return torch.cat([freqs_x, freqs_y, freqs_t], dim=-1)
+    def get_angles(self, pos_ids):
+        """Look up cos/sin angles for given position IDs."""
+        t, y, x = pos_ids["t_pos"], pos_ids["y_pos"], pos_ids["x_pos"]  # [B,T]
+        H, W = self.config.height, self.config.width
+        if not torch.compiler.is_compiling():
+            torch._assert(
+                (y.max() < H) & (x.max() < W),
+                f"pos_ids out of bounds, {y.max()}, {x.max()}",
+            )
+        flat = t * (H * W) + y * W + x  # [B,T]
+        idx = flat.reshape(-1).to(torch.long)
+        cos = self.cos.index_select(0, idx).view(*flat.shape, -1)
+        sin = self.sin.index_select(0, idx).view(*flat.shape, -1)
+        return cos[:, None], sin[:, None]  # add head dim for broadcast
+    @torch.autocast("cuda", enabled=False)
+    def forward(self, x, pos_ids):
+        assert self.cos.dtype == self.sin.dtype == torch.float32
+        cos, sin = self.get_angles(pos_ids)
+        x0, x1 = x.float().unfold(-1, 2, 2).unbind(-1)
+        y0 = x0 * cos - x1 * sin
+        y1 = x1 * cos + x0 * sin
+        return torch.cat((y0, y1), dim=-1).type_as(x)
+class Attn(nn.Module):
+    """Self-attention with RoPE and optional GQA, value residual, and gated attention."""
+    def __init__(self, config, layer_idx):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.value_residual = getattr(config, "value_residual", False)
+        if self.value_residual:
+            self.v_lamb = nn.Parameter(torch.tensor(0.5))
+        self.n_heads = config.n_heads
+        self.n_kv_heads = getattr(config, "n_kv_heads", config.n_heads)
+        self.d_head = config.d_model // self.n_heads
+        assert config.d_model % self.n_heads == 0
+        self.enable_gqa = self.n_heads != self.n_kv_heads
+        self.q_proj = nn.Linear(config.d_model, self.n_heads * self.d_head, bias=False)
+        self.k_proj = nn.Linear(
+            config.d_model, self.n_kv_heads * self.d_head, bias=False
+        )
+        self.v_proj = nn.Linear(
+            config.d_model, self.n_kv_heads * self.d_head, bias=False
+        )
+        self.out_proj = nn.Linear(config.d_model, config.d_model, bias=False)
+        self.rope = OrthoRoPE(config)
+        self.gated_attn = getattr(config, "gated_attn", False)
+        if self.gated_attn:
+            self.gate_proj = nn.Linear(
+                self.n_heads, self.n_heads, bias=False
+            )  # sparse attn gate
+            nn.init.zeros_(self.gate_proj.weight)
+    def forward(self, x, pos_ids, v1, kv_cache):
+        # Q, K, V proj -> QK-norm -> RoPE
+        q = eo.rearrange(
+            self.q_proj(x), "b t (h d) -> b h t d", h=self.n_heads, d=self.d_head
+        )
+        k = eo.rearrange(
+            self.k_proj(x), "b t (h d) -> b h t d", h=self.n_kv_heads, d=self.d_head
+        )
+        v = eo.rearrange(
+            self.v_proj(x), "b t (h d) -> b h t d", h=self.n_kv_heads, d=self.d_head
+        )
+        if self.value_residual:
+            v1 = v if v1 is None else v1
+            v = torch.lerp(v, v1.view_as(v), self.v_lamb)
+        q, k = rms_norm(q), rms_norm(k)
+        q, k = self.rope(q, pos_ids), self.rope(k, pos_ids)
+        k, v, bm = kv_cache.upsert(k, v, pos_ids, self.layer_idx)
+        y = flex_attention(q, k, v, block_mask=bm, enable_gqa=self.enable_gqa)
+        if self.gated_attn:
+            gates = torch.sigmoid(self.gate_proj(x[..., : self.n_heads]))
+            y = y * gates.permute(0, 2, 1).unsqueeze(-1)
+        y = eo.rearrange(y, "b h t d -> b t (h d)")
+        y = self.out_proj(y)
+        return y, v1
+class MergedQKVAttn(Attn):
+    def __init__(self, src: Attn, config):
+        super().__init__(config, src.layer_idx)  # makes fresh q/k/v/out/etc
+        self.to(device=src.q_proj.weight.device, dtype=src.q_proj.weight.dtype)
+        self.load_state_dict(
+            src.state_dict(), strict=False
+        )  # copies trained weights/buffers
+        self.train(src.training)  # preserve train/eval mode
+        self.q_out = self.n_heads * self.d_head
+        self.kv_out = self.n_kv_heads * self.d_head
+        self.qkv_proj = nn.Linear(
+            self.q_proj.in_features,
+            self.q_out + 2 * self.kv_out,
+            bias=False,
+            device=self.q_proj.weight.device,
+            dtype=self.q_proj.weight.dtype,
+        )
+        with torch.no_grad():
+            self.qkv_proj.weight.copy_(
+                torch.cat(
+                    [self.q_proj.weight, self.k_proj.weight, self.v_proj.weight], dim=0
+                )
+            )
+        del self.q_proj, self.k_proj, self.v_proj
+    def forward(self, x, pos_ids, v1, kv_cache):
+        q, k, v = self.qkv_proj(x).split((self.q_out, self.kv_out, self.kv_out), dim=-1)
+        B, T = x.shape[:2]
+        q = q.reshape(B, T, self.n_heads, self.d_head).transpose(1, 2)
+        k = k.reshape(B, T, self.n_kv_heads, self.d_head).transpose(1, 2)
+        v = v.reshape(B, T, self.n_kv_heads, self.d_head).transpose(1, 2)
+        if self.value_residual:
+            v1 = v if v1 is None else v1
+            v = torch.lerp(v, v1.view_as(v), self.v_lamb)
+        q, k = rms_norm(q), rms_norm(k)
+        q, k = self.rope(q, pos_ids), self.rope(k, pos_ids)
+        k, v, bm = kv_cache.upsert(k, v, pos_ids, self.layer_idx)
+        y = flex_attention(q, k, v, block_mask=bm, enable_gqa=self.enable_gqa)
+        if self.gated_attn:
+            gates = torch.sigmoid(self.gate_proj(x[..., : self.n_heads]))
+            y = y * gates.permute(0, 2, 1).unsqueeze(-1)
+        y = y.transpose(1, 2).reshape(B, T, -1)
+        y = self.out_proj(y)
+        return y, v1
+class CrossAttention(nn.Module):
+    """Cross-attention for prompt conditioning."""
+    def __init__(self, config, context_dim=None):
+        super().__init__()
+        assert config.d_model % config.n_heads == 0
+        self.d_head = config.d_model // config.n_heads
+        self.inner_dim = context_dim or config.d_model
+        assert self.inner_dim % self.d_head == 0
+        self.n_heads = self.inner_dim // self.d_head
+        self.q_proj = nn.Linear(config.d_model, self.inner_dim, bias=False)
+        self.k_proj = nn.Linear(
+            context_dim or config.d_model, self.inner_dim, bias=False
+        )
+        self.v_proj = nn.Linear(
+            context_dim or config.d_model, self.inner_dim, bias=False
+        )
+        self.out_proj = nn.Linear(self.inner_dim, config.d_model, bias=False)
+        self.out_proj.weight.detach().zero_()
+    def forward(self, x, context, context_pad_mask=None):
+        q = eo.rearrange(self.q_proj(x), "b t (h d) -> b h t d", h=self.n_heads)
+        k = eo.rearrange(self.k_proj(context), "b t (h d) -> b h t d", h=self.n_heads)
+        v = eo.rearrange(self.v_proj(context), "b t (h d) -> b h t d", h=self.n_heads)
+        q, k = rms_norm(q), rms_norm(k)
+        out = flex_attention(q, k, v)
+        out = out.transpose(1, 2).contiguous().reshape(x.size(0), x.size(1), -1)
+        return self.out_proj(out)

transformer/cache.py ADDED Viewed

	@@ -0,0 +1,112 @@

+# Copyright (C) 2025 Hugging Face Team and Overworld
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+import torch
+from torch import nn, Tensor
+def _bf16_u16(x: Tensor) -> Tensor:
+    # reinterpret bf16 storage as int16 -> unsigned 0..65535 in int32
+    return x.contiguous().view(torch.int16).to(torch.int32) & 0xFFFF
+class CachedDenoiseStepEmb(nn.Module):
+    """bf16 sigma -> bf16 embedding via 64k LUT; invalid sigma => OOB index error (no silent wrong)."""
+    def __init__(self, base: nn.Module, sigmas: list[float]):
+        super().__init__()
+        device = next(base.parameters()).device
+        levels = torch.tensor(sigmas, device=device, dtype=torch.bfloat16)  # [S]
+        bits = _bf16_u16(levels)  # [S]
+        if torch.unique(bits).numel() != bits.numel():
+            raise ValueError(
+                "scheduler_sigmas collide in bf16; caching would be ambiguous"
+            )
+        with torch.no_grad():
+            table = (
+                base(levels[:, None]).squeeze(1).to(torch.bfloat16).contiguous()
+            )  # [S,D]
+        lut = torch.full((65536,), -1, device=device, dtype=torch.int32)
+        lut[bits] = torch.arange(bits.numel(), device=device, dtype=torch.int32)
+        self.register_buffer("table", table, persistent=False)  # [S,D] bf16
+        self.register_buffer("lut", lut, persistent=False)  # [65536] int32
+        self.register_buffer(
+            "oob",
+            torch.tensor(bits.numel(), device=device, dtype=torch.int32),
+            persistent=False,
+        )
+    def forward(self, sigma: Tensor) -> Tensor:
+        if sigma.dtype is not torch.bfloat16:
+            raise RuntimeError("CachedDenoiseStepEmb expects sigma bf16")
+        idx = self.lut[_bf16_u16(sigma)]
+        idx = torch.where(idx >= 0, idx, self.oob)  # invalid -> S (OOB)
+        return self.table[idx.to(torch.int64)]  # [...,D] bf16
+class CachedCondHead(nn.Module):
+    """bf16 cond -> cached (s0,b0,g0,s1,b1,g1); invalid cond => OOB index error (no silent wrong)."""
+    def __init__(
+        self, base, cached_denoise_step_emb: CachedDenoiseStepEmb, max_key_dims: int = 8
+    ):
+        super().__init__()
+        table = cached_denoise_step_emb.table  # [S,D] bf16
+        S, D = table.shape
+        with torch.no_grad():
+            emb = table[:, None, :]  # [S,1,D]
+            cache = (
+                torch.stack([t.squeeze(1) for t in base(emb)], 0)
+                .to(torch.bfloat16)
+                .contiguous()
+            )  # [6,S,D]
+        # pick a single embedding dimension whose bf16 bits uniquely identify sigma
+        key_dim = None
+        for d in range(min(D, max_key_dims)):
+            b = _bf16_u16(table[:, d])
+            if torch.unique(b).numel() == S:
+                key_dim = d
+                key_bits = b
+                break
+        if key_dim is None:
+            raise ValueError(
+                "Could not find a unique bf16 key dim for cond->sigma mapping; increase max_key_dims"
+            )
+        lut = torch.full((65536,), -1, device=table.device, dtype=torch.int32)
+        lut[key_bits] = torch.arange(S, device=table.device, dtype=torch.int32)
+        self.key_dim = int(key_dim)
+        self.register_buffer("cache", cache, persistent=False)  # [6,S,D] bf16
+        self.register_buffer("lut", lut, persistent=False)  # [65536] int32
+        self.register_buffer(
+            "oob",
+            torch.tensor(S, device=table.device, dtype=torch.int32),
+            persistent=False,
+        )
+    def forward(self, cond: Tensor):
+        if cond.dtype is not torch.bfloat16:
+            raise RuntimeError("CachedCondHead expects cond bf16")
+        idx = self.lut[_bf16_u16(cond[..., self.key_dim])]
+        idx = torch.where(idx >= 0, idx, self.oob)  # invalid -> S (OOB)
+        g = self.cache[:, idx.to(torch.int64)]  # [6,...,D] bf16 (or errors)
+        return tuple(g.unbind(0))  # (s0,b0,g0,s1,b1,g1)

transformer/config.json ADDED Viewed

	@@ -0,0 +1,49 @@

+{
+  "_class_name": "WorldModel",
+  "_diffusers_version": "0.36.0.dev0",
+  "auto_map": {
+    "AutoModel": "model.WorldModel"
+  },
+  "d_model": 2560,
+  "n_heads": 40,
+  "n_kv_heads": 20,
+  "n_layers": 22,
+  "mlp_ratio": 5,
+  "channels": 16,
+  "height": 16,
+  "width": 16,
+  "patch": [
+    2,
+    2
+  ],
+  "tokens_per_frame": 256,
+  "n_frames": 4096,
+  "local_window": 16,
+  "global_window": 128,
+  "global_attn_period": 4,
+  "global_pinned_dilation": 8,
+  "global_attn_offset": 0,
+  "value_residual": false,
+  "gated_attn": true,
+  "n_buttons": 256,
+  "ctrl_conditioning": "mlp_fusion",
+  "ctrl_conditioning_period": 3,
+  "ctrl_cond_dropout": 0.0,
+  "prompt_conditioning": "cross_attention",
+  "prompt_conditioning_period": 3,
+  "prompt_embedding_dim": 2048,
+  "prompt_cond_dropout": 0.0,
+  "noise_conditioning": "wan",
+  "base_fps": 60,
+  "causal": true,
+  "mlp_gradient_checkpointing": true,
+  "block_gradient_checkpointing": true,
+  "rope_impl": "ortho",
+  "scheduler_sigmas": [
+    1.0,
+    0.8609585762023926,
+    0.729332447052002,
+    0.3205108940601349,
+    0.0
+  ]
+}

transformer/diffusion_pytorch_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:14356db9229453850f9ad650f31c3e1c4744066abd43562f6fbee161fb36c9e6
+size 12515075376

transformer/model.py ADDED Viewed

	@@ -0,0 +1,452 @@

+# Copyright (C) 2025 Hugging Face Team and Overworld
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+"""WorldModel transformer for frame generation."""
+from typing import Optional, List
+import math
+import einops as eo
+import torch
+from torch import nn, Tensor
+import torch.nn.functional as F
+from tensordict import TensorDict
+from diffusers.configuration_utils import ConfigMixin, register_to_config
+from diffusers.models.modeling_utils import ModelMixin
+from .attn import Attn, MergedQKVAttn, CrossAttention
+from .nn import AdaLN, MLP, NoiseConditioner, ada_gate, ada_rmsnorm, rms_norm
+from .quantize import quantize_model
+from .cache import CachedDenoiseStepEmb, CachedCondHead
+def patch_cached_noise_conditioning(model) -> None:
+    # Call AFTER: model.to(device="cuda", dtype=torch.bfloat16).eval()
+    cached_denoise_step_emb = CachedDenoiseStepEmb(
+        model.denoise_step_emb, model.config.scheduler_sigmas
+    )
+    model.denoise_step_emb = cached_denoise_step_emb
+    for blk in model.transformer.blocks:
+        blk.cond_head = CachedCondHead(blk.cond_head, cached_denoise_step_emb)
+def patch_Attn_merge_qkv(model) -> None:
+    for name, mod in list(model.named_modules()):
+        if isinstance(mod, Attn) and not isinstance(mod, MergedQKVAttn):
+            model.set_submodule(name, MergedQKVAttn(mod, model.config))
+def patch_MLPFusion_split(model) -> None:
+    for name, mod in list(model.named_modules()):
+        if isinstance(mod, MLPFusion) and not isinstance(mod, SplitMLPFusion):
+            model.set_submodule(name, SplitMLPFusion(mod))
+def _apply_inference_patches(model) -> None:
+    patch_cached_noise_conditioning(model)
+    patch_Attn_merge_qkv(model)
+    patch_MLPFusion_split(model)
+class CFG(nn.Module):
+    def __init__(self, d_model: int, dropout: float):
+        super().__init__()
+        self.dropout = dropout
+        self.null_emb = nn.Parameter(torch.zeros(1, 1, d_model))
+    def forward(
+        self, x: torch.Tensor, is_conditioned: Optional[bool] = None
+    ) -> torch.Tensor:
+        """
+        x: [B, L, D]
+        is_conditioned:
+          - None: training-style random dropout
+          - bool: whole batch conditioned / unconditioned at sampling
+        """
+        B, L, _ = x.shape
+        null = self.null_emb.expand(B, L, -1)
+        # training-style dropout OR unspecified
+        if self.training or is_conditioned is None:
+            if self.dropout == 0.0:
+                return x
+            drop = torch.rand(B, 1, 1, device=x.device) < self.dropout  # [B,1,1]
+            return torch.where(drop, null, x)
+        # sampling-time switch
+        return x if is_conditioned else null
+class ControllerInputEmbedding(nn.Module):
+    """Embeds controller inputs (mouse + buttons) into model dimension."""
+    def __init__(self, n_buttons: int, d_model: int, mlp_ratio: int = 4):
+        super().__init__()
+        self.mlp = MLP(n_buttons + 3, d_model * mlp_ratio, d_model)  # mouse velocity (x,y) + scroll sign
+    def forward(self, mouse: Tensor, button: Tensor, scroll: Tensor):
+        assert len(mouse.shape) == 3
+        x = torch.cat((mouse, button, scroll), dim=-1)
+        return self.mlp(x)
+class MLPFusion(nn.Module):
+    """Fuses per-group conditioning into tokens by applying an MLP to cat([x, cond])."""
+    def __init__(self, d_model: int):
+        super().__init__()
+        self.mlp = MLP(2 * d_model, d_model, d_model)
+    def forward(self, x: torch.Tensor, cond: torch.Tensor) -> torch.Tensor:
+        B, _, D = x.shape
+        L = cond.shape[1]
+        Wx, Wc = self.mlp.fc1.weight.chunk(2, dim=1)  # each [D, D]
+        x = x.view(B, L, -1, D)
+        h = F.linear(x, Wx) + F.linear(cond, Wc).unsqueeze(
+            2
+        )  # broadcast, no repeat/cat
+        h = F.silu(h)
+        y = F.linear(h, self.mlp.fc2.weight)
+        return y.flatten(1, 2)
+class SplitMLPFusion(nn.Module):
+    """Packed MLPFusion -> split linears (no cat, quant-friendly)."""
+    def __init__(self, src: MLPFusion):
+        super().__init__()
+        D = src.mlp.fc2.in_features
+        dev, dt = src.mlp.fc2.weight.device, src.mlp.fc2.weight.dtype
+        self.fc1_x = nn.Linear(D, D, bias=False, device=dev, dtype=dt)
+        self.fc1_c = nn.Linear(D, D, bias=False, device=dev, dtype=dt)
+        self.fc2 = nn.Linear(D, D, bias=False, device=dev, dtype=dt)
+        with torch.no_grad():
+            Wx, Wc = src.mlp.fc1.weight.chunk(2, dim=1)
+            self.fc1_x.weight.copy_(Wx)
+            self.fc1_c.weight.copy_(Wc)
+            self.fc2.weight.copy_(src.mlp.fc2.weight)
+        self.train(src.training)
+    def forward(self, x: torch.Tensor, cond: torch.Tensor) -> torch.Tensor:
+        B, _, D = x.shape
+        L = cond.shape[1]
+        x = x.reshape(B, L, -1, D)
+        return self.fc2(F.silu(self.fc1_x(x) + self.fc1_c(cond).unsqueeze(2))).flatten(
+            1, 2
+        )
+class CondHead(nn.Module):
+    """Per-layer conditioning head: bias_in -> SiLU -> Linear -> chunk(n_cond)."""
+    n_cond = 6
+    def __init__(self, d_model: int, noise_conditioning: str = "wan"):
+        super().__init__()
+        self.bias_in = (
+            nn.Parameter(torch.zeros(d_model)) if noise_conditioning == "wan" else None
+        )
+        self.cond_proj = nn.ModuleList(
+            [nn.Linear(d_model, d_model, bias=False) for _ in range(self.n_cond)]
+        )
+    def forward(self, cond):
+        cond = cond + self.bias_in if self.bias_in is not None else cond
+        h = F.silu(cond)
+        return tuple(p(h) for p in self.cond_proj)
+class WorldDiTBlock(nn.Module):
+    """Single transformer block with self-attention, optional cross-attention, and MLP."""
+    def __init__(
+        self,
+        d_model: int,
+        n_heads: int,
+        mlp_ratio: int,
+        layer_idx: int,
+        prompt_conditioning: Optional[str],
+        prompt_conditioning_period: int,
+        prompt_embedding_dim: int,
+        ctrl_conditioning_period: int,
+        noise_conditioning: str,
+        config,
+    ):
+        super().__init__()
+        self.config = config
+        self.attn = Attn(config, layer_idx)
+        self.mlp = MLP(d_model, d_model * mlp_ratio, d_model)
+        self.cond_head = CondHead(d_model, noise_conditioning)
+        do_prompt_cond = (
+            prompt_conditioning is not None
+            and layer_idx % prompt_conditioning_period == 0
+        )
+        self.prompt_cross_attn = (
+            CrossAttention(config, prompt_embedding_dim) if do_prompt_cond else None
+        )
+        do_ctrl_cond = layer_idx % ctrl_conditioning_period == 0
+        self.ctrl_mlpfusion = MLPFusion(d_model) if do_ctrl_cond else None
+    def forward(self, x, pos_ids, cond, ctx, v, kv_cache=None):
+        """
+        0) Causal Frame Attention
+        1) Frame->CTX Cross Attention
+        2) MLP
+        """
+        s0, b0, g0, s1, b1, g1 = self.cond_head(cond)
+        # Self / Causal Attention
+        residual = x
+        x = ada_rmsnorm(x, s0, b0)
+        x, v = self.attn(x, pos_ids, v, kv_cache=kv_cache)
+        x = ada_gate(x, g0) + residual
+        # Cross Attention Prompt Conditioning
+        if self.prompt_cross_attn is not None:
+            x = (
+                self.prompt_cross_attn(
+                    rms_norm(x),
+                    context=rms_norm(ctx["prompt_emb"]),
+                    context_pad_mask=ctx["prompt_pad_mask"],
+                )
+                + x
+            )
+        # MLPFusion Controller Conditioning
+        if self.ctrl_mlpfusion is not None:
+            x = self.ctrl_mlpfusion(rms_norm(x), rms_norm(ctx["ctrl_emb"])) + x
+        # MLP
+        x = ada_gate(self.mlp(ada_rmsnorm(x, s1, b1)), g1) + x
+        return x, v
+class WorldDiT(nn.Module):
+    """Stack of WorldDiTBlocks with shared parameters."""
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.blocks = nn.ModuleList(
+            [
+                WorldDiTBlock(
+                    d_model=config.d_model,
+                    n_heads=config.n_heads,
+                    mlp_ratio=config.mlp_ratio,
+                    layer_idx=idx,
+                    prompt_conditioning=config.prompt_conditioning,
+                    prompt_conditioning_period=config.prompt_conditioning_period,
+                    prompt_embedding_dim=config.prompt_embedding_dim,
+                    ctrl_conditioning_period=config.ctrl_conditioning_period,
+                    noise_conditioning=config.noise_conditioning,
+                    config=config,
+                )
+                for idx in range(config.n_layers)
+            ]
+        )
+        if config.noise_conditioning in ("dit_air", "wan"):
+            ref_proj = self.blocks[0].cond_head.cond_proj
+            for blk in self.blocks[1:]:
+                for blk_mod, ref_mod in zip(blk.cond_head.cond_proj, ref_proj):
+                    blk_mod.weight = ref_mod.weight
+        # Shared RoPE buffers
+        ref_rope = self.blocks[0].attn.rope
+        for blk in self.blocks[1:]:
+            blk.attn.rope = ref_rope
+    def forward(self, x, pos_ids, cond, ctx, kv_cache=None):
+        v = None
+        for i, block in enumerate(self.blocks):
+            x, v = block(x, pos_ids, cond, ctx, v, kv_cache=kv_cache)
+        return x
+class WorldModel(ModelMixin, ConfigMixin):
+    """
+    WORLD: Wayfarer Operator-driven Rectified-flow Long-context Diffuser.
+    Denoises a frame given:
+    - All previous frames (via KV cache)
+    - The prompt embedding
+    - The controller input embedding
+    - The current noise level
+    """
+    _supports_gradient_checkpointing = False
+    _keep_in_fp32_modules = ["denoise_step_emb", "rope"]
+    @register_to_config
+    def __init__(
+        self,
+        # Model architecture
+        d_model: int = 2560,
+        n_heads: int = 40,
+        n_kv_heads: Optional[int] = 20,
+        n_layers: int = 22,
+        mlp_ratio: int = 5,
+        channels: int = 16,
+        height: int = 16,
+        width: int = 16,
+        patch: tuple = (2, 2),
+        tokens_per_frame: int = 256,
+        n_frames: int = 512,
+        local_window: int = 16,
+        global_window: int = 128,
+        global_attn_period: int = 4,
+        global_pinned_dilation: int = 8,
+        global_attn_offset: int = -1,
+        value_residual: bool = False,
+        gated_attn: bool = True,
+        n_buttons: int = 256,
+        ctrl_conditioning: Optional[str] = "mlp_fusion",
+        ctrl_conditioning_period: int = 3,
+        ctrl_cond_dropout: float = 0.0,
+        prompt_conditioning: Optional[str] = "cross_attention",
+        prompt_conditioning_period: int = 3,
+        prompt_embedding_dim: int = 2048,
+        prompt_cond_dropout: float = 0.0,
+        noise_conditioning: str = "wan",
+        scheduler_sigmas: Optional[List[float]] = [
+            1.0,
+            0.9483006596565247,
+            0.8379597067832947,
+            0.0,
+        ],
+        base_fps: int = 60,
+        causal: bool = True,
+        mlp_gradient_checkpointing: bool = True,
+        block_gradient_checkpointing: bool = True,
+        rope_impl: str = "ortho",
+    ):
+        super().__init__()
+        self.denoise_step_emb = NoiseConditioner(d_model)
+        self.ctrl_emb = ControllerInputEmbedding(n_buttons, d_model, mlp_ratio)
+        if self.config.ctrl_conditioning is not None:
+            self.ctrl_cfg = CFG(self.config.d_model, self.config.ctrl_cond_dropout)
+        if self.config.prompt_conditioning is not None:
+            self.prompt_cfg = CFG(
+                self.config.prompt_embedding_dim, self.config.prompt_cond_dropout
+            )
+        self.transformer = WorldDiT(self.config)
+        self.patch = tuple(patch)
+        C, D = channels, d_model
+        self.patchify = nn.Conv2d(
+            C, D, kernel_size=self.patch, stride=self.patch, bias=False
+        )
+        self.unpatchify = nn.Linear(D, C * math.prod(self.patch), bias=True)
+        self.out_norm = AdaLN(d_model)
+        # Cached 1-frame pos_ids (buffers + cached TensorDict view)
+        T = tokens_per_frame
+        idx = torch.arange(T, dtype=torch.long)
+        self.register_buffer(
+            "_t_pos_1f", torch.empty(T, dtype=torch.long), persistent=False
+        )
+        self.register_buffer(
+            "_y_pos_1f", idx.div(width, rounding_mode="floor"), persistent=False
+        )
+        self.register_buffer("_x_pos_1f", idx.remainder(width), persistent=False)
+    def forward(
+        self,
+        x: Tensor,
+        sigma: Tensor,
+        frame_timestamp: Tensor,
+        prompt_emb: Optional[Tensor] = None,
+        prompt_pad_mask: Optional[Tensor] = None,
+        mouse: Optional[Tensor] = None,
+        button: Optional[Tensor] = None,
+        scroll: Optional[Tensor] = None,
+        kv_cache=None,
+    ):
+        """
+        Args:
+            x: [B, N, C, H, W] - latent frames
+            sigma: [B, N] - noise levels
+            frame_timestamp: [B, N] - frame indices
+            prompt_emb: [B, P, D] - prompt embeddings
+            prompt_pad_mask: [B, P] - padding mask for prompts
+            mouse: [B, N, 2] - mouse velocity
+            button: [B, N, n_buttons] - button states
+            scroll: [B, N, 1] - scroll wheel sign (-1, 0, 1)
+            kv_cache: StaticKVCache instance
+            ctrl_cond: whether to apply controller conditioning (inference only)
+            prompt_cond: whether to apply prompt conditioning (inference only)
+        """
+        B, N, C, H, W = x.shape
+        ph, pw = self.patch
+        assert (H % ph == 0) and (W % pw == 0), "H, W must be divisible by patch"
+        Hp, Wp = H // ph, W // pw
+        torch._assert(
+            Hp * Wp == self.config.tokens_per_frame,
+            f"{Hp} * {Wp} != {self.config.tokens_per_frame}",
+        )
+        torch._assert(
+            B == 1 and N == 1, "WorldModel.forward currently supports B==1, N==1"
+        )
+        self._t_pos_1f.copy_(frame_timestamp[0, 0].expand_as(self._t_pos_1f))
+        pos_ids = TensorDict(
+            {
+                "t_pos": self._t_pos_1f[None],
+                "y_pos": self._y_pos_1f[None],
+                "x_pos": self._x_pos_1f[None],
+            },
+            batch_size=[1, self._t_pos_1f.numel()],
+        )
+        cond = self.denoise_step_emb(sigma)  # [B, N, d]
+        assert button is not None
+        ctx = {
+            "ctrl_emb": self.ctrl_emb(mouse, button, scroll),
+            "prompt_emb": prompt_emb,
+            "prompt_pad_mask": prompt_pad_mask,
+        }
+        D = self.unpatchify.in_features
+        x = self.patchify(x.reshape(B * N, C, H, W))
+        x = eo.rearrange(x.view(B, N, D, Hp, Wp), "b n d hp wp -> b (n hp wp) d")
+        x = self.transformer(x, pos_ids, cond, ctx, kv_cache)
+        x = F.silu(self.out_norm(x, cond))
+        x = eo.rearrange(
+            self.unpatchify(x),
+            "b (n hp wp) (c ph pw) -> b n c (hp ph) (wp pw)",
+            n=N,
+            hp=Hp,
+            wp=Wp,
+            ph=ph,
+            pw=pw,
+        )
+        return x
+    def quantize(self, quant_type: str):
+        quantize_model(self, quant_type)
+    def apply_inference_patches(self):
+        _apply_inference_patches(self)

transformer/nn.py ADDED Viewed

	@@ -0,0 +1,153 @@

+# Copyright (C) 2025 Hugging Face Team and Overworld
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+"""Neural network building blocks for WorldModel transformer."""
+import warnings
+import einops as eo
+import torch
+from torch import nn
+import torch.nn.functional as F
+class NoCastModule(torch.nn.Module):
+    """Module that prevents dtype casting during .to() calls."""
+    def _apply(self, fn):
+        def keep_dtype(t):
+            old_dtype = t.dtype
+            out = fn(t)
+            if out.dtype is not old_dtype:
+                warnings.warn(
+                    f"{self.__class__.__name__}: requested dtype cast ignored; "
+                    f"keeping {old_dtype}.",
+                    stacklevel=3,
+                )
+                out = out.to(dtype=old_dtype)
+            return out
+        return super()._apply(keep_dtype)
+    def to(self, *args, **kwargs):
+        warn_cast = False
+        # m.to(ref_tensor): use ref's device, ignore its dtype
+        if args and isinstance(args[0], torch.Tensor):
+            ref, *rest = args
+            args = (ref.device, *rest)
+            base = next(self.parameters(), None) or next(self.buffers(), None)
+            if base is not None and ref.dtype is not base.dtype:
+                warn_cast = True
+        # keyword dtype
+        if kwargs.pop("dtype", None) is not None:
+            warn_cast = True
+        # positional dtype
+        args = tuple(a for a in args if not isinstance(a, torch.dtype))
+        if warn_cast:
+            warnings.warn(
+                f"{self.__class__.__name__}.to: requested dtype cast ignored; "
+                "keeping existing dtypes.",
+                stacklevel=2,
+            )
+        return super().to(*args, **kwargs)
+def rms_norm(x: torch.Tensor) -> torch.Tensor:
+    """Root mean square layer normalization."""
+    return F.rms_norm(x, (x.size(-1),))
+class MLP(nn.Module):
+    """Simple MLP with SiLU activation."""
+    def __init__(self, dim_in, dim_middle, dim_out):
+        super().__init__()
+        self.fc1 = nn.Linear(dim_in, dim_middle, bias=False)
+        self.fc2 = nn.Linear(dim_middle, dim_out, bias=False)
+    def forward(self, x):
+        return self.fc2(F.silu(self.fc1(x)))
+class AdaLN(nn.Module):
+    """Adaptive Layer Normalization."""
+    def __init__(self, dim):
+        super().__init__()
+        self.fc = nn.Linear(dim, 2 * dim, bias=False)
+    def forward(self, x, cond):
+        # cond: [b, n, d], x: [b, n*m, d]
+        b, n, d = cond.shape
+        _, nm, _ = x.shape
+        m = nm // n
+        y = F.silu(cond)
+        ab = self.fc(y)  # [b, n, 2d]
+        ab = ab.view(b, n, 1, 2 * d)  # [b, n, 1, 2d]
+        ab = ab.expand(-1, -1, m, -1)  # [b, n, m, 2d]
+        ab = ab.reshape(b, nm, 2 * d)  # [b, nm, 2d]
+        a, b_ = ab.chunk(2, dim=-1)  # [b, nm, d] each
+        x = rms_norm(x) * (1 + a) + b_
+        return x
+def ada_rmsnorm(x, scale, bias):
+    """Adaptive RMS normalization with scale and bias."""
+    x4 = eo.rearrange(x, "b (n m) d -> b n m d", n=scale.size(1))
+    y4 = rms_norm(x4) * (1 + scale.unsqueeze(2)) + bias.unsqueeze(2)
+    return eo.rearrange(y4, "b n m d -> b (n m) d")
+def ada_gate(x, gate):
+    """Apply gating to x with per-frame gates."""
+    x4 = eo.rearrange(x, "b (n m) d -> b n m d", n=gate.size(1))
+    return eo.rearrange(x4 * gate.unsqueeze(2), "b n m d -> b (n m) d")
+class NoiseConditioner(NoCastModule):
+    """Sigma -> logSNR -> Fourier Features -> Dense embedding."""
+    def __init__(self, dim, fourier_dim=512, base=10_000.0):
+        super().__init__()
+        assert fourier_dim % 2 == 0
+        half = fourier_dim // 2
+        self.freq = nn.Buffer(
+            torch.logspace(0, -1, steps=half, base=base, dtype=torch.float32),
+            persistent=False,
+        )
+        self.mlp = MLP(fourier_dim, dim * 4, dim)
+    def forward(self, s, eps=torch.finfo(torch.float32).eps):
+        assert self.freq.dtype == torch.float32
+        orig_dtype, shape = s.dtype, s.shape
+        with torch.autocast("cuda", enabled=False):
+            s = s.reshape(-1).float()  # fp32 for fourier numerical stability
+            s = s * 1000  # expressive rotation range
+            # calculate fourier features
+            phase = s[:, None] * self.freq[None, :]
+            emb = torch.cat((torch.sin(phase), torch.cos(phase)), dim=-1)
+            emb = emb * 2**0.5  # Ensure unit variance
+            emb = self.mlp(emb)
+        return emb.to(orig_dtype).view(*shape, -1)

transformer/quantize.py ADDED Viewed

	@@ -0,0 +1,245 @@

+# Copyright (C) 2025 Hugging Face Team and Overworld
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+from typing import Optional
+import torch
+import torch.nn as nn
+QUANTS = [
+    None
+]  # TODO: enable specific quant based on model config, which should specify compatible quants
+try:
+    from flashinfer import nvfp4_quantize, mm_fp4, SfLayout
+    QUANTS.append("nvfp4")
+except ImportError:
+    pass
+@torch.library.custom_op("world_engine::fp4_linear", mutates_args=())
+def fp4_linear(
+    a_bf16: torch.Tensor,
+    b_fp4_T: torch.Tensor,
+    a_global_sf: torch.Tensor,
+    b_sf_T: torch.Tensor,
+    alpha: torch.Tensor,
+) -> torch.Tensor:
+    a_fp4, a_sf = nvfp4_quantize(
+        a_bf16,
+        a_global_sf,
+        sfLayout=SfLayout.layout_128x4,
+        do_shuffle=False,
+    )
+    return mm_fp4(
+        a_fp4, b_fp4_T, a_sf, b_sf_T, alpha, out_dtype=torch.bfloat16, backend="cutlass"
+    )
+@fp4_linear.register_fake
+def _fp4_linear_fake(
+    a_bf16: torch.Tensor,
+    b_fp4_T: torch.Tensor,
+    a_global_sf: torch.Tensor,
+    b_sf_T: torch.Tensor,
+    alpha: torch.Tensor,
+) -> torch.Tensor:
+    return torch.empty(
+        (a_bf16.shape[0], b_fp4_T.shape[1]), device=a_bf16.device, dtype=torch.bfloat16
+    )
+class FP4Linear(nn.Module):
+    """FP4 Linear layer using FlashInfer's NVFP4 quantization."""
+    def __init__(self, lin: nn.Linear):
+        super().__init__()
+        self.in_features = lin.in_features
+        self.out_features = lin.out_features
+        # Check alignment requirements for NVFP4 TMA
+        assert self.in_features % 32 == 0 and self.out_features % 32 == 0, (
+            "features % 32 != 0, nvfp4 disallowed"
+        )
+        # Store weight from original linear layer
+        self.weight = nn.Parameter(lin.weight.detach().clone())
+        # Cached FP4 weight and scales (populated on first forward)
+        self._weight_fp4_T: Optional[torch.Tensor] = None
+        self._weight_scales_T: Optional[torch.Tensor] = None
+        self._alpha: Optional[torch.Tensor] = None
+        self._dummy_scale: Optional[torch.Tensor] = None
+        self._weight_global_sf = None
+        with torch.no_grad():
+            # Quantize weights eagerly (no lazy path)
+            self._dummy_scale = torch.full(
+                (1,), 1.0, device=self.weight.device, dtype=torch.float32
+            )
+            weight_bf16 = (
+                self.weight.to(torch.bfloat16).to(self.weight.device).contiguous()
+            )
+            weight_amax = weight_bf16.float().abs().nan_to_num().max()
+            self._weight_global_sf = (1.0) / weight_amax
+            self._alpha = 1.0 / (self._weight_global_sf * self._dummy_scale)
+            w_fp4, w_sf = nvfp4_quantize(
+                weight_bf16,
+                self._weight_global_sf,
+                sfLayout=SfLayout.layout_128x4,
+                do_shuffle=False,
+            )
+            self._weight_fp4_T = w_fp4.t()
+            self._weight_scales_T = w_sf.t()
+            # Warmup flashinfer fp4 graphs
+            assert self.weight.is_cuda, "Weights need to be on GPU before quantization"
+            # TODO: test actual shape warmup, might perform better
+            lazy_x = torch.zeros(
+                (1, lin.in_features), device=self.weight.device, dtype=torch.bfloat16
+            )
+            fp4_linear(
+                lazy_x,
+                self._weight_fp4_T,
+                self._dummy_scale,
+                self._weight_scales_T,
+                self._alpha,
+            )
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Forward pass using FP4 quantization and FlashInfer GEMM."""
+        x_flat = x.reshape(-1, x.shape[-1])
+        y = fp4_linear(
+            x_flat.to(torch.bfloat16).contiguous(),
+            self._weight_fp4_T,
+            self._dummy_scale,
+            self._weight_scales_T,
+            self._alpha,
+        )
+        return y.reshape(x.shape[:-1] + (-1,))
+class FP8W8A8Linear(nn.Module):
+    __constants__ = ("in_features", "out_features")
+    def __init__(self, lin: nn.Linear):
+        super().__init__()
+        self.in_features, self.out_features = lin.in_features, lin.out_features
+        f8 = torch.float8_e4m3fn
+        inv = 1.0 / float(torch.finfo(f8).max)
+        self._inv = inv
+        w = lin.weight.detach()
+        ws = (w.abs().amax() * inv).clamp_min(1e-8).float()  # 0-d
+        wf8 = (w / ws.to(w.dtype)).to(f8).contiguous()  # row-major
+        self.register_buffer("wT", wf8.t())  # col-major view (no contiguous)
+        self.register_buffer("ws", ws)
+        if lin.bias is None:
+            self.bias = None
+        else:
+            self.register_buffer("bias", lin.bias.detach().to(torch.float16))
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        s = x.shape
+        x2 = x.reshape(-1, s[-1])
+        xs = (x2.abs().amax() * self._inv).clamp_min(1e-8).float()  # 0-d
+        xf8 = (x2 / xs.to(x2.dtype)).to(torch.float8_e4m3fn).contiguous()
+        y = torch._scaled_mm(
+            xf8,
+            self.wT,
+            xs,
+            self.ws,
+            bias=self.bias,
+            out_dtype=torch.float16,
+            use_fast_accum=True,
+        )
+        return y.reshape(*s[:-1], self.out_features).to(x.dtype)
+class FP8Linear(nn.Module):
+    def __init__(self, lin: nn.Linear):
+        super().__init__()
+        self.in_features, self.out_features = lin.in_features, lin.out_features
+        self.bias = (
+            nn.Parameter(lin.bias.data.clone().to(torch.float8_e4m3fn))
+            if lin.bias is not None
+            else None
+        )
+        w_amax = lin.weight.data.clone().amax().float().squeeze()
+        w = lin.weight.data.clone().div(w_amax).to(torch.float8_e4m3fn)
+        self.register_buffer("w_amax", w_amax)
+        self.register_buffer("weightT", w.t())
+        self.dummy_scale = torch.ones((), device=lin.weight.device, dtype=torch.float32)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        Forward pass using FP8 matmul.
+        Args:
+            x: Input tensor of shape [..., in_features] (flattens if > 2D)
+        Returns:
+            Output tensor of shape [..., out_features] in BF16 format, unflattened if input is > 2D
+        """
+        # Convert input to FP8 e4m3
+        x_fp8 = x.to(torch.float8_e4m3fn).reshape(-1, x.size(-1)).contiguous()
+        result = torch._scaled_mm(
+            x_fp8,
+            self.weightT,
+            bias=self.bias,
+            scale_a=self.dummy_scale,
+            scale_b=self.w_amax,
+            out_dtype=torch.bfloat16,
+            use_fast_accum=True,
+        )
+        return result.reshape(x.shape[:-1] + (-1,))
+def quantize_model(model: nn.Module, quant: str):
+    if quant is None:
+        return model
+    def eligible(m: nn.Module) -> bool:
+        w = getattr(m, "weight", None)
+        if not isinstance(m, nn.Linear):
+            return False
+        if getattr(w, "dtype", None) != torch.bfloat16:
+            return False
+        o, k = w.shape
+        return (o % 32 == 0) and (k % 32 == 0)
+    new_linear = {
+        "w8a8": FP8W8A8Linear,
+        "nvfp4": FP4Linear,
+        "fp8": FP8Linear,
+    }[quant]
+    for name, child in model.named_children():
+        setattr(model, name, new_linear(child)) if eligible(child) else quantize_model(
+            child, quant
+        )
+    return model

vae/__init__.py ADDED Viewed

	@@ -0,0 +1,19 @@

+# Copyright (C) 2025 Hugging Face Team and Overworld
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+from .ae_model import WorldEngineVAE
+from .dcae import bake_weight_norm
+__all__ = ["WorldEngineVAE", "bake_weight_norm"]

vae/__pycache__/__init__.cpython-311.pyc ADDED Viewed

Binary file (265 Bytes). View file

vae/__pycache__/ae_model.cpython-311.pyc ADDED Viewed

Binary file (5.64 kB). View file

vae/__pycache__/dcae.cpython-311.pyc ADDED Viewed

Binary file (17.1 kB). View file

vae/ae_model.py ADDED Viewed

	@@ -0,0 +1,141 @@

+# Copyright (C) 2025 Hugging Face Team and Overworld
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+"""VAE model for WorldEngine frame encoding/decoding."""
+from dataclasses import dataclass
+from typing import List, Tuple
+import torch
+from torch import Tensor
+from diffusers.configuration_utils import ConfigMixin, register_to_config
+from diffusers.models.modeling_utils import ModelMixin
+from .dcae import Encoder, Decoder, bake_weight_norm
+@dataclass
+class EncoderDecoderConfig:
+    """Config object for Encoder/Decoder initialization."""
+    channels: int
+    latent_channels: int
+    ch_0: int
+    ch_max: int
+    encoder_blocks_per_stage: List[int]
+    decoder_blocks_per_stage: List[int]
+    skip_logvar: bool = False
+class WorldEngineVAE(ModelMixin, ConfigMixin):
+    """
+    VAE for encoding/decoding video frames using DCAE architecture.
+    Encodes RGB uint8 images to latent space and decodes latents back to RGB.
+    """
+    _supports_gradient_checkpointing = False
+    @register_to_config
+    def __init__(
+        self,
+        # Common parameters
+        sample_size: Tuple[int, int] = (360, 640),
+        channels: int = 3,
+        latent_channels: int = 16,
+        # Encoder parameters
+        encoder_ch_0: int = 64,
+        encoder_ch_max: int = 256,
+        encoder_blocks_per_stage: List[int] = None,
+        # Decoder parameters
+        decoder_ch_0: int = 128,
+        decoder_ch_max: int = 1024,
+        decoder_blocks_per_stage: List[int] = None,
+        # Shared parameters
+        skip_logvar: bool = False,
+        # Scaling factors
+        scale_factor: float = 1.0,
+        shift_factor: float = 0.0,
+    ):
+        super().__init__()
+        # Default blocks per stage
+        if encoder_blocks_per_stage is None:
+            encoder_blocks_per_stage = [1, 1, 1, 1]
+        if decoder_blocks_per_stage is None:
+            decoder_blocks_per_stage = [1, 1, 1, 1]
+        # Create encoder config
+        encoder_config = EncoderDecoderConfig(
+            channels=channels,
+            latent_channels=latent_channels,
+            ch_0=encoder_ch_0,
+            ch_max=encoder_ch_max,
+            encoder_blocks_per_stage=list(encoder_blocks_per_stage),
+            decoder_blocks_per_stage=list(decoder_blocks_per_stage),
+            skip_logvar=skip_logvar,
+        )
+        # Create decoder config
+        decoder_config = EncoderDecoderConfig(
+            channels=channels,
+            latent_channels=latent_channels,
+            ch_0=decoder_ch_0,
+            ch_max=decoder_ch_max,
+            encoder_blocks_per_stage=list(encoder_blocks_per_stage),
+            decoder_blocks_per_stage=list(decoder_blocks_per_stage),
+            skip_logvar=skip_logvar,
+        )
+        self.encoder = Encoder(encoder_config)
+        self.decoder = Decoder(decoder_config)
+    def encode(self, img: Tensor):
+        """RGB -> RGB+D -> latent"""
+        assert img.dim() == 3, "Expected [H, W, C] image tensor"
+        img = img.unsqueeze(0).to(device=self.device, dtype=self.dtype)
+        rgb = img.permute(0, 3, 1, 2).contiguous().div(255).mul(2).sub(1)
+        return self.encoder(rgb)
+    def decode(self, latent: Tensor):
+        decoded = self.decoder(latent)
+        decoded = (decoded / 2 + 0.5).clamp(0, 1)
+        decoded = (decoded * 255).round().to(torch.uint8)
+        return decoded.squeeze(0).permute(1, 2, 0)[..., :3]
+    def forward(self, x: Tensor, encode: bool = True) -> Tensor:
+        """
+        Forward pass - encode or decode based on flag.
+        Args:
+            x: Input tensor (image for encode, latent for decode)
+            encode: If True, encode; if False, decode
+        Returns:
+            Encoded latent or decoded image
+        """
+        if encode:
+            return self.encode(x)
+        else:
+            return self.decode(x)
+    def bake_weight_norm(self):
+        """Remove weight_norm parametrizations, baking normalized weights into regular tensors.
+        Call this after loading weights and before torch.compile to avoid
+        CUDA graph capture errors from in-place weight updates.
+        """
+        bake_weight_norm(self)
+        return self

vae/config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "_class_name": "WorldEngineVAE",
+  "_diffusers_version": "0.36.0.dev0",
+  "auto_map": {
+    "AutoModel": "ae_model.WorldEngineVAE"
+  },
+  "sample_size": [
+    360,
+    640
+  ],
+  "channels": 3,
+  "latent_channels": 16,
+  "encoder_ch_0": 64,
+  "encoder_ch_max": 256,
+  "encoder_blocks_per_stage": [
+    1,
+    1,
+    1,
+    1
+  ],
+  "decoder_ch_0": 128,
+  "decoder_ch_max": 1024,
+  "decoder_blocks_per_stage": [
+    1,
+    1,
+    1,
+    1
+  ],
+  "use_middle_block": false,
+  "skip_logvar": false,
+  "scale_factor": 1.0,
+  "shift_factor": 0.0
+}

vae/dcae.py ADDED Viewed

	@@ -0,0 +1,271 @@

+# Copyright (C) 2025 Hugging Face Team and Overworld
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+import torch
+from torch import nn
+import torch.nn.functional as F
+from torch.nn.utils.parametrizations import weight_norm
+from torch.nn.utils.parametrize import remove_parametrizations
+def bake_weight_norm(model: nn.Module) -> nn.Module:
+    """Remove weight_norm parametrizations, baking normalized weights into regular tensors.
+    This is required for torch.compile/CUDA graph compatibility since weight_norm
+    performs in-place updates during forward passes.
+    """
+    for module in model.modules():
+        if hasattr(module, "parametrizations") and "weight" in getattr(module, "parametrizations", {}):
+            remove_parametrizations(module, "weight", leave_parametrized=True)
+    return model
+# === General Blocks ===
+def WeightNormConv2d(*args, **kwargs):
+    return weight_norm(nn.Conv2d(*args, **kwargs))
+class ResBlock(nn.Module):
+    def __init__(self, ch):
+        super().__init__()
+        hidden = 2 * ch
+        # 16 channels per group (matches checkpoint shapes like [128,16,3,3] when ch=64)
+        n_grps = max(1, hidden // 16)
+        self.conv1 = WeightNormConv2d(ch, hidden, 1, 1, 0)
+        self.conv2 = WeightNormConv2d(hidden, hidden, 3, 1, 1, groups=n_grps)
+        self.conv3 = WeightNormConv2d(hidden, ch, 1, 1, 0, bias=False)
+        self.act1 = nn.LeakyReLU(inplace=False)
+        self.act2 = nn.LeakyReLU(inplace=False)
+    def forward(self, x):
+        h = self.conv1(x)
+        h = self.act1(h)
+        h = self.conv2(h)
+        h = self.act2(h)
+        h = self.conv3(h)
+        return x + h
+# === Encoder ===
+class LandscapeToSquare(nn.Module):
+    # Strict assumption of 360p
+    def __init__(self, ch_in, ch_out):
+        super().__init__()
+        self.proj = WeightNormConv2d(ch_in, ch_out, 3, 1, 1)
+    def forward(self, x):
+        x = F.interpolate(x, (512, 512), mode='bicubic')
+        x = self.proj(x)
+        return x
+class Downsample(nn.Module):
+    def __init__(self, ch_in, ch_out):
+        super().__init__()
+        self.proj = WeightNormConv2d(ch_in, ch_out, 1, 1, 0, bias=False)
+    def forward(self, x):
+        x = F.interpolate(x, scale_factor=0.5, mode='bicubic')
+        x = self.proj(x)
+        return x
+class DownBlock(nn.Module):
+    def __init__(self, ch_in, ch_out, num_res=1):
+        super().__init__()
+        self.down = Downsample(ch_in, ch_out)
+        blocks = []
+        for _ in range(num_res):
+            blocks.append(ResBlock(ch_in))
+        self.blocks = nn.ModuleList(blocks)
+    def forward(self, x):
+        for block in self.blocks:
+            x = block(x)
+        x = self.down(x)
+        return x
+class SpaceToChannel(nn.Module):
+    def __init__(self, ch_in, ch_out):
+        super().__init__()
+        self.proj = WeightNormConv2d(ch_in, ch_out // 4, 3, 1, 1)
+    def forward(self, x):
+        x = self.proj(x)
+        x = F.pixel_unshuffle(x, 2).contiguous()
+        return x
+class ChannelAverage(nn.Module):
+    def __init__(self, ch_in, ch_out):
+        super().__init__()
+        self.proj = WeightNormConv2d(ch_in, ch_out, 3, 1, 1)
+        self.grps = ch_in // ch_out
+        self.scale = (self.grps) ** 0.5
+    def forward(self, x):
+        res = x
+        x = self.proj(x.contiguous())  # [b, ch_out, h, w]
+        # Residual goes through channel avg
+        res = res.view(res.shape[0], self.grps, res.shape[1] // self.grps, res.shape[2], res.shape[3]).contiguous()
+        res = res.mean(dim=1) * self.scale  # [b, ch_out, h, w]
+        return res + x
+# === Decoder ===
+class SquareToLandscape(nn.Module):
+    def __init__(self, ch_in, ch_out):
+        super().__init__()
+        self.proj = WeightNormConv2d(ch_in, ch_out, 3, 1, 1)
+    def forward(self, x):
+        x = self.proj(x)  # TODO This ordering is wrong for both
+        x = F.interpolate(x, (360, 640), mode='bicubic')
+        return x
+class Upsample(nn.Module):
+    def __init__(self, ch_in, ch_out):
+        super().__init__()
+        self.proj = nn.Identity() if ch_in == ch_out else WeightNormConv2d(
+            ch_in, ch_out, 1, 1, 0, bias=False
+        )
+    def forward(self, x):
+        x = self.proj(x)
+        x = F.interpolate(x, scale_factor=2.0, mode='bicubic')
+        return x
+class UpBlock(nn.Module):
+    def __init__(self, ch_in, ch_out, num_res=1):
+        super().__init__()
+        self.up = Upsample(ch_in, ch_out)
+        blocks = []
+        for _ in range(num_res):
+            blocks.append(ResBlock(ch_out))
+        self.blocks = nn.ModuleList(blocks)
+    def forward(self, x):
+        x = self.up(x)
+        for block in self.blocks:
+            x = block(x)
+        return x
+class ChannelToSpace(nn.Module):
+    def __init__(self, ch_in, ch_out):
+        super().__init__()
+        self.proj = WeightNormConv2d(ch_in, ch_out * 4, 3, 1, 1)
+    def forward(self, x):
+        x = self.proj(x)
+        x = F.pixel_shuffle(x, 2).contiguous()
+        return x
+class ChannelDuplication(nn.Module):
+    def __init__(self, ch_in, ch_out):
+        super().__init__()
+        self.proj = WeightNormConv2d(ch_in, ch_out, 3, 1, 1)
+        self.reps = ch_out // ch_in
+        self.scale = (self.reps) ** -0.5
+    def forward(self, x):
+        res = x
+        x = self.proj(x.contiguous())
+        b, c, h, w = res.shape
+        res = res.unsqueeze(2)  # [b, c, 1, h, w]
+        res = res.expand(b, c, self.reps, h, w)  # [b, c, reps, h, w]
+        res = res.reshape(b, c * self.reps, h, w).contiguous()
+        res = res * self.scale
+        return res + x
+# === Main AE ===
+class Encoder(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.conv_in = LandscapeToSquare(config.channels, config.ch_0)
+        blocks = []
+        residuals = []
+        ch = config.ch_0
+        for block_count in config.encoder_blocks_per_stage:
+            next_ch = min(ch*2, config.ch_max)
+            blocks.append(DownBlock(ch, next_ch, block_count))
+            residuals.append(SpaceToChannel(ch, next_ch))
+            ch = next_ch
+        self.blocks = nn.ModuleList(blocks)
+        self.residuals = nn.ModuleList(residuals)
+        self.conv_out = ChannelAverage(ch, config.latent_channels)
+        self.skip_logvar = bool(getattr(config, "skip_logvar", False))
+        if not self.skip_logvar:
+            # Checkpoint expects a 1-channel logvar head: [1, ch, 3, 3]
+            self.conv_out_logvar = WeightNormConv2d(ch, 1, 3, 1, 1)
+    def forward(self, x):
+        x = self.conv_in(x)
+        for block, residual in zip(self.blocks, self.residuals):
+            x = block(x) + residual(x)
+        return self.conv_out(x)
+class Decoder(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.conv_in = ChannelDuplication(config.latent_channels, config.ch_max)
+        blocks = []
+        residuals = []
+        ch = config.ch_0
+        for block_count in reversed(config.decoder_blocks_per_stage):
+            next_ch = min(ch*2, config.ch_max)
+            blocks.append(UpBlock(next_ch, ch, block_count))
+            residuals.append(ChannelToSpace(next_ch, ch))
+            ch = next_ch
+        self.blocks = nn.ModuleList(reversed(blocks))
+        self.residuals = nn.ModuleList(reversed(residuals))
+        self.act_out = nn.SiLU()
+        self.conv_out = SquareToLandscape(config.ch_0, config.channels)
+    def forward(self, x):
+        x = self.conv_in(x)
+        for block, residual in zip(self.blocks, self.residuals):
+            x = block(x) + residual(x)
+        x = self.act_out(x)
+        return self.conv_out(x)

vae/diffusion_pytorch_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ecdebf692a6b02610163948251dcf264c5793da12b3729986fb4a3e3c4dc4d1f
+size 141887736

vae/model.py ADDED Viewed

	@@ -0,0 +1,142 @@

+# Copyright (C) 2025 Hugging Face Team and Overworld
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+"""VAE model for WorldEngine frame encoding/decoding."""
+from dataclasses import dataclass
+from typing import List, Tuple
+import torch
+from torch import Tensor
+from diffusers.configuration_utils import ConfigMixin, register_to_config
+from diffusers.models.modeling_utils import ModelMixin
+from .dcae import Encoder, Decoder
+@dataclass
+class EncoderDecoderConfig:
+    """Config object for Encoder/Decoder initialization."""
+    sample_size: Tuple[int, int]
+    channels: int
+    latent_channels: int
+    ch_0: int
+    ch_max: int
+    encoder_blocks_per_stage: List[int]
+    decoder_blocks_per_stage: List[int]
+    use_middle_block: bool
+    skip_logvar: bool = False
+    skip_residuals: bool = False
+    normalize_mu: bool = False
+class WorldEngineVAE(ModelMixin, ConfigMixin):
+    """
+    VAE for encoding/decoding video frames using DCAE architecture.
+    Encodes RGB uint8 images to latent space and decodes latents back to RGB.
+    """
+    _supports_gradient_checkpointing = False
+    @register_to_config
+    def __init__(
+        self,
+        # Common parameters
+        sample_size: Tuple[int, int] = (360, 640),
+        channels: int = 3,
+        latent_channels: int = 16,
+        # Encoder parameters
+        encoder_ch_0: int = 64,
+        encoder_ch_max: int = 256,
+        encoder_blocks_per_stage: List[int] = None,
+        # Decoder parameters
+        decoder_ch_0: int = 128,
+        decoder_ch_max: int = 1024,
+        decoder_blocks_per_stage: List[int] = None,
+        # Shared parameters
+        use_middle_block: bool = False,
+        skip_logvar: bool = False,
+        # Scaling factors
+        scale_factor: float = 1.0,
+        shift_factor: float = 0.0,
+    ):
+        super().__init__()
+        # Default blocks per stage
+        if encoder_blocks_per_stage is None:
+            encoder_blocks_per_stage = [1, 1, 1, 1]
+        if decoder_blocks_per_stage is None:
+            decoder_blocks_per_stage = [1, 1, 1, 1]
+        # Create encoder config
+        encoder_config = EncoderDecoderConfig(
+            sample_size=tuple(sample_size),
+            channels=channels,
+            latent_channels=latent_channels,
+            ch_0=encoder_ch_0,
+            ch_max=encoder_ch_max,
+            encoder_blocks_per_stage=list(encoder_blocks_per_stage),
+            decoder_blocks_per_stage=list(decoder_blocks_per_stage),
+            use_middle_block=use_middle_block,
+            skip_logvar=skip_logvar,
+        )
+        # Create decoder config
+        decoder_config = EncoderDecoderConfig(
+            sample_size=tuple(sample_size),
+            channels=channels,
+            latent_channels=latent_channels,
+            ch_0=decoder_ch_0,
+            ch_max=decoder_ch_max,
+            encoder_blocks_per_stage=list(encoder_blocks_per_stage),
+            decoder_blocks_per_stage=list(decoder_blocks_per_stage),
+            use_middle_block=use_middle_block,
+            skip_logvar=skip_logvar,
+        )
+        self.encoder = Encoder(encoder_config)
+        self.decoder = Decoder(decoder_config)
+    def encode(self, img: Tensor):
+        """RGB -> RGB+D -> latent"""
+        assert img.dim() == 3, "Expected [H, W, C] image tensor"
+        img = img.unsqueeze(0).to(device=self.device, dtype=self.dtype)
+        rgb = img.permute(0, 3, 1, 2).contiguous().div(255).mul(2).sub(1)
+        return self.encoder(rgb)
+    @torch.compile
+    def decode(self, latent: Tensor):
+        decoded = self.decoder(latent)
+        decoded = (decoded / 2 + 0.5).clamp(0, 1)
+        decoded = (decoded * 255).round().to(torch.uint8)
+        return decoded.squeeze(0).permute(1, 2, 0)[..., :3]
+    def forward(self, x: Tensor, encode: bool = True) -> Tensor:
+        """
+        Forward pass - encode or decode based on flag.
+        Args:
+            x: Input tensor (image for encode, latent for decode)
+            encode: If True, encode; if False, decode
+        Returns:
+            Encoded latent or decoded image
+        """
+        if encode:
+            return self.encode(x)
+        else:
+            return self.decode(x)