Overworld
/

Waypoint-1-Small

@@ -19,10 +19,97 @@ Can be prompted with any number of starting frames and controls
 # Usage:
-In order to simply use Waypoint-1-Small, we recommend [Biome](https://github.com/Overworldai/Biome) for local, or the [Overworld streaming client](https://www.overworld.stream/).
 To run the model locally, we recommend an NVIDIA RTX 5090, which should achieve 20-30 FPS, or an RTX 6000 Pro Blackwell, which should achieve ~35 FPS.
 # Keywords
 To properly explain limitations and misuse we must define some terms. While the model can be used for general interactive video generation tasks, we herein define interacting with the model via sending controls and receiving new frames as “playing” the model, and the agent/user inputting controls as the “player”. The model has two forms of output, continuations and generations. Continuations occur when seed frames are given and no inputs are given. For example, if a scene has fire or water, you may see them evolve progressively in the generated frames even if no action is given. Likewise, if you seed with an image of a humanoid entity, the entity will persist on the screen as you move/look around. However, generations occur when the player plays with the model extensively, for example moving around, turning around fully, or interacting with objects/items. Continuations roughly correspond to moving around already existing information in the given context frames while generations correspond to creating entirely new information.

 # Usage:
+In order to simply use Waypoint-1-Small, we recommend [Biome](https://github.com/Overworldai/Biome) for local, the [Overworld streaming client](https://www.overworld.stream/), or the Hugging Face hosted [Gradio Space](TODO).
 To run the model locally, we recommend an NVIDIA RTX 5090, which should achieve 20-30 FPS, or an RTX 6000 Pro Blackwell, which should achieve ~35 FPS.
+# Run with Diffusers Modular Pipelines
+World Engine and Waypoint-1 can be used with [Diffusers Modular Pipelines](https://huggingface.co/docs/diffusers/main/en/api/modular_pipelines/modular_pipeline).
+## Setup
+```bash
+uv venv -p 3.11 && uv pip install \
+    torch>=2.9.0 \
+    diffusers>=0.36.0 \
+    transformers>=4.57.1 \
+    einops>=0.8.0 \
+    tensordict>=0.5.0 \
+    regex \
+    ftfy \
+    imageio \
+    imageio-ffmpeg \
+    tqdm
+```
+## Usage Example
+```python
+import random
+import torch
+from tqdm import tqdm
+from dataclasses import dataclass, field
+from typing import Set, Tuple
+from diffusers.modular_pipelines import ModularPipeline
+from diffusers.utils import load_image, export_to_video
+@dataclass
+class CtrlInput:
+    button: Set[int] = field(default_factory=set)  # pressed button IDs
+    mouse: Tuple[float, float] = (0.0, 0.0)  # (x, y) velocity
+# Generate random control trajectories
+ctrl = lambda: random.choice(
+    [
+        CtrlInput(button={48, 42}, mouse=[0.4, 0.3]),
+        CtrlInput(mouse=[0.1, 0.2]),
+        CtrlInput(button={95, 32, 105}),
+    ]
+)
+model_id = "Overworld/Waypoint-1-Small"
+pipe = ModularPipeline.from_pretrained(model_id, trust_remote_code=True)
+pipe.load_components(
+    device_map="cuda", torch_dtype=torch.bfloat16, trust_remote_code=True
+)
+pipe.transformer.apply_inference_patches()
+# Optional Quantization Step
+# Available options are: nvfp4 (if running on Blackwell hardware), fp8, w8a8
+# pipe.transformer.quantize("nvfp4")
+pipe.transformer.compile(fullgraph=True, mode="max-autotune", dynamic=False)
+pipe.vae.bake_weight_norm()
+pipe.vae.compile(fullgraph=True, mode="max-autotune")
+prompt = "A fun game"
+image = load_image(
+    "https://gist.github.com/user-attachments/assets/4adc5a3d-6980-4d1e-b6e8-9033cdf61c66"
+)
+num_frames = 240
+outputs = []
+# create world state based on an initial image
+state = pipe(prompt=prompt, image=image, button=ctrl().button, mouse=ctrl().mouse)
+outputs.append(state.values["images"])
+state.values["image"] = None
+for _ in tqdm(range(1, num_frames)):
+    state = pipe(
+        state,
+        prompt=prompt,
+        button=ctrl().button,
+        mouse=ctrl().mouse,
+        output_type="pil",
+    )
+    outputs.append(state.values["images"])
+export_to_video(outputs, "waypoint-1-small.mp4", fps=60)
+```
 # Keywords
 To properly explain limitations and misuse we must define some terms. While the model can be used for general interactive video generation tasks, we herein define interacting with the model via sending controls and receiving new frames as “playing” the model, and the agent/user inputting controls as the “player”. The model has two forms of output, continuations and generations. Continuations occur when seed frames are given and no inputs are given. For example, if a scene has fire or water, you may see them evolve progressively in the generated frames even if no action is given. Likewise, if you seed with an image of a humanoid entity, the entity will persist on the screen as you move/look around. However, generations occur when the player plays with the model extensively, for example moving around, turning around fully, or interacting with objects/items. Continuations roughly correspond to moving around already existing information in the given context frames while generations correspond to creating entirely new information.