File size: 5,257 Bytes

---
license: apache-2.0
language:
- en
tags:
- WM
- Diffusion
- Egocentric
---

Waypoint-1-Small is a 2.3 billion parameter control-and-text-conditioned causal diffusion model. It is a transformer architecture utilizing rectified flow, distilled via self forcing with DMD. The model can autoregressively generate new frames given historical frames, actions, and text.

# Capabilities:

Can generate worlds in realtime on high-end consumer hardware
Allows for exploration and interaction with worlds via control inputs
Allows for guidance of generated world via text prompts
Can be prompted with any number of starting frames and controls

# Usage:

In order to simply use Waypoint-1-Small, we recommend [Biome](https://github.com/Overworldai/Biome) for local, the [Overworld streaming client](https://www.overworld.stream/), or the Hugging Face hosted [Gradio Space](TODO).

To run the model locally, we recommend an NVIDIA RTX 5090, which should achieve 20-30 FPS, or an RTX 6000 Pro Blackwell, which should achieve ~35 FPS.

# Run with Diffusers Modular Pipelines

World Engine and Waypoint-1 can be used with [Diffusers Modular Pipelines](https://huggingface.co/docs/diffusers/main/en/api/modular_pipelines/modular_pipeline).

## Setup

```bash
uv venv -p 3.11 && uv pip install \
    torch>=2.9.0 \
    diffusers>=0.36.0 \
    transformers>=4.57.1 \
    einops>=0.8.0 \
    tensordict>=0.5.0 \
    regex \
    ftfy \
    imageio \
    imageio-ffmpeg \
    tqdm
```

## Usage Example

```python
import random
import torch

from tqdm import tqdm
from dataclasses import dataclass, field
from typing import Set, Tuple
from diffusers.modular_pipelines import ModularPipeline
from diffusers.utils import load_image, export_to_video

@dataclass
class CtrlInput:
    button: Set[int] = field(default_factory=set)  # pressed button IDs
    mouse: Tuple[float, float] = (0.0, 0.0)  # (x, y) velocity


# Generate random control trajectories
ctrl = lambda: random.choice(
    [
        CtrlInput(button={48, 42}, mouse=[0.4, 0.3]),
        CtrlInput(mouse=[0.1, 0.2]),
        CtrlInput(button={95, 32, 105}),
    ]
)
model_id = "Overworld/Waypoint-1-Small"

pipe = ModularPipeline.from_pretrained(model_id, trust_remote_code=True)
pipe.load_components(
    device_map="cuda", torch_dtype=torch.bfloat16, trust_remote_code=True
)
pipe.transformer.apply_inference_patches()

# Optional Quantization Step
# Available options are: nvfp4 (if running on Blackwell hardware), fp8, w8a8
# pipe.transformer.quantize("nvfp4")
pipe.transformer.compile(fullgraph=True, mode="max-autotune", dynamic=False)
pipe.vae.bake_weight_norm()
pipe.vae.compile(fullgraph=True, mode="max-autotune")

prompt = "A fun game"
image = load_image(
    "https://gist.github.com/user-attachments/assets/4adc5a3d-6980-4d1e-b6e8-9033cdf61c66"
)

num_frames = 240
outputs = []

# create world state based on an initial image
state = pipe(prompt=prompt, image=image, button=ctrl().button, mouse=ctrl().mouse)
outputs.append(state.values["images"])

state.values["image"] = None
for _ in tqdm(range(1, num_frames)):
    state = pipe(
        state,
        prompt=prompt,
        button=ctrl().button,
        mouse=ctrl().mouse,
        output_type="pil",
    )
    outputs.append(state.values["images"])

export_to_video(outputs, "waypoint-1-small.mp4", fps=60)
```

# Keywords

To properly explain limitations and misuse we must define some terms. While the model can be used for general interactive video generation tasks, we herein define interacting with the model via sending controls and receiving new frames as “playing” the model, and the agent/user inputting controls as the “player”. The model has two forms of output, continuations and generations. Continuations occur when seed frames are given and no inputs are given. For example, if a scene has fire or water, you may see them evolve progressively in the generated frames even if no action is given. Likewise, if you seed with an image of a humanoid entity, the entity will persist on the screen as you move/look around. However, generations occur when the player plays with the model extensively, for example moving around, turning around fully, or interacting with objects/items. Continuations roughly correspond to moving around already existing information in the given context frames while generations correspond to creating entirely new information.

# Limitations

- Continuations can plausibly model any inputted scene or photo, and will depend largely on the seed frame given. For generations, the model may occasionally:
- Ignore given text prompt
- Ignore certain controls in specific contexts
- Fail to generate realistic text or interactive HUD/UI elements
- Fail to generate human/animal entities
- Fail to generate realistic motion for given entities
- Prompt adherence is heavily dependent on prompting strategy
- Fail to generate faces

# Out of Scope Usage

- The model and derivatives must not be used
- For harassment or bullying
- For the purpose of exploiting or harming minors in any way
- For simulating extremely violent acts
- For generating violent/gory video
- For facilitation of large-scale disinformation campaigns
- For the purpose of generating any sexually explicit or suggestive material