lapp0 commited on
Commit
93fe14e
·
verified ·
1 Parent(s): 64746fd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -1
README.md CHANGED
@@ -19,10 +19,97 @@ Can be prompted with any number of starting frames and controls
19
 
20
  # Usage:
21
 
22
- In order to simply use Waypoint-1-Small, we recommend [Biome](https://github.com/Overworldai/Biome) for local, or the [Overworld streaming client](https://www.overworld.stream/).
23
 
24
  To run the model locally, we recommend an NVIDIA RTX 5090, which should achieve 20-30 FPS, or an RTX 6000 Pro Blackwell, which should achieve ~35 FPS.
25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  # Keywords
27
 
28
  To properly explain limitations and misuse we must define some terms. While the model can be used for general interactive video generation tasks, we herein define interacting with the model via sending controls and receiving new frames as “playing” the model, and the agent/user inputting controls as the “player”. The model has two forms of output, continuations and generations. Continuations occur when seed frames are given and no inputs are given. For example, if a scene has fire or water, you may see them evolve progressively in the generated frames even if no action is given. Likewise, if you seed with an image of a humanoid entity, the entity will persist on the screen as you move/look around. However, generations occur when the player plays with the model extensively, for example moving around, turning around fully, or interacting with objects/items. Continuations roughly correspond to moving around already existing information in the given context frames while generations correspond to creating entirely new information.
 
19
 
20
  # Usage:
21
 
22
+ In order to simply use Waypoint-1-Small, we recommend [Biome](https://github.com/Overworldai/Biome) for local, the [Overworld streaming client](https://www.overworld.stream/), or the Hugging Face hosted [Gradio Space](TODO).
23
 
24
  To run the model locally, we recommend an NVIDIA RTX 5090, which should achieve 20-30 FPS, or an RTX 6000 Pro Blackwell, which should achieve ~35 FPS.
25
 
26
+ # Run with Diffusers Modular Pipelines
27
+
28
+ World Engine and Waypoint-1 can be used with [Diffusers Modular Pipelines](https://huggingface.co/docs/diffusers/main/en/api/modular_pipelines/modular_pipeline).
29
+
30
+ ## Setup
31
+
32
+ ```bash
33
+ uv venv -p 3.11 && uv pip install \
34
+ torch>=2.9.0 \
35
+ diffusers>=0.36.0 \
36
+ transformers>=4.57.1 \
37
+ einops>=0.8.0 \
38
+ tensordict>=0.5.0 \
39
+ regex \
40
+ ftfy \
41
+ imageio \
42
+ imageio-ffmpeg \
43
+ tqdm
44
+ ```
45
+
46
+ ## Usage Example
47
+
48
+ ```python
49
+ import random
50
+ import torch
51
+
52
+ from tqdm import tqdm
53
+ from dataclasses import dataclass, field
54
+ from typing import Set, Tuple
55
+ from diffusers.modular_pipelines import ModularPipeline
56
+ from diffusers.utils import load_image, export_to_video
57
+
58
+ @dataclass
59
+ class CtrlInput:
60
+ button: Set[int] = field(default_factory=set) # pressed button IDs
61
+ mouse: Tuple[float, float] = (0.0, 0.0) # (x, y) velocity
62
+
63
+
64
+ # Generate random control trajectories
65
+ ctrl = lambda: random.choice(
66
+ [
67
+ CtrlInput(button={48, 42}, mouse=[0.4, 0.3]),
68
+ CtrlInput(mouse=[0.1, 0.2]),
69
+ CtrlInput(button={95, 32, 105}),
70
+ ]
71
+ )
72
+ model_id = "Overworld/Waypoint-1-Small"
73
+
74
+ pipe = ModularPipeline.from_pretrained(model_id, trust_remote_code=True)
75
+ pipe.load_components(
76
+ device_map="cuda", torch_dtype=torch.bfloat16, trust_remote_code=True
77
+ )
78
+ pipe.transformer.apply_inference_patches()
79
+
80
+ # Optional Quantization Step
81
+ # Available options are: nvfp4 (if running on Blackwell hardware), fp8, w8a8
82
+ # pipe.transformer.quantize("nvfp4")
83
+ pipe.transformer.compile(fullgraph=True, mode="max-autotune", dynamic=False)
84
+ pipe.vae.bake_weight_norm()
85
+ pipe.vae.compile(fullgraph=True, mode="max-autotune")
86
+
87
+ prompt = "A fun game"
88
+ image = load_image(
89
+ "https://gist.github.com/user-attachments/assets/4adc5a3d-6980-4d1e-b6e8-9033cdf61c66"
90
+ )
91
+
92
+ num_frames = 240
93
+ outputs = []
94
+
95
+ # create world state based on an initial image
96
+ state = pipe(prompt=prompt, image=image, button=ctrl().button, mouse=ctrl().mouse)
97
+ outputs.append(state.values["images"])
98
+
99
+ state.values["image"] = None
100
+ for _ in tqdm(range(1, num_frames)):
101
+ state = pipe(
102
+ state,
103
+ prompt=prompt,
104
+ button=ctrl().button,
105
+ mouse=ctrl().mouse,
106
+ output_type="pil",
107
+ )
108
+ outputs.append(state.values["images"])
109
+
110
+ export_to_video(outputs, "waypoint-1-small.mp4", fps=60)
111
+ ```
112
+
113
  # Keywords
114
 
115
  To properly explain limitations and misuse we must define some terms. While the model can be used for general interactive video generation tasks, we herein define interacting with the model via sending controls and receiving new frames as “playing” the model, and the agent/user inputting controls as the “player”. The model has two forms of output, continuations and generations. Continuations occur when seed frames are given and no inputs are given. For example, if a scene has fire or water, you may see them evolve progressively in the generated frames even if no action is given. Likewise, if you seed with an image of a humanoid entity, the entity will persist on the screen as you move/look around. However, generations occur when the player plays with the model extensively, for example moving around, turning around fully, or interacting with objects/items. Continuations roughly correspond to moving around already existing information in the given context frames while generations correspond to creating entirely new information.