anntnikita commited on
Commit
41fb5d9
·
verified ·
1 Parent(s): d43bea6

Add Matrix Game app, requirements, and updated README

Browse files
Files changed (3) hide show
  1. README.md +74 -14
  2. app.py +248 -0
  3. requirements.txt +10 -0
README.md CHANGED
@@ -1,14 +1,74 @@
1
- ---
2
- title: Matrix Game Demo
3
- emoji:
4
- colorFrom: yellow
5
- colorTo: blue
6
- sdk: gradio
7
- sdk_version: 5.42.0
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- short_description: Interactive demo for Matrix Game 2.0 model.
12
- ---
13
-
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Matrix Game 2.0 Interactive Demo
2
+
3
+ This folder contains a minimal but **fully‑working** example for running and interacting with the [Matrix‑Game 2.0](https://huggingface.co/Skywork/Matrix-Game-2.0) model. The goal of this demo is to expose the core mechanics of the model—turning a single input image and a sequence of user actions into a short video—behind a simple web interface.
4
+
5
+ > **❗️Hardware requirements**
6
+
7
+ Matrix‑Game 2.0 is a very large model (over 1.8 B parameters) and was designed to run on datacenter GPUs like NVIDIA A100 or H100. You can technically run it on a consumer GPU or even CPU, but inference will be extremely slow and may run out of memory. For the best experience you should launch this demo on a machine with at least 24 GiB of GPU VRAM. The code will gracefully fall back to CPU execution if no GPU is available, but expect generation to take minutes per frame on a CPU.
8
+
9
+ ## Setup
10
+
11
+ 1. Create a fresh Python environment (Python 3.10+ is recommended):
12
+
13
+ ```bash
14
+ python -m venv .venv
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. Install the dependencies listed in `requirements.txt`:
19
+
20
+ ```bash
21
+ pip install --upgrade pip
22
+ pip install -r requirements.txt
23
+ ```
24
+
25
+ 3. Log in to the Hugging Face Hub using your own access token. You can either export it as an environment variable or pass it directly to the application. To export it, replace `YOUR_HF_TOKEN` with a valid token generated from your Hugging Face account:
26
+
27
+ ```bash
28
+ export HF_TOKEN="YOUR_HF_TOKEN"
29
+ ```
30
+
31
+ Alternatively, run the command interactively:
32
+
33
+ ```bash
34
+ huggingface-cli login --token YOUR_HF_TOKEN
35
+ ```
36
+
37
+ 4. Launch the interactive demo:
38
+
39
+ ```bash
40
+ python matrix_game_interface.py
41
+ ```
42
+
43
+ The first time you run the script it will download several gigabytes of model weights from Hugging Face. Subsequent runs will reuse the cached files.
44
+
45
+ ## Usage
46
+
47
+ Once the Gradio interface starts (usually it will open your browser automatically), follow these steps:
48
+
49
+ 1. **Select an input image** – this image acts as the first frame of the generated video. The model expects images with a 16∶9 aspect ratio. You can upload any photo (for example, a screenshot from a game or a view you want to explore).
50
+
51
+ 2. **Choose the number of frames** you want to generate. The demo currently supports up to 30 frames (roughly one second of video at 30 fps). Longer videos require more memory and compute.
52
+
53
+ 3. **Click “Generate”**. The model will synthesize a sequence of frames conditioned on your chosen image. You can watch the result directly in the browser or download the MP4 file to view it offline.
54
+
55
+ ### Action control
56
+
57
+ Matrix‑Game 2.0 normally accepts keyboard and mouse actions at each time step to steer the camera within the scene. The simplified interface provided here does not expose those controls directly—primarily because real‑time interaction requires high‑frequency communication with the model that cannot be reliably handled in a browser without significant latency.
58
+
59
+ However, the underlying `MatrixGame` class exposes a `generate` method that takes optional `mouse` and `keyboard` tensors (representing camera and movement commands). Feel free to modify the UI to add your own custom controls if you would like to experiment with the full action‑conditioned generation.
60
+
61
+ ## Project structure
62
+
63
+ ```
64
+ matrix_game_app/
65
+ ├── README.md – this file
66
+ ├── requirements.txt – minimal Python dependencies
67
+ └── matrix_game_interface.py – entry point for the interactive demo
68
+ ```
69
+
70
+ ## Notes
71
+
72
+ This project is intentionally light‑weight. It does not attempt to replicate the full training or inference pipeline from the official Matrix‑Game repository. Instead, it leverages the `diffusers` integration for the model to provide a quick way to run inference. If you need the full streaming inference pipeline with mouse/keyboard injection (as available in the original repo), please clone [SkyworkAI/Matrix‑Game](https://github.com/SkyworkAI/Matrix-Game) and follow the instructions in its `README.md`.
73
+
74
+ Finally, please be aware that the model weights are released under the MIT license. Make sure you adhere to the license terms when redistributing or using the model.
app.py ADDED
@@ -0,0 +1,248 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ matrix_game_interface.py
3
+ ========================
4
+
5
+ This script exposes a simple web interface for the Matrix‑Game 2.0 model via
6
+ Gradio. Given an initial image, the model produces a short video that
7
+ continues the scene forward in time. The code uses the diffusers library to
8
+ download and load the model from Hugging Face. It automatically selects CPU
9
+ or GPU based on availability.
10
+
11
+ To run this script you must have installed the dependencies in
12
+ `requirements.txt` and logged in to the Hugging Face Hub using your access
13
+ token. You can set the token at runtime via the `HF_TOKEN` environment
14
+ variable or by passing it into the constructor of the `MatrixGame` class.
15
+
16
+ Note: generating videos with Matrix‑Game 2.0 is computationally intensive and
17
+ requires a machine with significant memory. On a CPU the generation may be
18
+ very slow. For best results use a GPU with at least 24 GiB VRAM.
19
+ """
20
+
21
+ from __future__ import annotations
22
+
23
+ import os
24
+ import tempfile
25
+ from typing import List, Optional
26
+
27
+ import numpy as np
28
+ from PIL import Image
29
+
30
+ import torch
31
+
32
+ from huggingface_hub import login
33
+
34
+ try:
35
+ # Import the generic video pipeline loader. Depending on your version of
36
+ # diffusers this symbol may live in different modules. We guard the import
37
+ # so that the script does not crash at import time on older versions.
38
+ from diffusers import AutoPipelineForVideo
39
+ except Exception:
40
+ AutoPipelineForVideo = None # type: ignore
41
+
42
+ try:
43
+ from diffusers import ImageToVideoPipeline
44
+ except Exception:
45
+ ImageToVideoPipeline = None # type: ignore
46
+
47
+ try:
48
+ import gradio as gr
49
+ except Exception:
50
+ gr = None # type: ignore
51
+
52
+ try:
53
+ from moviepy.editor import ImageSequenceClip
54
+ except Exception:
55
+ ImageSequenceClip = None # type: ignore
56
+
57
+
58
+ class MatrixGame:
59
+ """Wrapper around the Matrix‑Game 2.0 model.
60
+
61
+ This class handles logging in to Hugging Face, downloading the model,
62
+ selecting the appropriate device and performing video generation. It
63
+ currently supports the universal mode, which uses the base distilled model
64
+ weights. Real‑time interactive control with mouse and keyboard inputs is
65
+ possible but not exposed through the Gradio UI.
66
+ """
67
+
68
+ MODEL_ID: str = "Skywork/Matrix-Game-2.0"
69
+
70
+ def __init__(self, hf_token: Optional[str] = None, *, mode: str = "universal"):
71
+ self.mode = mode
72
+ self.hf_token = hf_token or os.environ.get("HF_TOKEN")
73
+ if not self.hf_token:
74
+ raise ValueError(
75
+ "A HuggingFace token must be provided either via the HF_TOKEN "
76
+ "environment variable or the hf_token argument."
77
+ )
78
+ # Authenticate with Hugging Face. This call is idempotent; if you're
79
+ # already logged in it does nothing.
80
+ login(token=self.hf_token, add_to_git_credential=False)
81
+
82
+ # Select compute device. Use GPU if available; otherwise fall back to CPU.
83
+ self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
84
+ # Use lower‑precision dtypes on GPU to save memory.
85
+ if self.device.type == "cuda":
86
+ self.dtype = torch.float16
87
+ else:
88
+ self.dtype = torch.float32
89
+
90
+ # Load the pipeline. We try the new `AutoPipelineForVideo` first since it
91
+ # automatically selects the proper class based on the model's
92
+ # configuration. If that is unavailable we fall back to
93
+ # `ImageToVideoPipeline`, which is supported by diffusers >=0.25.0.
94
+ pipeline = None
95
+ if AutoPipelineForVideo is not None:
96
+ try:
97
+ pipeline = AutoPipelineForVideo.from_pretrained(
98
+ self.MODEL_ID,
99
+ torch_dtype=self.dtype,
100
+ variant="fp16" if self.dtype == torch.float16 else None,
101
+ use_auth_token=self.hf_token,
102
+ )
103
+ except Exception as e:
104
+ print(f"AutoPipelineForVideo failed to load: {e}")
105
+
106
+ if pipeline is None and ImageToVideoPipeline is not None:
107
+ try:
108
+ pipeline = ImageToVideoPipeline.from_pretrained(
109
+ self.MODEL_ID,
110
+ torch_dtype=self.dtype,
111
+ variant="fp16" if self.dtype == torch.float16 else None,
112
+ use_auth_token=self.hf_token,
113
+ )
114
+ except Exception as e:
115
+ print(f"ImageToVideoPipeline failed to load: {e}")
116
+
117
+ if pipeline is None:
118
+ raise RuntimeError(
119
+ "Could not load a video pipeline for Matrix‑Game 2.0. Please "
120
+ "ensure diffusers is up to date (>=0.33) and that you have GPU "
121
+ "support installed."
122
+ )
123
+
124
+ self.pipeline = pipeline.to(self.device)
125
+
126
+ def generate_frames(self, image: Image.Image, num_frames: int = 8) -> List[Image.Image]:
127
+ """Generate a sequence of frames given an initial image.
128
+
129
+ Args:
130
+ image: A PIL.Image that will act as the first frame of the video.
131
+ num_frames: The number of frames to generate (including the input).
132
+
133
+ Returns:
134
+ A list of PIL.Image objects representing the generated video frames.
135
+ """
136
+ # Normalize and resize the input image to what the pipeline expects. The
137
+ # diffusers pipelines internally handle resizing, but explicitly
138
+ # converting to RGB ensures consistent results.
139
+ if not isinstance(image, Image.Image):
140
+ raise ValueError("Input must be a PIL.Image")
141
+
142
+ image = image.convert("RGB")
143
+ # Some pipelines support passing `num_frames` directly to control the video
144
+ # length. Others may ignore the argument and use a default value. The
145
+ # Matrix‑Game model natively produces 16 frames per call. We allow the
146
+ # caller to request fewer frames; the pipeline will truncate the result
147
+ # accordingly.
148
+ with torch.autocast(self.device.type, dtype=self.dtype):
149
+ result = self.pipeline(image, num_frames=num_frames)
150
+
151
+ # The result is a simple namespace with a `frames` attribute containing
152
+ # the frames as PIL images.
153
+ frames: List[Image.Image] = getattr(result, "frames", None)
154
+ if frames is None:
155
+ # Some versions of diffusers return a dictionary with a
156
+ # "frames" key instead of an attribute.
157
+ frames = result.get("frames") # type: ignore
158
+ if frames is None:
159
+ raise RuntimeError("Unexpected output format from the pipeline")
160
+ # Limit to the requested number of frames if more were produced.
161
+ return frames[: num_frames]
162
+
163
+ def frames_to_video(self, frames: List[Image.Image], fps: int = 15) -> str:
164
+ """Convert a list of frames into a temporary MP4 file.
165
+
166
+ Args:
167
+ frames: A list of PIL images.
168
+ fps: Frames per second for the output video.
169
+
170
+ Returns:
171
+ The file path to the generated MP4 video.
172
+ """
173
+ if ImageSequenceClip is None:
174
+ raise ImportError(
175
+ "moviepy is required to assemble videos. Please install it with "
176
+ "`pip install moviepy` or use an alternative method."
177
+ )
178
+ # Convert PIL images to numpy arrays in uint8 format
179
+ clips = [np.array(frame) for frame in frames]
180
+ clip = ImageSequenceClip(clips, fps=fps)
181
+ # Write to a temporary file
182
+ tmp_dir = tempfile.mkdtemp(prefix="matrix_game_")
183
+ video_path = os.path.join(tmp_dir, "output.mp4")
184
+ clip.write_videofile(video_path, codec="libx264", audio=False, verbose=False, logger=None)
185
+ return video_path
186
+
187
+
188
+ def launch_interface():
189
+ """Launch a Gradio interface for Matrix‑Game 2.0."""
190
+ if gr is None:
191
+ raise ImportError(
192
+ "Gradio is not installed. Please install it with `pip install gradio`."
193
+ )
194
+ # Instantiate the model wrapper once. This will download the weights
195
+ # automatically on first use. We read the token from the environment; if
196
+ # you prefer you can hard‑code the token here, but be mindful of
197
+ # security best practices.
198
+ hf_token = os.environ.get("HF_TOKEN")
199
+ if not hf_token:
200
+ raise RuntimeError(
201
+ "Please set the HF_TOKEN environment variable to your HuggingFace access "
202
+ "token before launching the interface."
203
+ )
204
+ matrix_game = MatrixGame(hf_token=hf_token)
205
+
206
+ def generate_fn(image: Image.Image, num_frames: int) -> str:
207
+ """Callback invoked by Gradio to generate a video file."""
208
+ frames = matrix_game.generate_frames(image, num_frames=num_frames)
209
+ video_path = matrix_game.frames_to_video(frames, fps=15)
210
+ return video_path
211
+
212
+ with gr.Blocks() as demo:
213
+ gr.Markdown(
214
+ """
215
+ # Matrix‑Game 2.0 Demo
216
+
217
+ Upload an image and choose how many frames to generate. The model
218
+ will synthesize a short video that extends the scene in real time.
219
+ Note that generation may take several minutes on machines without
220
+ high‑end GPUs.
221
+ """
222
+ )
223
+ with gr.Row():
224
+ with gr.Column():
225
+ image_input = gr.Image(type="pil", label="Initial Frame")
226
+ num_frames = gr.Slider(
227
+ minimum=4,
228
+ maximum=32,
229
+ step=1,
230
+ value=16,
231
+ label="Number of Frames",
232
+ info="Total frames in the generated video (including the initial frame)",
233
+ )
234
+ generate_btn = gr.Button("Generate Video")
235
+ with gr.Column():
236
+ video_output = gr.Video(label="Generated Video", interactive=False)
237
+
238
+ generate_btn.click(
239
+ fn=generate_fn,
240
+ inputs=[image_input, num_frames],
241
+ outputs=video_output,
242
+ )
243
+
244
+ demo.launch()
245
+
246
+
247
+ if __name__ == "__main__":
248
+ launch_interface()
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ torch>=2.1
2
+ diffusers>=0.33.0
3
+ huggingface_hub>=0.20.0
4
+ gradio>=4.0
5
+ numpy>=1.21
6
+ Pillow>=9.2
7
+ moviepy>=1.0
8
+ omegaconf>=2.3
9
+ einops>=0.7
10
+ safetensors>=0.3