BonanDing commited on Feb 10

Commit

681f346

1 Parent(s): 8652b14

Reproduce Training & Fix distributed eval

Browse files

Files changed (19) hide show

README.md +38 -173
algorithms/worldmem/df_base.py +1 -0
algorithms/worldmem/df_video.py +88 -60
algorithms/worldmem/models/diffusion.py +1 -1
configurations/experiment/base_pytorch.yaml +3 -1
configurations/experiment/exp_video.yaml +1 -0
configurations/training.yaml +1 -1
datasets/video/base_video_dataset.py +1 -0
datasets/video/minecraft_video_dataset.py +5 -5
evaluate.sh +16 -4
experiments/exp_base.py +3 -2
infer.sh +10 -3
main.py +4 -0
requirements.txt +138 -27
train_3stages.sh +80 -0
train_stage_1.sh +9 -8
train_stage_2.sh +30 -6
train_stage_3.sh +8 -5
utils/distributed_utils.py +9 -2

README.md CHANGED Viewed

@@ -1,201 +1,66 @@
-<br>
-<p align="center">
-<p align="center">
-  <img src="assets/worldmem_logo.png" alt="WORLDMEM Icon" width="80"/>
-</p>
-<h1 align="center"><strong>WorldMem: Long-term Consistent World Simulation <br> with Memory</strong></h1>
-  <p align="center"><span><a href=""></a></span>
-              <a href="https://xizaoqu.github.io">Zeqi Xiao<sup>1</sup></a>
-              <a href="https://nirvanalan.github.io/">Yushi Lan<sup>1</sup></a>
-              <a href="https://zhouyifan.net/about/">Yifan Zhou<sup>1</sup></a>
-              <a href="https://vicky0522.github.io/Wenqi-Ouyang/">Wenqi Ouyang<sup>1</sup></a>
-              <a href="https://williamyang1991.github.io/">Shuai Yang<sup>2</sup></a>
-              <a href="https://zengyh1900.github.io/">Yanhong Zeng<sup>3</sup></a>
-              <a href="https://xingangpan.github.io/">Xingang Pan<sup>1</sup></a>    <br>
-    <sup>1</sup>S-Lab, Nanyang Technological University, <br> <sup>2</sup>Wangxuan Institute of Computer Technology, Peking University,<br>  <sup>3</sup>Shanghai AI Laboratory
-    </p>
-</p>
-<p align="center">
-  <a href="https://arxiv.org/abs/2504.12369" target='_blank'>
-    <img src="https://img.shields.io/badge/arXiv-2504.12369-blue?">
-  </a>
-  <a href="https://xizaoqu.github.io/worldmem/" target='_blank'>
-    <img src="https://img.shields.io/badge/Project-&#x1F680-blue">
-  </a>
-<a href="https://huggingface.co/spaces/yslan/worldmem" target="_blank">
-  <img src="https://img.shields.io/badge/🤗 HuggingFace-Demo-orange" />
-</a>
-</p>
-https://github.com/user-attachments/assets/fb8a32e2-9470-4819-a93d-c38caf76d72c
-## Installation
-```
-conda create python=3.10 -n worldmem
 conda activate worldmem
 pip install -r requirements.txt
 conda install -c conda-forge ffmpeg=4.3.2
 ```
-## Quick start
-```
-python app.py
-```
-## Run
-To enable cloud logging with [Weights & Biases (wandb)](https://wandb.ai/site), follow these steps:
-1. Sign up for a wandb account.
-2. Run the following command to log in:
-    ```bash
-    wandb login
-    ```
-3. Open `configurations/training.yaml` and set the `entity` and `project` field to your wandb username.
----
-### Training
-Download pretrained weights from [Oasis](https://github.com/etched-ai/open-oasis).
-Training the model on 4 H100 GPUs, it converges after approximately 500K steps.
-We observe that gradually increasing task difficulty improves performance. Thus, we adopt a multi-stage training strategy:
-,
-```bash
-sh train_stage_1.sh   # Small range, no vertical turning
-sh train_stage_2.sh   # Large range, no vertical turning
-sh train_stage_3.sh   # Large range, with vertical turning
-```
-To resume training from a previous checkpoint, configure the `resume` and `output_dir` variables in the corresponding `.sh` script.
----
-### Inference
-To run inference:
-```bash
-sh infer.sh
-```
-You can either **load the diffusion model and VAE separately**:
-```bash
-+diffusion_model_path=zeqixiao/worldmem_checkpoints/diffusion_only.ckpt \
-+vae_path=zeqixiao/worldmem_checkpoints/vae_only.ckpt \
-+customized_load=true \
-+seperate_load=true \
-```
-Or **load a combined checkpoint**:
-```bash
-+load=your_model_path \
-+customized_load=true \
-+seperate_load=false \
-```
-### Evaluation
-To run evaluation:
-```bash
-sh evaluate.sh
-```
-This script reproduces the results in Table 1 (beyond context window). It will generate PSNR and Lpips. Evaluating 1 case on 1 A100 GPU takes approximately 6 minutes. You can adjust `experiment.test.limit_batch` to specify the number of cases to evaluate.
-Visual results will be saved by default to a timestamped directory (e.g., `outputs/2025-11-30/00-02-42`).
-To calculate the FID score, run:
-```bash
-python calculate_fid.py --videos_dir <path_to_videos>
-```
-For example:
-```bash
-python calculate_fid.py --videos_dir outputs/2025-11-30/00-02-42/videos/test_vis
-```
-**Expected Results:**
-| Metric | Value  |
-|--------|--------|
-| PSNR   | 24.01  |
-| LPIPS  | 0.1667 |
-| FID    | 15.13  |
-*Note: FID is computed over 5000 frames.*
----
-## Dataset
-Download the Minecraft dataset from [Hugging Face](https://huggingface.co/datasets/zeqixiao/worldmem_minecraft_dataset)
-Place the dataset in the following directory structure:
-```
 data/
 └── minecraft/
     ├── training/
-    └── validation/
     └── test/
 ```
-## Data Generation
-After setting up the environment as described in [MineDojo's GitHub repository](https://github.com/MineDojo/MineDojo), you can generate data using the following command:
 ```bash
-xvfb-run -a python data_generator.py -o data/test -z 4 --env_type plains
 ```
-**Parameters:**
-- `-o`: Output directory for generated data
-- `-z`: Number of parallel workers
-- `--env_type`: Environment type (e.g., `plains`)
-## TODO
-- [x] Release inference models and weights;
-- [x] Release training pipeline on Minecraft;
-- [x] Release training data on Minecraft;
-- [x] Release evaluation scripts and data generator.
-## 🔗 Citation
-If you find our work helpful, please cite:
 ```
-@misc{xiao2025worldmemlongtermconsistentworld,
-      title={WORLDMEM: Long-term Consistent World Simulation with Memory},
-      author={Zeqi Xiao and Yushi Lan and Yifan Zhou and Wenqi Ouyang and Shuai Yang and Yanhong Zeng and Xingang Pan},
-      year={2025},
-      eprint={2504.12369},
-      archivePrefix={arXiv},
-      primaryClass={cs.CV},
-      url={https://arxiv.org/abs/2504.12369},
-}
-```
-## 👏 Acknowledgements
-- [Diffusion Forcing](https://github.com/buoyancy99/diffusion-forcing): Diffusion Forcing provides flexible training and inference strategies for our methods.
-- [Minedojo](https://github.com/MineDojo/MineDojo): We collect our Minecraft dataset from Minedojo.
-- [Open-oasis](https://github.com/etched-ai/open-oasis): Our model architecture is based on Open-oasis. We also use pretrained VAE and DiT weight from it.

+# WorldMem
+Long-term consistent world simulation with memory.
+## Environment (conda)
+```bash
+conda create -n worldmem python=3.10
 conda activate worldmem
 pip install -r requirements.txt
 conda install -c conda-forge ffmpeg=4.3.2
 ```
+## Data preparation (data folder)
+1. Download the Minecraft dataset:
+   https://huggingface.co/datasets/zeqixiao/worldmem_minecraft_dataset
+2. Place it under `data/` with this structure:
+```text
 data/
 └── minecraft/
     ├── training/
+    ├── validation/
     └── test/
 ```
+The training and evaluation scripts expect the dataset to live at `data/minecraft` by default.
+## Training
+Run a single stage:
 ```bash
+sh train_stage_1.sh
+sh train_stage_2.sh
+sh train_stage_3.sh
 ```
+Run all stages:
+```bash
+sh train_3stages.sh
+```
+The stage scripts include dataset and checkpoint paths. Update those paths or override them on the CLI to match your local setup.
+## Training config (exp_video.yaml)
+Defaults live in `configurations/experiment/exp_video.yaml`.
+Common fields to edit:
+- `training.lr`
+- `training.precision`
+- `training.batch_size`
+- `training.max_steps`
+- `training.checkpointing.every_n_train_steps`
+- `validation.val_every_n_step`
+- `validation.batch_size`
+- `test.batch_size`
+You can also override values from the CLI used in the scripts:
+```bash
+python -m main +name=train experiment.training.batch_size=8 experiment.training.max_steps=100000
 ```

algorithms/worldmem/df_base.py CHANGED Viewed

@@ -33,6 +33,7 @@ class DiffusionForcingBase(BasePytorchAlgo):
         self.action_cond_dim = cfg.action_cond_dim
         self.causal = cfg.causal
         self.uncertainty_scale = cfg.uncertainty_scale
         self.timesteps = cfg.diffusion.timesteps
         self.sampling_timesteps = cfg.diffusion.sampling_timesteps

         self.action_cond_dim = cfg.action_cond_dim
         self.causal = cfg.causal
         self.uncertainty_scale = cfg.uncertainty_scale
         self.timesteps = cfg.diffusion.timesteps
         self.sampling_timesteps = cfg.diffusion.sampling_timesteps

algorithms/worldmem/df_video.py CHANGED Viewed

@@ -3,6 +3,7 @@ import random
 import math
 import numpy as np
 import torch
 import torch.nn.functional as F
 import torchvision.transforms.functional as TF
 from torchvision.transforms import InterpolationMode
@@ -21,6 +22,7 @@ from .models.vae import VAE_models
 from .models.diffusion import Diffusion
 from .models.pose_prediction import PosePredictionNet
 import glob
 # Utility Functions
 def euler_to_rotation_matrix(pitch, yaw):
@@ -376,7 +378,8 @@ class WorldMemMinecraft(DiffusionForcingBase):
             ref_mode=self.ref_mode
         )
-        self.validation_lpips_model = LearnedPerceptualImagePatchSimilarity()
         vae = VAE_models["vit-l-20-shallow-encoder"]()
         self.vae = vae.eval()
@@ -430,13 +433,13 @@ class WorldMemMinecraft(DiffusionForcingBase):
                             focal_length=self.focal_length,
                             image_height=xs.shape[-2],image_width=xs.shape[-1]
                         ).to(xs.dtype)
-                    )
                     frame_idx_list.append(
                         torch.cat([
                             frame_idx[i:i + 1] - frame_idx[i:i + 1],
                             frame_idx[-self.memory_condition_length:] - frame_idx[i:i + 1]
                         ]).clone()
-                    )
                 input_pose_condition = torch.cat(input_pose_condition)
                 frame_idx_list = torch.cat(frame_idx_list)
             else:
@@ -476,66 +479,78 @@ class WorldMemMinecraft(DiffusionForcingBase):
         return {"loss": loss}
     def on_validation_epoch_end(self, namespace="validation") -> None:
-        if not self.validation_step_outputs:
             return
-        xs_pred = []
-        xs = []
-        for pred, gt in self.validation_step_outputs:
-            xs_pred.append(pred)
-            xs.append(gt)
-        xs_pred = torch.cat(xs_pred, 1)
-        if gt is not None:
-            xs = torch.cat(xs, 1)
-        else:
-            xs = None
-        if self.logger and self.log_video:
-            log_video(
-                xs_pred,
-                xs,
-                step=None if namespace == "test" else self.global_step,
-                namespace=namespace + "_vis",
-                context_frames=self.context_frames,
-                logger=self.logger.experiment,
-                save_local=self.save_local,
-                local_save_dir=self.local_save_dir,
-            )
-        if xs is not None:
-            # Move data to the same device as LPIPS model for metric calculation
-            device = next(self.validation_lpips_model.parameters()).device
-            xs_pred_device = xs_pred.to(device)
-            xs_device = xs.to(device)
-            metric_dict = get_validation_metrics_for_videos(
-                xs_pred_device, xs_device,
-                lpips_model=self.validation_lpips_model,
-                lpips_batch_size=self.lpips_batch_size)
-            self.log_dict(
-                {"mse": metric_dict['mse'],
-                "psnr": metric_dict['psnr'],
-                "lpips": metric_dict['lpips']},
-                sync_dist=True
-            )
-            if self.log_curve:
-                psnr_values = metric_dict['frame_wise_psnr'].cpu().tolist()
-                frames = list(range(len(psnr_values)))
-                line_plot = wandb.plot.line_series(
-                    xs = frames,
-                    ys = [psnr_values],
-                    keys = ["PSNR"],
-                    title = "Frame-wise PSNR",
-                    xname = "Frame index"
                 )
-                self.logger.experiment.log({"frame_wise_psnr_plot": line_plot})
         self.validation_step_outputs.clear()
     def _preprocess_batch(self, batch):
         xs, conditions, pose_conditions, frame_index = batch
@@ -554,7 +569,7 @@ class WorldMemMinecraft(DiffusionForcingBase):
         return xs, conditions, pose_conditions, c2w_mat, frame_index
     def encode(self, x):
-        # vae encoding
         T = x.shape[0]
         H, W = x.shape[-2:]
         scaling_factor = 0.07843137255
@@ -783,8 +798,21 @@ class WorldMemMinecraft(DiffusionForcingBase):
         xs_pred = self.decode(xs_pred[n_context_frames:].to(conditions.device))
         xs_decode = self.decode(xs[n_context_frames:].to(conditions.device))
-        # Store results for evaluation (move to CPU to save GPU memory)
-        self.validation_step_outputs.append((xs_pred.detach().cpu(), xs_decode.detach().cpu()))
         return
     @torch.no_grad()

 import math
 import numpy as np
 import torch
+import torch.distributed as dist
 import torch.nn.functional as F
 import torchvision.transforms.functional as TF
 from torchvision.transforms import InterpolationMode
 from .models.diffusion import Diffusion
 from .models.pose_prediction import PosePredictionNet
 import glob
+import wandb
 # Utility Functions
 def euler_to_rotation_matrix(pitch, yaw):
             ref_mode=self.ref_mode
         )
+        # Avoid distributed sync inside torchmetrics; reduce metrics manually across ranks.
+        self.validation_lpips_model = LearnedPerceptualImagePatchSimilarity(sync_on_compute=False)
         vae = VAE_models["vit-l-20-shallow-encoder"]()
         self.vae = vae.eval()
                             focal_length=self.focal_length,
                             image_height=xs.shape[-2],image_width=xs.shape[-1]
                         ).to(xs.dtype)
+                    ) # [V(1 + memory_condition_length),B ,H, W, 6]
                     frame_idx_list.append(
                         torch.cat([
                             frame_idx[i:i + 1] - frame_idx[i:i + 1],
                             frame_idx[-self.memory_condition_length:] - frame_idx[i:i + 1]
                         ]).clone()
+                    ) # [V(1 + memory_condition_length),B] (0 for current frame, others for memory frames with relative index to current frame)
                 input_pose_condition = torch.cat(input_pose_condition)
                 frame_idx_list = torch.cat(frame_idx_list)
             else:
         return {"loss": loss}
     def on_validation_epoch_end(self, namespace="validation") -> None:
+        if not hasattr(self, "_metric_device"):
             return
+        if dist.is_available() and dist.is_initialized():
+            for tensor in (
+                self._mse_sum,
+                self._mse_count,
+                self._psnr_sum,
+                self._psnr_count,
+                self._lpips_sum,
+                self._lpips_count,
+            ):
+                dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
+        mse = self._mse_sum / self._mse_count.clamp_min(1.0)
+        psnr = self._psnr_sum / self._psnr_count.clamp_min(1.0)
+        lpips = self._lpips_sum / self._lpips_count.clamp_min(1.0)
+        if self.trainer is None or self.trainer.is_global_zero:
+            if self._mse_count.item() > 0:
+                self.log_dict(
+                    {"mse": mse, "psnr": psnr, "lpips": lpips},
+                    sync_dist=False,
                 )
         self.validation_step_outputs.clear()
+    def on_validation_epoch_start(self) -> None:
+        self._reset_metric_accumulators()
+    def on_test_epoch_start(self) -> None:
+        self._reset_metric_accumulators()
+    def _reset_metric_accumulators(self) -> None:
+        self._metric_device = next(self.validation_lpips_model.parameters()).device
+        self._mse_sum = torch.tensor(0.0, device=self._metric_device)
+        self._mse_count = torch.tensor(0.0, device=self._metric_device)
+        self._psnr_sum = torch.tensor(0.0, device=self._metric_device)
+        self._psnr_count = torch.tensor(0.0, device=self._metric_device)
+        self._lpips_sum = torch.tensor(0.0, device=self._metric_device)
+        self._lpips_count = torch.tensor(0.0, device=self._metric_device)
+    def _update_metric_accumulators(self, xs_pred: torch.Tensor, xs_gt: torch.Tensor) -> None:
+        xs_pred_device = xs_pred.to(self._metric_device)
+        xs_device = xs_gt.to(self._metric_device)
+        metric_dict = get_validation_metrics_for_videos(
+            xs_pred_device,
+            xs_device,
+            lpips_model=self.validation_lpips_model,
+            lpips_batch_size=self.lpips_batch_size,
+        )
+        mse_val = metric_dict["mse"].detach()
+        psnr_val = metric_dict["psnr"].detach()
+        lpips_val = torch.tensor(metric_dict["lpips"], device=self._metric_device)
+        mse_count_batch = torch.tensor(float(xs_pred_device.numel()), device=self._metric_device)
+        psnr_count_batch = torch.tensor(float(xs_pred_device.shape[1]), device=self._metric_device)
+        lpips_count_batch = torch.tensor(
+            float(xs_pred_device.shape[0] * xs_pred_device.shape[1]), device=self._metric_device
+        )
+        self._mse_sum += mse_val * mse_count_batch
+        self._psnr_sum += psnr_val * psnr_count_batch
+        self._lpips_sum += lpips_val * lpips_count_batch
+        self._mse_count += mse_count_batch
+        self._psnr_count += psnr_count_batch
+        self._lpips_count += lpips_count_batch
+        del xs_pred_device, xs_device
     def _preprocess_batch(self, batch):
         xs, conditions, pose_conditions, frame_index = batch
         return xs, conditions, pose_conditions, c2w_mat, frame_index
     def encode(self, x):
+        # vae encoding x with shape (t b c h w)
         T = x.shape[0]
         H, W = x.shape[-2:]
         scaling_factor = 0.07843137255
         xs_pred = self.decode(xs_pred[n_context_frames:].to(conditions.device))
         xs_decode = self.decode(xs[n_context_frames:].to(conditions.device))
+        # Save videos for every batch (rank is encoded in filenames).
+        if self.logger and self.log_video:
+            log_video(
+                xs_pred,
+                xs_decode,
+                step=batch_idx,
+                namespace=namespace + "_vis",
+                context_frames=self.context_frames,
+                logger=self.logger.experiment,
+                save_local=self.save_local,
+                local_save_dir=self.local_save_dir,
+            )
+        # Stream metrics to avoid holding all outputs in memory.
+        self._update_metric_accumulators(xs_pred, xs_decode)
         return
     @torch.no_grad()

algorithms/worldmem/models/diffusion.py CHANGED Viewed

@@ -169,7 +169,7 @@ class Diffusion(nn.Module):
             mode=mode, reference_length=reference_length, frame_idx=frame_idx)
         model_output = model_output.permute(1,0,2,3,4)
         x = x.permute(1,0,2,3,4)
-        t = t.permute(1,0)
         if self.objective == "pred_noise":
             pred_noise = torch.clamp(model_output, -self.clip_noise, self.clip_noise)

             mode=mode, reference_length=reference_length, frame_idx=frame_idx)
         model_output = model_output.permute(1,0,2,3,4)
         x = x.permute(1,0,2,3,4)
+        t = t.permute(1,0)
         if self.objective == "pred_noise":
             pred_noise = torch.clamp(model_output, -self.clip_noise, self.clip_noise)

configurations/experiment/base_pytorch.yaml CHANGED Viewed

@@ -35,7 +35,8 @@ validation:
   val_every_n_step: 2000 # controls how frequent do we run validation, can be float (fraction of epoches) or int (steps) or null (if val_every_n_epoch is set)
   val_every_n_epoch: null # if you want to do validation every n epoches, requires val_every_n_step to be null.
   limit_batch: null # if null, run through validation set. Otherwise limit the number of batches to use for validation.
-  inference_mode: True # whether to run validation in inference mode (enable_grad won't work!)
   data:
     num_workers: 4 # number of CPU threads for data preprocessing, for validation.
     shuffle: False # whether validation data will be shuffled
@@ -45,6 +46,7 @@ test:
   compile: False # whether to compile the model with torch.compile
   batch_size: 4 # test batch size per GPU; effective batch size is this number * gpu * nodes iff using distributed training
   limit_batch: null # if null, run through test set. Otherwise limit the number of batches to use for test.
   data:
     num_workers: 4 # number of CPU threads for data preprocessing, for test.
     shuffle: False # whether test data will be shuffled

   val_every_n_step: 2000 # controls how frequent do we run validation, can be float (fraction of epoches) or int (steps) or null (if val_every_n_epoch is set)
   val_every_n_epoch: null # if you want to do validation every n epoches, requires val_every_n_step to be null.
   limit_batch: null # if null, run through validation set. Otherwise limit the number of batches to use for validation.
+  # inference_mode: True # whether to run validation in inference mode (enable_grad won't work!)
+  inference_mode: False # whether to run validation in inference mode (enable_grad won't work!)
   data:
     num_workers: 4 # number of CPU threads for data preprocessing, for validation.
     shuffle: False # whether validation data will be shuffled
   compile: False # whether to compile the model with torch.compile
   batch_size: 4 # test batch size per GPU; effective batch size is this number * gpu * nodes iff using distributed training
   limit_batch: null # if null, run through test set. Otherwise limit the number of batches to use for test.
+  inference_mode: False # whether to run test in inference mode (enable_grad won't work!)
   data:
     num_workers: 4 # number of CPU threads for data preprocessing, for test.
     shuffle: False # whether test data will be shuffled

configurations/experiment/exp_video.yaml CHANGED Viewed

@@ -7,6 +7,7 @@ training:
   lr: 2e-5
   precision: 16-mixed
   batch_size: 4
   max_epochs: -1
   max_steps: 2000005
   checkpointing:

   lr: 2e-5
   precision: 16-mixed
   batch_size: 4
+  # batch_size: 8
   max_epochs: -1
   max_steps: 2000005
   checkpointing:

configurations/training.yaml CHANGED Viewed

@@ -8,7 +8,7 @@ defaults:
 debug: false # global debug flag will be passed into configuration of experiment, dataset and algorithm
 wandb:
-  entity: xizaoqu # wandb account name / organization name [fixme]
   project: worldmem # wandb project name; if not provided, defaults to root folder name [fixme]
   mode: online # set wandb logging to online, offline or dryrun

 debug: false # global debug flag will be passed into configuration of experiment, dataset and algorithm
 wandb:
+  entity: turlin # wandb account name / organization name [fixme]
   project: worldmem # wandb project name; if not provided, defaults to root folder name [fixme]
   mode: online # set wandb logging to online, offline or dryrun

datasets/video/base_video_dataset.py CHANGED Viewed

@@ -47,6 +47,7 @@ class BaseVideoDataset(torch.utils.data.Dataset, ABC):
         self.clips_per_video = np.clip(np.array(self.metadata) - self.n_frames + 1, a_min=1, a_max=None).astype(
             np.int32
         )
         self.cum_clips_per_video = np.cumsum(self.clips_per_video)
         self.transform = transforms.Resize((self.resolution, self.resolution), antialias=True)

         self.clips_per_video = np.clip(np.array(self.metadata) - self.n_frames + 1, a_min=1, a_max=None).astype(
             np.int32
         )
         self.cum_clips_per_video = np.cumsum(self.clips_per_video)
         self.transform = transforms.Resize((self.resolution, self.resolution), antialias=True)

datasets/video/minecraft_video_dataset.py CHANGED Viewed

@@ -126,7 +126,7 @@ class MinecraftVideoDataset(BaseVideoDataset):
             try:
                 return self.load_data(idx)
             except Exception as e:
-                # print(f"Retrying due to error: {e}")
                 idx = (idx + 1) % len(self)
     def load_data(self, idx):
@@ -184,9 +184,9 @@ class MinecraftVideoDataset(BaseVideoDataset):
                 dis = np.abs(poses[:, None] - poses_pool[None, :])
                 dis[..., 3:][dis[..., 3:] > 180] = 360 - dis[..., 3:][dis[..., 3:] > 180]
-                spatial_match = (dis[..., :3] <= self.pos_range).sum(-1) >= 3
-                angular_match = (dis[..., 3:] <= self.angle_range).sum(-1) >= 2
-                not_exact_match = ((dis[..., :3] > 0).sum(-1) >= 1) | ((dis[..., 3:] > 0).sum(-1) >= 1)
                 valid_index = (spatial_match & angular_match & not_exact_match).sum(0)
                 valid_index[:100] = 0  # skip unstable early frames
@@ -237,7 +237,7 @@ class MinecraftVideoDataset(BaseVideoDataset):
             timestamp = np.arange(self.n_frames)
         # === 7. Convert video to torch format ===
-        video = torch.from_numpy(video / 255.0).float().permute(0, 3, 1, 2).contiguous()
         # === 9. Return all items ===
         return (

             try:
                 return self.load_data(idx)
             except Exception as e:
+                print(f"Retrying due to error: {e}")
                 idx = (idx + 1) % len(self)
     def load_data(self, idx):
                 dis = np.abs(poses[:, None] - poses_pool[None, :])
                 dis[..., 3:][dis[..., 3:] > 180] = 360 - dis[..., 3:][dis[..., 3:] > 180]
+                spatial_match = (dis[..., :3] <= self.pos_range).sum(-1) >= 3 # X, Y, Z axis all within range
+                angular_match = (dis[..., 3:] <= self.angle_range).sum(-1) >= 2 # Pitch, Yaw all within range
+                not_exact_match = ((dis[..., :3] > 0).sum(-1) >= 1) | ((dis[..., 3:] > 0).sum(-1) >= 1) # At least one axis is in range
                 valid_index = (spatial_match & angular_match & not_exact_match).sum(0)
                 valid_index[:100] = 0  # skip unstable early frames
             timestamp = np.arange(self.n_frames)
         # === 7. Convert video to torch format ===
+        video = torch.from_numpy(video / 255.0).float().permute(0, 3, 1, 2).contiguous() # (T, H, W, C) -> (T, C, H, W)
         # === 9. Return all items ===
         return (

evaluate.sh CHANGED Viewed

@@ -1,15 +1,27 @@
 export PYTHONWARNINGS="ignore"
 wandb offline
 python -m main +name=infer \
     experiment.tasks=[test] \
     dataset.validation_multiplier=1 \
     +dataset.seed=42 \
-    +diffusion_model_path=zeqixiao/worldmem_checkpoints/diffusion_only.ckpt \
-    +vae_path=zeqixiao/worldmem_checkpoints/vae_only.ckpt \
     +customized_load=true \
     +seperate_load=true \
     dataset.n_frames=8 \
-    dataset.save_dir=data/minecraft \
     +dataset.n_frames_valid=700 \
     algorithm.diffusion.sampling_timesteps=20 \
     +algorithm.memory_condition_length=8 \
@@ -20,4 +32,4 @@ python -m main +name=infer \
     +algorithm.n_tokens=8 \
     algorithm.context_frames=600 \
     experiment.test.batch_size=1 \
-    experiment.test.limit_batch=10 \

 export PYTHONWARNINGS="ignore"
+export CUDA_VISIBLE_DEVICES=4,5,6,7
+# export NCCL_DEBUG=INFO
+# export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
+# export TORCH_DISTRIBUTED_DEBUG=DETAIL
+# export NCCL_DEBUG_SUBSYS=COLL
+# # Optional but very helpful while debugging (slower):
+# export TORCH_NCCL_BLOCKING_WAIT=1
+export NCCL_TIMEOUT=7200
+export NCCL_P2P_DISABLE=1
+export HYDRA_FULL_ERROR=1
 wandb offline
 python -m main +name=infer \
     experiment.tasks=[test] \
     dataset.validation_multiplier=1 \
     +dataset.seed=42 \
+    +diffusion_model_path=/share_1/users/bonan_ding/worldmem_ckpt/diffusion_only.ckpt \
+    +vae_path=/share_1/users/bonan_ding/worldmem_ckpt/vae_only.ckpt \
     +customized_load=true \
     +seperate_load=true \
     dataset.n_frames=8 \
+    dataset.save_dir=/share_1/users/bonan_ding/worldmem_data/minecraft \
     +dataset.n_frames_valid=700 \
     algorithm.diffusion.sampling_timesteps=20 \
     +algorithm.memory_condition_length=8 \
     +algorithm.n_tokens=8 \
     algorithm.context_frames=600 \
     experiment.test.batch_size=1 \
+    experiment.test.limit_batch=160 \

experiments/exp_base.py CHANGED Viewed

@@ -9,7 +9,7 @@ from abc import ABC, abstractmethod
 from typing import Optional, Union, Literal, List, Dict
 import pathlib
 import os
 import hydra
 import torch
 from lightning.pytorch.strategies.ddp import DDPStrategy
@@ -415,10 +415,11 @@ class BaseLightningExperiment(BaseExperiment):
             logger=self.logger,
             devices="auto",
             num_nodes=self.cfg.num_nodes,
-            strategy=DDPStrategy(find_unused_parameters=False) if torch.cuda.device_count() > 1 else "auto",
             callbacks=callbacks,
             limit_test_batches=self.cfg.test.limit_batch,
             precision=self.cfg.test.precision,
             detect_anomaly=False,  # self.cfg.debug,
         )

 from typing import Optional, Union, Literal, List, Dict
 import pathlib
 import os
+from datetime import timedelta
 import hydra
 import torch
 from lightning.pytorch.strategies.ddp import DDPStrategy
             logger=self.logger,
             devices="auto",
             num_nodes=self.cfg.num_nodes,
+            strategy=DDPStrategy(find_unused_parameters=False, timeout=timedelta(hours=1)) if torch.cuda.device_count() > 1 else "auto",
             callbacks=callbacks,
             limit_test_batches=self.cfg.test.limit_batch,
             precision=self.cfg.test.precision,
+            inference_mode=self.cfg.test.inference_mode,
             detect_anomaly=False,  # self.cfg.debug,
         )

infer.sh CHANGED Viewed

@@ -1,14 +1,21 @@
 export PYTHONWARNINGS="ignore"
 wandb offline
 python -m main +name=infer \
     experiment.tasks=[validation] \
     dataset.validation_multiplier=1 \
-    +diffusion_model_path=zeqixiao/worldmem_checkpoints/diffusion_only.ckpt \
-    +vae_path=zeqixiao/worldmem_checkpoints/vae_only.ckpt \
     +customized_load=true \
     +seperate_load=true \
     dataset.n_frames=8 \
-    dataset.save_dir=data/minecraft \
     +dataset.n_frames_valid=700 \
     +dataset.memory_condition_length=8 \
     +dataset.customized_validation=true \

 export PYTHONWARNINGS="ignore"
+export NCCL_DEBUG=INFO
+export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
+export TORCH_DISTRIBUTED_DEBUG=DETAIL
+export NCCL_DEBUG_SUBSYS=COLL
+# Optional but very helpful while debugging (slower):
+export TORCH_NCCL_BLOCKING_WAIT=1
+export NCCL_P2P_DISABLE=1
 wandb offline
 python -m main +name=infer \
     experiment.tasks=[validation] \
     dataset.validation_multiplier=1 \
+    +diffusion_model_path=/share_1/users/bonan_ding/worldmem_ckpt/diffusion_only.ckpt \
+    +vae_path=/share_1/users/bonan_ding/worldmem_ckpt/vae_only.ckpt \
     +customized_load=true \
     +seperate_load=true \
     dataset.n_frames=8 \
+    dataset.save_dir=/share_1/users/bonan_ding/worldmem_data/minecraft \
     +dataset.n_frames_valid=700 \
     +dataset.memory_condition_length=8 \
     +dataset.customized_validation=true \

main.py CHANGED Viewed

@@ -59,6 +59,10 @@ def run_local(cfg: DictConfig):
         OmegaConf.set_readonly(hydra_cfg, True)
     output_dir = Path(hydra_cfg.runtime.output_dir)
     if is_rank_zero:
         print(cyan(f"Outputs will be saved to:"), output_dir)

         OmegaConf.set_readonly(hydra_cfg, True)
     output_dir = Path(hydra_cfg.runtime.output_dir)
+    if not output_dir.exists():
+        output_dir.mkdir(parents=True, exist_ok=True)
+        if is_rank_zero:
+            print(cyan(f"Created output directory: {output_dir}"))
     if is_rank_zero:
         print(cyan(f"Outputs will be saved to:"), output_dir)

requirements.txt CHANGED Viewed

@@ -1,28 +1,139 @@
-torch~=2.4.0
-torchvision~=0.19.1
-lightning~=2.1.2
-wandb~=0.17.0
-hydra-core~=1.3.2
-omegaconf~=2.3.0
-torchmetrics[image]==0.11.4
-wandb-osh==1.2.1
-gluonts[torch]==0.13.1
-pytorchvideo~=0.1.5
-colorama
-tqdm
-opencv-python
-matplotlib
-click
 moviepy==1.0.3
-imageio
-einops
-pandas
-pyzmq
-pyrealsense2
-internetarchive
-h5py
-rotary_embedding_torch
-diffusers
-timm
-gradio
-spaces

+aiofiles==23.2.1
+aiohappyeyeballs==2.6.1
+aiohttp==3.13.3
+aiosignal==1.4.0
+altair==5.5.0
+annotated-doc==0.0.4
+antlr4-python3-runtime==4.9.3
+anyio==4.12.1
+async-timeout==5.0.1
+attrs==25.4.0
+av==16.1.0
+certifi==2026.1.4
+charset-normalizer==3.4.4
+click==8.3.1
+colorama==0.4.6
+colorlog==6.10.1
+contourpy==1.3.2
+cycler==0.12.1
+decorator==4.4.2
+diffusers==0.36.0
+docker-pycreds==0.4.0
+einops==0.8.1
+exceptiongroup==1.3.1
+fastapi==0.125.0
+ffmpy==1.0.0
+filelock==3.20.3
+fonttools==4.61.1
+frozenlist==1.8.0
+fsspec==2024.12.0
+fvcore==0.1.5.post20221221
+gitdb==4.0.12
+GitPython==3.1.46
+gluonts==0.13.1
+gradio==3.50.2
+gradio_client==0.6.1
+h11==0.16.0
+h5py==3.15.1
+hf-xet==1.2.0
+httpcore==1.0.9
+httpx==0.28.1
+huggingface_hub==1.3.2
+hydra-core==1.3.2
+idna==3.11
+ImageIO==2.37.2
+imageio-ffmpeg==0.6.0
+importlib_metadata==8.7.1
+importlib_resources==6.5.2
+internetarchive==5.7.1
+iopath==0.1.10
+Jinja2==3.1.6
+jsonpatch==1.33
+jsonpointer==3.0.0
+jsonschema==4.26.0
+jsonschema-specifications==2025.9.1
+kiwisolver==1.4.9
+lightning==2.1.4
+lightning-utilities==0.15.2
+lpips==0.1.4
+MarkupSafe==2.1.5
+matplotlib==3.10.8
 moviepy==1.0.3
+mpmath==1.3.0
+multidict==6.7.0
+narwhals==2.15.0
+networkx==3.4.2
+numpy==1.26.4
+nvidia-cublas-cu12==12.1.3.1
+nvidia-cuda-cupti-cu12==12.1.105
+nvidia-cuda-nvrtc-cu12==12.1.105
+nvidia-cuda-runtime-cu12==12.1.105
+nvidia-cudnn-cu12==9.1.0.70
+nvidia-cufft-cu12==11.0.2.54
+nvidia-curand-cu12==10.3.2.106
+nvidia-cusolver-cu12==11.4.5.107
+nvidia-cusparse-cu12==12.1.0.106
+nvidia-nccl-cu12==2.20.5
+nvidia-nvjitlink-cu12==12.9.86
+nvidia-nvtx-cu12==12.1.105
+omegaconf==2.3.0
+opencv-python==4.11.0.86
+orjson==3.11.5
+packaging==24.2
+pandas==2.3.3
+parameterized==0.9.0
+pillow==10.4.0
+platformdirs==4.5.1
+portalocker==3.2.0
+proglog==0.1.12
+propcache==0.4.1
+protobuf==3.19.6
+psutil==5.9.8
+pydantic==1.10.26
+pydub==0.25.1
+pyparsing==3.3.1
+pyrealsense2==2.56.5.9235
+python-dateutil==2.9.0.post0
+python-multipart==0.0.21
+pytorch-lightning==2.6.0
+pytorchvideo==0.1.5
+pytz==2025.2
+PyYAML==6.0.3
+pyzmq==27.1.0
+referencing==0.37.0
+regex==2026.1.15
+requests==2.32.5
+rotary-embedding-torch==0.8.9
+rpds-py==0.30.0
+safetensors==0.7.0
+scipy==1.15.3
+semantic-version==2.10.0
+sentry-sdk==2.49.0
+setproctitle==1.3.7
+shellingham==1.5.4
+six==1.17.0
+smmap==5.0.2
+spaces==0.46.0
+starlette==0.50.0
+sympy==1.14.0
+tabulate==0.9.0
+termcolor==3.3.0
+timm==1.0.24
+toolz==0.12.1
+torch==2.4.1
+torch-fidelity==0.3.0
+torchmetrics==0.11.4
+torchvision==0.19.1
+tqdm==4.67.1
+triton==3.0.0
+typer-slim==0.21.1
+typing_extensions==4.15.0
+tzdata==2025.3
+urllib3==2.6.3
+uvicorn==0.40.0
+wandb==0.17.9
+wandb_osh==1.2.1
+websockets==11.0.3
+yacs==0.1.8
+yarl==1.22.0
+zipp==3.23.0

train_3stages.sh ADDED Viewed

	@@ -0,0 +1,80 @@

+wandb enabled
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+export NCCL_P2P_DISABLE=1
+# export HYDRA_FULL_ERROR=1
+set -e  # Exit on any error
+set -o pipefail  # Exit on pipe failures
+#Stage 1
+python -m main +name=train \
+    +diffusion_model_path=/share_1/users/bonan_ding/worldmem_ckpt/diffusion_only.ckpt \
+    +vae_path=/share_1/users/bonan_ding/worldmem_ckpt/vae_only.ckpt \
+    +customized_load=true \
+    +seperate_load=true \
+    +zero_init_gate=true \
+    dataset.n_frames=8 \
+    dataset.save_dir=/share_1/users/bonan_ding/worldmem_data/minecraft \
+    +dataset.n_frames_valid=700 \
+    +dataset.angle_range=110 \
+    +dataset.pos_range=2 \
+    +dataset.memory_condition_length=8 \
+    +dataset.customized_validation=true \
+    +dataset.add_timestamp_embedding=true \
+    +dataset.wo_updown=true \
+    +algorithm.n_tokens=8 \
+    +algorithm.memory_condition_length=8 \
+    algorithm.context_frames=600 \
+    +algorithm.relative_embedding=true \
+    +algorithm.log_video=true \
+    +algorithm.add_timestamp_embedding=true \
+    +algorithm.metrics=[lpips,psnr] \
+    experiment.training.checkpointing.every_n_train_steps=2500 \
+    experiment.training.max_steps=120000 \
+    +output_dir=/share_1/users/bonan_ding/worldmem_ckpt/reproduce_official_set \
+#Stage 2
+python -m main +name=train \
+    dataset.n_frames=8 \
+    dataset.save_dir=data/minecraft \
+    +dataset.n_frames_valid=700 \
+    +dataset.angle_range=110 \
+    +dataset.pos_range=8 \
+    +dataset.memory_condition_length=8 \
+    +dataset.customized_validation=true \
+    +dataset.add_timestamp_embedding=true \
+    +dataset.wo_updown=true \
+    +algorithm.n_tokens=8 \
+    +algorithm.memory_condition_length=8 \
+    algorithm.context_frames=600 \
+    +algorithm.relative_embedding=true \
+    +algorithm.log_video=true \
+    +algorithm.add_timestamp_embedding=true \
+    +algorithm.metrics=[lpips,psnr] \
+    experiment.training.checkpointing.every_n_train_steps=2500 \
+    resume=ot7jqmgn \
+    +output_dir=/share_1/users/bonan_ding/worldmem_ckpt/reproduce_official_set \
+    experiment.training.max_steps=240000
+#Stage 3
+python -m main +name=train \
+    dataset.n_frames=8 \
+    dataset.save_dir=data/minecraft \
+    +dataset.n_frames_valid=700 \
+    +dataset.angle_range=110 \
+    +dataset.pos_range=8 \
+    +dataset.memory_condition_length=8 \
+    +dataset.customized_validation=true \
+    +dataset.add_timestamp_embedding=true \
+    +dataset.wo_updown=false \
+    +algorithm.n_tokens=8 \
+    +algorithm.memory_condition_length=8 \
+    algorithm.context_frames=600 \
+    +algorithm.relative_embedding=true \
+    +algorithm.log_video=true \
+    +algorithm.add_timestamp_embedding=true \
+    +algorithm.metrics=[lpips,psnr] \
+    experiment.training.checkpointing.every_n_train_steps=2500 \
+    resume=ot7jqmgn \
+    +output_dir=/share_1/users/bonan_ding/worldmem_ckpt/reproduce_official_set \
+    experiment.training.max_steps=700000

train_stage_1.sh CHANGED Viewed

@@ -1,14 +1,16 @@
 wandb enabled
 # set -e
 python -m main +name=train \
-    +diffusion_model_path=your_diffusion_model_path \
-    +vae_path=your_vae_path \
     +customized_load=true \
     +seperate_load=true \
     +zero_init_gate=true \
     dataset.n_frames=8 \
-    dataset.save_dir=data/minecraft \
     +dataset.n_frames_valid=700 \
     +dataset.angle_range=110 \
     +dataset.pos_range=2 \
@@ -22,8 +24,7 @@ python -m main +name=train \
     +algorithm.relative_embedding=true \
     +algorithm.log_video=true \
     +algorithm.add_timestamp_embedding=true \
-    algorithm.metrics=[lpips,psnr] \
     experiment.training.checkpointing.every_n_train_steps=2500 \
-    experiment.training.max_steps=120000

 wandb enabled
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+export NCCL_P2P_DISABLE=1
+# export HYDRA_FULL_ERROR=1
 # set -e
 python -m main +name=train \
+    +diffusion_model_path=/share_1/users/bonan_ding/worldmem_ckpt/diffusion_only.ckpt \
+    +vae_path=/share_1/users/bonan_ding/worldmem_ckpt/vae_only.ckpt \
     +customized_load=true \
     +seperate_load=true \
     +zero_init_gate=true \
     dataset.n_frames=8 \
+    dataset.save_dir=/share_1/users/bonan_ding/worldmem_data/minecraft \
     +dataset.n_frames_valid=700 \
     +dataset.angle_range=110 \
     +dataset.pos_range=2 \
     +algorithm.relative_embedding=true \
     +algorithm.log_video=true \
     +algorithm.add_timestamp_embedding=true \
+    +algorithm.metrics=[lpips,psnr] \
     experiment.training.checkpointing.every_n_train_steps=2500 \
+    experiment.training.max_steps=120000 \
+    +output_dir=/share_1/users/bonan_ding/worldmem_ckpt/reproduce_official_set \

train_stage_2.sh CHANGED Viewed

@@ -1,9 +1,11 @@
 wandb enabled
-# set -e
 python -m main +name=train \
     dataset.n_frames=8 \
-    dataset.save_dir=data/minecraft \
     +dataset.n_frames_valid=700 \
     +dataset.angle_range=110 \
     +dataset.pos_range=8 \
@@ -17,9 +19,31 @@ python -m main +name=train \
     +algorithm.relative_embedding=true \
     +algorithm.log_video=true \
     +algorithm.add_timestamp_embedding=true \
-    algorithm.metrics=[lpips,psnr] \
     experiment.training.checkpointing.every_n_train_steps=2500 \
-    resume=your_wandb_job_id e.g.yhht29bz \
-    +output_dir=your_saving_path e.g. outputs/2025-05-18/15-16-32 \
     experiment.training.max_steps=240000

 wandb enabled
+export CUDA_VISIBLE_DEVICES=4,5,6,7
+export NCCL_P2P_DISABLE=1
+# export HYDRA_FULL_ERROR=1
+set -e
 python -m main +name=train \
     dataset.n_frames=8 \
+    dataset.save_dir=/share_1/users/bonan_ding/worldmem_data/minecraft \
     +dataset.n_frames_valid=700 \
     +dataset.angle_range=110 \
     +dataset.pos_range=8 \
     +algorithm.relative_embedding=true \
     +algorithm.log_video=true \
     +algorithm.add_timestamp_embedding=true \
+    +algorithm.metrics=[lpips,psnr] \
     experiment.training.checkpointing.every_n_train_steps=2500 \
+    resume=ot7jqmgn \
+    +output_dir=/share_1/users/bonan_ding/worldmem_ckpt/reproduce_official_set \
     experiment.training.max_steps=240000
+#Stage 3
+python -m main +name=train \
+    dataset.n_frames=8 \
+    dataset.save_dir=/share_1/users/bonan_ding/worldmem_data/minecraft \
+    +dataset.n_frames_valid=700 \
+    +dataset.angle_range=110 \
+    +dataset.pos_range=8 \
+    +dataset.memory_condition_length=8 \
+    +dataset.customized_validation=true \
+    +dataset.add_timestamp_embedding=true \
+    +dataset.wo_updown=false \
+    +algorithm.n_tokens=8 \
+    +algorithm.memory_condition_length=8 \
+    algorithm.context_frames=600 \
+    +algorithm.relative_embedding=true \
+    +algorithm.log_video=true \
+    +algorithm.add_timestamp_embedding=true \
+    +algorithm.metrics=[lpips,psnr] \
+    experiment.training.checkpointing.every_n_train_steps=2500 \
+    resume=ot7jqmgn \
+    +output_dir=/share_1/users/bonan_ding/worldmem_ckpt/reproduce_official_set \
+    experiment.training.max_steps=700000

train_stage_3.sh CHANGED Viewed

@@ -1,5 +1,7 @@
 wandb enabled
 # set -e
 python -m main +name=train \
     dataset.n_frames=8 \
@@ -17,8 +19,9 @@ python -m main +name=train \
     +algorithm.relative_embedding=true \
     +algorithm.log_video=true \
     +algorithm.add_timestamp_embedding=true \
-    algorithm.metrics=[lpips,psnr] \
     experiment.training.checkpointing.every_n_train_steps=2500 \
-    resume=your_wandb_job_id e.g.yhht29bz \
-    +output_dir=your_saving_path e.g. outputs/2025-05-18/15-16-32 \
-    experiment.training.max_steps=700000

 wandb enabled
+export CUDA_VISIBLE_DEVICES=4,5
+export NCCL_P2P_DISABLE=1
+# export HYDRA_FULL_ERROR=1
 # set -e
 python -m main +name=train \
     dataset.n_frames=8 \
     +algorithm.relative_embedding=true \
     +algorithm.log_video=true \
     +algorithm.add_timestamp_embedding=true \
+    +algorithm.metrics=[lpips,psnr] \
     experiment.training.checkpointing.every_n_train_steps=2500 \
+    resume=qyyk38nw \
+    +output_dir=/share_1/users/bonan_ding/worldmem_ckpt/reproduce_1 \
+    # experiment.training.max_steps=700000
+    experiment.training.max_steps=350000

utils/distributed_utils.py CHANGED Viewed

@@ -1,3 +1,10 @@
-import wandb
-is_rank_zero = wandb.run is not None

+import os
+# Check standard environment variables for distributed training
+# Default to True (rank 0) if not in a distributed environment
+_rank = int(os.environ.get("RANK", 0))
+_local_rank = int(os.environ.get("LOCAL_RANK", 0))
+# We consider it rank zero if global rank is 0.
+# Local rank check is usually redundant if rank is 0, but good for sanity.
+is_rank_zero = _rank == 0