| # WaveGen: Wave Generative Method | |
| [[Paper]]() | [[Project Page]](https://world-snapshot.github.io/) | [[Huggingface]](https://huggingface.co/World-Snapshot) | [[Document]](https://world-snapshot.github.io/doc/index.html) | [[Org Materials]](https://github.com/World-Snapshot) | [[WSMs]](https://github.com/World-Snapshot/WSMs) | [[ControlWave]](https://github.com/World-Snapshot/ControlWave) | |
| This generation method is mainly used to create the core space of WSM. The input is text/wsg, output is some mathematical functions (differentiable, movable and deformable waves, rather than probability clouds), which is used to construct the core space for simulating the operation of the world. | |
| First, you should read the project page to understand that WaveGen is currently relatively weak and operates at the level of multiple objects with simple dynamic shapes. If you are pursuing [a large number of tasks](https://github.com/World-Snapshot/The-results-of-Augustus), please check [WSMs](https://github.com/World-Snapshot/WSMs) (Wave2Pixel decoder), which is the main source of our current work strength. If you want to continue developing WaveGen, please: | |
| ## Abstract | |
| **Wave Generative Method** is a conceptual new generation mechanism<sup>1</sup>. It aims to use mathematical functions to losslessly record dynamic 3D shapes, fundamentally reducing the generated string/values while maintaining the same expression ability. It is dedicated to achieving some new features that was not well accomplished by previous methods<sup>2</sup>: like consistency, native 3D+t, variable resolution, unlimited duration, predicting the future, variable FPS, physics, causality, predicting the past, pixel-level control of the world distribution, synchronizing the real world with model's core space, and training the world itself, etc. | |
| This architecture is essentially designed for training the core space of the [World Snapshot Model (WSM)](https://world-snapshot.github.io/doc/index.html?page=S5_blogs/00_world_model.md#definition), and it usually requires a [WSMs](https://github.com/World-Snapshot/WSMs) decoder (By default, it is a sub-module of WaveGen). For users, the [ControlWave UI](https://github.com/World-Snapshot/ControlWave) is recommended for control. | |
| **Motivation:** As stated in the [Unified Law for Visual Tasks](https://github.com/World-Snapshot/Unified-Law-for-Visual-Tasks), WSM can almost uniformly and controllably generate any type of visual output. However, we hope that the traditional generation method comes with a core space (functional implicit reconstruction). Therefore, we developed the [WaveGen](https://github.com/World-Snapshot/WaveGen) generation method to obtain advanced functions/features and address this shortcoming. It is not yet perfect. But for general cases, the traditional generation methods are sufficient. | |
| <sub>1. It is mainly based on RF and parametric expression of mathematical primitives.</sub> | |
| <sub>2. These features (except for predicting the future, consistency, unlimited duration, native 3D+t) are almost unique to the Wave Model.</sub> | |
| ## Inference | |
| **For users:** | |
| The [ControlWave UI](https://github.com/World-Snapshot/ControlWave) is recommended for control. | |
| **For developers:** | |
| First, prepare the code. | |
| ```python | |
| # You need to install the submodules. | |
| git clone --recursive git@github.com:World-Snapshot/WaveGen.git | |
| # If you have cloned WaveGen but do not have WSMs, or if WSMs needs to be updated. | |
| cd WaveGen | |
| git submodule update --remote --merge | |
| ``` | |
| cd WaveGen, you can see that each folder represents a WaveGen model, facilitating development and switching, and making it convenient for the ControlWave UI to use. If you plan to use a particular model of WaveGen, you need to install the corresponding environment. | |
| Then, we can prepare the environment. | |
| ```python | |
| conda create -n WaveGen python=3.11 -y | |
| conda activate WaveGen | |
| pip install -r Augustus_v1/env/requirements_WaveGen.txt | |
| ``` | |
| ## Training | |
| 2025.9.5 Estimated general WaveGen&WSM training: | |
| ### Text2Wave: | |
| 1. First, freeze the decoder and train the main model so that it can generate appropriate wave space based on the camera position (requires the common RealCam-Vid dataset and Articulation-XL2.0 dataset). These two might require us to re-extract the relevant camera information, mainly to train a wave-space generation model with 3D knowledge. The former features scenes, extremely long texts and high-quality dynamic perspectives, while the latter utilizes normalized point clouds to learn the shapes of objects. | |
| **Note:** For the initial version, I used [MOVI-a](https://github.com/google-research/kubric/tree/main/challenges/movi) instead of the previous part, merely for a proof-of-concept. | |
| 2. **Prepare Data:** | |
| **Step 1: Download MOVi-A Dataset** | |
| ```bash | |
| python EMS-superquadric_fitting_inference/download_movi_a.py | |
| ``` | |
| This will download the MOVi-A dataset from Google Cloud Storage to `data/movi_a_128x128/`. The download includes: | |
| - Train split: ~9,700 samples | |
| - Validation split: ~250 samples | |
| - Each sample contains 24 frames with RGB, depth, segmentation, and metadata | |
| **Note:** The download can be interrupted and resumed by running the script again. | |
| **Step 2: Preprocess Dataset (Generate Superquadric Caches)** | |
| After downloading, preprocess the dataset to fit superquadrics to point clouds and generate cache files: | |
| ```bash | |
| # Preprocess training set (process first 100 samples for quick testing) | |
| python data/preprocess_dataset.py \ | |
| --data_root data/movi_a_128x128 \ | |
| --split train \ | |
| --max_samples 100 \ | |
| --num_workers 8 | |
| # Preprocess validation set | |
| python data/preprocess_dataset.py \ | |
| --data_root data/movi_a_128x128 \ | |
| --split validation \ | |
| --max_samples 10 \ | |
| --num_workers 8 | |
| ``` | |
| **Parameters:** | |
| - `--max_samples`: Number of samples to process (-1 for all samples) | |
| - `--num_workers`: Number of parallel processes (adjust based on your CPU cores) | |
| This step generates `Full_Sample_Data_for_Learning_Target.npz` files for each sample, containing all training data (superquadric parameters, camera info, physics properties). Processing time: ~5-10 seconds per sample. | |
| **Step 2.5 (Optional): Merge .npy Files to Reduce File Count** | |
| For better storage efficiency and easier file transfer, you can merge individual .npy files into compressed .npz archives: | |
| ```bash | |
| # Merge all .npy files in train and validation splits | |
| python data/merge_npy_to_npz.py --data_root data/movi_a_128x128 | |
| # Preview what will be merged (dry-run mode) | |
| python data/merge_npy_to_npz.py --data_root data/movi_a_128x128 --dry-run | |
| # Merge only specific split | |
| python data/merge_npy_to_npz.py --data_root data/movi_a_128x128 --split train | |
| # Revert back to .npy files if needed | |
| python data/merge_npy_to_npz.py --data_root data/movi_a_128x128 --revert | |
| ``` | |
| **Benefits:** | |
| - Reduces file count by **~98%** (e.g., 303 files → 5 .npz files per sample) | |
| - Saves **~70%** storage space with compression | |
| - Faster file transfers and backups | |
| - RGB images (.png) are preserved as-is | |
| **Note:** The training code automatically handles both .npy and .npz formats, so this step is completely optional. | |
| **Step 3: Start Training** | |
| ```bash | |
| cd WaveGen_Augustus_v1 | |
| bash launch_text2wave_training.sh | |
| ``` | |
| The training script will: | |
| 1. Automatically check and preprocess any uncached samples | |
| 2. Train the Text2Wave model using T5 encoder-decoder | |
| 3. Save checkpoints and generation results to `core_space/` | |
| **Training Configuration:** | |
| Edit `WaveGen_Augustus_v1/configs/default.yaml` to adjust: | |
| - `data.max_sequences`: Number of training samples (default: 100) | |
| - `training.batch_size`: Batch size (default: 24) | |
| - `training.max_steps`: Total training steps (default: 50000) | |
| - Loss weights and other hyperparameters | |
| **Resume Training:** | |
| ```bash | |
| bash launch_text2wave_training.sh 1000 # Resume from step 1000 | |
| ``` | |
| ### WSMs (Wave2Pixel Decoder): | |
| See the [training section](https://github.com/World-Snapshot/WSMs/tree/main#training) of WSMs. | |