WaveGen / README.md

Rename README-en.md to README.md

d29801b verified about 1 month ago

7.98 kB

	# WaveGen: Wave Generative Method

	[[Paper]]() \| [[Project Page]](https://world-snapshot.github.io/) \| [[Huggingface]](https://huggingface.co/World-Snapshot) \| [[Document]](https://world-snapshot.github.io/doc/index.html) \| [[Org Materials]](https://github.com/World-Snapshot) \| [[WSMs]](https://github.com/World-Snapshot/WSMs) \| [[ControlWave]](https://github.com/World-Snapshot/ControlWave)

	This generation method is mainly used to create the core space of WSM. The input is text/wsg, output is some mathematical functions (differentiable, movable and deformable waves, rather than probability clouds), which is used to construct the core space for simulating the operation of the world.

	First, you should read the project page to understand that WaveGen is currently relatively weak and operates at the level of multiple objects with simple dynamic shapes. If you are pursuing [a large number of tasks](https://github.com/World-Snapshot/The-results-of-Augustus), please check [WSMs](https://github.com/World-Snapshot/WSMs) (Wave2Pixel decoder), which is the main source of our current work strength. If you want to continue developing WaveGen, please:

	## Abstract

	Wave Generative Method is a conceptual new generation mechanism<sup>1</sup>. It aims to use mathematical functions to losslessly record dynamic 3D shapes, fundamentally reducing the generated string/values while maintaining the same expression ability. It is dedicated to achieving some new features that was not well accomplished by previous methods<sup>2</sup>: like consistency, native 3D+t, variable resolution, unlimited duration, predicting the future, variable FPS, physics, causality, predicting the past, pixel-level control of the world distribution, synchronizing the real world with model's core space, and training the world itself, etc.

	This architecture is essentially designed for training the core space of the [World Snapshot Model (WSM)](https://world-snapshot.github.io/doc/index.html?page=S5_blogs/00_world_model.md#definition), and it usually requires a [WSMs](https://github.com/World-Snapshot/WSMs) decoder (By default, it is a sub-module of WaveGen). For users, the [ControlWave UI](https://github.com/World-Snapshot/ControlWave) is recommended for control.

	Motivation: As stated in the [Unified Law for Visual Tasks](https://github.com/World-Snapshot/Unified-Law-for-Visual-Tasks), WSM can almost uniformly and controllably generate any type of visual output. However, we hope that the traditional generation method comes with a core space (functional implicit reconstruction). Therefore, we developed the [WaveGen](https://github.com/World-Snapshot/WaveGen) generation method to obtain advanced functions/features and address this shortcoming. It is not yet perfect. But for general cases, the traditional generation methods are sufficient.

	<sub>1. It is mainly based on RF and parametric expression of mathematical primitives.</sub>

	<sub>2. These features (except for predicting the future, consistency, unlimited duration, native 3D+t) are almost unique to the Wave Model.</sub>

	## Inference

	For users:

	The [ControlWave UI](https://github.com/World-Snapshot/ControlWave) is recommended for control.

	For developers:

	First, prepare the code.

	```python
	# You need to install the submodules.
	git clone --recursive git@github.com:World-Snapshot/WaveGen.git

	# If you have cloned WaveGen but do not have WSMs, or if WSMs needs to be updated.
	cd WaveGen
	git submodule update --remote --merge
	```

	cd WaveGen, you can see that each folder represents a WaveGen model, facilitating development and switching, and making it convenient for the ControlWave UI to use. If you plan to use a particular model of WaveGen, you need to install the corresponding environment.

	Then, we can prepare the environment.

	```python
	conda create -n WaveGen python=3.11 -y
	conda activate WaveGen
	pip install -r Augustus_v1/env/requirements_WaveGen.txt
	```

	## Training

	2025.9.5 Estimated general WaveGen&WSM training:

	### Text2Wave:

	1. First, freeze the decoder and train the main model so that it can generate appropriate wave space based on the camera position (requires the common RealCam-Vid dataset and Articulation-XL2.0 dataset). These two might require us to re-extract the relevant camera information, mainly to train a wave-space generation model with 3D knowledge. The former features scenes, extremely long texts and high-quality dynamic perspectives, while the latter utilizes normalized point clouds to learn the shapes of objects.

	Note: For the initial version, I used [MOVI-a](https://github.com/google-research/kubric/tree/main/challenges/movi) instead of the previous part, merely for a proof-of-concept.

	2. Prepare Data:

	Step 1: Download MOVi-A Dataset

	```bash
	python EMS-superquadric_fitting_inference/download_movi_a.py
	```

	This will download the MOVi-A dataset from Google Cloud Storage to `data/movi_a_128x128/`. The download includes:
	- Train split: ~9,700 samples
	- Validation split: ~250 samples
	- Each sample contains 24 frames with RGB, depth, segmentation, and metadata

	Note: The download can be interrupted and resumed by running the script again.

	Step 2: Preprocess Dataset (Generate Superquadric Caches)

	After downloading, preprocess the dataset to fit superquadrics to point clouds and generate cache files:

	```bash
	# Preprocess training set (process first 100 samples for quick testing)
	python data/preprocess_dataset.py \
	--data_root data/movi_a_128x128 \
	--split train \
	--max_samples 100 \
	--num_workers 8

	# Preprocess validation set
	python data/preprocess_dataset.py \
	--data_root data/movi_a_128x128 \
	--split validation \
	--max_samples 10 \
	--num_workers 8
	```

	Parameters:
	- `--max_samples`: Number of samples to process (-1 for all samples)
	- `--num_workers`: Number of parallel processes (adjust based on your CPU cores)

	This step generates `Full_Sample_Data_for_Learning_Target.npz` files for each sample, containing all training data (superquadric parameters, camera info, physics properties). Processing time: ~5-10 seconds per sample.

	Step 2.5 (Optional): Merge .npy Files to Reduce File Count

	For better storage efficiency and easier file transfer, you can merge individual .npy files into compressed .npz archives:

	```bash
	# Merge all .npy files in train and validation splits
	python data/merge_npy_to_npz.py --data_root data/movi_a_128x128

	# Preview what will be merged (dry-run mode)
	python data/merge_npy_to_npz.py --data_root data/movi_a_128x128 --dry-run

	# Merge only specific split
	python data/merge_npy_to_npz.py --data_root data/movi_a_128x128 --split train

	# Revert back to .npy files if needed
	python data/merge_npy_to_npz.py --data_root data/movi_a_128x128 --revert
	```

	Benefits:
	- Reduces file count by ~98% (e.g., 303 files → 5 .npz files per sample)
	- Saves ~70% storage space with compression
	- Faster file transfers and backups
	- RGB images (.png) are preserved as-is

	Note: The training code automatically handles both .npy and .npz formats, so this step is completely optional.

	Step 3: Start Training

	```bash
	cd WaveGen_Augustus_v1
	bash launch_text2wave_training.sh
	```

	The training script will:
	1. Automatically check and preprocess any uncached samples
	2. Train the Text2Wave model using T5 encoder-decoder
	3. Save checkpoints and generation results to `core_space/`

	Training Configuration:

	Edit `WaveGen_Augustus_v1/configs/default.yaml` to adjust:
	- `data.max_sequences`: Number of training samples (default: 100)
	- `training.batch_size`: Batch size (default: 24)
	- `training.max_steps`: Total training steps (default: 50000)
	- Loss weights and other hyperparameters

	Resume Training:

	```bash
	bash launch_text2wave_training.sh 1000 # Resume from step 1000
	```

	### WSMs (Wave2Pixel Decoder):

	See the [training section](https://github.com/World-Snapshot/WSMs/tree/main#training) of WSMs.