Astra / README.md

EvanEternal

Upload README.md

9a5fcac verified 2 months ago

preview code

raw

history blame

20 kB

metadata

license: mit
tags:
  - video-generation
  - diffusion
  - world-model
library_name: diffusion
model-index:
  - name: Astra
    results: []

Astra 🌏: General Interactive World Model with Autoregressive Denoising

📄 [arXiv] 🏠 [Project Page] 🖥️ [Github]

Yixuan Zhu¹, Jiaqi Feng¹, Wenzhao Zheng^{1 †}, Yuan Gao², Xin Tao², Pengfei Wan², Jie Zhou ¹, Jiwen Lu¹

(*Work done during an internship at Kuaishou Technology, † Project leader)

¹Tsinghua University, ²Kuaishou Technology.

🔥 Updates

[2025.11.17]: Release the project page.
[2025.12.09]: Release the training and inference code, model checkpoint.

🎯 TODO List

Release full inference pipelines for additional scenarios:
- 🚗 Autonomous driving
- 🤖 Robotic manipulation
- 🛸 Drone navigation / exploration
Open-source training scripts:
- ⬆️ Action-conditioned autoregressive denoising training
- 🔄 Multi-scenario joint training pipeline
Release dataset preprocessing tools
Provide unified evaluation toolkit

📖 Introduction

TL;DR: Astra is an interactive world model that delivers realistic long-horizon video rollouts under a wide range of scenarios and action inputs.

Gallery

Astra+Wan2.1

⚙️ Code: Astra + Wan2.1 (Inference & Training)

Astra is built upon Wan2.1-1.3B, a diffusion-based video generation model. We provide inference scripts to help you quickly generate videos from images and action inputs. Follow the steps below:

Inference

Step 1: Set up the environment

DiffSynth-Studio requires Rust and Cargo to compile extensions. You can install them using the following command:

curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs/) | sh
. "$HOME/.cargo/env"

Install DiffSynth-Studio:

git clone https://github.com/EternalEvan/Astra.git
cd Astra
pip install -e .

Step 2: Download the pretrained checkpoints

Download the pre-trained Wan2.1 models

cd script
python download_wan2.1.py

Download the pre-trained Astra checkpoint

Please download from huggingface and place it in models/Astra/checkpoints.

Step 3: Test the example videos

python inference_astra.py --cam_type 1

Step 4: Test your own videos

If you want to test your own videos, you need to prepare your test data following the structure of the example_test_data folder. This includes N mp4 videos, each with at least 81 frames, and a metadata.csv file that stores their paths and corresponding captions. You can refer to the Prompt Extension section in Wan2.1 for guidance on preparing video captions.

python inference_astra.py --cam_type 1 --dataset_path path/to/your/data

We provide several preset camera types, as shown in the table below. Additionally, you can generate new trajectories for testing.

cam_type	Trajectory
1	Pan Right
2	Pan Left
3	Tilt Up
4	Tilt Down
5	Zoom In
6	Zoom Out
7	Translate Up (with rotation)
8	Translate Down (with rotation)
9	Arc Left (with rotation)
10	Arc Right (with rotation)

Training

Step 1: Set up the environment

pip install lightning pandas websockets

Step 2: Prepare the training dataset

Download the MultiCamVideo dataset.
Extract VAE features

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_recammaster.py   --task data_process   --dataset_path path/to/the/MultiCamVideo/Dataset   --output_path ./models   --text_encoder_path "models/Wan-AI/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth"   --vae_path "models/Wan-AI/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth"   --tiled   --num_frames 81   --height 480   --width 832 --dataloader_num_workers 2

Generate Captions for Each Video

You can use video caption tools like LLaVA to generate captions for each video and store them in the metadata.csv file.

Step 3: Training

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_recammaster.py   --task train  --dataset_path recam_train_data   --output_path ./models/train   --dit_path "models/Wan-AI/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors"   --steps_per_epoch 8000   --max_epochs 100   --learning_rate 1e-4   --accumulate_grad_batches 1   --use_gradient_checkpointing  --dataloader_num_workers 4

We do not explore the optimal set of hyper-parameters and train with a batch size of 1 on each GPU. You may achieve better model performance by adjusting hyper-parameters such as the learning rate and increasing the batch size.

Step 4: Test the model

python inference_recammaster.py --cam_type 1 --ckpt_path path/to/the/checkpoint

🤗 Awesome Related Works

Feel free to explore these outstanding related works, including but not limited to:

ReCamMaster: ReCamMaster re-captures in-the-wild videos with novel camera trajectories.

GCD: GCD synthesizes large-angle novel viewpoints of 4D dynamic scenes from a monocular video.

ReCapture: a method for generating new videos with novel camera trajectories from a single user-provided video.

Trajectory Attention: Trajectory Attention facilitates various tasks like camera motion control on images and videos, and video editing.

GS-DiT: GS-DiT provides 4D video control for a single monocular video.

Diffusion as Shader: a versatile video generation control model for various tasks.

TrajectoryCrafter: TrajectoryCrafter achieves high-fidelity novel views generation from casually captured monocular video.

GEN3C: a generative video model with precise Camera Control and temporal 3D Consistency.

🌟 Citation

Please leave us a star 🌟 and cite our paper if you find our work helpful.

@inproceedings{zhu2025astra,
  title={Astra: General Interactive World Model with Autoregressive Denoising},
  author={Zhu, Yixuan and Feng, Jiaqi and Zheng, Wenzhao and Gao, Yuan and Tao, Xin and Wan, Pengfei and Zhou, Jie and Lu, Jiwen},
  booktitle={arxiv},
  year={2025}
}