Astra / README.md
EvanEternal's picture
Upload README.md
9a5fcac verified
|
raw
history blame
20 kB
metadata
license: mit
tags:
  - video-generation
  - diffusion
  - world-model
library_name: diffusion
model-index:
  - name: Astra
    results: []

Astra 🌏: General Interactive World Model with Autoregressive Denoising

πŸ“„ [arXiv]    🏠 [Project Page]    πŸ–₯️ [Github]

Yixuan Zhu1, Jiaqi Feng1, Wenzhao Zheng1 †, Yuan Gao2, Xin Tao2, Pengfei Wan2, Jie Zhou 1, Jiwen Lu1

(*Work done during an internship at Kuaishou Technology, † Project leader)

1Tsinghua University, 2Kuaishou Technology.

πŸ”₯ Updates

  • [2025.11.17]: Release the project page.
  • [2025.12.09]: Release the training and inference code, model checkpoint.

🎯 TODO List

  • Release full inference pipelines for additional scenarios:

    • πŸš— Autonomous driving
    • πŸ€– Robotic manipulation
    • πŸ›Έ Drone navigation / exploration
  • Open-source training scripts:

    • ⬆️ Action-conditioned autoregressive denoising training
    • πŸ”„ Multi-scenario joint training pipeline
  • Release dataset preprocessing tools

  • Provide unified evaluation toolkit

πŸ“– Introduction

TL;DR: Astra is an interactive world model that delivers realistic long-horizon video rollouts under a wide range of scenarios and action inputs.

Gallery

Astra+Wan2.1

βš™οΈ Code: Astra + Wan2.1 (Inference & Training)

Astra is built upon Wan2.1-1.3B, a diffusion-based video generation model. We provide inference scripts to help you quickly generate videos from images and action inputs. Follow the steps below:

Inference

Step 1: Set up the environment

DiffSynth-Studio requires Rust and Cargo to compile extensions. You can install them using the following command:

curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs/) | sh
. "$HOME/.cargo/env"

Install DiffSynth-Studio:

git clone https://github.com/EternalEvan/Astra.git
cd Astra
pip install -e .

Step 2: Download the pretrained checkpoints

  1. Download the pre-trained Wan2.1 models
cd script
python download_wan2.1.py
  1. Download the pre-trained Astra checkpoint

Please download from huggingface and place it in models/Astra/checkpoints.

Step 3: Test the example videos

python inference_astra.py --cam_type 1

Step 4: Test your own videos

If you want to test your own videos, you need to prepare your test data following the structure of the example_test_data folder. This includes N mp4 videos, each with at least 81 frames, and a metadata.csv file that stores their paths and corresponding captions. You can refer to the Prompt Extension section in Wan2.1 for guidance on preparing video captions.

python inference_astra.py --cam_type 1 --dataset_path path/to/your/data

We provide several preset camera types, as shown in the table below. Additionally, you can generate new trajectories for testing.

cam_type Trajectory
1 Pan Right
2 Pan Left
3 Tilt Up
4 Tilt Down
5 Zoom In
6 Zoom Out
7 Translate Up (with rotation)
8 Translate Down (with rotation)
9 Arc Left (with rotation)
10 Arc Right (with rotation)

Training

Step 1: Set up the environment

pip install lightning pandas websockets

Step 2: Prepare the training dataset

  1. Download the MultiCamVideo dataset.

  2. Extract VAE features

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_recammaster.py   --task data_process   --dataset_path path/to/the/MultiCamVideo/Dataset   --output_path ./models   --text_encoder_path "models/Wan-AI/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth"   --vae_path "models/Wan-AI/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth"   --tiled   --num_frames 81   --height 480   --width 832 --dataloader_num_workers 2
  1. Generate Captions for Each Video

You can use video caption tools like LLaVA to generate captions for each video and store them in the metadata.csv file.

Step 3: Training

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_recammaster.py   --task train  --dataset_path recam_train_data   --output_path ./models/train   --dit_path "models/Wan-AI/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors"   --steps_per_epoch 8000   --max_epochs 100   --learning_rate 1e-4   --accumulate_grad_batches 1   --use_gradient_checkpointing  --dataloader_num_workers 4

We do not explore the optimal set of hyper-parameters and train with a batch size of 1 on each GPU. You may achieve better model performance by adjusting hyper-parameters such as the learning rate and increasing the batch size.

Step 4: Test the model

python inference_recammaster.py --cam_type 1 --ckpt_path path/to/the/checkpoint

πŸ€— Awesome Related Works

Feel free to explore these outstanding related works, including but not limited to:

ReCamMaster: ReCamMaster re-captures in-the-wild videos with novel camera trajectories.

GCD: GCD synthesizes large-angle novel viewpoints of 4D dynamic scenes from a monocular video.

ReCapture: a method for generating new videos with novel camera trajectories from a single user-provided video.

Trajectory Attention: Trajectory Attention facilitates various tasks like camera motion control on images and videos, and video editing.

GS-DiT: GS-DiT provides 4D video control for a single monocular video.

Diffusion as Shader: a versatile video generation control model for various tasks.

TrajectoryCrafter: TrajectoryCrafter achieves high-fidelity novel views generation from casually captured monocular video.

GEN3C: a generative video model with precise Camera Control and temporal 3D Consistency.

🌟 Citation

Please leave us a star 🌟 and cite our paper if you find our work helpful.

@inproceedings{zhu2025astra,
  title={Astra: General Interactive World Model with Autoregressive Denoising},
  author={Zhu, Yixuan and Feng, Jiaqi and Zheng, Wenzhao and Gao, Yuan and Tao, Xin and Wan, Pengfei and Zhou, Jie and Lu, Jiwen},
  booktitle={arxiv},
  year={2025}
}