---
license: mit
tags:
- video-generation
- diffusion
- world-model
library_name: diffusion
model-index:
- name: Astra
results: []
---
# Astra 🌏: General Interactive World Model with Autoregressive Denoising
**[Yixuan Zhu1](https://jianhongbai.github.io/), [Jiaqi Feng1](https://menghanxia.github.io/), [Wenzhao Zheng1 †](https://fuxiao0719.github.io/), [Yuan Gao2](https://xinntao.github.io/), [Xin Tao2](https://scholar.google.com/citations?user=dCik-2YAAAAJ&hl=en), [Pengfei Wan2](https://openreview.net/profile?id=~Jinwen_Cao1), [Jie Zhou 1](https://person.zju.edu.cn/en/lzz), [Jiwen Lu1](https://person.zju.edu.cn/en/huhaoji)**
(*Work done during an internship at Kuaishou Technology,
† Project leader)
1Tsinghua University, 2Kuaishou Technology.
## 🔥 Updates
- __[2025.11.17]__: Release the [project page](https://eternalevan.github.io/Astra-project/).
- __[2025.12.09]__: Release the training and inference code, model checkpoint.
## 🎯 TODO List
- [ ] **Release full inference pipelines** for additional scenarios:
- [ ] 🚗 Autonomous driving
- [ ] 🤖 Robotic manipulation
- [ ] 🛸 Drone navigation / exploration
- [ ] **Open-source training scripts**:
- [ ] ⬆️ Action-conditioned autoregressive denoising training
- [ ] 🔄 Multi-scenario joint training pipeline
- [ ] **Release dataset preprocessing tools**
- [ ] **Provide unified evaluation toolkit**
## 📖 Introduction
**TL;DR:** Astra is an **interactive world model** that delivers realistic long-horizon video rollouts under a wide range of scenarios and action inputs.
## Gallery
### Astra+Wan2.1
## ⚙️ Code: Astra + Wan2.1 (Inference & Training)
Astra is built upon [Wan2.1-1.3B](https://github.com/Wan-Video/Wan2.1), a diffusion-based video generation model. We provide inference scripts to help you quickly generate videos from images and action inputs. Follow the steps below:
### Inference
Step 1: Set up the environment
[DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio) requires Rust and Cargo to compile extensions. You can install them using the following command:
```shell
curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs/) | sh
. "$HOME/.cargo/env"
```
Install [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio):
```shell
git clone https://github.com/EternalEvan/Astra.git
cd Astra
pip install -e .
```
Step 2: Download the pretrained checkpoints
1. Download the pre-trained Wan2.1 models
```shell
cd script
python download_wan2.1.py
```
2. Download the pre-trained Astra checkpoint
Please download from [huggingface](https://huggingface.co/wjque/lyra/blob/main/diffusion_pytorch_model.ckpt) and place it in ```models/Astra/checkpoints```.
Step 3: Test the example videos
```shell
python inference_astra.py --cam_type 1
```
Step 4: Test your own videos
If you want to test your own videos, you need to prepare your test data following the structure of the ```example_test_data``` folder. This includes N mp4 videos, each with at least 81 frames, and a ```metadata.csv``` file that stores their paths and corresponding captions. You can refer to the [Prompt Extension section](https://github.com/Wan-Video/Wan2.1?tab=readme-ov-file#2-using-prompt-extension) in Wan2.1 for guidance on preparing video captions.
```shell
python inference_astra.py --cam_type 1 --dataset_path path/to/your/data
```
We provide several preset camera types, as shown in the table below. Additionally, you can generate new trajectories for testing.
| cam_type | Trajectory |
|-------------------|-----------------------------|
| 1 | Pan Right |
| 2 | Pan Left |
| 3 | Tilt Up |
| 4 | Tilt Down |
| 5 | Zoom In |
| 6 | Zoom Out |
| 7 | Translate Up (with rotation) |
| 8 | Translate Down (with rotation) |
| 9 | Arc Left (with rotation) |
| 10 | Arc Right (with rotation) |
### Training
Step 1: Set up the environment
```shell
pip install lightning pandas websockets
```
Step 2: Prepare the training dataset
1. Download the [MultiCamVideo dataset](https://huggingface.co/datasets/KwaiVGI/MultiCamVideo-Dataset).
2. Extract VAE features
```shell
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_recammaster.py --task data_process --dataset_path path/to/the/MultiCamVideo/Dataset --output_path ./models --text_encoder_path "models/Wan-AI/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth" --vae_path "models/Wan-AI/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth" --tiled --num_frames 81 --height 480 --width 832 --dataloader_num_workers 2
```
3. Generate Captions for Each Video
You can use video caption tools like [LLaVA](https://github.com/haotian-liu/LLaVA) to generate captions for each video and store them in the ```metadata.csv``` file.
Step 3: Training
```shell
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_recammaster.py --task train --dataset_path recam_train_data --output_path ./models/train --dit_path "models/Wan-AI/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors" --steps_per_epoch 8000 --max_epochs 100 --learning_rate 1e-4 --accumulate_grad_batches 1 --use_gradient_checkpointing --dataloader_num_workers 4
```
We do not explore the optimal set of hyper-parameters and train with a batch size of 1 on each GPU. You may achieve better model performance by adjusting hyper-parameters such as the learning rate and increasing the batch size.
Step 4: Test the model
```shell
python inference_recammaster.py --cam_type 1 --ckpt_path path/to/the/checkpoint
```
## 🤗 Awesome Related Works
Feel free to explore these outstanding related works, including but not limited to:
[ReCamMaster](https://github.com/KlingTeam/ReCamMaster): ReCamMaster re-captures in-the-wild videos with novel camera trajectories.
[GCD](https://gcd.cs.columbia.edu/): GCD synthesizes large-angle novel viewpoints of 4D dynamic scenes from a monocular video.
[ReCapture](https://generative-video-camera-controls.github.io/): a method for generating new videos with novel camera trajectories from a single user-provided video.
[Trajectory Attention](https://xizaoqu.github.io/trajattn/): Trajectory Attention facilitates various tasks like camera motion control on images and videos, and video editing.
[GS-DiT](https://wkbian.github.io/Projects/GS-DiT/): GS-DiT provides 4D video control for a single monocular video.
[Diffusion as Shader](https://igl-hkust.github.io/das/): a versatile video generation control model for various tasks.
[TrajectoryCrafter](https://trajectorycrafter.github.io/): TrajectoryCrafter achieves high-fidelity novel views generation from casually captured monocular video.
[GEN3C](https://research.nvidia.com/labs/toronto-ai/GEN3C/): a generative video model with precise Camera Control and temporal 3D Consistency.
## 🌟 Citation
Please leave us a star 🌟 and cite our paper if you find our work helpful.
```
@inproceedings{zhu2025astra,
title={Astra: General Interactive World Model with Autoregressive Denoising},
author={Zhu, Yixuan and Feng, Jiaqi and Zheng, Wenzhao and Gao, Yuan and Tao, Xin and Wan, Pengfei and Zhou, Jie and Lu, Jiwen},
booktitle={arxiv},
year={2025}
}
```