license: mit
tags:
- video-generation
- diffusion
- world-model
library_name: diffusion
model-index:
- name: Astra
results: []
Astra π: General Interactive World Model with Autoregressive Denoising
π [arXiv] π [Project Page] π₯οΈ [Github]
Yixuan Zhu1, Jiaqi Feng1, Wenzhao Zheng1 β , Yuan Gao2, Xin Tao2, Pengfei Wan2, Jie Zhou 1, Jiwen Lu1
(*Work done during an internship at Kuaishou Technology, β Project leader)
1Tsinghua University, 2Kuaishou Technology.
π₯ Updates
- [2025.11.17]: Release the project page.
- [2025.12.09]: Release the training and inference code, model checkpoint.
π― TODO List
Release full inference pipelines for additional scenarios:
- π Autonomous driving
- π€ Robotic manipulation
- πΈ Drone navigation / exploration
Open-source training scripts:
- β¬οΈ Action-conditioned autoregressive denoising training
- π Multi-scenario joint training pipeline
Release dataset preprocessing tools
Provide unified evaluation toolkit
π Introduction
TL;DR: Astra is an interactive world model that delivers realistic long-horizon video rollouts under a wide range of scenarios and action inputs.
Gallery
Astra+Wan2.1
βοΈ Code: Astra + Wan2.1 (Inference & Training)
Astra is built upon Wan2.1-1.3B, a diffusion-based video generation model. We provide inference scripts to help you quickly generate videos from images and action inputs. Follow the steps below:
Inference
Step 1: Set up the environment
DiffSynth-Studio requires Rust and Cargo to compile extensions. You can install them using the following command:
curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs/) | sh
. "$HOME/.cargo/env"
Install DiffSynth-Studio:
git clone https://github.com/EternalEvan/Astra.git
cd Astra
pip install -e .
Step 2: Download the pretrained checkpoints
- Download the pre-trained Wan2.1 models
cd script
python download_wan2.1.py
- Download the pre-trained Astra checkpoint
Please download from huggingface and place it in models/Astra/checkpoints.
Step 3: Test the example videos
python inference_astra.py --cam_type 1
Step 4: Test your own videos
If you want to test your own videos, you need to prepare your test data following the structure of the example_test_data folder. This includes N mp4 videos, each with at least 81 frames, and a metadata.csv file that stores their paths and corresponding captions. You can refer to the Prompt Extension section in Wan2.1 for guidance on preparing video captions.
python inference_astra.py --cam_type 1 --dataset_path path/to/your/data
We provide several preset camera types, as shown in the table below. Additionally, you can generate new trajectories for testing.
| cam_type | Trajectory |
|---|---|
| 1 | Pan Right |
| 2 | Pan Left |
| 3 | Tilt Up |
| 4 | Tilt Down |
| 5 | Zoom In |
| 6 | Zoom Out |
| 7 | Translate Up (with rotation) |
| 8 | Translate Down (with rotation) |
| 9 | Arc Left (with rotation) |
| 10 | Arc Right (with rotation) |
Training
Step 1: Set up the environment
pip install lightning pandas websockets
Step 2: Prepare the training dataset
Download the MultiCamVideo dataset.
Extract VAE features
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_recammaster.py --task data_process --dataset_path path/to/the/MultiCamVideo/Dataset --output_path ./models --text_encoder_path "models/Wan-AI/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth" --vae_path "models/Wan-AI/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth" --tiled --num_frames 81 --height 480 --width 832 --dataloader_num_workers 2
- Generate Captions for Each Video
You can use video caption tools like LLaVA to generate captions for each video and store them in the metadata.csv file.
Step 3: Training
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_recammaster.py --task train --dataset_path recam_train_data --output_path ./models/train --dit_path "models/Wan-AI/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors" --steps_per_epoch 8000 --max_epochs 100 --learning_rate 1e-4 --accumulate_grad_batches 1 --use_gradient_checkpointing --dataloader_num_workers 4
We do not explore the optimal set of hyper-parameters and train with a batch size of 1 on each GPU. You may achieve better model performance by adjusting hyper-parameters such as the learning rate and increasing the batch size.
Step 4: Test the model
python inference_recammaster.py --cam_type 1 --ckpt_path path/to/the/checkpoint
π€ Awesome Related Works
Feel free to explore these outstanding related works, including but not limited to:
ReCamMaster: ReCamMaster re-captures in-the-wild videos with novel camera trajectories.
GCD: GCD synthesizes large-angle novel viewpoints of 4D dynamic scenes from a monocular video.
ReCapture: a method for generating new videos with novel camera trajectories from a single user-provided video.
Trajectory Attention: Trajectory Attention facilitates various tasks like camera motion control on images and videos, and video editing.
GS-DiT: GS-DiT provides 4D video control for a single monocular video.
Diffusion as Shader: a versatile video generation control model for various tasks.
TrajectoryCrafter: TrajectoryCrafter achieves high-fidelity novel views generation from casually captured monocular video.
GEN3C: a generative video model with precise Camera Control and temporal 3D Consistency.
π Citation
Please leave us a star π and cite our paper if you find our work helpful.
@inproceedings{zhu2025astra,
title={Astra: General Interactive World Model with Autoregressive Denoising},
author={Zhu, Yixuan and Feng, Jiaqi and Zheng, Wenzhao and Gao, Yuan and Tao, Xin and Wan, Pengfei and Zhou, Jie and Lu, Jiwen},
booktitle={arxiv},
year={2025}
}