--- license: mit tags: - video-generation - diffusion - world-model library_name: diffusion model-index: - name: Astra results: [] --- # Astra 🌏: General Interactive World Model with Autoregressive Denoising

📄 [arXiv] 🏠 [Project Page] 🖥️ [Github]

**[Yixuan Zhu¹](https://jianhongbai.github.io/), [Jiaqi Feng¹](https://menghanxia.github.io/), [Wenzhao Zheng^{1 †}](https://fuxiao0719.github.io/), [Yuan Gao²](https://xinntao.github.io/), [Xin Tao²](https://scholar.google.com/citations?user=dCik-2YAAAAJ&hl=en), [Pengfei Wan²](https://openreview.net/profile?id=~Jinwen_Cao1), [Jie Zhou ¹](https://person.zju.edu.cn/en/lzz), [Jiwen Lu¹](https://person.zju.edu.cn/en/huhaoji)** (*Work done during an internship at Kuaishou Technology, † Project leader) ¹Tsinghua University, ²Kuaishou Technology.

## 🔥 Updates - __[2025.11.17]__: Release the [project page](https://eternalevan.github.io/Astra-project/). - __[2025.12.09]__: Release the training and inference code, model checkpoint. ## 🎯 TODO List - [ ] **Release full inference pipelines** for additional scenarios: - [ ] 🚗 Autonomous driving - [ ] 🤖 Robotic manipulation - [ ] 🛸 Drone navigation / exploration - [ ] **Open-source training scripts**: - [ ] ⬆️ Action-conditioned autoregressive denoising training - [ ] 🔄 Multi-scenario joint training pipeline - [ ] **Release dataset preprocessing tools** - [ ] **Provide unified evaluation toolkit** ## 📖 Introduction **TL;DR:** Astra is an **interactive world model** that delivers realistic long-horizon video rollouts under a wide range of scenarios and action inputs. ## Gallery ### Astra+Wan2.1

## ⚙️ Code: Astra + Wan2.1 (Inference & Training) Astra is built upon [Wan2.1-1.3B](https://github.com/Wan-Video/Wan2.1), a diffusion-based video generation model. We provide inference scripts to help you quickly generate videos from images and action inputs. Follow the steps below: ### Inference Step 1: Set up the environment [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio) requires Rust and Cargo to compile extensions. You can install them using the following command: ```shell curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs/) | sh . "$HOME/.cargo/env" ``` Install [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio): ```shell git clone https://github.com/EternalEvan/Astra.git cd Astra pip install -e . ``` Step 2: Download the pretrained checkpoints 1. Download the pre-trained Wan2.1 models ```shell cd script python download_wan2.1.py ``` 2. Download the pre-trained Astra checkpoint Please download from [huggingface](https://huggingface.co/wjque/lyra/blob/main/diffusion_pytorch_model.ckpt) and place it in ```models/Astra/checkpoints```. Step 3: Test the example videos ```shell python inference_astra.py --cam_type 1 ``` Step 4: Test your own videos If you want to test your own videos, you need to prepare your test data following the structure of the ```example_test_data``` folder. This includes N mp4 videos, each with at least 81 frames, and a ```metadata.csv``` file that stores their paths and corresponding captions. You can refer to the [Prompt Extension section](https://github.com/Wan-Video/Wan2.1?tab=readme-ov-file#2-using-prompt-extension) in Wan2.1 for guidance on preparing video captions. ```shell python inference_astra.py --cam_type 1 --dataset_path path/to/your/data ``` We provide several preset camera types, as shown in the table below. Additionally, you can generate new trajectories for testing. | cam_type | Trajectory | |-------------------|-----------------------------| | 1 | Pan Right | | 2 | Pan Left | | 3 | Tilt Up | | 4 | Tilt Down | | 5 | Zoom In | | 6 | Zoom Out | | 7 | Translate Up (with rotation) | | 8 | Translate Down (with rotation) | | 9 | Arc Left (with rotation) | | 10 | Arc Right (with rotation) | ### Training Step 1: Set up the environment ```shell pip install lightning pandas websockets ``` Step 2: Prepare the training dataset 1. Download the [MultiCamVideo dataset](https://huggingface.co/datasets/KwaiVGI/MultiCamVideo-Dataset). 2. Extract VAE features ```shell CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_recammaster.py --task data_process --dataset_path path/to/the/MultiCamVideo/Dataset --output_path ./models --text_encoder_path "models/Wan-AI/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth" --vae_path "models/Wan-AI/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth" --tiled --num_frames 81 --height 480 --width 832 --dataloader_num_workers 2 ``` 3. Generate Captions for Each Video You can use video caption tools like [LLaVA](https://github.com/haotian-liu/LLaVA) to generate captions for each video and store them in the ```metadata.csv``` file. Step 3: Training ```shell CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_recammaster.py --task train --dataset_path recam_train_data --output_path ./models/train --dit_path "models/Wan-AI/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors" --steps_per_epoch 8000 --max_epochs 100 --learning_rate 1e-4 --accumulate_grad_batches 1 --use_gradient_checkpointing --dataloader_num_workers 4 ``` We do not explore the optimal set of hyper-parameters and train with a batch size of 1 on each GPU. You may achieve better model performance by adjusting hyper-parameters such as the learning rate and increasing the batch size. Step 4: Test the model ```shell python inference_recammaster.py --cam_type 1 --ckpt_path path/to/the/checkpoint ``` ## 🤗 Awesome Related Works Feel free to explore these outstanding related works, including but not limited to: [ReCamMaster](https://github.com/KlingTeam/ReCamMaster): ReCamMaster re-captures in-the-wild videos with novel camera trajectories. [GCD](https://gcd.cs.columbia.edu/): GCD synthesizes large-angle novel viewpoints of 4D dynamic scenes from a monocular video. [ReCapture](https://generative-video-camera-controls.github.io/): a method for generating new videos with novel camera trajectories from a single user-provided video. [Trajectory Attention](https://xizaoqu.github.io/trajattn/): Trajectory Attention facilitates various tasks like camera motion control on images and videos, and video editing. [GS-DiT](https://wkbian.github.io/Projects/GS-DiT/): GS-DiT provides 4D video control for a single monocular video. [Diffusion as Shader](https://igl-hkust.github.io/das/): a versatile video generation control model for various tasks. [TrajectoryCrafter](https://trajectorycrafter.github.io/): TrajectoryCrafter achieves high-fidelity novel views generation from casually captured monocular video. [GEN3C](https://research.nvidia.com/labs/toronto-ai/GEN3C/): a generative video model with precise Camera Control and temporal 3D Consistency. ## 🌟 Citation Please leave us a star 🌟 and cite our paper if you find our work helpful. ``` @inproceedings{zhu2025astra, title={Astra: General Interactive World Model with Autoregressive Denoising}, author={Zhu, Yixuan and Feng, Jiaqi and Zheng, Wenzhao and Gao, Yuan and Tao, Xin and Wan, Pengfei and Zhou, Jie and Lu, Jiwen}, booktitle={arxiv}, year={2025} } ```