diff --git a/.DS_Store b/.DS_Store deleted file mode 100644 index f064bd0dbbb637bbbcea87d28372b0538836c572..0000000000000000000000000000000000000000 Binary files a/.DS_Store and /dev/null differ diff --git a/.gitattributes b/.gitattributes index 44fae5a1a8dd53b77161d2037abfdf0782386d11..1206523dafd3ed8757110b7bc19fba84a316cb15 100644 --- a/.gitattributes +++ b/.gitattributes @@ -40,4 +40,3 @@ diffsynth/tokenizer_configs/kolors/tokenizer/vocab.txt filter=lfs diff=lfs merge examples/output_videos/output_moe_framepack_sliding.mp4 filter=lfs diff=lfs merge=lfs -text logo-text-2.png filter=lfs diff=lfs merge=lfs -text assets/images/logo-text-2.png filter=lfs diff=lfs merge=lfs -text -pipeline.png filter=lfs diff=lfs merge=lfs -text diff --git a/.gitignore b/.gitignore deleted file mode 100644 index 324e69a0773535fcc114384e651d40252a619903..0000000000000000000000000000000000000000 --- a/.gitignore +++ /dev/null @@ -1,3 +0,0 @@ -# Ignore all checkpoint files -*.ckpt -*.ckpt.* \ No newline at end of file diff --git a/README.md b/README.md index 6b944907d5f3a8074797bc8aa19784130b13378c..56e2800b526e74f3636d8a5466f64b7896f90ae6 100644 --- a/README.md +++ b/README.md @@ -20,7 +20,7 @@ model-index:

📄 - [arXiv] + [arXiv]    🏠 [Project Page] @@ -48,23 +48,36 @@ model-index:
-**[Yixuan Zhu1](https://eternalevan.github.io/), [Jiaqi Feng1](https://github.com/Aurora-edu/), [Wenzhao Zheng1 †](https://wzzheng.net), [Yuan Gao2](https://openreview.net/profile?id=~Yuan_Gao32), [Xin Tao2](https://www.xtao.website), [Pengfei Wan2](https://scholar.google.com/citations?user=P6MraaYAAAAJ&hl=en), [Jie Zhou 1](https://scholar.google.com/citations?user=6a79aPwAAAAJ&hl=en&authuser=1), [Jiwen Lu1](https://ivg.au.tsinghua.edu.cn/Jiwen_Lu/)** +**[Yixuan Zhu1](https://jianhongbai.github.io/), [Jiaqi Feng1](https://menghanxia.github.io/), [Wenzhao Zheng1 †](https://fuxiao0719.github.io/), [Yuan Gao2](https://xinntao.github.io/), [Xin Tao2](https://scholar.google.com/citations?user=dCik-2YAAAAJ&hl=en), [Pengfei Wan2](https://openreview.net/profile?id=~Jinwen_Cao1), [Jie Zhou 1](https://person.zju.edu.cn/en/lzz), [Jiwen Lu1](https://person.zju.edu.cn/en/huhaoji)** -(† Project leader) +(*Work done during an internship at Kuaishou Technology, +† Project leader) 1Tsinghua University, 2Kuaishou Technology.
+## 🔥 Updates +- __[2025.11.17]__: Release the [project page](https://eternalevan.github.io/Astra-project/). +- __[2025.12.09]__: Release the training and inference code, model checkpoint. -## 📖 Introduction +## 🎯 TODO List -**TL;DR:** Astra is an **interactive world model** that delivers realistic long-horizon video rollouts under a wide range of scenarios and action inputs. +- [ ] **Release full inference pipelines** for additional scenarios: + - [ ] 🚗 Autonomous driving + - [ ] 🤖 Robotic manipulation + - [ ] 🛸 Drone navigation / exploration -**Astra** is an **interactive**, action-driven world model that predicts long-horizon future videos across diverse real-world scenarios. Built on an autoregressive diffusion transformer with temporal causal attention, Astra supports **streaming prediction** while preserving strong temporal coherence. Astra introduces **noise-augmented history memory** to stabilize long rollouts, an **action-aware adapter** for precise control signals, and a **mixture of action experts** to route heterogeneous action modalities. Through these key innovations, Astra delivers consistent, controllable, and high-fidelity video futures for applications such as autonomous driving, robot manipulation, and camera motion. -
- Astra Pipeline -
+- [ ] **Open-source training scripts**: + - [ ] ⬆️ Action-conditioned autoregressive denoising training + - [ ] 🔄 Multi-scenario joint training pipeline + +- [ ] **Release dataset preprocessing tools** + +- [ ] **Provide unified evaluation toolkit** +## 📖 Introduction + +**TL;DR:** Astra is an **interactive world model** that delivers realistic long-horizon video rollouts under a wide range of scenarios and action inputs. ## Gallery @@ -119,27 +132,7 @@ model-index: If you would like to use ReCamMaster as a baseline and need qualitative or quantitative comparisons, please feel free to drop an email to [jianhongbai@zju.edu.cn](mailto:jianhongbai@zju.edu.cn). We can assist you with batch inference of our model. --> -## 🔥 Updates -- __[2025.11.17]__: Release the [project page](https://eternalevan.github.io/Astra-project/). -- __[2025.12.09]__: Release the inference code, model checkpoint. - -## 🎯 TODO List - -- [ ] **Release full inference pipelines** for additional scenarios: - - [ ] 🚗 Autonomous driving - - [ ] 🤖 Robotic manipulation - - [ ] 🛸 Drone navigation / exploration - - -- [ ] **Open-source training scripts**: - - [ ] ⬆️ Action-conditioned autoregressive denoising training - - [ ] 🔄 Multi-scenario joint training pipeline - -- [ ] **Release dataset preprocessing tools** - -- [ ] **Provide unified evaluation toolkit** - -## ⚙️ Run Astra (Inference) +## ⚙️ Code: Astra + Wan2.1 (Inference & Training) Astra is built upon [Wan2.1-1.3B](https://github.com/Wan-Video/Wan2.1), a diffusion-based video generation model. We provide inference scripts to help you quickly generate videos from images and action inputs. Follow the steps below: ### Inference @@ -167,57 +160,37 @@ python download_wan2.1.py ``` 2. Download the pre-trained Astra checkpoint -Please download from [huggingface](https://huggingface.co/EvanEternal/Astra/blob/main/models/Astra/checkpoints/diffusion_pytorch_model.ckpt) and place it in ```models/Astra/checkpoints```. +Please download from [huggingface](https://huggingface.co/wjque/lyra/blob/main/diffusion_pytorch_model.ckpt) and place it in ```models/Astra/checkpoints```. -Step 3: Test the example image +Step 3: Test the example videos ```shell -python infer_demo.py \ - --dit_path ../models/Astra/checkpoints/diffusion_pytorch_model.ckpt \ - --wan_model_path ../models/Wan-AI/Wan2.1-T2V-1.3B \ - --condition_image ../examples/condition_images/garden_1.png \ - --cam_type 4 \ - --prompt "A sunlit European street lined with historic buildings and vibrant greenery creates a warm, charming, and inviting atmosphere. The scene shows a picturesque open square paved with red bricks, surrounded by classic narrow townhouses featuring tall windows, gabled roofs, and dark-painted facades. On the right side, a lush arrangement of potted plants and blooming flowers adds rich color and texture to the foreground. A vintage-style streetlamp stands prominently near the center-right, contributing to the timeless character of the street. Mature trees frame the background, their leaves glowing in the warm afternoon sunlight. Bicycles are visible along the edges of the buildings, reinforcing the urban yet leisurely feel. The sky is bright blue with scattered clouds, and soft sun flares enter the frame from the left, enhancing the scene’s inviting, peaceful mood." \ - --output_path ../examples/output_videos/output_moe_framepack_sliding.mp4 \ +python inference_astra.py --cam_type 1 ``` -Step 4: Test your own images +Step 4: Test your own videos -To test with your own custom images, you need to prepare the target images and their corresponding text prompts. **We recommend that the size of the input images is close to 832×480 (width × height)**, which is consistent with the resolution of the generated video and can help achieve better video generation effects. For prompts generation, you can refer to the [Prompt Extension section](https://github.com/Wan-Video/Wan2.1?tab=readme-ov-file#2-using-prompt-extension) in Wan2.1 for guidance on crafting the captions. +If you want to test your own videos, you need to prepare your test data following the structure of the ```example_test_data``` folder. This includes N mp4 videos, each with at least 81 frames, and a ```metadata.csv``` file that stores their paths and corresponding captions. You can refer to the [Prompt Extension section](https://github.com/Wan-Video/Wan2.1?tab=readme-ov-file#2-using-prompt-extension) in Wan2.1 for guidance on preparing video captions. ```shell -python infer_demo.py \ - --dit_path path/to/your/dit_ckpt \ - --wan_model_path path/to/your/Wan2.1-T2V-1.3B \ - --condition_image path/to/your/image \ - --cam_type your_cam_type \ - --prompt your_prompt \ - --output_path path/to/your/output_video +python inference_astra.py --cam_type 1 --dataset_path path/to/your/data ``` We provide several preset camera types, as shown in the table below. Additionally, you can generate new trajectories for testing. -| cam_type | Trajectory | -|:-----------:|-----------------------------| -| 1 | Move Forward (Straight) | -| 2 | Rotate Left In Place | -| 3 | Rotate Right In Place | -| 4 | Move Forward + Rotate Left | -| 5 | Move Forward + Rotate Right | -| 6 | S-shaped Trajectory | -| 7 | Rotate Left → Rotate Right | - - -## Future Work 🚀 - -Looking ahead, we plan to further enhance Astra in several directions: - -- **Training with Wan-2.2:** Upgrade our model using the latest Wan-2.2 framework to release a more powerful version with improved generation quality. -- **3D Spatial Consistency:** Explore techniques to better preserve 3D consistency across frames for more coherent and realistic video generation. -- **Long-Term Memory:** Incorporate mechanisms for long-term memory, enabling the model to handle extended temporal dependencies and complex action sequences. - -These directions aim to push Astra towards more robust and interactive video world modeling. +| cam_type | Trajectory | +|-------------------|-----------------------------| +| 1 | Pan Right | +| 2 | Pan Left | +| 3 | Tilt Up | +| 4 | Tilt Down | +| 5 | Zoom In | +| 6 | Zoom Out | +| 7 | Translate Up (with rotation) | +| 8 | Translate Down (with rotation) | +| 9 | Arc Left (with rotation) | +| 10 | Arc Right (with rotation) | - +```