EmbodiedCity
/

WorldVLN

aerial-vision-language-navigation

Model card Files Files and versions

Improve model card and add metadata

#1

by nielsr HF Staff - opened May 18

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

Files changed (1) hide show

README.md +18 -10

README.md CHANGED Viewed

@@ -1,6 +1,8 @@
 ---
 language:
 - en
 tags:
 - embodied-ai
 - aerial-vision-language-navigation
@@ -8,28 +10,34 @@ tags:
 - model-weights
 ---
-# WorldVLN Model Weights
-This repository contains the model weights introduced in the paper:
-[WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation].
-It includes the weights for the world model backbone and the action decoder.
-For more details about the model and its implementation, please refer to the GitHub repository:
-https://github.com/EmbodiedCity/WorldVLN.code
 ## Citation
-If this work has contributed to your research, welcome to cite it:
 ```bibtex
 @misc{zhao2026worldvln,
-      title={WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation},
       author={Baining Zhao and Jiacheng Xu and Weicheng Feng and Xin Zhang and Zhaolu Wang and Haoyang Wang and Shilong Ji and Ziyou Wang and Jianjie Fang and Zhiheng Zheng and Weichen Zhang and Yu Shang and Wei Wu and Chen Gao and Xinlei Chen and Yong Li},
       year={2026},
       eprint={2605.15964},
       archivePrefix={arXiv},
       primaryClass={cs.RO},
-      url={https://arxiv.org/abs/2605.15964},
 }
-```

 ---
 language:
 - en
+license: cc-by-4.0
+pipeline_tag: robotics
 tags:
 - embodied-ai
 - aerial-vision-language-navigation
 - model-weights
 ---
+# WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation
+This repository contains the model weights for WorldVLN, the first autoregressive world action model for aerial vision-language navigation (VLN).
+[**Paper**](https://huggingface.co/papers/2605.15964) | [**Project Page**](https://embodiedcity.github.io/WorldVLN/) | [**Code**](https://github.com/EmbodiedCity/WorldVLN.code)
+WorldVLN formulates aerial navigation as a prediction-driven world-action problem. It adapts a latent autoregressive video backbone to predict short-horizon world-state transitions and decodes them directly into executable waypoint actions. After each action segment is executed, newly received observations are encoded back into the autoregressive context, enabling closed-loop world-action prediction.
+## Model Weights
+This repository includes the weights for:
+- The world model backbone.
+- The action decoder.
+## Usage
+For detailed instructions on installation, setup, and inference (including the autoregressive I/O protocol), please refer to the [official GitHub repository](https://github.com/EmbodiedCity/WorldVLN.code).
 ## Citation
+If this work is useful for your research, please cite:
 ```bibtex
 @misc{zhao2026worldvln,
+      title={WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation},
       author={Baining Zhao and Jiacheng Xu and Weicheng Feng and Xin Zhang and Zhaolu Wang and Haoyang Wang and Shilong Ji and Ziyou Wang and Jianjie Fang and Zhiheng Zheng and Weichen Zhang and Yu Shang and Wei Wu and Chen Gao and Xinlei Chen and Yong Li},
       year={2026},
       eprint={2605.15964},
       archivePrefix={arXiv},
       primaryClass={cs.RO},
+      url={https://arxiv.org/abs/2605.15964},
 }
+```