Add metadata and paper/code links for FantasyVLN

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +27 -5
README.md CHANGED
@@ -1,12 +1,34 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - en
 
 
 
5
  ---
6
- # FantasyVLN
7
 
8
- The model weights of **FantasyVLN**.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
- **FantasyVLN** is a unified multimodal Chain-of-Thought (CoT) reasoning framework that enables efficient and precise navigation based on natural language instructions and visual observations. **FantasyVLN** combines the benefits of textual, visual, and multimodal CoT reasoning by constructing a unified representation space across these reasoning modes. To enable efficient reasoning, we align these CoT reasoning modes with non-CoT reasoning during training, while using only non-CoT reasoning at test time. Notably, we perform visual CoT in the latent space of a [VAR](https://github.com/FoundationVision/VAR) model, where only low-scale latent representations are predicted. Compared to traditional pixel-level visual CoT methods, our approach significantly improves both training and inference efficiency.
11
 
12
- See the offical code for detail: [https://fantasy-amap.github.io/fantasy-vln](https://fantasy-amap.github.io/fantasy-vln/)
 
 
 
 
 
 
 
 
1
  ---
 
2
  language:
3
  - en
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ pipeline_tag: robotics
7
  ---
 
8
 
9
+ # FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation
10
+
11
+ **FantasyVLN** is a unified multimodal Chain-of-Thought (CoT) reasoning framework that enables efficient and precise navigation based on natural language instructions and visual observations.
12
+
13
+ - **Paper:** [FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation](https://huggingface.co/papers/2601.13976)
14
+ - **Project Page:** [https://fantasy-amap.github.io/fantasy-vln/](https://fantasy-amap.github.io/fantasy-vln/)
15
+ - **Code:** [https://github.com/Fantasy-AMAP/fantasy-vln](https://github.com/Fantasy-AMAP/fantasy-vln)
16
+
17
+ ## Introduction
18
+
19
+ Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences.
20
+
21
+ FantasyVLN combines the benefits of textual, visual, and multimodal CoT reasoning by constructing a unified representation space across these reasoning modes. To enable efficient reasoning, we align these CoT reasoning modes with non-CoT reasoning during training, while using only non-CoT reasoning at test time. Notably, we perform visual CoT in the latent space of a [VAR](https://github.com/FoundationVision/VAR) model, where only low-scale latent representations are predicted. Compared to traditional pixel-level visual CoT methods, our approach significantly improves both training and inference efficiency, reducing inference latency by an order of magnitude compared to explicit CoT methods.
22
+
23
+ ## Citation
24
 
25
+ If you find this work helpful, please consider citing:
26
 
27
+ ```bibtex
28
+ @article{zuo2025fantasyvln,
29
+ title={FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation},
30
+ author={Zuo, Jing and Mu, Lingzhou and Jiang, Fan and Ma, Chengcheng and Xu, Mu and Qi, Yonggang},
31
+ journal={arXiv preprint arXiv:2601.13976},
32
+ year={2025}
33
+ }
34
+ ```