--- license: mit pipeline_tag: robotics library_name: transformers --- # VLN-PE Benchmark Models This repository hosts models and results for the [Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities](https://huggingface.co/papers/2507.13019) benchmark. VLN-PE is a physically realistic Vision-and-Language Navigation (VLN) platform supporting humanoid, quadruped, and wheeled robots. It aims to bridge the gap between idealized assumptions and physical deployment challenges in VLN, systematically evaluating ego-centric VLN methods across different technical pipelines. * **Project Page**: https://crystalsixone.github.io/vln_pe.github.io/ * **Code Repository**: https://github.com/InternRobotics/InternNav ## Benchmark Results The following table presents the benchmark results for various models evaluated on the VLN-PE platform: **VLN-PE Benchmark**
Model Dataset/Benchmark Val Seen Val Unseen Download
TL NE FR StR OS SR SPL TL NE FR StR OS SR SPL
Zero-shot transfer evaluation from VLN-CE
Seq2Seq-Full R2R VLN-PE 7.80 7.62 20.21 3.04 19.3 15.2 12.79 7.73 7.18 18.04 3.04 22.42 16.48 14.11 model
CMA-Full R2R VLN-PE 6.62 7.37 20.06 3.95 18.54 16.11 14.61 6.58 7.09 17.07 3.79 20.86 16.93 15.24 model
Train on VLN-PE
Seq2Seq R2R VLN-PE 10.61 7.53 27.36 4.26 32.67 19.75 14.68 10.85 7.88 26.8 5.57 28.13 15.14 10.77 model
CMA R2R VLN-PE 11.13 7.59 23.71 3.19 34.94 21.58 16.1 11.16 7.98 22.64 3.27 33.11 19.15 14.05 model
RDP R2R VLN-PE 13.26 6.76 27.51 1.82 38.6 25.08 17.07 12.7 6.72 24.57 3.11 36.9 25.24 17.73 model
Seq2Seq+ R2R VLN-PE 10.22 7.75 33.43 3.19 30.09 16.86 12.54 9.88 7.85 26.27 6.52 28.79 16.56 12.7 model
CMA+ R2R VLN-PE 8.86 7.14 23.56 3.5 36.17 25.84 21.75 8.79 7.26 21.75 3.27 31.4 22.12 18.65 model