File size: 8,465 Bytes
e09e26f 19e05fc d39ca20 19e05fc d39ca20 19e05fc d39ca20 19e05fc d39ca20 19e05fc d39ca20 19e05fc d39ca20 19e05fc d39ca20 19e05fc d39ca20 19e05fc e09e26f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
---
pipeline_tag: robotics
library_name: transformers
license: mit
---
This repository contains models for the **VLN-PE Benchmark**, as presented in the paper [Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities](https://huggingface.co/papers/2507.13019).
VLN-PE introduces a physically realistic Vision-and-Language Navigation platform supporting humanoid, quadruped, and wheeled robots, and systematically evaluates several ego-centric VLN methods in physical robotic settings.
For more details, visit the [project page](https://crystalsixone.github.io/vln_pe.github.io/) or the main [GitHub repository](https://github.com/InternRobotics/InternNav).
## VLN-PE Benchmark
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-fymr{border-color:inherit;font-weight:bold;text-align:left;vertical-align:top}
</style>
<table class="tg"><thead>
<tr>
<th class="tg-c3ow" rowspan="2"><span style="font-weight:bold">Model</span></th>
<th class="tg-0pky" rowspan="2"><span style="font-weight:bold">Dataset/Benchmark</span></th>
<th class="tg-c3ow" colspan="7"><span style="font-weight:bold">Val Seen</span></th>
<th class="tg-c3ow" colspan="7"><span style="font-weight:bold">Val Unseen</span></th>
<th class="tg-fymr" rowspan="2">Download</th>
</tr>
<tr>
<th class="tg-fymr">TL</th>
<th class="tg-fymr">NE</th>
<th class="tg-fymr">FR</th>
<th class="tg-fymr">StR</th>
<th class="tg-fymr">OS</th>
<th class="tg-fymr">SR</th>
<th class="tg-fymr">SPL</th>
<th class="tg-fymr">TL</th>
<th class="tg-fymr">NE</th>
<th class="tg-fymr">FR</th>
<th class="tg-fymr">StR</th>
<th class="tg-fymr">OS</th>
<th class="tg-fymr">SR</th>
<th class="tg-fymr">SPL</th>
</tr></thead>
<tbody>
<tr>
<td class="tg-c3ow" colspan="17">Zero-shot transfer evaluation from VLN-CE</td>
</tr>
<tr>
<td class="tg-0pky">Seq2Seq-Full</td>
<td class="tg-0pky">R2R VLN-PE</td>
<td class="tg-0pky">7.80</td>
<td class="tg-0pky">7.62</td>
<td class="tg-0pky">20.21</td>
<td class="tg-0pky">3.04</td>
<td class="tg-0pky">19.3</td>
<td class="tg-0pky">15.2</td>
<td class="tg-0pky">12.79</td>
<td class="tg-0pky">7.73</td>
<td class="tg-0pky">7.18</td>
<td class="tg-0pky">18.04</td>
<td class="tg-0pky">3.04</td>
<td class="tg-0pky">22.42</td>
<td class="tg-0pky">16.48</td>
<td class="tg-0pky">14.11</td>
<td class="tg-0pky"><a href="https://huggingface.co/InternRobotics/VLN-PE/tree/main/r2r/zero_shot/seq2seq" target="_blank" rel="noopener noreferrer">model</a></td>
</tr>
<tr>
<td class="tg-0pky">CMA-Full</td>
<td class="tg-0pky">R2R VLN-PE</td>
<td class="tg-0pky">6.62</td>
<td class="tg-0pky">7.37</td>
<td class="tg-0pky">20.06</td>
<td class="tg-0pky">3.95</td>
<td class="tg-0pky">18.54</td>
<td class="tg-0pky">16.11</td>
<td class="tg-0pky">14.61</td>
<td class="tg-0pky">6.58</td>
<td class="tg-0pky">7.09</td>
<td class="tg-0pky">17.07</td>
<td class="tg-0pky">3.79</td>
<td class="tg-0pky">20.86</td>
<td class="tg-0pky">16.93</td>
<td class="tg-0pky">15.24</td>
<td class="tg-0pky"><a href="https://huggingface.co/InternRobotics/VLN-PE/tree/main/r2r/zero_shot/cma" target="_blank" rel="noopener noreferrer">model</a></td>
</tr>
<tr>
<td class="tg-c3ow" colspan="17">Train on VLN-PE</td>
</tr>
<tr>
<td class="tg-0pky">Seq2Seq</td>
<td class="tg-0pky">R2R VLN-PE</td>
<td class="tg-0pky">10.61</td>
<td class="tg-0pky">7.53</td>
<td class="tg-0pky">27.36</td>
<td class="tg-0pky">4.26</td>
<td class="tg-0pky">32.67</td>
<td class="tg-0pky">19.75</td>
<td class="tg-0pky">14.68</td>
<td class="tg-0pky">10.85</td>
<td class="tg-0pky">7.88</td>
<td class="tg-0pky">26.8</td>
<td class="tg-0pky">5.57</td>
<td class="tg-0pky">28.13</td>
<td class="tg-0pky">15.14</td>
<td class="tg-0pky">10.77</td>
<td class="tg-0pky"><a href="https://huggingface.co/InternRobotics/VLN-PE/tree/main/r2r/fine_tuned/seq2seq" target="_blank" rel="noopener noreferrer">model</a></td>
</tr>
<tr>
<td class="tg-0pky">CMA</td>
<td class="tg-0pky">R2R VLN-PE</td>
<td class="tg-0pky">11.13</td>
<td class="tg-0pky">7.59</td>
<td class="tg-0pky">23.71</td>
<td class="tg-0pky">3.19</td>
<td class="tg-0pky">34.94</td>
<td class="tg-0pky">21.58</td>
<td class="tg-0pky">16.1</td>
<td class="tg-0pky">11.16</td>
<td class="tg-0pky">7.98</td>
<td class="tg-0pky">22.64</td>
<td class="tg-0pky">3.27</td>
<td class="tg-0pky">33.11</td>
<td class="tg-0pky">19.15</td>
<td class="tg-0pky">14.05</td>
<td class="tg-0pky"><a href="https://huggingface.co/InternRobotics/VLN-PE/tree/main/r2r/fine_tuned/cma" target="_blank" rel="noopener noreferrer">model</a></td>
</tr>
<tr>
<td class="tg-0pky">RDP</td>
<td class="tg-0pky">R2R VLN-PE</td>
<td class="tg-0pky">13.26</td>
<td class="tg-0pky">6.76</td>
<td class="tg-0pky">27.51</td>
<td class="tg-0pky">1.82</td>
<td class="tg-0pky">38.6</td>
<td class="tg-0pky">25.08</td>
<td class="tg-0pky">17.07</td>
<td class="tg-0pky">12.7</td>
<td class="tg-0pky">6.72</td>
<td class="tg-0pky">24.57</td>
<td class="tg-0pky">3.11</td>
<td class="tg-0pky">36.9</td>
<td class="tg-0pky">25.24</td>
<td class="tg-0pky">17.73</td>
<td class="tg-0pky"><a href="https://huggingface.co/InternRobotics/VLN-PE/tree/main/r2r/fine_tuned/rdp" target="_blank" rel="noopener noreferrer">model</a></td>
</tr>
<tr>
<td class="tg-0pky">Seq2Seq+</td>
<td class="tg-0pky">R2R VLN-PE</td>
<td class="tg-0pky">10.22</td>
<td class="tg-0pky">7.75</td>
<td class="tg-0pky">33.43</td>
<td class="tg-0pky">3.19</td>
<td class="tg-0pky">30.09</td>
<td class="tg-0pky">16.86</td>
<td class="tg-0pky">12.54</td>
<td class="tg-0pky">9.88</td>
<td class="tg-0pky">7.85</td>
<td class="tg-0pky">26.27</td>
<td class="tg-0pky">6.52</td>
<td class="tg-0pky">28.79</td>
<td class="tg-0pky">16.56</td>
<td class="tg-0pky">12.7</td>
<td class="tg-0pky"><a href="https://huggingface.co/InternRobotics/VLN-PE/tree/main/r2r/fine_tuned/seq2seq_plus" target="_blank" rel="noopener noreferrer">model</a></td>
</tr>
<tr>
<td class="tg-0pky">CMA+</td>
<td class="tg-0pky">R2R VLN-PE</td>
<td class="tg-0pky">8.86</td>
<td class="tg-0pky">7.14</td>
<td class="tg-0pky">23.56</td>
<td class="tg-0pky">3.5</td>
<td class="tg-0pky">36.17</td>
<td class="tg-0pky">25.84</td>
<td class="tg-0pky">21.75</td>
<td class="tg-0pky">8.79</td>
<td class="tg-0pky">7.26</td>
<td class="tg-0pky">21.75</td>
<td class="tg-0pky">3.27</td>
<td class="tg-0pky">31.4</td>
<td class="tg-0pky">22.12</td>
<td class="tg-0pky">18.65</td>
<td class="tg-0pky"><a href="https://huggingface.co/InternRobotics/VLN-PE/tree/main/r2r/fine_tuned/cma_plus" target="_blank" rel="noopener noreferrer">model</a></td>
</tr>
</tbody></table>
## Citation
If you find our work helpful, please cite:
```bibtex
@inproceedings{vlnpe,
title={Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities},
author={Wang, Liuyi and Xia, Xinyuan and Zhao, Hui and Wang, Hanqing and Wang, Tai and Chen, Yilun and Liu, Chengju and Chen, Qijun and Pang, Jiangmiao},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2025}
}
@misc{internnav2025,
title = {{InternNav: InternRobotics'} open platform for building generalized navigation foundation models},
author = {InternNav Contributors},
howpublished={\url{https://github.com/InternRobotics/InternNav}},
year = {2025}
}
``` |