File size: 3,579 Bytes
0b3d249 69464bc 4c8fa58 69464bc 0b3d249 69464bc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | ---
license: mit
tags:
- zero-shot evaluation
- foundation models
- visual navigation
- robot learning
- real-world evaluation
- onnx
pipeline_tag: robotics
library_name: onnxruntime
arxiv: 2603.25937
base_model:
- rail-berkeley/crossformer
- robodhruv/visualnav-transformer
- hren20/NaiviBridger
---
# Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned — ONNX Models
ONNX-optimized exports of visual navigation models for deployment on physical robots (e.g., Boston Dynamic Spot, AgileX Limo, AgileX Bunker). These exports are derived from the original works listed below — all credit for architectures and training goes to the respective authors.
See https://github.com/MaevaGuerrier/vnm-zeroshot-eval for deployment instructions.
# Acknowledgements
We would like to thank the authors of the following works, whose open-source models made this evaluation possible.
- [GNM](https://arxiv.org/abs/2210.03370)
- [ViNT](https://arxiv.org/abs/2306.14846)
- [NoMaD](https://arxiv.org/abs/2310.07896)
- [NaviBridger](https://arxiv.org/abs/2504.10041)
- [CrossFormer](https://arxiv.org/abs/2408.11812)
# Citations
If you use this work, please cite:
```bibtex
@article{guerrier2026vnm,
title = {Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned},
author = {Guerrier, Maeva and Soma, Karthik and Pavlasek, Jana and Beltrame, Giovanni},
journal = {arXiv preprint arXiv:2603.25937},
year = {2026}
}
```
Consider citing the original models as well:
```bibtex
@misc{shah2023gnmgeneralnavigationmodel,
title={GNM: A General Navigation Model to Drive Any Robot},
author={Dhruv Shah and Ajay Sridhar and Arjun Bhorkar and Noriaki Hirose and Sergey Levine},
year={2023},
eprint={2210.03370},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2210.03370},
}
```
```bibtex
@misc{shah2023vintfoundationmodelvisual,
title={ViNT: A Foundation Model for Visual Navigation},
author={Dhruv Shah and Ajay Sridhar and Nitish Dashora and Kyle Stachowicz and Kevin Black and Noriaki Hirose and Sergey Levine},
year={2023},
eprint={2306.14846},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2306.14846},
}
```
```bibtex
@misc{sridhar2023nomadgoalmaskeddiffusion,
title={NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration},
author={Ajay Sridhar and Dhruv Shah and Catherine Glossop and Sergey Levine},
year={2023},
eprint={2310.07896},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2310.07896},
}
```
```bibtex
@misc{ren2025priordoesmattervisual,
title={Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models},
author={Hao Ren and Yiming Zeng and Zetong Bi and Zhaoliang Wan and Junlong Huang and Hui Cheng},
year={2025},
eprint={2504.10041},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2504.10041},
}
```
```bibtex
@misc{doshi2024scalingcrossembodiedlearningpolicy,
title={Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation},
author={Ria Doshi and Homer Walke and Oier Mees and Sudeep Dasari and Sergey Levine},
year={2024},
eprint={2408.11812},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2408.11812},
}
``` |