--- license: mit tags: - zero-shot evaluation - foundation models - visual navigation - robot learning - real-world evaluation - onnx pipeline_tag: robotics library_name: onnxruntime arxiv: 2603.25937 base_model: - rail-berkeley/crossformer - robodhruv/visualnav-transformer - hren20/NaiviBridger --- # Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned — ONNX Models ONNX-optimized exports of visual navigation models for deployment on physical robots (e.g., Boston Dynamic Spot, AgileX Limo, AgileX Bunker). These exports are derived from the original works listed below — all credit for architectures and training goes to the respective authors. See https://github.com/MaevaGuerrier/vnm-zeroshot-eval for deployment instructions. # Acknowledgements We would like to thank the authors of the following works, whose open-source models made this evaluation possible. - [GNM](https://arxiv.org/abs/2210.03370) - [ViNT](https://arxiv.org/abs/2306.14846) - [NoMaD](https://arxiv.org/abs/2310.07896) - [NaviBridger](https://arxiv.org/abs/2504.10041) - [CrossFormer](https://arxiv.org/abs/2408.11812) # Citations If you use this work, please cite: ```bibtex @article{guerrier2026vnm, title = {Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned}, author = {Guerrier, Maeva and Soma, Karthik and Pavlasek, Jana and Beltrame, Giovanni}, journal = {arXiv preprint arXiv:2603.25937}, year = {2026} } ``` Consider citing the original models as well: ```bibtex @misc{shah2023gnmgeneralnavigationmodel, title={GNM: A General Navigation Model to Drive Any Robot}, author={Dhruv Shah and Ajay Sridhar and Arjun Bhorkar and Noriaki Hirose and Sergey Levine}, year={2023}, eprint={2210.03370}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2210.03370}, } ``` ```bibtex @misc{shah2023vintfoundationmodelvisual, title={ViNT: A Foundation Model for Visual Navigation}, author={Dhruv Shah and Ajay Sridhar and Nitish Dashora and Kyle Stachowicz and Kevin Black and Noriaki Hirose and Sergey Levine}, year={2023}, eprint={2306.14846}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2306.14846}, } ``` ```bibtex @misc{sridhar2023nomadgoalmaskeddiffusion, title={NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration}, author={Ajay Sridhar and Dhruv Shah and Catherine Glossop and Sergey Levine}, year={2023}, eprint={2310.07896}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2310.07896}, } ``` ```bibtex @misc{ren2025priordoesmattervisual, title={Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models}, author={Hao Ren and Yiming Zeng and Zetong Bi and Zhaoliang Wan and Junlong Huang and Hui Cheng}, year={2025}, eprint={2504.10041}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2504.10041}, } ``` ```bibtex @misc{doshi2024scalingcrossembodiedlearningpolicy, title={Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation}, author={Ria Doshi and Homer Walke and Oier Mees and Sudeep Dasari and Sergey Levine}, year={2024}, eprint={2408.11812}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2408.11812}, } ```