File size: 3,579 Bytes

---
license: mit
tags:
  - zero-shot evaluation
  - foundation models
  - visual navigation
  - robot learning
  - real-world evaluation
  - onnx 
pipeline_tag: robotics    
library_name: onnxruntime
arxiv: 2603.25937            
base_model:
    - rail-berkeley/crossformer
    - robodhruv/visualnav-transformer
    - hren20/NaiviBridger
---


# Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned — ONNX Models

ONNX-optimized exports of visual navigation models for deployment on physical robots (e.g., Boston Dynamic Spot,  AgileX Limo, AgileX Bunker). These exports are derived from the original works listed below — all credit for architectures and training goes to the respective authors.

See https://github.com/MaevaGuerrier/vnm-zeroshot-eval for deployment instructions.

# Acknowledgements

We would like to thank the authors of the following works, whose open-source models made this evaluation possible.
- [GNM](https://arxiv.org/abs/2210.03370) 
- [ViNT](https://arxiv.org/abs/2306.14846) 
- [NoMaD](https://arxiv.org/abs/2310.07896)
- [NaviBridger](https://arxiv.org/abs/2504.10041) 
- [CrossFormer](https://arxiv.org/abs/2408.11812)

# Citations 

If you use this work, please cite:

```bibtex
@article{guerrier2026vnm,
  title   = {Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned},
  author  = {Guerrier, Maeva and Soma, Karthik and Pavlasek, Jana and Beltrame, Giovanni},
  journal = {arXiv preprint arXiv:2603.25937},
  year    = {2026}
}
```

Consider citing the original models as well: 

```bibtex
@misc{shah2023gnmgeneralnavigationmodel,
      title={GNM: A General Navigation Model to Drive Any Robot}, 
      author={Dhruv Shah and Ajay Sridhar and Arjun Bhorkar and Noriaki Hirose and Sergey Levine},
      year={2023},
      eprint={2210.03370},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2210.03370}, 
}
```


```bibtex
@misc{shah2023vintfoundationmodelvisual,
      title={ViNT: A Foundation Model for Visual Navigation}, 
      author={Dhruv Shah and Ajay Sridhar and Nitish Dashora and Kyle Stachowicz and Kevin Black and Noriaki Hirose and Sergey Levine},
      year={2023},
      eprint={2306.14846},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2306.14846}, 
}
```


```bibtex
@misc{sridhar2023nomadgoalmaskeddiffusion,
      title={NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration}, 
      author={Ajay Sridhar and Dhruv Shah and Catherine Glossop and Sergey Levine},
      year={2023},
      eprint={2310.07896},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2310.07896}, 
}
```


```bibtex
@misc{ren2025priordoesmattervisual,
      title={Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models}, 
      author={Hao Ren and Yiming Zeng and Zetong Bi and Zhaoliang Wan and Junlong Huang and Hui Cheng},
      year={2025},
      eprint={2504.10041},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2504.10041}, 
}
```


```bibtex
@misc{doshi2024scalingcrossembodiedlearningpolicy,
      title={Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation}, 
      author={Ria Doshi and Homer Walke and Oier Mees and Sudeep Dasari and Sergey Levine},
      year={2024},
      eprint={2408.11812},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2408.11812}, 
}
```