| --- |
| license: mit |
| tags: |
| - zero-shot evaluation |
| - foundation models |
| - visual navigation |
| - robot learning |
| - real-world evaluation |
| - onnx |
| pipeline_tag: robotics |
| library_name: onnxruntime |
| arxiv: 2603.25937 |
| base_model: |
| - rail-berkeley/crossformer |
| - robodhruv/visualnav-transformer |
| - hren20/NaiviBridger |
| --- |
| |
|
|
| # Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned — ONNX Models |
|
|
| ONNX-optimized exports of visual navigation models for deployment on physical robots (e.g., Boston Dynamic Spot, AgileX Limo, AgileX Bunker). These exports are derived from the original works listed below — all credit for architectures and training goes to the respective authors. |
|
|
| See https://github.com/MaevaGuerrier/vnm-zeroshot-eval for deployment instructions. |
|
|
| # Acknowledgements |
|
|
| We would like to thank the authors of the following works, whose open-source models made this evaluation possible. |
| - [GNM](https://arxiv.org/abs/2210.03370) |
| - [ViNT](https://arxiv.org/abs/2306.14846) |
| - [NoMaD](https://arxiv.org/abs/2310.07896) |
| - [NaviBridger](https://arxiv.org/abs/2504.10041) |
| - [CrossFormer](https://arxiv.org/abs/2408.11812) |
|
|
| # Citations |
|
|
| If you use this work, please cite: |
|
|
| ```bibtex |
| @article{guerrier2026vnm, |
| title = {Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned}, |
| author = {Guerrier, Maeva and Soma, Karthik and Pavlasek, Jana and Beltrame, Giovanni}, |
| journal = {arXiv preprint arXiv:2603.25937}, |
| year = {2026} |
| } |
| ``` |
|
|
| Consider citing the original models as well: |
|
|
| ```bibtex |
| @misc{shah2023gnmgeneralnavigationmodel, |
| title={GNM: A General Navigation Model to Drive Any Robot}, |
| author={Dhruv Shah and Ajay Sridhar and Arjun Bhorkar and Noriaki Hirose and Sergey Levine}, |
| year={2023}, |
| eprint={2210.03370}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.RO}, |
| url={https://arxiv.org/abs/2210.03370}, |
| } |
| ``` |
|
|
|
|
| ```bibtex |
| @misc{shah2023vintfoundationmodelvisual, |
| title={ViNT: A Foundation Model for Visual Navigation}, |
| author={Dhruv Shah and Ajay Sridhar and Nitish Dashora and Kyle Stachowicz and Kevin Black and Noriaki Hirose and Sergey Levine}, |
| year={2023}, |
| eprint={2306.14846}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.RO}, |
| url={https://arxiv.org/abs/2306.14846}, |
| } |
| ``` |
|
|
|
|
| ```bibtex |
| @misc{sridhar2023nomadgoalmaskeddiffusion, |
| title={NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration}, |
| author={Ajay Sridhar and Dhruv Shah and Catherine Glossop and Sergey Levine}, |
| year={2023}, |
| eprint={2310.07896}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.RO}, |
| url={https://arxiv.org/abs/2310.07896}, |
| } |
| ``` |
|
|
|
|
| ```bibtex |
| @misc{ren2025priordoesmattervisual, |
| title={Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models}, |
| author={Hao Ren and Yiming Zeng and Zetong Bi and Zhaoliang Wan and Junlong Huang and Hui Cheng}, |
| year={2025}, |
| eprint={2504.10041}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.RO}, |
| url={https://arxiv.org/abs/2504.10041}, |
| } |
| ``` |
|
|
|
|
| ```bibtex |
| @misc{doshi2024scalingcrossembodiedlearningpolicy, |
| title={Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation}, |
| author={Ria Doshi and Homer Walke and Oier Mees and Sudeep Dasari and Sergey Levine}, |
| year={2024}, |
| eprint={2408.11812}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.RO}, |
| url={https://arxiv.org/abs/2408.11812}, |
| } |
| ``` |