MaevaGuerrier commited on
Commit
69464bc
·
1 Parent(s): 0b3d249
README.md CHANGED
@@ -1,3 +1,112 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ tags:
4
+ - zero-shot evaluation
5
+ - foundation models
6
+ - visual navigation
7
+ - robot learning
8
+ - real-world evaluation
9
+ - onnx
10
+ pipeline_tag: vnm_zeroshot_eval
11
+ library_name: onnxruntime
12
+ arxiv: 2603.25937
13
+ base_model:
14
+ - rail-berkeley/crossformer
15
+ - robodhruv/visualnav-transformer
16
+ - hren20/NaiviBridger
17
  ---
18
+
19
+
20
+ # Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned — ONNX Models
21
+
22
+ ONNX-optimized exports of visual navigation models for deployment on physical robots (e.g., Boston Dynamic Spot, AgileX Limo, AgileX Bunker). These exports are derived from the original works listed below — all credit for architectures and training goes to the respective authors.
23
+
24
+ See https://github.com/MaevaGuerrier/vnm-zeroshot-eval for deployment instructions.
25
+
26
+ # Acknowledgements
27
+
28
+ We would like to thank the authors of the following works, whose open-source models made this evaluation possible.
29
+ - [GNM](https://arxiv.org/abs/2210.03370)
30
+ - [ViNT](https://arxiv.org/abs/2306.14846)
31
+ - [NoMaD](https://arxiv.org/abs/2310.07896)
32
+ - [NaviBridger](https://arxiv.org/abs/2504.10041)
33
+ - [CrossFormer](https://arxiv.org/abs/2408.11812)
34
+
35
+ # Citations
36
+
37
+ If you use this work, please cite:
38
+
39
+ ```bibtex
40
+ @article{guerrier2026vnm,
41
+ title = {Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned},
42
+ author = {Guerrier, Maeva and Soma, Karthik and Pavlasek, Jana and Beltrame, Giovanni},
43
+ journal = {arXiv preprint arXiv:2603.25937},
44
+ year = {2026}
45
+ }
46
+ ```
47
+
48
+ Consider citing the original models as well:
49
+
50
+ ```bibtex
51
+ @misc{shah2023gnmgeneralnavigationmodel,
52
+ title={GNM: A General Navigation Model to Drive Any Robot},
53
+ author={Dhruv Shah and Ajay Sridhar and Arjun Bhorkar and Noriaki Hirose and Sergey Levine},
54
+ year={2023},
55
+ eprint={2210.03370},
56
+ archivePrefix={arXiv},
57
+ primaryClass={cs.RO},
58
+ url={https://arxiv.org/abs/2210.03370},
59
+ }
60
+ ```
61
+
62
+
63
+ ```bibtex
64
+ @misc{shah2023vintfoundationmodelvisual,
65
+ title={ViNT: A Foundation Model for Visual Navigation},
66
+ author={Dhruv Shah and Ajay Sridhar and Nitish Dashora and Kyle Stachowicz and Kevin Black and Noriaki Hirose and Sergey Levine},
67
+ year={2023},
68
+ eprint={2306.14846},
69
+ archivePrefix={arXiv},
70
+ primaryClass={cs.RO},
71
+ url={https://arxiv.org/abs/2306.14846},
72
+ }
73
+ ```
74
+
75
+
76
+ ```bibtex
77
+ @misc{sridhar2023nomadgoalmaskeddiffusion,
78
+ title={NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration},
79
+ author={Ajay Sridhar and Dhruv Shah and Catherine Glossop and Sergey Levine},
80
+ year={2023},
81
+ eprint={2310.07896},
82
+ archivePrefix={arXiv},
83
+ primaryClass={cs.RO},
84
+ url={https://arxiv.org/abs/2310.07896},
85
+ }
86
+ ```
87
+
88
+
89
+ ```bibtex
90
+ @misc{ren2025priordoesmattervisual,
91
+ title={Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models},
92
+ author={Hao Ren and Yiming Zeng and Zetong Bi and Zhaoliang Wan and Junlong Huang and Hui Cheng},
93
+ year={2025},
94
+ eprint={2504.10041},
95
+ archivePrefix={arXiv},
96
+ primaryClass={cs.RO},
97
+ url={https://arxiv.org/abs/2504.10041},
98
+ }
99
+ ```
100
+
101
+
102
+ ```bibtex
103
+ @misc{doshi2024scalingcrossembodiedlearningpolicy,
104
+ title={Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation},
105
+ author={Ria Doshi and Homer Walke and Oier Mees and Sudeep Dasari and Sergey Levine},
106
+ year={2024},
107
+ eprint={2408.11812},
108
+ archivePrefix={arXiv},
109
+ primaryClass={cs.RO},
110
+ url={https://arxiv.org/abs/2408.11812},
111
+ }
112
+ ```
models/GNM/gnm.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5c2525cb2d42b2a7d8174d00345285b7ee5acff5232a6fc91a7531b19b145652
3
+ size 34630394
models/GNM/gnm.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4b03e0255f8a547290d4079f4e7d610ff69987122f17e019bd36684c08b3ee95
3
+ size 104806886
models/NaviBridger/cvae.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bd8414f2b37e7bb20fb61c8cd7064d112c24fdedb8ef5f2e9c066749fcc02ab5
3
+ size 915311478
models/NaviBridger/cvae.yaml ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ project_name: cvae
2
+ run_name: cvae
3
+
4
+ # training setup
5
+ use_wandb: True # set to false if you don't want to log to wandb
6
+ train: True
7
+ batch_size: 256
8
+ epochs: 30
9
+ gpu_ids: [0]
10
+ num_workers: 12
11
+ lr: 1e-4
12
+ optimizer: adamw
13
+ clipping: False
14
+ max_norm: 1.
15
+ scheduler: "cosine"
16
+ warmup: True
17
+ warmup_epochs: 4
18
+ cyclic_period: 10
19
+ plateau_patience: 3
20
+ plateau_factor: 0.5
21
+ seed: 0
22
+ save_freq: 1
23
+
24
+ # model params
25
+ model_type: cvae
26
+ vision_encoder: navibridge_encoder
27
+ encoding_size: 256
28
+ obs_encoder: efficientnet-b0
29
+ attn_unet: False
30
+ cond_predict_scale: False
31
+ mha_num_attention_heads: 4
32
+ mha_num_attention_layers: 4
33
+ mha_ff_dim_factor: 4
34
+ down_dims: [64, 128, 256]
35
+
36
+ # diffusion model params
37
+ num_diffusion_iters: 10
38
+
39
+ # mask
40
+ goal_mask_prob: 0.5
41
+
42
+ # normalization for the action space
43
+ normalize: True
44
+
45
+ # context
46
+ context_type: temporal
47
+ context_size: 3 # 5
48
+ alpha: 1e-4
49
+
50
+ # distance bounds for distance and action and distance predictions
51
+ distance:
52
+ min_dist_cat: 0
53
+ max_dist_cat: 20
54
+ action:
55
+ min_dist_cat: 3
56
+ max_dist_cat: 20
57
+
58
+ # action output params
59
+ len_traj_pred: 8
60
+ action_dim: 2
61
+ learn_angle: False
62
+
63
+ # navibridge
64
+ sampler_name: "uniform"
65
+ pred_mode: "ve"
66
+ weight_schedule: "karras"
67
+ sigma_data: 0.5
68
+ sigma_min: 0.002
69
+ sigma_max: 80.0
70
+ rho: 7.0
71
+ beta_d: 2
72
+ beta_min: 0.1
73
+ cov_xy: 0.
74
+ guidance: 1.
75
+ # sample defaults
76
+ clip_denoised: True
77
+ sampler: "euler"
78
+ churn_step_ratio: 0.
79
+ # prior settings
80
+ prior_policy: "gaussian" # handcraft, gaussian, cvae
81
+ class_num: 5
82
+
83
+ angle_ranges: [[0, 67.5],
84
+ [67.5, 112.5],
85
+ [112.5, 180],
86
+ [180, 270],
87
+ [270, 360]]
88
+ min_std_angle: 5.0
89
+ max_std_angle: 20.0
90
+ min_std_length: 1.0
91
+ max_std_length: 5.0
92
+
93
+ # cvae
94
+ train_params:
95
+ batch_size: 256
96
+ num_itr: 3001
97
+ lr: 0.5e-5
98
+ lr_gamma: 0.99
99
+ lr_step: 1000
100
+ l2_norm: 0.0
101
+ ema: 0.99
102
+
103
+
104
+ diffuse_params:
105
+ latent_dim: 64
106
+ layer: 3
107
+ net_type: vae_mlp
108
+ ckpt_path: /workspace/src/NaiviBridger/deployment/model_weights/cvae.pth
109
+ pretrain: False
110
+
111
+ # dataset specific parameters
112
+ image_size: [96, 96] # width, height
113
+ datasets:
114
+ recon:
115
+ data_folder: ./datasets/recon
116
+ train: ./datasets/data_splits/recon/train # path to train folder with traj_names.txt
117
+ test: ./datasets/data_splits/recon/test # path to test folder with traj_names.txt
118
+ end_slack: 3 # because many trajectories end in collisions
119
+ goals_per_obs: 1 # how many goals are sampled per observation
120
+ negative_mining: True # negative mining from the ViNG paper (Shah et al.)
121
+ go_stanford:
122
+ data_folder: ./datasets/go_stanford/ # datasets/stanford_go_new
123
+ train: ./datasets/data_splits/go_stanford/train/
124
+ test: ./datasets/data_splits/go_stanford/test/
125
+ end_slack: 0
126
+ goals_per_obs: 2 # increase dataset size
127
+ negative_mining: True
128
+ sacson:
129
+ data_folder: ./datasets/sacson/
130
+ train: ./datasets/data_splits/sacson/train/
131
+ test: ./datasets/data_splits/sacson/test/
132
+ end_slack: 3 # because many trajectories end in collisions
133
+ goals_per_obs: 1
134
+ negative_mining: True
135
+ scand:
136
+ data_folder: ./datasets/scand/
137
+ train: ./datasets/data_splits/scand/train/
138
+ test: ./datasets/data_splits/scand/test/
139
+ end_slack: 0
140
+ goals_per_obs: 1
141
+ negative_mining: True
142
+
143
+ # logging stuff
144
+ ## =0 turns off
145
+ print_log_freq: 500 # in iterations
146
+ image_log_freq: 1000 #0 # in iterations
147
+ num_images_log: 8 #0
148
+ pairwise_test_freq: 0 # in epochs
149
+ eval_fraction: 0.25
150
+ wandb_log_freq: 10 # in iterations
151
+ eval_freq: 1 # in epochs
models/NaviBridger/navibridger_cvae.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:511334b25ca38da88787f1ccdf6eea1f0f7eff6d762acd9223a04fd347920fe2
3
+ size 76547213
models/NaviBridger/navibridger_cvae.yaml ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ project_name: navibridge
2
+ run_name: navibridge
3
+
4
+ # training setup
5
+ use_wandb: True # set to false if you don't want to log to wandb
6
+ train: True
7
+ batch_size: 224
8
+ epochs: 30
9
+ gpu_ids: [1]
10
+ num_workers: 12
11
+ lr: 1e-4
12
+ optimizer: adamw
13
+ clipping: False
14
+ max_norm: 1.
15
+ scheduler: "cosine"
16
+ warmup: True
17
+ warmup_epochs: 4
18
+ cyclic_period: 10
19
+ plateau_patience: 3
20
+ plateau_factor: 0.5
21
+ seed: 0
22
+ save_freq: 1
23
+
24
+ # model params
25
+ model_type: navibridge
26
+ vision_encoder: navibridge_encoder
27
+ encoding_size: 256
28
+ obs_encoder: efficientnet-b0
29
+ attn_unet: False
30
+ cond_predict_scale: False
31
+ mha_num_attention_heads: 4
32
+ mha_num_attention_layers: 4
33
+ mha_ff_dim_factor: 4
34
+ down_dims: [64, 128, 256]
35
+
36
+ # diffusion model params
37
+ num_diffusion_iters: 10
38
+
39
+ # mask
40
+ goal_mask_prob: 0.5
41
+
42
+ # normalization for the action space
43
+ normalize: True
44
+
45
+ # context
46
+ context_type: temporal
47
+ context_size: 3 # 5
48
+ alpha: 1e-4
49
+
50
+ # distance bounds for distance and action and distance predictions
51
+ distance:
52
+ min_dist_cat: 0
53
+ max_dist_cat: 20
54
+ action:
55
+ min_dist_cat: 3
56
+ max_dist_cat: 20
57
+
58
+ # action output params
59
+ len_traj_pred: 8
60
+ action_dim: 2
61
+ learn_angle: False
62
+
63
+ # navibridge
64
+ sampler_name: "uniform"
65
+ pred_mode: "ve"
66
+ weight_schedule: "karras"
67
+ sigma_data: 0.5
68
+ sigma_min: 0.002
69
+ sigma_max: 10.0
70
+ rho: 7.0
71
+ beta_d: 2
72
+ beta_min: 0.1
73
+ cov_xy: 0.
74
+ guidance: 1.
75
+ clip_denoised: True
76
+ sampler: "euler"
77
+ churn_step_ratio: 0.
78
+ # prior settings
79
+ prior_policy: "cvae" # handcraft, gaussian, cvae
80
+ class_num: 5
81
+
82
+ angle_ranges: [[0, 67.5],
83
+ [67.5, 112.5],
84
+ [112.5, 180],
85
+ [180, 270],
86
+ [270, 360]]
87
+ min_std_angle: 5.0
88
+ max_std_angle: 20.0
89
+ min_std_length: 1.0
90
+ max_std_length: 5.0
91
+
92
+ # cvae
93
+ train_params:
94
+ batch_size: 256
95
+ num_itr: 3001
96
+ lr: 0.5e-5
97
+ lr_gamma: 0.99
98
+ lr_step: 1000
99
+ l2_norm: 0.0
100
+ ema: 0.99
101
+
102
+
103
+ diffuse_params:
104
+ latent_dim: 64
105
+ layer: 3
106
+ net_type: vae_mlp
107
+ ckpt_path: /workspace/src/NaiviBridger/deployment/model_weights/cvae.pth
108
+ pretrain: False
109
+
110
+ # dataset specific parameters
111
+ image_size: [96, 96] # width, height
112
+ datasets:
113
+ recon:
114
+ data_folder: ./datasets/recon
115
+ train: ./datasets/data_splits/recon/train # path to train folder with traj_names.txt
116
+ test: ./datasets/data_splits/recon/test # path to test folder with traj_names.txt
117
+ end_slack: 3 # because many trajectories end in collisions
118
+ goals_per_obs: 1 # how many goals are sampled per observation
119
+ negative_mining: True # negative mining from the ViNG paper (Shah et al.)
120
+ go_stanford:
121
+ data_folder: ./datasets/go_stanford/ # datasets/stanford_go_new
122
+ train: ./datasets/data_splits/go_stanford/train/
123
+ test: ./datasets/data_splits/go_stanford/test/
124
+ end_slack: 0
125
+ goals_per_obs: 2 # increase dataset size
126
+ negative_mining: True
127
+ sacson:
128
+ data_folder: ./datasets/sacson/
129
+ train: ./datasets/data_splits/sacson/train/
130
+ test: ./datasets/data_splits/sacson/test/
131
+ end_slack: 3 # because many trajectories end in collisions
132
+ goals_per_obs: 1
133
+ negative_mining: True
134
+ scand:
135
+ data_folder: ./datasets/scand/
136
+ train: ./datasets/data_splits/scand/train/
137
+ test: ./datasets/data_splits/scand/test/
138
+ end_slack: 0
139
+ goals_per_obs: 1
140
+ negative_mining: True
141
+
142
+ # logging stuff
143
+ ## =0 turns off
144
+ print_log_freq: 100 # in iterations
145
+ image_log_freq: 1000 #0 # in iterations
146
+ num_images_log: 8 #0
147
+ pairwise_test_freq: 0 # in epochs
148
+ eval_fraction: 0.25
149
+ wandb_log_freq: 10 # in iterations
150
+ eval_freq: 1 # in epochs
models/NaviBridger/navibridger_dist_pred_net.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cba7328b993c06858db7776f7c46293b17c032dcdefcb540a1cecce181a1bc61
3
+ size 71653
models/NaviBridger/navibridger_vision_encoder.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:89e15954b138aca90940677ca0b75855408f86caed50b7b0f497a4234bbd7721
3
+ size 47967171
models/NoMaD/nomad.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:70f79b8262527e20e56ced64a3e3d7ef91855bc9e7c3fa348d78edcb83c6a333
3
+ size 76473631
models/NoMaD/nomad_dist_pred_net.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:47a4272c8b6fea3982cc403fcbb7275461135b173b02e4a8bac545138ff641ed
3
+ size 71653
models/NoMaD/nomad_noise_pred_net.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d22b30234f2a2db469a82e5304feec65f20c7daa0df4f4d4224c4d32063153c5
3
+ size 15550505
models/NoMaD/nomad_vision_encoder.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:35439a338d1a12f481c0186443521b7cd9d4a330eb82309f54a348f237e9fa97
3
+ size 47967171
models/ViNT/vint.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:092fb24e9f73c6ea1a42e07442232e73b98a920195fc1b550e4aed52c3f43304
3
+ size 96004784
models/ViNT/vint.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:155fd72de2e98ae0e2fef9404072e1aefa79dae5f7f2411d4bcf7e384b83aa1f
3
+ size 430167114