File size: 12,784 Bytes
2587ef0 9c1f523 2587ef0 9c1f523 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 | ---
license: apache-2.0
tags:
- robotics
- navigation
- visual-navigation
- embodied-ai
- onnx
pipeline_tag: robotics
---
# Navigation Model Zoo
A collection of vision-based navigation policies exported to **ONNX**, each wrapped in a small,
uniform Python inference API. Maintained by **Honglin He @ UCLA-VAIL**.
Every model takes a short history of RGB frames and predicts a local trajectory (and optionally a
distance-to-goal / arrival signal); a built-in PD controller turns the trajectory into `(v, ω)`
velocity commands. All models share the same wrapper interface so they can be swapped and
benchmarked without per-model glue code.
## Models
| Folder | Model / paper | Goal mode | Context | Input H×W | Waypoints | Weights |
|--------|---------------|-----------|:-------:|:---------:|:---------:|---------|
| [`GNM_GL_Official`](GNM_GL_Official) | [GNM](https://arxiv.org/abs/2210.03370) · ICRA 2023 | goal-free | 6 | 64×85 | 5 | `gnm_imagegoal.onnx` (+`.data`) · 35 MB |
| [`Vint_GL_Official`](Vint_GL_Official) | [ViNT](https://arxiv.org/abs/2306.14846) · CoRL 2023 | goal-free | 6 | 64×85 | 5 | `vint_imagegoal.onnx` (+`.data`) · 97 MB |
| [`NoMaD_GL_Official`](NoMaD_GL_Official) | [NoMaD](https://arxiv.org/abs/2310.07896) · ICRA 2024 | goal-free (diffusion) | 4 | 96×96 | 8 ×8 samples | 3× `.onnx` (+`.data`) · 111 MB |
| [`CityWalker_PG_Official`](CityWalker_PG_Official) | [CityWalker](https://arxiv.org/abs/2411.17820) · CVPR 2025 | point-goal | 5 | 350×630 | 5 | `citywalker.onnx` · 806 MB |
| [`MBRA_PG_Official`](MBRA_PG_Official) | [MBRA](https://arxiv.org/abs/2505.05592) · RA-L 2025 | point-goal | 6 | 96×96 | 8 | `mbra.onnx` · 254 MB |
| [`S2E`](S2E) | [S2E](https://arxiv.org/abs/2507.22028) · ICLR 2026 | point-goal / goal-free | 11 | 256×256 | 10 | `s2e.onnx` · 382 MB |
| [`MIMIC`](MIMIC) | [MIMIC](https://arxiv.org/abs/2603.22527) · ICRA 2026 | goal-free | 16 | 288×512 | 13 | `mimic.onnx` · 318 MB |
Suffix legend: `PG` = point-goal, `GL` = goal-less (goal-free). Models with a `.onnx.data` companion
(GNM, ViNT, NoMaD) use ONNX external weights — keep each `.onnx` and its `.onnx.data` together.
## Common interface
Each folder is a self-contained module exposing one navigator class. They all follow the same contract:
```python
import numpy as np
from MBRA_PG_Official.inference import MBRAPGNavigator # run from the repo root
nav = MBRAPGNavigator(device="cuda") # use device="cpu" if you have no GPU
# obs: (B, nav.context_size, 3, H, W) float32 in [0, 1]
# the wrapper resizes & normalizes to the model's spec internally
obs = np.random.rand(1, nav.context_size, 3, 96, 96).astype(np.float32)
# Point-goal models take goal_xy (standard frame: x=forward, y=left, meters);
# goal-free models omit it.
traj, scores = nav.inference_trajectory(obs, goal_xy=np.array([5.0, 0.2])) # (B, M, W, 2) meters
vw, best = nav.inference_vw(obs, goal_xy=np.array([5.0, 0.2])) # vw: (B, 2) = [v, ω]
nav.reset() # clears PD-controller velocity smoothing between episodes
```
Conventions shared by every model:
- **Coordinate frame** — all user-facing inputs/outputs are *standard frame*: `x = forward`, `y = left`, in meters. Models with a different internal convention (e.g. CityWalker) convert transparently.
- **Observations** — `(B, context_size, 3, H, W)`, `float32`, pixel values in `[0, 1]`. The wrapper handles resize and any ImageNet normalization. *(Exception: `MIMIC` expects frames already at 288×512 and does not resize.)*
- **`inference_trajectory(obs[, goal_xy])`** → `(trajectory, scores)`. `trajectory` is `(B, M, W, 2)` in meters, where `M` is the number of modes (1 for unimodal, 8 for NoMaD) and `W` the waypoint count; `scores` is `(B, M)`.
- **`inference_vw(obs[, goal_xy])`** → `(vw, best_traj)` where `vw` is a `(B, 2)` torch tensor of `[linear_v, angular_w]`. Tune limits with `max_v` / `max_w` at construction.
- Goal-free models (`Vint`, `GNM`, `NoMaD`, `MIMIC`) ignore `goal_xy` — call `inference_trajectory(obs)`.
## Installation
```bash
pip install onnxruntime-gpu numpy torch torchvision pyyaml pillow
# CPU-only: use onnxruntime instead of onnxruntime-gpu
pip install opencv-python # required by S2E (frame resizing)
```
Optional, lab-internal dependency: `Vint`, `GNM`, and `NoMaD` expose an extra `inference_vw_pp()`
method that uses `urbansim.custom.pp.PurePursuitController`; it is imported lazily and only needed
for that method. **`MIMIC` imports `urbansim` at module load**, so its `inference.py` will not import
without the `urbansim` package on your path.
## Model details
### GNM_GL_Official — `gnm_imagegoal.onnx` (+ `.onnx.data`)
**Paper:** *GNM: A General Navigation Model to Drive Any Robot* (ICRA 2023) · [arXiv:2210.03370](https://arxiv.org/abs/2210.03370) · [code](https://github.com/robodhruv/drive-any-robot)
Goal-free General Navigation Model — same NavDP image-goal I/O contract as ViNT (`obs_img (B,18,64,85)` + `goal_img (B,3,64,85)` → `dist_pred (B,1)`, `action_pred (B,5,4)`), with a lower top speed. Expects input downsampled to ≈ 3 Hz.
### Vint_GL_Official — `vint_imagegoal.onnx` (+ `.onnx.data`)
**Paper:** *ViNT: A Foundation Model for Visual Navigation* (CoRL 2023) · [arXiv:2306.14846](https://arxiv.org/abs/2306.14846) · [project](https://general-navigation-models.github.io/vint/)
Goal-free ViNT (NavDP image-goal backbone run with a random goal image). **ONNX I/O:** `obs_img (B,18,64,85)` (6 ImageNet-normalized frames × 3 ch) + `goal_img (B,3,64,85)` (random noise) → `dist_pred (B,1)`, `action_pred (B,5,4)`. Cumulative `xy` is already baked in; the wrapper scales by the 0.8 m metric spacing. Expects input downsampled to ≈ 3 Hz.
### NoMaD_GL_Official — 3× ONNX (diffusion, + `.onnx.data`)
**Paper:** *NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration* (ICRA 2024) · [arXiv:2310.07896](https://arxiv.org/abs/2310.07896) · [project](https://general-navigation-models.github.io/nomad/)
Goal-free diffusion policy. Runs a 10-step DDPM loop (`squaredcos_cap_v2`) over 3 components:
`nomad_vision_encoder.onnx` (`obs_img (B,12,96,96)` + `goal_img (B,3,96,96)` + `goal_mask (B)` → `cond (B,256)`), `nomad_noise_pred.onnx` (one denoising step), and `nomad_dist_pred.onnx`. Produces **8 trajectory samples** → `trajectory (B,8,8,2)` meters (decode: unnormalize → cumsum → ×0.267 m spacing). This is the only multi-modal model and the slowest (diffusion + multiple samples).
### CityWalker_PG_Official — `citywalker.onnx`
**Paper:** *CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos* (CVPR 2025) · [arXiv:2411.17820](https://arxiv.org/abs/2411.17820) · [project](https://ai4ce.github.io/CityWalker/)
Point-goal urban walker. **ONNX I/O:** `obs_images (B,5,3,350,630)` + `trajectory (B,6,2)` past waypoints → `wp_pred (B,5,2)`, `arrive_pred (B,1)` (arrival probability). Images are ImageNet-normalized internally; the model's internal `y=forward, x=right` frame is converted to standard frame by the wrapper. Input rate ≈ 5 Hz.
### MBRA_PG_Official — `mbra.onnx`
**Paper:** *Learning to Drive Anywhere with Model-Based Reannotation* (RA-L 2025) · [arXiv:2505.05592](https://arxiv.org/abs/2505.05592) · [project](https://model-base-reannotation.github.io/)
Point-goal policy. **ONNX I/O:** `obs_images (B,6,3,96,96)` ImageNet-normalized + `goal_pose (B,4)` = `[x, y, sin(yaw), cos(yaw)]` → `waypoints (B,8,4)`. Goal is given as `goal_xy` (meters) and converted internally; waypoints are un-normalized by a 0.8 m metric spacing. Input rate ≈ 5 Hz.
### S2E — `s2e.onnx`
**Paper:** *From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning* (ICLR 2026) · [arXiv:2507.22028](https://arxiv.org/abs/2507.22028) · [project](https://metadriverse.github.io/s2e)
UCLA-VAIL navigation foundation model; this is the behavior-cloning, point-goal, web-pretrained variant (`S2EBC-PG-Web100`). **ONNX I/O:** `obs_images (B,11,3,256,256)` in `[0,1]` (no ImageNet norm) + `goal (B,3)` = `[norm_dist, cos(θ), sin(θ)]` → `wp_pred (B,10,3)` `[x,y,yaw]`, `wp_pred_score (B,63)` mode scores. Frames are resized to 256×256 with OpenCV.
### MIMIC — `mimic.onnx`
**Paper:** *Learning Sidewalk Autopilot from Multi-Scale Imitation with Corrective Behavior Expansion* (ICRA 2026) · [arXiv:2603.22527](https://arxiv.org/abs/2603.22527) · [project](https://vail-ucla.github.io/MIMIC)
UCLA-VAIL goal-free long-context sidewalk policy. **ONNX I/O:** `input (1,16,3,288,512)` in `[0,1]` → `output (1,15,3)` `[x,y,yaw]` at non-uniform timestamps (0.2 s–5.0 s @ 5 Hz). Batch is processed one sample at a time; the wrapper keeps the first 13 waypoints (~4 s) and scales to meters. Requires `urbansim` (see Installation).
## Downloading
**Full repo** (includes the LFS-tracked ONNX weights):
```bash
hf download UCLA-VAIL/Navigation-Model-Zoo-Public --local-dir ./Navigation-Model-Zoo-Public
```
**One model** — fetch just its folder, e.g. MBRA:
```bash
hf download UCLA-VAIL/Navigation-Model-Zoo-Public \
--include "MBRA_PG_Official/*" --local-dir .
```
Then run from the repo root: `from MBRA_PG_Official.inference import MBRAPGNavigator`.
> **External weights:** GNM, ViNT, and NoMaD ship `*.onnx.data` files — keep each `.onnx` and its
> `.onnx.data` together in the same folder so ONNX Runtime can resolve the weights.
## Intended use & limitations
These are **research artifacts** for navigation research, reproduction, and benchmarking — not
safety-validated for deployment on real robots without additional testing. Each policy's behavior
is bounded by its training distribution (camera intrinsics, embodiment, frame rate, environment).
Several wrappers rectify/resize inputs to a specific training camera; mismatched cameras may degrade
performance.
## License
Released under **Apache 2.0**. Individual models carry the licenses and terms of their original
sources (ViNT, GNM, NoMaD, CityWalker, MBRA) — check upstream before commercial use.
## Citation
If you use a model from this zoo, please cite its original paper.
**GNM**
```bibtex
@inproceedings{shah2023gnm,
title={Gnm: A general navigation model to drive any robot},
author={Shah, Dhruv and Sridhar, Ajay and Bhorkar, Arjun and Hirose, Noriaki and Levine, Sergey},
booktitle={2023 IEEE International Conference on Robotics and Automation (ICRA)},
pages={7226--7233},
year={2023},
organization={IEEE}
}
```
**ViNT**
```bibtex
@article{shah2023vint,
title={ViNT: A foundation model for visual navigation},
author={Shah, Dhruv and Sridhar, Ajay and Dashora, Nitish and Stachowicz, Kyle and Black, Kevin and Hirose, Noriaki and Levine, Sergey},
journal={arXiv preprint arXiv:2306.14846},
year={2023}
}
```
**NoMaD**
```bibtex
@inproceedings{sridhar2024nomad,
title={Nomad: Goal masked diffusion policies for navigation and exploration},
author={Sridhar, Ajay and Shah, Dhruv and Glossop, Catherine and Levine, Sergey},
booktitle={2024 IEEE International Conference on Robotics and Automation (ICRA)},
pages={63--70},
year={2024},
organization={IEEE}
}
```
**CityWalker**
```bibtex
@inproceedings{liu2025citywalker,
title={Citywalker: Learning embodied urban navigation from web-scale videos},
author={Liu, Xinhao and Li, Jintong and Jiang, Yicheng and Sujay, Niranjan and Yang, Zhicheng and Zhang, Juexiao and Abanes, John and Zhang, Jing and Feng, Chen},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={6875--6885},
year={2025}
}
```
**MBRA**
```bibtex
@article{hirose2025learning,
title={Learning to drive anywhere with model-based reannotation},
author={Hirose, Noriaki and Ignatova, Lydia and Stachowicz, Kyle and Glossop, Catherine and Levine, Sergey and Shah, Dhruv},
journal={IEEE Robotics and Automation Letters},
volume={11},
number={2},
pages={1242--1249},
year={2025},
publisher={IEEE}
}
```
**S2E**
```bibtex
@article{he2025seeing,
title={From seeing to experiencing: Scaling navigation foundation models with reinforcement learning},
author={He, Honglin and Ma, Yukai and Squicciarini, Brad and Wu, Wayne and Zhou, Bolei},
journal={arXiv preprint arXiv:2507.22028},
year={2025}
}
```
**MIMIC**
```bibtex
@article{he2026learning,
title={Learning Sidewalk Autopilot from Multi-Scale Imitation with Corrective Behavior Expansion},
author={He, Honglin and Ma, Yukai and Squicciarini, Brad and Wu, Wayne and Zhou, Bolei},
journal={arXiv preprint arXiv:2603.22527},
year={2026}
}
```
## Contact
Maintained by [UCLA-VAIL](https://vail-ucla.github.io/). Open an issue/discussion on the
repository page for questions or contributions.
|