Hollis71025's picture
Add 7 navigation models (ONNX + inference wrappers) and model card
9c1f523 verified
---
license: apache-2.0
tags:
- robotics
- navigation
- visual-navigation
- embodied-ai
- onnx
pipeline_tag: robotics
---
# Navigation Model Zoo
A collection of vision-based navigation policies exported to **ONNX**, each wrapped in a small,
uniform Python inference API. Maintained by **Honglin He @ UCLA-VAIL**.
Every model takes a short history of RGB frames and predicts a local trajectory (and optionally a
distance-to-goal / arrival signal); a built-in PD controller turns the trajectory into `(v, ω)`
velocity commands. All models share the same wrapper interface so they can be swapped and
benchmarked without per-model glue code.
## Models
| Folder | Model / paper | Goal mode | Context | Input H×W | Waypoints | Weights |
|--------|---------------|-----------|:-------:|:---------:|:---------:|---------|
| [`GNM_GL_Official`](GNM_GL_Official) | [GNM](https://arxiv.org/abs/2210.03370) · ICRA 2023 | goal-free | 6 | 64×85 | 5 | `gnm_imagegoal.onnx` (+`.data`) · 35 MB |
| [`Vint_GL_Official`](Vint_GL_Official) | [ViNT](https://arxiv.org/abs/2306.14846) · CoRL 2023 | goal-free | 6 | 64×85 | 5 | `vint_imagegoal.onnx` (+`.data`) · 97 MB |
| [`NoMaD_GL_Official`](NoMaD_GL_Official) | [NoMaD](https://arxiv.org/abs/2310.07896) · ICRA 2024 | goal-free (diffusion) | 4 | 96×96 | 8 ×8 samples | 3× `.onnx` (+`.data`) · 111 MB |
| [`CityWalker_PG_Official`](CityWalker_PG_Official) | [CityWalker](https://arxiv.org/abs/2411.17820) · CVPR 2025 | point-goal | 5 | 350×630 | 5 | `citywalker.onnx` · 806 MB |
| [`MBRA_PG_Official`](MBRA_PG_Official) | [MBRA](https://arxiv.org/abs/2505.05592) · RA-L 2025 | point-goal | 6 | 96×96 | 8 | `mbra.onnx` · 254 MB |
| [`S2E`](S2E) | [S2E](https://arxiv.org/abs/2507.22028) · ICLR 2026 | point-goal / goal-free | 11 | 256×256 | 10 | `s2e.onnx` · 382 MB |
| [`MIMIC`](MIMIC) | [MIMIC](https://arxiv.org/abs/2603.22527) · ICRA 2026 | goal-free | 16 | 288×512 | 13 | `mimic.onnx` · 318 MB |
Suffix legend: `PG` = point-goal, `GL` = goal-less (goal-free). Models with a `.onnx.data` companion
(GNM, ViNT, NoMaD) use ONNX external weights — keep each `.onnx` and its `.onnx.data` together.
## Common interface
Each folder is a self-contained module exposing one navigator class. They all follow the same contract:
```python
import numpy as np
from MBRA_PG_Official.inference import MBRAPGNavigator # run from the repo root
nav = MBRAPGNavigator(device="cuda") # use device="cpu" if you have no GPU
# obs: (B, nav.context_size, 3, H, W) float32 in [0, 1]
# the wrapper resizes & normalizes to the model's spec internally
obs = np.random.rand(1, nav.context_size, 3, 96, 96).astype(np.float32)
# Point-goal models take goal_xy (standard frame: x=forward, y=left, meters);
# goal-free models omit it.
traj, scores = nav.inference_trajectory(obs, goal_xy=np.array([5.0, 0.2])) # (B, M, W, 2) meters
vw, best = nav.inference_vw(obs, goal_xy=np.array([5.0, 0.2])) # vw: (B, 2) = [v, ω]
nav.reset() # clears PD-controller velocity smoothing between episodes
```
Conventions shared by every model:
- **Coordinate frame** — all user-facing inputs/outputs are *standard frame*: `x = forward`, `y = left`, in meters. Models with a different internal convention (e.g. CityWalker) convert transparently.
- **Observations**`(B, context_size, 3, H, W)`, `float32`, pixel values in `[0, 1]`. The wrapper handles resize and any ImageNet normalization. *(Exception: `MIMIC` expects frames already at 288×512 and does not resize.)*
- **`inference_trajectory(obs[, goal_xy])`**`(trajectory, scores)`. `trajectory` is `(B, M, W, 2)` in meters, where `M` is the number of modes (1 for unimodal, 8 for NoMaD) and `W` the waypoint count; `scores` is `(B, M)`.
- **`inference_vw(obs[, goal_xy])`**`(vw, best_traj)` where `vw` is a `(B, 2)` torch tensor of `[linear_v, angular_w]`. Tune limits with `max_v` / `max_w` at construction.
- Goal-free models (`Vint`, `GNM`, `NoMaD`, `MIMIC`) ignore `goal_xy` — call `inference_trajectory(obs)`.
## Installation
```bash
pip install onnxruntime-gpu numpy torch torchvision pyyaml pillow
# CPU-only: use onnxruntime instead of onnxruntime-gpu
pip install opencv-python # required by S2E (frame resizing)
```
Optional, lab-internal dependency: `Vint`, `GNM`, and `NoMaD` expose an extra `inference_vw_pp()`
method that uses `urbansim.custom.pp.PurePursuitController`; it is imported lazily and only needed
for that method. **`MIMIC` imports `urbansim` at module load**, so its `inference.py` will not import
without the `urbansim` package on your path.
## Model details
### GNM_GL_Official — `gnm_imagegoal.onnx` (+ `.onnx.data`)
**Paper:** *GNM: A General Navigation Model to Drive Any Robot* (ICRA 2023) · [arXiv:2210.03370](https://arxiv.org/abs/2210.03370) · [code](https://github.com/robodhruv/drive-any-robot)
Goal-free General Navigation Model — same NavDP image-goal I/O contract as ViNT (`obs_img (B,18,64,85)` + `goal_img (B,3,64,85)` → `dist_pred (B,1)`, `action_pred (B,5,4)`), with a lower top speed. Expects input downsampled to ≈ 3 Hz.
### Vint_GL_Official — `vint_imagegoal.onnx` (+ `.onnx.data`)
**Paper:** *ViNT: A Foundation Model for Visual Navigation* (CoRL 2023) · [arXiv:2306.14846](https://arxiv.org/abs/2306.14846) · [project](https://general-navigation-models.github.io/vint/)
Goal-free ViNT (NavDP image-goal backbone run with a random goal image). **ONNX I/O:** `obs_img (B,18,64,85)` (6 ImageNet-normalized frames × 3 ch) + `goal_img (B,3,64,85)` (random noise) → `dist_pred (B,1)`, `action_pred (B,5,4)`. Cumulative `xy` is already baked in; the wrapper scales by the 0.8 m metric spacing. Expects input downsampled to ≈ 3 Hz.
### NoMaD_GL_Official — 3× ONNX (diffusion, + `.onnx.data`)
**Paper:** *NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration* (ICRA 2024) · [arXiv:2310.07896](https://arxiv.org/abs/2310.07896) · [project](https://general-navigation-models.github.io/nomad/)
Goal-free diffusion policy. Runs a 10-step DDPM loop (`squaredcos_cap_v2`) over 3 components:
`nomad_vision_encoder.onnx` (`obs_img (B,12,96,96)` + `goal_img (B,3,96,96)` + `goal_mask (B)``cond (B,256)`), `nomad_noise_pred.onnx` (one denoising step), and `nomad_dist_pred.onnx`. Produces **8 trajectory samples**`trajectory (B,8,8,2)` meters (decode: unnormalize → cumsum → ×0.267 m spacing). This is the only multi-modal model and the slowest (diffusion + multiple samples).
### CityWalker_PG_Official — `citywalker.onnx`
**Paper:** *CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos* (CVPR 2025) · [arXiv:2411.17820](https://arxiv.org/abs/2411.17820) · [project](https://ai4ce.github.io/CityWalker/)
Point-goal urban walker. **ONNX I/O:** `obs_images (B,5,3,350,630)` + `trajectory (B,6,2)` past waypoints → `wp_pred (B,5,2)`, `arrive_pred (B,1)` (arrival probability). Images are ImageNet-normalized internally; the model's internal `y=forward, x=right` frame is converted to standard frame by the wrapper. Input rate ≈ 5 Hz.
### MBRA_PG_Official — `mbra.onnx`
**Paper:** *Learning to Drive Anywhere with Model-Based Reannotation* (RA-L 2025) · [arXiv:2505.05592](https://arxiv.org/abs/2505.05592) · [project](https://model-base-reannotation.github.io/)
Point-goal policy. **ONNX I/O:** `obs_images (B,6,3,96,96)` ImageNet-normalized + `goal_pose (B,4)` = `[x, y, sin(yaw), cos(yaw)]``waypoints (B,8,4)`. Goal is given as `goal_xy` (meters) and converted internally; waypoints are un-normalized by a 0.8 m metric spacing. Input rate ≈ 5 Hz.
### S2E — `s2e.onnx`
**Paper:** *From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning* (ICLR 2026) · [arXiv:2507.22028](https://arxiv.org/abs/2507.22028) · [project](https://metadriverse.github.io/s2e)
UCLA-VAIL navigation foundation model; this is the behavior-cloning, point-goal, web-pretrained variant (`S2EBC-PG-Web100`). **ONNX I/O:** `obs_images (B,11,3,256,256)` in `[0,1]` (no ImageNet norm) + `goal (B,3)` = `[norm_dist, cos(θ), sin(θ)]``wp_pred (B,10,3)` `[x,y,yaw]`, `wp_pred_score (B,63)` mode scores. Frames are resized to 256×256 with OpenCV.
### MIMIC — `mimic.onnx`
**Paper:** *Learning Sidewalk Autopilot from Multi-Scale Imitation with Corrective Behavior Expansion* (ICRA 2026) · [arXiv:2603.22527](https://arxiv.org/abs/2603.22527) · [project](https://vail-ucla.github.io/MIMIC)
UCLA-VAIL goal-free long-context sidewalk policy. **ONNX I/O:** `input (1,16,3,288,512)` in `[0,1]``output (1,15,3)` `[x,y,yaw]` at non-uniform timestamps (0.2 s–5.0 s @ 5 Hz). Batch is processed one sample at a time; the wrapper keeps the first 13 waypoints (~4 s) and scales to meters. Requires `urbansim` (see Installation).
## Downloading
**Full repo** (includes the LFS-tracked ONNX weights):
```bash
hf download UCLA-VAIL/Navigation-Model-Zoo-Public --local-dir ./Navigation-Model-Zoo-Public
```
**One model** — fetch just its folder, e.g. MBRA:
```bash
hf download UCLA-VAIL/Navigation-Model-Zoo-Public \
--include "MBRA_PG_Official/*" --local-dir .
```
Then run from the repo root: `from MBRA_PG_Official.inference import MBRAPGNavigator`.
> **External weights:** GNM, ViNT, and NoMaD ship `*.onnx.data` files — keep each `.onnx` and its
> `.onnx.data` together in the same folder so ONNX Runtime can resolve the weights.
## Intended use & limitations
These are **research artifacts** for navigation research, reproduction, and benchmarking — not
safety-validated for deployment on real robots without additional testing. Each policy's behavior
is bounded by its training distribution (camera intrinsics, embodiment, frame rate, environment).
Several wrappers rectify/resize inputs to a specific training camera; mismatched cameras may degrade
performance.
## License
Released under **Apache 2.0**. Individual models carry the licenses and terms of their original
sources (ViNT, GNM, NoMaD, CityWalker, MBRA) — check upstream before commercial use.
## Citation
If you use a model from this zoo, please cite its original paper.
**GNM**
```bibtex
@inproceedings{shah2023gnm,
title={Gnm: A general navigation model to drive any robot},
author={Shah, Dhruv and Sridhar, Ajay and Bhorkar, Arjun and Hirose, Noriaki and Levine, Sergey},
booktitle={2023 IEEE International Conference on Robotics and Automation (ICRA)},
pages={7226--7233},
year={2023},
organization={IEEE}
}
```
**ViNT**
```bibtex
@article{shah2023vint,
title={ViNT: A foundation model for visual navigation},
author={Shah, Dhruv and Sridhar, Ajay and Dashora, Nitish and Stachowicz, Kyle and Black, Kevin and Hirose, Noriaki and Levine, Sergey},
journal={arXiv preprint arXiv:2306.14846},
year={2023}
}
```
**NoMaD**
```bibtex
@inproceedings{sridhar2024nomad,
title={Nomad: Goal masked diffusion policies for navigation and exploration},
author={Sridhar, Ajay and Shah, Dhruv and Glossop, Catherine and Levine, Sergey},
booktitle={2024 IEEE International Conference on Robotics and Automation (ICRA)},
pages={63--70},
year={2024},
organization={IEEE}
}
```
**CityWalker**
```bibtex
@inproceedings{liu2025citywalker,
title={Citywalker: Learning embodied urban navigation from web-scale videos},
author={Liu, Xinhao and Li, Jintong and Jiang, Yicheng and Sujay, Niranjan and Yang, Zhicheng and Zhang, Juexiao and Abanes, John and Zhang, Jing and Feng, Chen},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={6875--6885},
year={2025}
}
```
**MBRA**
```bibtex
@article{hirose2025learning,
title={Learning to drive anywhere with model-based reannotation},
author={Hirose, Noriaki and Ignatova, Lydia and Stachowicz, Kyle and Glossop, Catherine and Levine, Sergey and Shah, Dhruv},
journal={IEEE Robotics and Automation Letters},
volume={11},
number={2},
pages={1242--1249},
year={2025},
publisher={IEEE}
}
```
**S2E**
```bibtex
@article{he2025seeing,
title={From seeing to experiencing: Scaling navigation foundation models with reinforcement learning},
author={He, Honglin and Ma, Yukai and Squicciarini, Brad and Wu, Wayne and Zhou, Bolei},
journal={arXiv preprint arXiv:2507.22028},
year={2025}
}
```
**MIMIC**
```bibtex
@article{he2026learning,
title={Learning Sidewalk Autopilot from Multi-Scale Imitation with Corrective Behavior Expansion},
author={He, Honglin and Ma, Yukai and Squicciarini, Brad and Wu, Wayne and Zhou, Bolei},
journal={arXiv preprint arXiv:2603.22527},
year={2026}
}
```
## Contact
Maintained by [UCLA-VAIL](https://vail-ucla.github.io/). Open an issue/discussion on the
repository page for questions or contributions.