| --- |
| license: apache-2.0 |
| tags: |
| - robotics |
| - navigation |
| - visual-navigation |
| - embodied-ai |
| - onnx |
| pipeline_tag: robotics |
| --- |
| |
| # Navigation Model Zoo |
|
|
| A collection of vision-based navigation policies exported to **ONNX**, each wrapped in a small, |
| uniform Python inference API. Maintained by **Honglin He @ UCLA-VAIL**. |
|
|
| Every model takes a short history of RGB frames and predicts a local trajectory (and optionally a |
| distance-to-goal / arrival signal); a built-in PD controller turns the trajectory into `(v, ω)` |
| velocity commands. All models share the same wrapper interface so they can be swapped and |
| benchmarked without per-model glue code. |
|
|
| ## Models |
|
|
| | Folder | Model / paper | Goal mode | Context | Input H×W | Waypoints | Weights | |
| |--------|---------------|-----------|:-------:|:---------:|:---------:|---------| |
| | [`GNM_GL_Official`](GNM_GL_Official) | [GNM](https://arxiv.org/abs/2210.03370) · ICRA 2023 | goal-free | 6 | 64×85 | 5 | `gnm_imagegoal.onnx` (+`.data`) · 35 MB | |
| | [`Vint_GL_Official`](Vint_GL_Official) | [ViNT](https://arxiv.org/abs/2306.14846) · CoRL 2023 | goal-free | 6 | 64×85 | 5 | `vint_imagegoal.onnx` (+`.data`) · 97 MB | |
| | [`NoMaD_GL_Official`](NoMaD_GL_Official) | [NoMaD](https://arxiv.org/abs/2310.07896) · ICRA 2024 | goal-free (diffusion) | 4 | 96×96 | 8 ×8 samples | 3× `.onnx` (+`.data`) · 111 MB | |
| | [`CityWalker_PG_Official`](CityWalker_PG_Official) | [CityWalker](https://arxiv.org/abs/2411.17820) · CVPR 2025 | point-goal | 5 | 350×630 | 5 | `citywalker.onnx` · 806 MB | |
| | [`MBRA_PG_Official`](MBRA_PG_Official) | [MBRA](https://arxiv.org/abs/2505.05592) · RA-L 2025 | point-goal | 6 | 96×96 | 8 | `mbra.onnx` · 254 MB | |
| | [`S2E`](S2E) | [S2E](https://arxiv.org/abs/2507.22028) · ICLR 2026 | point-goal / goal-free | 11 | 256×256 | 10 | `s2e.onnx` · 382 MB | |
| | [`MIMIC`](MIMIC) | [MIMIC](https://arxiv.org/abs/2603.22527) · ICRA 2026 | goal-free | 16 | 288×512 | 13 | `mimic.onnx` · 318 MB | |
|
|
| Suffix legend: `PG` = point-goal, `GL` = goal-less (goal-free). Models with a `.onnx.data` companion |
| (GNM, ViNT, NoMaD) use ONNX external weights — keep each `.onnx` and its `.onnx.data` together. |
|
|
| ## Common interface |
|
|
| Each folder is a self-contained module exposing one navigator class. They all follow the same contract: |
|
|
| ```python |
| import numpy as np |
| from MBRA_PG_Official.inference import MBRAPGNavigator # run from the repo root |
| |
| nav = MBRAPGNavigator(device="cuda") # use device="cpu" if you have no GPU |
| |
| # obs: (B, nav.context_size, 3, H, W) float32 in [0, 1] |
| # the wrapper resizes & normalizes to the model's spec internally |
| obs = np.random.rand(1, nav.context_size, 3, 96, 96).astype(np.float32) |
| |
| # Point-goal models take goal_xy (standard frame: x=forward, y=left, meters); |
| # goal-free models omit it. |
| traj, scores = nav.inference_trajectory(obs, goal_xy=np.array([5.0, 0.2])) # (B, M, W, 2) meters |
| vw, best = nav.inference_vw(obs, goal_xy=np.array([5.0, 0.2])) # vw: (B, 2) = [v, ω] |
| |
| nav.reset() # clears PD-controller velocity smoothing between episodes |
| ``` |
|
|
| Conventions shared by every model: |
|
|
| - **Coordinate frame** — all user-facing inputs/outputs are *standard frame*: `x = forward`, `y = left`, in meters. Models with a different internal convention (e.g. CityWalker) convert transparently. |
| - **Observations** — `(B, context_size, 3, H, W)`, `float32`, pixel values in `[0, 1]`. The wrapper handles resize and any ImageNet normalization. *(Exception: `MIMIC` expects frames already at 288×512 and does not resize.)* |
| - **`inference_trajectory(obs[, goal_xy])`** → `(trajectory, scores)`. `trajectory` is `(B, M, W, 2)` in meters, where `M` is the number of modes (1 for unimodal, 8 for NoMaD) and `W` the waypoint count; `scores` is `(B, M)`. |
| - **`inference_vw(obs[, goal_xy])`** → `(vw, best_traj)` where `vw` is a `(B, 2)` torch tensor of `[linear_v, angular_w]`. Tune limits with `max_v` / `max_w` at construction. |
| - Goal-free models (`Vint`, `GNM`, `NoMaD`, `MIMIC`) ignore `goal_xy` — call `inference_trajectory(obs)`. |
|
|
| ## Installation |
|
|
| ```bash |
| pip install onnxruntime-gpu numpy torch torchvision pyyaml pillow |
| # CPU-only: use onnxruntime instead of onnxruntime-gpu |
| pip install opencv-python # required by S2E (frame resizing) |
| ``` |
|
|
| Optional, lab-internal dependency: `Vint`, `GNM`, and `NoMaD` expose an extra `inference_vw_pp()` |
| method that uses `urbansim.custom.pp.PurePursuitController`; it is imported lazily and only needed |
| for that method. **`MIMIC` imports `urbansim` at module load**, so its `inference.py` will not import |
| without the `urbansim` package on your path. |
|
|
| ## Model details |
|
|
| ### GNM_GL_Official — `gnm_imagegoal.onnx` (+ `.onnx.data`) |
| **Paper:** *GNM: A General Navigation Model to Drive Any Robot* (ICRA 2023) · [arXiv:2210.03370](https://arxiv.org/abs/2210.03370) · [code](https://github.com/robodhruv/drive-any-robot) |
| |
| Goal-free General Navigation Model — same NavDP image-goal I/O contract as ViNT (`obs_img (B,18,64,85)` + `goal_img (B,3,64,85)` → `dist_pred (B,1)`, `action_pred (B,5,4)`), with a lower top speed. Expects input downsampled to ≈ 3 Hz. |
| |
| ### Vint_GL_Official — `vint_imagegoal.onnx` (+ `.onnx.data`) |
| **Paper:** *ViNT: A Foundation Model for Visual Navigation* (CoRL 2023) · [arXiv:2306.14846](https://arxiv.org/abs/2306.14846) · [project](https://general-navigation-models.github.io/vint/) |
|
|
| Goal-free ViNT (NavDP image-goal backbone run with a random goal image). **ONNX I/O:** `obs_img (B,18,64,85)` (6 ImageNet-normalized frames × 3 ch) + `goal_img (B,3,64,85)` (random noise) → `dist_pred (B,1)`, `action_pred (B,5,4)`. Cumulative `xy` is already baked in; the wrapper scales by the 0.8 m metric spacing. Expects input downsampled to ≈ 3 Hz. |
|
|
| ### NoMaD_GL_Official — 3× ONNX (diffusion, + `.onnx.data`) |
| **Paper:** *NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration* (ICRA 2024) · [arXiv:2310.07896](https://arxiv.org/abs/2310.07896) · [project](https://general-navigation-models.github.io/nomad/) |
|
|
| Goal-free diffusion policy. Runs a 10-step DDPM loop (`squaredcos_cap_v2`) over 3 components: |
| `nomad_vision_encoder.onnx` (`obs_img (B,12,96,96)` + `goal_img (B,3,96,96)` + `goal_mask (B)` → `cond (B,256)`), `nomad_noise_pred.onnx` (one denoising step), and `nomad_dist_pred.onnx`. Produces **8 trajectory samples** → `trajectory (B,8,8,2)` meters (decode: unnormalize → cumsum → ×0.267 m spacing). This is the only multi-modal model and the slowest (diffusion + multiple samples). |
|
|
| ### CityWalker_PG_Official — `citywalker.onnx` |
| **Paper:** *CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos* (CVPR 2025) · [arXiv:2411.17820](https://arxiv.org/abs/2411.17820) · [project](https://ai4ce.github.io/CityWalker/) |
|
|
| Point-goal urban walker. **ONNX I/O:** `obs_images (B,5,3,350,630)` + `trajectory (B,6,2)` past waypoints → `wp_pred (B,5,2)`, `arrive_pred (B,1)` (arrival probability). Images are ImageNet-normalized internally; the model's internal `y=forward, x=right` frame is converted to standard frame by the wrapper. Input rate ≈ 5 Hz. |
|
|
| ### MBRA_PG_Official — `mbra.onnx` |
| **Paper:** *Learning to Drive Anywhere with Model-Based Reannotation* (RA-L 2025) · [arXiv:2505.05592](https://arxiv.org/abs/2505.05592) · [project](https://model-base-reannotation.github.io/) |
|
|
| Point-goal policy. **ONNX I/O:** `obs_images (B,6,3,96,96)` ImageNet-normalized + `goal_pose (B,4)` = `[x, y, sin(yaw), cos(yaw)]` → `waypoints (B,8,4)`. Goal is given as `goal_xy` (meters) and converted internally; waypoints are un-normalized by a 0.8 m metric spacing. Input rate ≈ 5 Hz. |
|
|
| ### S2E — `s2e.onnx` |
| **Paper:** *From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning* (ICLR 2026) · [arXiv:2507.22028](https://arxiv.org/abs/2507.22028) · [project](https://metadriverse.github.io/s2e) |
|
|
| UCLA-VAIL navigation foundation model; this is the behavior-cloning, point-goal, web-pretrained variant (`S2EBC-PG-Web100`). **ONNX I/O:** `obs_images (B,11,3,256,256)` in `[0,1]` (no ImageNet norm) + `goal (B,3)` = `[norm_dist, cos(θ), sin(θ)]` → `wp_pred (B,10,3)` `[x,y,yaw]`, `wp_pred_score (B,63)` mode scores. Frames are resized to 256×256 with OpenCV. |
|
|
| ### MIMIC — `mimic.onnx` |
| **Paper:** *Learning Sidewalk Autopilot from Multi-Scale Imitation with Corrective Behavior Expansion* (ICRA 2026) · [arXiv:2603.22527](https://arxiv.org/abs/2603.22527) · [project](https://vail-ucla.github.io/MIMIC) |
|
|
| UCLA-VAIL goal-free long-context sidewalk policy. **ONNX I/O:** `input (1,16,3,288,512)` in `[0,1]` → `output (1,15,3)` `[x,y,yaw]` at non-uniform timestamps (0.2 s–5.0 s @ 5 Hz). Batch is processed one sample at a time; the wrapper keeps the first 13 waypoints (~4 s) and scales to meters. Requires `urbansim` (see Installation). |
|
|
| ## Downloading |
|
|
| **Full repo** (includes the LFS-tracked ONNX weights): |
| ```bash |
| hf download UCLA-VAIL/Navigation-Model-Zoo-Public --local-dir ./Navigation-Model-Zoo-Public |
| ``` |
|
|
| **One model** — fetch just its folder, e.g. MBRA: |
| ```bash |
| hf download UCLA-VAIL/Navigation-Model-Zoo-Public \ |
| --include "MBRA_PG_Official/*" --local-dir . |
| ``` |
|
|
| Then run from the repo root: `from MBRA_PG_Official.inference import MBRAPGNavigator`. |
|
|
| > **External weights:** GNM, ViNT, and NoMaD ship `*.onnx.data` files — keep each `.onnx` and its |
| > `.onnx.data` together in the same folder so ONNX Runtime can resolve the weights. |
| |
| ## Intended use & limitations |
| |
| These are **research artifacts** for navigation research, reproduction, and benchmarking — not |
| safety-validated for deployment on real robots without additional testing. Each policy's behavior |
| is bounded by its training distribution (camera intrinsics, embodiment, frame rate, environment). |
| Several wrappers rectify/resize inputs to a specific training camera; mismatched cameras may degrade |
| performance. |
| |
| ## License |
| |
| Released under **Apache 2.0**. Individual models carry the licenses and terms of their original |
| sources (ViNT, GNM, NoMaD, CityWalker, MBRA) — check upstream before commercial use. |
| |
| ## Citation |
| |
| If you use a model from this zoo, please cite its original paper. |
| |
| **GNM** |
| ```bibtex |
| @inproceedings{shah2023gnm, |
| title={Gnm: A general navigation model to drive any robot}, |
| author={Shah, Dhruv and Sridhar, Ajay and Bhorkar, Arjun and Hirose, Noriaki and Levine, Sergey}, |
| booktitle={2023 IEEE International Conference on Robotics and Automation (ICRA)}, |
| pages={7226--7233}, |
| year={2023}, |
| organization={IEEE} |
| } |
| ``` |
| |
| **ViNT** |
| ```bibtex |
| @article{shah2023vint, |
| title={ViNT: A foundation model for visual navigation}, |
| author={Shah, Dhruv and Sridhar, Ajay and Dashora, Nitish and Stachowicz, Kyle and Black, Kevin and Hirose, Noriaki and Levine, Sergey}, |
| journal={arXiv preprint arXiv:2306.14846}, |
| year={2023} |
| } |
| ``` |
| |
| **NoMaD** |
| ```bibtex |
| @inproceedings{sridhar2024nomad, |
| title={Nomad: Goal masked diffusion policies for navigation and exploration}, |
| author={Sridhar, Ajay and Shah, Dhruv and Glossop, Catherine and Levine, Sergey}, |
| booktitle={2024 IEEE International Conference on Robotics and Automation (ICRA)}, |
| pages={63--70}, |
| year={2024}, |
| organization={IEEE} |
| } |
| ``` |
| |
| **CityWalker** |
| ```bibtex |
| @inproceedings{liu2025citywalker, |
| title={Citywalker: Learning embodied urban navigation from web-scale videos}, |
| author={Liu, Xinhao and Li, Jintong and Jiang, Yicheng and Sujay, Niranjan and Yang, Zhicheng and Zhang, Juexiao and Abanes, John and Zhang, Jing and Feng, Chen}, |
| booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference}, |
| pages={6875--6885}, |
| year={2025} |
| } |
| ``` |
| |
| **MBRA** |
| ```bibtex |
| @article{hirose2025learning, |
| title={Learning to drive anywhere with model-based reannotation}, |
| author={Hirose, Noriaki and Ignatova, Lydia and Stachowicz, Kyle and Glossop, Catherine and Levine, Sergey and Shah, Dhruv}, |
| journal={IEEE Robotics and Automation Letters}, |
| volume={11}, |
| number={2}, |
| pages={1242--1249}, |
| year={2025}, |
| publisher={IEEE} |
| } |
| ``` |
| |
| **S2E** |
| ```bibtex |
| @article{he2025seeing, |
| title={From seeing to experiencing: Scaling navigation foundation models with reinforcement learning}, |
| author={He, Honglin and Ma, Yukai and Squicciarini, Brad and Wu, Wayne and Zhou, Bolei}, |
| journal={arXiv preprint arXiv:2507.22028}, |
| year={2025} |
| } |
| ``` |
| |
| **MIMIC** |
| ```bibtex |
| @article{he2026learning, |
| title={Learning Sidewalk Autopilot from Multi-Scale Imitation with Corrective Behavior Expansion}, |
| author={He, Honglin and Ma, Yukai and Squicciarini, Brad and Wu, Wayne and Zhou, Bolei}, |
| journal={arXiv preprint arXiv:2603.22527}, |
| year={2026} |
| } |
| ``` |
| |
| ## Contact |
| |
| Maintained by [UCLA-VAIL](https://vail-ucla.github.io/). Open an issue/discussion on the |
| repository page for questions or contributions. |
| |