Add 7 navigation models (ONNX + inference wrappers) and model card

9c1f523 verified about 13 hours ago

12.8 kB

license: apache-2.0
tags:
  - robotics
  - navigation
  - visual-navigation
  - embodied-ai
  - onnx
pipeline_tag: robotics

Navigation Model Zoo

A collection of vision-based navigation policies exported to ONNX, each wrapped in a small, uniform Python inference API. Maintained by Honglin He @ UCLA-VAIL.

Every model takes a short history of RGB frames and predicts a local trajectory (and optionally a distance-to-goal / arrival signal); a built-in PD controller turns the trajectory into (v, ω) velocity commands. All models share the same wrapper interface so they can be swapped and benchmarked without per-model glue code.

Models

Folder	Model / paper	Goal mode	Context	Input H×W	Waypoints	Weights
`GNM_GL_Official`	GNM · ICRA 2023	goal-free	6	64×85	5	`gnm_imagegoal.onnx` (+`.data`) · 35 MB
`Vint_GL_Official`	ViNT · CoRL 2023	goal-free	6	64×85	5	`vint_imagegoal.onnx` (+`.data`) · 97 MB
`NoMaD_GL_Official`	NoMaD · ICRA 2024	goal-free (diffusion)	4	96×96	8 ×8 samples	3× `.onnx` (+`.data`) · 111 MB
`CityWalker_PG_Official`	CityWalker · CVPR 2025	point-goal	5	350×630	5	`citywalker.onnx` · 806 MB
`MBRA_PG_Official`	MBRA · RA-L 2025	point-goal	6	96×96	8	`mbra.onnx` · 254 MB
`S2E`	S2E · ICLR 2026	point-goal / goal-free	11	256×256	10	`s2e.onnx` · 382 MB
`MIMIC`	MIMIC · ICRA 2026	goal-free	16	288×512	13	`mimic.onnx` · 318 MB

Suffix legend: PG = point-goal, GL = goal-less (goal-free). Models with a .onnx.data companion (GNM, ViNT, NoMaD) use ONNX external weights — keep each .onnx and its .onnx.data together.

Common interface

Each folder is a self-contained module exposing one navigator class. They all follow the same contract:

import numpy as np
from MBRA_PG_Official.inference import MBRAPGNavigator   # run from the repo root

nav = MBRAPGNavigator(device="cuda")          # use device="cpu" if you have no GPU

# obs: (B, nav.context_size, 3, H, W) float32 in [0, 1]
#      the wrapper resizes & normalizes to the model's spec internally
obs = np.random.rand(1, nav.context_size, 3, 96, 96).astype(np.float32)

# Point-goal models take goal_xy (standard frame: x=forward, y=left, meters);
# goal-free models omit it.
traj, scores = nav.inference_trajectory(obs, goal_xy=np.array([5.0, 0.2]))  # (B, M, W, 2) meters
vw, best     = nav.inference_vw(obs,        goal_xy=np.array([5.0, 0.2]))   # vw: (B, 2) = [v, ω]

nav.reset()   # clears PD-controller velocity smoothing between episodes

Conventions shared by every model:

Coordinate frame — all user-facing inputs/outputs are standard frame: x = forward, y = left, in meters. Models with a different internal convention (e.g. CityWalker) convert transparently.
Observations — (B, context_size, 3, H, W), float32, pixel values in [0, 1]. The wrapper handles resize and any ImageNet normalization. (Exception: MIMIC expects frames already at 288×512 and does not resize.)
inference_trajectory(obs[, goal_xy]) → (trajectory, scores). trajectory is (B, M, W, 2) in meters, where M is the number of modes (1 for unimodal, 8 for NoMaD) and W the waypoint count; scores is (B, M).
inference_vw(obs[, goal_xy]) → (vw, best_traj) where vw is a (B, 2) torch tensor of [linear_v, angular_w]. Tune limits with max_v / max_w at construction.
Goal-free models (Vint, GNM, NoMaD, MIMIC) ignore goal_xy — call inference_trajectory(obs).

Installation

pip install onnxruntime-gpu numpy torch torchvision pyyaml pillow
# CPU-only: use onnxruntime instead of onnxruntime-gpu
pip install opencv-python   # required by S2E (frame resizing)

Optional, lab-internal dependency: Vint, GNM, and NoMaD expose an extra inference_vw_pp() method that uses urbansim.custom.pp.PurePursuitController; it is imported lazily and only needed for that method. MIMIC imports urbansim at module load, so its inference.py will not import without the urbansim package on your path.

Model details

GNM_GL_Official — `gnm_imagegoal.onnx` (+ `.onnx.data`)

Paper: GNM: A General Navigation Model to Drive Any Robot (ICRA 2023) · arXiv:2210.03370 · code

Goal-free General Navigation Model — same NavDP image-goal I/O contract as ViNT (obs_img (B,18,64,85) + goal_img (B,3,64,85) → dist_pred (B,1), action_pred (B,5,4)), with a lower top speed. Expects input downsampled to ≈ 3 Hz.

Vint_GL_Official — `vint_imagegoal.onnx` (+ `.onnx.data`)

Paper: ViNT: A Foundation Model for Visual Navigation (CoRL 2023) · arXiv:2306.14846 · project

Goal-free ViNT (NavDP image-goal backbone run with a random goal image). ONNX I/O: obs_img (B,18,64,85) (6 ImageNet-normalized frames × 3 ch) + goal_img (B,3,64,85) (random noise) → dist_pred (B,1), action_pred (B,5,4). Cumulative xy is already baked in; the wrapper scales by the 0.8 m metric spacing. Expects input downsampled to ≈ 3 Hz.

NoMaD_GL_Official — 3× ONNX (diffusion, + `.onnx.data`)

Paper: NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration (ICRA 2024) · arXiv:2310.07896 · project

Goal-free diffusion policy. Runs a 10-step DDPM loop (squaredcos_cap_v2) over 3 components: nomad_vision_encoder.onnx (obs_img (B,12,96,96) + goal_img (B,3,96,96) + goal_mask (B) → cond (B,256)), nomad_noise_pred.onnx (one denoising step), and nomad_dist_pred.onnx. Produces 8 trajectory samples → trajectory (B,8,8,2) meters (decode: unnormalize → cumsum → ×0.267 m spacing). This is the only multi-modal model and the slowest (diffusion + multiple samples).

CityWalker_PG_Official — `citywalker.onnx`

Paper: CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos (CVPR 2025) · arXiv:2411.17820 · project

Point-goal urban walker. ONNX I/O: obs_images (B,5,3,350,630) + trajectory (B,6,2) past waypoints → wp_pred (B,5,2), arrive_pred (B,1) (arrival probability). Images are ImageNet-normalized internally; the model's internal y=forward, x=right frame is converted to standard frame by the wrapper. Input rate ≈ 5 Hz.

MBRA_PG_Official — `mbra.onnx`

Paper: Learning to Drive Anywhere with Model-Based Reannotation (RA-L 2025) · arXiv:2505.05592 · project

Point-goal policy. ONNX I/O: obs_images (B,6,3,96,96) ImageNet-normalized + goal_pose (B,4) = [x, y, sin(yaw), cos(yaw)] → waypoints (B,8,4). Goal is given as goal_xy (meters) and converted internally; waypoints are un-normalized by a 0.8 m metric spacing. Input rate ≈ 5 Hz.

S2E — `s2e.onnx`

Paper: From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning (ICLR 2026) · arXiv:2507.22028 · project

UCLA-VAIL navigation foundation model; this is the behavior-cloning, point-goal, web-pretrained variant (S2EBC-PG-Web100). ONNX I/O: obs_images (B,11,3,256,256) in [0,1] (no ImageNet norm) + goal (B,3) = [norm_dist, cos(θ), sin(θ)] → wp_pred (B,10,3) [x,y,yaw], wp_pred_score (B,63) mode scores. Frames are resized to 256×256 with OpenCV.

MIMIC — `mimic.onnx`

Paper: Learning Sidewalk Autopilot from Multi-Scale Imitation with Corrective Behavior Expansion (ICRA 2026) · arXiv:2603.22527 · project

UCLA-VAIL goal-free long-context sidewalk policy. ONNX I/O: input (1,16,3,288,512) in [0,1] → output (1,15,3) [x,y,yaw] at non-uniform timestamps (0.2 s–5.0 s @ 5 Hz). Batch is processed one sample at a time; the wrapper keeps the first 13 waypoints (~4 s) and scales to meters. Requires urbansim (see Installation).

Downloading

Full repo (includes the LFS-tracked ONNX weights):

hf download UCLA-VAIL/Navigation-Model-Zoo-Public --local-dir ./Navigation-Model-Zoo-Public

One model — fetch just its folder, e.g. MBRA:

hf download UCLA-VAIL/Navigation-Model-Zoo-Public \
  --include "MBRA_PG_Official/*" --local-dir .

Then run from the repo root: from MBRA_PG_Official.inference import MBRAPGNavigator.

External weights: GNM, ViNT, and NoMaD ship *.onnx.data files — keep each .onnx and its .onnx.data together in the same folder so ONNX Runtime can resolve the weights.

Intended use & limitations

These are research artifacts for navigation research, reproduction, and benchmarking — not safety-validated for deployment on real robots without additional testing. Each policy's behavior is bounded by its training distribution (camera intrinsics, embodiment, frame rate, environment). Several wrappers rectify/resize inputs to a specific training camera; mismatched cameras may degrade performance.

License

Released under Apache 2.0. Individual models carry the licenses and terms of their original sources (ViNT, GNM, NoMaD, CityWalker, MBRA) — check upstream before commercial use.

Citation

If you use a model from this zoo, please cite its original paper.

GNM

@inproceedings{shah2023gnm,
  title={Gnm: A general navigation model to drive any robot},
  author={Shah, Dhruv and Sridhar, Ajay and Bhorkar, Arjun and Hirose, Noriaki and Levine, Sergey},
  booktitle={2023 IEEE International Conference on Robotics and Automation (ICRA)},
  pages={7226--7233},
  year={2023},
  organization={IEEE}
}

ViNT

@article{shah2023vint,
  title={ViNT: A foundation model for visual navigation},
  author={Shah, Dhruv and Sridhar, Ajay and Dashora, Nitish and Stachowicz, Kyle and Black, Kevin and Hirose, Noriaki and Levine, Sergey},
  journal={arXiv preprint arXiv:2306.14846},
  year={2023}
}

NoMaD

@inproceedings{sridhar2024nomad,
  title={Nomad: Goal masked diffusion policies for navigation and exploration},
  author={Sridhar, Ajay and Shah, Dhruv and Glossop, Catherine and Levine, Sergey},
  booktitle={2024 IEEE International Conference on Robotics and Automation (ICRA)},
  pages={63--70},
  year={2024},
  organization={IEEE}
}

CityWalker

@inproceedings{liu2025citywalker,
  title={Citywalker: Learning embodied urban navigation from web-scale videos},
  author={Liu, Xinhao and Li, Jintong and Jiang, Yicheng and Sujay, Niranjan and Yang, Zhicheng and Zhang, Juexiao and Abanes, John and Zhang, Jing and Feng, Chen},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={6875--6885},
  year={2025}
}

MBRA

@article{hirose2025learning,
  title={Learning to drive anywhere with model-based reannotation},
  author={Hirose, Noriaki and Ignatova, Lydia and Stachowicz, Kyle and Glossop, Catherine and Levine, Sergey and Shah, Dhruv},
  journal={IEEE Robotics and Automation Letters},
  volume={11},
  number={2},
  pages={1242--1249},
  year={2025},
  publisher={IEEE}
}

S2E

@article{he2025seeing,
  title={From seeing to experiencing: Scaling navigation foundation models with reinforcement learning},
  author={He, Honglin and Ma, Yukai and Squicciarini, Brad  and Wu, Wayne and Zhou, Bolei},
  journal={arXiv preprint arXiv:2507.22028},
  year={2025}
}

MIMIC

@article{he2026learning,
  title={Learning Sidewalk Autopilot from Multi-Scale Imitation with Corrective Behavior Expansion},
  author={He, Honglin and Ma, Yukai and Squicciarini, Brad and Wu, Wayne and Zhou, Bolei},
  journal={arXiv preprint arXiv:2603.22527},
  year={2026}
}

Contact

Maintained by UCLA-VAIL. Open an issue/discussion on the repository page for questions or contributions.