Add 7 navigation models (ONNX + inference wrappers) and model card

9c1f523 verified about 14 hours ago

12.8 kB

	---
	license: apache-2.0
	tags:
	- robotics
	- navigation
	- visual-navigation
	- embodied-ai
	- onnx
	pipeline_tag: robotics
	---

	# Navigation Model Zoo

	A collection of vision-based navigation policies exported to ONNX, each wrapped in a small,
	uniform Python inference API. Maintained by Honglin He @ UCLA-VAIL.

	Every model takes a short history of RGB frames and predicts a local trajectory (and optionally a
	distance-to-goal / arrival signal); a built-in PD controller turns the trajectory into `(v, ω)`
	velocity commands. All models share the same wrapper interface so they can be swapped and
	benchmarked without per-model glue code.

	## Models

	\| Folder \| Model / paper \| Goal mode \| Context \| Input H×W \| Waypoints \| Weights \|
	\|--------\|---------------\|-----------\|:-------:\|:---------:\|:---------:\|---------\|
	\| [`GNM_GL_Official`](GNM_GL_Official) \| [GNM](https://arxiv.org/abs/2210.03370) · ICRA 2023 \| goal-free \| 6 \| 64×85 \| 5 \| `gnm_imagegoal.onnx` (+`.data`) · 35 MB \|
	\| [`Vint_GL_Official`](Vint_GL_Official) \| [ViNT](https://arxiv.org/abs/2306.14846) · CoRL 2023 \| goal-free \| 6 \| 64×85 \| 5 \| `vint_imagegoal.onnx` (+`.data`) · 97 MB \|
	\| [`NoMaD_GL_Official`](NoMaD_GL_Official) \| [NoMaD](https://arxiv.org/abs/2310.07896) · ICRA 2024 \| goal-free (diffusion) \| 4 \| 96×96 \| 8 ×8 samples \| 3× `.onnx` (+`.data`) · 111 MB \|
	\| [`CityWalker_PG_Official`](CityWalker_PG_Official) \| [CityWalker](https://arxiv.org/abs/2411.17820) · CVPR 2025 \| point-goal \| 5 \| 350×630 \| 5 \| `citywalker.onnx` · 806 MB \|
	\| [`MBRA_PG_Official`](MBRA_PG_Official) \| [MBRA](https://arxiv.org/abs/2505.05592) · RA-L 2025 \| point-goal \| 6 \| 96×96 \| 8 \| `mbra.onnx` · 254 MB \|
	\| [`S2E`](S2E) \| [S2E](https://arxiv.org/abs/2507.22028) · ICLR 2026 \| point-goal / goal-free \| 11 \| 256×256 \| 10 \| `s2e.onnx` · 382 MB \|
	\| [`MIMIC`](MIMIC) \| [MIMIC](https://arxiv.org/abs/2603.22527) · ICRA 2026 \| goal-free \| 16 \| 288×512 \| 13 \| `mimic.onnx` · 318 MB \|

	Suffix legend: `PG` = point-goal, `GL` = goal-less (goal-free). Models with a `.onnx.data` companion
	(GNM, ViNT, NoMaD) use ONNX external weights — keep each `.onnx` and its `.onnx.data` together.

	## Common interface

	Each folder is a self-contained module exposing one navigator class. They all follow the same contract:

	```python
	import numpy as np
	from MBRA_PG_Official.inference import MBRAPGNavigator # run from the repo root

	nav = MBRAPGNavigator(device="cuda") # use device="cpu" if you have no GPU

	# obs: (B, nav.context_size, 3, H, W) float32 in [0, 1]
	# the wrapper resizes & normalizes to the model's spec internally
	obs = np.random.rand(1, nav.context_size, 3, 96, 96).astype(np.float32)

	# Point-goal models take goal_xy (standard frame: x=forward, y=left, meters);
	# goal-free models omit it.
	traj, scores = nav.inference_trajectory(obs, goal_xy=np.array([5.0, 0.2])) # (B, M, W, 2) meters
	vw, best = nav.inference_vw(obs, goal_xy=np.array([5.0, 0.2])) # vw: (B, 2) = [v, ω]

	nav.reset() # clears PD-controller velocity smoothing between episodes
	```

	Conventions shared by every model:

	- Coordinate frame — all user-facing inputs/outputs are standard frame: `x = forward`, `y = left`, in meters. Models with a different internal convention (e.g. CityWalker) convert transparently.
	- Observations — `(B, context_size, 3, H, W)`, `float32`, pixel values in `[0, 1]`. The wrapper handles resize and any ImageNet normalization. (Exception: `MIMIC` expects frames already at 288×512 and does not resize.)
	- `inference_trajectory(obs[, goal_xy])` → `(trajectory, scores)`. `trajectory` is `(B, M, W, 2)` in meters, where `M` is the number of modes (1 for unimodal, 8 for NoMaD) and `W` the waypoint count; `scores` is `(B, M)`.
	- `inference_vw(obs[, goal_xy])` → `(vw, best_traj)` where `vw` is a `(B, 2)` torch tensor of `[linear_v, angular_w]`. Tune limits with `max_v` / `max_w` at construction.
	- Goal-free models (`Vint`, `GNM`, `NoMaD`, `MIMIC`) ignore `goal_xy` — call `inference_trajectory(obs)`.

	## Installation

	```bash
	pip install onnxruntime-gpu numpy torch torchvision pyyaml pillow
	# CPU-only: use onnxruntime instead of onnxruntime-gpu
	pip install opencv-python # required by S2E (frame resizing)
	```

	Optional, lab-internal dependency: `Vint`, `GNM`, and `NoMaD` expose an extra `inference_vw_pp()`
	method that uses `urbansim.custom.pp.PurePursuitController`; it is imported lazily and only needed
	for that method. `MIMIC` imports `urbansim` at module load, so its `inference.py` will not import
	without the `urbansim` package on your path.

	## Model details

	### GNM_GL_Official — `gnm_imagegoal.onnx` (+ `.onnx.data`)
	Paper: GNM: A General Navigation Model to Drive Any Robot (ICRA 2023) · [arXiv:2210.03370](https://arxiv.org/abs/2210.03370) · [code](https://github.com/robodhruv/drive-any-robot)

	Goal-free General Navigation Model — same NavDP image-goal I/O contract as ViNT (`obs_img (B,18,64,85)` + `goal_img (B,3,64,85)` → `dist_pred (B,1)`, `action_pred (B,5,4)`), with a lower top speed. Expects input downsampled to ≈ 3 Hz.

	### Vint_GL_Official — `vint_imagegoal.onnx` (+ `.onnx.data`)
	Paper: ViNT: A Foundation Model for Visual Navigation (CoRL 2023) · [arXiv:2306.14846](https://arxiv.org/abs/2306.14846) · [project](https://general-navigation-models.github.io/vint/)

	Goal-free ViNT (NavDP image-goal backbone run with a random goal image). ONNX I/O: `obs_img (B,18,64,85)` (6 ImageNet-normalized frames × 3 ch) + `goal_img (B,3,64,85)` (random noise) → `dist_pred (B,1)`, `action_pred (B,5,4)`. Cumulative `xy` is already baked in; the wrapper scales by the 0.8 m metric spacing. Expects input downsampled to ≈ 3 Hz.

	### NoMaD_GL_Official — 3× ONNX (diffusion, + `.onnx.data`)
	Paper: NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration (ICRA 2024) · [arXiv:2310.07896](https://arxiv.org/abs/2310.07896) · [project](https://general-navigation-models.github.io/nomad/)

	Goal-free diffusion policy. Runs a 10-step DDPM loop (`squaredcos_cap_v2`) over 3 components:
	`nomad_vision_encoder.onnx` (`obs_img (B,12,96,96)` + `goal_img (B,3,96,96)` + `goal_mask (B)` → `cond (B,256)`), `nomad_noise_pred.onnx` (one denoising step), and `nomad_dist_pred.onnx`. Produces 8 trajectory samples → `trajectory (B,8,8,2)` meters (decode: unnormalize → cumsum → ×0.267 m spacing). This is the only multi-modal model and the slowest (diffusion + multiple samples).

	### CityWalker_PG_Official — `citywalker.onnx`
	Paper: CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos (CVPR 2025) · [arXiv:2411.17820](https://arxiv.org/abs/2411.17820) · [project](https://ai4ce.github.io/CityWalker/)

	Point-goal urban walker. ONNX I/O: `obs_images (B,5,3,350,630)` + `trajectory (B,6,2)` past waypoints → `wp_pred (B,5,2)`, `arrive_pred (B,1)` (arrival probability). Images are ImageNet-normalized internally; the model's internal `y=forward, x=right` frame is converted to standard frame by the wrapper. Input rate ≈ 5 Hz.

	### MBRA_PG_Official — `mbra.onnx`
	Paper: Learning to Drive Anywhere with Model-Based Reannotation (RA-L 2025) · [arXiv:2505.05592](https://arxiv.org/abs/2505.05592) · [project](https://model-base-reannotation.github.io/)

	Point-goal policy. ONNX I/O: `obs_images (B,6,3,96,96)` ImageNet-normalized + `goal_pose (B,4)` = `[x, y, sin(yaw), cos(yaw)]` → `waypoints (B,8,4)`. Goal is given as `goal_xy` (meters) and converted internally; waypoints are un-normalized by a 0.8 m metric spacing. Input rate ≈ 5 Hz.

	### S2E — `s2e.onnx`
	Paper: From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning (ICLR 2026) · [arXiv:2507.22028](https://arxiv.org/abs/2507.22028) · [project](https://metadriverse.github.io/s2e)

	UCLA-VAIL navigation foundation model; this is the behavior-cloning, point-goal, web-pretrained variant (`S2EBC-PG-Web100`). ONNX I/O: `obs_images (B,11,3,256,256)` in `[0,1]` (no ImageNet norm) + `goal (B,3)` = `[norm_dist, cos(θ), sin(θ)]` → `wp_pred (B,10,3)` `[x,y,yaw]`, `wp_pred_score (B,63)` mode scores. Frames are resized to 256×256 with OpenCV.

	### MIMIC — `mimic.onnx`
	Paper: Learning Sidewalk Autopilot from Multi-Scale Imitation with Corrective Behavior Expansion (ICRA 2026) · [arXiv:2603.22527](https://arxiv.org/abs/2603.22527) · [project](https://vail-ucla.github.io/MIMIC)

	UCLA-VAIL goal-free long-context sidewalk policy. ONNX I/O: `input (1,16,3,288,512)` in `[0,1]` → `output (1,15,3)` `[x,y,yaw]` at non-uniform timestamps (0.2 s–5.0 s @ 5 Hz). Batch is processed one sample at a time; the wrapper keeps the first 13 waypoints (~4 s) and scales to meters. Requires `urbansim` (see Installation).

	## Downloading

	Full repo (includes the LFS-tracked ONNX weights):
	```bash
	hf download UCLA-VAIL/Navigation-Model-Zoo-Public --local-dir ./Navigation-Model-Zoo-Public
	```

	One model — fetch just its folder, e.g. MBRA:
	```bash
	hf download UCLA-VAIL/Navigation-Model-Zoo-Public \
	--include "MBRA_PG_Official/*" --local-dir .
	```

	Then run from the repo root: `from MBRA_PG_Official.inference import MBRAPGNavigator`.

	> External weights: GNM, ViNT, and NoMaD ship `*.onnx.data` files — keep each `.onnx` and its
	> `.onnx.data` together in the same folder so ONNX Runtime can resolve the weights.

	## Intended use & limitations

	These are research artifacts for navigation research, reproduction, and benchmarking — not
	safety-validated for deployment on real robots without additional testing. Each policy's behavior
	is bounded by its training distribution (camera intrinsics, embodiment, frame rate, environment).
	Several wrappers rectify/resize inputs to a specific training camera; mismatched cameras may degrade
	performance.

	## License

	Released under Apache 2.0. Individual models carry the licenses and terms of their original
	sources (ViNT, GNM, NoMaD, CityWalker, MBRA) — check upstream before commercial use.

	## Citation

	If you use a model from this zoo, please cite its original paper.

	GNM
	```bibtex
	@inproceedings{shah2023gnm,
	title={Gnm: A general navigation model to drive any robot},
	author={Shah, Dhruv and Sridhar, Ajay and Bhorkar, Arjun and Hirose, Noriaki and Levine, Sergey},
	booktitle={2023 IEEE International Conference on Robotics and Automation (ICRA)},
	pages={7226--7233},
	year={2023},
	organization={IEEE}
	}
	```

	ViNT
	```bibtex
	@article{shah2023vint,
	title={ViNT: A foundation model for visual navigation},
	author={Shah, Dhruv and Sridhar, Ajay and Dashora, Nitish and Stachowicz, Kyle and Black, Kevin and Hirose, Noriaki and Levine, Sergey},
	journal={arXiv preprint arXiv:2306.14846},
	year={2023}
	}
	```

	NoMaD
	```bibtex
	@inproceedings{sridhar2024nomad,
	title={Nomad: Goal masked diffusion policies for navigation and exploration},
	author={Sridhar, Ajay and Shah, Dhruv and Glossop, Catherine and Levine, Sergey},
	booktitle={2024 IEEE International Conference on Robotics and Automation (ICRA)},
	pages={63--70},
	year={2024},
	organization={IEEE}
	}
	```

	CityWalker
	```bibtex
	@inproceedings{liu2025citywalker,
	title={Citywalker: Learning embodied urban navigation from web-scale videos},
	author={Liu, Xinhao and Li, Jintong and Jiang, Yicheng and Sujay, Niranjan and Yang, Zhicheng and Zhang, Juexiao and Abanes, John and Zhang, Jing and Feng, Chen},
	booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
	pages={6875--6885},
	year={2025}
	}
	```

	MBRA
	```bibtex
	@article{hirose2025learning,
	title={Learning to drive anywhere with model-based reannotation},
	author={Hirose, Noriaki and Ignatova, Lydia and Stachowicz, Kyle and Glossop, Catherine and Levine, Sergey and Shah, Dhruv},
	journal={IEEE Robotics and Automation Letters},
	volume={11},
	number={2},
	pages={1242--1249},
	year={2025},
	publisher={IEEE}
	}
	```

	S2E
	```bibtex
	@article{he2025seeing,
	title={From seeing to experiencing: Scaling navigation foundation models with reinforcement learning},
	author={He, Honglin and Ma, Yukai and Squicciarini, Brad and Wu, Wayne and Zhou, Bolei},
	journal={arXiv preprint arXiv:2507.22028},
	year={2025}
	}
	```

	MIMIC
	```bibtex
	@article{he2026learning,
	title={Learning Sidewalk Autopilot from Multi-Scale Imitation with Corrective Behavior Expansion},
	author={He, Honglin and Ma, Yukai and Squicciarini, Brad and Wu, Wayne and Zhou, Bolei},
	journal={arXiv preprint arXiv:2603.22527},
	year={2026}
	}
	```

	## Contact

	Maintained by [UCLA-VAIL](https://vail-ucla.github.io/). Open an issue/discussion on the
	repository page for questions or contributions.