Robotics
Transformers
Safetensors
citywalker
feature-extraction
navigation
waypoint-prediction
dinov2
custom_code
Instructions to use ai4ce/citywalker with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ai4ce/citywalker with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ai4ce/citywalker", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| library_name: transformers | |
| tags: | |
| - robotics | |
| - navigation | |
| - waypoint-prediction | |
| - citywalker | |
| - dinov2 | |
| pipeline_tag: robotics | |
| # CityWalker (2000hr) | |
| HuggingFace port of the [CityWalker](https://github.com/ai4ce/CityWalker) | |
| waypoint-prediction model (CVPR 2025), trained on ~2000 hours of web-scale | |
| urban walking and driving videos. This repo contains the converted weights | |
| of `CityWalker_2000hr.ckpt` (originally a PyTorch Lightning checkpoint) | |
| re-packaged as a `transformers.PreTrainedModel` so it can be loaded with | |
| `AutoModel.from_pretrained` directly. | |
| Upstream training dataset: [ai4ce/CityWalker](https://huggingface.co/datasets/ai4ce/CityWalker). | |
| Our port (model wrapper + ckpt converter + benchmark integration) lives in | |
| [ai4ce/wanderland-benchmark](https://github.com/ai4ce/wanderland-benchmark). | |
| ## Architecture | |
| ``` | |
| images (B, 5, 3, H, W) ββΊ center_crop(400) + resize(392) + ImageNet norm | |
| ββΊ DINOv2 (vit-b/14) ββΊ obs tokens (B, 5, 768) | |
| coords (B, 6, 2) ββΊ PolarEmbedding + Linear ββΊ goal token (B, 1, 768) | |
| ββΊ concat ββΊ (B, 6, 768) | |
| ββΊ TransformerEncoder (8 heads, 16 layers) | |
| ββΊ MLP head | |
| ββΊ waypoints (B, 5, 2) | |
| ββΊ arrive_logits (B, 1) | |
| ``` | |
| - `context_size` = 5 past RGB frames. | |
| - `len_traj_pred` = 5 future XY waypoints. | |
| - The 6 coord rows are the **5 past poses + 1 target pose**, all expressed in | |
| the current-pose-relative frame and divided by the per-video step_scale | |
| (so the model consumes dimensionless units, not meters). | |
| ## Usage | |
| ```python | |
| from transformers import AutoModel | |
| model = AutoModel.from_pretrained( | |
| "ai4ce/citywalker", | |
| trust_remote_code=True, | |
| ) | |
| model.eval() | |
| ``` | |
| The repo bundles `modeling_citywalker.py` and `configuration_citywalker.py` | |
| under `auto_map`, so `trust_remote_code=True` is all you need β no need to | |
| pip-install the wanderland-lab package. The DINOv2 backbone weights are | |
| included in `model.safetensors`. | |
| ## Inputs / Outputs | |
| | Name | Shape | Notes | | |
| |-------------------|-----------------------------|-------| | |
| | `images` | `(B, 5, 3, H, W)` float32 | RGB in `[0, 1]`; the model applies `center_crop(400) β resize(392) β ImageNet normalize` internally | | |
| | `coords` | `(B, 6, 2)` float32 | 5 past poses + 1 target pose in the current-pose-relative frame, scaled by `1 / step_scale` | | |
| | `waypoints` out | `(B, 5, 2)` float32 | Predicted XY waypoints in the current-pose-relative frame, in step_scale units β multiply by `step_scale` to recover meters | | |
| | `arrive_logits` | `(B, 1)` float32 | Pre-sigmoid logit for the "arrived at target" binary classifier | | |
| **The model predicts 2D XY waypoints only.** It does not output a heading or | |
| yaw. Downstream controllers that need `(vx, vy, yaw_rate)` derive yaw from | |
| the predicted waypoint direction (e.g. `atan2(wp_y, wp_x)`). | |
| ## Policy wrapper | |
| For robot-control use β per-episode position history, step_scale estimation | |
| from recent motion, lookahead along a reference path, and conversion of the | |
| waypoint to a body-frame velocity command β see `CityWalkerPolicy` in | |
| [ai4ce/wanderland-benchmark](https://github.com/ai4ce/wanderland-benchmark). | |
| ## Citation | |
| ``` | |
| @inproceedings{liu2025citywalker, | |
| title={Citywalker: Learning embodied urban navigation from web-scale videos}, | |
| author={Liu, Xinhao and Li, Jintong and Jiang, Yicheng and Sujay, Niranjan and Yang, Zhicheng and Zhang, Juexiao and Abanes, John and Zhang, Jing and Feng, Chen}, | |
| booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference}, | |
| pages={6875--6885}, | |
| year={2025} | |
| } | |
| ``` | |
| ## License | |
| Apache-2.0, matching the upstream | |
| [ai4ce/CityWalker](https://github.com/ai4ce/CityWalker) repository. | |