--- license: apache-2.0 library_name: transformers tags: - robotics - navigation - waypoint-prediction - citywalker - dinov2 pipeline_tag: robotics --- # CityWalker (2000hr) HuggingFace port of the [CityWalker](https://github.com/ai4ce/CityWalker) waypoint-prediction model (CVPR 2025), trained on ~2000 hours of web-scale urban walking and driving videos. This repo contains the converted weights of `CityWalker_2000hr.ckpt` (originally a PyTorch Lightning checkpoint) re-packaged as a `transformers.PreTrainedModel` so it can be loaded with `AutoModel.from_pretrained` directly. Upstream training dataset: [ai4ce/CityWalker](https://huggingface.co/datasets/ai4ce/CityWalker). Our port (model wrapper + ckpt converter + benchmark integration) lives in [ai4ce/wanderland-benchmark](https://github.com/ai4ce/wanderland-benchmark). ## Architecture ``` images (B, 5, 3, H, W) ─► center_crop(400) + resize(392) + ImageNet norm ─► DINOv2 (vit-b/14) ─► obs tokens (B, 5, 768) coords (B, 6, 2) ─► PolarEmbedding + Linear ─► goal token (B, 1, 768) ─► concat ─► (B, 6, 768) ─► TransformerEncoder (8 heads, 16 layers) ─► MLP head ─► waypoints (B, 5, 2) ─► arrive_logits (B, 1) ``` - `context_size` = 5 past RGB frames. - `len_traj_pred` = 5 future XY waypoints. - The 6 coord rows are the **5 past poses + 1 target pose**, all expressed in the current-pose-relative frame and divided by the per-video step_scale (so the model consumes dimensionless units, not meters). ## Usage ```python from transformers import AutoModel model = AutoModel.from_pretrained( "ai4ce/citywalker", trust_remote_code=True, ) model.eval() ``` The repo bundles `modeling_citywalker.py` and `configuration_citywalker.py` under `auto_map`, so `trust_remote_code=True` is all you need — no need to pip-install the wanderland-lab package. The DINOv2 backbone weights are included in `model.safetensors`. ## Inputs / Outputs | Name | Shape | Notes | |-------------------|-----------------------------|-------| | `images` | `(B, 5, 3, H, W)` float32 | RGB in `[0, 1]`; the model applies `center_crop(400) → resize(392) → ImageNet normalize` internally | | `coords` | `(B, 6, 2)` float32 | 5 past poses + 1 target pose in the current-pose-relative frame, scaled by `1 / step_scale` | | `waypoints` out | `(B, 5, 2)` float32 | Predicted XY waypoints in the current-pose-relative frame, in step_scale units — multiply by `step_scale` to recover meters | | `arrive_logits` | `(B, 1)` float32 | Pre-sigmoid logit for the "arrived at target" binary classifier | **The model predicts 2D XY waypoints only.** It does not output a heading or yaw. Downstream controllers that need `(vx, vy, yaw_rate)` derive yaw from the predicted waypoint direction (e.g. `atan2(wp_y, wp_x)`). ## Policy wrapper For robot-control use — per-episode position history, step_scale estimation from recent motion, lookahead along a reference path, and conversion of the waypoint to a body-frame velocity command — see `CityWalkerPolicy` in [ai4ce/wanderland-benchmark](https://github.com/ai4ce/wanderland-benchmark). ## Citation ``` @inproceedings{liu2025citywalker, title={Citywalker: Learning embodied urban navigation from web-scale videos}, author={Liu, Xinhao and Li, Jintong and Jiang, Yicheng and Sujay, Niranjan and Yang, Zhicheng and Zhang, Juexiao and Abanes, John and Zhang, Jing and Feng, Chen}, booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference}, pages={6875--6885}, year={2025} } ``` ## License Apache-2.0, matching the upstream [ai4ce/CityWalker](https://github.com/ai4ce/CityWalker) repository.