citywalker / README.md
Gaaaavin's picture
README: drop stale load_obs_encoder() snippet, add trust_remote_code=True example
1b5d169 verified
---
license: apache-2.0
library_name: transformers
tags:
- robotics
- navigation
- waypoint-prediction
- citywalker
- dinov2
pipeline_tag: robotics
---
# CityWalker (2000hr)
HuggingFace port of the [CityWalker](https://github.com/ai4ce/CityWalker)
waypoint-prediction model (CVPR 2025), trained on ~2000 hours of web-scale
urban walking and driving videos. This repo contains the converted weights
of `CityWalker_2000hr.ckpt` (originally a PyTorch Lightning checkpoint)
re-packaged as a `transformers.PreTrainedModel` so it can be loaded with
`AutoModel.from_pretrained` directly.
Upstream training dataset: [ai4ce/CityWalker](https://huggingface.co/datasets/ai4ce/CityWalker).
Our port (model wrapper + ckpt converter + benchmark integration) lives in
[ai4ce/wanderland-benchmark](https://github.com/ai4ce/wanderland-benchmark).
## Architecture
```
images (B, 5, 3, H, W) ─► center_crop(400) + resize(392) + ImageNet norm
─► DINOv2 (vit-b/14) ─► obs tokens (B, 5, 768)
coords (B, 6, 2) ─► PolarEmbedding + Linear ─► goal token (B, 1, 768)
─► concat ─► (B, 6, 768)
─► TransformerEncoder (8 heads, 16 layers)
─► MLP head
─► waypoints (B, 5, 2)
─► arrive_logits (B, 1)
```
- `context_size` = 5 past RGB frames.
- `len_traj_pred` = 5 future XY waypoints.
- The 6 coord rows are the **5 past poses + 1 target pose**, all expressed in
the current-pose-relative frame and divided by the per-video step_scale
(so the model consumes dimensionless units, not meters).
## Usage
```python
from transformers import AutoModel
model = AutoModel.from_pretrained(
"ai4ce/citywalker",
trust_remote_code=True,
)
model.eval()
```
The repo bundles `modeling_citywalker.py` and `configuration_citywalker.py`
under `auto_map`, so `trust_remote_code=True` is all you need β€” no need to
pip-install the wanderland-lab package. The DINOv2 backbone weights are
included in `model.safetensors`.
## Inputs / Outputs
| Name | Shape | Notes |
|-------------------|-----------------------------|-------|
| `images` | `(B, 5, 3, H, W)` float32 | RGB in `[0, 1]`; the model applies `center_crop(400) β†’ resize(392) β†’ ImageNet normalize` internally |
| `coords` | `(B, 6, 2)` float32 | 5 past poses + 1 target pose in the current-pose-relative frame, scaled by `1 / step_scale` |
| `waypoints` out | `(B, 5, 2)` float32 | Predicted XY waypoints in the current-pose-relative frame, in step_scale units β€” multiply by `step_scale` to recover meters |
| `arrive_logits` | `(B, 1)` float32 | Pre-sigmoid logit for the "arrived at target" binary classifier |
**The model predicts 2D XY waypoints only.** It does not output a heading or
yaw. Downstream controllers that need `(vx, vy, yaw_rate)` derive yaw from
the predicted waypoint direction (e.g. `atan2(wp_y, wp_x)`).
## Policy wrapper
For robot-control use β€” per-episode position history, step_scale estimation
from recent motion, lookahead along a reference path, and conversion of the
waypoint to a body-frame velocity command β€” see `CityWalkerPolicy` in
[ai4ce/wanderland-benchmark](https://github.com/ai4ce/wanderland-benchmark).
## Citation
```
@inproceedings{liu2025citywalker,
title={Citywalker: Learning embodied urban navigation from web-scale videos},
author={Liu, Xinhao and Li, Jintong and Jiang, Yicheng and Sujay, Niranjan and Yang, Zhicheng and Zhang, Juexiao and Abanes, John and Zhang, Jing and Feng, Chen},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={6875--6885},
year={2025}
}
```
## License
Apache-2.0, matching the upstream
[ai4ce/CityWalker](https://github.com/ai4ce/CityWalker) repository.