---
license: apache-2.0
library_name: transformers
tags:
- robotics
- navigation
- waypoint-prediction
- citywalker
- dinov2
pipeline_tag: robotics
---

# CityWalker (2000hr)

HuggingFace port of the [CityWalker](https://github.com/ai4ce/CityWalker)
waypoint-prediction model (CVPR 2025), trained on ~2000 hours of web-scale
urban walking and driving videos. This repo contains the converted weights
of `CityWalker_2000hr.ckpt` (originally a PyTorch Lightning checkpoint)
re-packaged as a `transformers.PreTrainedModel` so it can be loaded with
`AutoModel.from_pretrained` directly.

Upstream training dataset: [ai4ce/CityWalker](https://huggingface.co/datasets/ai4ce/CityWalker).
Our port (model wrapper + ckpt converter + benchmark integration) lives in
[ai4ce/wanderland-benchmark](https://github.com/ai4ce/wanderland-benchmark).

## Architecture

```
images (B, 5, 3, H, W)  ─► center_crop(400) + resize(392) + ImageNet norm
                         ─► DINOv2 (vit-b/14)        ─► obs tokens (B, 5, 768)
coords (B, 6, 2)         ─► PolarEmbedding + Linear  ─► goal token  (B, 1, 768)
                                                      ─► concat ─► (B, 6, 768)
                                                      ─► TransformerEncoder (8 heads, 16 layers)
                                                      ─► MLP head
                                                      ─► waypoints (B, 5, 2)
                                                      ─► arrive_logits (B, 1)
```

- `context_size` = 5 past RGB frames.
- `len_traj_pred` = 5 future XY waypoints.
- The 6 coord rows are the **5 past poses + 1 target pose**, all expressed in
  the current-pose-relative frame and divided by the per-video step_scale
  (so the model consumes dimensionless units, not meters).

## Usage

```python
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "ai4ce/citywalker",
    trust_remote_code=True,
)
model.eval()
```

The repo bundles `modeling_citywalker.py` and `configuration_citywalker.py`
under `auto_map`, so `trust_remote_code=True` is all you need — no need to
pip-install the wanderland-lab package. The DINOv2 backbone weights are
included in `model.safetensors`.

## Inputs / Outputs

| Name              | Shape                       | Notes |
|-------------------|-----------------------------|-------|
| `images`          | `(B, 5, 3, H, W)` float32   | RGB in `[0, 1]`; the model applies `center_crop(400) → resize(392) → ImageNet normalize` internally |
| `coords`          | `(B, 6, 2)` float32         | 5 past poses + 1 target pose in the current-pose-relative frame, scaled by `1 / step_scale` |
| `waypoints` out   | `(B, 5, 2)` float32         | Predicted XY waypoints in the current-pose-relative frame, in step_scale units — multiply by `step_scale` to recover meters |
| `arrive_logits`   | `(B, 1)` float32            | Pre-sigmoid logit for the "arrived at target" binary classifier |

**The model predicts 2D XY waypoints only.** It does not output a heading or
yaw. Downstream controllers that need `(vx, vy, yaw_rate)` derive yaw from
the predicted waypoint direction (e.g. `atan2(wp_y, wp_x)`).

## Policy wrapper

For robot-control use — per-episode position history, step_scale estimation
from recent motion, lookahead along a reference path, and conversion of the
waypoint to a body-frame velocity command — see `CityWalkerPolicy` in
[ai4ce/wanderland-benchmark](https://github.com/ai4ce/wanderland-benchmark).

## Citation

```
@inproceedings{liu2025citywalker,
  title={Citywalker: Learning embodied urban navigation from web-scale videos},
  author={Liu, Xinhao and Li, Jintong and Jiang, Yicheng and Sujay, Niranjan and Yang, Zhicheng and Zhang, Juexiao and Abanes, John and Zhang, Jing and Feng, Chen},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={6875--6885},
  year={2025}
}
```

## License

Apache-2.0, matching the upstream
[ai4ce/CityWalker](https://github.com/ai4ce/CityWalker) repository.