README: drop stale load_obs_encoder() snippet, add trust_remote_code=True example

1b5d169 verified 13 days ago

4.07 kB

	---
	license: apache-2.0
	library_name: transformers
	tags:
	- robotics
	- navigation
	- waypoint-prediction
	- citywalker
	- dinov2
	pipeline_tag: robotics
	---

	# CityWalker (2000hr)

	HuggingFace port of the [CityWalker](https://github.com/ai4ce/CityWalker)
	waypoint-prediction model (CVPR 2025), trained on ~2000 hours of web-scale
	urban walking and driving videos. This repo contains the converted weights
	of `CityWalker_2000hr.ckpt` (originally a PyTorch Lightning checkpoint)
	re-packaged as a `transformers.PreTrainedModel` so it can be loaded with
	`AutoModel.from_pretrained` directly.

	Upstream training dataset: [ai4ce/CityWalker](https://huggingface.co/datasets/ai4ce/CityWalker).
	Our port (model wrapper + ckpt converter + benchmark integration) lives in
	[ai4ce/wanderland-benchmark](https://github.com/ai4ce/wanderland-benchmark).

	## Architecture

	```
	images (B, 5, 3, H, W) ─► center_crop(400) + resize(392) + ImageNet norm
	─► DINOv2 (vit-b/14) ─► obs tokens (B, 5, 768)
	coords (B, 6, 2) ─► PolarEmbedding + Linear ─► goal token (B, 1, 768)
	─► concat ─► (B, 6, 768)
	─► TransformerEncoder (8 heads, 16 layers)
	─► MLP head
	─► waypoints (B, 5, 2)
	─► arrive_logits (B, 1)
	```

	- `context_size` = 5 past RGB frames.
	- `len_traj_pred` = 5 future XY waypoints.
	- The 6 coord rows are the 5 past poses + 1 target pose, all expressed in
	the current-pose-relative frame and divided by the per-video step_scale
	(so the model consumes dimensionless units, not meters).

	## Usage

	```python
	from transformers import AutoModel

	model = AutoModel.from_pretrained(
	"ai4ce/citywalker",
	trust_remote_code=True,
	)
	model.eval()
	```

	The repo bundles `modeling_citywalker.py` and `configuration_citywalker.py`
	under `auto_map`, so `trust_remote_code=True` is all you need — no need to
	pip-install the wanderland-lab package. The DINOv2 backbone weights are
	included in `model.safetensors`.

	## Inputs / Outputs

	\| Name \| Shape \| Notes \|
	\|-------------------\|-----------------------------\|-------\|
	\| `images` \| `(B, 5, 3, H, W)` float32 \| RGB in `[0, 1]`; the model applies `center_crop(400) → resize(392) → ImageNet normalize` internally \|
	\| `coords` \| `(B, 6, 2)` float32 \| 5 past poses + 1 target pose in the current-pose-relative frame, scaled by `1 / step_scale` \|
	\| `waypoints` out \| `(B, 5, 2)` float32 \| Predicted XY waypoints in the current-pose-relative frame, in step_scale units — multiply by `step_scale` to recover meters \|
	\| `arrive_logits` \| `(B, 1)` float32 \| Pre-sigmoid logit for the "arrived at target" binary classifier \|

	The model predicts 2D XY waypoints only. It does not output a heading or
	yaw. Downstream controllers that need `(vx, vy, yaw_rate)` derive yaw from
	the predicted waypoint direction (e.g. `atan2(wp_y, wp_x)`).

	## Policy wrapper

	For robot-control use — per-episode position history, step_scale estimation
	from recent motion, lookahead along a reference path, and conversion of the
	waypoint to a body-frame velocity command — see `CityWalkerPolicy` in
	[ai4ce/wanderland-benchmark](https://github.com/ai4ce/wanderland-benchmark).

	## Citation

	```
	@inproceedings{liu2025citywalker,
	title={Citywalker: Learning embodied urban navigation from web-scale videos},
	author={Liu, Xinhao and Li, Jintong and Jiang, Yicheng and Sujay, Niranjan and Yang, Zhicheng and Zhang, Juexiao and Abanes, John and Zhang, Jing and Feng, Chen},
	booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
	pages={6875--6885},
	year={2025}
	}
	```

	## License

	Apache-2.0, matching the upstream
	[ai4ce/CityWalker](https://github.com/ai4ce/CityWalker) repository.