ai4ce
/

citywalker

@@ -13,29 +13,34 @@ pipeline_tag: robotics
 # CityWalker (2000hr)
 HuggingFace port of the [CityWalker](https://github.com/ai4ce/CityWalker)
-waypoint-prediction model, trained on 2000 hours of urban pedestrian
-footage. This repo contains the converted weights of
-`CityWalker_2000hr.ckpt` (originally a PyTorch Lightning checkpoint)
 re-packaged as a `transformers.PreTrainedModel` so it can be loaded with
 `AutoModel.from_pretrained`.
-Model implementation lives in
-[`ai4ce/wanderland-benchmark`](https://github.com/ai4ce/wanderland-benchmark)
-under `src/wanderland_lab/models/citywalker/`.
 ## Architecture
 ```
-images (B, T, 3, H, W)   ─► DINOv2 (vit-b/14) ─► obs tokens (B, T, 768)
-coords (B, T+1, 2)       ─► PolarEmbedding + Linear ─► goal token (B, 1, 768)
-                           ─► concat ─► (B, T+1, 768)
-                           ─► TransformerEncoder (8 heads, 16 layers)
-                           ─► MLP head ─► (waypoints, arrive_logits)
 ```
-- **T** = `context_size` = 5 recent RGB frames.
-- **waypoints**: `(B, 5, 2)` cumulative XY deltas in body frame.
-- **arrive_logits**: `(B, 1)` pre-sigmoid arrival score.
 ## Usage
@@ -53,18 +58,23 @@ Meta's pretrained checkpoint; `load_obs_encoder()` pulls it via `torch.hub`.
 ## Inputs / Outputs
-| Name              | Shape                     | Notes                             |
-|-------------------|---------------------------|-----------------------------------|
-| `images`          | `(B, 5, 3, H, W)` float32 | `[0, 1]` RGB; model handles resize + ImageNet normalize |
-| `coords`          | `(B, 6, 2)` float32       | Recent body-frame XY positions    |
-| `waypoints` out   | `(B, 5, 2)` float32       | Cumulative XY deltas, body frame  |
-| `arrive_logits`   | `(B, 1)` float32          | Pre-sigmoid                       |
 ## Policy wrapper
-For robot-control use (body-frame `(vx, vy, yaw_rate)` with per-episode
-history + lookahead along a reference path), see `CityWalkerPolicy` in the
-[`wanderland-lab`](https://github.com/ai4ce/wanderland-benchmark) repo.
 ## Citation

 # CityWalker (2000hr)
 HuggingFace port of the [CityWalker](https://github.com/ai4ce/CityWalker)
+waypoint-prediction model (CVPR 2025), trained on ~2000 hours of web-scale
+urban walking and driving videos. This repo contains the converted weights
+of `CityWalker_2000hr.ckpt` (originally a PyTorch Lightning checkpoint)
 re-packaged as a `transformers.PreTrainedModel` so it can be loaded with
 `AutoModel.from_pretrained`.
+Upstream training dataset: [ai4ce/CityWalker](https://huggingface.co/datasets/ai4ce/CityWalker).
+Our port (model wrapper + ckpt converter + benchmark integration) lives in
+[ai4ce/wanderland-benchmark](https://github.com/ai4ce/wanderland-benchmark).
 ## Architecture
 ```
+images (B, 5, 3, H, W)  ─► center_crop(400) + resize(392) + ImageNet norm
+                         ─► DINOv2 (vit-b/14)        ─► obs tokens (B, 5, 768)
+coords (B, 6, 2)         ─► PolarEmbedding + Linear  ─► goal token  (B, 1, 768)
+                                                      ─► concat ─► (B, 6, 768)
+                                                      ─► TransformerEncoder (8 heads, 16 layers)
+                                                      ─► MLP head
+                                                      ─► waypoints (B, 5, 2)
+                                                      ─► arrive_logits (B, 1)
 ```
+- `context_size` = 5 past RGB frames.
+- `len_traj_pred` = 5 future XY waypoints.
+- The 6 coord rows are the **5 past poses + 1 target pose**, all expressed in
+  the current-pose-relative frame and divided by the per-video step_scale
+  (so the model consumes dimensionless units, not meters).
 ## Usage
 ## Inputs / Outputs
+| Name              | Shape                       | Notes |
+|-------------------|-----------------------------|-------|
+| `images`          | `(B, 5, 3, H, W)` float32   | RGB in `[0, 1]`; the model applies `center_crop(400) → resize(392) → ImageNet normalize` internally |
+| `coords`          | `(B, 6, 2)` float32         | 5 past poses + 1 target pose in the current-pose-relative frame, scaled by `1 / step_scale` |
+| `waypoints` out   | `(B, 5, 2)` float32         | Predicted XY waypoints in the current-pose-relative frame, in step_scale units — multiply by `step_scale` to recover meters |
+| `arrive_logits`   | `(B, 1)` float32            | Pre-sigmoid logit for the "arrived at target" binary classifier |
+**The model predicts 2D XY waypoints only.** It does not output a heading or
+yaw. Downstream controllers that need `(vx, vy, yaw_rate)` derive yaw from
+the predicted waypoint direction (e.g. `atan2(wp_y, wp_x)`).
 ## Policy wrapper
+For robot-control use — per-episode position history, step_scale estimation
+from recent motion, lookahead along a reference path, and conversion of the
+waypoint to a body-frame velocity command — see `CityWalkerPolicy` in
+[ai4ce/wanderland-benchmark](https://github.com/ai4ce/wanderland-benchmark).
 ## Citation