Gaaaavin commited on
Commit
bc06f9e
·
verified ·
1 Parent(s): ab88df5

Model card: fix factual errors (no yaw output, coords are past+target, normalized by step_scale)

Browse files
Files changed (1) hide show
  1. README.md +33 -23
README.md CHANGED
@@ -13,29 +13,34 @@ pipeline_tag: robotics
13
  # CityWalker (2000hr)
14
 
15
  HuggingFace port of the [CityWalker](https://github.com/ai4ce/CityWalker)
16
- waypoint-prediction model, trained on 2000 hours of urban pedestrian
17
- footage. This repo contains the converted weights of
18
- `CityWalker_2000hr.ckpt` (originally a PyTorch Lightning checkpoint)
19
  re-packaged as a `transformers.PreTrainedModel` so it can be loaded with
20
  `AutoModel.from_pretrained`.
21
 
22
- Model implementation lives in
23
- [`ai4ce/wanderland-benchmark`](https://github.com/ai4ce/wanderland-benchmark)
24
- under `src/wanderland_lab/models/citywalker/`.
25
 
26
  ## Architecture
27
 
28
  ```
29
- images (B, T, 3, H, W) ─► DINOv2 (vit-b/14) ─► obs tokens (B, T, 768)
30
- coords (B, T+1, 2) ─► PolarEmbedding + Linear ─► goal token (B, 1, 768)
31
- ─► concat ─► (B, T+1, 768)
32
- ─► TransformerEncoder (8 heads, 16 layers)
33
- ─► MLP head ─► (waypoints, arrive_logits)
 
 
 
34
  ```
35
 
36
- - **T** = `context_size` = 5 recent RGB frames.
37
- - **waypoints**: `(B, 5, 2)` cumulative XY deltas in body frame.
38
- - **arrive_logits**: `(B, 1)` pre-sigmoid arrival score.
 
 
39
 
40
  ## Usage
41
 
@@ -53,18 +58,23 @@ Meta's pretrained checkpoint; `load_obs_encoder()` pulls it via `torch.hub`.
53
 
54
  ## Inputs / Outputs
55
 
56
- | Name | Shape | Notes |
57
- |-------------------|---------------------------|-----------------------------------|
58
- | `images` | `(B, 5, 3, H, W)` float32 | `[0, 1]` RGB; model handles resize + ImageNet normalize |
59
- | `coords` | `(B, 6, 2)` float32 | Recent body-frame XY positions |
60
- | `waypoints` out | `(B, 5, 2)` float32 | Cumulative XY deltas, body frame |
61
- | `arrive_logits` | `(B, 1)` float32 | Pre-sigmoid |
 
 
 
 
62
 
63
  ## Policy wrapper
64
 
65
- For robot-control use (body-frame `(vx, vy, yaw_rate)` with per-episode
66
- history + lookahead along a reference path), see `CityWalkerPolicy` in the
67
- [`wanderland-lab`](https://github.com/ai4ce/wanderland-benchmark) repo.
 
68
 
69
  ## Citation
70
 
 
13
  # CityWalker (2000hr)
14
 
15
  HuggingFace port of the [CityWalker](https://github.com/ai4ce/CityWalker)
16
+ waypoint-prediction model (CVPR 2025), trained on ~2000 hours of web-scale
17
+ urban walking and driving videos. This repo contains the converted weights
18
+ of `CityWalker_2000hr.ckpt` (originally a PyTorch Lightning checkpoint)
19
  re-packaged as a `transformers.PreTrainedModel` so it can be loaded with
20
  `AutoModel.from_pretrained`.
21
 
22
+ Upstream training dataset: [ai4ce/CityWalker](https://huggingface.co/datasets/ai4ce/CityWalker).
23
+ Our port (model wrapper + ckpt converter + benchmark integration) lives in
24
+ [ai4ce/wanderland-benchmark](https://github.com/ai4ce/wanderland-benchmark).
25
 
26
  ## Architecture
27
 
28
  ```
29
+ images (B, 5, 3, H, W) ─► center_crop(400) + resize(392) + ImageNet norm
30
+ ─► DINOv2 (vit-b/14) ─► obs tokens (B, 5, 768)
31
+ coords (B, 6, 2) ─► PolarEmbedding + Linear ─► goal token (B, 1, 768)
32
+ ─► concat ─► (B, 6, 768)
33
+ ─► TransformerEncoder (8 heads, 16 layers)
34
+ ─► MLP head
35
+ ─► waypoints (B, 5, 2)
36
+ ─► arrive_logits (B, 1)
37
  ```
38
 
39
+ - `context_size` = 5 past RGB frames.
40
+ - `len_traj_pred` = 5 future XY waypoints.
41
+ - The 6 coord rows are the **5 past poses + 1 target pose**, all expressed in
42
+ the current-pose-relative frame and divided by the per-video step_scale
43
+ (so the model consumes dimensionless units, not meters).
44
 
45
  ## Usage
46
 
 
58
 
59
  ## Inputs / Outputs
60
 
61
+ | Name | Shape | Notes |
62
+ |-------------------|-----------------------------|-------|
63
+ | `images` | `(B, 5, 3, H, W)` float32 | RGB in `[0, 1]`; the model applies `center_crop(400) → resize(392) → ImageNet normalize` internally |
64
+ | `coords` | `(B, 6, 2)` float32 | 5 past poses + 1 target pose in the current-pose-relative frame, scaled by `1 / step_scale` |
65
+ | `waypoints` out | `(B, 5, 2)` float32 | Predicted XY waypoints in the current-pose-relative frame, in step_scale units — multiply by `step_scale` to recover meters |
66
+ | `arrive_logits` | `(B, 1)` float32 | Pre-sigmoid logit for the "arrived at target" binary classifier |
67
+
68
+ **The model predicts 2D XY waypoints only.** It does not output a heading or
69
+ yaw. Downstream controllers that need `(vx, vy, yaw_rate)` derive yaw from
70
+ the predicted waypoint direction (e.g. `atan2(wp_y, wp_x)`).
71
 
72
  ## Policy wrapper
73
 
74
+ For robot-control use — per-episode position history, step_scale estimation
75
+ from recent motion, lookahead along a reference path, and conversion of the
76
+ waypoint to a body-frame velocity command — see `CityWalkerPolicy` in
77
+ [ai4ce/wanderland-benchmark](https://github.com/ai4ce/wanderland-benchmark).
78
 
79
  ## Citation
80