Initial upload: CityWalker 2000hr converted from Lightning .ckpt

Browse files

Files changed (3) hide show

README.md +83 -0
config.json +27 -0
model.safetensors +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,83 @@

+---
+license: apache-2.0
+library_name: transformers
+tags:
+- robotics
+- navigation
+- waypoint-prediction
+- citywalker
+- dinov2
+pipeline_tag: robotics
+---
+# CityWalker (2000hr)
+HuggingFace port of the [CityWalker](https://github.com/ai4ce/CityWalker)
+waypoint-prediction model, trained on 2000 hours of urban pedestrian
+footage. This repo contains the converted weights of
+`CityWalker_2000hr.ckpt` (originally a PyTorch Lightning checkpoint)
+re-packaged as a `transformers.PreTrainedModel` so it can be loaded with
+`AutoModel.from_pretrained`.
+Model implementation lives in
+[`ai4ce/wanderland-benchmark`](https://github.com/ai4ce/wanderland-benchmark)
+under `src/wanderland_lab/models/citywalker/`.
+## Architecture
+```
+images (B, T, 3, H, W)   ─► DINOv2 (vit-b/14) ─► obs tokens (B, T, 768)
+coords (B, T+1, 2)       ─► PolarEmbedding + Linear ─► goal token (B, 1, 768)
+                           ─► concat ─► (B, T+1, 768)
+                           ─► TransformerEncoder (8 heads, 16 layers)
+                           ─► MLP head ─► (waypoints, arrive_logits)
+```
+- **T** = `context_size` = 5 recent RGB frames.
+- **waypoints**: `(B, 5, 2)` cumulative XY deltas in body frame.
+- **arrive_logits**: `(B, 1)` pre-sigmoid arrival score.
+## Usage
+```python
+from transformers import AutoModel
+from wanderland_lab.models.citywalker import CityWalkerModel  # registers AutoModel
+model = AutoModel.from_pretrained("ai4ce/citywalker")
+model.load_obs_encoder()   # fetches DINOv2 via torch.hub on first call
+model.eval()
+```
+The DINOv2 backbone is not bundled with the weights to avoid redistributing
+Meta's pretrained checkpoint; `load_obs_encoder()` pulls it via `torch.hub`.
+## Inputs / Outputs
+| Name              | Shape                     | Notes                             |
+|-------------------|---------------------------|-----------------------------------|
+| `images`          | `(B, 5, 3, H, W)` float32 | `[0, 1]` RGB; model handles resize + ImageNet normalize |
+| `coords`          | `(B, 6, 2)` float32       | Recent body-frame XY positions    |
+| `waypoints` out   | `(B, 5, 2)` float32       | Cumulative XY deltas, body frame  |
+| `arrive_logits`   | `(B, 1)` float32          | Pre-sigmoid                       |
+## Policy wrapper
+For robot-control use (body-frame `(vx, vy, yaw_rate)` with per-episode
+history + lookahead along a reference path), see `CityWalkerPolicy` in the
+[`wanderland-lab`](https://github.com/ai4ce/wanderland-benchmark) repo.
+## Citation
+```
+@inproceedings{liu2024citywalker,
+  title     = {CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos},
+  author    = {Liu, Xinhao and Chen, Jintong and Liu, Yichen and Feng, Chen},
+  booktitle = {CVPR},
+  year      = {2025}
+}
+```
+## License
+Apache-2.0, matching the upstream
+[ai4ce/CityWalker](https://github.com/ai4ce/CityWalker) repository.

config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "architectures": [
+    "CityWalkerModel"
+  ],
+  "context_size": 5,
+  "cord_include_input": true,
+  "cord_num_freqs": 6,
+  "crop": [
+    400,
+    400
+  ],
+  "decoder_ff_dim_factor": 4,
+  "decoder_num_heads": 8,
+  "decoder_num_layers": 16,
+  "do_resize": true,
+  "do_rgb_normalize": true,
+  "dtype": "float32",
+  "freeze_obs_encoder": true,
+  "len_traj_pred": 5,
+  "model_type": "citywalker",
+  "obs_encoder_type": "dinov2_vitb14",
+  "resize": [
+    392,
+    392
+  ],
+  "transformers_version": "5.8.0"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cb3c609a411eb901cdf4500a542c324e33bcf7a2b6ce328de6590cc55b8b8ca9
+size 833735756