justinstrong commited on
Commit
cd604b4
·
verified ·
1 Parent(s): db13178

Upload folder using huggingface_hub

Browse files
Files changed (7) hide show
  1. .gitignore +2 -0
  2. README.md +204 -0
  3. build_index.py +287 -0
  4. compute_stats.py +98 -0
  5. filtered_index.json +0 -0
  6. norm_stats.json +38 -0
  7. so100_dataset.py +312 -0
.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ outputs/
2
+ __pycache__/
README.md ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Pi0.5 SO-100 Diverse Finetune
2
+
3
+ Finetune Pi0.5's action expert on diverse SO-100/101 community data to enable
4
+ multi-task manipulation controlled by natural language.
5
+
6
+ **Goal**: A Pi0.5 model that can perform many different tasks on an SO-100/101 arm
7
+ ("pick up the red cube", "fold the cloth", "stack the blocks") — not a single-task
8
+ policy, but a generalist that understands the SO-100 embodiment.
9
+
10
+ **Approach**: Freeze the VLM backbone (3B params), finetune only the action expert
11
+ (693M params including projections) using `train_expert_only=true`. The frozen VLM
12
+ retains its general vision-language understanding while the expert learns SO-100
13
+ motor control from diverse demonstrations.
14
+
15
+ ## Dataset
16
+
17
+ **Source**: [HuggingFaceVLA/community_dataset_v3](https://huggingface.co/datasets/HuggingFaceVLA/community_dataset_v3)
18
+ — a community-contributed collection of SO-100/101 teleoperation demonstrations.
19
+
20
+ **Filtering** (`build_index.py`):
21
+ - Robot type: so100, so101, so100_follower, so101_follower
22
+ - Schema: exactly 2 cameras (`observation.images.image` + `image2`), 6-DOF state/action
23
+ - Resolution: 480x640, FPS: 30
24
+ - Episode length: 150-1800 frames (5-60 seconds)
25
+ - Integrity: every episode verified against actual parquet row count + both video files
26
+ - Per-task cap: 200 episodes max per unique task string (prevents dominant tasks)
27
+ - Per-contributor cap: 200 episodes max per contributor (prevents style bias)
28
+
29
+ **Result**: 376 datasets, 10,155 episodes, 215 unique tasks, 5.4M frames (~50 hours)
30
+
31
+ **Task categories** (see `so100_101_task_analysis.txt` for full breakdown):
32
+ - Pick-and-place (cubes, legos, balls, pens, cups, toys)
33
+ - Block stacking and building (towers, Hanoi, Jenga)
34
+ - Clothing folding (t-shirts, towels, blankets)
35
+ - Drawing and writing (smiley faces, letters, iPad)
36
+ - Food manipulation (fork strawberries, spoon food)
37
+ - Sorting and organizing (pins, blocks by color)
38
+ - Peg/hole insertion (shape sorting)
39
+ - Cleaning (table, area)
40
+ - Kitchen tasks (open/close cabinet, lids, containers)
41
+
42
+ ## Architecture
43
+
44
+ **No dataset merging or conversion required.** The custom `SO100Dataset` class reads
45
+ directly from the community_dataset_v3 v2.1 files on disk. A thin `.meta` adapter
46
+ makes it compatible with LeRobot's `lerobot_train.py` training script.
47
+
48
+ ```
49
+ community_dataset_v3/ (cloned from HuggingFace, ~261GB for filtered subset)
50
+ contributor/dataset/
51
+ data/chunk-000/episode_NNNNNN.parquet (state + action per frame)
52
+ videos/chunk-000/
53
+ observation.images.image/episode_NNNNNN.mp4 (camera 1)
54
+ observation.images.image2/episode_NNNNNN.mp4 (camera 2)
55
+ meta/info.json, tasks.jsonl, episodes.jsonl
56
+
57
+ filtered_index.json (maps episodes to files, verified frame counts)
58
+ norm_stats.json (mean/std for state and action normalization)
59
+ so100_dataset.py (PyTorch Dataset that reads the above)
60
+ ```
61
+
62
+ LeRobot's `factory.py` is patched to recognize `so100:` prefix in `--dataset.repo_id`:
63
+ ```
64
+ --dataset.repo_id="so100:/path/to/community_dataset_v3:/path/to/filtered_index.json:/path/to/norm_stats.json"
65
+ ```
66
+
67
+ ## Files
68
+
69
+ | File | Purpose |
70
+ |------|---------|
71
+ | `build_index.py` | Scan community_dataset_v3, apply filters, verify parquets, output index |
72
+ | `compute_stats.py` | Compute mean/std normalization stats from filtered parquets |
73
+ | `so100_dataset.py` | PyTorch Dataset class with `.meta` adapter for lerobot compatibility |
74
+ | `filtered_index.json` | The verified training index (10,155 episodes, 215 tasks) |
75
+ | `norm_stats.json` | Precomputed mean/std for state and action |
76
+
77
+ ## Training
78
+
79
+ ### Local sanity check (1x RTX 3090)
80
+
81
+ ```bash
82
+ cd /home/anon/pi05-so100-diverse
83
+ PYTHONPATH=/home/anon/pi05-so100-diverse:$PYTHONPATH python -m lerobot.scripts.lerobot_train \
84
+ --dataset.repo_id="so100:/home/anon/lap/community_dataset_v3:filtered_index.json:norm_stats.json" \
85
+ --policy.path=lerobot/pi05_base \
86
+ --policy.train_expert_only=true \
87
+ --policy.dtype=bfloat16 \
88
+ --policy.gradient_checkpointing=true \
89
+ --policy.push_to_hub=false \
90
+ --policy.normalization_mapping='{"VISUAL": "IDENTITY", "STATE": "MEAN_STD", "ACTION": "MEAN_STD"}' \
91
+ --policy.scheduler_warmup_steps=1000 \
92
+ --policy.scheduler_decay_steps=15000 \
93
+ --rename_map='{"observation.images.image": "observation.images.base_0_rgb", "observation.images.image2": "observation.images.left_wrist_0_rgb"}' \
94
+ --batch_size=4 \
95
+ --steps=15000 \
96
+ --early_stop_steps=1000 \
97
+ --save_freq=500 \
98
+ --log_freq=50 \
99
+ --num_workers=2 \
100
+ --output_dir=outputs/scale_up_1k \
101
+ --job_name=scale_up_1k \
102
+ --save_checkpoint=true
103
+ ```
104
+
105
+ ### Cloud training (8x H100)
106
+
107
+ ```bash
108
+ # 1. Selective download of filtered datasets (~261GB)
109
+ python download_filtered.py --data-root /data/community_dataset_v3
110
+
111
+ # 2. Launch training
112
+ accelerate launch --multi_gpu --num_processes 8 \
113
+ -m lerobot.scripts.lerobot_train \
114
+ --dataset.repo_id="so100:/data/community_dataset_v3:filtered_index.json:norm_stats.json" \
115
+ --policy.path=lerobot/pi05_base \
116
+ --policy.train_expert_only=true \
117
+ --policy.dtype=bfloat16 \
118
+ --policy.gradient_checkpointing=true \
119
+ --policy.compile_model=true \
120
+ --policy.push_to_hub=true \
121
+ --policy.repo_id=StrongRoboticsLab/pi05-so100-diverse \
122
+ --policy.normalization_mapping='{"VISUAL": "IDENTITY", "STATE": "MEAN_STD", "ACTION": "MEAN_STD"}' \
123
+ --policy.scheduler_warmup_steps=1000 \
124
+ --policy.scheduler_decay_steps=15000 \
125
+ --rename_map='{"observation.images.image": "observation.images.base_0_rgb", "observation.images.image2": "observation.images.left_wrist_0_rgb"}' \
126
+ --batch_size=32 \
127
+ --steps=15000 \
128
+ --save_freq=500 \
129
+ --log_freq=50 \
130
+ --num_workers=4 \
131
+ --wandb.enable=true \
132
+ --wandb.project=pi05-so100-diverse \
133
+ --output_dir=outputs/cloud_run \
134
+ --job_name=pi05_so100_diverse
135
+ ```
136
+
137
+ ## Training Configuration
138
+
139
+ | Parameter | Value | Rationale |
140
+ |-----------|-------|-----------|
141
+ | Base model | `lerobot/pi05_base` | Pi0.5 pretrained on cross-embodiment data |
142
+ | train_expert_only | true | Freeze VLM, train action expert + projections (693M params) |
143
+ | dtype | bfloat16 | Standard for H100/3090 training |
144
+ | gradient_checkpointing | true | Saves VRAM by recomputing activations |
145
+ | LR | 2.5e-5 (peak) | Pi0.5 default, conservative for finetuning |
146
+ | LR schedule | Cosine decay with 1000-step warmup | Standard, decays to 2.5e-6 |
147
+ | Batch size | 32/GPU (256 effective on 8x H100) | Matches community configs |
148
+ | Steps | 15,000 (~1 epoch) | 5M samples / 256 batch ≈ 19k steps per epoch |
149
+ | Normalization | MEAN_STD for state/action, IDENTITY for images | Simpler than QUANTILES, proven to work |
150
+ | ImageNet stats | Yes | Standard image normalization |
151
+ | save_freq | 500 | 30 checkpoints over full run, low risk of data loss |
152
+
153
+ ## Camera Mapping
154
+
155
+ The community datasets use generic camera names (`image`, `image2`). Pi0.5 expects
156
+ specific names from its pretraining. We map via `--rename_map`:
157
+
158
+ | Dataset | Pi0.5 | Meaning |
159
+ |---------|-------|---------|
160
+ | `observation.images.image` | `observation.images.base_0_rgb` | Front/base camera |
161
+ | `observation.images.image2` | `observation.images.left_wrist_0_rgb` | Wrist camera |
162
+
163
+ The third expected camera (`right_wrist_0_rgb`) is left empty — Pi0.5 handles
164
+ missing cameras via its `empty_cameras` mechanism.
165
+
166
+ ## LeRobot Modifications
167
+
168
+ Two changes to `lerobot/src/lerobot/`:
169
+
170
+ 1. **`datasets/factory.py`**: Added `so100:` prefix handler that returns `SO100Dataset`
171
+ instead of going through HuggingFace dataset loading. Also re-enabled
172
+ `MultiLeRobotDataset` (was behind `NotImplementedError`).
173
+
174
+ 2. **`configs/train.py`** + **`scripts/lerobot_train.py`**: Added `early_stop_steps`
175
+ parameter for local testing — trains with the full LR schedule shape but exits
176
+ early after N steps.
177
+
178
+ ## Reproducibility
179
+
180
+ To rebuild the dataset index from scratch:
181
+
182
+ ```bash
183
+ python build_index.py --data-root /path/to/community_dataset_v3
184
+ python compute_stats.py --data-root /path/to/community_dataset_v3
185
+ ```
186
+
187
+ This verifies every parquet file and video on disk. Takes ~2 minutes.
188
+
189
+ ## Status
190
+
191
+ - [x] Dataset filtering pipeline (build_index.py)
192
+ - [x] Dataset verification (all 10,155 episodes validated)
193
+ - [x] Normalization stats computed
194
+ - [x] Custom dataset class (so100_dataset.py)
195
+ - [x] LeRobot integration (factory.py patch)
196
+ - [x] Local sanity check (100 steps, loss decreasing)
197
+ - [ ] Local scale-up (1000 steps with real LR schedule) — in progress
198
+ - [ ] Cloud training (8x H100, 15k steps)
199
+ - [ ] Evaluation on real SO-101
200
+ - [ ] Inference script for deployment
201
+
202
+ ## License
203
+
204
+ Apache 2.0 (same as source data and Pi0.5 base model)
build_index.py ADDED
@@ -0,0 +1,287 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Build a filtered training index from community_dataset_v3 on disk.
4
+
5
+ Applies:
6
+ - Robot type filter (so100/so101 variants only)
7
+ - Schema filter (2 cameras, 6-DOF, 30fps)
8
+ - Episode length filter (5s-60s)
9
+ - Per-task cap (default 200)
10
+ - Per-contributor cap (default 200)
11
+ - Excludes datasets with file count mismatches
12
+
13
+ Outputs filtered_index.json with all info needed to train.
14
+ """
15
+
16
+ import argparse
17
+ import glob
18
+ import json
19
+ import random
20
+ from collections import defaultdict
21
+ from pathlib import Path
22
+
23
+ import pandas as pd
24
+
25
+
26
+ def load_dataset_meta(dataset_root: Path) -> dict | None:
27
+ """Load and validate a single dataset's metadata."""
28
+ info_path = dataset_root / "meta" / "info.json"
29
+ if not info_path.exists():
30
+ return None
31
+
32
+ info = json.load(open(info_path))
33
+
34
+ # Robot type filter
35
+ robot = info.get("robot_type", "")
36
+ if robot not in ("so100", "so101", "so100_follower", "so101_follower"):
37
+ return None
38
+
39
+ # Schema filter: exactly the 2-camera, 6-DOF schema
40
+ features = info.get("features", {})
41
+ expected_keys = {
42
+ "action", "episode_index", "frame_index", "index",
43
+ "observation.images.image", "observation.images.image2",
44
+ "observation.state", "task_index", "timestamp",
45
+ }
46
+ if set(features.keys()) != expected_keys:
47
+ return None
48
+
49
+ # Dimension check
50
+ if features.get("action", {}).get("shape") != [6]:
51
+ return None
52
+ if features.get("observation.state", {}).get("shape") != [6]:
53
+ return None
54
+
55
+ # FPS check
56
+ if info.get("fps") != 30:
57
+ return None
58
+
59
+ # Resolution check
60
+ for cam_key in ("observation.images.image", "observation.images.image2"):
61
+ shape = features.get(cam_key, {}).get("shape", [])
62
+ if len(shape) < 2 or shape[0] != 480 or shape[1] != 640:
63
+ return None
64
+
65
+ # Load tasks
66
+ tasks_path = dataset_root / "meta" / "tasks.jsonl"
67
+ tasks = {}
68
+ if tasks_path.exists():
69
+ for line in open(tasks_path):
70
+ line = line.strip()
71
+ if line:
72
+ t = json.loads(line)
73
+ tasks[t["task_index"]] = t["task"]
74
+
75
+ # Integrity check: video and parquet file counts
76
+ total_eps = info.get("total_episodes", 0)
77
+ vids = glob.glob(str(dataset_root / "videos" / "**" / "*.mp4"), recursive=True)
78
+ parquets = glob.glob(str(dataset_root / "data" / "**" / "*.parquet"), recursive=True)
79
+ expected_vids = total_eps * 2 # 2 cameras
80
+ if len(vids) != expected_vids or len(parquets) != total_eps:
81
+ return None
82
+
83
+ # Load episode metadata if available
84
+ episodes = []
85
+ ep_jsonl = dataset_root / "meta" / "episodes.jsonl"
86
+ if ep_jsonl.exists():
87
+ for line in open(ep_jsonl):
88
+ line = line.strip()
89
+ if line:
90
+ episodes.append(json.loads(line))
91
+
92
+ return {
93
+ "robot_type": robot,
94
+ "total_episodes": total_eps,
95
+ "total_frames": info.get("total_frames", 0),
96
+ "fps": info["fps"],
97
+ "tasks": tasks,
98
+ "episodes": episodes,
99
+ "features": {k: v.get("shape") for k, v in features.items()},
100
+ }
101
+
102
+
103
+ def build_index(
104
+ data_root: Path,
105
+ max_per_task: int = 200,
106
+ max_per_contributor: int = 200,
107
+ min_episode_frames: int = 150,
108
+ max_episode_frames: int = 1800,
109
+ seed: int = 42,
110
+ ) -> dict:
111
+ """Build filtered training index."""
112
+ rng = random.Random(seed)
113
+
114
+ # Discover all contributor/dataset pairs
115
+ contributors = sorted([
116
+ d for d in data_root.iterdir()
117
+ if d.is_dir() and not d.name.startswith(".")
118
+ ])
119
+
120
+ # Phase 1: Load all valid datasets
121
+ all_episodes = [] # (contributor, dataset_name, episode_idx, task, num_frames)
122
+ datasets_passed = 0
123
+ datasets_rejected = 0
124
+ skipped_missing = 0
125
+
126
+ for contrib_dir in contributors:
127
+ if not contrib_dir.is_dir():
128
+ continue
129
+ contributor = contrib_dir.name
130
+
131
+ for ds_dir in sorted(contrib_dir.iterdir()):
132
+ if not ds_dir.is_dir():
133
+ continue
134
+
135
+ meta = load_dataset_meta(ds_dir)
136
+ if meta is None:
137
+ datasets_rejected += 1
138
+ continue
139
+
140
+ datasets_passed += 1
141
+ dataset_name = f"{contributor}/{ds_dir.name}"
142
+
143
+ # Default task if none specified
144
+ if not meta["tasks"]:
145
+ meta["tasks"] = {0: "(no task)"}
146
+
147
+ # Build episode list by reading actual parquet files
148
+ # Trust the parquet row count, not metadata
149
+ for ep_idx in range(meta["total_episodes"]):
150
+ parquet_path = ds_dir / f"data/chunk-000/episode_{ep_idx:06d}.parquet"
151
+ if not parquet_path.exists():
152
+ skipped_missing += 1
153
+ continue
154
+
155
+ # Read actual row count from parquet (fast — just reads footer)
156
+ pf = pd.read_parquet(parquet_path, columns=["frame_index"])
157
+ actual_length = len(pf)
158
+
159
+ if actual_length < min_episode_frames or actual_length > max_episode_frames:
160
+ continue
161
+
162
+ # Also verify both video files exist
163
+ vid1 = ds_dir / f"videos/chunk-000/observation.images.image/episode_{ep_idx:06d}.mp4"
164
+ vid2 = ds_dir / f"videos/chunk-000/observation.images.image2/episode_{ep_idx:06d}.mp4"
165
+ if not vid1.exists() or not vid2.exists():
166
+ skipped_missing += 1
167
+ continue
168
+
169
+ # Get task from episodes.jsonl if available, else default
170
+ task_idx = 0
171
+ if meta["episodes"]:
172
+ for ep_meta in meta["episodes"]:
173
+ if ep_meta.get("episode_index") == ep_idx:
174
+ task_idx = ep_meta.get("task_index", 0)
175
+ break
176
+
177
+ task = meta["tasks"].get(task_idx, "(no task)")
178
+ all_episodes.append((contributor, dataset_name, ep_idx, task, actual_length))
179
+
180
+ print(f"Datasets: {datasets_passed} passed, {datasets_rejected} rejected")
181
+ print(f"Episodes verified: {len(all_episodes)}, skipped (missing files): {skipped_missing}")
182
+ print(f"Episodes before caps: {len(all_episodes)}")
183
+
184
+ # Phase 2: Apply per-task cap
185
+ task_buckets = defaultdict(list)
186
+ for ep in all_episodes:
187
+ task_buckets[ep[3]].append(ep)
188
+
189
+ after_task_cap = []
190
+ tasks_capped = 0
191
+ for task, eps in task_buckets.items():
192
+ rng.shuffle(eps)
193
+ if len(eps) > max_per_task:
194
+ tasks_capped += 1
195
+ after_task_cap.extend(eps[:max_per_task])
196
+
197
+ print(f"Episodes after per-task cap ({max_per_task}): {len(after_task_cap)} ({tasks_capped} tasks capped)")
198
+
199
+ # Phase 3: Apply per-contributor cap
200
+ contrib_buckets = defaultdict(list)
201
+ for ep in after_task_cap:
202
+ contrib_buckets[ep[0]].append(ep)
203
+
204
+ final_episodes = []
205
+ contribs_capped = 0
206
+ for contributor, eps in contrib_buckets.items():
207
+ rng.shuffle(eps)
208
+ if len(eps) > max_per_contributor:
209
+ contribs_capped += 1
210
+ final_episodes.extend(eps[:max_per_contributor])
211
+
212
+ print(f"Episodes after per-contributor cap ({max_per_contributor}): {len(final_episodes)} ({contribs_capped} contributors capped)")
213
+
214
+ # Phase 4: Build the index
215
+ # Sort for determinism
216
+ final_episodes.sort(key=lambda x: (x[1], x[2]))
217
+
218
+ # Collect unique tasks
219
+ unique_tasks = sorted(set(ep[3] for ep in final_episodes))
220
+ task_to_idx = {t: i for i, t in enumerate(unique_tasks)}
221
+
222
+ # Collect unique datasets used
223
+ datasets_used = sorted(set(ep[1] for ep in final_episodes))
224
+
225
+ # Build episode entries
226
+ entries = []
227
+ total_frames = 0
228
+ for contributor, dataset_name, ep_idx, task, num_frames in final_episodes:
229
+ entries.append({
230
+ "dataset": dataset_name,
231
+ "episode_index": ep_idx,
232
+ "task": task,
233
+ "task_index": task_to_idx[task],
234
+ "num_frames": num_frames,
235
+ })
236
+ total_frames += num_frames
237
+
238
+ index = {
239
+ "source_repo": "HuggingFaceVLA/community_dataset_v3",
240
+ "filters": {
241
+ "max_per_task": max_per_task,
242
+ "max_per_contributor": max_per_contributor,
243
+ "min_episode_frames": min_episode_frames,
244
+ "max_episode_frames": max_episode_frames,
245
+ "seed": seed,
246
+ },
247
+ "summary": {
248
+ "datasets": len(datasets_used),
249
+ "episodes": len(entries),
250
+ "unique_tasks": len(unique_tasks),
251
+ "total_frames": total_frames,
252
+ "est_hours": total_frames / 30 / 3600,
253
+ },
254
+ "tasks": unique_tasks,
255
+ "datasets_used": datasets_used,
256
+ "episodes": entries,
257
+ }
258
+
259
+ return index
260
+
261
+
262
+ if __name__ == "__main__":
263
+ parser = argparse.ArgumentParser()
264
+ parser.add_argument("--data-root", type=Path, default=Path.home() / "lap" / "community_dataset_v3")
265
+ parser.add_argument("--output", type=Path, default=Path(__file__).parent / "filtered_index.json")
266
+ parser.add_argument("--max-per-task", type=int, default=200)
267
+ parser.add_argument("--max-per-contributor", type=int, default=200)
268
+ parser.add_argument("--seed", type=int, default=42)
269
+ args = parser.parse_args()
270
+
271
+ index = build_index(
272
+ args.data_root,
273
+ max_per_task=args.max_per_task,
274
+ max_per_contributor=args.max_per_contributor,
275
+ seed=args.seed,
276
+ )
277
+
278
+ args.output.parent.mkdir(parents=True, exist_ok=True)
279
+ with open(args.output, "w") as f:
280
+ json.dump(index, f, indent=2)
281
+
282
+ print(f"\nSaved to {args.output}")
283
+ print(f" Datasets: {index['summary']['datasets']}")
284
+ print(f" Episodes: {index['summary']['episodes']}")
285
+ print(f" Tasks: {index['summary']['unique_tasks']}")
286
+ print(f" Frames: {index['summary']['total_frames']:,}")
287
+ print(f" Est. hours: {index['summary']['est_hours']:.1f}")
compute_stats.py ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Compute normalization statistics (mean/std) for state and action across the filtered dataset.
4
+ Only reads parquet files — no video decoding, so it's fast.
5
+ """
6
+
7
+ import argparse
8
+ import json
9
+ import time
10
+ from pathlib import Path
11
+
12
+ import numpy as np
13
+ import pandas as pd
14
+
15
+
16
+ def compute_stats(data_root: Path, index_path: Path) -> dict:
17
+ with open(index_path) as f:
18
+ index = json.load(f)
19
+
20
+ # Collect all unique (dataset, episode) pairs
21
+ episode_set = set()
22
+ for ep in index["episodes"]:
23
+ episode_set.add((ep["dataset"], ep["episode_index"]))
24
+
25
+ print(f"Computing stats from {len(episode_set)} episodes...")
26
+
27
+ # Online mean/variance computation (Welford's algorithm)
28
+ state_sum = np.zeros(6, dtype=np.float64)
29
+ state_sq_sum = np.zeros(6, dtype=np.float64)
30
+ action_sum = np.zeros(6, dtype=np.float64)
31
+ action_sq_sum = np.zeros(6, dtype=np.float64)
32
+ n_state = 0
33
+ n_action = 0
34
+
35
+ start = time.time()
36
+ for i, (dataset, ep_idx) in enumerate(sorted(episode_set)):
37
+ parquet_path = data_root / dataset / f"data/chunk-000/episode_{ep_idx:06d}.parquet"
38
+ if not parquet_path.exists():
39
+ continue
40
+
41
+ df = pd.read_parquet(parquet_path)
42
+
43
+ states = np.stack(df["observation.state"].values).astype(np.float64)
44
+ actions = np.stack(df["action"].values).astype(np.float64)
45
+
46
+ state_sum += states.sum(axis=0)
47
+ state_sq_sum += (states ** 2).sum(axis=0)
48
+ n_state += len(states)
49
+
50
+ action_sum += actions.sum(axis=0)
51
+ action_sq_sum += (actions ** 2).sum(axis=0)
52
+ n_action += len(actions)
53
+
54
+ if (i + 1) % 1000 == 0:
55
+ elapsed = time.time() - start
56
+ rate = (i + 1) / elapsed
57
+ eta = (len(episode_set) - i - 1) / rate
58
+ print(f" [{i+1}/{len(episode_set)}] {rate:.0f} eps/s, ETA: {eta:.0f}s")
59
+
60
+ state_mean = state_sum / n_state
61
+ state_std = np.sqrt(state_sq_sum / n_state - state_mean ** 2)
62
+
63
+ action_mean = action_sum / n_action
64
+ action_std = np.sqrt(action_sq_sum / n_action - action_mean ** 2)
65
+
66
+ elapsed = time.time() - start
67
+ print(f"Done in {elapsed:.1f}s ({n_state:,} state frames, {n_action:,} action frames)")
68
+
69
+ print(f"\nState mean: {state_mean}")
70
+ print(f"State std: {state_std}")
71
+ print(f"Action mean: {action_mean}")
72
+ print(f"Action std: {action_std}")
73
+
74
+ stats = {
75
+ "observation.state": {
76
+ "mean": state_mean.tolist(),
77
+ "std": state_std.tolist(),
78
+ },
79
+ "action": {
80
+ "mean": action_mean.tolist(),
81
+ "std": action_std.tolist(),
82
+ },
83
+ }
84
+ return stats
85
+
86
+
87
+ if __name__ == "__main__":
88
+ parser = argparse.ArgumentParser()
89
+ parser.add_argument("--data-root", type=Path, default=Path.home() / "lap" / "community_dataset_v3")
90
+ parser.add_argument("--index", type=Path, default=Path(__file__).parent / "filtered_index.json")
91
+ parser.add_argument("--output", type=Path, default=Path(__file__).parent / "norm_stats.json")
92
+ args = parser.parse_args()
93
+
94
+ stats = compute_stats(args.data_root, args.index)
95
+
96
+ with open(args.output, "w") as f:
97
+ json.dump(stats, f, indent=2)
98
+ print(f"\nSaved to {args.output}")
filtered_index.json ADDED
The diff for this file is too large to render. See raw diff
 
norm_stats.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "observation.state": {
3
+ "mean": [
4
+ 3.2129562341482223,
5
+ 81.25934383631572,
6
+ 97.87567545165706,
7
+ 58.2558965428857,
8
+ -3.869688922486154,
9
+ 13.552276313577162
10
+ ],
11
+ "std": [
12
+ 26.932913188864053,
13
+ 85.10186432539234,
14
+ 60.096302230313775,
15
+ 32.18041942119004,
16
+ 64.69174273514702,
17
+ 17.38995233769721
18
+ ]
19
+ },
20
+ "action": {
21
+ "mean": [
22
+ 3.2667901525244267,
23
+ 82.01517467950833,
24
+ 96.44080348317482,
25
+ 58.19181662702153,
26
+ -3.898391972920288,
27
+ 11.117041393936647
28
+ ],
29
+ "std": [
30
+ 27.026112586762707,
31
+ 85.80857081004108,
32
+ 60.86058528648729,
33
+ 32.566689386004555,
34
+ 64.99547212544971,
35
+ 17.279498490768535
36
+ ]
37
+ }
38
+ }
so100_dataset.py ADDED
@@ -0,0 +1,312 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Custom PyTorch Dataset that reads directly from community_dataset_v3 v2.1 files on disk.
4
+ No merging, no conversion, no copying. Just reads parquets + decodes video frames.
5
+
6
+ Returns raw (unnormalized) data in the format LeRobotDataset returns — the existing
7
+ Pi0.5 preprocessor handles normalization, padding, tokenization, and device placement.
8
+
9
+ Provides a .meta adapter so lerobot_train.py can use it as a drop-in replacement.
10
+ """
11
+
12
+ import json
13
+ from pathlib import Path
14
+
15
+ import numpy as np
16
+ import pandas as pd
17
+ import torch
18
+ from torch.utils.data import Dataset
19
+
20
+
21
+ class _DatasetMeta:
22
+ """
23
+ Lightweight adapter that provides the .meta interface lerobot_train.py expects.
24
+ Wraps our filtered index + precomputed stats.
25
+ """
26
+
27
+ def __init__(self, index: dict, stats: dict, data_root: Path):
28
+ self.repo_id = "SO100Dataset/local"
29
+ self.root = data_root
30
+
31
+ # Stats: training script expects dict[str, dict[str, torch.Tensor]]
32
+ self.stats = {}
33
+ for key, s in stats.items():
34
+ self.stats[key] = {
35
+ "mean": torch.tensor(s["mean"], dtype=torch.float32),
36
+ "std": torch.tensor(s["std"], dtype=torch.float32),
37
+ # Preprocessor may also look for min/max/quantiles.
38
+ # Approximate them from mean/std for MEAN_STD normalization.
39
+ "min": torch.tensor(s["mean"], dtype=torch.float32) - 3 * torch.tensor(s["std"], dtype=torch.float32),
40
+ "max": torch.tensor(s["mean"], dtype=torch.float32) + 3 * torch.tensor(s["std"], dtype=torch.float32),
41
+ }
42
+
43
+ # Tasks
44
+ self.tasks = pd.DataFrame(
45
+ {"task_index": range(len(index["tasks"]))},
46
+ index=index["tasks"],
47
+ )
48
+
49
+ # Features
50
+ self._features = {
51
+ "observation.images.image": {
52
+ "dtype": "video",
53
+ "shape": [3, 480, 640],
54
+ "names": ["channels", "height", "width"],
55
+ },
56
+ "observation.images.image2": {
57
+ "dtype": "video",
58
+ "shape": [3, 480, 640],
59
+ "names": ["channels", "height", "width"],
60
+ },
61
+ "observation.state": {
62
+ "dtype": "float32",
63
+ "shape": [6],
64
+ },
65
+ "action": {
66
+ "dtype": "float32",
67
+ "shape": [6],
68
+ },
69
+ "timestamp": {"dtype": "float32", "shape": []},
70
+ "frame_index": {"dtype": "int64", "shape": []},
71
+ "episode_index": {"dtype": "int64", "shape": []},
72
+ "index": {"dtype": "int64", "shape": []},
73
+ "task_index": {"dtype": "int64", "shape": []},
74
+ }
75
+
76
+ self.info = {
77
+ "fps": 30,
78
+ "robot_type": "so100",
79
+ "total_episodes": index["summary"]["episodes"],
80
+ "total_frames": index["summary"]["total_frames"],
81
+ }
82
+
83
+ @property
84
+ def fps(self):
85
+ return 30
86
+
87
+ @property
88
+ def features(self):
89
+ return self._features
90
+
91
+ @property
92
+ def camera_keys(self):
93
+ return ["observation.images.image", "observation.images.image2"]
94
+
95
+ @property
96
+ def video_keys(self):
97
+ return ["observation.images.image", "observation.images.image2"]
98
+
99
+ @property
100
+ def image_keys(self):
101
+ return []
102
+
103
+ @property
104
+ def total_episodes(self):
105
+ return self.info["total_episodes"]
106
+
107
+ @property
108
+ def total_frames(self):
109
+ return self.info["total_frames"]
110
+
111
+ @property
112
+ def robot_type(self):
113
+ return "so100"
114
+
115
+
116
+ class SO100Dataset(Dataset):
117
+ """
118
+ Loads filtered SO-100/101 episodes from community_dataset_v3 on disk.
119
+
120
+ Each sample is one frame with an action chunk of the next `chunk_size` steps.
121
+ Returns raw unnormalized data — the Pi0.5 preprocessor handles normalization.
122
+
123
+ Provides .meta property compatible with lerobot_train.py.
124
+ """
125
+
126
+ def __init__(
127
+ self,
128
+ data_root: str | Path,
129
+ index_path: str | Path,
130
+ stats_path: str | Path | None = None,
131
+ video_backend: str = "pyav",
132
+ chunk_size: int = 50,
133
+ image_transforms=None,
134
+ ):
135
+ self.data_root = Path(data_root)
136
+ self.video_backend = video_backend
137
+ self.chunk_size = chunk_size
138
+ self.image_transforms = image_transforms
139
+ self.fps = 30
140
+
141
+ # Load index
142
+ with open(index_path) as f:
143
+ self._index = json.load(f)
144
+
145
+ self.tasks = self._index["tasks"]
146
+
147
+ # Load stats
148
+ raw_stats = {}
149
+ if stats_path and Path(stats_path).exists():
150
+ with open(stats_path) as f:
151
+ raw_stats = json.load(f)
152
+
153
+ # Create meta adapter
154
+ self.meta = _DatasetMeta(self._index, raw_stats, self.data_root)
155
+
156
+ # Build flat frame-level index
157
+ self._frame_index = []
158
+ self._episode_offsets = []
159
+
160
+ for ep in self._index["episodes"]:
161
+ dataset_path = self.data_root / ep["dataset"]
162
+ ep_idx = ep["episode_index"]
163
+ task = ep["task"]
164
+ task_idx = ep["task_index"]
165
+ num_frames = ep["num_frames"]
166
+
167
+ # Only include frames where a full action chunk fits
168
+ valid_frames = max(0, num_frames - self.chunk_size)
169
+ if valid_frames == 0:
170
+ continue
171
+
172
+ start = len(self._frame_index)
173
+ self._episode_offsets.append(start)
174
+
175
+ for frame_idx in range(valid_frames):
176
+ self._frame_index.append((
177
+ dataset_path, ep_idx, frame_idx,
178
+ num_frames, task, task_idx,
179
+ ))
180
+
181
+ # Parquet cache
182
+ self._parquet_cache = {}
183
+ self._cache_max = 200
184
+
185
+ def __len__(self):
186
+ return len(self._frame_index)
187
+
188
+ @property
189
+ def num_episodes(self):
190
+ return len(self._episode_offsets)
191
+
192
+ @property
193
+ def num_frames(self):
194
+ return len(self._frame_index)
195
+
196
+ @property
197
+ def episodes(self):
198
+ return None # Use all episodes (no further filtering)
199
+
200
+ @property
201
+ def features(self):
202
+ return self.meta.features
203
+
204
+ @property
205
+ def video(self):
206
+ return True
207
+
208
+ @property
209
+ def camera_keys(self):
210
+ return self.meta.camera_keys
211
+
212
+ @property
213
+ def video_frame_keys(self):
214
+ return self.meta.camera_keys
215
+
216
+ def _load_parquet(self, dataset_path: Path, episode_index: int) -> pd.DataFrame:
217
+ """Load and cache a parquet file."""
218
+ key = (str(dataset_path), episode_index)
219
+ if key in self._parquet_cache:
220
+ return self._parquet_cache[key]
221
+
222
+ parquet_path = dataset_path / f"data/chunk-000/episode_{episode_index:06d}.parquet"
223
+ df = pd.read_parquet(parquet_path)
224
+
225
+ if len(self._parquet_cache) >= self._cache_max:
226
+ oldest_key = next(iter(self._parquet_cache))
227
+ del self._parquet_cache[oldest_key]
228
+
229
+ self._parquet_cache[key] = df
230
+ return df
231
+
232
+ def _decode_video_frame(self, video_path: Path, timestamp: float) -> torch.Tensor:
233
+ """Decode a single frame from an MP4 at the given timestamp. Returns (C, H, W) float32 [0,1]."""
234
+ if self.video_backend == "torchcodec":
235
+ from torchcodec.decoders import VideoDecoder
236
+ decoder = VideoDecoder(str(video_path))
237
+ frame = decoder.get_frame_played_at(timestamp)
238
+ return frame.data.float() / 255.0
239
+ else:
240
+ import av
241
+ container = av.open(str(video_path))
242
+ stream = container.streams.video[0]
243
+ target_pts = int(timestamp / float(stream.time_base))
244
+ container.seek(target_pts, stream=stream)
245
+ for frame in container.decode(video=0):
246
+ arr = frame.to_ndarray(format="rgb24")
247
+ tensor = torch.from_numpy(arr).permute(2, 0, 1).float() / 255.0
248
+ container.close()
249
+ return tensor
250
+ container.close()
251
+ raise RuntimeError(f"Could not decode frame at t={timestamp} from {video_path}")
252
+
253
+ def __getitem__(self, idx: int) -> dict:
254
+ dataset_path, ep_idx, frame_idx, num_frames, task, task_idx = self._frame_index[idx]
255
+
256
+ df = self._load_parquet(dataset_path, ep_idx)
257
+
258
+ # Current frame
259
+ row = df.iloc[frame_idx]
260
+ state = torch.tensor(row["observation.state"], dtype=torch.float32)
261
+ timestamp = float(row["timestamp"])
262
+
263
+ # Action chunk: next chunk_size actions starting from current frame
264
+ action_end = min(frame_idx + self.chunk_size, len(df))
265
+ action_rows = df.iloc[frame_idx:action_end]
266
+ actions = torch.tensor(
267
+ np.stack(action_rows["action"].values),
268
+ dtype=torch.float32,
269
+ )
270
+ # Pad with last action if near episode end
271
+ if actions.shape[0] < self.chunk_size:
272
+ pad = actions[-1:].expand(self.chunk_size - actions.shape[0], -1)
273
+ actions = torch.cat([actions, pad], dim=0)
274
+
275
+ # Decode video frames
276
+ video_dir = dataset_path / "videos" / "chunk-000"
277
+ ep_str = f"episode_{ep_idx:06d}.mp4"
278
+
279
+ image1 = self._decode_video_frame(
280
+ video_dir / "observation.images.image" / ep_str, timestamp
281
+ )
282
+ image2 = self._decode_video_frame(
283
+ video_dir / "observation.images.image2" / ep_str, timestamp
284
+ )
285
+
286
+ if self.image_transforms is not None:
287
+ image1 = self.image_transforms(image1)
288
+ image2 = self.image_transforms(image2)
289
+
290
+ return {
291
+ "observation.images.image": image1, # (3, 480, 640) float32 [0,1]
292
+ "observation.images.image2": image2, # (3, 480, 640) float32 [0,1]
293
+ "observation.state": state, # (6,) float32, raw values
294
+ "action": actions, # (50, 6) float32, raw values
295
+ "task": task, # str
296
+ "task_index": torch.tensor(task_idx),
297
+ "timestamp": torch.tensor(timestamp),
298
+ "frame_index": torch.tensor(frame_idx),
299
+ "episode_index": torch.tensor(ep_idx),
300
+ "index": torch.tensor(idx),
301
+ }
302
+
303
+ def __repr__(self):
304
+ return (
305
+ f"SO100Dataset(\n"
306
+ f" data_root='{self.data_root}',\n"
307
+ f" episodes={self.num_episodes},\n"
308
+ f" frames={self.num_frames:,},\n"
309
+ f" tasks={len(self.tasks)},\n"
310
+ f" video_backend='{self.video_backend}',\n"
311
+ f")"
312
+ )