Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

.gitignore +2 -0
README.md +204 -0
build_index.py +287 -0
compute_stats.py +98 -0
filtered_index.json +0 -0
norm_stats.json +38 -0
so100_dataset.py +312 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ outputs/
2	+ __pycache__/

README.md ADDED Viewed

	@@ -0,0 +1,204 @@

+# Pi0.5 SO-100 Diverse Finetune
+Finetune Pi0.5's action expert on diverse SO-100/101 community data to enable
+multi-task manipulation controlled by natural language.
+**Goal**: A Pi0.5 model that can perform many different tasks on an SO-100/101 arm
+("pick up the red cube", "fold the cloth", "stack the blocks") — not a single-task
+policy, but a generalist that understands the SO-100 embodiment.
+**Approach**: Freeze the VLM backbone (3B params), finetune only the action expert
+(693M params including projections) using `train_expert_only=true`. The frozen VLM
+retains its general vision-language understanding while the expert learns SO-100
+motor control from diverse demonstrations.
+## Dataset
+**Source**: [HuggingFaceVLA/community_dataset_v3](https://huggingface.co/datasets/HuggingFaceVLA/community_dataset_v3)
+— a community-contributed collection of SO-100/101 teleoperation demonstrations.
+**Filtering** (`build_index.py`):
+- Robot type: so100, so101, so100_follower, so101_follower
+- Schema: exactly 2 cameras (`observation.images.image` + `image2`), 6-DOF state/action
+- Resolution: 480x640, FPS: 30
+- Episode length: 150-1800 frames (5-60 seconds)
+- Integrity: every episode verified against actual parquet row count + both video files
+- Per-task cap: 200 episodes max per unique task string (prevents dominant tasks)
+- Per-contributor cap: 200 episodes max per contributor (prevents style bias)
+**Result**: 376 datasets, 10,155 episodes, 215 unique tasks, 5.4M frames (~50 hours)
+**Task categories** (see `so100_101_task_analysis.txt` for full breakdown):
+- Pick-and-place (cubes, legos, balls, pens, cups, toys)
+- Block stacking and building (towers, Hanoi, Jenga)
+- Clothing folding (t-shirts, towels, blankets)
+- Drawing and writing (smiley faces, letters, iPad)
+- Food manipulation (fork strawberries, spoon food)
+- Sorting and organizing (pins, blocks by color)
+- Peg/hole insertion (shape sorting)
+- Cleaning (table, area)
+- Kitchen tasks (open/close cabinet, lids, containers)
+## Architecture
+**No dataset merging or conversion required.** The custom `SO100Dataset` class reads
+directly from the community_dataset_v3 v2.1 files on disk. A thin `.meta` adapter
+makes it compatible with LeRobot's `lerobot_train.py` training script.
+```
+community_dataset_v3/           (cloned from HuggingFace, ~261GB for filtered subset)
+  contributor/dataset/
+    data/chunk-000/episode_NNNNNN.parquet    (state + action per frame)
+    videos/chunk-000/
+      observation.images.image/episode_NNNNNN.mp4   (camera 1)
+      observation.images.image2/episode_NNNNNN.mp4  (camera 2)
+    meta/info.json, tasks.jsonl, episodes.jsonl
+filtered_index.json             (maps episodes to files, verified frame counts)
+norm_stats.json                 (mean/std for state and action normalization)
+so100_dataset.py                (PyTorch Dataset that reads the above)
+```
+LeRobot's `factory.py` is patched to recognize `so100:` prefix in `--dataset.repo_id`:
+```
+--dataset.repo_id="so100:/path/to/community_dataset_v3:/path/to/filtered_index.json:/path/to/norm_stats.json"
+```
+## Files
+| File | Purpose |
+|------|---------|
+| `build_index.py` | Scan community_dataset_v3, apply filters, verify parquets, output index |
+| `compute_stats.py` | Compute mean/std normalization stats from filtered parquets |
+| `so100_dataset.py` | PyTorch Dataset class with `.meta` adapter for lerobot compatibility |
+| `filtered_index.json` | The verified training index (10,155 episodes, 215 tasks) |
+| `norm_stats.json` | Precomputed mean/std for state and action |
+## Training
+### Local sanity check (1x RTX 3090)
+```bash
+cd /home/anon/pi05-so100-diverse
+PYTHONPATH=/home/anon/pi05-so100-diverse:$PYTHONPATH python -m lerobot.scripts.lerobot_train \
+  --dataset.repo_id="so100:/home/anon/lap/community_dataset_v3:filtered_index.json:norm_stats.json" \
+  --policy.path=lerobot/pi05_base \
+  --policy.train_expert_only=true \
+  --policy.dtype=bfloat16 \
+  --policy.gradient_checkpointing=true \
+  --policy.push_to_hub=false \
+  --policy.normalization_mapping='{"VISUAL": "IDENTITY", "STATE": "MEAN_STD", "ACTION": "MEAN_STD"}' \
+  --policy.scheduler_warmup_steps=1000 \
+  --policy.scheduler_decay_steps=15000 \
+  --rename_map='{"observation.images.image": "observation.images.base_0_rgb", "observation.images.image2": "observation.images.left_wrist_0_rgb"}' \
+  --batch_size=4 \
+  --steps=15000 \
+  --early_stop_steps=1000 \
+  --save_freq=500 \
+  --log_freq=50 \
+  --num_workers=2 \
+  --output_dir=outputs/scale_up_1k \
+  --job_name=scale_up_1k \
+  --save_checkpoint=true
+```
+### Cloud training (8x H100)
+```bash
+# 1. Selective download of filtered datasets (~261GB)
+python download_filtered.py --data-root /data/community_dataset_v3
+# 2. Launch training
+accelerate launch --multi_gpu --num_processes 8 \
+  -m lerobot.scripts.lerobot_train \
+  --dataset.repo_id="so100:/data/community_dataset_v3:filtered_index.json:norm_stats.json" \
+  --policy.path=lerobot/pi05_base \
+  --policy.train_expert_only=true \
+  --policy.dtype=bfloat16 \
+  --policy.gradient_checkpointing=true \
+  --policy.compile_model=true \
+  --policy.push_to_hub=true \
+  --policy.repo_id=StrongRoboticsLab/pi05-so100-diverse \
+  --policy.normalization_mapping='{"VISUAL": "IDENTITY", "STATE": "MEAN_STD", "ACTION": "MEAN_STD"}' \
+  --policy.scheduler_warmup_steps=1000 \
+  --policy.scheduler_decay_steps=15000 \
+  --rename_map='{"observation.images.image": "observation.images.base_0_rgb", "observation.images.image2": "observation.images.left_wrist_0_rgb"}' \
+  --batch_size=32 \
+  --steps=15000 \
+  --save_freq=500 \
+  --log_freq=50 \
+  --num_workers=4 \
+  --wandb.enable=true \
+  --wandb.project=pi05-so100-diverse \
+  --output_dir=outputs/cloud_run \
+  --job_name=pi05_so100_diverse
+```
+## Training Configuration
+| Parameter | Value | Rationale |
+|-----------|-------|-----------|
+| Base model | `lerobot/pi05_base` | Pi0.5 pretrained on cross-embodiment data |
+| train_expert_only | true | Freeze VLM, train action expert + projections (693M params) |
+| dtype | bfloat16 | Standard for H100/3090 training |
+| gradient_checkpointing | true | Saves VRAM by recomputing activations |
+| LR | 2.5e-5 (peak) | Pi0.5 default, conservative for finetuning |
+| LR schedule | Cosine decay with 1000-step warmup | Standard, decays to 2.5e-6 |
+| Batch size | 32/GPU (256 effective on 8x H100) | Matches community configs |
+| Steps | 15,000 (~1 epoch) | 5M samples / 256 batch ≈ 19k steps per epoch |
+| Normalization | MEAN_STD for state/action, IDENTITY for images | Simpler than QUANTILES, proven to work |
+| ImageNet stats | Yes | Standard image normalization |
+| save_freq | 500 | 30 checkpoints over full run, low risk of data loss |
+## Camera Mapping
+The community datasets use generic camera names (`image`, `image2`). Pi0.5 expects
+specific names from its pretraining. We map via `--rename_map`:
+| Dataset | Pi0.5 | Meaning |
+|---------|-------|---------|
+| `observation.images.image` | `observation.images.base_0_rgb` | Front/base camera |
+| `observation.images.image2` | `observation.images.left_wrist_0_rgb` | Wrist camera |
+The third expected camera (`right_wrist_0_rgb`) is left empty — Pi0.5 handles
+missing cameras via its `empty_cameras` mechanism.
+## LeRobot Modifications
+Two changes to `lerobot/src/lerobot/`:
+1. **`datasets/factory.py`**: Added `so100:` prefix handler that returns `SO100Dataset`
+   instead of going through HuggingFace dataset loading. Also re-enabled
+   `MultiLeRobotDataset` (was behind `NotImplementedError`).
+2. **`configs/train.py`** + **`scripts/lerobot_train.py`**: Added `early_stop_steps`
+   parameter for local testing — trains with the full LR schedule shape but exits
+   early after N steps.
+## Reproducibility
+To rebuild the dataset index from scratch:
+```bash
+python build_index.py --data-root /path/to/community_dataset_v3
+python compute_stats.py --data-root /path/to/community_dataset_v3
+```
+This verifies every parquet file and video on disk. Takes ~2 minutes.
+## Status
+- [x] Dataset filtering pipeline (build_index.py)
+- [x] Dataset verification (all 10,155 episodes validated)
+- [x] Normalization stats computed
+- [x] Custom dataset class (so100_dataset.py)
+- [x] LeRobot integration (factory.py patch)
+- [x] Local sanity check (100 steps, loss decreasing)
+- [ ] Local scale-up (1000 steps with real LR schedule) — in progress
+- [ ] Cloud training (8x H100, 15k steps)
+- [ ] Evaluation on real SO-101
+- [ ] Inference script for deployment
+## License
+Apache 2.0 (same as source data and Pi0.5 base model)

build_index.py ADDED Viewed

	@@ -0,0 +1,287 @@

+#!/usr/bin/env python3
+"""
+Build a filtered training index from community_dataset_v3 on disk.
+Applies:
+  - Robot type filter (so100/so101 variants only)
+  - Schema filter (2 cameras, 6-DOF, 30fps)
+  - Episode length filter (5s-60s)
+  - Per-task cap (default 200)
+  - Per-contributor cap (default 200)
+  - Excludes datasets with file count mismatches
+Outputs filtered_index.json with all info needed to train.
+"""
+import argparse
+import glob
+import json
+import random
+from collections import defaultdict
+from pathlib import Path
+import pandas as pd
+def load_dataset_meta(dataset_root: Path) -> dict | None:
+    """Load and validate a single dataset's metadata."""
+    info_path = dataset_root / "meta" / "info.json"
+    if not info_path.exists():
+        return None
+    info = json.load(open(info_path))
+    # Robot type filter
+    robot = info.get("robot_type", "")
+    if robot not in ("so100", "so101", "so100_follower", "so101_follower"):
+        return None
+    # Schema filter: exactly the 2-camera, 6-DOF schema
+    features = info.get("features", {})
+    expected_keys = {
+        "action", "episode_index", "frame_index", "index",
+        "observation.images.image", "observation.images.image2",
+        "observation.state", "task_index", "timestamp",
+    }
+    if set(features.keys()) != expected_keys:
+        return None
+    # Dimension check
+    if features.get("action", {}).get("shape") != [6]:
+        return None
+    if features.get("observation.state", {}).get("shape") != [6]:
+        return None
+    # FPS check
+    if info.get("fps") != 30:
+        return None
+    # Resolution check
+    for cam_key in ("observation.images.image", "observation.images.image2"):
+        shape = features.get(cam_key, {}).get("shape", [])
+        if len(shape) < 2 or shape[0] != 480 or shape[1] != 640:
+            return None
+    # Load tasks
+    tasks_path = dataset_root / "meta" / "tasks.jsonl"
+    tasks = {}
+    if tasks_path.exists():
+        for line in open(tasks_path):
+            line = line.strip()
+            if line:
+                t = json.loads(line)
+                tasks[t["task_index"]] = t["task"]
+    # Integrity check: video and parquet file counts
+    total_eps = info.get("total_episodes", 0)
+    vids = glob.glob(str(dataset_root / "videos" / "**" / "*.mp4"), recursive=True)
+    parquets = glob.glob(str(dataset_root / "data" / "**" / "*.parquet"), recursive=True)
+    expected_vids = total_eps * 2  # 2 cameras
+    if len(vids) != expected_vids or len(parquets) != total_eps:
+        return None
+    # Load episode metadata if available
+    episodes = []
+    ep_jsonl = dataset_root / "meta" / "episodes.jsonl"
+    if ep_jsonl.exists():
+        for line in open(ep_jsonl):
+            line = line.strip()
+            if line:
+                episodes.append(json.loads(line))
+    return {
+        "robot_type": robot,
+        "total_episodes": total_eps,
+        "total_frames": info.get("total_frames", 0),
+        "fps": info["fps"],
+        "tasks": tasks,
+        "episodes": episodes,
+        "features": {k: v.get("shape") for k, v in features.items()},
+    }
+def build_index(
+    data_root: Path,
+    max_per_task: int = 200,
+    max_per_contributor: int = 200,
+    min_episode_frames: int = 150,
+    max_episode_frames: int = 1800,
+    seed: int = 42,
+) -> dict:
+    """Build filtered training index."""
+    rng = random.Random(seed)
+    # Discover all contributor/dataset pairs
+    contributors = sorted([
+        d for d in data_root.iterdir()
+        if d.is_dir() and not d.name.startswith(".")
+    ])
+    # Phase 1: Load all valid datasets
+    all_episodes = []  # (contributor, dataset_name, episode_idx, task, num_frames)
+    datasets_passed = 0
+    datasets_rejected = 0
+    skipped_missing = 0
+    for contrib_dir in contributors:
+        if not contrib_dir.is_dir():
+            continue
+        contributor = contrib_dir.name
+        for ds_dir in sorted(contrib_dir.iterdir()):
+            if not ds_dir.is_dir():
+                continue
+            meta = load_dataset_meta(ds_dir)
+            if meta is None:
+                datasets_rejected += 1
+                continue
+            datasets_passed += 1
+            dataset_name = f"{contributor}/{ds_dir.name}"
+            # Default task if none specified
+            if not meta["tasks"]:
+                meta["tasks"] = {0: "(no task)"}
+            # Build episode list by reading actual parquet files
+            # Trust the parquet row count, not metadata
+            for ep_idx in range(meta["total_episodes"]):
+                parquet_path = ds_dir / f"data/chunk-000/episode_{ep_idx:06d}.parquet"
+                if not parquet_path.exists():
+                    skipped_missing += 1
+                    continue
+                # Read actual row count from parquet (fast — just reads footer)
+                pf = pd.read_parquet(parquet_path, columns=["frame_index"])
+                actual_length = len(pf)
+                if actual_length < min_episode_frames or actual_length > max_episode_frames:
+                    continue
+                # Also verify both video files exist
+                vid1 = ds_dir / f"videos/chunk-000/observation.images.image/episode_{ep_idx:06d}.mp4"
+                vid2 = ds_dir / f"videos/chunk-000/observation.images.image2/episode_{ep_idx:06d}.mp4"
+                if not vid1.exists() or not vid2.exists():
+                    skipped_missing += 1
+                    continue
+                # Get task from episodes.jsonl if available, else default
+                task_idx = 0
+                if meta["episodes"]:
+                    for ep_meta in meta["episodes"]:
+                        if ep_meta.get("episode_index") == ep_idx:
+                            task_idx = ep_meta.get("task_index", 0)
+                            break
+                task = meta["tasks"].get(task_idx, "(no task)")
+                all_episodes.append((contributor, dataset_name, ep_idx, task, actual_length))
+    print(f"Datasets: {datasets_passed} passed, {datasets_rejected} rejected")
+    print(f"Episodes verified: {len(all_episodes)}, skipped (missing files): {skipped_missing}")
+    print(f"Episodes before caps: {len(all_episodes)}")
+    # Phase 2: Apply per-task cap
+    task_buckets = defaultdict(list)
+    for ep in all_episodes:
+        task_buckets[ep[3]].append(ep)
+    after_task_cap = []
+    tasks_capped = 0
+    for task, eps in task_buckets.items():
+        rng.shuffle(eps)
+        if len(eps) > max_per_task:
+            tasks_capped += 1
+        after_task_cap.extend(eps[:max_per_task])
+    print(f"Episodes after per-task cap ({max_per_task}): {len(after_task_cap)} ({tasks_capped} tasks capped)")
+    # Phase 3: Apply per-contributor cap
+    contrib_buckets = defaultdict(list)
+    for ep in after_task_cap:
+        contrib_buckets[ep[0]].append(ep)
+    final_episodes = []
+    contribs_capped = 0
+    for contributor, eps in contrib_buckets.items():
+        rng.shuffle(eps)
+        if len(eps) > max_per_contributor:
+            contribs_capped += 1
+        final_episodes.extend(eps[:max_per_contributor])
+    print(f"Episodes after per-contributor cap ({max_per_contributor}): {len(final_episodes)} ({contribs_capped} contributors capped)")
+    # Phase 4: Build the index
+    # Sort for determinism
+    final_episodes.sort(key=lambda x: (x[1], x[2]))
+    # Collect unique tasks
+    unique_tasks = sorted(set(ep[3] for ep in final_episodes))
+    task_to_idx = {t: i for i, t in enumerate(unique_tasks)}
+    # Collect unique datasets used
+    datasets_used = sorted(set(ep[1] for ep in final_episodes))
+    # Build episode entries
+    entries = []
+    total_frames = 0
+    for contributor, dataset_name, ep_idx, task, num_frames in final_episodes:
+        entries.append({
+            "dataset": dataset_name,
+            "episode_index": ep_idx,
+            "task": task,
+            "task_index": task_to_idx[task],
+            "num_frames": num_frames,
+        })
+        total_frames += num_frames
+    index = {
+        "source_repo": "HuggingFaceVLA/community_dataset_v3",
+        "filters": {
+            "max_per_task": max_per_task,
+            "max_per_contributor": max_per_contributor,
+            "min_episode_frames": min_episode_frames,
+            "max_episode_frames": max_episode_frames,
+            "seed": seed,
+        },
+        "summary": {
+            "datasets": len(datasets_used),
+            "episodes": len(entries),
+            "unique_tasks": len(unique_tasks),
+            "total_frames": total_frames,
+            "est_hours": total_frames / 30 / 3600,
+        },
+        "tasks": unique_tasks,
+        "datasets_used": datasets_used,
+        "episodes": entries,
+    }
+    return index
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--data-root", type=Path, default=Path.home() / "lap" / "community_dataset_v3")
+    parser.add_argument("--output", type=Path, default=Path(__file__).parent / "filtered_index.json")
+    parser.add_argument("--max-per-task", type=int, default=200)
+    parser.add_argument("--max-per-contributor", type=int, default=200)
+    parser.add_argument("--seed", type=int, default=42)
+    args = parser.parse_args()
+    index = build_index(
+        args.data_root,
+        max_per_task=args.max_per_task,
+        max_per_contributor=args.max_per_contributor,
+        seed=args.seed,
+    )
+    args.output.parent.mkdir(parents=True, exist_ok=True)
+    with open(args.output, "w") as f:
+        json.dump(index, f, indent=2)
+    print(f"\nSaved to {args.output}")
+    print(f"  Datasets: {index['summary']['datasets']}")
+    print(f"  Episodes: {index['summary']['episodes']}")
+    print(f"  Tasks: {index['summary']['unique_tasks']}")
+    print(f"  Frames: {index['summary']['total_frames']:,}")
+    print(f"  Est. hours: {index['summary']['est_hours']:.1f}")

compute_stats.py ADDED Viewed

	@@ -0,0 +1,98 @@

+#!/usr/bin/env python3
+"""
+Compute normalization statistics (mean/std) for state and action across the filtered dataset.
+Only reads parquet files — no video decoding, so it's fast.
+"""
+import argparse
+import json
+import time
+from pathlib import Path
+import numpy as np
+import pandas as pd
+def compute_stats(data_root: Path, index_path: Path) -> dict:
+    with open(index_path) as f:
+        index = json.load(f)
+    # Collect all unique (dataset, episode) pairs
+    episode_set = set()
+    for ep in index["episodes"]:
+        episode_set.add((ep["dataset"], ep["episode_index"]))
+    print(f"Computing stats from {len(episode_set)} episodes...")
+    # Online mean/variance computation (Welford's algorithm)
+    state_sum = np.zeros(6, dtype=np.float64)
+    state_sq_sum = np.zeros(6, dtype=np.float64)
+    action_sum = np.zeros(6, dtype=np.float64)
+    action_sq_sum = np.zeros(6, dtype=np.float64)
+    n_state = 0
+    n_action = 0
+    start = time.time()
+    for i, (dataset, ep_idx) in enumerate(sorted(episode_set)):
+        parquet_path = data_root / dataset / f"data/chunk-000/episode_{ep_idx:06d}.parquet"
+        if not parquet_path.exists():
+            continue
+        df = pd.read_parquet(parquet_path)
+        states = np.stack(df["observation.state"].values).astype(np.float64)
+        actions = np.stack(df["action"].values).astype(np.float64)
+        state_sum += states.sum(axis=0)
+        state_sq_sum += (states ** 2).sum(axis=0)
+        n_state += len(states)
+        action_sum += actions.sum(axis=0)
+        action_sq_sum += (actions ** 2).sum(axis=0)
+        n_action += len(actions)
+        if (i + 1) % 1000 == 0:
+            elapsed = time.time() - start
+            rate = (i + 1) / elapsed
+            eta = (len(episode_set) - i - 1) / rate
+            print(f"  [{i+1}/{len(episode_set)}] {rate:.0f} eps/s, ETA: {eta:.0f}s")
+    state_mean = state_sum / n_state
+    state_std = np.sqrt(state_sq_sum / n_state - state_mean ** 2)
+    action_mean = action_sum / n_action
+    action_std = np.sqrt(action_sq_sum / n_action - action_mean ** 2)
+    elapsed = time.time() - start
+    print(f"Done in {elapsed:.1f}s ({n_state:,} state frames, {n_action:,} action frames)")
+    print(f"\nState mean: {state_mean}")
+    print(f"State std:  {state_std}")
+    print(f"Action mean: {action_mean}")
+    print(f"Action std:  {action_std}")
+    stats = {
+        "observation.state": {
+            "mean": state_mean.tolist(),
+            "std": state_std.tolist(),
+        },
+        "action": {
+            "mean": action_mean.tolist(),
+            "std": action_std.tolist(),
+        },
+    }
+    return stats
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--data-root", type=Path, default=Path.home() / "lap" / "community_dataset_v3")
+    parser.add_argument("--index", type=Path, default=Path(__file__).parent / "filtered_index.json")
+    parser.add_argument("--output", type=Path, default=Path(__file__).parent / "norm_stats.json")
+    args = parser.parse_args()
+    stats = compute_stats(args.data_root, args.index)
+    with open(args.output, "w") as f:
+        json.dump(stats, f, indent=2)
+    print(f"\nSaved to {args.output}")

filtered_index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

norm_stats.json ADDED Viewed

	@@ -0,0 +1,38 @@

+{
+  "observation.state": {
+    "mean": [
+      3.2129562341482223,
+      81.25934383631572,
+      97.87567545165706,
+      58.2558965428857,
+      -3.869688922486154,
+      13.552276313577162
+    ],
+    "std": [
+      26.932913188864053,
+      85.10186432539234,
+      60.096302230313775,
+      32.18041942119004,
+      64.69174273514702,
+      17.38995233769721
+    ]
+  },
+  "action": {
+    "mean": [
+      3.2667901525244267,
+      82.01517467950833,
+      96.44080348317482,
+      58.19181662702153,
+      -3.898391972920288,
+      11.117041393936647
+    ],
+    "std": [
+      27.026112586762707,
+      85.80857081004108,
+      60.86058528648729,
+      32.566689386004555,
+      64.99547212544971,
+      17.279498490768535
+    ]
+  }
+}

so100_dataset.py ADDED Viewed

	@@ -0,0 +1,312 @@

+#!/usr/bin/env python3
+"""
+Custom PyTorch Dataset that reads directly from community_dataset_v3 v2.1 files on disk.
+No merging, no conversion, no copying. Just reads parquets + decodes video frames.
+Returns raw (unnormalized) data in the format LeRobotDataset returns — the existing
+Pi0.5 preprocessor handles normalization, padding, tokenization, and device placement.
+Provides a .meta adapter so lerobot_train.py can use it as a drop-in replacement.
+"""
+import json
+from pathlib import Path
+import numpy as np
+import pandas as pd
+import torch
+from torch.utils.data import Dataset
+class _DatasetMeta:
+    """
+    Lightweight adapter that provides the .meta interface lerobot_train.py expects.
+    Wraps our filtered index + precomputed stats.
+    """
+    def __init__(self, index: dict, stats: dict, data_root: Path):
+        self.repo_id = "SO100Dataset/local"
+        self.root = data_root
+        # Stats: training script expects dict[str, dict[str, torch.Tensor]]
+        self.stats = {}
+        for key, s in stats.items():
+            self.stats[key] = {
+                "mean": torch.tensor(s["mean"], dtype=torch.float32),
+                "std": torch.tensor(s["std"], dtype=torch.float32),
+                # Preprocessor may also look for min/max/quantiles.
+                # Approximate them from mean/std for MEAN_STD normalization.
+                "min": torch.tensor(s["mean"], dtype=torch.float32) - 3 * torch.tensor(s["std"], dtype=torch.float32),
+                "max": torch.tensor(s["mean"], dtype=torch.float32) + 3 * torch.tensor(s["std"], dtype=torch.float32),
+            }
+        # Tasks
+        self.tasks = pd.DataFrame(
+            {"task_index": range(len(index["tasks"]))},
+            index=index["tasks"],
+        )
+        # Features
+        self._features = {
+            "observation.images.image": {
+                "dtype": "video",
+                "shape": [3, 480, 640],
+                "names": ["channels", "height", "width"],
+            },
+            "observation.images.image2": {
+                "dtype": "video",
+                "shape": [3, 480, 640],
+                "names": ["channels", "height", "width"],
+            },
+            "observation.state": {
+                "dtype": "float32",
+                "shape": [6],
+            },
+            "action": {
+                "dtype": "float32",
+                "shape": [6],
+            },
+            "timestamp": {"dtype": "float32", "shape": []},
+            "frame_index": {"dtype": "int64", "shape": []},
+            "episode_index": {"dtype": "int64", "shape": []},
+            "index": {"dtype": "int64", "shape": []},
+            "task_index": {"dtype": "int64", "shape": []},
+        }
+        self.info = {
+            "fps": 30,
+            "robot_type": "so100",
+            "total_episodes": index["summary"]["episodes"],
+            "total_frames": index["summary"]["total_frames"],
+        }
+    @property
+    def fps(self):
+        return 30
+    @property
+    def features(self):
+        return self._features
+    @property
+    def camera_keys(self):
+        return ["observation.images.image", "observation.images.image2"]
+    @property
+    def video_keys(self):
+        return ["observation.images.image", "observation.images.image2"]
+    @property
+    def image_keys(self):
+        return []
+    @property
+    def total_episodes(self):
+        return self.info["total_episodes"]
+    @property
+    def total_frames(self):
+        return self.info["total_frames"]
+    @property
+    def robot_type(self):
+        return "so100"
+class SO100Dataset(Dataset):
+    """
+    Loads filtered SO-100/101 episodes from community_dataset_v3 on disk.
+    Each sample is one frame with an action chunk of the next `chunk_size` steps.
+    Returns raw unnormalized data — the Pi0.5 preprocessor handles normalization.
+    Provides .meta property compatible with lerobot_train.py.
+    """
+    def __init__(
+        self,
+        data_root: str | Path,
+        index_path: str | Path,
+        stats_path: str | Path | None = None,
+        video_backend: str = "pyav",
+        chunk_size: int = 50,
+        image_transforms=None,
+    ):
+        self.data_root = Path(data_root)
+        self.video_backend = video_backend
+        self.chunk_size = chunk_size
+        self.image_transforms = image_transforms
+        self.fps = 30
+        # Load index
+        with open(index_path) as f:
+            self._index = json.load(f)
+        self.tasks = self._index["tasks"]
+        # Load stats
+        raw_stats = {}
+        if stats_path and Path(stats_path).exists():
+            with open(stats_path) as f:
+                raw_stats = json.load(f)
+        # Create meta adapter
+        self.meta = _DatasetMeta(self._index, raw_stats, self.data_root)
+        # Build flat frame-level index
+        self._frame_index = []
+        self._episode_offsets = []
+        for ep in self._index["episodes"]:
+            dataset_path = self.data_root / ep["dataset"]
+            ep_idx = ep["episode_index"]
+            task = ep["task"]
+            task_idx = ep["task_index"]
+            num_frames = ep["num_frames"]
+            # Only include frames where a full action chunk fits
+            valid_frames = max(0, num_frames - self.chunk_size)
+            if valid_frames == 0:
+                continue
+            start = len(self._frame_index)
+            self._episode_offsets.append(start)
+            for frame_idx in range(valid_frames):
+                self._frame_index.append((
+                    dataset_path, ep_idx, frame_idx,
+                    num_frames, task, task_idx,
+                ))
+        # Parquet cache
+        self._parquet_cache = {}
+        self._cache_max = 200
+    def __len__(self):
+        return len(self._frame_index)
+    @property
+    def num_episodes(self):
+        return len(self._episode_offsets)
+    @property
+    def num_frames(self):
+        return len(self._frame_index)
+    @property
+    def episodes(self):
+        return None  # Use all episodes (no further filtering)
+    @property
+    def features(self):
+        return self.meta.features
+    @property
+    def video(self):
+        return True
+    @property
+    def camera_keys(self):
+        return self.meta.camera_keys
+    @property
+    def video_frame_keys(self):
+        return self.meta.camera_keys
+    def _load_parquet(self, dataset_path: Path, episode_index: int) -> pd.DataFrame:
+        """Load and cache a parquet file."""
+        key = (str(dataset_path), episode_index)
+        if key in self._parquet_cache:
+            return self._parquet_cache[key]
+        parquet_path = dataset_path / f"data/chunk-000/episode_{episode_index:06d}.parquet"
+        df = pd.read_parquet(parquet_path)
+        if len(self._parquet_cache) >= self._cache_max:
+            oldest_key = next(iter(self._parquet_cache))
+            del self._parquet_cache[oldest_key]
+        self._parquet_cache[key] = df
+        return df
+    def _decode_video_frame(self, video_path: Path, timestamp: float) -> torch.Tensor:
+        """Decode a single frame from an MP4 at the given timestamp. Returns (C, H, W) float32 [0,1]."""
+        if self.video_backend == "torchcodec":
+            from torchcodec.decoders import VideoDecoder
+            decoder = VideoDecoder(str(video_path))
+            frame = decoder.get_frame_played_at(timestamp)
+            return frame.data.float() / 255.0
+        else:
+            import av
+            container = av.open(str(video_path))
+            stream = container.streams.video[0]
+            target_pts = int(timestamp / float(stream.time_base))
+            container.seek(target_pts, stream=stream)
+            for frame in container.decode(video=0):
+                arr = frame.to_ndarray(format="rgb24")
+                tensor = torch.from_numpy(arr).permute(2, 0, 1).float() / 255.0
+                container.close()
+                return tensor
+            container.close()
+            raise RuntimeError(f"Could not decode frame at t={timestamp} from {video_path}")
+    def __getitem__(self, idx: int) -> dict:
+        dataset_path, ep_idx, frame_idx, num_frames, task, task_idx = self._frame_index[idx]
+        df = self._load_parquet(dataset_path, ep_idx)
+        # Current frame
+        row = df.iloc[frame_idx]
+        state = torch.tensor(row["observation.state"], dtype=torch.float32)
+        timestamp = float(row["timestamp"])
+        # Action chunk: next chunk_size actions starting from current frame
+        action_end = min(frame_idx + self.chunk_size, len(df))
+        action_rows = df.iloc[frame_idx:action_end]
+        actions = torch.tensor(
+            np.stack(action_rows["action"].values),
+            dtype=torch.float32,
+        )
+        # Pad with last action if near episode end
+        if actions.shape[0] < self.chunk_size:
+            pad = actions[-1:].expand(self.chunk_size - actions.shape[0], -1)
+            actions = torch.cat([actions, pad], dim=0)
+        # Decode video frames
+        video_dir = dataset_path / "videos" / "chunk-000"
+        ep_str = f"episode_{ep_idx:06d}.mp4"
+        image1 = self._decode_video_frame(
+            video_dir / "observation.images.image" / ep_str, timestamp
+        )
+        image2 = self._decode_video_frame(
+            video_dir / "observation.images.image2" / ep_str, timestamp
+        )
+        if self.image_transforms is not None:
+            image1 = self.image_transforms(image1)
+            image2 = self.image_transforms(image2)
+        return {
+            "observation.images.image": image1,       # (3, 480, 640) float32 [0,1]
+            "observation.images.image2": image2,       # (3, 480, 640) float32 [0,1]
+            "observation.state": state,                # (6,) float32, raw values
+            "action": actions,                         # (50, 6) float32, raw values
+            "task": task,                              # str
+            "task_index": torch.tensor(task_idx),
+            "timestamp": torch.tensor(timestamp),
+            "frame_index": torch.tensor(frame_idx),
+            "episode_index": torch.tensor(ep_idx),
+            "index": torch.tensor(idx),
+        }
+    def __repr__(self):
+        return (
+            f"SO100Dataset(\n"
+            f"  data_root='{self.data_root}',\n"
+            f"  episodes={self.num_episodes},\n"
+            f"  frames={self.num_frames:,},\n"
+            f"  tasks={len(self.tasks)},\n"
+            f"  video_backend='{self.video_backend}',\n"
+            f")"
+        )