ConorWang commited on
Commit
ae7851f
·
verified ·
1 Parent(s): 03426f9

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +332 -48
README.md CHANGED
@@ -2,78 +2,362 @@
2
  license: gemma
3
  language:
4
  - en
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  ---
6
- # π₀.₅ (Pi05)
7
 
8
- These weights directly come from the Pytorch conversion script of openpi and their `pi05_base` model.
9
 
10
- π₀.₅ is a **Vision-Language-Action model with open-world generalization**, from Physical Intelligence. The LeRobot implementation is adapted from their open source [OpenPI](https://github.com/Physical-Intelligence/openpi) repository.
 
 
11
 
12
- ## Model Overview
 
13
 
14
- π₀.₅ represents a significant evolution from π₀, developed by [Physical Intelligence](https://www.physicalintelligence.company/blog/pi05) to address a big challenge in robotics: **open-world generalization**. While robots can perform impressive tasks in controlled environments, π₀.₅ is designed to generalize to entirely new environments and situations that were never seen during training.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
- ### The Generalization Challenge
17
 
18
- As Physical Intelligence explains, the fundamental challenge isn't performing tasks of agility or dexterity, but generalization, the ability to correctly perform tasks in new settings with new objects. Consider a robot cleaning different homes: each home has different objects in different places. Generalization must occur at multiple levels:
 
 
19
 
20
- - **Physical Level**: Understanding how to pick up a spoon (by the handle) or plate (by the edge), even with unseen objects in cluttered environments
21
- - **Semantic Level**: Understanding task semantics, where to put clothes and shoes (laundry hamper, not on the bed), and what tools are appropriate for cleaning spills
22
- - **Environmental Level**: Adapting to "messy" real-world environments like homes, grocery stores, offices, and hospitals
23
 
24
- ### Co-Training on Heterogeneous Data
 
25
 
26
- The breakthrough innovation in π₀.₅ is **co-training on heterogeneous data sources**. The model learns from:
27
 
28
- 1. **Multimodal Web Data**: Image captioning, visual question answering, object detection
29
- 2. **Verbal Instructions**: Humans coaching robots through complex tasks step-by-step
30
- 3. **Subtask Commands**: High-level semantic behavior labels (e.g., "pick up the pillow" for an unmade bed)
31
- 4. **Cross-Embodiment Robot Data**: Data from various robot platforms with different capabilities
32
- 5. **Multi-Environment Data**: Static robots deployed across many different homes
33
- 6. **Mobile Manipulation Data**: ~400 hours of mobile robot demonstrations
34
 
35
- This diverse training mixture creates a "curriculum" that enables generalization across physical, visual, and semantic levels simultaneously.
 
 
 
 
36
 
 
37
 
38
- ## Training
39
 
40
- Here's a complete training command for finetuning the base π₀.₅ model on your own dataset:
41
 
42
  ```bash
43
- python src/lerobot/scripts/train.py \
44
- --dataset.repo_id=your_dataset \
45
- --policy.type=pi05 \
46
- --output_dir=./outputs/pi05_training \
47
- --job_name=pi05_training \
48
- --policy.repo_id=your_repo_id \
49
- --policy.pretrained_path=lerobot/pi05_base \
50
- --policy.compile_model=true \
51
- --policy.gradient_checkpointing=true \
52
- --wandb.enable=true \
53
- --policy.dtype=bfloat16 \
54
- --steps=3000 \
55
- --policy.scheduler_decay_steps=3000 \
56
- --policy.device=cuda \
57
- --batch_size=32
58
  ```
59
 
60
- ## Citation
61
 
62
- If you use this model, please cite the original OpenPI work:
 
 
 
 
 
 
 
 
 
63
 
64
- ```bibtex
65
- @article{openpi2024,
66
- title={Open-World Robotic Manipulation with Vision-Language-Action Models},
67
- author={Physical Intelligence},
68
- year={2024},
69
- url={https://github.com/Physical-Intelligence/openpi}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  }
 
 
 
 
 
 
 
 
71
  ```
72
 
73
- ## Original Repository
 
 
74
 
75
- [OpenPI GitHub Repository](https://github.com/Physical-Intelligence/openpi)
76
 
77
- ## License
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
- This model follows the same license as the original OpenPI repository.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: gemma
3
  language:
4
  - en
5
+ tags:
6
+ - vision-language-action
7
+ - humanoid-robotics
8
+ - telepathy
9
+ - multimodal
10
+ - robotics-control
11
+ - lora
12
+ - pytorch
13
+ base_model: lerobot/pi05_base
14
+ datasets:
15
+ - lerobot/svla_so101_pickplace
16
+ library_name: transformers
17
+ pipeline_tag: other
18
+ author: "Libo Wang"
19
  ---
 
20
 
21
+ # Sigma: The Key for Vision–Language–Action Models toward Telepathy
22
 
23
+ [![Model Card](https://img.shields.io/badge/HF-Sigma-orange?logo=huggingface)](https://huggingface.co/Veltraxor/Sigma)
24
+ [![Base Model](https://img.shields.io/badge/base-lerobot%2Fpi05__base-blue)](https://huggingface.co/lerobot/pi05_base)
25
+ [![Dataset](https://img.shields.io/badge/dataset-lerobot%2Fsvla__so101__pickplace-green)](https://huggingface.co/datasets/lerobot/svla_so101_pickplace)
26
 
27
+ Sigma is a **telepathy-style Vision–Language–Action (VLA) model** built on top of `lerobot/pi05_base`.
28
+ It adds a semantic “telepathy” path and LoRA adapters that steer continuous robot control using internal **semantic memory** and **intent states**, while keeping the original π0.5 backbone weights intact and recoverable.
29
 
30
+ ---
31
+
32
+ ## 1. Summary
33
+
34
+ - **Base policy**: `lerobot/pi05_base` (π0.5)
35
+ - **Author**: **Libo Wang**
36
+ - **GPU for training**: single RTX 4090 (24GB)
37
+ - **Data**: `lerobot/svla_so101_pickplace`
38
+ - **Objective**:
39
+ Make a π0.5-style VLA **use internal semantic & intent states** to refine continuous control, rather than only imitating trajectories.
40
+
41
+ Sigma keeps the perception and control structure of π0.5, and introduces an additional pathway that:
42
+
43
+ - fuses **vision, language, and robot state** into a shared latent sequence,
44
+ - maintains a **semantic state** \(m_t\) and an **intent vector** \(z_\text{intent}\) over time,
45
+ - converts them into **telepathy factors** that modulate the policy’s action outputs as residual corrections.
46
+
47
+ ---
48
+
49
+ ## 2. Architecture at a Glance
50
+
51
+ Sigma can be seen as **π0.5 + telepathic head + LoRA adapters**:
52
+
53
+ - **Vision / State stream**
54
+ - reuse π0.5 encoders for images and robot state;
55
+ - add FiLM-style modulation from telepathy factors on vision tokens.
56
+
57
+ - **Language–semantic stream**
58
+ - take text tokens, vision tokens, and state tokens into a shared MLLM backbone;
59
+ - derive:
60
+ - a **semantic memory** \(m_t\) that accumulates cross-time information,
61
+ - an **intent vector** \(z_\text{intent}\),
62
+ - pooled **semantic factors** aligned with the text embedding space.
63
+
64
+ - **Action stream (three branches)**
65
+ - treat π0.5 outputs as **baseline**:
66
+ - action vector (per-step),
67
+ - action chunk (short horizon),
68
+ - action trajectory (full horizon);
69
+ - learn **residual actions** driven by telepathy factors on all three branches.
70
 
71
+ The resulting policy still *looks like* π0.5 from the outside (same inputs, same output types), but actions are now corrected by an internal telepathy pathway that is aware of **deep semantics and associative intent**.
72
 
73
+ ---
74
+
75
+ ## 3. Training Setup
76
 
77
+ ### 3.1 Dataset & preprocessing
 
 
78
 
79
+ - **Upstream dataset**: `lerobot/svla_so101_pickplace`
80
+ - **Task**: pick-and-place style manipulation with multi-frame RGB + robot state + continuous actions.
81
 
82
+ A preprocessing script (`dataset_preprocess_sigma_vla.py`) does:
83
 
84
+ - sliding-window segmentation with horizon `T = 16`,
85
+ - filtering out windows with nearly zero action norm to remove static segments,
86
+ - packing vision frames, robot state, and 3-scale action targets into tensor batches,
87
+ - exporting three sharded files:
 
 
88
 
89
+ ```text
90
+ storage/sigma_pickplace/shard_00000.pt
91
+ storage/sigma_pickplace/shard_00001.pt
92
+ storage/sigma_pickplace/shard_00002.pt
93
+ ```
94
 
95
+ These shards are the **only** data used for Sigma training and evaluation.
96
 
97
+ ### 3.2 LoRA fine-tuning (Sigma training)
98
 
99
+ Training is performed on a **single RTX 4090** using `train_sigma_telepathy_vla_lora.py`:
100
 
101
  ```bash
102
+ python train_sigma_telepathy_vla_lora.py \
103
+ --base_model_id lerobot/pi05_base \
104
+ --dataset_dir /workspace/storage/sigma_pickplace \
105
+ --output_dir /workspace/storage/sigma_lora_out \
106
+ --batch_size 4 \
107
+ --gradient_accumulation_steps 4 \
108
+ --max_steps 300 \
109
+ --dtype bf16
 
 
 
 
 
 
 
110
  ```
111
 
112
+ Key aspects:
113
 
114
+ - freeze backbone weights from `lerobot/pi05_base`;
115
+ - attach **LoRA** on key projections (`q`, `k`, `v`, `o`) and the telepathy heads;
116
+ - jointly optimize:
117
+ - **three control losses**:
118
+ - `L_act_vec` for per-step action vectors,
119
+ - `L_act_chk` for short-horizon chunks,
120
+ - `L_act_trj` for full trajectories;
121
+ - **semantic & telepathy regularizers**:
122
+ - alignment of semantic factors with text embeddings,
123
+ - control of telepathy factor norm `tau_l2`.
124
 
125
+ All LoRA and telepathy parameters are stored under:
126
+
127
+ ```text
128
+ storage/sigma_lora_out/
129
+ sigma_telepathy_heads.pt
130
+ adapter_config.json
131
+ adapter_model.bin
132
+ ...
133
+ ```
134
+
135
+ ### 3.3 Telepathy-aware training logic
136
+
137
+ Two key training mechanisms are implemented inside the loss:
138
+
139
+ - **Telepathic Residual Action Focusing (TRAF)**
140
+ Focuses learning on *residual actions* instead of full actions, and uses **hard-sample mining** (top-k error segments) to allocate more gradient budget to difficult humanoid control windows.
141
+
142
+ - **Telepathic Semantic Alignment Curriculum (TSAC)**
143
+ Gradually increases the weights of:
144
+ - semantic memory–text alignment,
145
+ - intent–telepathy alignment,
146
+ while maintaining action regression as the primary objective early on.
147
+ Late in training, Sigma is encouraged to let **internal semantic/intent structure** drive the residual corrections.
148
+
149
+ ---
150
+
151
+ ## 4. Inference-time Telepathy Adapter
152
+
153
+ A lightweight adapter (`sigma_adapter.py`) controls how much the telepathy residuals are allowed to modify the baseline π0.5 actions:
154
+
155
+ - reads:
156
+ - baseline π0.5 actions (`base_action_vector`, …),
157
+ - Sigma residuals,
158
+ - telepathy diagnostics (norms, cosine alignments),
159
+ - computes a **risk-aware scaling factor** in \([ \text{min_scale}, \text{max_scale} ]\),
160
+ - blends:
161
+
162
+ ```python
163
+ action = base_action + scale * telepathy_residual
164
+ ```
165
+
166
+ If residuals are too large or misaligned, `scale` is pushed toward 0, effectively reverting to π0.5 behavior.
167
+ If residuals are moderate and well aligned, `scale` approaches 1, enabling telepathy-enhanced control.
168
+
169
+ ---
170
+
171
+ ## 5. Evaluation Protocol
172
+
173
+ Evaluation uses `eval_sigma_vla_rollout.py` in **offline closed-loop replay**:
174
+
175
+ - both Sigma and the baseline:
176
+ - use the *same* preprocessed shards (`shard_0000x.pt`),
177
+ - share the *same* telepathy heads file `sigma_telepathy_heads.pt`,
178
+ - **only Sigma**:
179
+ - loads LoRA weights,
180
+ - activates telepathy residuals and the adapter in control output.
181
+
182
+ ### 5.1 CHECK A – telepathy geometry & alignment sanity
183
+
184
+ CHECK A verifies that **telepathy geometry is identical** between experimental and control runs:
185
+
186
+ - `heads_tensors = 325`
187
+ - `mean ≈ 0.002`, `std ≈ 0.107`, `rms ≈ 0.107` for telepathy head weights
188
+ - `avg_tau_l2 ≈ 51.6` – average L2 norm of telepathy factors
189
+ - `avg_semantic_text_alignment ≈ 0.13` – semantic factor vs. text embedding alignment
190
+
191
+ These numbers are matched between Sigma and the π0.5 baseline, so behavior differences cannot be explained by changing telepathy parameters or text alignment geometry.
192
+
193
+ ### 5.2 CHECK B – multiscale control & telepathy metrics
194
+
195
+ CHECK B defines and reports:
196
+
197
+ - `mse_vec` – per-step action vector MSE (fine-grain control precision)
198
+ - `mse_chk` – short segment chunk MSE (local motion consistency)
199
+ - `mse_trj` – full trajectory MSE (long-horizon tracking)
200
+ - `tau_l2` – telepathy factor norms (activation strength)
201
+ - `sem_align` – semantic alignment (e.g., cosine) between semantic factors and text embeddings
202
+
203
+ On the same 723 samples and 181 batches:
204
+
205
+ - Sigma shows **consistently lower `mse_vec`, `mse_chk`, `mse_trj`** than the baseline,
206
+ - while **`tau_l2` and `sem_align` remain similar** between both models.
207
+
208
+ This pattern supports the interpretation that Sigma **uses the same semantic / telepathy geometry more effectively**, converting it into tangible gains in control accuracy instead of merely altering the embedding space.
209
+
210
+ ---
211
+
212
+ ## 6. How to Use Sigma
213
+
214
+ > ⚠️ You must have access to `lerobot/pi05_base` and the preprocessed shards or an equivalent environment to reproduce full experiments.
215
+
216
+ ### 6.1 Installation (example)
217
+
218
+ ```bash
219
+ # base env
220
+ pip install "transformers>=4.40.0" accelerate torch torchvision
221
+ pip install lerobot
222
+
223
+ # clone this repository (example path)
224
+ git clone https://github.com/Veltraxor/Sigma.git
225
+ cd Sigma
226
+ ```
227
+
228
+ ### 6.2 Loading Sigma on top of pi0.5
229
+
230
+ ```python
231
+ import torch
232
+ from lerobot import Pi05Policy
233
+ from sigma_vla import SigmaTelepathyVLA, SigmaTelepathyAdapter
234
+
235
+ device = "cuda"
236
+ dtype = torch.bfloat16
237
+
238
+ # 1. Load base π0.5 policy
239
+ base_policy = Pi05Policy.from_pretrained("lerobot/pi05_base")
240
+
241
+ # 2. Build Sigma on top of the base policy
242
+ sigma_policy = SigmaTelepathyVLA.from_base(
243
+ base_policy=base_policy,
244
+ lora_dir="./storage/sigma_lora_out",
245
+ telepathy_heads_path="./storage/sigma_lora_out/sigma_telepathy_heads.pt",
246
+ device=device,
247
+ dtype=dtype,
248
+ )
249
+
250
+ # 3. Optional runtime adapter
251
+ adapter = SigmaTelepathyAdapter(
252
+ min_scale=0.0,
253
+ max_scale=1.0,
254
+ risk_temperature=1.0,
255
+ )
256
+
257
+ # 4. Single batch forward (offline replay)
258
+ batch = {
259
+ "vis_obs": vis_obs_tensor, # [B, T, C, H, W]
260
+ "robot_state": robot_state_tensor, # [B, T, D_state]
261
+ "texts": list_of_text_prompts, # length B
262
  }
263
+
264
+ with torch.no_grad():
265
+ out = sigma_policy(**batch, use_telepathy=True)
266
+ blended_action = adapter(
267
+ base_action_vector=out["base_action_vector"],
268
+ telepathy_residual=out["telepathy_residual_vector"],
269
+ telepathy_factors=out["telepathy_factors"],
270
+ )
271
  ```
272
 
273
+ ---
274
+
275
+ ## 7. Repository Layout (typical)
276
 
277
+ A typical Sigma repo / model card includes:
278
 
279
+ ```text
280
+ README.md # this file
281
+ sigma_env.example # example env file for HF tokens, paths
282
+ dataset_preprocess_sigma_vla.py
283
+ train_sigma_telepathy_vla_lora.py
284
+ eval_sigma_vla_rollout.py
285
+ sigma_telepathy_vla.py # model definition
286
+ sigma_adapter.py # inference-time adapter
287
+
288
+ storage/
289
+ sigma_pickplace/
290
+ shard_00000.pt
291
+ shard_00001.pt
292
+ shard_00002.pt
293
+ sigma_lora_out/
294
+ sigma_telepathy_heads.pt
295
+ adapter_config.json
296
+ adapter_model.bin
297
+ ...
298
+
299
+ logs/
300
+ sigma_eval_report.json
301
+ sigma_eval_checkA.json
302
+ sigma_eval_checkB.json
303
+ ```
304
+
305
+ You can adapt this layout to your own environment; the key assumption is that **Sigma is always loaded as a LoRA + telepathy delta on top of `lerobot/pi05_base`**.
306
+
307
+ ---
308
 
309
+ ## 8. Intended Use, Risks, and Limitations
310
+
311
+ - **Intended use**
312
+ Sigma is intended for **research and experimentation** on:
313
+ - semantic / telepathy-style control in VLA systems,
314
+ - offline trajectory analysis and simulation,
315
+ - early-stage humanoid / manipulator control studies.
316
+
317
+ - **Not intended for**
318
+ - direct deployment on physical robots **without additional safety layers**;
319
+ - safety-critical or human-facing applications.
320
+
321
+ - **Known limitations**
322
+ - trained only on `svla_so101_pickplace`;
323
+ - evaluated only in offline replay;
324
+ - telepathy path tuned for a single task family and embodiment.
325
+
326
+ Users should treat Sigma as a **proof-of-concept** that demonstrates how “deep semantic + associative intent” can be engineered into residual control, not as a generic controller.
327
+
328
+ ---
329
+
330
+ ## 9. Author & Acknowledgements
331
+
332
+ - **Author**: **Libo Wang**
333
+ - Base policy and dataset by **Physical Intelligence / LeRobot** teams.
334
+ - Training environment based on a single RTX 4090 GPU; all scripts are structured to be portable to other single-GPU or multi-GPU setups with minimal changes.
335
+
336
+ ---
337
+
338
+ ## 10. Citation
339
+
340
+ If you use Sigma, please cite both the original π0.5 / OpenPI work and this Sigma extension.
341
+
342
+ **π0.5 / OpenPI:**
343
+
344
+ ```bibtex
345
+ @article{openpi2024,
346
+ title = {Open-World Robotic Manipulation with Vision-Language-Action Models},
347
+ author = {Physical Intelligence},
348
+ year = {2024},
349
+ url = {https://github.com/Physical-Intelligence/openpi}
350
+ }
351
+ ```
352
+
353
+ **Sigma (example entry):**
354
+
355
+ ```bibtex
356
+ @article{sigma2025,
357
+ title = {Sigma: The Key for Vision--Language--Action Models toward Telepathy},
358
+ author = {Wang, Libo},
359
+ year = {2025},
360
+ note = {Telepathy-style extension of lerobot/pi05_base},
361
+ url = {https://huggingface.co/Veltraxor/Sigma}
362
+ }
363
+ ```