File size: 12,209 Bytes
7d25255
 
 
 
ae7851f
 
 
 
 
 
 
 
 
 
 
 
 
 
7d25255
 
ae7851f
7d25255
ae7851f
 
 
7d25255
ae7851f
 
7d25255
ae7851f
 
 
 
 
 
 
 
 
 
 
 
 
 
41a4de1
ae7851f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a28e95
 
ae7851f
 
 
 
 
 
 
 
7d25255
ae7851f
7d25255
ae7851f
 
 
7d25255
ae7851f
7d25255
ae7851f
 
7d25255
ae7851f
7d25255
ae7851f
 
 
 
7d25255
ae7851f
 
 
 
 
7d25255
ae7851f
7d25255
ae7851f
7d25255
ae7851f
7d25255
 
ae7851f
 
 
 
 
 
 
 
7d25255
 
ae7851f
7d25255
ae7851f
3a28e95
ae7851f
 
 
 
 
 
 
 
7d25255
ae7851f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a28e95
ae7851f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7d25255
ae7851f
 
 
 
 
 
 
 
7d25255
 
ae7851f
 
 
7d25255
ae7851f
7d25255
ae7851f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7d25255
ae7851f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
---
license: gemma
language:
- en
tags:
- vision-language-action
- humanoid-robotics
- telepathy
- multimodal
- robotics-control
- lora
- pytorch
base_model: lerobot/pi05_base
datasets:
- lerobot/svla_so101_pickplace
library_name: transformers
pipeline_tag: other
author: "Libo Wang"
---

# Sigma: The Key for Vision–Language–Action Models toward Telepathy

[![Model Card](https://img.shields.io/badge/HF-Sigma-orange?logo=huggingface)](https://huggingface.co/Veltraxor/Sigma)
[![Base Model](https://img.shields.io/badge/base-lerobot%2Fpi05__base-blue)](https://huggingface.co/lerobot/pi05_base)
[![Dataset](https://img.shields.io/badge/dataset-lerobot%2Fsvla__so101__pickplace-green)](https://huggingface.co/datasets/lerobot/svla_so101_pickplace)

Sigma is a **telepathy-style Vision–Language–Action (VLA) model** built on top of `lerobot/pi05_base`.  
It adds a semantic “telepathy” path and LoRA adapters that steer continuous robot control using internal **semantic memory** and **intent states**, while keeping the original π0.5 backbone weights intact and recoverable.

---

## 1. Summary

- **Base policy**: `lerobot/pi05_base` (π0.5)  
- **Author**: **Libo Wang**  
- **GPU for training**: single RTX 4090 (24GB)  
- **Data**: `lerobot/svla_so101_pickplace`  
- **Objective**:  
  Make a π0.5-style VLA **use internal semantic & intent states** to refine continuous control, rather than only imitating trajectories.

Sigma keeps the perception and control structure of π0.5, and introduces an additional pathway that:

- fuses **vision, language, and robot state** into a shared latent sequence,
- maintains a **semantic state** m_t and an **intent vector** z_intent over time,
- converts them into **telepathy factors** that modulate the policy’s action outputs as residual corrections.

---

## 2. Architecture at a Glance

Sigma can be seen as **π0.5 + telepathic head + LoRA adapters**:

- **Vision / State stream**  
  - reuse π0.5 encoders for images and robot state;  
  - add FiLM-style modulation from telepathy factors on vision tokens.

- **Language–semantic stream**  
  - take text tokens, vision tokens, and state tokens into a shared MLLM backbone;  
  - derive:
    - a **semantic memory** m_t that accumulates cross-time information,
    - an **intent vector** z_intent,
    - pooled **semantic factors** aligned with the text embedding space.

- **Action stream (three branches)**  
  - treat π0.5 outputs as **baseline**:  
    - action vector (per-step),  
    - action chunk (short horizon),  
    - action trajectory (full horizon);  
  - learn **residual actions** driven by telepathy factors on all three branches.

The resulting policy still *looks like* π0.5 from the outside (same inputs, same output types), but actions are now corrected by an internal telepathy pathway that is aware of **deep semantics and associative intent**.

---

## 3. Training Setup

### 3.1 Dataset & preprocessing

- **Upstream dataset**: `lerobot/svla_so101_pickplace`  
- **Task**: pick-and-place style manipulation with multi-frame RGB + robot state + continuous actions.

A preprocessing script (`dataset_preprocess_sigma_vla.py`) does:

- sliding-window segmentation with horizon `T = 16`,  
- filtering out windows with nearly zero action norm to remove static segments,  
- packing vision frames, robot state, and 3-scale action targets into tensor batches,  
- exporting three sharded files:

```text
storage/sigma_pickplace/shard_00000.pt
storage/sigma_pickplace/shard_00001.pt
storage/sigma_pickplace/shard_00002.pt
```

These shards are the **only** data used for Sigma training and evaluation.

### 3.2 LoRA fine-tuning (Sigma training)

Training is performed on a **single RTX 4090** using `train_sigma_telepathy_vla_lora.py`:

```bash
python train_sigma_telepathy_vla_lora.py \
  --base_model_id lerobot/pi05_base \
  --dataset_dir /workspace/storage/sigma_pickplace \
  --output_dir /workspace/storage/sigma_lora_out \
  --batch_size 4 \
  --gradient_accumulation_steps 4 \
  --max_steps 300 \
  --dtype bf16
```

Key aspects:

- freeze backbone weights from `lerobot/pi05_base`;  
- attach **LoRA** on key projections (q, k, v, o) and the telepathy heads;  
- jointly optimize:
  - **three control losses**:
    - `L_act_vec` for per-step action vectors,
    - `L_act_chk` for short-horizon chunks,
    - `L_act_trj` for full trajectories;
  - **semantic & telepathy regularizers**:
    - alignment of semantic factors with text embeddings,
    - control of telepathy factor norm `tau_l2`.

All LoRA and telepathy parameters are stored under:

```text
storage/sigma_lora_out/
  sigma_telepathy_heads.pt
  adapter_config.json
  adapter_model.bin
  ...
```

### 3.3 Telepathy-aware training logic

Two key training mechanisms are implemented inside the loss:

- **Telepathic Residual Action Focusing (TRAF)**  
  Focuses learning on *residual actions* instead of full actions, and uses **hard-sample mining** (top-k error segments) to allocate more gradient budget to difficult humanoid control windows.

- **Telepathic Semantic Alignment Curriculum (TSAC)**  
  Gradually increases the weights of:
  - semantic memory–text alignment,
  - intent–telepathy alignment,
  while maintaining action regression as the primary objective early on.  
  Late in training, Sigma is encouraged to let **internal semantic/intent structure** drive the residual corrections.

---

## 4. Inference-time Telepathy Adapter

A lightweight adapter (`sigma_adapter.py`) controls how much the telepathy residuals are allowed to modify the baseline π0.5 actions:

- reads:
  - baseline π0.5 actions (`base_action_vector`, …),
  - Sigma residuals,
  - telepathy diagnostics (norms, cosine alignments),
- computes a **risk-aware scaling factor** in min_scale, max_scale,
- blends:

```python
action = base_action + scale * telepathy_residual
```

If residuals are too large or misaligned, `scale` is pushed toward 0, effectively reverting to π0.5 behavior.  
If residuals are moderate and well aligned, `scale` approaches 1, enabling telepathy-enhanced control.

---

## 5. Evaluation Protocol

Evaluation uses `eval_sigma_vla_rollout.py` in **offline closed-loop replay**:

- both Sigma and the baseline:
  - use the *same* preprocessed shards (`shard_0000x.pt`),
  - share the *same* telepathy heads file `sigma_telepathy_heads.pt`,
- **only Sigma**:
  - loads LoRA weights,
  - activates telepathy residuals and the adapter in control output.

### 5.1 CHECK A – telepathy geometry & alignment sanity

CHECK A verifies that **telepathy geometry is identical** between experimental and control runs:

- `heads_tensors = 325`  
- `mean ≈ 0.002`, `std ≈ 0.107`, `rms ≈ 0.107` for telepathy head weights  
- `avg_tau_l2 ≈ 51.6` – average L2 norm of telepathy factors  
- `avg_semantic_text_alignment ≈ 0.13` – semantic factor vs. text embedding alignment

These numbers are matched between Sigma and the π0.5 baseline, so behavior differences cannot be explained by changing telepathy parameters or text alignment geometry.

### 5.2 CHECK B – multiscale control & telepathy metrics

CHECK B defines and reports:

- `mse_vec` – per-step action vector MSE (fine-grain control precision)  
- `mse_chk` – short segment chunk MSE (local motion consistency)  
- `mse_trj` – full trajectory MSE (long-horizon tracking)  
- `tau_l2` – telepathy factor norms (activation strength)  
- `sem_align` – semantic alignment (e.g., cosine) between semantic factors and text embeddings

On the same 723 samples and 181 batches:

- Sigma shows **consistently lower `mse_vec`, `mse_chk`, `mse_trj`** than the baseline,
- while **`tau_l2` and `sem_align` remain similar** between both models.

This pattern supports the interpretation that Sigma **uses the same semantic / telepathy geometry more effectively**, converting it into tangible gains in control accuracy instead of merely altering the embedding space.

---

## 6. How to Use Sigma

> ⚠️ You must have access to `lerobot/pi05_base` and the preprocessed shards or an equivalent environment to reproduce full experiments.

### 6.1 Installation (example)

```bash
# base env
pip install "transformers>=4.40.0" accelerate torch torchvision
pip install lerobot

# clone this repository (example path)
git clone https://github.com/Veltraxor/Sigma.git
cd Sigma
```

### 6.2 Loading Sigma on top of pi0.5

```python
import torch
from lerobot import Pi05Policy
from sigma_vla import SigmaTelepathyVLA, SigmaTelepathyAdapter

device = "cuda"
dtype = torch.bfloat16

# 1. Load base π0.5 policy
base_policy = Pi05Policy.from_pretrained("lerobot/pi05_base")

# 2. Build Sigma on top of the base policy
sigma_policy = SigmaTelepathyVLA.from_base(
    base_policy=base_policy,
    lora_dir="./storage/sigma_lora_out",
    telepathy_heads_path="./storage/sigma_lora_out/sigma_telepathy_heads.pt",
    device=device,
    dtype=dtype,
)

# 3. Optional runtime adapter
adapter = SigmaTelepathyAdapter(
    min_scale=0.0,
    max_scale=1.0,
    risk_temperature=1.0,
)

# 4. Single batch forward (offline replay)
batch = {
    "vis_obs": vis_obs_tensor,           # [B, T, C, H, W]
    "robot_state": robot_state_tensor,   # [B, T, D_state]
    "texts": list_of_text_prompts,       # length B
}

with torch.no_grad():
    out = sigma_policy(**batch, use_telepathy=True)
    blended_action = adapter(
        base_action_vector=out["base_action_vector"],
        telepathy_residual=out["telepathy_residual_vector"],
        telepathy_factors=out["telepathy_factors"],
    )
```

---

## 7. Repository Layout (typical)

A typical Sigma repo / model card includes:

```text
README.md                      # this file
sigma_env.example              # example env file for HF tokens, paths
dataset_preprocess_sigma_vla.py
train_sigma_telepathy_vla_lora.py
eval_sigma_vla_rollout.py
sigma_telepathy_vla.py         # model definition
sigma_adapter.py               # inference-time adapter

storage/
  sigma_pickplace/
    shard_00000.pt
    shard_00001.pt
    shard_00002.pt
  sigma_lora_out/
    sigma_telepathy_heads.pt
    adapter_config.json
    adapter_model.bin
    ...

logs/
  sigma_eval_report.json
  sigma_eval_checkA.json
  sigma_eval_checkB.json
```

You can adapt this layout to your own environment; the key assumption is that **Sigma is always loaded as a LoRA + telepathy delta on top of `lerobot/pi05_base`**.

---

## 8. Intended Use, Risks, and Limitations

- **Intended use**  
  Sigma is intended for **research and experimentation** on:
  - semantic / telepathy-style control in VLA systems,
  - offline trajectory analysis and simulation,
  - early-stage humanoid / manipulator control studies.

- **Not intended for**  
  - direct deployment on physical robots **without additional safety layers**;  
  - safety-critical or human-facing applications.

- **Known limitations**  
  - trained only on `svla_so101_pickplace`;  
  - evaluated only in offline replay;  
  - telepathy path tuned for a single task family and embodiment.

Users should treat Sigma as a **proof-of-concept** that demonstrates how “deep semantic + associative intent” can be engineered into residual control, not as a generic controller.

---

## 9. Author & Acknowledgements

- **Author**: **Libo Wang**  
- Base policy and dataset by **Physical Intelligence / LeRobot** teams.  
- Training environment based on a single RTX 4090 GPU; all scripts are structured to be portable to other single-GPU or multi-GPU setups with minimal changes.

---

## 10. Citation

If you use Sigma, please cite both the original π0.5 / OpenPI work and this Sigma extension.

**π0.5 / OpenPI:**

```bibtex
@article{openpi2024,
  title   = {Open-World Robotic Manipulation with Vision-Language-Action Models},
  author  = {Physical Intelligence},
  year    = {2024},
  url     = {https://github.com/Physical-Intelligence/openpi}
}
```

**Sigma (example entry):**

```bibtex
@article{sigma2025,
  title   = {Sigma: The Key for Vision--Language--Action Models toward Telepathy},
  author  = {Wang, Libo},
  year    = {2025},
  note    = {Telepathy-style extension of lerobot/pi05_base},
  url     = {https://huggingface.co/Veltraxor/Sigma}
}
```