File size: 12,209 Bytes
7d25255 ae7851f 7d25255 ae7851f 7d25255 ae7851f 7d25255 ae7851f 7d25255 ae7851f 41a4de1 ae7851f 3a28e95 ae7851f 7d25255 ae7851f 7d25255 ae7851f 7d25255 ae7851f 7d25255 ae7851f 7d25255 ae7851f 7d25255 ae7851f 7d25255 ae7851f 7d25255 ae7851f 7d25255 ae7851f 7d25255 ae7851f 7d25255 ae7851f 7d25255 ae7851f 7d25255 ae7851f 3a28e95 ae7851f 7d25255 ae7851f 3a28e95 ae7851f 7d25255 ae7851f 7d25255 ae7851f 7d25255 ae7851f 7d25255 ae7851f 7d25255 ae7851f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 |
---
license: gemma
language:
- en
tags:
- vision-language-action
- humanoid-robotics
- telepathy
- multimodal
- robotics-control
- lora
- pytorch
base_model: lerobot/pi05_base
datasets:
- lerobot/svla_so101_pickplace
library_name: transformers
pipeline_tag: other
author: "Libo Wang"
---
# Sigma: The Key for Vision–Language–Action Models toward Telepathy
[](https://huggingface.co/Veltraxor/Sigma)
[](https://huggingface.co/lerobot/pi05_base)
[](https://huggingface.co/datasets/lerobot/svla_so101_pickplace)
Sigma is a **telepathy-style Vision–Language–Action (VLA) model** built on top of `lerobot/pi05_base`.
It adds a semantic “telepathy” path and LoRA adapters that steer continuous robot control using internal **semantic memory** and **intent states**, while keeping the original π0.5 backbone weights intact and recoverable.
---
## 1. Summary
- **Base policy**: `lerobot/pi05_base` (π0.5)
- **Author**: **Libo Wang**
- **GPU for training**: single RTX 4090 (24GB)
- **Data**: `lerobot/svla_so101_pickplace`
- **Objective**:
Make a π0.5-style VLA **use internal semantic & intent states** to refine continuous control, rather than only imitating trajectories.
Sigma keeps the perception and control structure of π0.5, and introduces an additional pathway that:
- fuses **vision, language, and robot state** into a shared latent sequence,
- maintains a **semantic state** m_t and an **intent vector** z_intent over time,
- converts them into **telepathy factors** that modulate the policy’s action outputs as residual corrections.
---
## 2. Architecture at a Glance
Sigma can be seen as **π0.5 + telepathic head + LoRA adapters**:
- **Vision / State stream**
- reuse π0.5 encoders for images and robot state;
- add FiLM-style modulation from telepathy factors on vision tokens.
- **Language–semantic stream**
- take text tokens, vision tokens, and state tokens into a shared MLLM backbone;
- derive:
- a **semantic memory** m_t that accumulates cross-time information,
- an **intent vector** z_intent,
- pooled **semantic factors** aligned with the text embedding space.
- **Action stream (three branches)**
- treat π0.5 outputs as **baseline**:
- action vector (per-step),
- action chunk (short horizon),
- action trajectory (full horizon);
- learn **residual actions** driven by telepathy factors on all three branches.
The resulting policy still *looks like* π0.5 from the outside (same inputs, same output types), but actions are now corrected by an internal telepathy pathway that is aware of **deep semantics and associative intent**.
---
## 3. Training Setup
### 3.1 Dataset & preprocessing
- **Upstream dataset**: `lerobot/svla_so101_pickplace`
- **Task**: pick-and-place style manipulation with multi-frame RGB + robot state + continuous actions.
A preprocessing script (`dataset_preprocess_sigma_vla.py`) does:
- sliding-window segmentation with horizon `T = 16`,
- filtering out windows with nearly zero action norm to remove static segments,
- packing vision frames, robot state, and 3-scale action targets into tensor batches,
- exporting three sharded files:
```text
storage/sigma_pickplace/shard_00000.pt
storage/sigma_pickplace/shard_00001.pt
storage/sigma_pickplace/shard_00002.pt
```
These shards are the **only** data used for Sigma training and evaluation.
### 3.2 LoRA fine-tuning (Sigma training)
Training is performed on a **single RTX 4090** using `train_sigma_telepathy_vla_lora.py`:
```bash
python train_sigma_telepathy_vla_lora.py \
--base_model_id lerobot/pi05_base \
--dataset_dir /workspace/storage/sigma_pickplace \
--output_dir /workspace/storage/sigma_lora_out \
--batch_size 4 \
--gradient_accumulation_steps 4 \
--max_steps 300 \
--dtype bf16
```
Key aspects:
- freeze backbone weights from `lerobot/pi05_base`;
- attach **LoRA** on key projections (q, k, v, o) and the telepathy heads;
- jointly optimize:
- **three control losses**:
- `L_act_vec` for per-step action vectors,
- `L_act_chk` for short-horizon chunks,
- `L_act_trj` for full trajectories;
- **semantic & telepathy regularizers**:
- alignment of semantic factors with text embeddings,
- control of telepathy factor norm `tau_l2`.
All LoRA and telepathy parameters are stored under:
```text
storage/sigma_lora_out/
sigma_telepathy_heads.pt
adapter_config.json
adapter_model.bin
...
```
### 3.3 Telepathy-aware training logic
Two key training mechanisms are implemented inside the loss:
- **Telepathic Residual Action Focusing (TRAF)**
Focuses learning on *residual actions* instead of full actions, and uses **hard-sample mining** (top-k error segments) to allocate more gradient budget to difficult humanoid control windows.
- **Telepathic Semantic Alignment Curriculum (TSAC)**
Gradually increases the weights of:
- semantic memory–text alignment,
- intent–telepathy alignment,
while maintaining action regression as the primary objective early on.
Late in training, Sigma is encouraged to let **internal semantic/intent structure** drive the residual corrections.
---
## 4. Inference-time Telepathy Adapter
A lightweight adapter (`sigma_adapter.py`) controls how much the telepathy residuals are allowed to modify the baseline π0.5 actions:
- reads:
- baseline π0.5 actions (`base_action_vector`, …),
- Sigma residuals,
- telepathy diagnostics (norms, cosine alignments),
- computes a **risk-aware scaling factor** in min_scale, max_scale,
- blends:
```python
action = base_action + scale * telepathy_residual
```
If residuals are too large or misaligned, `scale` is pushed toward 0, effectively reverting to π0.5 behavior.
If residuals are moderate and well aligned, `scale` approaches 1, enabling telepathy-enhanced control.
---
## 5. Evaluation Protocol
Evaluation uses `eval_sigma_vla_rollout.py` in **offline closed-loop replay**:
- both Sigma and the baseline:
- use the *same* preprocessed shards (`shard_0000x.pt`),
- share the *same* telepathy heads file `sigma_telepathy_heads.pt`,
- **only Sigma**:
- loads LoRA weights,
- activates telepathy residuals and the adapter in control output.
### 5.1 CHECK A – telepathy geometry & alignment sanity
CHECK A verifies that **telepathy geometry is identical** between experimental and control runs:
- `heads_tensors = 325`
- `mean ≈ 0.002`, `std ≈ 0.107`, `rms ≈ 0.107` for telepathy head weights
- `avg_tau_l2 ≈ 51.6` – average L2 norm of telepathy factors
- `avg_semantic_text_alignment ≈ 0.13` – semantic factor vs. text embedding alignment
These numbers are matched between Sigma and the π0.5 baseline, so behavior differences cannot be explained by changing telepathy parameters or text alignment geometry.
### 5.2 CHECK B – multiscale control & telepathy metrics
CHECK B defines and reports:
- `mse_vec` – per-step action vector MSE (fine-grain control precision)
- `mse_chk` – short segment chunk MSE (local motion consistency)
- `mse_trj` – full trajectory MSE (long-horizon tracking)
- `tau_l2` – telepathy factor norms (activation strength)
- `sem_align` – semantic alignment (e.g., cosine) between semantic factors and text embeddings
On the same 723 samples and 181 batches:
- Sigma shows **consistently lower `mse_vec`, `mse_chk`, `mse_trj`** than the baseline,
- while **`tau_l2` and `sem_align` remain similar** between both models.
This pattern supports the interpretation that Sigma **uses the same semantic / telepathy geometry more effectively**, converting it into tangible gains in control accuracy instead of merely altering the embedding space.
---
## 6. How to Use Sigma
> ⚠️ You must have access to `lerobot/pi05_base` and the preprocessed shards or an equivalent environment to reproduce full experiments.
### 6.1 Installation (example)
```bash
# base env
pip install "transformers>=4.40.0" accelerate torch torchvision
pip install lerobot
# clone this repository (example path)
git clone https://github.com/Veltraxor/Sigma.git
cd Sigma
```
### 6.2 Loading Sigma on top of pi0.5
```python
import torch
from lerobot import Pi05Policy
from sigma_vla import SigmaTelepathyVLA, SigmaTelepathyAdapter
device = "cuda"
dtype = torch.bfloat16
# 1. Load base π0.5 policy
base_policy = Pi05Policy.from_pretrained("lerobot/pi05_base")
# 2. Build Sigma on top of the base policy
sigma_policy = SigmaTelepathyVLA.from_base(
base_policy=base_policy,
lora_dir="./storage/sigma_lora_out",
telepathy_heads_path="./storage/sigma_lora_out/sigma_telepathy_heads.pt",
device=device,
dtype=dtype,
)
# 3. Optional runtime adapter
adapter = SigmaTelepathyAdapter(
min_scale=0.0,
max_scale=1.0,
risk_temperature=1.0,
)
# 4. Single batch forward (offline replay)
batch = {
"vis_obs": vis_obs_tensor, # [B, T, C, H, W]
"robot_state": robot_state_tensor, # [B, T, D_state]
"texts": list_of_text_prompts, # length B
}
with torch.no_grad():
out = sigma_policy(**batch, use_telepathy=True)
blended_action = adapter(
base_action_vector=out["base_action_vector"],
telepathy_residual=out["telepathy_residual_vector"],
telepathy_factors=out["telepathy_factors"],
)
```
---
## 7. Repository Layout (typical)
A typical Sigma repo / model card includes:
```text
README.md # this file
sigma_env.example # example env file for HF tokens, paths
dataset_preprocess_sigma_vla.py
train_sigma_telepathy_vla_lora.py
eval_sigma_vla_rollout.py
sigma_telepathy_vla.py # model definition
sigma_adapter.py # inference-time adapter
storage/
sigma_pickplace/
shard_00000.pt
shard_00001.pt
shard_00002.pt
sigma_lora_out/
sigma_telepathy_heads.pt
adapter_config.json
adapter_model.bin
...
logs/
sigma_eval_report.json
sigma_eval_checkA.json
sigma_eval_checkB.json
```
You can adapt this layout to your own environment; the key assumption is that **Sigma is always loaded as a LoRA + telepathy delta on top of `lerobot/pi05_base`**.
---
## 8. Intended Use, Risks, and Limitations
- **Intended use**
Sigma is intended for **research and experimentation** on:
- semantic / telepathy-style control in VLA systems,
- offline trajectory analysis and simulation,
- early-stage humanoid / manipulator control studies.
- **Not intended for**
- direct deployment on physical robots **without additional safety layers**;
- safety-critical or human-facing applications.
- **Known limitations**
- trained only on `svla_so101_pickplace`;
- evaluated only in offline replay;
- telepathy path tuned for a single task family and embodiment.
Users should treat Sigma as a **proof-of-concept** that demonstrates how “deep semantic + associative intent” can be engineered into residual control, not as a generic controller.
---
## 9. Author & Acknowledgements
- **Author**: **Libo Wang**
- Base policy and dataset by **Physical Intelligence / LeRobot** teams.
- Training environment based on a single RTX 4090 GPU; all scripts are structured to be portable to other single-GPU or multi-GPU setups with minimal changes.
---
## 10. Citation
If you use Sigma, please cite both the original π0.5 / OpenPI work and this Sigma extension.
**π0.5 / OpenPI:**
```bibtex
@article{openpi2024,
title = {Open-World Robotic Manipulation with Vision-Language-Action Models},
author = {Physical Intelligence},
year = {2024},
url = {https://github.com/Physical-Intelligence/openpi}
}
```
**Sigma (example entry):**
```bibtex
@article{sigma2025,
title = {Sigma: The Key for Vision--Language--Action Models toward Telepathy},
author = {Wang, Libo},
year = {2025},
note = {Telepathy-style extension of lerobot/pi05_base},
url = {https://huggingface.co/Veltraxor/Sigma}
}
```
|