|
|
--- |
|
|
license: gemma |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- vision-language-action |
|
|
- humanoid-robotics |
|
|
- telepathy |
|
|
- multimodal |
|
|
- robotics-control |
|
|
- lora |
|
|
- pytorch |
|
|
base_model: lerobot/pi05_base |
|
|
datasets: |
|
|
- lerobot/svla_so101_pickplace |
|
|
library_name: transformers |
|
|
pipeline_tag: other |
|
|
author: "Libo Wang" |
|
|
--- |
|
|
|
|
|
# Sigma: The Key for Vision–Language–Action Models toward Telepathy |
|
|
|
|
|
[](https://huggingface.co/Veltraxor/Sigma) |
|
|
[](https://huggingface.co/lerobot/pi05_base) |
|
|
[](https://huggingface.co/datasets/lerobot/svla_so101_pickplace) |
|
|
|
|
|
Sigma is a **telepathy-style Vision–Language–Action (VLA) model** built on top of `lerobot/pi05_base`. |
|
|
It adds a semantic “telepathy” path and LoRA adapters that steer continuous robot control using internal **semantic memory** and **intent states**, while keeping the original π0.5 backbone weights intact and recoverable. |
|
|
|
|
|
--- |
|
|
|
|
|
## 1. Summary |
|
|
|
|
|
- **Base policy**: `lerobot/pi05_base` (π0.5) |
|
|
- **Author**: **Libo Wang** |
|
|
- **GPU for training**: single RTX 4090 (24GB) |
|
|
- **Data**: `lerobot/svla_so101_pickplace` |
|
|
- **Objective**: |
|
|
Make a π0.5-style VLA **use internal semantic & intent states** to refine continuous control, rather than only imitating trajectories. |
|
|
|
|
|
Sigma keeps the perception and control structure of π0.5, and introduces an additional pathway that: |
|
|
|
|
|
- fuses **vision, language, and robot state** into a shared latent sequence, |
|
|
- maintains a **semantic state** m_t and an **intent vector** z_intent over time, |
|
|
- converts them into **telepathy factors** that modulate the policy’s action outputs as residual corrections. |
|
|
|
|
|
--- |
|
|
|
|
|
## 2. Architecture at a Glance |
|
|
|
|
|
Sigma can be seen as **π0.5 + telepathic head + LoRA adapters**: |
|
|
|
|
|
- **Vision / State stream** |
|
|
- reuse π0.5 encoders for images and robot state; |
|
|
- add FiLM-style modulation from telepathy factors on vision tokens. |
|
|
|
|
|
- **Language–semantic stream** |
|
|
- take text tokens, vision tokens, and state tokens into a shared MLLM backbone; |
|
|
- derive: |
|
|
- a **semantic memory** m_t that accumulates cross-time information, |
|
|
- an **intent vector** z_intent, |
|
|
- pooled **semantic factors** aligned with the text embedding space. |
|
|
|
|
|
- **Action stream (three branches)** |
|
|
- treat π0.5 outputs as **baseline**: |
|
|
- action vector (per-step), |
|
|
- action chunk (short horizon), |
|
|
- action trajectory (full horizon); |
|
|
- learn **residual actions** driven by telepathy factors on all three branches. |
|
|
|
|
|
The resulting policy still *looks like* π0.5 from the outside (same inputs, same output types), but actions are now corrected by an internal telepathy pathway that is aware of **deep semantics and associative intent**. |
|
|
|
|
|
--- |
|
|
|
|
|
## 3. Training Setup |
|
|
|
|
|
### 3.1 Dataset & preprocessing |
|
|
|
|
|
- **Upstream dataset**: `lerobot/svla_so101_pickplace` |
|
|
- **Task**: pick-and-place style manipulation with multi-frame RGB + robot state + continuous actions. |
|
|
|
|
|
A preprocessing script (`dataset_preprocess_sigma_vla.py`) does: |
|
|
|
|
|
- sliding-window segmentation with horizon `T = 16`, |
|
|
- filtering out windows with nearly zero action norm to remove static segments, |
|
|
- packing vision frames, robot state, and 3-scale action targets into tensor batches, |
|
|
- exporting three sharded files: |
|
|
|
|
|
```text |
|
|
storage/sigma_pickplace/shard_00000.pt |
|
|
storage/sigma_pickplace/shard_00001.pt |
|
|
storage/sigma_pickplace/shard_00002.pt |
|
|
``` |
|
|
|
|
|
These shards are the **only** data used for Sigma training and evaluation. |
|
|
|
|
|
### 3.2 LoRA fine-tuning (Sigma training) |
|
|
|
|
|
Training is performed on a **single RTX 4090** using `train_sigma_telepathy_vla_lora.py`: |
|
|
|
|
|
```bash |
|
|
python train_sigma_telepathy_vla_lora.py \ |
|
|
--base_model_id lerobot/pi05_base \ |
|
|
--dataset_dir /workspace/storage/sigma_pickplace \ |
|
|
--output_dir /workspace/storage/sigma_lora_out \ |
|
|
--batch_size 4 \ |
|
|
--gradient_accumulation_steps 4 \ |
|
|
--max_steps 300 \ |
|
|
--dtype bf16 |
|
|
``` |
|
|
|
|
|
Key aspects: |
|
|
|
|
|
- freeze backbone weights from `lerobot/pi05_base`; |
|
|
- attach **LoRA** on key projections (q, k, v, o) and the telepathy heads; |
|
|
- jointly optimize: |
|
|
- **three control losses**: |
|
|
- `L_act_vec` for per-step action vectors, |
|
|
- `L_act_chk` for short-horizon chunks, |
|
|
- `L_act_trj` for full trajectories; |
|
|
- **semantic & telepathy regularizers**: |
|
|
- alignment of semantic factors with text embeddings, |
|
|
- control of telepathy factor norm `tau_l2`. |
|
|
|
|
|
All LoRA and telepathy parameters are stored under: |
|
|
|
|
|
```text |
|
|
storage/sigma_lora_out/ |
|
|
sigma_telepathy_heads.pt |
|
|
adapter_config.json |
|
|
adapter_model.bin |
|
|
... |
|
|
``` |
|
|
|
|
|
### 3.3 Telepathy-aware training logic |
|
|
|
|
|
Two key training mechanisms are implemented inside the loss: |
|
|
|
|
|
- **Telepathic Residual Action Focusing (TRAF)** |
|
|
Focuses learning on *residual actions* instead of full actions, and uses **hard-sample mining** (top-k error segments) to allocate more gradient budget to difficult humanoid control windows. |
|
|
|
|
|
- **Telepathic Semantic Alignment Curriculum (TSAC)** |
|
|
Gradually increases the weights of: |
|
|
- semantic memory–text alignment, |
|
|
- intent–telepathy alignment, |
|
|
while maintaining action regression as the primary objective early on. |
|
|
Late in training, Sigma is encouraged to let **internal semantic/intent structure** drive the residual corrections. |
|
|
|
|
|
--- |
|
|
|
|
|
## 4. Inference-time Telepathy Adapter |
|
|
|
|
|
A lightweight adapter (`sigma_adapter.py`) controls how much the telepathy residuals are allowed to modify the baseline π0.5 actions: |
|
|
|
|
|
- reads: |
|
|
- baseline π0.5 actions (`base_action_vector`, …), |
|
|
- Sigma residuals, |
|
|
- telepathy diagnostics (norms, cosine alignments), |
|
|
- computes a **risk-aware scaling factor** in min_scale, max_scale, |
|
|
- blends: |
|
|
|
|
|
```python |
|
|
action = base_action + scale * telepathy_residual |
|
|
``` |
|
|
|
|
|
If residuals are too large or misaligned, `scale` is pushed toward 0, effectively reverting to π0.5 behavior. |
|
|
If residuals are moderate and well aligned, `scale` approaches 1, enabling telepathy-enhanced control. |
|
|
|
|
|
--- |
|
|
|
|
|
## 5. Evaluation Protocol |
|
|
|
|
|
Evaluation uses `eval_sigma_vla_rollout.py` in **offline closed-loop replay**: |
|
|
|
|
|
- both Sigma and the baseline: |
|
|
- use the *same* preprocessed shards (`shard_0000x.pt`), |
|
|
- share the *same* telepathy heads file `sigma_telepathy_heads.pt`, |
|
|
- **only Sigma**: |
|
|
- loads LoRA weights, |
|
|
- activates telepathy residuals and the adapter in control output. |
|
|
|
|
|
### 5.1 CHECK A – telepathy geometry & alignment sanity |
|
|
|
|
|
CHECK A verifies that **telepathy geometry is identical** between experimental and control runs: |
|
|
|
|
|
- `heads_tensors = 325` |
|
|
- `mean ≈ 0.002`, `std ≈ 0.107`, `rms ≈ 0.107` for telepathy head weights |
|
|
- `avg_tau_l2 ≈ 51.6` – average L2 norm of telepathy factors |
|
|
- `avg_semantic_text_alignment ≈ 0.13` – semantic factor vs. text embedding alignment |
|
|
|
|
|
These numbers are matched between Sigma and the π0.5 baseline, so behavior differences cannot be explained by changing telepathy parameters or text alignment geometry. |
|
|
|
|
|
### 5.2 CHECK B – multiscale control & telepathy metrics |
|
|
|
|
|
CHECK B defines and reports: |
|
|
|
|
|
- `mse_vec` – per-step action vector MSE (fine-grain control precision) |
|
|
- `mse_chk` – short segment chunk MSE (local motion consistency) |
|
|
- `mse_trj` – full trajectory MSE (long-horizon tracking) |
|
|
- `tau_l2` – telepathy factor norms (activation strength) |
|
|
- `sem_align` – semantic alignment (e.g., cosine) between semantic factors and text embeddings |
|
|
|
|
|
On the same 723 samples and 181 batches: |
|
|
|
|
|
- Sigma shows **consistently lower `mse_vec`, `mse_chk`, `mse_trj`** than the baseline, |
|
|
- while **`tau_l2` and `sem_align` remain similar** between both models. |
|
|
|
|
|
This pattern supports the interpretation that Sigma **uses the same semantic / telepathy geometry more effectively**, converting it into tangible gains in control accuracy instead of merely altering the embedding space. |
|
|
|
|
|
--- |
|
|
|
|
|
## 6. How to Use Sigma |
|
|
|
|
|
> ⚠️ You must have access to `lerobot/pi05_base` and the preprocessed shards or an equivalent environment to reproduce full experiments. |
|
|
|
|
|
### 6.1 Installation (example) |
|
|
|
|
|
```bash |
|
|
# base env |
|
|
pip install "transformers>=4.40.0" accelerate torch torchvision |
|
|
pip install lerobot |
|
|
|
|
|
# clone this repository (example path) |
|
|
git clone https://github.com/Veltraxor/Sigma.git |
|
|
cd Sigma |
|
|
``` |
|
|
|
|
|
### 6.2 Loading Sigma on top of pi0.5 |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from lerobot import Pi05Policy |
|
|
from sigma_vla import SigmaTelepathyVLA, SigmaTelepathyAdapter |
|
|
|
|
|
device = "cuda" |
|
|
dtype = torch.bfloat16 |
|
|
|
|
|
# 1. Load base π0.5 policy |
|
|
base_policy = Pi05Policy.from_pretrained("lerobot/pi05_base") |
|
|
|
|
|
# 2. Build Sigma on top of the base policy |
|
|
sigma_policy = SigmaTelepathyVLA.from_base( |
|
|
base_policy=base_policy, |
|
|
lora_dir="./storage/sigma_lora_out", |
|
|
telepathy_heads_path="./storage/sigma_lora_out/sigma_telepathy_heads.pt", |
|
|
device=device, |
|
|
dtype=dtype, |
|
|
) |
|
|
|
|
|
# 3. Optional runtime adapter |
|
|
adapter = SigmaTelepathyAdapter( |
|
|
min_scale=0.0, |
|
|
max_scale=1.0, |
|
|
risk_temperature=1.0, |
|
|
) |
|
|
|
|
|
# 4. Single batch forward (offline replay) |
|
|
batch = { |
|
|
"vis_obs": vis_obs_tensor, # [B, T, C, H, W] |
|
|
"robot_state": robot_state_tensor, # [B, T, D_state] |
|
|
"texts": list_of_text_prompts, # length B |
|
|
} |
|
|
|
|
|
with torch.no_grad(): |
|
|
out = sigma_policy(**batch, use_telepathy=True) |
|
|
blended_action = adapter( |
|
|
base_action_vector=out["base_action_vector"], |
|
|
telepathy_residual=out["telepathy_residual_vector"], |
|
|
telepathy_factors=out["telepathy_factors"], |
|
|
) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 7. Repository Layout (typical) |
|
|
|
|
|
A typical Sigma repo / model card includes: |
|
|
|
|
|
```text |
|
|
README.md # this file |
|
|
sigma_env.example # example env file for HF tokens, paths |
|
|
dataset_preprocess_sigma_vla.py |
|
|
train_sigma_telepathy_vla_lora.py |
|
|
eval_sigma_vla_rollout.py |
|
|
sigma_telepathy_vla.py # model definition |
|
|
sigma_adapter.py # inference-time adapter |
|
|
|
|
|
storage/ |
|
|
sigma_pickplace/ |
|
|
shard_00000.pt |
|
|
shard_00001.pt |
|
|
shard_00002.pt |
|
|
sigma_lora_out/ |
|
|
sigma_telepathy_heads.pt |
|
|
adapter_config.json |
|
|
adapter_model.bin |
|
|
... |
|
|
|
|
|
logs/ |
|
|
sigma_eval_report.json |
|
|
sigma_eval_checkA.json |
|
|
sigma_eval_checkB.json |
|
|
``` |
|
|
|
|
|
You can adapt this layout to your own environment; the key assumption is that **Sigma is always loaded as a LoRA + telepathy delta on top of `lerobot/pi05_base`**. |
|
|
|
|
|
--- |
|
|
|
|
|
## 8. Intended Use, Risks, and Limitations |
|
|
|
|
|
- **Intended use** |
|
|
Sigma is intended for **research and experimentation** on: |
|
|
- semantic / telepathy-style control in VLA systems, |
|
|
- offline trajectory analysis and simulation, |
|
|
- early-stage humanoid / manipulator control studies. |
|
|
|
|
|
- **Not intended for** |
|
|
- direct deployment on physical robots **without additional safety layers**; |
|
|
- safety-critical or human-facing applications. |
|
|
|
|
|
- **Known limitations** |
|
|
- trained only on `svla_so101_pickplace`; |
|
|
- evaluated only in offline replay; |
|
|
- telepathy path tuned for a single task family and embodiment. |
|
|
|
|
|
Users should treat Sigma as a **proof-of-concept** that demonstrates how “deep semantic + associative intent” can be engineered into residual control, not as a generic controller. |
|
|
|
|
|
--- |
|
|
|
|
|
## 9. Author & Acknowledgements |
|
|
|
|
|
- **Author**: **Libo Wang** |
|
|
- Base policy and dataset by **Physical Intelligence / LeRobot** teams. |
|
|
- Training environment based on a single RTX 4090 GPU; all scripts are structured to be portable to other single-GPU or multi-GPU setups with minimal changes. |
|
|
|
|
|
--- |
|
|
|
|
|
## 10. Citation |
|
|
|
|
|
If you use Sigma, please cite both the original π0.5 / OpenPI work and this Sigma extension. |
|
|
|
|
|
**π0.5 / OpenPI:** |
|
|
|
|
|
```bibtex |
|
|
@article{openpi2024, |
|
|
title = {Open-World Robotic Manipulation with Vision-Language-Action Models}, |
|
|
author = {Physical Intelligence}, |
|
|
year = {2024}, |
|
|
url = {https://github.com/Physical-Intelligence/openpi} |
|
|
} |
|
|
``` |
|
|
|
|
|
**Sigma (example entry):** |
|
|
|
|
|
```bibtex |
|
|
@article{sigma2025, |
|
|
title = {Sigma: The Key for Vision--Language--Action Models toward Telepathy}, |
|
|
author = {Wang, Libo}, |
|
|
year = {2025}, |
|
|
note = {Telepathy-style extension of lerobot/pi05_base}, |
|
|
url = {https://huggingface.co/Veltraxor/Sigma} |
|
|
} |
|
|
``` |
|
|
|