Sigma / README.md

Update README.md

3a28e95 verified 2 months ago

12.2 kB

	---
	license: gemma
	language:
	- en
	tags:
	- vision-language-action
	- humanoid-robotics
	- telepathy
	- multimodal
	- robotics-control
	- lora
	- pytorch
	base_model: lerobot/pi05_base
	datasets:
	- lerobot/svla_so101_pickplace
	library_name: transformers
	pipeline_tag: other
	author: "Libo Wang"
	---

	# Sigma: The Key for Vision–Language–Action Models toward Telepathy

	[![Model Card](https://img.shields.io/badge/HF-Sigma-orange?logo=huggingface)](https://huggingface.co/Veltraxor/Sigma)
	[![Base Model](https://img.shields.io/badge/base-lerobot%2Fpi05__base-blue)](https://huggingface.co/lerobot/pi05_base)
	[![Dataset](https://img.shields.io/badge/dataset-lerobot%2Fsvla__so101__pickplace-green)](https://huggingface.co/datasets/lerobot/svla_so101_pickplace)

	Sigma is a telepathy-style Vision–Language–Action (VLA) model built on top of `lerobot/pi05_base`.
	It adds a semantic “telepathy” path and LoRA adapters that steer continuous robot control using internal semantic memory and intent states, while keeping the original π0.5 backbone weights intact and recoverable.

	---

	## 1. Summary

	- Base policy: `lerobot/pi05_base` (π0.5)
	- Author: Libo Wang
	- GPU for training: single RTX 4090 (24GB)
	- Data: `lerobot/svla_so101_pickplace`
	- Objective:
	Make a π0.5-style VLA use internal semantic & intent states to refine continuous control, rather than only imitating trajectories.

	Sigma keeps the perception and control structure of π0.5, and introduces an additional pathway that:

	- fuses vision, language, and robot state into a shared latent sequence,
	- maintains a semantic state m_t and an intent vector z_intent over time,
	- converts them into telepathy factors that modulate the policy’s action outputs as residual corrections.

	---

	## 2. Architecture at a Glance

	Sigma can be seen as π0.5 + telepathic head + LoRA adapters:

	- Vision / State stream
	- reuse π0.5 encoders for images and robot state;
	- add FiLM-style modulation from telepathy factors on vision tokens.

	- Language–semantic stream
	- take text tokens, vision tokens, and state tokens into a shared MLLM backbone;
	- derive:
	- a semantic memory m_t that accumulates cross-time information,
	- an intent vector z_intent,
	- pooled semantic factors aligned with the text embedding space.

	- Action stream (three branches)
	- treat π0.5 outputs as baseline:
	- action vector (per-step),
	- action chunk (short horizon),
	- action trajectory (full horizon);
	- learn residual actions driven by telepathy factors on all three branches.

	The resulting policy still looks like π0.5 from the outside (same inputs, same output types), but actions are now corrected by an internal telepathy pathway that is aware of deep semantics and associative intent.

	---

	## 3. Training Setup

	### 3.1 Dataset & preprocessing

	- Upstream dataset: `lerobot/svla_so101_pickplace`
	- Task: pick-and-place style manipulation with multi-frame RGB + robot state + continuous actions.

	A preprocessing script (`dataset_preprocess_sigma_vla.py`) does:

	- sliding-window segmentation with horizon `T = 16`,
	- filtering out windows with nearly zero action norm to remove static segments,
	- packing vision frames, robot state, and 3-scale action targets into tensor batches,
	- exporting three sharded files:

	```text
	storage/sigma_pickplace/shard_00000.pt
	storage/sigma_pickplace/shard_00001.pt
	storage/sigma_pickplace/shard_00002.pt
	```

	These shards are the only data used for Sigma training and evaluation.

	### 3.2 LoRA fine-tuning (Sigma training)

	Training is performed on a single RTX 4090 using `train_sigma_telepathy_vla_lora.py`:

	```bash
	python train_sigma_telepathy_vla_lora.py \
	--base_model_id lerobot/pi05_base \
	--dataset_dir /workspace/storage/sigma_pickplace \
	--output_dir /workspace/storage/sigma_lora_out \
	--batch_size 4 \
	--gradient_accumulation_steps 4 \
	--max_steps 300 \
	--dtype bf16
	```

	Key aspects:

	- freeze backbone weights from `lerobot/pi05_base`;
	- attach LoRA on key projections (q, k, v, o) and the telepathy heads;
	- jointly optimize:
	- three control losses:
	- `L_act_vec` for per-step action vectors,
	- `L_act_chk` for short-horizon chunks,
	- `L_act_trj` for full trajectories;
	- semantic & telepathy regularizers:
	- alignment of semantic factors with text embeddings,
	- control of telepathy factor norm `tau_l2`.

	All LoRA and telepathy parameters are stored under:

	```text
	storage/sigma_lora_out/
	sigma_telepathy_heads.pt
	adapter_config.json
	adapter_model.bin
	...
	```

	### 3.3 Telepathy-aware training logic

	Two key training mechanisms are implemented inside the loss:

	- Telepathic Residual Action Focusing (TRAF)
	Focuses learning on residual actions instead of full actions, and uses hard-sample mining (top-k error segments) to allocate more gradient budget to difficult humanoid control windows.

	- Telepathic Semantic Alignment Curriculum (TSAC)
	Gradually increases the weights of:
	- semantic memory–text alignment,
	- intent–telepathy alignment,
	while maintaining action regression as the primary objective early on.
	Late in training, Sigma is encouraged to let internal semantic/intent structure drive the residual corrections.

	---

	## 4. Inference-time Telepathy Adapter

	A lightweight adapter (`sigma_adapter.py`) controls how much the telepathy residuals are allowed to modify the baseline π0.5 actions:

	- reads:
	- baseline π0.5 actions (`base_action_vector`, …),
	- Sigma residuals,
	- telepathy diagnostics (norms, cosine alignments),
	- computes a risk-aware scaling factor in min_scale, max_scale,
	- blends:

	```python
	action = base_action + scale * telepathy_residual
	```

	If residuals are too large or misaligned, `scale` is pushed toward 0, effectively reverting to π0.5 behavior.
	If residuals are moderate and well aligned, `scale` approaches 1, enabling telepathy-enhanced control.

	---

	## 5. Evaluation Protocol

	Evaluation uses `eval_sigma_vla_rollout.py` in offline closed-loop replay:

	- both Sigma and the baseline:
	- use the same preprocessed shards (`shard_0000x.pt`),
	- share the same telepathy heads file `sigma_telepathy_heads.pt`,
	- only Sigma:
	- loads LoRA weights,
	- activates telepathy residuals and the adapter in control output.

	### 5.1 CHECK A – telepathy geometry & alignment sanity

	CHECK A verifies that telepathy geometry is identical between experimental and control runs:

	- `heads_tensors = 325`
	- `mean ≈ 0.002`, `std ≈ 0.107`, `rms ≈ 0.107` for telepathy head weights
	- `avg_tau_l2 ≈ 51.6` – average L2 norm of telepathy factors
	- `avg_semantic_text_alignment ≈ 0.13` – semantic factor vs. text embedding alignment

	These numbers are matched between Sigma and the π0.5 baseline, so behavior differences cannot be explained by changing telepathy parameters or text alignment geometry.

	### 5.2 CHECK B – multiscale control & telepathy metrics

	CHECK B defines and reports:

	- `mse_vec` – per-step action vector MSE (fine-grain control precision)
	- `mse_chk` – short segment chunk MSE (local motion consistency)
	- `mse_trj` – full trajectory MSE (long-horizon tracking)
	- `tau_l2` – telepathy factor norms (activation strength)
	- `sem_align` – semantic alignment (e.g., cosine) between semantic factors and text embeddings

	On the same 723 samples and 181 batches:

	- Sigma shows consistently lower `mse_vec`, `mse_chk`, `mse_trj` than the baseline,
	- while `tau_l2` and `sem_align` remain similar between both models.

	This pattern supports the interpretation that Sigma uses the same semantic / telepathy geometry more effectively, converting it into tangible gains in control accuracy instead of merely altering the embedding space.

	---

	## 6. How to Use Sigma

	> ⚠️ You must have access to `lerobot/pi05_base` and the preprocessed shards or an equivalent environment to reproduce full experiments.

	### 6.1 Installation (example)

	```bash
	# base env
	pip install "transformers>=4.40.0" accelerate torch torchvision
	pip install lerobot

	# clone this repository (example path)
	git clone https://github.com/Veltraxor/Sigma.git
	cd Sigma
	```

	### 6.2 Loading Sigma on top of pi0.5

	```python
	import torch
	from lerobot import Pi05Policy
	from sigma_vla import SigmaTelepathyVLA, SigmaTelepathyAdapter

	device = "cuda"
	dtype = torch.bfloat16

	# 1. Load base π0.5 policy
	base_policy = Pi05Policy.from_pretrained("lerobot/pi05_base")

	# 2. Build Sigma on top of the base policy
	sigma_policy = SigmaTelepathyVLA.from_base(
	base_policy=base_policy,
	lora_dir="./storage/sigma_lora_out",
	telepathy_heads_path="./storage/sigma_lora_out/sigma_telepathy_heads.pt",
	device=device,
	dtype=dtype,
	)

	# 3. Optional runtime adapter
	adapter = SigmaTelepathyAdapter(
	min_scale=0.0,
	max_scale=1.0,
	risk_temperature=1.0,
	)

	# 4. Single batch forward (offline replay)
	batch = {
	"vis_obs": vis_obs_tensor, # [B, T, C, H, W]
	"robot_state": robot_state_tensor, # [B, T, D_state]
	"texts": list_of_text_prompts, # length B
	}

	with torch.no_grad():
	out = sigma_policy(**batch, use_telepathy=True)
	blended_action = adapter(
	base_action_vector=out["base_action_vector"],
	telepathy_residual=out["telepathy_residual_vector"],
	telepathy_factors=out["telepathy_factors"],
	)
	```

	---

	## 7. Repository Layout (typical)

	A typical Sigma repo / model card includes:

	```text
	README.md # this file
	sigma_env.example # example env file for HF tokens, paths
	dataset_preprocess_sigma_vla.py
	train_sigma_telepathy_vla_lora.py
	eval_sigma_vla_rollout.py
	sigma_telepathy_vla.py # model definition
	sigma_adapter.py # inference-time adapter

	storage/
	sigma_pickplace/
	shard_00000.pt
	shard_00001.pt
	shard_00002.pt
	sigma_lora_out/
	sigma_telepathy_heads.pt
	adapter_config.json
	adapter_model.bin
	...

	logs/
	sigma_eval_report.json
	sigma_eval_checkA.json
	sigma_eval_checkB.json
	```

	You can adapt this layout to your own environment; the key assumption is that Sigma is always loaded as a LoRA + telepathy delta on top of `lerobot/pi05_base`.

	---

	## 8. Intended Use, Risks, and Limitations

	- Intended use
	Sigma is intended for research and experimentation on:
	- semantic / telepathy-style control in VLA systems,
	- offline trajectory analysis and simulation,
	- early-stage humanoid / manipulator control studies.

	- Not intended for
	- direct deployment on physical robots without additional safety layers;
	- safety-critical or human-facing applications.

	- Known limitations
	- trained only on `svla_so101_pickplace`;
	- evaluated only in offline replay;
	- telepathy path tuned for a single task family and embodiment.

	Users should treat Sigma as a proof-of-concept that demonstrates how “deep semantic + associative intent” can be engineered into residual control, not as a generic controller.

	---

	## 9. Author & Acknowledgements

	- Author: Libo Wang
	- Base policy and dataset by Physical Intelligence / LeRobot teams.
	- Training environment based on a single RTX 4090 GPU; all scripts are structured to be portable to other single-GPU or multi-GPU setups with minimal changes.

	---

	## 10. Citation

	If you use Sigma, please cite both the original π0.5 / OpenPI work and this Sigma extension.

	π0.5 / OpenPI:

	```bibtex
	@article{openpi2024,
	title = {Open-World Robotic Manipulation with Vision-Language-Action Models},
	author = {Physical Intelligence},
	year = {2024},
	url = {https://github.com/Physical-Intelligence/openpi}
	}
	```

	Sigma (example entry):

	```bibtex
	@article{sigma2025,
	title = {Sigma: The Key for Vision--Language--Action Models toward Telepathy},
	author = {Wang, Libo},
	year = {2025},
	note = {Telepathy-style extension of lerobot/pi05_base},
	url = {https://huggingface.co/Veltraxor/Sigma}
	}
	```