--- license: gemma language: - en tags: - vision-language-action - humanoid-robotics - telepathy - multimodal - robotics-control - lora - pytorch base_model: lerobot/pi05_base datasets: - lerobot/svla_so101_pickplace library_name: transformers pipeline_tag: other author: "Libo Wang" --- # Sigma: The Key for Vision–Language–Action Models toward Telepathy [![Model Card](https://img.shields.io/badge/HF-Sigma-orange?logo=huggingface)](https://huggingface.co/Veltraxor/Sigma) [![Base Model](https://img.shields.io/badge/base-lerobot%2Fpi05__base-blue)](https://huggingface.co/lerobot/pi05_base) [![Dataset](https://img.shields.io/badge/dataset-lerobot%2Fsvla__so101__pickplace-green)](https://huggingface.co/datasets/lerobot/svla_so101_pickplace) Sigma is a **telepathy-style Vision–Language–Action (VLA) model** built on top of `lerobot/pi05_base`. It adds a semantic “telepathy” path and LoRA adapters that steer continuous robot control using internal **semantic memory** and **intent states**, while keeping the original π0.5 backbone weights intact and recoverable. --- ## 1. Summary - **Base policy**: `lerobot/pi05_base` (π0.5) - **Author**: **Libo Wang** - **GPU for training**: single RTX 4090 (24GB) - **Data**: `lerobot/svla_so101_pickplace` - **Objective**: Make a π0.5-style VLA **use internal semantic & intent states** to refine continuous control, rather than only imitating trajectories. Sigma keeps the perception and control structure of π0.5, and introduces an additional pathway that: - fuses **vision, language, and robot state** into a shared latent sequence, - maintains a **semantic state** m_t and an **intent vector** z_intent over time, - converts them into **telepathy factors** that modulate the policy’s action outputs as residual corrections. --- ## 2. Architecture at a Glance Sigma can be seen as **π0.5 + telepathic head + LoRA adapters**: - **Vision / State stream** - reuse π0.5 encoders for images and robot state; - add FiLM-style modulation from telepathy factors on vision tokens. - **Language–semantic stream** - take text tokens, vision tokens, and state tokens into a shared MLLM backbone; - derive: - a **semantic memory** m_t that accumulates cross-time information, - an **intent vector** z_intent, - pooled **semantic factors** aligned with the text embedding space. - **Action stream (three branches)** - treat π0.5 outputs as **baseline**: - action vector (per-step), - action chunk (short horizon), - action trajectory (full horizon); - learn **residual actions** driven by telepathy factors on all three branches. The resulting policy still *looks like* π0.5 from the outside (same inputs, same output types), but actions are now corrected by an internal telepathy pathway that is aware of **deep semantics and associative intent**. --- ## 3. Training Setup ### 3.1 Dataset & preprocessing - **Upstream dataset**: `lerobot/svla_so101_pickplace` - **Task**: pick-and-place style manipulation with multi-frame RGB + robot state + continuous actions. A preprocessing script (`dataset_preprocess_sigma_vla.py`) does: - sliding-window segmentation with horizon `T = 16`, - filtering out windows with nearly zero action norm to remove static segments, - packing vision frames, robot state, and 3-scale action targets into tensor batches, - exporting three sharded files: ```text storage/sigma_pickplace/shard_00000.pt storage/sigma_pickplace/shard_00001.pt storage/sigma_pickplace/shard_00002.pt ``` These shards are the **only** data used for Sigma training and evaluation. ### 3.2 LoRA fine-tuning (Sigma training) Training is performed on a **single RTX 4090** using `train_sigma_telepathy_vla_lora.py`: ```bash python train_sigma_telepathy_vla_lora.py \ --base_model_id lerobot/pi05_base \ --dataset_dir /workspace/storage/sigma_pickplace \ --output_dir /workspace/storage/sigma_lora_out \ --batch_size 4 \ --gradient_accumulation_steps 4 \ --max_steps 300 \ --dtype bf16 ``` Key aspects: - freeze backbone weights from `lerobot/pi05_base`; - attach **LoRA** on key projections (q, k, v, o) and the telepathy heads; - jointly optimize: - **three control losses**: - `L_act_vec` for per-step action vectors, - `L_act_chk` for short-horizon chunks, - `L_act_trj` for full trajectories; - **semantic & telepathy regularizers**: - alignment of semantic factors with text embeddings, - control of telepathy factor norm `tau_l2`. All LoRA and telepathy parameters are stored under: ```text storage/sigma_lora_out/ sigma_telepathy_heads.pt adapter_config.json adapter_model.bin ... ``` ### 3.3 Telepathy-aware training logic Two key training mechanisms are implemented inside the loss: - **Telepathic Residual Action Focusing (TRAF)** Focuses learning on *residual actions* instead of full actions, and uses **hard-sample mining** (top-k error segments) to allocate more gradient budget to difficult humanoid control windows. - **Telepathic Semantic Alignment Curriculum (TSAC)** Gradually increases the weights of: - semantic memory–text alignment, - intent–telepathy alignment, while maintaining action regression as the primary objective early on. Late in training, Sigma is encouraged to let **internal semantic/intent structure** drive the residual corrections. --- ## 4. Inference-time Telepathy Adapter A lightweight adapter (`sigma_adapter.py`) controls how much the telepathy residuals are allowed to modify the baseline π0.5 actions: - reads: - baseline π0.5 actions (`base_action_vector`, …), - Sigma residuals, - telepathy diagnostics (norms, cosine alignments), - computes a **risk-aware scaling factor** in min_scale, max_scale, - blends: ```python action = base_action + scale * telepathy_residual ``` If residuals are too large or misaligned, `scale` is pushed toward 0, effectively reverting to π0.5 behavior. If residuals are moderate and well aligned, `scale` approaches 1, enabling telepathy-enhanced control. --- ## 5. Evaluation Protocol Evaluation uses `eval_sigma_vla_rollout.py` in **offline closed-loop replay**: - both Sigma and the baseline: - use the *same* preprocessed shards (`shard_0000x.pt`), - share the *same* telepathy heads file `sigma_telepathy_heads.pt`, - **only Sigma**: - loads LoRA weights, - activates telepathy residuals and the adapter in control output. ### 5.1 CHECK A – telepathy geometry & alignment sanity CHECK A verifies that **telepathy geometry is identical** between experimental and control runs: - `heads_tensors = 325` - `mean ≈ 0.002`, `std ≈ 0.107`, `rms ≈ 0.107` for telepathy head weights - `avg_tau_l2 ≈ 51.6` – average L2 norm of telepathy factors - `avg_semantic_text_alignment ≈ 0.13` – semantic factor vs. text embedding alignment These numbers are matched between Sigma and the π0.5 baseline, so behavior differences cannot be explained by changing telepathy parameters or text alignment geometry. ### 5.2 CHECK B – multiscale control & telepathy metrics CHECK B defines and reports: - `mse_vec` – per-step action vector MSE (fine-grain control precision) - `mse_chk` – short segment chunk MSE (local motion consistency) - `mse_trj` – full trajectory MSE (long-horizon tracking) - `tau_l2` – telepathy factor norms (activation strength) - `sem_align` – semantic alignment (e.g., cosine) between semantic factors and text embeddings On the same 723 samples and 181 batches: - Sigma shows **consistently lower `mse_vec`, `mse_chk`, `mse_trj`** than the baseline, - while **`tau_l2` and `sem_align` remain similar** between both models. This pattern supports the interpretation that Sigma **uses the same semantic / telepathy geometry more effectively**, converting it into tangible gains in control accuracy instead of merely altering the embedding space. --- ## 6. How to Use Sigma > ⚠️ You must have access to `lerobot/pi05_base` and the preprocessed shards or an equivalent environment to reproduce full experiments. ### 6.1 Installation (example) ```bash # base env pip install "transformers>=4.40.0" accelerate torch torchvision pip install lerobot # clone this repository (example path) git clone https://github.com/Veltraxor/Sigma.git cd Sigma ``` ### 6.2 Loading Sigma on top of pi0.5 ```python import torch from lerobot import Pi05Policy from sigma_vla import SigmaTelepathyVLA, SigmaTelepathyAdapter device = "cuda" dtype = torch.bfloat16 # 1. Load base π0.5 policy base_policy = Pi05Policy.from_pretrained("lerobot/pi05_base") # 2. Build Sigma on top of the base policy sigma_policy = SigmaTelepathyVLA.from_base( base_policy=base_policy, lora_dir="./storage/sigma_lora_out", telepathy_heads_path="./storage/sigma_lora_out/sigma_telepathy_heads.pt", device=device, dtype=dtype, ) # 3. Optional runtime adapter adapter = SigmaTelepathyAdapter( min_scale=0.0, max_scale=1.0, risk_temperature=1.0, ) # 4. Single batch forward (offline replay) batch = { "vis_obs": vis_obs_tensor, # [B, T, C, H, W] "robot_state": robot_state_tensor, # [B, T, D_state] "texts": list_of_text_prompts, # length B } with torch.no_grad(): out = sigma_policy(**batch, use_telepathy=True) blended_action = adapter( base_action_vector=out["base_action_vector"], telepathy_residual=out["telepathy_residual_vector"], telepathy_factors=out["telepathy_factors"], ) ``` --- ## 7. Repository Layout (typical) A typical Sigma repo / model card includes: ```text README.md # this file sigma_env.example # example env file for HF tokens, paths dataset_preprocess_sigma_vla.py train_sigma_telepathy_vla_lora.py eval_sigma_vla_rollout.py sigma_telepathy_vla.py # model definition sigma_adapter.py # inference-time adapter storage/ sigma_pickplace/ shard_00000.pt shard_00001.pt shard_00002.pt sigma_lora_out/ sigma_telepathy_heads.pt adapter_config.json adapter_model.bin ... logs/ sigma_eval_report.json sigma_eval_checkA.json sigma_eval_checkB.json ``` You can adapt this layout to your own environment; the key assumption is that **Sigma is always loaded as a LoRA + telepathy delta on top of `lerobot/pi05_base`**. --- ## 8. Intended Use, Risks, and Limitations - **Intended use** Sigma is intended for **research and experimentation** on: - semantic / telepathy-style control in VLA systems, - offline trajectory analysis and simulation, - early-stage humanoid / manipulator control studies. - **Not intended for** - direct deployment on physical robots **without additional safety layers**; - safety-critical or human-facing applications. - **Known limitations** - trained only on `svla_so101_pickplace`; - evaluated only in offline replay; - telepathy path tuned for a single task family and embodiment. Users should treat Sigma as a **proof-of-concept** that demonstrates how “deep semantic + associative intent” can be engineered into residual control, not as a generic controller. --- ## 9. Author & Acknowledgements - **Author**: **Libo Wang** - Base policy and dataset by **Physical Intelligence / LeRobot** teams. - Training environment based on a single RTX 4090 GPU; all scripts are structured to be portable to other single-GPU or multi-GPU setups with minimal changes. --- ## 10. Citation If you use Sigma, please cite both the original π0.5 / OpenPI work and this Sigma extension. **π0.5 / OpenPI:** ```bibtex @article{openpi2024, title = {Open-World Robotic Manipulation with Vision-Language-Action Models}, author = {Physical Intelligence}, year = {2024}, url = {https://github.com/Physical-Intelligence/openpi} } ``` **Sigma (example entry):** ```bibtex @article{sigma2025, title = {Sigma: The Key for Vision--Language--Action Models toward Telepathy}, author = {Wang, Libo}, year = {2025}, note = {Telepathy-style extension of lerobot/pi05_base}, url = {https://huggingface.co/Veltraxor/Sigma} } ```