Veltraxor
/

Sigma

@@ -2,78 +2,362 @@
 license: gemma
 language:
 - en
 ---
-# π₀.₅ (Pi05)
-These weights directly come from the Pytorch conversion script of openpi and their `pi05_base` model.
-π₀.₅ is a **Vision-Language-Action model with open-world generalization**, from Physical Intelligence. The LeRobot implementation is adapted from their open source [OpenPI](https://github.com/Physical-Intelligence/openpi) repository.
-## Model Overview
-π₀.₅ represents a significant evolution from π₀, developed by [Physical Intelligence](https://www.physicalintelligence.company/blog/pi05) to address a big challenge in robotics: **open-world generalization**. While robots can perform impressive tasks in controlled environments, π₀.₅ is designed to generalize to entirely new environments and situations that were never seen during training.
-### The Generalization Challenge
-As Physical Intelligence explains, the fundamental challenge isn't performing tasks of agility or dexterity, but generalization, the ability to correctly perform tasks in new settings with new objects. Consider a robot cleaning different homes: each home has different objects in different places. Generalization must occur at multiple levels:
-- **Physical Level**: Understanding how to pick up a spoon (by the handle) or plate (by the edge), even with unseen objects in cluttered environments
-- **Semantic Level**: Understanding task semantics, where to put clothes and shoes (laundry hamper, not on the bed), and what tools are appropriate for cleaning spills
-- **Environmental Level**: Adapting to "messy" real-world environments like homes, grocery stores, offices, and hospitals
-### Co-Training on Heterogeneous Data
-The breakthrough innovation in π₀.₅ is **co-training on heterogeneous data sources**. The model learns from:
-1. **Multimodal Web Data**: Image captioning, visual question answering, object detection
-2. **Verbal Instructions**: Humans coaching robots through complex tasks step-by-step
-3. **Subtask Commands**: High-level semantic behavior labels (e.g., "pick up the pillow" for an unmade bed)
-4. **Cross-Embodiment Robot Data**: Data from various robot platforms with different capabilities
-5. **Multi-Environment Data**: Static robots deployed across many different homes
-6. **Mobile Manipulation Data**: ~400 hours of mobile robot demonstrations
-This diverse training mixture creates a "curriculum" that enables generalization across physical, visual, and semantic levels simultaneously.
-## Training
-Here's a complete training command for finetuning the base π₀.₅ model on your own dataset:
 ```bash
-python src/lerobot/scripts/train.py \
-    --dataset.repo_id=your_dataset \
-    --policy.type=pi05 \
-    --output_dir=./outputs/pi05_training \
-    --job_name=pi05_training \
-    --policy.repo_id=your_repo_id \
-    --policy.pretrained_path=lerobot/pi05_base \
-    --policy.compile_model=true \
-    --policy.gradient_checkpointing=true \
-    --wandb.enable=true \
-    --policy.dtype=bfloat16 \
-    --steps=3000 \
-    --policy.scheduler_decay_steps=3000 \
-    --policy.device=cuda \
-    --batch_size=32
 ```
-## Citation
-If you use this model, please cite the original OpenPI work:
-```bibtex
-@article{openpi2024,
-    title={Open-World Robotic Manipulation with Vision-Language-Action Models},
-    author={Physical Intelligence},
-    year={2024},
-    url={https://github.com/Physical-Intelligence/openpi}
 }
 ```
-## Original Repository
-[OpenPI GitHub Repository](https://github.com/Physical-Intelligence/openpi)
-## License
-This model follows the same license as the original OpenPI repository.

 license: gemma
 language:
 - en
+tags:
+- vision-language-action
+- humanoid-robotics
+- telepathy
+- multimodal
+- robotics-control
+- lora
+- pytorch
+base_model: lerobot/pi05_base
+datasets:
+- lerobot/svla_so101_pickplace
+library_name: transformers
+pipeline_tag: other
+author: "Libo Wang"
 ---
+# Sigma: The Key for Vision–Language–Action Models toward Telepathy
+[![Model Card](https://img.shields.io/badge/HF-Sigma-orange?logo=huggingface)](https://huggingface.co/Veltraxor/Sigma)
+[![Base Model](https://img.shields.io/badge/base-lerobot%2Fpi05__base-blue)](https://huggingface.co/lerobot/pi05_base)
+[![Dataset](https://img.shields.io/badge/dataset-lerobot%2Fsvla__so101__pickplace-green)](https://huggingface.co/datasets/lerobot/svla_so101_pickplace)
+Sigma is a **telepathy-style Vision–Language–Action (VLA) model** built on top of `lerobot/pi05_base`.
+It adds a semantic “telepathy” path and LoRA adapters that steer continuous robot control using internal **semantic memory** and **intent states**, while keeping the original π0.5 backbone weights intact and recoverable.
+---
+## 1. Summary
+- **Base policy**: `lerobot/pi05_base` (π0.5)
+- **Author**: **Libo Wang**
+- **GPU for training**: single RTX 4090 (24GB)
+- **Data**: `lerobot/svla_so101_pickplace`
+- **Objective**:
+  Make a π0.5-style VLA **use internal semantic & intent states** to refine continuous control, rather than only imitating trajectories.
+Sigma keeps the perception and control structure of π0.5, and introduces an additional pathway that:
+- fuses **vision, language, and robot state** into a shared latent sequence,
+- maintains a **semantic state** \(m_t\) and an **intent vector** \(z_\text{intent}\) over time,
+- converts them into **telepathy factors** that modulate the policy’s action outputs as residual corrections.
+---
+## 2. Architecture at a Glance
+Sigma can be seen as **π0.5 + telepathic head + LoRA adapters**:
+- **Vision / State stream**
+  - reuse π0.5 encoders for images and robot state;
+  - add FiLM-style modulation from telepathy factors on vision tokens.
+- **Language–semantic stream**
+  - take text tokens, vision tokens, and state tokens into a shared MLLM backbone;
+  - derive:
+    - a **semantic memory** \(m_t\) that accumulates cross-time information,
+    - an **intent vector** \(z_\text{intent}\),
+    - pooled **semantic factors** aligned with the text embedding space.
+- **Action stream (three branches)**
+  - treat π0.5 outputs as **baseline**:
+    - action vector (per-step),
+    - action chunk (short horizon),
+    - action trajectory (full horizon);
+  - learn **residual actions** driven by telepathy factors on all three branches.
+The resulting policy still *looks like* π0.5 from the outside (same inputs, same output types), but actions are now corrected by an internal telepathy pathway that is aware of **deep semantics and associative intent**.
+---
+## 3. Training Setup
+### 3.1 Dataset & preprocessing
+- **Upstream dataset**: `lerobot/svla_so101_pickplace`
+- **Task**: pick-and-place style manipulation with multi-frame RGB + robot state + continuous actions.
+A preprocessing script (`dataset_preprocess_sigma_vla.py`) does:
+- sliding-window segmentation with horizon `T = 16`,
+- filtering out windows with nearly zero action norm to remove static segments,
+- packing vision frames, robot state, and 3-scale action targets into tensor batches,
+- exporting three sharded files:
+```text
+storage/sigma_pickplace/shard_00000.pt
+storage/sigma_pickplace/shard_00001.pt
+storage/sigma_pickplace/shard_00002.pt
+```
+These shards are the **only** data used for Sigma training and evaluation.
+### 3.2 LoRA fine-tuning (Sigma training)
+Training is performed on a **single RTX 4090** using `train_sigma_telepathy_vla_lora.py`:
 ```bash
+python train_sigma_telepathy_vla_lora.py \
+  --base_model_id lerobot/pi05_base \
+  --dataset_dir /workspace/storage/sigma_pickplace \
+  --output_dir /workspace/storage/sigma_lora_out \
+  --batch_size 4 \
+  --gradient_accumulation_steps 4 \
+  --max_steps 300 \
+  --dtype bf16
 ```
+Key aspects:
+- freeze backbone weights from `lerobot/pi05_base`;
+- attach **LoRA** on key projections (`q`, `k`, `v`, `o`) and the telepathy heads;
+- jointly optimize:
+  - **three control losses**:
+    - `L_act_vec` for per-step action vectors,
+    - `L_act_chk` for short-horizon chunks,
+    - `L_act_trj` for full trajectories;
+  - **semantic & telepathy regularizers**:
+    - alignment of semantic factors with text embeddings,
+    - control of telepathy factor norm `tau_l2`.
+All LoRA and telepathy parameters are stored under:
+```text
+storage/sigma_lora_out/
+  sigma_telepathy_heads.pt
+  adapter_config.json
+  adapter_model.bin
+  ...
+```
+### 3.3 Telepathy-aware training logic
+Two key training mechanisms are implemented inside the loss:
+- **Telepathic Residual Action Focusing (TRAF)**
+  Focuses learning on *residual actions* instead of full actions, and uses **hard-sample mining** (top-k error segments) to allocate more gradient budget to difficult humanoid control windows.
+- **Telepathic Semantic Alignment Curriculum (TSAC)**
+  Gradually increases the weights of:
+  - semantic memory–text alignment,
+  - intent–telepathy alignment,
+  while maintaining action regression as the primary objective early on.
+  Late in training, Sigma is encouraged to let **internal semantic/intent structure** drive the residual corrections.
+---
+## 4. Inference-time Telepathy Adapter
+A lightweight adapter (`sigma_adapter.py`) controls how much the telepathy residuals are allowed to modify the baseline π0.5 actions:
+- reads:
+  - baseline π0.5 actions (`base_action_vector`, …),
+  - Sigma residuals,
+  - telepathy diagnostics (norms, cosine alignments),
+- computes a **risk-aware scaling factor** in \([ \text{min_scale}, \text{max_scale} ]\),
+- blends:
+```python
+action = base_action + scale * telepathy_residual
+```
+If residuals are too large or misaligned, `scale` is pushed toward 0, effectively reverting to π0.5 behavior.
+If residuals are moderate and well aligned, `scale` approaches 1, enabling telepathy-enhanced control.
+---
+## 5. Evaluation Protocol
+Evaluation uses `eval_sigma_vla_rollout.py` in **offline closed-loop replay**:
+- both Sigma and the baseline:
+  - use the *same* preprocessed shards (`shard_0000x.pt`),
+  - share the *same* telepathy heads file `sigma_telepathy_heads.pt`,
+- **only Sigma**:
+  - loads LoRA weights,
+  - activates telepathy residuals and the adapter in control output.
+### 5.1 CHECK A – telepathy geometry & alignment sanity
+CHECK A verifies that **telepathy geometry is identical** between experimental and control runs:
+- `heads_tensors = 325`
+- `mean ≈ 0.002`, `std ≈ 0.107`, `rms ≈ 0.107` for telepathy head weights
+- `avg_tau_l2 ≈ 51.6` – average L2 norm of telepathy factors
+- `avg_semantic_text_alignment ≈ 0.13` – semantic factor vs. text embedding alignment
+These numbers are matched between Sigma and the π0.5 baseline, so behavior differences cannot be explained by changing telepathy parameters or text alignment geometry.
+### 5.2 CHECK B – multiscale control & telepathy metrics
+CHECK B defines and reports:
+- `mse_vec` – per-step action vector MSE (fine-grain control precision)
+- `mse_chk` – short segment chunk MSE (local motion consistency)
+- `mse_trj` – full trajectory MSE (long-horizon tracking)
+- `tau_l2` – telepathy factor norms (activation strength)
+- `sem_align` – semantic alignment (e.g., cosine) between semantic factors and text embeddings
+On the same 723 samples and 181 batches:
+- Sigma shows **consistently lower `mse_vec`, `mse_chk`, `mse_trj`** than the baseline,
+- while **`tau_l2` and `sem_align` remain similar** between both models.
+This pattern supports the interpretation that Sigma **uses the same semantic / telepathy geometry more effectively**, converting it into tangible gains in control accuracy instead of merely altering the embedding space.
+---
+## 6. How to Use Sigma
+> ⚠️ You must have access to `lerobot/pi05_base` and the preprocessed shards or an equivalent environment to reproduce full experiments.
+### 6.1 Installation (example)
+```bash
+# base env
+pip install "transformers>=4.40.0" accelerate torch torchvision
+pip install lerobot
+# clone this repository (example path)
+git clone https://github.com/Veltraxor/Sigma.git
+cd Sigma
+```
+### 6.2 Loading Sigma on top of pi0.5
+```python
+import torch
+from lerobot import Pi05Policy
+from sigma_vla import SigmaTelepathyVLA, SigmaTelepathyAdapter
+device = "cuda"
+dtype = torch.bfloat16
+# 1. Load base π0.5 policy
+base_policy = Pi05Policy.from_pretrained("lerobot/pi05_base")
+# 2. Build Sigma on top of the base policy
+sigma_policy = SigmaTelepathyVLA.from_base(
+    base_policy=base_policy,
+    lora_dir="./storage/sigma_lora_out",
+    telepathy_heads_path="./storage/sigma_lora_out/sigma_telepathy_heads.pt",
+    device=device,
+    dtype=dtype,
+)
+# 3. Optional runtime adapter
+adapter = SigmaTelepathyAdapter(
+    min_scale=0.0,
+    max_scale=1.0,
+    risk_temperature=1.0,
+)
+# 4. Single batch forward (offline replay)
+batch = {
+    "vis_obs": vis_obs_tensor,           # [B, T, C, H, W]
+    "robot_state": robot_state_tensor,   # [B, T, D_state]
+    "texts": list_of_text_prompts,       # length B
 }
+with torch.no_grad():
+    out = sigma_policy(**batch, use_telepathy=True)
+    blended_action = adapter(
+        base_action_vector=out["base_action_vector"],
+        telepathy_residual=out["telepathy_residual_vector"],
+        telepathy_factors=out["telepathy_factors"],
+    )
 ```
+---
+## 7. Repository Layout (typical)
+A typical Sigma repo / model card includes:
+```text
+README.md                      # this file
+sigma_env.example              # example env file for HF tokens, paths
+dataset_preprocess_sigma_vla.py
+train_sigma_telepathy_vla_lora.py
+eval_sigma_vla_rollout.py
+sigma_telepathy_vla.py         # model definition
+sigma_adapter.py               # inference-time adapter
+storage/
+  sigma_pickplace/
+    shard_00000.pt
+    shard_00001.pt
+    shard_00002.pt
+  sigma_lora_out/
+    sigma_telepathy_heads.pt
+    adapter_config.json
+    adapter_model.bin
+    ...
+logs/
+  sigma_eval_report.json
+  sigma_eval_checkA.json
+  sigma_eval_checkB.json
+```
+You can adapt this layout to your own environment; the key assumption is that **Sigma is always loaded as a LoRA + telepathy delta on top of `lerobot/pi05_base`**.
+---
+## 8. Intended Use, Risks, and Limitations
+- **Intended use**
+  Sigma is intended for **research and experimentation** on:
+  - semantic / telepathy-style control in VLA systems,
+  - offline trajectory analysis and simulation,
+  - early-stage humanoid / manipulator control studies.
+- **Not intended for**
+  - direct deployment on physical robots **without additional safety layers**;
+  - safety-critical or human-facing applications.
+- **Known limitations**
+  - trained only on `svla_so101_pickplace`;
+  - evaluated only in offline replay;
+  - telepathy path tuned for a single task family and embodiment.
+Users should treat Sigma as a **proof-of-concept** that demonstrates how “deep semantic + associative intent” can be engineered into residual control, not as a generic controller.
+---
+## 9. Author & Acknowledgements
+- **Author**: **Libo Wang**
+- Base policy and dataset by **Physical Intelligence / LeRobot** teams.
+- Training environment based on a single RTX 4090 GPU; all scripts are structured to be portable to other single-GPU or multi-GPU setups with minimal changes.
+---
+## 10. Citation
+If you use Sigma, please cite both the original π0.5 / OpenPI work and this Sigma extension.
+**π0.5 / OpenPI:**
+```bibtex
+@article{openpi2024,
+  title   = {Open-World Robotic Manipulation with Vision-Language-Action Models},
+  author  = {Physical Intelligence},
+  year    = {2024},
+  url     = {https://github.com/Physical-Intelligence/openpi}
+}
+```
+**Sigma (example entry):**
+```bibtex
+@article{sigma2025,
+  title   = {Sigma: The Key for Vision--Language--Action Models toward Telepathy},
+  author  = {Wang, Libo},
+  year    = {2025},
+  note    = {Telepathy-style extension of lerobot/pi05_base},
+  url     = {https://huggingface.co/Veltraxor/Sigma}
+}
+```