gpudad
/

xvla-so101-phase2-checkpoints

+---
+tags:
+- robotics
+- imitation-learning
+- xvla
+- so101
+- pick-and-place
+license: apache-2.0
+---
+# X-VLA SO-101 Phase II - All Checkpoints
+Fine-tuned X-VLA model checkpoints for SO-101 robot arm pick-and-place task.
+## Model Details
+- **Base model:** [lerobot/xvla-base](https://huggingface.co/lerobot/xvla-base)
+- **Training steps:** 200,000 total
+- **Task:** Pick up cube and place in bin
+- **Robot:** SO-101 single arm
+- **Action space:** Delta position control (4D: x, y, z, gripper)
+- **Domain ID:** 0 (WidowX-compatible)
+## Available Checkpoints
+| Checkpoint | Steps | Path |
+|------------|-------|------|
+| 020000 | 20,000 | `020000/pretrained_model/` |
+| 040000 | 40,000 | `040000/pretrained_model/` |
+| 060000 | 60,000 | `060000/pretrained_model/` |
+| 080000 | 80,000 | `080000/pretrained_model/` |
+| 100000 | 100,000 | `100000/pretrained_model/` |
+| 120000 | 120,000 | `120000/pretrained_model/` |
+| 140000 | 140,000 | `140000/pretrained_model/` |
+| 160000 | 160,000 | `160000/pretrained_model/` |
+| 180000 | 180,000 | `180000/pretrained_model/` |
+| 200000 | 200,000 | `200000/pretrained_model/` |
+## Training Configuration
+- **Frozen:** Vision encoder, Language encoder
+- **Trained:** Policy transformer, Soft prompts, Action heads
+- **Loss:** L1 for XYZ, BCE for gripper
+- **LR:** 1e-4 → 1e-5 with warmup
+## Best Checkpoint
+The **200000** checkpoint is recommended - it achieves:
+| Phase | Status |
+|-------|--------|
+| Approach cube | ✅ Works |
+| Grasp cube | ✅ Works |
+| Place in bin | ⚠️ Partial |
+## Usage
+```python
+from lerobot.common.policies.xvla.modeling_xvla import XVLAPolicy
+# Load best checkpoint (200k)
+policy = XVLAPolicy.from_pretrained(
+    "gpudad/xvla-so101-phase2-checkpoints",
+    subfolder="200000/pretrained_model"
+)
+# Or load an earlier checkpoint
+policy = XVLAPolicy.from_pretrained(
+    "gpudad/xvla-so101-phase2-checkpoints",
+    subfolder="100000/pretrained_model"
+)
+```
+## Evaluation Tips
+- Use `n_action_steps=4` for faster re-querying (better performance)
+- Model works best with 128x128 images (front + wrist cameras)
+- Language instruction: "pick up the cube and place it in the bin"
+## Files Structure
+```
+├── 020000/
+│   └── pretrained_model/
+│       ├── model.safetensors
+│       ├── config.json
+│       └── ...
+├── 040000/
+│   └── pretrained_model/
+├── ...
+└── 200000/
+    └── pretrained_model/
+```
+## Citation
+Based on X-VLA from LeRobot.