|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- robotics |
|
|
- lerobot |
|
|
- pi0-fast |
|
|
- imitation-learning |
|
|
- vla |
|
|
datasets: |
|
|
- gpudad/so101_pick_cube_chunked |
|
|
base_model: |
|
|
- lerobot/pi0fast-base |
|
|
pipeline_tag: robotics |
|
|
--- |
|
|
|
|
|
# π₀-FAST SO101 Pick & Place |
|
|
|
|
|
A finetuned [π₀-FAST](https://huggingface.co/lerobot/pi0fast-base) model for pick and place tasks on the SO101 robot arm. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model:** lerobot/pi0fast-base (3B parameters) |
|
|
- **Training Dataset:** [gpudad/so101_pick_cube_chunked](https://huggingface.co/datasets/gpudad/so101_pick_cube_chunked) |
|
|
- 10,990 episodes |
|
|
- 1,456,443 frames @ 30 FPS |
|
|
- 3 cameras: front, overhead, wrist (512x512) |
|
|
- 6-DOF action space |
|
|
- **Training Steps:** 10,000 (quick validation run) |
|
|
- **Final Loss:** 2.35 |
|
|
- **Hardware:** NVIDIA RTX 5090 (32GB VRAM) |
|
|
|
|
|
## Performance |
|
|
|
|
|
Tested on held-out samples from the dataset: |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Mean MAE | 0.079 | |
|
|
| Relative Error | ~2.6% of action range | |
|
|
| Best MAE | 0.0085 | |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from lerobot.policies.pi0_fast.modeling_pi0_fast import PI0FastPolicy |
|
|
from lerobot.processor.pipeline import PolicyProcessorPipeline |
|
|
|
|
|
# Load model |
|
|
policy = PI0FastPolicy.from_pretrained("gpudad/pi0fast-so101-pick-cube") |
|
|
policy.to("cuda") |
|
|
policy.eval() |
|
|
|
|
|
# Load processors |
|
|
preprocessor = PolicyProcessorPipeline.from_pretrained( |
|
|
"gpudad/pi0fast-so101-pick-cube", |
|
|
"policy_preprocessor.json" |
|
|
) |
|
|
postprocessor = PolicyProcessorPipeline.from_pretrained( |
|
|
"gpudad/pi0fast-so101-pick-cube", |
|
|
"policy_postprocessor.json" |
|
|
) |
|
|
|
|
|
# Run inference |
|
|
observation = { |
|
|
"observation.state": state_tensor, |
|
|
"observation.images.front": front_image, |
|
|
"observation.images.wrist": wrist_image, |
|
|
"observation.images.overhead": overhead_image, |
|
|
"task": "pick up the object and place it in the target location", |
|
|
} |
|
|
|
|
|
batch = preprocessor(observation) |
|
|
batch['observation.language.attention_mask'] = batch['observation.language.attention_mask'].bool() |
|
|
|
|
|
policy.reset() |
|
|
with torch.no_grad(): |
|
|
action = policy.select_action(batch) |
|
|
|
|
|
result = postprocessor({"action": action}) |
|
|
final_action = result["action"] |
|
|
``` |
|
|
|
|
|
## Training Configuration |
|
|
|
|
|
```yaml |
|
|
policy.type: pi0_fast |
|
|
policy.dtype: bfloat16 |
|
|
policy.gradient_checkpointing: true |
|
|
policy.chunk_size: 10 |
|
|
policy.n_action_steps: 10 |
|
|
batch_size: 4 |
|
|
optimizer_lr: 2.5e-5 |
|
|
scheduler_warmup_steps: 400 |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite the original π₀ paper and LeRobot: |
|
|
|
|
|
```bibtex |
|
|
@article{black2024pi0, |
|
|
title={$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control}, |
|
|
author={Black, Kevin and Brown, Noah and Driess, Danny and others}, |
|
|
journal={arXiv preprint arXiv:2410.24164}, |
|
|
year={2024} |
|
|
} |
|
|
``` |
|
|
|