File size: 2,365 Bytes
1864e85 a23b17a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
---
datasets:
- GetSoloTech/FoodStack
language:
- en
base_model:
- lerobot/smolvla_base
library_name: transformers
tags:
- Robotics
- Lerobot
- Food
- PickPlace
- VLA
- SmolVLA
- PhysicalAI
---
### SmolVLA Fine-Tuned on for Food Stacking
**Summary**: This is a fine-tuned version of `lerobot/smolvla_base` for stacking food objects (e.g., burgers, sandwiches). It was fine-tuned on the `GetSoloTech/FoodStack` dataset using the LeRobot framework.
### Model details
- **Base model**: `lerobot/smolvla_base`
- **Task**: Vision-Language-Action control for manipulation (stacking)
- **Domain**: Food item stacking (burger, sandwich, etc.)
- **Params**: ~450M (SmolVLA)
- **Library**: LeRobot (`lerobot`)
### Quick start
Install LeRobot with SmolVLA extras:
```bash
git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[smolvla]"
```
Load the policy from this repo and run inference:
```python
from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
# Replace with your actual model ID on the Hub
model_id = "GetSoloTech/SmolVLA-FoodStack"
policy = SmolVLAPolicy.from_pretrained(model_id)
# Example placeholders for observation and instruction
observation = {
"image": ... , # BGR/RGB frame or processed observation per your setup
"state": ... , # optional proprio/scene state if used
}
instruction = "Stack the burger: bun, patty, cheese, lettuce, bun."
# Depending on your pipeline, you may wrap this in your control loop
actions = policy(observation, instruction)
# Send actions to your robot controller
# send_actions_to_robot(actions)
```
For end-to-end examples (policy loops, camera/robot IO), see the LeRobot docs and examples.
Notes:
- Tune batch size/steps and augmentation to your hardware and dataset split.
- Ensure your observation preprocessing at train-time matches inference.
### Limitations
- Specializes in food stacking; may not generalize to unseen objects/layouts.
- Sensitive to perception domain shift (lighting, textures, camera intrinsics).
- Requires correct observation normalization consistent with training.
### Dataset
- **Training data**: `GetSoloTech/FoodStack`
### Resources and references
- SmolVLA base: `https://huggingface.co/lerobot/smolvla_base`
- SmolVLA overview: `https://smolvla.net/index_en.html`
- LeRobot: `https://github.com/huggingface/lerobot` |