|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: mit |
|
|
tags: |
|
|
- robotics |
|
|
- vision-language-action |
|
|
- vla |
|
|
- flow-matching |
|
|
- clip |
|
|
- manipulation |
|
|
- pick-and-place |
|
|
- educational |
|
|
- lightweight |
|
|
- beginner-friendly |
|
|
library_name: transformers |
|
|
pipeline_tag: robotics |
|
|
datasets: |
|
|
- synthetic-pick-place-1k |
|
|
metrics: |
|
|
- mae |
|
|
- position_error |
|
|
- quaternion_error |
|
|
- accuracy |
|
|
--- |
|
|
|
|
|
# π Minimal VLA: The Simplest Vision-Language-Action Model |
|
|
|
|
|
> **The lightest VLA implementation for learning and experimentation β only ~20MB!** |
|
|
|
|
|
A beginner-friendly, minimal Vision-Language-Action (VLA) model designed for **educational purposes** and **rapid prototyping**. This project demonstrates the core concepts of VLA systems using CLIP + Flow Matching in the simplest possible setup. |
|
|
|
|
|
## β¨ Why This Project? |
|
|
|
|
|
| Feature | This Model | Typical VLAs | |
|
|
|---------|-----------|--------------| |
|
|
| **Model Size** | **~20MB** | 1-7GB+ | |
|
|
| **Training Time** | **~20 min** | Hours to days | |
|
|
| **Hardware** | Any GPU / CPU | High-end GPUs | |
|
|
| **Simulation** | 2D rendering | Physics engines | |
|
|
| **Complexity** | ~1000 lines | 10,000+ lines | |
|
|
| **Dependencies** | PyTorch + CLIP | Complex stacks | |
|
|
|
|
|
**Perfect for:** |
|
|
- π **Students** learning VLA fundamentals |
|
|
- π¬ **Researchers** prototyping new ideas quickly |
|
|
- π¨βπ« **Educators** teaching robot learning concepts |
|
|
- π **Developers** building their first VLA system |
|
|
|
|
|
## ποΈ Model Overview |
|
|
|
|
|
This minimal VLA predicts 8-DOF robotic actions from RGB images and natural language: |
|
|
|
|
|
``` |
|
|
Input: Image (224Γ224) + Text ("pick up the red cube") |
|
|
β |
|
|
CLIP ViT-B/32 (frozen, vision + language encoding) |
|
|
β |
|
|
Flow Matching Policy (~2MB trainable parameters) |
|
|
β |
|
|
Output: [x, y, z, qx, qy, qz, qw, gripper] |
|
|
``` |
|
|
|
|
|
### Key Design Choices for Simplicity |
|
|
|
|
|
1. **Frozen CLIP Backbone** β No need to train vision-language understanding |
|
|
2. **2D Synthetic Environment** β No physics engine required |
|
|
3. **Flow Matching** β Elegant generative approach for continuous actions |
|
|
4. **Separate Gripper Classifier** β Binary decision for open/close |
|
|
|
|
|
## π Performance |
|
|
|
|
|
Evaluated on 10 test samples from 1000 synthetic demonstrations: |
|
|
|
|
|
| Metric | Value | Notes | |
|
|
|--------|-------|-------| |
|
|
| Position Error | **8.60cm** | Suitable for ~5cm cube picking | |
|
|
| Gripper Accuracy | **75%** | Reliable grasp planning | |
|
|
| Overall MAE | **0.1217** | Across all 8 action dimensions | |
|
|
| Quaternion Error | 19.36Β° | Best for top-down grasps | |
|
|
|
|
|
> β οΈ **Note**: This is an educational model trained on simplified 2D projections. Real-world deployment requires fine-tuning on actual robot data. |
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install torch transformers pillow numpy matplotlib |
|
|
``` |
|
|
|
|
|
### Inference (3 lines!) |
|
|
|
|
|
```python |
|
|
from vla_flow_matching import VLM_Encoder, ImprovedFlowMatchingPolicy |
|
|
import torch |
|
|
|
|
|
# Load model |
|
|
device = 'cuda' if torch.cuda.is_available() else 'cpu' |
|
|
checkpoint = torch.load('vla_checkpoint_best.pt', map_location=device) |
|
|
|
|
|
vlm_encoder = VLM_Encoder().to(device) |
|
|
policy = ImprovedFlowMatchingPolicy(action_dim=8, context_dim=1024, hidden_dim=512).to(device) |
|
|
policy.load_state_dict(checkpoint['policy']) |
|
|
policy.eval() |
|
|
|
|
|
# Predict! |
|
|
from PIL import Image |
|
|
image = Image.open('workspace.jpg').resize((224, 224)) |
|
|
context = vlm_encoder.encode([image], ["pick up the red cube"]) |
|
|
action = policy.sample(context, num_samples=1, device=device) |
|
|
|
|
|
print(f"Position: {action[0, :3].cpu().numpy()}") |
|
|
print(f"Gripper: {'CLOSE' if action[0, 7] > 0 else 'OPEN'}") |
|
|
``` |
|
|
|
|
|
### Train from Scratch (~20 minutes) |
|
|
|
|
|
```bash |
|
|
# Step 1: Generate synthetic data |
|
|
python vla_flow_matching.py --mode generate_data --num_demos 1000 |
|
|
|
|
|
# Step 2: Train (takes ~20 min on consumer GPU) |
|
|
python vla_flow_matching.py --mode train --epochs 200 --batch_size 32 |
|
|
|
|
|
# Step 3: Evaluate |
|
|
python vla_flow_matching.py --mode replay --checkpoint vla_checkpoint_best.pt |
|
|
``` |
|
|
|
|
|
## π Repository Structure |
|
|
|
|
|
``` |
|
|
βββ vla_flow_matching.py # Complete implementation (~1000 lines) |
|
|
βββ vla_checkpoint_best.pt # Trained weights (~20MB) |
|
|
βββ demo_data.pkl # Training data (1000 demos) |
|
|
βββ replay_results.png # Evaluation visualization |
|
|
βββ README.md # This file |
|
|
``` |
|
|
|
|
|
## π― What You'll Learn |
|
|
|
|
|
This codebase teaches core VLA concepts: |
|
|
|
|
|
1. **Vision-Language Encoding**: Using CLIP for joint image-text understanding |
|
|
2. **Flow Matching**: A modern generative approach for action prediction |
|
|
3. **Action Representation**: 8-DOF with quaternion rotations |
|
|
4. **Synthetic Data Generation**: Creating training environments without physics |
|
|
5. **Model Architecture**: Combining frozen backbones with trainable policies |
|
|
|
|
|
## π§ Architecture Details |
|
|
|
|
|
### VLM Encoder (Frozen CLIP) |
|
|
- Vision: ViT-B/32 β 512-dim features |
|
|
- Text: Transformer β 512-dim features |
|
|
- Combined: 1024-dim context vector |
|
|
|
|
|
### Flow Matching Policy (~2MB) |
|
|
``` |
|
|
Context Encoder: 1024 β 512 β 128 (with LayerNorm, GELU, Dropout) |
|
|
Time Embedding: Sinusoidal 128-dim |
|
|
Action Encoder: 7D β 128 |
|
|
Velocity Network: 384 β 512 β 256 β 7 |
|
|
``` |
|
|
|
|
|
### Gripper Classifier |
|
|
``` |
|
|
Context β 512 β 256 β 2 (softmax) |
|
|
``` |
|
|
|
|
|
### Training Configuration |
|
|
```yaml |
|
|
Epochs: 200 |
|
|
Batch Size: 32 |
|
|
Learning Rate: 1e-4 (cosine decay to 1e-5) |
|
|
Optimizer: AdamW (weight_decay=1e-4) |
|
|
Flow Steps: 200 (Euler integration) |
|
|
``` |
|
|
|
|
|
## π Training Data |
|
|
|
|
|
The synthetic environment generates pick-and-place demonstrations: |
|
|
|
|
|
- **1000 demonstrations** with diverse object positions |
|
|
- **6 cube colors**: red, blue, green, yellow, purple, orange |
|
|
- **24 instruction templates**: "pick up the [color] cube", "grasp the [color] block", etc. |
|
|
- **40cm Γ 40cm workspace** with position and orientation variations |
|
|
- **2D projection with 3D visual effects** (shadows, shading) |
|
|
|
|
|
## β‘ Extending This Work |
|
|
|
|
|
### Ideas for Students/Researchers |
|
|
|
|
|
1. **Add more objects**: Extend beyond cubes to spheres, cylinders |
|
|
2. **Multi-step tasks**: Chain pick β place actions |
|
|
3. **Real images**: Fine-tune on real robot data |
|
|
4. **Better orientation**: Improve quaternion prediction accuracy |
|
|
5. **Action chunking**: Predict action sequences instead of single steps |
|
|
6. **Physics simulation**: Replace 2D rendering with PyBullet/MuJoCo |
|
|
|
|
|
### Fine-tuning for Real Robots |
|
|
|
|
|
```bash |
|
|
# Collect 10-50 real demonstrations, then: |
|
|
python vla_flow_matching.py --mode finetune \ |
|
|
--checkpoint vla_checkpoint_best.pt \ |
|
|
--data_path real_robot_demos.pkl \ |
|
|
--epochs 30 --lr 1e-5 |
|
|
``` |
|
|
|
|
|
## β οΈ Limitations |
|
|
|
|
|
This is an **educational model** with intentional simplifications: |
|
|
|
|
|
- β 2D synthetic environment (no physics) |
|
|
- β Single-object scenes only |
|
|
- β Limited orientation precision |
|
|
- β Not suitable for direct real-world deployment |
|
|
- β No temporal/sequential reasoning |
|
|
|
|
|
**Do NOT use for**: Safety-critical applications, precision assembly, or autonomous operation without extensive testing. |
|
|
|
|
|
## π Acknowledgments |
|
|
|
|
|
Built with: |
|
|
- π€ Transformers (CLIP) |
|
|
- π₯ PyTorch |
|
|
- π NumPy & Matplotlib |
|
|
|
|
|
Inspired by: |
|
|
- [Flow Matching](https://arxiv.org/abs/2210.02747) (Lipman et al., 2023) |
|
|
- [CLIP](https://arxiv.org/abs/2103.00020) (Radford et al., 2021) |
|
|
- [RT-1](https://arxiv.org/abs/2212.06817) (Brohan et al., 2022) |
|
|
- [OpenVLA](https://openvla.github.io/) (Kim et al., 2024) |
|
|
|
|
|
## π Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{minimal-vla-2025, |
|
|
title={Minimal VLA: A Lightweight Vision-Language-Action Model for Education}, |
|
|
author={LeTau}, |
|
|
year={2025}, |
|
|
publisher={Hugging Face}, |
|
|
url={https://huggingface.co/your-username/minimal-vla} |
|
|
} |
|
|
``` |
|
|
|
|
|
## π License |
|
|
|
|
|
MIT License β Feel free to use, modify, and share! |
|
|
|
|
|
--- |
|
|
|
|
|
**Questions?** Open an issue or reach out. Happy learning! π€ |
|
|
|
|
|
|