|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- vision-language-action |
|
|
- mobile-robot |
|
|
- kosmos-2b |
|
|
- robotics |
|
|
- obstacle-avoidance |
|
|
datasets: |
|
|
- mobile-vla-dataset |
|
|
language: |
|
|
- en |
|
|
- ko |
|
|
metrics: |
|
|
- mae |
|
|
- r2_score |
|
|
library_name: transformers |
|
|
pipeline_tag: robotics |
|
|
--- |
|
|
|
|
|
# ๐ Mobile VLA: Vision-Language-Action Model for Mobile Robots |
|
|
|
|
|
## ๐ Model Description |
|
|
|
|
|
Mobile VLA๋ Kosmos-2B๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ํ Mobile Robot ์ ์ฉ Vision-Language-Action ๋ชจ๋ธ์
๋๋ค. |
|
|
์ฅ์ ๋ฌผ ํํผ ์๋๋ฆฌ์ค์์ ์ฐ์์ ์ธ 3D ์ก์
์์ธก์ ์ํํฉ๋๋ค. |
|
|
|
|
|
### ๐ฏ ํต์ฌ ๊ธฐ๋ฅ |
|
|
|
|
|
- **Vision-Language-Action**: ์ด๋ฏธ์ง์ ํ
์คํธ ์ง์์ฌํญ์ ๋ฐ์ ๋ก๋ด ์ก์
์์ธก |
|
|
- **3D ์ฐ์ ์ ์ด**: `[linear_x, linear_y, angular_z]` ํํ์ ์ฐ์ ์ก์
๊ณต๊ฐ |
|
|
- **์ฅ์ ๋ฌผ ํํผ**: 1-box, 2-box ์๋๋ฆฌ์ค์์ ์ข์ฐ ํํผ ์ ๋ต ํ์ต |
|
|
- **์ค์๊ฐ ์ฒ๋ฆฌ**: ํจ์จ์ ์ธ vision-only ์ฒ๋ฆฌ๋ก ๋น ๋ฅธ ์ถ๋ก |
|
|
|
|
|
### ๐ง ๊ธฐ์ ์ฌ์ |
|
|
|
|
|
- **๋ฐฑ๋ณธ ๋ชจ๋ธ**: microsoft/kosmos-2-patch14-224 |
|
|
- **์
๋ ฅ**: RGB ์ด๋ฏธ์ง (224x224) + ํ
์คํธ ์ง์์ฌํญ |
|
|
- **์ถ๋ ฅ**: 3D ์ฐ์ ์ก์
๋ฒกํฐ |
|
|
- **ํ์ต ๋ฐฉ์**: Huber Loss ๊ธฐ๋ฐ ํ๊ท |
|
|
- **๋ฐ์ดํฐ**: 72๊ฐ ์ค์ ๋ก๋ด ์ํผ์๋ |
|
|
|
|
|
## ๐ ์ฑ๋ฅ ์งํ |
|
|
|
|
|
### ์ ์ฒด ์ฑ๋ฅ |
|
|
- **์ ์ฒด MAE**: 0.285 |
|
|
- **์๊ณ๊ฐ ์ ํ๋ (0.1)**: 37.5% |
|
|
|
|
|
### ์ก์
๋ณ ์ฑ๋ฅ |
|
|
| ์ก์
| MAE | Rยฒ Score | ์ค๋ช
| |
|
|
|------|-----|----------|------| |
|
|
| linear_x | 0.243 | 0.354 | ์ ์ง/ํ์ง (์ฐ์) | |
|
|
| linear_y | 0.550 | 0.293 | ์ข์ฐ ์ด๋ (๋ณดํต) | |
|
|
| angular_z | 0.062 | 0.000 | ํ์ (๋ฎ์) | |
|
|
|
|
|
### ์๋๋ฆฌ์ค๋ณ ์ฑ๋ฅ |
|
|
| ์๋๋ฆฌ์ค | MAE | ๋ฑ๊ธ | ์ค๋ช
| |
|
|
|----------|-----|------|------| |
|
|
| 1box_right_vertical | 0.217 | B+ | ์ฐ์ | |
|
|
| 1box_left_horizontal | 0.303 | B | ์ํธ | |
|
|
| 2box_left_vertical | 0.322 | B | ์ํธ | |
|
|
| 1box_left_vertical | 0.337 | B- | ๋ณดํต | |
|
|
|
|
|
## ๐ ์ฌ์ฉ ๋ฐฉ๋ฒ |
|
|
|
|
|
### ์ค์น |
|
|
```bash |
|
|
pip install transformers torch pillow numpy |
|
|
``` |
|
|
|
|
|
### ๊ธฐ๋ณธ ์ฌ์ฉ๋ฒ |
|
|
```python |
|
|
from mobile_vla import MobileVLAModel, MobileVLATrainer |
|
|
from PIL import Image |
|
|
import torch |
|
|
|
|
|
# ๋ชจ๋ธ ๋ก๋ |
|
|
model = MobileVLAModel.from_pretrained("minuum/mobile-vla") |
|
|
|
|
|
# ์ด๋ฏธ์ง์ ํ์คํฌ ์ค๋น |
|
|
image = Image.open("robot_camera.jpg") |
|
|
task = "Navigate around obstacles to track the target cup" |
|
|
|
|
|
# ์์ธก |
|
|
with torch.no_grad(): |
|
|
actions = model.predict(image, task) |
|
|
|
|
|
print(f"Predicted actions: {actions}") |
|
|
# ์ถ๋ ฅ: [linear_x, linear_y, angular_z] |
|
|
``` |
|
|
|
|
|
### ๊ณ ๊ธ ์ฌ์ฉ๋ฒ |
|
|
```python |
|
|
# ๋ฐฐ์น ์ฒ๋ฆฌ |
|
|
images = [Image.open(f"frame_{i}.jpg") for i in range(8)] |
|
|
actions = model.predict_sequence(images, task) |
|
|
|
|
|
# ์ค์๊ฐ ์ ์ด |
|
|
for frame in camera_stream: |
|
|
action = model.predict(frame, task) |
|
|
robot.execute(action) |
|
|
``` |
|
|
|
|
|
## ๐๏ธ ๋ชจ๋ธ ์ํคํ
์ฒ |
|
|
|
|
|
``` |
|
|
[RGB Images] โ [Kosmos-2B Vision] โ [Action Head] โ [3D Actions] |
|
|
โ โ โ โ |
|
|
224x224 Image Features Regression [x, y, ฮธ] |
|
|
``` |
|
|
|
|
|
### ํต์ฌ ์ปดํฌ๋ํธ |
|
|
1. **Kosmos-2B Vision Model**: ์ด๋ฏธ์ง ํน์ง ์ถ์ถ |
|
|
2. **Action Head**: 3D ํ๊ท ํค๋ (512 โ 3*chunk_size) |
|
|
3. **Window/Chunk**: 8ํ๋ ์ ๊ด์ฐฐ โ 2ํ๋ ์ ์์ธก |
|
|
|
|
|
## ๐ RoboVLMs์์ ๋น๊ต |
|
|
|
|
|
| ํญ๋ชฉ | RoboVLMs | Mobile VLA | |
|
|
|------|----------|------------| |
|
|
| **๋ฐ์ดํฐ ์๊ตฌ๋** | ์๋ฐฑ๋ง ๋ฐ๋ชจ | 72 ์ํผ์๋ | |
|
|
| **์ก์
๊ณต๊ฐ** | 7-DOF Discrete | 3D Continuous | |
|
|
| **์ถ๋ก ์๋** | ๋ณตํฉ์ | ๋น ๋ฆ | |
|
|
| **ํนํ ๋ถ์ผ** | ๋ฒ์ฉ Manipulation | Mobile Robot | |
|
|
| **ํ๊ฐ ๋ฐฉ์** | ์ฑ๊ณต๋ฅ | ๋ค์ฐจ์ ํ๊ท ์งํ | |
|
|
|
|
|
## ๐ฏ ์ฃผ์ ๊ฐ์ ์ฌํญ |
|
|
|
|
|
- **๋ฐ์ดํฐ ํจ์จ์ฑ**: 1000๋ฐฐ ์ ์ ๋ฐ์ดํฐ๋ก ์ค์ฉ์ ์ฑ๋ฅ |
|
|
- **์ค์๊ฐ ์ฑ๋ฅ**: Vision-only ์ฒ๋ฆฌ๋ก ๋น ๋ฅธ ์ถ๋ก |
|
|
- **์ฐ์ ์ ์ด**: ์ ๋ฐํ 3D ์ก์
์์ธก |
|
|
- **์๋๋ฆฌ์ค ํนํ**: ์ฅ์ ๋ฌผ ํํผ ์ ์ฉ ์ต์ ํ |
|
|
|
|
|
## ๐ ํ์ต ๋ฐ์ดํฐ |
|
|
|
|
|
- **์ํผ์๋ ์**: 72๊ฐ |
|
|
- **์๋๋ฆฌ์ค**: 1box/2box ร left/right ร vertical/horizontal |
|
|
- **์ก์
**: [linear_x, linear_y, angular_z] ์ฐ์ ๊ฐ |
|
|
- **์ด๋ฏธ์ง**: ์ค์ ๋ก๋ด ์นด๋ฉ๋ผ RGB (224x224) |
|
|
|
|
|
## ๐ฌ ์ฐ๊ตฌ ๋ฐฐ๊ฒฝ |
|
|
|
|
|
์ด ๋ชจ๋ธ์ RoboVLMs์ Window/Chunk ๋ฉ์ปค๋์ฆ์ ์ ์งํ๋ฉด์ Mobile Robot์ ํนํ๋ ๊ธฐ๋ฅ์ ์ถ๊ฐํ ์ฐ๊ตฌ์
๋๋ค: |
|
|
|
|
|
1. **Window/Chunk ์ ์ง**: 8ํ๋ ์ ๊ด์ฐฐ โ 2ํ๋ ์ ์์ธก ๊ตฌ์กฐ |
|
|
2. **Kosmos-2B ํตํฉ**: Vision-Language ๋ฐฑ๋ณธ ํ์ฉ |
|
|
3. **์ฐ์ ์ ์ด**: Discrete โ Continuous ์ก์
๊ณต๊ฐ ์ ํ |
|
|
4. **์ค์ ๋ก๋ด ๋ฐ์ดํฐ**: HDF5 ํํ์ ์ค์ ์์ง ๋ฐ์ดํฐ |
|
|
|
|
|
## ๐ ์ธ์ฉ |
|
|
|
|
|
```bibtex |
|
|
@misc{mobile_vla_2024, |
|
|
title={Mobile VLA: Vision-Language-Action Model for Mobile Robot Navigation}, |
|
|
author={Mobile VLA Team}, |
|
|
year={2024}, |
|
|
publisher={HuggingFace}, |
|
|
url={https://huggingface.co/minuum/mobile-vla} |
|
|
} |
|
|
``` |
|
|
|
|
|
## ๐ค ๊ธฐ์ฌ |
|
|
|
|
|
์ด ๋ชจ๋ธ์ RoboVLMs ํ๋ ์์ํฌ๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ๊ฐ๋ฐ๋์์ผ๋ฉฐ, Mobile Robot ์ปค๋ฎค๋ํฐ์ ๋ฐ์ ์ ์ํด ๊ณต๊ฐ๋ฉ๋๋ค. |
|
|
|
|
|
## ๐ ์ฐ๋ฝ์ฒ |
|
|
|
|
|
- **Issues**: [GitHub Issues](https://github.com/minuum/vla/issues) |
|
|
- **Discussions**: [HuggingFace Discussions](https://huggingface.co/minuum/mobile-vla/discussions) |
|
|
|
|
|
--- |
|
|
*Generated on 2025-08-21* |
|
|
|