mobile-vla / README.md
minium's picture
Upload README.md with huggingface_hub
ca6d175 verified
---
license: apache-2.0
tags:
- vision-language-action
- mobile-robot
- kosmos-2b
- robotics
- obstacle-avoidance
datasets:
- mobile-vla-dataset
language:
- en
- ko
metrics:
- mae
- r2_score
library_name: transformers
pipeline_tag: robotics
---
# ๐Ÿš€ Mobile VLA: Vision-Language-Action Model for Mobile Robots
## ๐Ÿ“‹ Model Description
Mobile VLA๋Š” Kosmos-2B๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ Mobile Robot ์ „์šฉ Vision-Language-Action ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
์žฅ์• ๋ฌผ ํšŒํ”ผ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์—ฐ์†์ ์ธ 3D ์•ก์…˜ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
### ๐ŸŽฏ ํ•ต์‹ฌ ๊ธฐ๋Šฅ
- **Vision-Language-Action**: ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์ง€์‹œ์‚ฌํ•ญ์„ ๋ฐ›์•„ ๋กœ๋ด‡ ์•ก์…˜ ์˜ˆ์ธก
- **3D ์—ฐ์† ์ œ์–ด**: `[linear_x, linear_y, angular_z]` ํ˜•ํƒœ์˜ ์—ฐ์† ์•ก์…˜ ๊ณต๊ฐ„
- **์žฅ์• ๋ฌผ ํšŒํ”ผ**: 1-box, 2-box ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์ขŒ์šฐ ํšŒํ”ผ ์ „๋žต ํ•™์Šต
- **์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ**: ํšจ์œจ์ ์ธ vision-only ์ฒ˜๋ฆฌ๋กœ ๋น ๋ฅธ ์ถ”๋ก 
### ๐Ÿ”ง ๊ธฐ์ˆ  ์‚ฌ์–‘
- **๋ฐฑ๋ณธ ๋ชจ๋ธ**: microsoft/kosmos-2-patch14-224
- **์ž…๋ ฅ**: RGB ์ด๋ฏธ์ง€ (224x224) + ํ…์ŠคํŠธ ์ง€์‹œ์‚ฌํ•ญ
- **์ถœ๋ ฅ**: 3D ์—ฐ์† ์•ก์…˜ ๋ฒกํ„ฐ
- **ํ•™์Šต ๋ฐฉ์‹**: Huber Loss ๊ธฐ๋ฐ˜ ํšŒ๊ท€
- **๋ฐ์ดํ„ฐ**: 72๊ฐœ ์‹ค์ œ ๋กœ๋ด‡ ์—ํ”ผ์†Œ๋“œ
## ๐Ÿ“Š ์„ฑ๋Šฅ ์ง€ํ‘œ
### ์ „์ฒด ์„ฑ๋Šฅ
- **์ „์ฒด MAE**: 0.285
- **์ž„๊ณ„๊ฐ’ ์ •ํ™•๋„ (0.1)**: 37.5%
### ์•ก์…˜๋ณ„ ์„ฑ๋Šฅ
| ์•ก์…˜ | MAE | Rยฒ Score | ์„ค๋ช… |
|------|-----|----------|------|
| linear_x | 0.243 | 0.354 | ์ „์ง„/ํ›„์ง„ (์šฐ์ˆ˜) |
| linear_y | 0.550 | 0.293 | ์ขŒ์šฐ ์ด๋™ (๋ณดํ†ต) |
| angular_z | 0.062 | 0.000 | ํšŒ์ „ (๋‚ฎ์Œ) |
### ์‹œ๋‚˜๋ฆฌ์˜ค๋ณ„ ์„ฑ๋Šฅ
| ์‹œ๋‚˜๋ฆฌ์˜ค | MAE | ๋“ฑ๊ธ‰ | ์„ค๋ช… |
|----------|-----|------|------|
| 1box_right_vertical | 0.217 | B+ | ์šฐ์ˆ˜ |
| 1box_left_horizontal | 0.303 | B | ์–‘ํ˜ธ |
| 2box_left_vertical | 0.322 | B | ์–‘ํ˜ธ |
| 1box_left_vertical | 0.337 | B- | ๋ณดํ†ต |
## ๐Ÿš€ ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
### ์„ค์น˜
```bash
pip install transformers torch pillow numpy
```
### ๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•
```python
from mobile_vla import MobileVLAModel, MobileVLATrainer
from PIL import Image
import torch
# ๋ชจ๋ธ ๋กœ๋“œ
model = MobileVLAModel.from_pretrained("minuum/mobile-vla")
# ์ด๋ฏธ์ง€์™€ ํƒœ์Šคํฌ ์ค€๋น„
image = Image.open("robot_camera.jpg")
task = "Navigate around obstacles to track the target cup"
# ์˜ˆ์ธก
with torch.no_grad():
actions = model.predict(image, task)
print(f"Predicted actions: {actions}")
# ์ถœ๋ ฅ: [linear_x, linear_y, angular_z]
```
### ๊ณ ๊ธ‰ ์‚ฌ์šฉ๋ฒ•
```python
# ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ
images = [Image.open(f"frame_{i}.jpg") for i in range(8)]
actions = model.predict_sequence(images, task)
# ์‹ค์‹œ๊ฐ„ ์ œ์–ด
for frame in camera_stream:
action = model.predict(frame, task)
robot.execute(action)
```
## ๐Ÿ—๏ธ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜
```
[RGB Images] โ†’ [Kosmos-2B Vision] โ†’ [Action Head] โ†’ [3D Actions]
โ†“ โ†“ โ†“ โ†“
224x224 Image Features Regression [x, y, ฮธ]
```
### ํ•ต์‹ฌ ์ปดํฌ๋„ŒํŠธ
1. **Kosmos-2B Vision Model**: ์ด๋ฏธ์ง€ ํŠน์ง• ์ถ”์ถœ
2. **Action Head**: 3D ํšŒ๊ท€ ํ—ค๋“œ (512 โ†’ 3*chunk_size)
3. **Window/Chunk**: 8ํ”„๋ ˆ์ž„ ๊ด€์ฐฐ โ†’ 2ํ”„๋ ˆ์ž„ ์˜ˆ์ธก
## ๐Ÿ“ˆ RoboVLMs์™€์˜ ๋น„๊ต
| ํ•ญ๋ชฉ | RoboVLMs | Mobile VLA |
|------|----------|------------|
| **๋ฐ์ดํ„ฐ ์š”๊ตฌ๋Ÿ‰** | ์ˆ˜๋ฐฑ๋งŒ ๋ฐ๋ชจ | 72 ์—ํ”ผ์†Œ๋“œ |
| **์•ก์…˜ ๊ณต๊ฐ„** | 7-DOF Discrete | 3D Continuous |
| **์ถ”๋ก  ์†๋„** | ๋ณตํ•ฉ์  | ๋น ๋ฆ„ |
| **ํŠนํ™” ๋ถ„์•ผ** | ๋ฒ”์šฉ Manipulation | Mobile Robot |
| **ํ‰๊ฐ€ ๋ฐฉ์‹** | ์„ฑ๊ณต๋ฅ  | ๋‹ค์ฐจ์› ํšŒ๊ท€ ์ง€ํ‘œ |
## ๐ŸŽฏ ์ฃผ์š” ๊ฐœ์„ ์‚ฌํ•ญ
- **๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ**: 1000๋ฐฐ ์ ์€ ๋ฐ์ดํ„ฐ๋กœ ์‹ค์šฉ์  ์„ฑ๋Šฅ
- **์‹ค์‹œ๊ฐ„ ์„ฑ๋Šฅ**: Vision-only ์ฒ˜๋ฆฌ๋กœ ๋น ๋ฅธ ์ถ”๋ก 
- **์—ฐ์† ์ œ์–ด**: ์ •๋ฐ€ํ•œ 3D ์•ก์…˜ ์˜ˆ์ธก
- **์‹œ๋‚˜๋ฆฌ์˜ค ํŠนํ™”**: ์žฅ์• ๋ฌผ ํšŒํ”ผ ์ „์šฉ ์ตœ์ ํ™”
## ๐Ÿ“š ํ•™์Šต ๋ฐ์ดํ„ฐ
- **์—ํ”ผ์†Œ๋“œ ์ˆ˜**: 72๊ฐœ
- **์‹œ๋‚˜๋ฆฌ์˜ค**: 1box/2box ร— left/right ร— vertical/horizontal
- **์•ก์…˜**: [linear_x, linear_y, angular_z] ์—ฐ์† ๊ฐ’
- **์ด๋ฏธ์ง€**: ์‹ค์ œ ๋กœ๋ด‡ ์นด๋ฉ”๋ผ RGB (224x224)
## ๐Ÿ”ฌ ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ
์ด ๋ชจ๋ธ์€ RoboVLMs์˜ Window/Chunk ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์œ ์ง€ํ•˜๋ฉด์„œ Mobile Robot์— ํŠนํ™”๋œ ๊ธฐ๋Šฅ์„ ์ถ”๊ฐ€ํ•œ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค:
1. **Window/Chunk ์œ ์ง€**: 8ํ”„๋ ˆ์ž„ ๊ด€์ฐฐ โ†’ 2ํ”„๋ ˆ์ž„ ์˜ˆ์ธก ๊ตฌ์กฐ
2. **Kosmos-2B ํ†ตํ•ฉ**: Vision-Language ๋ฐฑ๋ณธ ํ™œ์šฉ
3. **์—ฐ์† ์ œ์–ด**: Discrete โ†’ Continuous ์•ก์…˜ ๊ณต๊ฐ„ ์ „ํ™˜
4. **์‹ค์ œ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ**: HDF5 ํ˜•ํƒœ์˜ ์‹ค์ œ ์ˆ˜์ง‘ ๋ฐ์ดํ„ฐ
## ๐Ÿ“„ ์ธ์šฉ
```bibtex
@misc{mobile_vla_2024,
title={Mobile VLA: Vision-Language-Action Model for Mobile Robot Navigation},
author={Mobile VLA Team},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/minuum/mobile-vla}
}
```
## ๐Ÿค ๊ธฐ์—ฌ
์ด ๋ชจ๋ธ์€ RoboVLMs ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์œผ๋ฉฐ, Mobile Robot ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ ๋ฐœ์ „์„ ์œ„ํ•ด ๊ณต๊ฐœ๋ฉ๋‹ˆ๋‹ค.
## ๐Ÿ“ž ์—ฐ๋ฝ์ฒ˜
- **Issues**: [GitHub Issues](https://github.com/minuum/vla/issues)
- **Discussions**: [HuggingFace Discussions](https://huggingface.co/minuum/mobile-vla/discussions)
---
*Generated on 2025-08-21*