File size: 5,050 Bytes

ca6d175

---
license: apache-2.0
tags:
- vision-language-action
- mobile-robot
- kosmos-2b
- robotics
- obstacle-avoidance
datasets:
- mobile-vla-dataset
language:
- en
- ko
metrics:
- mae
- r2_score
library_name: transformers
pipeline_tag: robotics
---

# 🚀 Mobile VLA: Vision-Language-Action Model for Mobile Robots

## 📋 Model Description

Mobile VLA는 Kosmos-2B를 기반으로 한 Mobile Robot 전용 Vision-Language-Action 모델입니다. 
장애물 회피 시나리오에서 연속적인 3D 액션 예측을 수행합니다.

### 🎯 핵심 기능

- **Vision-Language-Action**: 이미지와 텍스트 지시사항을 받아 로봇 액션 예측
- **3D 연속 제어**: `[linear_x, linear_y, angular_z]` 형태의 연속 액션 공간
- **장애물 회피**: 1-box, 2-box 시나리오에서 좌우 회피 전략 학습
- **실시간 처리**: 효율적인 vision-only 처리로 빠른 추론

### 🔧 기술 사양

- **백본 모델**: microsoft/kosmos-2-patch14-224
- **입력**: RGB 이미지 (224x224) + 텍스트 지시사항
- **출력**: 3D 연속 액션 벡터
- **학습 방식**: Huber Loss 기반 회귀
- **데이터**: 72개 실제 로봇 에피소드

## 📊 성능 지표

### 전체 성능
- **전체 MAE**: 0.285
- **임계값 정확도 (0.1)**: 37.5%

### 액션별 성능
| 액션 | MAE | R² Score | 설명 |
|------|-----|----------|------|
| linear_x | 0.243 | 0.354 | 전진/후진 (우수) |
| linear_y | 0.550 | 0.293 | 좌우 이동 (보통) |
| angular_z | 0.062 | 0.000 | 회전 (낮음) |

### 시나리오별 성능
| 시나리오 | MAE | 등급 | 설명 |
|----------|-----|------|------|
| 1box_right_vertical | 0.217 | B+ | 우수 |
| 1box_left_horizontal | 0.303 | B | 양호 |
| 2box_left_vertical | 0.322 | B | 양호 |
| 1box_left_vertical | 0.337 | B- | 보통 |

## 🚀 사용 방법

### 설치
```bash
pip install transformers torch pillow numpy
```

### 기본 사용법
```python
from mobile_vla import MobileVLAModel, MobileVLATrainer
from PIL import Image
import torch

# 모델 로드
model = MobileVLAModel.from_pretrained("minuum/mobile-vla")

# 이미지와 태스크 준비
image = Image.open("robot_camera.jpg")
task = "Navigate around obstacles to track the target cup"

# 예측
with torch.no_grad():
    actions = model.predict(image, task)
    
print(f"Predicted actions: {actions}")
# 출력: [linear_x, linear_y, angular_z]
```

### 고급 사용법
```python
# 배치 처리
images = [Image.open(f"frame_{i}.jpg") for i in range(8)]
actions = model.predict_sequence(images, task)

# 실시간 제어
for frame in camera_stream:
    action = model.predict(frame, task)
    robot.execute(action)
```

## 🏗️ 모델 아키텍처

```
[RGB Images] → [Kosmos-2B Vision] → [Action Head] → [3D Actions]
     ↓              ↓                    ↓             ↓
   224x224    Image Features         Regression    [x, y, θ]
```

### 핵심 컴포넌트
1. **Kosmos-2B Vision Model**: 이미지 특징 추출
2. **Action Head**: 3D 회귀 헤드 (512 → 3*chunk_size)
3. **Window/Chunk**: 8프레임 관찰 → 2프레임 예측

## 📈 RoboVLMs와의 비교

| 항목 | RoboVLMs | Mobile VLA |
|------|----------|------------|
| **데이터 요구량** | 수백만 데모 | 72 에피소드 |
| **액션 공간** | 7-DOF Discrete | 3D Continuous |
| **추론 속도** | 복합적 | 빠름 |
| **특화 분야** | 범용 Manipulation | Mobile Robot |
| **평가 방식** | 성공률 | 다차원 회귀 지표 |

## 🎯 주요 개선사항

- **데이터 효율성**: 1000배 적은 데이터로 실용적 성능
- **실시간 성능**: Vision-only 처리로 빠른 추론
- **연속 제어**: 정밀한 3D 액션 예측
- **시나리오 특화**: 장애물 회피 전용 최적화

## 📚 학습 데이터

- **에피소드 수**: 72개
- **시나리오**: 1box/2box × left/right × vertical/horizontal
- **액션**: [linear_x, linear_y, angular_z] 연속 값
- **이미지**: 실제 로봇 카메라 RGB (224x224)

## 🔬 연구 배경

이 모델은 RoboVLMs의 Window/Chunk 메커니즘을 유지하면서 Mobile Robot에 특화된 기능을 추가한 연구입니다:

1. **Window/Chunk 유지**: 8프레임 관찰 → 2프레임 예측 구조
2. **Kosmos-2B 통합**: Vision-Language 백본 활용
3. **연속 제어**: Discrete → Continuous 액션 공간 전환
4. **실제 로봇 데이터**: HDF5 형태의 실제 수집 데이터

## 📄 인용

```bibtex
@misc{mobile_vla_2024,
  title={Mobile VLA: Vision-Language-Action Model for Mobile Robot Navigation},
  author={Mobile VLA Team},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/minuum/mobile-vla}
}
```

## 🤝 기여

이 모델은 RoboVLMs 프레임워크를 기반으로 개발되었으며, Mobile Robot 커뮤니티의 발전을 위해 공개됩니다.

## 📞 연락처

- **Issues**: [GitHub Issues](https://github.com/minuum/vla/issues)
- **Discussions**: [HuggingFace Discussions](https://huggingface.co/minuum/mobile-vla/discussions)

---
*Generated on 2025-08-21*