minium commited on
Commit
ca6d175
ยท
verified ยท
1 Parent(s): 83cd5ec

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +172 -0
README.md ADDED
@@ -0,0 +1,172 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - vision-language-action
5
+ - mobile-robot
6
+ - kosmos-2b
7
+ - robotics
8
+ - obstacle-avoidance
9
+ datasets:
10
+ - mobile-vla-dataset
11
+ language:
12
+ - en
13
+ - ko
14
+ metrics:
15
+ - mae
16
+ - r2_score
17
+ library_name: transformers
18
+ pipeline_tag: robotics
19
+ ---
20
+
21
+ # ๐Ÿš€ Mobile VLA: Vision-Language-Action Model for Mobile Robots
22
+
23
+ ## ๐Ÿ“‹ Model Description
24
+
25
+ Mobile VLA๋Š” Kosmos-2B๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ Mobile Robot ์ „์šฉ Vision-Language-Action ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
26
+ ์žฅ์• ๋ฌผ ํšŒํ”ผ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์—ฐ์†์ ์ธ 3D ์•ก์…˜ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
27
+
28
+ ### ๐ŸŽฏ ํ•ต์‹ฌ ๊ธฐ๋Šฅ
29
+
30
+ - **Vision-Language-Action**: ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์ง€์‹œ์‚ฌํ•ญ์„ ๋ฐ›์•„ ๋กœ๋ด‡ ์•ก์…˜ ์˜ˆ์ธก
31
+ - **3D ์—ฐ์† ์ œ์–ด**: `[linear_x, linear_y, angular_z]` ํ˜•ํƒœ์˜ ์—ฐ์† ์•ก์…˜ ๊ณต๊ฐ„
32
+ - **์žฅ์• ๋ฌผ ํšŒํ”ผ**: 1-box, 2-box ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์ขŒ์šฐ ํšŒํ”ผ ์ „๋žต ํ•™์Šต
33
+ - **์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ**: ํšจ์œจ์ ์ธ vision-only ์ฒ˜๋ฆฌ๋กœ ๋น ๋ฅธ ์ถ”๋ก 
34
+
35
+ ### ๐Ÿ”ง ๊ธฐ์ˆ  ์‚ฌ์–‘
36
+
37
+ - **๋ฐฑ๋ณธ ๋ชจ๋ธ**: microsoft/kosmos-2-patch14-224
38
+ - **์ž…๋ ฅ**: RGB ์ด๋ฏธ์ง€ (224x224) + ํ…์ŠคํŠธ ์ง€์‹œ์‚ฌํ•ญ
39
+ - **์ถœ๋ ฅ**: 3D ์—ฐ์† ์•ก์…˜ ๋ฒกํ„ฐ
40
+ - **ํ•™์Šต ๋ฐฉ์‹**: Huber Loss ๊ธฐ๋ฐ˜ ํšŒ๊ท€
41
+ - **๋ฐ์ดํ„ฐ**: 72๊ฐœ ์‹ค์ œ ๋กœ๋ด‡ ์—ํ”ผ์†Œ๋“œ
42
+
43
+ ## ๐Ÿ“Š ์„ฑ๋Šฅ ์ง€ํ‘œ
44
+
45
+ ### ์ „์ฒด ์„ฑ๋Šฅ
46
+ - **์ „์ฒด MAE**: 0.285
47
+ - **์ž„๊ณ„๊ฐ’ ์ •ํ™•๋„ (0.1)**: 37.5%
48
+
49
+ ### ์•ก์…˜๋ณ„ ์„ฑ๋Šฅ
50
+ | ์•ก์…˜ | MAE | Rยฒ Score | ์„ค๋ช… |
51
+ |------|-----|----------|------|
52
+ | linear_x | 0.243 | 0.354 | ์ „์ง„/ํ›„์ง„ (์šฐ์ˆ˜) |
53
+ | linear_y | 0.550 | 0.293 | ์ขŒ์šฐ ์ด๋™ (๋ณดํ†ต) |
54
+ | angular_z | 0.062 | 0.000 | ํšŒ์ „ (๋‚ฎ์Œ) |
55
+
56
+ ### ์‹œ๋‚˜๋ฆฌ์˜ค๋ณ„ ์„ฑ๋Šฅ
57
+ | ์‹œ๋‚˜๋ฆฌ์˜ค | MAE | ๋“ฑ๊ธ‰ | ์„ค๋ช… |
58
+ |----------|-----|------|------|
59
+ | 1box_right_vertical | 0.217 | B+ | ์šฐ์ˆ˜ |
60
+ | 1box_left_horizontal | 0.303 | B | ์–‘ํ˜ธ |
61
+ | 2box_left_vertical | 0.322 | B | ์–‘ํ˜ธ |
62
+ | 1box_left_vertical | 0.337 | B- | ๋ณดํ†ต |
63
+
64
+ ## ๐Ÿš€ ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
65
+
66
+ ### ์„ค์น˜
67
+ ```bash
68
+ pip install transformers torch pillow numpy
69
+ ```
70
+
71
+ ### ๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•
72
+ ```python
73
+ from mobile_vla import MobileVLAModel, MobileVLATrainer
74
+ from PIL import Image
75
+ import torch
76
+
77
+ # ๋ชจ๋ธ ๋กœ๋“œ
78
+ model = MobileVLAModel.from_pretrained("minuum/mobile-vla")
79
+
80
+ # ์ด๋ฏธ์ง€์™€ ํƒœ์Šคํฌ ์ค€๋น„
81
+ image = Image.open("robot_camera.jpg")
82
+ task = "Navigate around obstacles to track the target cup"
83
+
84
+ # ์˜ˆ์ธก
85
+ with torch.no_grad():
86
+ actions = model.predict(image, task)
87
+
88
+ print(f"Predicted actions: {actions}")
89
+ # ์ถœ๋ ฅ: [linear_x, linear_y, angular_z]
90
+ ```
91
+
92
+ ### ๊ณ ๊ธ‰ ์‚ฌ์šฉ๋ฒ•
93
+ ```python
94
+ # ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ
95
+ images = [Image.open(f"frame_{i}.jpg") for i in range(8)]
96
+ actions = model.predict_sequence(images, task)
97
+
98
+ # ์‹ค์‹œ๊ฐ„ ์ œ์–ด
99
+ for frame in camera_stream:
100
+ action = model.predict(frame, task)
101
+ robot.execute(action)
102
+ ```
103
+
104
+ ## ๐Ÿ—๏ธ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜
105
+
106
+ ```
107
+ [RGB Images] โ†’ [Kosmos-2B Vision] โ†’ [Action Head] โ†’ [3D Actions]
108
+ โ†“ โ†“ โ†“ โ†“
109
+ 224x224 Image Features Regression [x, y, ฮธ]
110
+ ```
111
+
112
+ ### ํ•ต์‹ฌ ์ปดํฌ๋„ŒํŠธ
113
+ 1. **Kosmos-2B Vision Model**: ์ด๋ฏธ์ง€ ํŠน์ง• ์ถ”์ถœ
114
+ 2. **Action Head**: 3D ํšŒ๊ท€ ํ—ค๋“œ (512 โ†’ 3*chunk_size)
115
+ 3. **Window/Chunk**: 8ํ”„๋ ˆ์ž„ ๊ด€์ฐฐ โ†’ 2ํ”„๋ ˆ์ž„ ์˜ˆ์ธก
116
+
117
+ ## ๐Ÿ“ˆ RoboVLMs์™€์˜ ๋น„๊ต
118
+
119
+ | ํ•ญ๋ชฉ | RoboVLMs | Mobile VLA |
120
+ |------|----------|------------|
121
+ | **๋ฐ์ดํ„ฐ ์š”๊ตฌ๋Ÿ‰** | ์ˆ˜๋ฐฑ๋งŒ ๋ฐ๋ชจ | 72 ์—ํ”ผ์†Œ๋“œ |
122
+ | **์•ก์…˜ ๊ณต๊ฐ„** | 7-DOF Discrete | 3D Continuous |
123
+ | **์ถ”๋ก  ์†๋„** | ๋ณตํ•ฉ์  | ๋น ๋ฆ„ |
124
+ | **ํŠนํ™” ๋ถ„์•ผ** | ๋ฒ”์šฉ Manipulation | Mobile Robot |
125
+ | **ํ‰๊ฐ€ ๋ฐฉ์‹** | ์„ฑ๊ณต๋ฅ  | ๋‹ค์ฐจ์› ํšŒ๊ท€ ์ง€ํ‘œ |
126
+
127
+ ## ๐ŸŽฏ ์ฃผ์š” ๊ฐœ์„ ์‚ฌํ•ญ
128
+
129
+ - **๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ**: 1000๋ฐฐ ์ ์€ ๋ฐ์ดํ„ฐ๋กœ ์‹ค์šฉ์  ์„ฑ๋Šฅ
130
+ - **์‹ค์‹œ๊ฐ„ ์„ฑ๋Šฅ**: Vision-only ์ฒ˜๋ฆฌ๋กœ ๋น ๋ฅธ ์ถ”๋ก 
131
+ - **์—ฐ์† ์ œ์–ด**: ์ •๋ฐ€ํ•œ 3D ์•ก์…˜ ์˜ˆ์ธก
132
+ - **์‹œ๋‚˜๋ฆฌ์˜ค ํŠนํ™”**: ์žฅ์• ๋ฌผ ํšŒํ”ผ ์ „์šฉ ์ตœ์ ํ™”
133
+
134
+ ## ๐Ÿ“š ํ•™์Šต ๋ฐ์ดํ„ฐ
135
+
136
+ - **์—ํ”ผ์†Œ๋“œ ์ˆ˜**: 72๊ฐœ
137
+ - **์‹œ๋‚˜๋ฆฌ์˜ค**: 1box/2box ร— left/right ร— vertical/horizontal
138
+ - **์•ก์…˜**: [linear_x, linear_y, angular_z] ์—ฐ์† ๊ฐ’
139
+ - **์ด๋ฏธ์ง€**: ์‹ค์ œ ๋กœ๋ด‡ ์นด๋ฉ”๋ผ RGB (224x224)
140
+
141
+ ## ๐Ÿ”ฌ ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ
142
+
143
+ ์ด ๋ชจ๋ธ์€ RoboVLMs์˜ Window/Chunk ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์œ ์ง€ํ•˜๋ฉด์„œ Mobile Robot์— ํŠนํ™”๋œ ๊ธฐ๋Šฅ์„ ์ถ”๊ฐ€ํ•œ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค:
144
+
145
+ 1. **Window/Chunk ์œ ์ง€**: 8ํ”„๋ ˆ์ž„ ๊ด€์ฐฐ โ†’ 2ํ”„๋ ˆ์ž„ ์˜ˆ์ธก ๊ตฌ์กฐ
146
+ 2. **Kosmos-2B ํ†ตํ•ฉ**: Vision-Language ๋ฐฑ๋ณธ ํ™œ์šฉ
147
+ 3. **์—ฐ์† ์ œ์–ด**: Discrete โ†’ Continuous ์•ก์…˜ ๊ณต๊ฐ„ ์ „ํ™˜
148
+ 4. **์‹ค์ œ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ**: HDF5 ํ˜•ํƒœ์˜ ์‹ค์ œ ์ˆ˜์ง‘ ๋ฐ์ดํ„ฐ
149
+
150
+ ## ๐Ÿ“„ ์ธ์šฉ
151
+
152
+ ```bibtex
153
+ @misc{mobile_vla_2024,
154
+ title={Mobile VLA: Vision-Language-Action Model for Mobile Robot Navigation},
155
+ author={Mobile VLA Team},
156
+ year={2024},
157
+ publisher={HuggingFace},
158
+ url={https://huggingface.co/minuum/mobile-vla}
159
+ }
160
+ ```
161
+
162
+ ## ๐Ÿค ๊ธฐ์—ฌ
163
+
164
+ ์ด ๋ชจ๋ธ์€ RoboVLMs ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์œผ๋ฉฐ, Mobile Robot ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ ๋ฐœ์ „์„ ์œ„ํ•ด ๊ณต๊ฐœ๋ฉ๋‹ˆ๋‹ค.
165
+
166
+ ## ๐Ÿ“ž ์—ฐ๋ฝ์ฒ˜
167
+
168
+ - **Issues**: [GitHub Issues](https://github.com/minuum/vla/issues)
169
+ - **Discussions**: [HuggingFace Discussions](https://huggingface.co/minuum/mobile-vla/discussions)
170
+
171
+ ---
172
+ *Generated on 2025-08-21*