File size: 7,024 Bytes
3ea13b0
 
 
 
 
 
 
 
 
 
 
 
 
89266fc
3ea13b0
 
34a7728
 
 
 
 
 
 
 
 
055b5dd
34a7728
055b5dd
34a7728
1f912b1
34a7728
 
 
 
 
 
350d200
34a7728
 
 
 
 
 
 
 
 
 
 
 
350d200
34a7728
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1f912b1
 
34a7728
 
 
 
c466f84
 
 
 
 
 
 
34a7728
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a09cba
34a7728
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6d2512f
34a7728
 
6d2512f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34a7728
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c466f84
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
---
license: apache-2.0
language:
- en
tags:
- vision-language-action
- edge-deployment
- tensorRT
- qwen
base_model: Stanford-ILIAD/minivla-vq-libero90-prismatic
library_name: transformers
datasets:
- LIBERO
pipeline_tag: image-text-to-text
---

# MiniVLA

This repository hosts **MiniVLA** – a modular and deployment-friendly Vision-Language-Action (VLA) model designed for **edge hardware** (e.g., Jetson Orin Nano).  
It contains model checkpoints, Hugging Face–compatible Qwen-0.5B LLM, and ONNX/TensorRT exports for accelerated inference.  

---

## 🔎 Introduction

To enable low-latency, high-security desktop robot tasks on local devices, this project focuses on addressing the deployment and performance challenges of lightweight multimodal models on edge hardware. Using OpenVLA-Mini as a case study, we propose a hybrid acceleration pipeline designed to alleviate deployment bottlenecks on resource-constrained platforms.

We reproduced a lightweight VLA model and then significantly reduced its end-to-end latency and GPU memory usage by exporting the vision encoder into ONNX and TensorRT engines. While we observed a moderate drop in the task success rate (around 5-10% in LIBERO desktop operation tasks), our results still demonstrate the feasibility of achieving efficient, real-time VLA inference on the edge side. 

---  

## 🏗️ System Architecture

The MiniVLA deployment is designed with modular microservices:  

<p align="center">
  <img src="./Results/System_Architecture.svg" width="100%" >
</p>


- **Inputs**: image + language instruction  
- **Vision Encoder**: DinoV2 / SigLIP → ONNX/TensorRT  
- **LLM**: Qwen 2.5 0.5B (Hugging Face / TensorRT-LLM)  
- **Router & Fallback**: balances between local inference and accelerated microservices  
- **Robot Action**: decoded from predicted action tokens  

### Hybrid Acceleration

<p align="center">
  <img src="./Results/MiniVLA_Architecture.svg" width="100%" >
</p>


- **Vision Encoder Acceleration**: PyTorch → ONNX → TensorRT, deployed as microservice (`/vision/encode`)  
- **LLM Acceleration**: Hugging Face → TensorRT-LLM engine, deployed as microservice (`/llm/generate`)  
- **Main Process**: Orchestrates requests, ensures fallback, and outputs robot actions  

---

## 📦 Contents

- **`models/`**  
  Contains the original MiniVLA model checkpoints, based on  
  [Stanford-ILIAD/minivla-vq-libero90-prismatic](https://huggingface.co/Stanford-ILIAD/minivla-vq-libero90-prismatic).  
  Special thanks to the Stanford ILIAD team for their open-source contribution.  

- **`qwen25-0_5b-trtllm/`**  
  Qwen-0.5B language model converted to TensorRT-LLM format.  

- **`qwen25-0_5b-with-extra-tokenizer/`**  
  Hugging Face–compatible Qwen-0.5B model with extended tokenizer.  

- **`tensorRT/`**  
  Vision encoder acceleration files:  
  -
 `vision_encoder_fp16.onnx`  
  - `vision_encoder_fp16.engine`  

---


## 🔗 Related Project

For full implementation and code, please visit the companion GitHub repository:  
👉 [https://github.com/Zhenxintao/MiniVLA](https://github.com/Zhenxintao/MiniVLA)


## 🚀 Usage

### Load Hugging Face Qwen-0.5B

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "xintaozhen/MiniVLA/qwen25-0_5b-with-extra-tokenizer"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
```

### Call TensorRT Vision Encoder (HTTP API)

```python
import requests

url = "http://vision.svc:8000/vision/encode"
image_data = {"image": "base64_encoded_image"}
response = requests.post(url, json=image_data)
vision_embedding = response.json()
```

### Call TensorRT-LLM (HTTP API)

```python

import requests

url = "http://llm.svc:8810/llm/generate"
payload = {"prompt": "Close the top drawer of the cabinet."}
response = requests.post(url, json=payload)
generated_actions = response.json()
```

---

## 🔑 Key Contributions

- Built an **end-to-end online inference framework** with a FastAPI service (`/act`), transforming offline benchmark code into a **real-time deployable system**.  
- Reproduced a lightweight **OpenVLA-Mini** and proposed a **hybrid acceleration pipeline**.  
- Exported the **vision encoder** to TensorRT, reducing perception latency and GPU memory usage.  
- Improved **GPU memory efficiency**: reduced average utilization from ~67% to ~43%, and peak usage from ~85% to ~65%, making deployment feasible under 8 GB memory constraints (similar to Jetson-class devices).  
- Integrated **Qwen 2.5 0.5B** in Hugging Face and TensorRT-LLM formats.  
- Designed a **modular system architecture** with router & fallback for robustness.  
- Demonstrated efficient **edge-side VLA inference** on Jetson Orin Nano in LIBERO tasks, with only a moderate performance drop (5–10%).  

---

## 🖥️ Device & Performance

Target deployment: **Jetson Orin Nano (16 GB / 8 GB variants)**.  

For simulation and reproducibility, experiments were conducted on a **local workstation** equipped with:

- **GPU**: NVIDIA GeForce RTX 4060 Laptop GPU (8 GB VRAM)  
- **Driver / CUDA**: Driver 550.144.03, CUDA 12.4  
- **OS**: Ubuntu 22.04 LTS  

⚠️ **Note**: Although the experiments were run on RTX 4060, the GPU memory (8 GB) is comparable to entry-level Jetson devices, making it a suitable proxy for evaluating edge deployment feasibility.  

### GPU Memory Utilization (Long-Sequence Tasks)

| Model Variant                           | Avg. GPU Utilization | Peak GPU Utilization |
| --------------------------------------- | -------------------- | -------------------- |
| Original MiniVLA (PyTorch, no TRT)      | ~67%                 | ~85%                 |
| MiniVLA w/ TensorRT Vision Acceleration | ~43%                 | ~65%                 |

**Observation:**  

- The hybrid acceleration pipeline (TensorRT vision + VLA main process) reduced **average GPU utilization by ~24%** and **peak usage by ~20%**.  
- This indicates better **GPU memory efficiency**, allowing longer sequence tasks to run stably under resource-constrained devices.  

### Example nvidia-smi Output

Original model:

```
GPU Memory-Usage: 4115MiB / 8188MiB
GPU-Util: 67% (peak 85%)
```

With TensorRT vision acceleration:

```
GPU Memory-Usage: 4055MiB / 8188MiB
GPU-Util: 43% (peak 65%)
```

---

## 📑 License

Specify the license here (e.g., Apache 2.0, MIT, or same as MiniVLA / Qwen license).  

---

## 📚 Citation

If you use **MiniVLA** in your research or deployment, please cite:

```
@misc{MiniVLA2025,
  title   = {MiniVLA: A Modular Vision-Language-Action Model for Edge Deployment},
  author  = {Xintao Zhen},
  year    = {2025},
  url     = {https://huggingface.co/xintaozhen/MiniVLA}
}
```

We also acknowledge and thank the authors of [Stanford-ILIAD/minivla-vq-libero90-prismatic](https://huggingface.co/Stanford-ILIAD/minivla-vq-libero90-prismatic), which serves as the base for the checkpoints included in this repository.  

---