File size: 7,024 Bytes
3ea13b0 89266fc 3ea13b0 34a7728 055b5dd 34a7728 055b5dd 34a7728 1f912b1 34a7728 350d200 34a7728 350d200 34a7728 1f912b1 34a7728 c466f84 34a7728 2a09cba 34a7728 6d2512f 34a7728 6d2512f 34a7728 c466f84 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 |
---
license: apache-2.0
language:
- en
tags:
- vision-language-action
- edge-deployment
- tensorRT
- qwen
base_model: Stanford-ILIAD/minivla-vq-libero90-prismatic
library_name: transformers
datasets:
- LIBERO
pipeline_tag: image-text-to-text
---
# MiniVLA
This repository hosts **MiniVLA** – a modular and deployment-friendly Vision-Language-Action (VLA) model designed for **edge hardware** (e.g., Jetson Orin Nano).
It contains model checkpoints, Hugging Face–compatible Qwen-0.5B LLM, and ONNX/TensorRT exports for accelerated inference.
---
## 🔎 Introduction
To enable low-latency, high-security desktop robot tasks on local devices, this project focuses on addressing the deployment and performance challenges of lightweight multimodal models on edge hardware. Using OpenVLA-Mini as a case study, we propose a hybrid acceleration pipeline designed to alleviate deployment bottlenecks on resource-constrained platforms.
We reproduced a lightweight VLA model and then significantly reduced its end-to-end latency and GPU memory usage by exporting the vision encoder into ONNX and TensorRT engines. While we observed a moderate drop in the task success rate (around 5-10% in LIBERO desktop operation tasks), our results still demonstrate the feasibility of achieving efficient, real-time VLA inference on the edge side.
---
## 🏗️ System Architecture
The MiniVLA deployment is designed with modular microservices:
<p align="center">
<img src="./Results/System_Architecture.svg" width="100%" >
</p>
- **Inputs**: image + language instruction
- **Vision Encoder**: DinoV2 / SigLIP → ONNX/TensorRT
- **LLM**: Qwen 2.5 0.5B (Hugging Face / TensorRT-LLM)
- **Router & Fallback**: balances between local inference and accelerated microservices
- **Robot Action**: decoded from predicted action tokens
### Hybrid Acceleration
<p align="center">
<img src="./Results/MiniVLA_Architecture.svg" width="100%" >
</p>
- **Vision Encoder Acceleration**: PyTorch → ONNX → TensorRT, deployed as microservice (`/vision/encode`)
- **LLM Acceleration**: Hugging Face → TensorRT-LLM engine, deployed as microservice (`/llm/generate`)
- **Main Process**: Orchestrates requests, ensures fallback, and outputs robot actions
---
## 📦 Contents
- **`models/`**
Contains the original MiniVLA model checkpoints, based on
[Stanford-ILIAD/minivla-vq-libero90-prismatic](https://huggingface.co/Stanford-ILIAD/minivla-vq-libero90-prismatic).
Special thanks to the Stanford ILIAD team for their open-source contribution.
- **`qwen25-0_5b-trtllm/`**
Qwen-0.5B language model converted to TensorRT-LLM format.
- **`qwen25-0_5b-with-extra-tokenizer/`**
Hugging Face–compatible Qwen-0.5B model with extended tokenizer.
- **`tensorRT/`**
Vision encoder acceleration files:
-
`vision_encoder_fp16.onnx`
- `vision_encoder_fp16.engine`
---
## 🔗 Related Project
For full implementation and code, please visit the companion GitHub repository:
👉 [https://github.com/Zhenxintao/MiniVLA](https://github.com/Zhenxintao/MiniVLA)
## 🚀 Usage
### Load Hugging Face Qwen-0.5B
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "xintaozhen/MiniVLA/qwen25-0_5b-with-extra-tokenizer"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
```
### Call TensorRT Vision Encoder (HTTP API)
```python
import requests
url = "http://vision.svc:8000/vision/encode"
image_data = {"image": "base64_encoded_image"}
response = requests.post(url, json=image_data)
vision_embedding = response.json()
```
### Call TensorRT-LLM (HTTP API)
```python
import requests
url = "http://llm.svc:8810/llm/generate"
payload = {"prompt": "Close the top drawer of the cabinet."}
response = requests.post(url, json=payload)
generated_actions = response.json()
```
---
## 🔑 Key Contributions
- Built an **end-to-end online inference framework** with a FastAPI service (`/act`), transforming offline benchmark code into a **real-time deployable system**.
- Reproduced a lightweight **OpenVLA-Mini** and proposed a **hybrid acceleration pipeline**.
- Exported the **vision encoder** to TensorRT, reducing perception latency and GPU memory usage.
- Improved **GPU memory efficiency**: reduced average utilization from ~67% to ~43%, and peak usage from ~85% to ~65%, making deployment feasible under 8 GB memory constraints (similar to Jetson-class devices).
- Integrated **Qwen 2.5 0.5B** in Hugging Face and TensorRT-LLM formats.
- Designed a **modular system architecture** with router & fallback for robustness.
- Demonstrated efficient **edge-side VLA inference** on Jetson Orin Nano in LIBERO tasks, with only a moderate performance drop (5–10%).
---
## 🖥️ Device & Performance
Target deployment: **Jetson Orin Nano (16 GB / 8 GB variants)**.
For simulation and reproducibility, experiments were conducted on a **local workstation** equipped with:
- **GPU**: NVIDIA GeForce RTX 4060 Laptop GPU (8 GB VRAM)
- **Driver / CUDA**: Driver 550.144.03, CUDA 12.4
- **OS**: Ubuntu 22.04 LTS
⚠️ **Note**: Although the experiments were run on RTX 4060, the GPU memory (8 GB) is comparable to entry-level Jetson devices, making it a suitable proxy for evaluating edge deployment feasibility.
### GPU Memory Utilization (Long-Sequence Tasks)
| Model Variant | Avg. GPU Utilization | Peak GPU Utilization |
| --------------------------------------- | -------------------- | -------------------- |
| Original MiniVLA (PyTorch, no TRT) | ~67% | ~85% |
| MiniVLA w/ TensorRT Vision Acceleration | ~43% | ~65% |
**Observation:**
- The hybrid acceleration pipeline (TensorRT vision + VLA main process) reduced **average GPU utilization by ~24%** and **peak usage by ~20%**.
- This indicates better **GPU memory efficiency**, allowing longer sequence tasks to run stably under resource-constrained devices.
### Example nvidia-smi Output
Original model:
```
GPU Memory-Usage: 4115MiB / 8188MiB
GPU-Util: 67% (peak 85%)
```
With TensorRT vision acceleration:
```
GPU Memory-Usage: 4055MiB / 8188MiB
GPU-Util: 43% (peak 65%)
```
---
## 📑 License
Specify the license here (e.g., Apache 2.0, MIT, or same as MiniVLA / Qwen license).
---
## 📚 Citation
If you use **MiniVLA** in your research or deployment, please cite:
```
@misc{MiniVLA2025,
title = {MiniVLA: A Modular Vision-Language-Action Model for Edge Deployment},
author = {Xintao Zhen},
year = {2025},
url = {https://huggingface.co/xintaozhen/MiniVLA}
}
```
We also acknowledge and thank the authors of [Stanford-ILIAD/minivla-vq-libero90-prismatic](https://huggingface.co/Stanford-ILIAD/minivla-vq-libero90-prismatic), which serves as the base for the checkpoints included in this repository.
--- |