File size: 7,024 Bytes

---
license: apache-2.0
language:
- en
tags:
- vision-language-action
- edge-deployment
- tensorRT
- qwen
base_model: Stanford-ILIAD/minivla-vq-libero90-prismatic
library_name: transformers
datasets:
- LIBERO
pipeline_tag: image-text-to-text
---

# MiniVLA

This repository hosts **MiniVLA** – a modular and deployment-friendly Vision-Language-Action (VLA) model designed for **edge hardware** (e.g., Jetson Orin Nano).  
It contains model checkpoints, Hugging Face–compatible Qwen-0.5B LLM, and ONNX/TensorRT exports for accelerated inference.  

---

## 🔎 Introduction

To enable low-latency, high-security desktop robot tasks on local devices, this project focuses on addressing the deployment and performance challenges of lightweight multimodal models on edge hardware. Using OpenVLA-Mini as a case study, we propose a hybrid acceleration pipeline designed to alleviate deployment bottlenecks on resource-constrained platforms.

We reproduced a lightweight VLA model and then significantly reduced its end-to-end latency and GPU memory usage by exporting the vision encoder into ONNX and TensorRT engines. While we observed a moderate drop in the task success rate (around 5-10% in LIBERO desktop operation tasks), our results still demonstrate the feasibility of achieving efficient, real-time VLA inference on the edge side. 

---  

## 🏗️ System Architecture

The MiniVLA deployment is designed with modular microservices:  

<p align="center">
  <img src="./Results/System_Architecture.svg" width="100%" >
</p>


- **Inputs**: image + language instruction  
- **Vision Encoder**: DinoV2 / SigLIP → ONNX/TensorRT  
- **LLM**: Qwen 2.5 0.5B (Hugging Face / TensorRT-LLM)  
- **Router & Fallback**: balances between local inference and accelerated microservices  
- **Robot Action**: decoded from predicted action tokens  

### Hybrid Acceleration

<p align="center">
  <img src="./Results/MiniVLA_Architecture.svg" width="100%" >
</p>


- **Vision Encoder Acceleration**: PyTorch → ONNX → TensorRT, deployed as microservice (`/vision/encode`)  
- **LLM Acceleration**: Hugging Face → TensorRT-LLM engine, deployed as microservice (`/llm/generate`)  
- **Main Process**: Orchestrates requests, ensures fallback, and outputs robot actions  

---

## 📦 Contents

- **`models/`**  
  Contains the original MiniVLA model checkpoints, based on  
  [Stanford-ILIAD/minivla-vq-libero90-prismatic](https://huggingface.co/Stanford-ILIAD/minivla-vq-libero90-prismatic).  
  Special thanks to the Stanford ILIAD team for their open-source contribution.  

- **`qwen25-0_5b-trtllm/`**  
  Qwen-0.5B language model converted to TensorRT-LLM format.  

- **`qwen25-0_5b-with-extra-tokenizer/`**  
  Hugging Face–compatible Qwen-0.5B model with extended tokenizer.  

- **`tensorRT/`**  
  Vision encoder acceleration files:  
  -
 `vision_encoder_fp16.onnx`  
  - `vision_encoder_fp16.engine`  

---


## 🔗 Related Project

For full implementation and code, please visit the companion GitHub repository:  
👉 [https://github.com/Zhenxintao/MiniVLA](https://github.com/Zhenxintao/MiniVLA)


## 🚀 Usage

### Load Hugging Face Qwen-0.5B

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "xintaozhen/MiniVLA/qwen25-0_5b-with-extra-tokenizer"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
```

### Call TensorRT Vision Encoder (HTTP API)

```python
import requests

url = "http://vision.svc:8000/vision/encode"
image_data = {"image": "base64_encoded_image"}
response = requests.post(url, json=image_data)
vision_embedding = response.json()
```

### Call TensorRT-LLM (HTTP API)

```python

import requests

url = "http://llm.svc:8810/llm/generate"
payload = {"prompt": "Close the top drawer of the cabinet."}
response = requests.post(url, json=payload)
generated_actions = response.json()
```

---

## 🔑 Key Contributions

- Built an **end-to-end online inference framework** with a FastAPI service (`/act`), transforming offline benchmark code into a **real-time deployable system**.  
- Reproduced a lightweight **OpenVLA-Mini** and proposed a **hybrid acceleration pipeline**.  
- Exported the **vision encoder** to TensorRT, reducing perception latency and GPU memory usage.  
- Improved **GPU memory efficiency**: reduced average utilization from ~67% to ~43%, and peak usage from ~85% to ~65%, making deployment feasible under 8 GB memory constraints (similar to Jetson-class devices).  
- Integrated **Qwen 2.5 0.5B** in Hugging Face and TensorRT-LLM formats.  
- Designed a **modular system architecture** with router & fallback for robustness.  
- Demonstrated efficient **edge-side VLA inference** on Jetson Orin Nano in LIBERO tasks, with only a moderate performance drop (5–10%).  

---

## 🖥️ Device & Performance

Target deployment: **Jetson Orin Nano (16 GB / 8 GB variants)**.  

For simulation and reproducibility, experiments were conducted on a **local workstation** equipped with:

- **GPU**: NVIDIA GeForce RTX 4060 Laptop GPU (8 GB VRAM)  
- **Driver / CUDA**: Driver 550.144.03, CUDA 12.4  
- **OS**: Ubuntu 22.04 LTS  

⚠️ **Note**: Although the experiments were run on RTX 4060, the GPU memory (8 GB) is comparable to entry-level Jetson devices, making it a suitable proxy for evaluating edge deployment feasibility.  

### GPU Memory Utilization (Long-Sequence Tasks)

| Model Variant                           | Avg. GPU Utilization | Peak GPU Utilization |
| --------------------------------------- | -------------------- | -------------------- |
| Original MiniVLA (PyTorch, no TRT)      | ~67%                 | ~85%                 |
| MiniVLA w/ TensorRT Vision Acceleration | ~43%                 | ~65%                 |

**Observation:**  

- The hybrid acceleration pipeline (TensorRT vision + VLA main process) reduced **average GPU utilization by ~24%** and **peak usage by ~20%**.  
- This indicates better **GPU memory efficiency**, allowing longer sequence tasks to run stably under resource-constrained devices.  

### Example nvidia-smi Output

Original model:

```
GPU Memory-Usage: 4115MiB / 8188MiB
GPU-Util: 67% (peak 85%)
```

With TensorRT vision acceleration:

```
GPU Memory-Usage: 4055MiB / 8188MiB
GPU-Util: 43% (peak 65%)
```

---

## 📑 License

Specify the license here (e.g., Apache 2.0, MIT, or same as MiniVLA / Qwen license).  

---

## 📚 Citation

If you use **MiniVLA** in your research or deployment, please cite:

```
@misc{MiniVLA2025,
  title   = {MiniVLA: A Modular Vision-Language-Action Model for Edge Deployment},
  author  = {Xintao Zhen},
  year    = {2025},
  url     = {https://huggingface.co/xintaozhen/MiniVLA}
}
```

We also acknowledge and thank the authors of [Stanford-ILIAD/minivla-vq-libero90-prismatic](https://huggingface.co/Stanford-ILIAD/minivla-vq-libero90-prismatic), which serves as the base for the checkpoints included in this repository.  

---