--- license: apache-2.0 language: - en tags: - vision-language-action - edge-deployment - tensorRT - qwen base_model: Stanford-ILIAD/minivla-vq-libero90-prismatic library_name: transformers datasets: - LIBERO pipeline_tag: image-text-to-text --- # MiniVLA This repository hosts **MiniVLA** โ€“ a modular and deployment-friendly Vision-Language-Action (VLA) model designed for **edge hardware** (e.g., Jetson Orin Nano). It contains model checkpoints, Hugging Faceโ€“compatible Qwen-0.5B LLM, and ONNX/TensorRT exports for accelerated inference. --- ## ๐Ÿ”Ž Introduction To enable low-latency, high-security desktop robot tasks on local devices, this project focuses on addressing the deployment and performance challenges of lightweight multimodal models on edge hardware. Using OpenVLA-Mini as a case study, we propose a hybrid acceleration pipeline designed to alleviate deployment bottlenecks on resource-constrained platforms. We reproduced a lightweight VLA model and then significantly reduced its end-to-end latency and GPU memory usage by exporting the vision encoder into ONNX and TensorRT engines. While we observed a moderate drop in the task success rate (around 5-10% in LIBERO desktop operation tasks), our results still demonstrate the feasibility of achieving efficient, real-time VLA inference on the edge side. --- ## ๐Ÿ—๏ธ System Architecture The MiniVLA deployment is designed with modular microservices:

- **Inputs**: image + language instruction - **Vision Encoder**: DinoV2 / SigLIP โ†’ ONNX/TensorRT - **LLM**: Qwen 2.5 0.5B (Hugging Face / TensorRT-LLM) - **Router & Fallback**: balances between local inference and accelerated microservices - **Robot Action**: decoded from predicted action tokens ### Hybrid Acceleration

- **Vision Encoder Acceleration**: PyTorch โ†’ ONNX โ†’ TensorRT, deployed as microservice (`/vision/encode`) - **LLM Acceleration**: Hugging Face โ†’ TensorRT-LLM engine, deployed as microservice (`/llm/generate`) - **Main Process**: Orchestrates requests, ensures fallback, and outputs robot actions --- ## ๐Ÿ“ฆ Contents - **`models/`** Contains the original MiniVLA model checkpoints, based on [Stanford-ILIAD/minivla-vq-libero90-prismatic](https://huggingface.co/Stanford-ILIAD/minivla-vq-libero90-prismatic). Special thanks to the Stanford ILIAD team for their open-source contribution. - **`qwen25-0_5b-trtllm/`** Qwen-0.5B language model converted to TensorRT-LLM format. - **`qwen25-0_5b-with-extra-tokenizer/`** Hugging Faceโ€“compatible Qwen-0.5B model with extended tokenizer. - **`tensorRT/`** Vision encoder acceleration files: - `vision_encoder_fp16.onnx` - `vision_encoder_fp16.engine` --- ## ๐Ÿ”— Related Project For full implementation and code, please visit the companion GitHub repository: ๐Ÿ‘‰ [https://github.com/Zhenxintao/MiniVLA](https://github.com/Zhenxintao/MiniVLA) ## ๐Ÿš€ Usage ### Load Hugging Face Qwen-0.5B ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_name = "xintaozhen/MiniVLA/qwen25-0_5b-with-extra-tokenizer" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True) ``` ### Call TensorRT Vision Encoder (HTTP API) ```python import requests url = "http://vision.svc:8000/vision/encode" image_data = {"image": "base64_encoded_image"} response = requests.post(url, json=image_data) vision_embedding = response.json() ``` ### Call TensorRT-LLM (HTTP API) ```python import requests url = "http://llm.svc:8810/llm/generate" payload = {"prompt": "Close the top drawer of the cabinet."} response = requests.post(url, json=payload) generated_actions = response.json() ``` --- ## ๐Ÿ”‘ Key Contributions - Built an **end-to-end online inference framework** with a FastAPI service (`/act`), transforming offline benchmark code into a **real-time deployable system**. - Reproduced a lightweight **OpenVLA-Mini** and proposed a **hybrid acceleration pipeline**. - Exported the **vision encoder** to TensorRT, reducing perception latency and GPU memory usage. - Improved **GPU memory efficiency**: reduced average utilization from ~67% to ~43%, and peak usage from ~85% to ~65%, making deployment feasible under 8 GB memory constraints (similar to Jetson-class devices). - Integrated **Qwen 2.5 0.5B** in Hugging Face and TensorRT-LLM formats. - Designed a **modular system architecture** with router & fallback for robustness. - Demonstrated efficient **edge-side VLA inference** on Jetson Orin Nano in LIBERO tasks, with only a moderate performance drop (5โ€“10%). --- ## ๐Ÿ–ฅ๏ธ Device & Performance Target deployment: **Jetson Orin Nano (16 GB / 8 GB variants)**. For simulation and reproducibility, experiments were conducted on a **local workstation** equipped with: - **GPU**: NVIDIA GeForce RTX 4060 Laptop GPU (8 GB VRAM) - **Driver / CUDA**: Driver 550.144.03, CUDA 12.4 - **OS**: Ubuntu 22.04 LTS โš ๏ธ **Note**: Although the experiments were run on RTX 4060, the GPU memory (8 GB) is comparable to entry-level Jetson devices, making it a suitable proxy for evaluating edge deployment feasibility. ### GPU Memory Utilization (Long-Sequence Tasks) | Model Variant | Avg. GPU Utilization | Peak GPU Utilization | | --------------------------------------- | -------------------- | -------------------- | | Original MiniVLA (PyTorch, no TRT) | ~67% | ~85% | | MiniVLA w/ TensorRT Vision Acceleration | ~43% | ~65% | **Observation:** - The hybrid acceleration pipeline (TensorRT vision + VLA main process) reduced **average GPU utilization by ~24%** and **peak usage by ~20%**. - This indicates better **GPU memory efficiency**, allowing longer sequence tasks to run stably under resource-constrained devices. ### Example nvidia-smi Output Original model: ``` GPU Memory-Usage: 4115MiB / 8188MiB GPU-Util: 67% (peak 85%) ``` With TensorRT vision acceleration: ``` GPU Memory-Usage: 4055MiB / 8188MiB GPU-Util: 43% (peak 65%) ``` --- ## ๐Ÿ“‘ License Specify the license here (e.g., Apache 2.0, MIT, or same as MiniVLA / Qwen license). --- ## ๐Ÿ“š Citation If you use **MiniVLA** in your research or deployment, please cite: ``` @misc{MiniVLA2025, title = {MiniVLA: A Modular Vision-Language-Action Model for Edge Deployment}, author = {Xintao Zhen}, year = {2025}, url = {https://huggingface.co/xintaozhen/MiniVLA} } ``` We also acknowledge and thank the authors of [Stanford-ILIAD/minivla-vq-libero90-prismatic](https://huggingface.co/Stanford-ILIAD/minivla-vq-libero90-prismatic), which serves as the base for the checkpoints included in this repository. ---