NanoVLA-Flow-V1 (Gemma-4-E2B)
NanoVLA-Flow is an experimental, highly parameter-efficient Vision-Language-Action (VLA) architecture. By combining a 2.6B parameter google/gemma-4-E2B-it backbone with a continuous Flow Matching ODE solver, we successfully trained a highly intelligent robotic policy entirely on a single consumer-grade 16GB T4 GPU.
This repository contains the natively merged weights. The LoRA adapters have been absorbed into the base model for seamless, immediate inference.
Core Innovations
- Continuous Flow Matching: Instead of discretizing 3D space into language tokens (like RT-2 or OpenVLA), NanoVLA-Flow explicitly predicts 3D velocity vectors via an ODE solver (Heun's 2nd-Order Method).
- Min-Max Action Bounding: Achieves SOTA trajectory precision (0.015 MSE) by strictly scaling target vectors to
[-1.0, 1.0]. - Parameter Efficiency: The entire architecture, including the 80M parameter
ActionExpert, was trained via LoRA on a single T4.
Benchmarks
We rigorously tested the model to ensure our robotic fine-tuning did not cause catastrophic forgetting of the VLM's general intelligence.
- A-OKVQA (Augmented Outside Knowledge VQA):
55.11%Multiple Choice Accuracy - Flow Trajectory MSE:
0.156 - Trajectory Cosine Similarity:
0.666
Inference Snippet
Here is how to load the merged weights and predict a robotic trajectory directly from an image:
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
# 1. Load the merged Vision-Language Backbone
processor = AutoProcessor.from_pretrained("AryanNsc/NanoVLA-Flow-V1")
backbone = AutoModelForImageTextToText.from_pretrained(
"AryanNsc/NanoVLA-Flow-V1",
torch_dtype=torch.bfloat16,
trust_remote_code=True
).to(device)
# 2. Process your camera feed and task language
image = Image.open("path_to_your_camera_frame.jpg")
instruction = "pick up the black bowl"
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": instruction}]}]
prompt = processor.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
inputs = processor(text=[prompt], images=[image], return_tensors="pt").to(device)
# 3. Extract the spatial hidden states (The "Brain")
with torch.no_grad():
outputs = backbone(**inputs, output_hidden_states=True)
# From here, pass the `outputs.hidden_states` into the ActionExpert to calculate the continuous Flow trajectory!
Resources
| Resource | Link |
|---|---|
| GitHub Repository | Guney-olu/NanoVLA-Flow |
- Downloads last month
- 102