NanoVLA-Flow-V1 (Gemma-4-E2B)

NanoVLA-Flow is an experimental, highly parameter-efficient Vision-Language-Action (VLA) architecture. By combining a 2.6B parameter google/gemma-4-E2B-it backbone with a continuous Flow Matching ODE solver, we successfully trained a highly intelligent robotic policy entirely on a single consumer-grade 16GB T4 GPU.

This repository contains the natively merged weights. The LoRA adapters have been absorbed into the base model for seamless, immediate inference.

Core Innovations

  1. Continuous Flow Matching: Instead of discretizing 3D space into language tokens (like RT-2 or OpenVLA), NanoVLA-Flow explicitly predicts 3D velocity vectors via an ODE solver (Heun's 2nd-Order Method).
  2. Min-Max Action Bounding: Achieves SOTA trajectory precision (0.015 MSE) by strictly scaling target vectors to [-1.0, 1.0].
  3. Parameter Efficiency: The entire architecture, including the 80M parameter ActionExpert, was trained via LoRA on a single T4.

Benchmarks

We rigorously tested the model to ensure our robotic fine-tuning did not cause catastrophic forgetting of the VLM's general intelligence.

  • A-OKVQA (Augmented Outside Knowledge VQA): 55.11% Multiple Choice Accuracy
  • Flow Trajectory MSE: 0.156
  • Trajectory Cosine Similarity: 0.666

Inference Snippet

Here is how to load the merged weights and predict a robotic trajectory directly from an image:

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"

# 1. Load the merged Vision-Language Backbone
processor = AutoProcessor.from_pretrained("AryanNsc/NanoVLA-Flow-V1")
backbone = AutoModelForImageTextToText.from_pretrained(
    "AryanNsc/NanoVLA-Flow-V1", 
    torch_dtype=torch.bfloat16, 
    trust_remote_code=True
).to(device)

# 2. Process your camera feed and task language
image = Image.open("path_to_your_camera_frame.jpg")
instruction = "pick up the black bowl"
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": instruction}]}]

prompt = processor.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
inputs = processor(text=[prompt], images=[image], return_tensors="pt").to(device)

# 3. Extract the spatial hidden states (The "Brain")
with torch.no_grad():
    outputs = backbone(**inputs, output_hidden_states=True)
    
# From here, pass the `outputs.hidden_states` into the ActionExpert to calculate the continuous Flow trajectory!

Resources

Resource Link
GitHub Repository Guney-olu/NanoVLA-Flow
Downloads last month
102
Safetensors
Model size
5B params
Tensor type
BF16
·
Video Preview
loading

Dataset used to train AryanNsc/NanoVLA-Flow-V1