NanoVLA-Flow-V1 (Gemma-4-E2B)

NanoVLA-Flow is an experimental, highly parameter-efficient Vision-Language-Action (VLA) architecture. By combining a 2.6B parameter google/gemma-4-E2B-it backbone with a continuous Flow Matching ODE solver, we successfully trained a highly intelligent robotic policy entirely on a single consumer-grade 16GB T4 GPU.

This repository contains the natively merged weights. The LoRA adapters have been absorbed into the base model for seamless, immediate inference.

Core Innovations

Continuous Flow Matching: Instead of discretizing 3D space into language tokens (like RT-2 or OpenVLA), NanoVLA-Flow explicitly predicts 3D velocity vectors via an ODE solver (Heun's 2nd-Order Method).
Min-Max Action Bounding: Achieves SOTA trajectory precision (0.015 MSE) by strictly scaling target vectors to [-1.0, 1.0].
Parameter Efficiency: The entire architecture, including the 80M parameter ActionExpert, was trained via LoRA on a single T4.

Benchmarks

We rigorously tested the model to ensure our robotic fine-tuning did not cause catastrophic forgetting of the VLM's general intelligence.

A-OKVQA (Augmented Outside Knowledge VQA): 55.11% Multiple Choice Accuracy
Flow Trajectory MSE: 0.156
Trajectory Cosine Similarity: 0.666

Inference Snippet

Here is how to load the merged weights and predict a robotic trajectory directly from an image:

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"

# 1. Load the merged Vision-Language Backbone
processor = AutoProcessor.from_pretrained("AryanNsc/NanoVLA-Flow-V1")
backbone = AutoModelForImageTextToText.from_pretrained(
    "AryanNsc/NanoVLA-Flow-V1", 
    torch_dtype=torch.bfloat16, 
    trust_remote_code=True
).to(device)

# 2. Process your camera feed and task language
image = Image.open("path_to_your_camera_frame.jpg")
instruction = "pick up the black bowl"
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": instruction}]}]

prompt = processor.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
inputs = processor(text=[prompt], images=[image], return_tensors="pt").to(device)

# 3. Extract the spatial hidden states (The "Brain")
with torch.no_grad():
    outputs = backbone(**inputs, output_hidden_states=True)
    
# From here, pass the `outputs.hidden_states` into the ActionExpert to calculate the continuous Flow trajectory!

Resources

Resource	Link
GitHub Repository	Guney-olu/NanoVLA-Flow

Downloads last month: 102

Safetensors

Model size

5B params

Tensor type

BF16

Video Preview

Robotics

AryanNsc
/

NanoVLA-Flow-V1

NanoVLA-Flow-V1 (Gemma-4-E2B)

Core Innovations

Benchmarks

Inference Snippet

Resources

Dataset used to train AryanNsc/NanoVLA-Flow-V1