Nemotron-VLA

A Vision-Language-Action model powered by NVIDIA foundation models for robot manipulation.

Architecture

Component	Model	Trainable?
Vision Encoder	NVIDIA RADIO (ViT-B)	Frozen
Language Encoder	NVIDIA Nemotron Nano 9B v2	Frozen
Fusion	Cross-Attention (4 heads)	Trained
Action Head	DDPM Diffusion Policy	Trained

Quick Start

import torch
from models import NemotronVLA, load_radio_model, load_nemotron_model, extract_nemotron_embedding
from huggingface_hub import hf_hub_download

# Download checkpoint
ckpt_path = hf_hub_download("keivalya/nemotron-vla", "nemotron_vla.pt")
ckpt = torch.load(ckpt_path, map_location="cuda", weights_only=False)

# Build model
model = NemotronVLA(**ckpt["config"]).to("cuda")
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

# Load RADIO for vision
radio_model, _ = load_radio_model(device="cuda")

# Encode instruction with Nemotron
nemotron_model, tokenizer, _ = load_nemotron_model(device="cuda")
text_emb = extract_nemotron_embedding(nemotron_model, tokenizer, "push the object to the goal")

Training Details

Environment: MetaWorld (push-v3, door-open-v3, drawer-close-v3, etc.)
Demonstrations: Expert policy, 30 episodes per task
Trainable params: ~0.8M (fusion + diffusion head only)
Training: 80 epochs, AdamW, cosine LR schedule

Files

nemotron_vla.pt — model checkpoint
config.json — architecture config
models.py — model definitions
utils.py — training and evaluation utilities
env.py — MetaWorld environment wrapper
collect_multitask.py — multi-task data collection

Downloads last month: 4

Video Preview

Robotics