Nemotron-VLA

A Vision-Language-Action model powered by NVIDIA foundation models for robot manipulation.

Architecture

Component Model Trainable?
Vision Encoder NVIDIA RADIO (ViT-B) Frozen
Language Encoder NVIDIA Nemotron Nano 9B v2 Frozen
Fusion Cross-Attention (4 heads) Trained
Action Head DDPM Diffusion Policy Trained

Quick Start

import torch
from models import NemotronVLA, load_radio_model, load_nemotron_model, extract_nemotron_embedding
from huggingface_hub import hf_hub_download

# Download checkpoint
ckpt_path = hf_hub_download("keivalya/nemotron-vla", "nemotron_vla.pt")
ckpt = torch.load(ckpt_path, map_location="cuda", weights_only=False)

# Build model
model = NemotronVLA(**ckpt["config"]).to("cuda")
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

# Load RADIO for vision
radio_model, _ = load_radio_model(device="cuda")

# Encode instruction with Nemotron
nemotron_model, tokenizer, _ = load_nemotron_model(device="cuda")
text_emb = extract_nemotron_embedding(nemotron_model, tokenizer, "push the object to the goal")

Training Details

  • Environment: MetaWorld (push-v3, door-open-v3, drawer-close-v3, etc.)
  • Demonstrations: Expert policy, 30 episodes per task
  • Trainable params: ~0.8M (fusion + diffusion head only)
  • Training: 80 epochs, AdamW, cosine LR schedule

Files

  • nemotron_vla.pt โ€” model checkpoint
  • config.json โ€” architecture config
  • models.py โ€” model definitions
  • utils.py โ€” training and evaluation utilities
  • env.py โ€” MetaWorld environment wrapper
  • collect_multitask.py โ€” multi-task data collection
Downloads last month
-
Video Preview
loading