Nemotron-VLA
A Vision-Language-Action model powered by NVIDIA foundation models for robot manipulation.
Architecture
| Component | Model | Trainable? |
|---|---|---|
| Vision Encoder | NVIDIA RADIO (ViT-B) | Frozen |
| Language Encoder | NVIDIA Nemotron Nano 9B v2 | Frozen |
| Fusion | Cross-Attention (4 heads) | Trained |
| Action Head | DDPM Diffusion Policy | Trained |
Quick Start
import torch
from models import NemotronVLA, load_radio_model, load_nemotron_model, extract_nemotron_embedding
from huggingface_hub import hf_hub_download
# Download checkpoint
ckpt_path = hf_hub_download("keivalya/nemotron-vla", "nemotron_vla.pt")
ckpt = torch.load(ckpt_path, map_location="cuda", weights_only=False)
# Build model
model = NemotronVLA(**ckpt["config"]).to("cuda")
model.load_state_dict(ckpt["model_state_dict"])
model.eval()
# Load RADIO for vision
radio_model, _ = load_radio_model(device="cuda")
# Encode instruction with Nemotron
nemotron_model, tokenizer, _ = load_nemotron_model(device="cuda")
text_emb = extract_nemotron_embedding(nemotron_model, tokenizer, "push the object to the goal")
Training Details
- Environment: MetaWorld (push-v3, door-open-v3, drawer-close-v3, etc.)
- Demonstrations: Expert policy, 30 episodes per task
- Trainable params: ~0.8M (fusion + diffusion head only)
- Training: 80 epochs, AdamW, cosine LR schedule
Files
nemotron_vla.ptโ model checkpointconfig.jsonโ architecture configmodels.pyโ model definitionsutils.pyโ training and evaluation utilitiesenv.pyโ MetaWorld environment wrappercollect_multitask.pyโ multi-task data collection
- Downloads last month
- -