Model Overview

Description:

G1 Locomanipulation Finetune fine-tunes the GR00T N1.5 Vision-Language-Action (VLA) foundation model to perform loco-manipulation on the Unitree G1 humanoid robot, jointly controlling bimanual manipulation and whole-body locomotion to pick up an object at one location, navigate to a second location, and place it there. This checkpoint is provided as an example artifact for the Isaac Lab locomanipulation SDG pipeline โ€” users who have already run the data generation and GR00T N1.5 training steps can reproduce it, but it is offered here so those steps can be skipped when working through the pipeline examples. This model is for demonstration purposes and not for production usage.

License/Terms of Use:

Governing Terms: Use of this model is governed by the NVIDIA License.

Deployment Geography:

Global

Use Case:

This model is an example artifact for users working through the Isaac Lab locomanipulation SDG pipeline. It demonstrates the output of running the locomanipulation data generation pipeline followed by GR00T N1.5 finetuning. It is provided so that users can skip those steps and proceed directly to running rollout/evaluation examples. It is not intended for deployment on physical robots or production systems.

Release Date:

Hugging Face 05/28/2026 via https://huggingface.co/nvidia/g1_locomanip_finetune

Reference(s):

Model Architecture:

Architecture Type: Transformer (Vision-Language-Action)

Network Architecture: GR00T N1.5

This model was developed based on nvidia/GR00T-N1.5-3B.

Number of model parameters: ~3 ร— 10^9 (3B)

Input(s):

Input Type(s): Video, Tabular

Input Format(s):

  • Video: Red, Green, Blue (RGB), HDF5-encoded frames
  • Tabular: Float32 arrays

Input Parameters:

  • Video: Three-Dimensional (3D) โ€” T ร— H ร— W
  • Tabular: One-Dimensional (1D) vectors

Other Properties Related to Input:

Modality Description Shape
video.ego_view RGB from torso-mounted Intel D435 camera 224ร—224 (preprocessed)
state.left_hand_pose Left wrist pose (xyz + xyzw quat) 7D
state.right_hand_pose Right wrist pose (xyz + xyzw quat) 7D
state.left_hand_joint_positions Left finger joint angles 7D
state.right_hand_joint_positions Right finger joint angles 7D
state.object_pose Manipulated object pose 7D
state.goal_pose Target placement pose 7D
state.end_fixture_pose Drop-off table pose 7D

Video preprocessing: center crop 95%, resize 224ร—224, color jitter. State/action normalization: min-max.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Output(s):

Output Type(s): Tabular

Output Format(s):

  • Tabular: Float32 arrays

Output Parameters:

  • Tabular: One-Dimensional (1D) vectors

Other Properties Related to Output:

Key Description Dim
action.left_hand_pose Left end-effector target (xyz + xyzw quat) 7D
action.right_hand_pose Right end-effector target (xyz + xyzw quat) 7D
action.left_hand_joint_positions Left finger joint targets 7D
action.right_hand_joint_positions Right finger joint targets 7D
action.base_velocity Base navigation command (vx, vy, yaw_rate) 3D
action.base_height Base height target 1D

Total action dimension: 32D (28D manipulation + 4D locomotion)

Software Integration:

Runtime Engine(s):

  • Not Applicable (N/A) โ€” Isaac Lab deployment stack

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere or newer

Supported Operating System(s):

  • Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

v1.0 โ€” Initial fine-tuned checkpoint of GR00T N1.5 for G1 loco-manipulation, trained for 10,000 steps on synthetic Isaac Lab SDG demonstrations. Data config class: G1LocomanipulationSDGDataConfig.

Training and Evaluation Datasets:

Training Dataset:

Data Modality:

  • Video (RGB ego-view)
  • Other: Proprioceptive state โ€” end-effector poses, finger joint positions, object/goal/fixture poses

Training Data Size:

  • Video Training Data Size: Less than 10,000 Hours
  • Non-Audio, Image, Text Training Data Size: 100Kโ€“1M timesteps; ~478 MB compressed

Data Collection Method by dataset:

  • Synthetic โ€” Generated via Isaac Lab SDG pipeline

Labeling Method by dataset:

  • Automatic/Sensors โ€” Action labels derived from simulation state at each timestep (200 Hz)

Properties: Synthetic robot demonstration data. Modalities: RGB video (3D, Tร—Hร—W) and proprioceptive numerical state (1D vectors). Content: machine-generated simulation data with no personal data or copyrighted content. Sensor type: simulated Intel D435 RGB camera and simulated joint encoders.

Evaluation Dataset:

Benchmark Score: Not Applicable โ€” no quantitative benchmarks have been established for this demonstration checkpoint.

Data Collection Method by dataset:

  • Synthetic โ€” Isaac Lab simulation

Labeling Method by dataset:

  • Automatic/Sensors

Properties: Evaluation performed qualitatively in Isaac Lab simulation environment using the same synthetic SDG pipeline. No quantitative benchmark scores have been established.

Inference:

Acceleration Engine: Not Applicable (N/A)
Test Hardware:

  • NVIDIA GPU (Ampere or newer recommended)

Key Considerations:

This model is provided solely as a demonstration artifact for the Isaac Lab locomanipulation SDG pipeline examples and is not intended for deployment on physical robots or production systems. It was trained entirely on synthetic simulation data and has not been validated for real-world use.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Citation

If you use this model, please cite GR00T N1.5:

@misc{nvidia2025groot,
  title = {GR00T N1: An Open Foundation Model for Generalist Humanoid Robots},
  author = {NVIDIA},
  year = {2025},
  url = {https://huggingface.co/nvidia/GR00T-N1.5-3B},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for nvidia/g1_locomanip_finetune

Finetuned
(41)
this model