YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Model Overview

Description

GR00T-N1.5-RL-Rheo-AssembleTrocar is a vision language action model (VLA). This model is fine-tuned for surgical instrument handling in the Isaac for Healthcare Rheo workflow. It utilizes a G1 embodiment to perform a trocar assembly task. A trocar is a surgical instrument consisting of an obturator and a cannula, commonly used as an access port in minimally invasive surgery. The robot retrieves the trocar from a surgical tray on the left, assembles it, and places it on a Mayo Stand on the right.This model is ready for commercial/non-commercial use.

License/Terms of Use

Governing Terms: Your usage of the GR00T-N1.5-RL-Rheo-AssembleTrocar model is governed by the NVIDIA License.
You are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.

Deployment Geography

Global

Use Case

This model is intended for Rheo simulation workflows focused on surgical instrument assembly (assembling a trocar). It is not intended for real-world clinical deployment.

Release Date

Hugging Face (03/10/2026) via https://huggingface.co/nvidia/GR00T-N1.5-RL-Rheo-AssembleTrocar

Reference(s)

Nvidia Isaac-GR00T N1.5 Isaac For Healthcare

Model Architecture

Architecture Type: Vision Language Action model
Network Architecture: GR00T N1.5 This model was developed based on GR00T N1.5 Number of model parameters: 3 billion

Computational Load

Cumulative Compute: 6.8×10^21 FLOPs (hardware-based calculation using single NVIDIA H100 NVL for training) Estimated Energy and Emissions for Model Training: 1505.28 kWh, 0.618tCO2e

Input(s)

Input Type(s): Vision, State, Language Instruction
Input Format(s):

  • Vision: Red, Green, Blue (RGB) images (uint8)
  • State: Floating point
  • Language Instruction: String

Input Parameters:

  • Vision: Two-Dimensional (2D)
  • State: One-Dimensional (1D)
  • Language Instruction: One-Dimensional (1D)

Other Properties Related to Input:

  • Vision: Three 480x640 uint8 RGB image frames from robot head camera and two wrist camera.
  • State: 1x28 vector.

Output(s)

Output Type(s): Actions Output Format(s): Continuous-value vectors
Output Parameters: Two-Dimensional (2D), 16x28 tensor
Other Properties Related to Output: Continuous-value vectors correspond to different motor controls on the robot embodiment.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g., GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine(s): PyTorch 2.8.0

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Hopper

Supported Operating System(s):

  • Linux (Ubuntu 22.04/24.04 LTS)

Model Version(s)

GR00T-N1.5-RL-Rheo-AssembleTrocar

Training Datasets, Testing, and Evaluation Datasets

Manual teleoperation.

Training Dataset

Total Size: 59 samples Data Modality:

  • Text
  • Video
  • Action

Text Training Data Size: Less than a Billion Tokens
Video Training Data Size: Less than 10,000 Hours
Non-Audio, Image, Text Training Data Size:

Image/Video Data: RGB video frames from robot head camera (640x480 pixels)
Text Data: 59 language instruction strings by human labelling
Action Data: 59 episodes of robot action trajectories (state observations and action sequences)

Data Collection Method by dataset: Human, Automatic/Sensors Labeling Method by dataset: Human, Automatic/Sensors

Data Properties:
Quantity: 59 simulation samples Modalities: Multi-modal data consisting of (i) RGB video frames, (ii) text-based language instructions, (iii) robot state observations
Nature of Content: Data from Isaac Sim simulation environment collected in Isaac Lab; no personal data or copyright-protected content; data represents surgical instrument assembly tasks
Linguistic Characteristics: Language instructions describing surgical instrument assembly

Sensor(s):
Vision sensors: Three RGB cameras (one robot head-mounted, two wrist-mounted) capturing 640x480 pixel images in simulation Action sensors: Motor sensors on G1 embodiment

Testing Datasets

Data Collection Method by dataset: Not Applicable Labeling Method by dataset: Not Applicable Data Properties: The evaluation was performed in simulation using the Isaac for Healthcare Rheo workflow. The testing data consists of dynamically generated episodes of the trocar assembly task.

Evaluation Datasets

Data Collection Method by dataset: Not Applicable Labeling Method by dataset: Not Applicable Data Properties: The evaluation was performed in simulation using the Isaac for Healthcare Rheo workflow. The testing data consists of dynamically generated episodes of the trocar assembly task.

Inference

Engine: PyTorch
Test Hardware: NVIDIA RTX 5880 Ada Generation
Inference mode / Latency / Memory: PyTorch 54.2 ± 8.5 ms, 8 GB

Limitations

This model was trained on data from the Isaac for Healthcare Rheo workflow. Therefore, the model will only perform well in that specific operating room environment. This model is not expected to generalize to different robot platforms, environments, or surgical procedures outside of the trained domain.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month
40
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support