# Model Overview ### Description GR00T-N1.6-Rheo-PickNPlace is a vision language action model (VLA). This model is fine-tuned for preparing for surgical instruments handling in the Isaac for Healthcare Rheo workflow. It performs the pick‑and‑place of a sterilized box from a shelf to a cart using a G1 embodiment. This model is ready for commercial/non-commercial use. ### License/Terms of Use **Governing Terms:** Your usage of the GR00T-N1.6-Rheo-PickNPlaceTray model is governed by the [NVIDIA License](https://developer.download.nvidia.cn/licenses/NVIDIA-OneWay-Noncommercial-License-22Mar2022.pdf?t=eyJscyI6ImdzZW8iLCJsc2QiOiJodHRwczovL3d3dy5nb29nbGUuY29tLyIsIm5jaWQiOiJzby15b3V0LTg3MTcwMS12dDQ4In0=).
You are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws. ### Deployment Geography Global ### Use Case This model is intended for Rheo simulation workflows focused on surgical instruments handling (sterilized box pick-and-place from shelf to cart). It is not intended for real-world clinical deployment. ### Release Date Hugging Face (03/10/2026) via https://huggingface.co/nvidia/GR00T-N1.6-Rheo-PickNPlaceTray/tree/main ## Reference(s) [Nvidia Isaac-GR00T N1.6](https://github.com/NVIDIA/Isaac-GR00T) [Isaac For Healthcare](https://github.com/isaac-for-healthcare) ## Model Architecture **Architecture Type:** Vision Language Action model **Network Architecture:** GR00T N1.6 **This model was developed based on** GR00T N1.6 **Number of model parameters:** 3 billion ## Computational Load **Cumulative Compute:** 2.45×10^19 FLOPs (hardware-based calculation using single NVIDIA H100 NVL for training) **Estimated Energy and Emissions for Model Training:** 5.37 kWh, 0.00217 tCO₂e ## Input(s) **Input Type(s):** Vision, State, Language Instruction **Input Format(s):** - Vision: RGB images (uint8) - State: Floating point - Language Instruction: String **Input Parameters:** - Vision: Two-Dimensional (2D) - State: One-Dimensional (1D) - Language Instruction: One-Dimensional (1D) **Other Properties Related to Input:** - Vision: Raw 480x640 uint8 RGB frames from robot head camera; training preprocessing uses shortest_edge=256 with crop_fraction=0.95 (albumentations). - State: 1x31 vector. ## Output(s) **Output Type(s):** Actions **Output Format(s):** Continuous-value vectors **Output Parameters:** Two-Dimensional (2D), 16x32 tensor **Other Properties Related to Output:** Continuous-value vectors correspond to different motor controls on the robot embodiment. Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g., GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. ## Software Integration **Runtime Engine(s):** PyTorch 2.8.0 **Supported Operating System:** - NVIDIA Ampere - NVIDIA Blackwell - NVIDIA Hopper **Preferred/Supported Operating System(s):** - Linux (Ubuntu 22.04/24.04 LTS) ## Model Version(s) GR00T-N1.6-Rheo-PickNPlace ## Training Datasets, Testing, and Evaluation Datasets Manual teleoperation and IsaacLab mimic generation. ### Training Dataset **Total Size:** 120 samples **Text Training Data Size:** Less than a Billion Tokens **Video Training Data Size:** Less than 10,000 Hours **Non-Audio, Image, Text Training Data Size:** Image/Video Data: RGB video frames from robot head camera (640x480 pixels) Text Data: 120 language instruction strings by human labelling Action Data: 120 episodes of robot action trajectories (state observations and action sequences) **Data Modality:** - Text - Video - Action **Data Collection Method by dataset:** Automatic/Sensors **Labeling Method by dataset:** Human **Data Properties:** Quantity: 120 simulation samples Modalities: Multi-modal data consisting of (i) RGB video frames, (ii) text-based language instructions, (iii) robot state observations Nature of Content: Data from Isaac Sim simulation environment collected in Isaac Lab mimic; no personal data or copyright-protected content; data represents surgical instrument manipulation tasks Linguistic Characteristics: Language instructions describing surgical instrument prepartion **Sensor(s):** Vision sensors: RGB cameras (robot head-mounted) capturing 640x480 pixel images in simulation Action sensors: Motor sensors on G1 embodiment ### Testing Datasets **Data Collection Method by dataset:** Not Applicable **Labeling Method by dataset:** Not Applicable **Data Properties:** The evaluation was performed in simulation using the Isaac for Healthcare Rheo workflow. The testing data consists of dynamically generated episodes of the pick-and-place task. ### Evaluation Datasets **Data Collection Method by dataset:** Not Applicable **Labeling Method by dataset:** Not Applicable **Data Properties:** The evaluation was performed in simulation using the Isaac for Healthcare Rheo workflow. The testing data consists of dynamically generated episodes of the pick-and-place task. ## Inference **Engine:** PyTorch **Test Hardware:** NVIDIA RTX 5880 Ada Generation **Inference mode / Latency / Memory:** PyTorch 92.4 ± 1.3 ms, 8 GB ## Limitations This model was trained on data from the Isaac for Healthcare Rheo workflow. Therefore, the model will only perform well in that specific operating room environment. This model is not expected to generalize to different robot platforms, environments, or surgical procedures outside of the trained domain. ## Ethical Considerations NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards. Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).