Cosmos-H-Surgical-Simulator

Description

Cosmos-H-Surgical-Simulator is a surgical world foundation model fine-tuned on the Open-H embodiment dataset including clinical surgical procedures for the evaluation of physically grounded surgical robotics policies in simulation and synthetic data generation. This model assists in evaluating surgical robotics policies in simulation, primarily for CMR Surgical Versius clinical procedures (cholecystectomy, prostatectomy, inguinal hernia, and hysterectomy), as well as dVRK, MITIC, and other surgical platforms across tasks such as suturing, tissue manipulation, and peg transfer, before transitioning to a physical system.

The released model is based on the public NVIDIA Cosmos-predict2.5 world foundation model for physical AI.

This model is for commercial/non-commercial use.

License/Terms of Use

Use of this model is governed by the NVIDIA Open Model License Agreement.

Deployment Geography

Global

Use Case

Primarily intended for surgical robotics researchers, healthcare AI developers, academic institutions, and surgical robotics companies, exploring surgical robotics policy evaluation and synthetic data generation.

Release Date

Reference(s)

Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.-W., Chattopadhyay, P., Chen, M., Chen, Y., Cheng, S., Cui, Y., Diamond, J., Ding, Y., Fan, J., Fan, L., Feng, L., Ferroni, F., Fidler, S., Fu, X., Gao, R., Ge, Y., Gu, J., … Zhu, Y. (2025).World Simulation with Video Foundation Models for Physical AI (arXiv:2511.00062) [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2511.00062

Link to Cosmos’ nvidia/Cosmos-Predict2.5-2B-Video2World Model Card

Model Architecture

Architecture Type: Diffusion Transformer Network Architecture: Latent video diffusion transformer (DiT-style denoiser) with cross-attention conditioning.

This model was developed based on Cosmos-Predict2.5-2B-Video2World.

The Cosmos-H-Surgical-Simulator model extends Cosmos-Predict2.5-2B-Video2World, a diffusion transformer for video generation in latent space. It incorporates an MLP to condition the model on kinematic actions. The model accepts a 44-dimensional action vector (22 dimensions per arm) alongside the current video frame, and predicts the subsequent 12 frames. Through autoregressive rollout, it can generate videos of complete surgical trajectories from either learned policies or manually designed action sequences.

Input

  • Input Type(s): Image (camera frame), a sequence of twelve 44-dimensional numerical vectors
  • Input Format(s): Red, Green, Blue (RGB) frame, numeric vector
  • Input Parameters: Image: Two-Dimensional (2D) image frame , Vector: Sequence of 12 forty-four-dimensional (44D) vectors Other Properties Related to Input:
  • Recommended resolution: The model operates at 512x288 resolution. Input frames are automatically resized; a 16:9 aspect ratio is recommended.
  • Pre-processing: The Versius kinematic action is a hybrid relative action as defined here.

Output

  • Output Type(s): A sequence of 12 video frames
  • Output Format: mp4
  • Output Parameters: Three-Dimensional (3D)
  • Other Properties Related to Output:
    • By default, the model generates twelve video frames as output, representing the next world states incorporating the input kinematic actions into the current video frame. Through an autoregressive loop, a complete video sequence, representing a complete surgical trajectory, can be generated.
    • No additional post-processing is strictly required.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine(s):

Cosmos-Predict2.5

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Hopper

Note: Only BF16 precision is tested. Other precisions like FP16 or FP32 are not officially supported.

Preferred/Supported Operating System(s): Linux (We have not tested on other operating systems.)

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s)

v1.0 (Finetuned on the Open-H Embodiment dataset, which includes clinical procedures data such as cholecystectomy, prostatectomy, inguinal hernia, and hysterectomy)

Developers may integrate the model into an AI evaluation system by providing video frames as input along with corresponding kinematic actions to evaluate a surgical policy model, such as one for the CMR Surgical Versius robotic system.

Training, Testing, and Evaluation Datasets

Dataset: Open-H-Embodiment community generated dataset.

  • Total size: ~26,500+ surgical task demonstrations (plus CMR Versius clinical procedures) spanning 32 datasets, 9 robot embodiments, and 10+ institutions, totaling ~4.9 million synchronized video-kinematics frames.
  • Dataset partition: 95% training and validation, 5% test (approximate).

Training Dataset

  • Link: Open-H Embodiment dataset
  • Data Collection Method: Human (kinematic action trajectories).
  • Labeling Method: Human
  • Properties:
    • ~26,500+ surgical task demonstrations (plus CMR Versius clinical procedures) spanning 32 datasets, 9 robot embodiments, and 10+ institutions, totaling ~4.9 million synchronized video-kinematics frames.
    • Human experts teleoperated the surgical robotic arms, which were recorded as video and kinematics pairs.

Testing Dataset

  • Link: We tested on the holdout portion for the Open-H Embodiment dataset.
  • Data Collection Method: Human
  • Labeling Method: Human
  • Properties:
    • ~1K frames for testing (approx).

Evaluation Dataset

  • Link: We evaluated on the Open-H Embodiment dataset, using the frames never seen during training.
  • Data Collection Method: Human
  • Labeling Method: Human
  • Properties:
    • ~1K frames for final evaluation (approx).

Key performance: The model can generate physically accurate surgical robotic videos given an initial frame and a kinematic action trajectory, either recorded from a surgeon or a surgical robotic policy model such as Gr00t-H.

Quantitative Evaluation

Evaluated on 4 CMR Versius clinical surgery procedures (prostatectomy, inguinal hernia, hysterectomy, cholecystectomy) at 360p resolution, 2 episodes per procedure, 2 seeds each, 72-frame autoregressive generation (6 chunks × 12 frames).

Aggregate Metrics

Checkpoint FDS (L1) ↓ GATC ↑ TCD (px) ↓
4k 0.2286 0.3517 84.42
8k 0.2298 0.3414 98.21
12k 0.2253 0.3814 138.54
16k 0.2227 0.4167 83.68
20k 0.2219 0.4058 124.96

Metrics:

  • FDS (L1): Frame Decay Score - mean L1 distance between generated and ground-truth frames normalized to [-1, 1], averaged across all generated frames (lower is better)
  • GATC: Ground-truth Anchored Tool Consistency - median zero-mean normalized cross-correlation (ZNCC) of grayscale pixels within SAM3-segmented tool regions between generated and ground-truth frames, weighted by a gradient-based tool presence penalty (higher is better)
  • TCD: Tool Centroid Distance - median per-frame average Euclidean distance (in pixels) between Hungarian-matched tool instance centroids in generated vs ground-truth frames, with a half-diagonal penalty for unmatched tools (lower is better)

Per-Procedure Metrics (16k checkpoint)

Procedure FDS (L1) ↓ GATC ↑ TCD (px) ↓
Prostatectomy 0.229 0.429 130.7
Inguinal Hernia 0.259 0.261 211.3
Hysterectomy 0.173 0.593 153.0
Cholecystectomy 0.248 0.188 60.8

Inference

Acceleration Engine: PyTorch, Transformer Engine

Test Hardware: Tested on A100.

Usage:

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

For more detailed information on ethical considerations for this model, please see the Model Card++ for Bias, Explainability, Safety & Security, and Privacy Subcards below.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including nvidia/Cosmos-H-Surgical-Simulator

Papers for nvidia/Cosmos-H-Surgical-Simulator